Rethink Reboot | Cypress Semiconductor
Quick test: the gadget in your hand is acting up, what do you do? Restart it, right? If it is running Windows you give it the 3-finger salute. CTRL-ALT-DEL, reboot, restart, cycle power, they all mean the same, and since the dawn of the computer age, this is the number one troubleshooting guideline for any problem. But have we accepted this "behavior" too easily?
There is a massive difference between rebooting an MP3 player and having to reset a pacemaker, but at the heart of it, are these two so different? Both are embedded systems; both have specific, regular and irregular inputs; both are useless if they cannot produce their output. So what is the biggest difference between them? One MUST NOT get hung-up and require a reboot. One will take any opportunity to shave off unnecessary design and test steps to save money and schedule. Hold on, why don't both statements apply to both products. Shouldn't every project eliminate unnecessary design and test steps? Why shouldn't every product just continue to work?
So the real difference is what we, the consumer have deemed necessary when we vote with our dollars. Which answers my earlier question: YES, we have accepted this behavior too easily AND now it is time to rebel.
Great, how do we get started? Since we are the designers and testers for these products we need to start looking for these bugs and once found, rate them high and get them fixed BEFORE we ship. We need to learn from the bugs found and eradicate the root causes in our designs before the tests.
How can we do this? For PSoC and many microcontrollers, the WatchDog Timer (WDT) is the primary mechanism - but it is also the "savior of last resort". Once a watchdog times out, that's it, a reset is still required; it is only marginally better if the controller takes care of it rather than the user.
So what is a better way to use watchdog timer? Two ways: 1) As a diagnostic resource during development and test; 2) As a way to recover gracefully, provided enough bread crumbs were left before the reset. These are not mutually exclusive, use both. It is essential to plan a WDT reset recovery plan and design in the "bread crumbs" in order to recover as gracefully as possible. But we also need to design in the diagnostic "bread crumbs" that will help in seeking out WDT-reset-inducing situations before they "shoot the engineer and ship it". And if someone in the field finds a failure, those same bread crumbs will help identify the source and eradicate it.
Sounds simple. But it's not easy. It's simple to say: "Don't let the device get into a state where the user needs to reset it, and if the device resets itself, make sure to get the user back to where they were". Adding this to any requirements document is simple - seeing it through is not easy. But you have to start, and the place to start is reboot your attitude.
Forced restarts mean you failed the user. Plan to succeed.