lauantai 18. huhtikuuta 2020

Finding extremely rarely occuring bugs


Sometimes I get reports of issues that I just don't seem to be able to replicate in my office. These can vary from mild inconveniences to fatal corruption issues. Either way, I still at least try to figure out what is going on. It is not uncommon that these happen just to very few people, and with very long time in between. And of course report usually has very few details to use as starting point.

Sometimes the issues may be of external origin - supply voltage fluctuations, EM fields and what not - but they can also be very subtle bugs too. Or in worst case combination of both, bug triggered by some outside influence. It's just that rarity makes them very, very difficult to track down.

Over time I've developed a process I often use to find these bugs in my embedded devices. Note that these steps require that you either can write custom software for the system and that the system allows you at least monitor it from outside (debug prints are sufficient), or have sufficient access to system via JTAG or other such means.
  1. First, figure out how you can determine the problem "from inside" - so either device software or JTAG debugger can detect the issue and react (for example stop) on it. This may involve adding data checksums, code flow tracing (such as "if current state is XYZ and previous was HGR, we have issue here") or other such means. If the problem is system crash, system like I described before might help.
  2. Problems usually don't happen when device is just idling, so you may also need to implement a way for the system to run itself in a loop to shake out the problem. This obviously is highly device- and problem-dependent process so not much can be said about that in your specific problem case.
  3. Sprinkle your code liberally with traces (like debug outs) to allow you to see what was and is going on when system runs. If you can take an educated guess on where to focus, all the better.
  4. Run system until you run onto the issue. Sometimes code from (2) may need refining if it isn't triggering the issue. Either way, this may take a few iterations of refining.
  5. Now the hard part. Based on prints from (3), see what was going on before the issue. If you happen to have really good hardware emulator that can show you the exact instruction history system ran - the code path - this might be a breeze. I don't have one (and I don't like using embedded debuggers anyway) so I use debug prints.
  6. Based on (3) and (5), refine your traces so that they tell you more what was going on. Add traces to promising locations, remove ones that don't appear helpful. Then repeat from (4).
  7. Eventually you should be able to lock onto the main issue and fix it.
This is obviously highly time-consuming process. Adding needed helper routines alone takes a while, and depending on issue each run may take from hours to days. And then the refining and repeated attempts. All this is interrupting, or interrupted by, all the other work that needs also be done in parallel - you don't generally want this to be main focus of anyone due to sheet time scale involved. Unless this is some highly critical issue you want fixed right now.

Some time ago I finished locating one issue like this and it took me three days - mainly because I had very good start guess to work with so I was able to focus on most likely code first and make code of (2) do exactly right things on the first attempt. Often I am too busy to allocate this much time for this kind of issues, but guess this entire Covid-19 situation is kinda mixed blessing in this sense - without all the typical office work, working at home while also taking care of kids' homeschooling and all that leaves me just enough time to work with this kind of issues, especially since this is the exact type of process that doesn't suffer from being interrupted - system happily keeps trying to hang itself while I walk away from it.