sunnuntai 23. lokakuuta 2022

Mysterious crash

This is kinda follow-up on topics about 433MHz transmitter and receiver.

I've been using this setup to receive temperature data and enable/disable heaters at our cottage over last winter, and in general it has worked very nicely. DIY home automation, kinda-sorta, and previously it was used mostly to enable extra heaters before going there so it would be closer to 20C when arriving there, instead of 5C or so.

With energy mess going on in Europe, I was thinking about optimizing the setup so I lower energy usage even further by having it automatically disable heaters when it's warmer outside (above -10 or -5 C or so), when heat pump alone is sufficient for heating.

I did changes to software, but during testing I found out that system crashes hard occasionally, once or twice a week or so. The thing is, I have watchdog timer running on the MCU, and if software crashes the watchdog should reset MCU automatically. Except here it doesn't. 

Obviously first I checked that watchdog actually is running, and (by simulating imagined failure mode in code) can actually reset MCU in any case. And it works. Except in these mysterious crashes.

I've been trying different things for weeks now and so far with no results. At the moment my best guess is that MCU gets overflown with interrupts (particularily from RF receiver, as it is noisy) and has no time for anything else. 

To counter this, I made changes:

- Timer interrupt has a counter that is increased on every call. If this counter ever exceeds certain value (3 seconds or so), interrupts reset MCU.
- RF receiver interrupt has similar counter. I did some measurements and it seemed that in normal operation it was generating some 3000 interrupts per second (which is nothing, really). Likewise, if counter exceeds certain value, MCU is reset.
- Main loop, which normally runs at 50 Hz or so, resets these counters to prevent said MCU resets.

Winter (well, autumn) got here, so I had to install the system as it is, before I had time to do more testing. So far so good, it's been few weeks and it's still up.