perjantai 6. tammikuuta 2023

Server overload

I run a server for IoT-like devices. They send data to server, which verifies it (transmission errors with checksums and so on), checks that it isn't duplicate entry and then stores and acknowledges it to device, after which device marks it handled on its end. 

This system has been going on for almost ten years now, gradually growing by number of users and, by extension, data amount.

Some time ago this failed hard. Server overloaded and requests started timing out. Restarts helped for a while but since devices were still trying to send data, it very quickly fell again. After some while, even more devices had data to send but couldn't, and they really didn't have any throttling implemented so server could them to cool down.

Things were looking quite bad.

I had no real great tools to profile system, so I effectively had to resort to using top to see what's going on. And what was going on was that SQL server (not the MS product with similar name)  was using vast majority of CPU power.

So, I started think back to what code is doing.

1) Parse incoming message, doing validations and checksums
2) Find client device from database, discard data if it doesn't exist
3) Look through device's data, looking for duplicate for this data. Ignore data if duplicate found.
4) Insert new data to database
5) Send device acknowledge of data (in case of duplicate, data is acknowledged without action)

The process is like this for a reason. Basically I want that all data produced by devices is stored exactly once, and process assumes that there will be hiccups. Data getting corrupt during transmission, connections dropping for some reason in mid-transmit, acknowledgements not received and so on.

At this point I had strong suspicions on step 3. After few experiments (trying and timing those duplicate search queries) I indeed confirmed that this step was taking way too long, even with fully indexed device measurement data tables and query optimizations. This was Not Good.

I already knew what I would have to do. I just didn't want to do it, since it would make a small but not insignificant dent on reliability. I had to separate data reception from data storage. But there was no other option.

So I created a new table that contains just raw measurement data. The process above was split in two separate parts, running independently.

A1) Parse incoming data, validating etc.
A2) Store data to intermediate table
A3) Send device acknowledge of data

Then B, which is another script on cronjob. Architecturally this could be a continuosly running service which is notified by (A) via signal or such when there is data available. Cronjob was however quick and good enough solution here, causing latency of few minutes before received data appears to main database. Basically non-issue here.

B1) Load bunch of data from intermediate table
B2) Find client device from database, discard data if not found
B3) Look for duplicates, discard if found
B4) Insert data

And just like that, entire server issue was solved, with a cost of very slight reduction of data reliability (adding very unlikely but non-zero possibility that corrupt-in-transport data that somehow passes A1 gets through but fails either at B2/B3, without possibility of re-transmit).

Since then site has been chugging along nicely. I have also added some throttling to devices (by using remote update capability) that should in future ease the worst-case scenario. Switching to beefier server is another future-proofing option, which I need to keep an eye on in case it seems that things start get iffy again. And of course switching to using separate database and front-end servers.