perjantai 20. huhtikuuta 2018

A lesson on forward design


I've got a product that is now over ten years old now (measuring from start of design), with several years left of lifetime before its production will ramp down. And then maybe another five before most of the units are removed from service.

This product is combined of two parts; main unit, and relay/aux unit that are connected together with serial link (essentially RS-232 with no flow control.)  Relay/aux in this context means that it controls both some high(er) current devices as well as communication center to some auxiliary devices (via several other serial links). The plan didn't go above 38400bps anywhere, with the usual speed being 9600 bps, with a lot of time in between communication bursts, so all in all, pretty easy stuff.

So, as it usually starts, I chose to use relatively simple, ASCII-based command/response protocol with fairly long timeouts in the first iteration of protocol. This makes it very easy to test at first, using simple terminal program and manual commanding, making product release faster.

Unfortunately limits of that design became apparent very quickly just after few (this being very relative unit) devices were delivered. Long timeouts means that in case of any error, there is long time before system can even start to try recovery. Then, attempting to retry an operation might cause some operations to fire twice before acknowledgement came through. When setting state of relay this isn't a problem, it's still "on", no matter if you set it once or thousand times. But when sending packets of serial data... Yeaaah, not so nice.  So manual-friendly commands proved to be less than machine-friendly, and then I started to need capability to communicate with several devices at same time (previously need was strictly sequential; main unit opens exclusive link to one of connected device, and when done with that for time being, opens another exclusive link to another direction).

Enter revision two. This was fairly early in the product lifecycle and this was a "quick fix". Improved protocol allows easier "muxing" of data; data is sent in frames, and frames have target (where they should go; relay control parts, aux devices and so on) and checksums now, making system otherwise much more robust. Of course, commands are still essentially version 1 commands, now just wrapped in frames. This fixed many of the original issues, but not all of them, most importantly the problem where a command might still fire twice before acknowledged, as the implementation had no guards for that. Not exactly your typical "two army problem", but close to it.

At that time this wasn't major issue so I let is slide.

Fast forward some years to this day. I need to interface a WLAN module that uses 115kbps with flow control and can potentially send large amounts of data at bursts. With a design originally planned to run at max 38.4kbps. Now the "two army problem" is getting to be a serious nuisance, as receiving duplicate data packets is huge no-no for data integrity. I've grown very fond of data integrity over the years, you see. Much more so when I started this design. This trouble here might have been one reason for it.

Data integrity - defined here as ability to "lower level interface" to transmit and receive data with as few errors as possible, and with ability to correct those errors whenever possible - is fairly complex issue when you get into it. When you are working with TCP/IP stacks provided by your operating systems it is very easy to overlook complexities on that level, as that provides you with huge amounts of integrity already - not necessarily error-correctness, but at least you packets will arrive in order and exactly once (to your application, that is.)

Now, back to issue at hand. At this point there are fairly large number of devices deployed already, with aux modules with old and older revisions installed (which, unfortunately, cannot be field-updated - it requires a hardware programmer). On plus side hardware did not need any changes, so I was able to hack together flow control with software and signalling lines included in design. But lack of software update capability means that I can't fully overhaul the system, as it should be able to work with old modules to a certain degree - at least when not using advanced features like ones needed by that WLAN module.

Eventually I figured out a way to do this, by expanding revision two frame system by few commands that have packet sequence identifiers and respective acknowledge/timeout mechanisms. And now the system is fairly robust.

But it is carrying a lot of compatibility burden. So now I have v3 frames, that contain v2 frames, that contain v1 commands in them. With fallback to use just v2 frames if this aux unit doesn't happen to speak v3 language. Oh joy...

But hey, next time I design a protocol for similar use, I will know better.

(you may want to read that as "I will have existing and tested protocol (almost) ready to use - and strip out the bad parts" ...)





Ei kommentteja:

Lähetä kommentti