maanantai 10. lokakuuta 2016

Embedded debugging fun


Sometimes I forget how difficult crash tracking and fixing can be. And for once I would be really happy to have a step-by-step debugger handy. Unfortunately, for this platform I don't.

So I have this simple struct:

struct struct_wf_settings {
    unsigned short magic;        // A3F0: structure valid
    unsigned char mode;          // 1=sta, 2=ap, other=undefined
    unsigned char ssid[4][32];   // ap:first is name; sta:associated SSIDs
    unsigned char ssidPw[4][32]; // passwords
    unsigned char bssid[4][6];   // MACs
};

As you can guess, this is to control external WLAN module, in this case WF121. 

I have written code is written so that it compiles for several different devices, including WiFi module evaluation board (PC as host via USB port) and embedded board I'm using (with ARM processor as host). It works perfectly when ran or PC, but on embedded side it fails in following part (some irrelevant code omitted for simplicity, some debug stuff added).

... 
if (addNewAp) {
  // checking pointer to struct; it is valid.
  debugWriteInt("wfSet=", (unsigned int)wfSettings); 
 
  // n is slot; 0-3. In this case 0 as this is first to be added (yes, that too was verified)
  // print address to ssidPw used. This is correct.
  debugWriteInt("ssid p adx=", (unsigned int)wfSettings->ssidPw[n]); 
 
  // store ssid (kept in temporary buffer, typed char[66])
  cmemcpy(wfSettings->ssid[n], wfConnectTempBuff, 32); 
  
  // store password; also in same temp buffer 
  cmemcpy(wfSettings->ssidPw[n], wfConnectTempBuff+33, 32);
 
  // store bssid from event sent by module
  cmemcpy(wfSettings->bssid[n], buff+6, 6);
 
  // signal change to main 
  wfEvent(WFE_SETTING_CHANGE, 0,0);
}
... 

This is part of "Access Point connected" event handler, where we're figuring out where we actually just connected. At this point we've figured out that this AP wasn't previously known, so we add it to our table of APs we can use in future.

This crashes on second cmemcpy (ssidPw). After some checking I found out that it fails because the target address is invalid. Except it's kinda-sorta correct - the lowest 24 bits of target address is exactly correct, but the highest bits of (32-bit) address are now 0x61, causing processor to segfault when trying to access (nonexistent) memory area.

What the f***?

I'm using custom "standard" library here, and cmemcpy (so named to avoid namespace clash) is simple while-loop that does nothing but copy source from target byte by byte. Nothing fancy, exotic or optimized in such a way that could do anything insidious, and there is nothing overflowing, and it already has been used for several years without any issues (and yes, I know I should also be using sizeof() on call, but copy size isn't the issue here)

Note that second debugWriteInt; it prints out address of ssidPw slot I want to use. Address there is correct. Just a few lines later it isn't anymore.

I even tried to dig through disassembly but since I am not sufficiently familiar with ARM assembly this was a bit of a dead end. Nothing very obviously wrong though, and specifically no mysterious 0x61000000 added to r0 anywhere. A reference to [sp, #20] kinda stumped me; I'm guessing it refers to n variable but it was "a bit" difficult to figure out so I gave up (oh, did I mention that this is compiled with -O2 ?)

Unless this is a compiler bug (and quick googling suggested that there have been somewhat similar issues in GCC 4.7-branch, although on x86), I'm at a loss here. I'm a bit apprehensive on upgrading the compiler as it might cause other issues elsewhere. The devil you know, you know...

Eventually I just moved the first cmemcpy to end of sequence, and not the code works correctly. Not the first time code rearrangement fixed things. But I can't shake the feeling that there is a ticking time bomb there still... So it works correctly, for now.

And I don't like it.




Ei kommentteja:

Lähetä kommentti