perjantai 25. marraskuuta 2016

C: sizeof(pointer) is ...?


Pointers in C are one of the biggest traps for programmers, novices and experienced alike. If you are working with modern PCs they are fairly simple to use - a linear 32- or 64-bit address, handily the same size of native word size of architecture so you "can" cast it to integer and do other things with it (but you really shouldn't, unless you really, really know what you are doing).

Too often people seem to make assumption that sizeof(int*) == sizeof (int). For many architectures this is so - but not for all, and not even for all x86-based architectures. Because of this you really shouldn't do that casting thing I mentioned above (for example, for backwards compatibility 64-bit linux [amd64 ABI] has 32-bit int.)

And when you get to older architectures it gets even weirder.

Let's take original 16-bit x86 for example. It was, of course, 16-bit architecture, but with 1MB of memory total. Memory was addressed with segment:offset system; both being 16-bit values, the pointer total was 32 bits (I won't go into the actual addressing details here as it's irrelevant for the discussion.)
But wait, there's more! This was the "far" pointer, that could point to any address of the system (no memory protection back then.)  Then there was also "near" pointer that could only access memory within a segment with only the 16-bit offset value. And then there was apparently "huge" pointer which I've never encountered myself.

So here even sizeof(int*) isn't necessarily sizeof(int*), if those happen to be "near" and "far" respectively. Confusing already?

How about embedded world then?

I used to work with C51-based devices. This is old architecture, introduced back in early 80s, but is still going strong as modern variants. Different variants are easily and cheaply available from several manufacturers, and many (old-school) programmers are familiar with it, so they are very popular in embedded world.
These devices are 8-bit MCUs (int typically is 16-bit) and have four different memory access modes (and three completely separate memories);
  • DATA, directly accessible RAM, 128 bytes.  
  • IDATA, indirectly accessible RAM, 256 bytes total. This overlaps DATA for first 128 bytes.
  • XDATA, external RAM, up to 64kbytes. These days this is not really external but built in the chip itself. Being external there is no overlap with DATA or IDATA.
  • CODE, being code memory, I've used devices with up to 8kbytes of flash built in.
So here pointer must identify which memory is accessed *and* up to 16-bit address within it. So commonly pointer is three bytes; one byte to specify which memory is accessed, and two to specify offset (and yes, other compilers may have used other methods, this is one I am familiar with.)

Separate memory spaces for code and data is known as Harvard architecture, and it is common even these days; many Atmel chips use it - and my extension, many Arduino devices. I am not familar with those (aside the code/data separation) and how C compilers there handle pointers, but I wouldn't be surprised if similar methods as with C51 would apply.

C is full of traps, especially when you start mixing several architectures in same project, and it's kinda okay to play fast and loose iff you know damn well what you are doing. If you don't (even I don't, not always), one should play safe. Don't assume you know sizes, and don't do any wild castings. It'll come back to bite you.

I should know, I've been there, many, many times over the years.






Ei kommentteja:

Lähetä kommentti