The case of the different shifts

Larry Osterman has commented on an interesting edge case in the C/C++ standards, involving the underflow of the right shift operator.

They reported that if they compiled code which calculated: 1130149156 >> -05701653 it generated different results on 32bit and 64bit operating systems. On 32bit machines it reported 0 but on 64bit machines, it reported 0x21a.

This is one of various areas of “undefined behavior” for which you can ask 2 engineers what it should do and get 3 answers, at least one being that “but I know there must be one!” I think I know where at least 2 of the answers are coming from …

The Hardware shift

On x64, Larry mentions that the sar instruction is used to carry out the operations. This instruction is not new to the x86 architecture, and in that architecture line’s early days you did not waste die space. So what’s cheap?

Enter the barrel shifter. It’s cheap, it’s small, you pick the maximum power-of-2 shift you want, and that’s what you pay for. For a 16-bit shift, you can support shifts of 0 to 15 and pay for 4 stages. If you want to shift out all 16, you might think “add another stage” — but it gets worse! It only takes rotation as input. If you need to identify greater shifts than 31, then you’re toast.

No, what you consider doing is detect shifts > 15, and add a an extra stage to mask away the entire output. But wait, it’s 1972, and you don’t have either the die space nor the extra cycles to waste on a conditional masking stage (masking to 4 bits of shift input is free — you just don’t connect the other 12 input bits to anything). You just tell the programmer to never shift too much, use the bottom most bits, and ignore the rest. (1 >> 128) == 1.

In time the processor is expanded to support 32-bit quantities and 64-bit quantities, but each time you need to remain compatible, so the sar instruction continues to mask away the upper bits.

The Programming Language

I’m gonna gloss over the whole “undefined behavior” thing. Enough others have discussed it. Even languages or implementations which aim to minimize it will occasionally miss edge cases or fall short.

And C takes a very brutal route of making a great many things “undefined behavior” in the name of permitting performance shortcuts. For instance, signed arithmetic overflow yields undefined behavior. This might seem strange now, but at the time 2’s compliment arithmetic was not common. Choosing 2’s compliment wrap-around would have introduced perf penalties on 1’s compliment architectures as they did not generally include native 2’s compliment instructions, and vice-versa. A modern spin on this are special purpose DSP chips that default to saturating arithmetic.

In any case, I imagine a similar case stands for shifting. One architecture may prefer precise shifting, but x86 prefers to take shifts in modulo. Rather than specify one behavior and penalize compilation on all other hardware, the C language says instead “this area off limits — if you want a particular slow-but-precise behavior, implement it yourself”

The Software Shift

The 32-bit x86 CRT routine mentioned was _allshr (the extra leading underscore comes from cdecl calling convention on x86), and you can find its source in your Visual C++ product installation, under %programfiles%/Microsoft Visual Studio 10.0/vc/crt/src/intel/llshr.asm. It leads off:

; Handle shifts of 64 bits or more (if shifting 64 bits or more, the result
; depends only on the high order bit of edx).
        cmp     cl,64
        jae     short RETSIGN

Interesting — for large shifts, it bails early. Following it is another branch, switching between different implementations for 0-31 and 32-63 bit. In any case, the resulting 3 code paths are short, with the cl >= 64 path being the shortest of all.

Now, the designer could have considered masking instead of comparing — but for random shift amounts, that would have been a longer code sequence to execute*. And by most reasonable interpretations, returning the sign bit is the right answer.

So there you have it

Each implementation is most efficient and reasonable, when and where it was designed, and both are permitted by a language standard that valued efficiency above most else. And in that respect, both implementations were perfect … at least, they were when originally written.

* yes, the conditional branch is probably a bigger cost than the savings… today. While I don’t know when this code was written, it mentions PROC NEAR, leading me to think that perhaps it comes from a time before superscalar architectures, where branches were cheap(er) and arithmetic was costly. You might be want to change it now, but that decision opens up back-compat questions.

Leave a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.