3.9 Division using multiplication
Unrolled loops are slower than short loops if there is a fast cache betweenRAMandCPUand the body
of loop can fit into cache, andCPUwill load the code from it not touching theRAM. Fast loops are the
loops which body’s size can fit into L1 cache, but even faster loops are those small ones which can fit into
micro-operation cache.
3.9 Division using multiplication
A very simple function:
int f(int a)
{
return a/9;
};
3.9.1 x86
...is compiled in a very predictable way:
Listing 3.18: MSVC
_a$ = 8 ; size = 4
_f PROC
push ebp
mov ebp, esp
mov eax, DWORD PTR _a$[ebp]
cdq ; sign extend EAX to EDX:EAX
mov ecx, 9
idiv ecx
pop ebp
ret 0
_f ENDP
IDIVdivides the 64-bit number stored in theEDX:EAXregister pair by the value in theECX. As a result,
EAXwill contain thequotient, andEDX— the remainder. The result is returned from thef()function in the
EAXregister, so the value is not moved after the division operation, it is in right place already.
SinceIDIVuses the value in theEDX:EAXregister pair, theCDQinstruction (beforeIDIV) extends the value
inEAXto a 64-bit value taking its sign into account, just asMOVSXdoes.
If we turn optimization on (/Ox), we get:
Listing 3.19: Optimizing MSVC
_a$ = 8 ; size = 4
_f PROC
mov ecx, DWORD PTR _a$[esp-4]
mov eax, 954437177 ; 38e38e39H
imul ecx
sar edx, 1
mov eax, edx
shr eax, 31 ; 0000001fH
add eax, edx
ret 0
_f ENDP
This is division by multiplication. Multiplication operations work much faster. And it is possible to use this
trick^12 to produce code which is effectively equivalent and faster.
This is also called “strength reduction” in compiler optimizations.
GCC 4.4.1 generates almost the same code even without additional optimization flags, just like MSVC with
optimization turned on:
(^12) Read more about division by multiplication in [Henry S. Warren,Hacker’s Delight, (2002)10-3]