Assembly Language for Beginners

(nextflipdebug2) #1

3.8. DUFF’S DEVICE


movsxd rax, r8d
cmp rax, r9
jb SHORT $LL19@bzero
$LN11@bzero:
; work out the tail
and edx, 7
dec rdx
cmp rdx, 6
ja SHORT $LN9@bzero
lea r8, OFFSET FLAT:__ImageBase
mov eax, DWORD PTR $LN22@bzero[r8+rdx*4]
add rax, r8
jmp rax
$LN8@bzero:
mov BYTE PTR [rcx], 0
inc rcx
$LN7@bzero:
mov BYTE PTR [rcx], 0
inc rcx
$LN6@bzero:
mov BYTE PTR [rcx], 0
inc rcx
$LN5@bzero:
mov BYTE PTR [rcx], 0
inc rcx
$LN4@bzero:
mov BYTE PTR [rcx], 0
inc rcx
$LN3@bzero:
mov BYTE PTR [rcx], 0
inc rcx
$LN2@bzero:
mov BYTE PTR [rcx], 0
$LN9@bzero:
fatret 0
npad 1
$LN22@bzero:
DD $LN2@bzero
DD $LN3@bzero
DD $LN4@bzero
DD $LN5@bzero
DD $LN6@bzero
DD $LN7@bzero
DD $LN8@bzero
bzero ENDP


The first part of the function is predictable. The second part is just an unrolled loop and a jump passing
control flow to the correct instruction inside it. There is no other code between theMOV/INCinstruction
pairs, so the execution is to fall until the very end, executing as many pairs as needed. By the way, we
can observe that theMOV/INCpair consumes a fixed number of bytes (3+3). So the pair consumes 6 bytes.
Knowing that, we can get rid of the switch() jumptable, we can just multiple the input value by 6 and jump
tocurrent_RIP+input_value∗ 6.


This can also be faster because we are not in need to fetch a value from the jumptable.


It’s possible that 6 probably is not a very good constant for fast multiplication and maybe it’s not worth
it, but you get the idea^11.


That is what old-school demomakers did in the past with unrolled loops.


3.8.1 Should one use unrolled loops?


Unrolled loops can have benefits if there is no fast cache memory betweenRAMandCPU, and theCPU, in
order to get the code of the next instruction, must load it fromRAMeach time. This is a case of modern
low-costMCUand oldCPUs.


(^11) As an exercise, you can try to rework the code to get rid of the jumptable. The instruction pair can be rewritten in a way that it
will consume 4 bytes or maybe 8. 1 byte is also possible (usingSTOSBinstruction).

Free download pdf