3.8 Duff’s device
inc r10
mov DWORD PTR [r9+rcx], r11d
add r9, 12
add rax, 28
cmp r10, r8
jb .B1.8
jmp exit
just_copy::
; R8 = cnt
; RDX = a2
; RCX = a1
xor r10d, r10d
xor r9d, r9d
xor eax, eax
.B1.11::
mov r11d, DWORD PTR [rax+rdx]
inc r10
mov DWORD PTR [r9+rcx], r11d
add r9, 12
add rax, 28
cmp r10, r8
jb .B1.11
exit::
ret
First, there are some decisions taken, then one of the routines is executed.
Looks like it is a check if arrays intersect.
This is very well known way of optimizing memory block copy routines. But copy routines are the same!
This is has to be an error of the Intel C++ optimizer, which still produces workable code, though.
Weintentionallyconsideringsuchexamplecodeinthisbooksothereaderwouldunderstandthatcompiler
output is weird at times, but still correct, because when the compiler was tested, it passed the tests.
3.8 Duff’s device
Duff’s device^9 is an unrolled loop with the possibility to jump to the middle of it. The unrolled loop is
implemented using a fallthrough switch() statement. We would use here a slightly simplified version of
Tom Duff’s original code. Let’s say, we have to write a function that clears a region in memory. One
can come with a simple loop, clearing byte by byte. It’s obviously slow, since all modern computers have
much wider memory bus. So the better way is to clear the memory region using 4 or 8 bytes blocks. Since
we are going to work with a 64-bit example here, we are going to clear the memory in 8 bytes blocks. So
far so good. But what about the tail? Memory clearing routine can also be called for regions of size that’s
not a multiple of 8. So here is the algorithm:
- calculate the number of 8-bytes blocks, clear them using 8-bytes (64-bit) memory accesses;
- calculate the size of the tail, clear it using 1-byte memory accesses.
The second step can be implemented using a simple loop. But let’s implement it as an unrolled loop:
#include <stdint.h>
#include <stdio.h>
void bzero(uint8_t* dst, size_t count)
{
int i;
if (count&(~7))
// work out 8-byte blocks
for (i=0; i<count>>3; i++)
{
(^9) wikipedia