1.16. LOOPS
lw $v0, 0x28+i($fp)
or $at, $zero ; NOP
slti $v0, 0xA
; if it is less than 10, jump to loc_80 (loop body begin):
bnez $v0, loc_80
or $at, $zero ; branch delay slot, NOP
; finishing, return 0:
move $v0, $zero
; function epilogue:
move $sp, $fp
lw $ra, 0x28+saved_RA($sp)
lw $fp, 0x28+saved_FP($sp)
addiu $sp, 0x28
jr $ra
or $at, $zero ; branch delay slot, NOP
The instruction that’s new to us isB. It is actually the pseudo instruction (BEQ).
One more thing
In the generated code we can see: after initializingi, the body of the loop is not to be executed, as the
condition foriis checked first, and only after that loop body can be executed. And that is correct.
Because, if the loop condition is not met at the beginning, the body of the loop must not be executed.
This is possible in the following case:
for (i=0; i<total_entries_to_process; i++)
loop_body;
Iftotal_entries_to_processis 0, the body of the loop must not be executed at all.
This is why the condition checked before the execution.
However, an optimizing compiler may swap the condition check and loop body, if it sure that the situation
described here is not possible (like in the case of our very simple example and using compilers like Keil,
Xcode (LLVM), MSVC in optimization mode).
1.16.2 Memory blocks copying routine
Real-world memory copy routines may copy 4 or 8 bytes at each iteration, useSIMD^102 , vectorization, etc.
But for the sake of simplicity, this example is the simplest possible.
#include <stdio.h>
void my_memcpy (unsigned char dst, unsigned char src, size_t cnt)
{
size_t i;
for (i=0; i<cnt; i++)
dst[i]=src[i];
};
Straight-forward implementation
Listing 1.171: GCC 4.9 x64 optimized for size (-Os)
my_memcpy:
; RDI = destination address
; RSI = source address
; RDX = size of block
; initialize counter (i) at 0
xor eax, eax
.L2:
(^102) Single Instruction, Multiple Data