1.19. FLOATING-POINT UNIT
Non-optimizing GCC is more verbose.
First, the function saves its input argument values in the local stack (Register Save Area). Then the code
reloads these values into registersX0/X1and finally copies them toD0/D1to be compared usingFCMPE. A
lot of redundant code, but that is how non-optimizing compilers work. FCMPEcompares the values and
sets theAPSRflags. At this moment, the compiler is not thinking yet about the more convenientFCSEL
instruction, so it proceed using old methods: using theBLEinstruction (Branch if Less than or Equal). In
the first case (a>b), the value ofagets loaded intoX0. In the other case (a<=b), the value ofbgets
loaded intoX0. Finally, the value fromX0gets copied intoD0, because the return value needs to be in this
register.
Exercise
As an exercise, you can try optimizing this piece of code manually by removing redundant instructions
and not introducing new ones (includingFCSEL).
Optimizing GCC (Linaro) 4.9—float
Let’s also rewrite this example to usefloatinstead ofdouble.
float f_max (float a, float b)
{
if (a>b)
return a;
return b;
};
f_max:
; S0 - a, S1 - b
fcmpe s0, s1
fcsel s0, s0, s1, gt
; now result in S0
ret
It is the same code, but the S-registers are used instead of D- ones. It’s because numbers of typefloat
are passed in 32-bit S-registers (which are in fact the lower parts of the 64-bit D-registers).
MIPS
The co-processor of the MIPS processor has a condition bit which can be set in the FPU and checked in the
CPU.
Earlier MIPS-es have only one condition bit (called FCC0), later ones have 8 (called FCC7-FCC0).
This bit (or bits) are located in the register called FCCR.
Listing 1.223: Optimizing GCC 4.4.5 (IDA)
d_max:
; set FPU condition bit if $f14<$f12 (b<a):
c.lt.d $f14, $f12
or $at, $zero ; NOP
; jump to locret_14 if condition bit is set
bc1t locret_14
; this instruction is always executed (set return value to "a"):
mov.d $f0, $f12 ; branch delay slot
; this instruction is executed only if branch was not taken (i.e., if b>=a)
; set return value to "b":
mov.d $f0, $f14
locret_14:
jr $ra
or $at, $zero ; branch delay slot, NOP