1.19. FLOATING-POINT UNIT
; D0 = a/3.14
fmov d1, x0
; D1 = X0 = b4.1
fadd d0, d0, d1
; D0 = D0+D1 = a/3.14 + b4.1
fmov x0, d0 ; \ redundant code
fmov d0, x0 ; /
add sp, sp, 16
ret
.LC25:
.word 1374389535 ; 3.14
.word 1074339512
.LC26:
.word 1717986918 ; 4.1
.word 1074816614
Non-optimizing GCC is more verbose.
There is a lot of unnecessary value shuffling, including some clearly redundant code (the last twoFMOV
instructions). Probably, GCC 4.9 is not yet good in generating ARM64 code.
What is worth noting is that ARM64 has 64-bit registers, and the D-registers are 64-bit ones as well.
So the compiler is free to save values of typedoubleinGPRs instead of the local stack. This isn’t possible
on 32-bit CPUs.
And again, as an exercise, you can try to optimize this function manually, without introducing new instruc-
tions likeFMADD.
MIPS
MIPS can support several coprocessors (up to 4), the zeroth of which^118 is a special control coprocessor,
and first coprocessor is the FPU.
As in ARM, the MIPS coprocessor is not a stack machine, it has 32 32-bit registers ($F0-$F31):.3.1 on
page 1042.
When one needs to work with 64-bitdoublevalues, a pair of 32-bit F-registers is used.
Listing 1.206: Optimizing GCC 4.4.5 (IDA)
f:
; $f12-$f13=A
; $f14-$f15=B
lui $v0, (dword_C4 >> 16) ;?
; load low 32-bit part of 3.14 constant to $f0:
lwc1 $f0, dword_BC
or $at, $zero ; load delay slot, NOP
; load high 32-bit part of 3.14 constant to $f1:
lwc1 $f1, $LC0
lui $v0, ($LC1 >> 16) ;?
; A in $f12-$f13, 3.14 constant in $f0-$f1, do division:
div.d $f0, $f12, $f0
; $f0-$f1=A/3.14
; load low 32-bit part of 4.1 to $f2:
lwc1 $f2, dword_C4
or $at, $zero ; load delay slot, NOP
; load high 32-bit part of 4.1 to $f3:
lwc1 $f3, $LC1
or $at, $zero ; load delay slot, NOP
; B in $f14-$f15, 4.1 constant in $f2-$f3, do multiplication:
mul.d $f2, $f14, $f2
; $f2-$f3=B*4.1
jr $ra
; sum 64-bit parts and leave result in $f0-$f1:
add.d $f0, $f2 ; branch delay slot, NOP
(^118) Starting at 0.