186 4. 3D Math for Games
a single 128-bit register; four operations such as additions or multiplications
are performed in parallel on four pairs of fl oats using a single instruction. This
is just what the doctor ordered when multiplying a four-element vector by a
4 × 4 matrix!
4.7.1.1. SSE Registers
In packed 32-bit fl oating-point mode, each 128-bit SSE register contains four
32-bit fl oats. The individual fl oats within an SSE register are conveniently re-
ferred to as [ x y z w ], just as they would be when doing vector/matrix math
in homogeneous coordinates on paper (see Figure 4.30). To see how the SSE
registers work, here’s an example of a SIMD instruction:
addps xmm0, xmm1
The addps instruction adds the four fl oats in the 128-bit XMM0 register with
the four fl oats in the XMM1 register, and stores the four results back into
XMM0. Put another way:
xmm0.x = xmm0.x + xmm1.x;
xmm0.y = xmm0.y + xmm1.y;
xmm0.z = xmm0.z + xmm1.z;
xmm0.w = xmm0.w + xmm1.w.
The four fl oating-point values stored in an SSE register can be extracted
to or loaded from memory or registers individually, but such operations tend
to be comparatively slow. Moving data between the x87 FPU registers and the
SSE registers is particularly bad, because the CPU has to wait for either the x87
or the SSE unit to spit out its pending calculations. This stalls out the CPU’s
entire instruction execution pipeline and results in a lot of wasted cycles. In a
nutshell, code that mixes regular float mathematics with SSE mathematics
should be avoided like the plague.
To minimize the costs of going back and forth between memory, x87 FPU
registers, and SSE registers, most SIMD math libraries do their best to leave
data in the SSE registers for as long as possible. This means that even scalar
values are left in SSE registers, rather than transferring them out to float
variables. For example, a dot product between two vectors produces a scalar
result, but if we leave that result in an SSE register it can be used later in other
x y z w
32 bits 32 bits 32 bits 32 bits
Figure 4.30. The four components of an SSE register in 32-bit fl oating-point mode.