Game Engine Architecture

(Ben Green) #1

190 4. 3D Math for Games


__m128 mulVectorMatrixAttempt1(__m128 v,
__m128 Mcol1, __m128 Mcol2,
__m128 Mcol3, __m128 Mcol4)
{
__m128 vMcol1 = _mm_mul_ps(v, Mcol1);
__m128 vMcol2 = _mm_mul_ps(v, Mcol2);
__m128 vMcol3 = _mm_mul_ps(v, Mcol3);
__m128 vMcol4 = _mm_mul_ps(v, Mcol4);
// ... then what?
}

The above code would yield the following intermediate results:
vMcol1 = [ vxM 11 vyM 21 vzM 31 vwM 41 ];
vMcol2 = [ vxM 12 vyM 22 vzM 32 vwM 42 ];
vMcol3 = [ vxM 13 vyM 23 vzM 33 vwM 43 ];
vMcol4 = [ vxM 14 vyM 24 vzM 34 vwM 44 ].
But the problem with doing it this way is that we now have to add “across
the registers” in order to generate the results we need. For example, rx =
(vxM 11 + vyM 21 + vzM 31 + vwM 41 ), so we’d need to add the four components of
vMcol1 together. Adding across a register like this is diffi cult and ineffi cient,
and moreover it leaves the four components of the result in four separate SSE
registers, which would need to be combined into the single result vector r. We
can do bett er.
The “trick” here is to multiply with the rows of M, not its columns.
That way, we’ll have results that we can add in parallel, and the fi nal sums
will end up in the four components of a single SSE register representing
the output vector r. However, we don’t want to multiply v as-is with the
rows of M—we want to multiply vx with all of row 1, vy with all of row 2,
vz with all of row 3, and vw with all of row 4. To do this, we need to replicate
a single component of v, such as vx, across a register to yield a vector like
[ vx vx vx vx ]. Then we can multiply the replicated component vectors by the
appropriate rows of M.
Thankfully there’s a powerful SSE instruction which can replicate values
like this. It is called shufps, and it’s wrapped by the intrinsic _mm_shuffle_
ps(). This beast is a bit complicated to understand, because it’s a general-
purpose instruction that can shuffl e the components of an SSE register around
in arbitrary ways. However, for our purposes we need only know that the
following macros replicate the x, y, z or w components of a vector across an
entire register:
#define SHUFFLE_PARAM(x, y, z, w) \
((x) | ((y) << 2) | ((z) << 4) | ((w) << 6))
Free download pdf