189
// test the two functions
__m128 c = addWithAssembly(a, b);
__m128 d = addWithIntrinsics(a, b);
// store the original values back to check that they
// weren’t overwritten
_mm_store_ps(&A[0], a);
_mm_store_ps(&B[0], b);
// store results into float arrays so we can print
// them
_mm_store_ps(&C[0], c);
_mm_store_ps(&D[0], d);
// inspect the results
printf(“%g %g %g %g\n”, A[0], A[1], A[2], A[3]);
printf(“%g %g %g %g\n”, B[0], B[1], B[2], B[3]);
printf(“%g %g %g %g\n”, C[0], C[1], C[2], C[3]);
printf(“%g %g %g %g\n”, D[0], D[1], D[2], D[3]);
return 0;
}
4.7.1.4. Vector-Matrix Multiplication with SSE
Let’s take a look at how vector-matrix multiplication might be implemented
using SSE instructions. We want to multiply the 1 × 4 vector v with the 4 × 4
matrix M to generate a result vector r.
The multiplication involves taking the dot product of the row vector v
with the columns of matrix M. So to do this calculation using SSE instructions,
we might fi rst try storing v in an SSE register (__m128), and storing each of
the columns of M in SSE registers as well. Then we could calculate all of the
products vkMij in parallel using only four mulps instructions, like this:
11 12 13 14
21 22 23 24
31 32 33 34
41 42 43 44
11 12 13 14
21 22 23 24
31 32 33 34
41 42 43 44
;
[ ][ ]
((((
))))
xyzw x y z w
xxxx
yyyy
zzzz
wwww
MMMM
MMMM
rrrr v v v v MMMM
MMMM
vM vM vM vM
vM vM vM vM
vM vM vM vM
vM vM vM vM
=
⎡⎤
⎢⎥
= ⎢⎥
⎢⎥
⎢⎥
⎣⎦
⎡⎤
⎢⎥++++
=⎢⎥
⎢⎥++++
⎢⎥++++
⎣⎦
r vM
.
4.7. Hardware-Accelerated SIMD Math