4.7. Hardware-Accelerated SIMD Math 187
vector calculations without incurring a transfer cost. Scalars are represented
by duplicating the single fl oating-point value across all four “slots” in an SSE
register. So to store the scalar s in an SSE register, we’d set x = y = z = w = s.
4.7.1.2. The __m128 Data Type
Using one of these magic SSE 128-bit values in C or C++ is quite easy. The
Microsoft Visual Studio compiler provides a predefi ned data type called
m128. This data type can be used to declare global variables, automatic vari-
ables, and even class and structure members. In many cases, variables of this
type will be stored in RAM. But when used in calculations, m128 values are
manipulated directly in the CPU’s SSE registers. In fact, declaring automatic
variables and function arguments to be of type __m128 oft en results in the
compiler storing those values directly in SSE registers, rather than keeping
them in RAM on the program stack.
Alignment of __m128 Variables
When an m128 variable is stored in RAM, it is the programmer’s responsi-
bility to ensure that the variable is aligned to a 16-byte address boundary. This
means that the hexadecimal address of an m128 variable must always end
in the nibble 0x0. The compiler will automatically pad structures and classes
so that if the entire struct or class is aligned to a 16-byte boundary, all of the
m128 data members within it will be properly aligned as well. If you de-
clare an automatic or global struct/class containing one or more m128s, the
compiler will align the object for you. However, it is still your responsibility
to align dynamically allocated data structures (i.e., data allocated with new or
malloc()); the compiler can’t help you there.
4.7.1.3. Coding with SSE Intrinsics
SSE mathematics can be done in raw assembly language, or via inline assem-
bly in C or C++. However, writing code like this is not only non-portable, it’s
also a big pain in the butt. To make life easier, modern compilers provide
intrinsics —special commands that look and behave like regular C functions,
but are really boiled down to inline assembly code by the compiler. Many in-
trinsics translate into a single assembly language instruction, although some
are macros that translate into a sequence of instructions.
In order to use the __m128 data type and SSE intrinsics, your .cpp fi le
must #include <xmmintrin.h>.
As an example, let’s take another look at the addps assembly language
instruction. This instruction can be invoked in C/C++ using the intrinsic _mm
_add_ps(). Here’s a side-by-side comparison of what the code would look
like with and without the use of the intrinsic.