CHAPTER 25. SIMD CHAPTER 25. SIMD
for (i = 0; i < 1024; i++)
{
C[i] = A[i]*B[i];
}
This fragment of code takes elements from A and B, multiplies them and saves the result into C.
If each array element we have is 32-bitint, then it is possible to load 4 elements from A into a 128-bit XMM-register, from
B to another XMM-registers, and by executingPMULLD(Multiply Packed Signed Dword Integers and Store Low Result) and
PMULHW(Multiply Packed Signed Integers and Store High Result), it is possible to get 4 64-bitproductsat once.
Thus, loop body execution count is1024/4instead of 1024, that is 4 times less and, of course, faster.
25.1.1 Addition example
Some compilers can do vectorization automatically in simple cases, e.g., Intel C++^4.
Here is tiny function:
int f (int sz, int ar1, int ar2, int *ar3)
{
for (int i=0; i<sz; i++)
ar3[i]=ar1[i]+ar2[i];
return 0;
};
Intel C++
Let’s compile it with Intel C++ 11.1.051 win32:
icl intel.cpp /QaxSSE2 /Faintel.asm /Ox
We got (inIDA):
; int __cdecl f(int, int , int , int *)
public ?f@@YAHHPAH00@Z
?f@@YAHHPAH00@Z proc near
var_10 = dword ptr -10h
sz = dword ptr 4
ar1 = dword ptr 8
ar2 = dword ptr 0Ch
ar3 = dword ptr 10h
push edi
push esi
push ebx
push esi
mov edx, [esp+10h+sz]
test edx, edx
jle loc_15B
mov eax, [esp+10h+ar3]
cmp edx, 6
jle loc_143
cmp eax, [esp+10h+ar2]
jbe short loc_36
mov esi, [esp+10h+ar2]
sub esi, eax
lea ecx, ds:0[edx*4]
neg esi
cmp ecx, esi
jbe short loc_55
loc_36: ; CODE XREF: f(int,int ,int ,int *)+21
cmp eax, [esp+10h+ar2]
jnb loc_143
(^4) More about Intel C++ automatic vectorization:Excerpt: Effective Automatic Vectorization