Reverse Engineering for Beginners

(avery) #1

CHAPTER 25. SIMD CHAPTER 25. SIMD


for (i = 0; i < 1024; i++)
{
C[i] = A[i]*B[i];
}


This fragment of code takes elements from A and B, multiplies them and saves the result into C.


If each array element we have is 32-bitint, then it is possible to load 4 elements from A into a 128-bit XMM-register, from
B to another XMM-registers, and by executingPMULLD(Multiply Packed Signed Dword Integers and Store Low Result) and
PMULHW(Multiply Packed Signed Integers and Store High Result), it is possible to get 4 64-bitproductsat once.


Thus, loop body execution count is1024/4instead of 1024, that is 4 times less and, of course, faster.


25.1.1 Addition example


Some compilers can do vectorization automatically in simple cases, e.g., Intel C++^4.


Here is tiny function:


int f (int sz, int ar1, int ar2, int *ar3)
{
for (int i=0; i<sz; i++)
ar3[i]=ar1[i]+ar2[i];


return 0;
};


Intel C++


Let’s compile it with Intel C++ 11.1.051 win32:


icl intel.cpp /QaxSSE2 /Faintel.asm /Ox


We got (inIDA):


; int __cdecl f(int, int , int , int *)
public ?f@@YAHHPAH00@Z
?f@@YAHHPAH00@Z proc near


var_10 = dword ptr -10h
sz = dword ptr 4
ar1 = dword ptr 8
ar2 = dword ptr 0Ch
ar3 = dword ptr 10h


push edi
push esi
push ebx
push esi
mov edx, [esp+10h+sz]
test edx, edx
jle loc_15B
mov eax, [esp+10h+ar3]
cmp edx, 6
jle loc_143
cmp eax, [esp+10h+ar2]
jbe short loc_36
mov esi, [esp+10h+ar2]
sub esi, eax
lea ecx, ds:0[edx*4]
neg esi
cmp ecx, esi
jbe short loc_55

loc_36: ; CODE XREF: f(int,int ,int ,int *)+21
cmp eax, [esp+10h+ar2]
jnb loc_143


(^4) More about Intel C++ automatic vectorization:Excerpt: Effective Automatic Vectorization

Free download pdf