Reverse Engineering for Beginners

CHAPTER 25. SIMD CHAPTER 25. SIMD

for (i = 0; i < 1024; i++)
{
C[i] = A[i]*B[i];
}

This fragment of code takes elements from A and B, multiplies them and saves the result into C.

If each array element we have is 32-bitint, then it is possible to load 4 elements from A into a 128-bit XMM-register, from
B to another XMM-registers, and by executingPMULLD(Multiply Packed Signed Dword Integers and Store Low Result) and
PMULHW(Multiply Packed Signed Integers and Store High Result), it is possible to get 4 64-bitproductsat once.

Thus, loop body execution count is1024/4instead of 1024, that is 4 times less and, of course, faster.

25.1.1 Addition example

Some compilers can do vectorization automatically in simple cases, e.g., Intel C++^4.

Here is tiny function:

int f (int sz, int ar1, int ar2, int *ar3)
{
for (int i=0; i<sz; i++)
ar3[i]=ar1[i]+ar2[i];

return 0;
};

Intel C++

Let’s compile it with Intel C++ 11.1.051 win32:

icl intel.cpp /QaxSSE2 /Faintel.asm /Ox

We got (inIDA):

; int __cdecl f(int, int , int , int *)
public ?f@@YAHHPAH00@Z
?f@@YAHHPAH00@Z proc near

var_10 = dword ptr -10h
sz = dword ptr 4
ar1 = dword ptr 8
ar2 = dword ptr 0Ch
ar3 = dword ptr 10h

push edi push esi push ebx push esi mov edx, [esp+10h+sz] test edx, edx jle loc_15B mov eax, [esp+10h+ar3] cmp edx, 6 jle loc_143 cmp eax, [esp+10h+ar2] jbe short loc_36 mov esi, [esp+10h+ar2] sub esi, eax lea ecx, ds:0[edx*4] neg esi cmp ecx, esi jbe short loc_55

loc_36: ; CODE XREF: f(int,int ,int ,int *)+21
cmp eax, [esp+10h+ar2]
jnb loc_143

(^4) More about Intel C++ automatic vectorization:Excerpt: Effective Automatic Vectorization

Reverse Engineering for Beginners

25.1.1 Addition example

Get our desktop app

Company

Features

Documentation

Resources