1.29. SIMD
mov ebx, [esi+ecx*4]
add ebx, [edi+ecx*4]
mov [eax+ecx*4], ebx
inc ecx
cmp ecx, edx
jb short loc_14D
loc_15B: ; CODE XREF: f(int,int ,int ,int )+A
; f(int,int ,int ,int )+129 ...
xor eax, eax
pop ecx
pop ebx
pop esi
pop edi
retn
loc_162: ; CODE XREF: f(int,int ,int ,int )+8C
; f(int,int ,int ,int )+9F
xor ecx, ecx
jmp short loc_127
?f@@YAHHPAH00@Z endp
The SSE2-related instructions are:
- MOVDQU(Move Unaligned Double Quadword)—just loads 16 bytes from memory into a XMM-register.
- PADDD(AddPackedIntegers)—adds4pairsof32-bitnumbersandleavestheresultinthefirstoperand.
By the way, no exception is raised in case of overflow and no flags are to be set, just the low 32 bits
of the result are to be stored. If one ofPADDD’s operands is the address of a value in memory, then
the address must be aligned on a 16-byte boundary. If it is not aligned, an exception will be triggered
(^181).
- MOVDQA(Move Aligned Double Quadword) is the same asMOVDQU, but requires the address of the
value in memory to be aligned on a 16-bit boundary. If it is not aligned, exception will be raised.
MOVDQAworks faster thanMOVDQU, but requires aforesaid.
So, these SSE2-instructions are to be executed only in case there are more than 4 pairs to work on and
the pointerar3is aligned on a 16-byte boundary.
Also, ifar2is aligned on a 16-byte boundary as well, this fragment of code is to be executed:
movdqu xmm0, xmmword ptr [ebx+edi4] ; ar1+i4
paddd xmm0, xmmword ptr [esi+edi4] ; ar2+i4
movdqa xmmword ptr [eax+edi4], xmm0 ; ar3+i4
Otherwise, the value fromar2is to be loaded intoXMM0usingMOVDQU, which does not require aligned
pointer, but may work slower:
movdqu xmm1, xmmword ptr [ebx+edi4] ; ar1+i4
movdqu xmm0, xmmword ptr [esi+edi4] ; ar2+i4 is not 16-byte aligned, so load it to XMM0
paddd xmm1, xmm0
movdqa xmmword ptr [eax+edi4], xmm1 ; ar3+i4
In all other cases, non-SSE2 code is to be executed.
GCC
GCCmayalsovectorizeinsimplecases^182 , ifthe-O3optionisusedandSSE2supportisturnedon:-msse2.
What we get (GCC 4.4.1):
; f(int, int , int , int *)
public _Z1fiPiSS
_Z1fiPiSS proc near
var_18 = dword ptr -18h
var_14 = dword ptr -14h
(^181) More about data alignment:Wikipedia: Data structure alignment
(^182) More about GCC vectorization support:http://go.yurichev.com/17083