x86 - Sum reduction of unsigned bytes without overflow, using SSE2 on Intel -


i trying find sum reduction of 32 elements (each 1 byte data) on intel i3 processor. did this:

s=0;  (i=0; i<32; i++) {     s = s + a[i]; }   

however, taking more time, since application real-time application requiring lesser time. please note final sum more 255.

is there way can implement using low level simd sse2 instructions? unfortunately have never used sse. tried searching sse2 function purpose, not available. (sse) guaranteed reduce computation time such small-sized problems?

any suggestions??

note: have implemented similar algorithms using opencl , cuda , worked great when problem size big. small sized problems cost of overhead more. not sure how works on sse

you can abuse psadbw calculate small horizontal sums quickly.

something this: (not tested)

pxor xmm0, xmm0 psadbw xmm0, [a + 0] pxor xmm1, xmm1 psadbw xmm1, [a + 16] paddw xmm0, xmm1 pshufd xmm1, xmm0, 2 paddw xmm0, xmm1 ; low word in xmm0 total sum 

attempted intrinsics version:

i never use intrinsics code makes no sense whatsoever. disassembly looked ok though.

uint16_t sum_32(const uint8_t a[32]) {     __m128i 0 = _mm_xor_si128(zero, zero);     __m128i sum0 = _mm_sad_epu8(                         zero,                         _mm_load_si128(reinterpret_cast<const __m128i*>(a)));     __m128i sum1 = _mm_sad_epu8(                         zero,                         _mm_load_si128(reinterpret_cast<const __m128i*>(&a[16])));     __m128i sum2 = _mm_add_epi16(sum0, sum1);     __m128i totalsum = _mm_add_epi16(sum2, _mm_shuffle_epi32(sum2, 2));     return totalsum.m128i_u16[0]; } 

Comments

Popular posts from this blog

django - How can I change user group without delete record -

java - Need to add SOAP security token -

java - EclipseLink JPA Object is not a known entity type -