x86 - Sum reduction of unsigned bytes without overflow, using SSE2 on Intel -
i trying find sum reduction of 32 elements (each 1 byte data) on intel i3 processor. did this:
s=0; (i=0; i<32; i++) { s = s + a[i]; }
however, taking more time, since application real-time application requiring lesser time. please note final sum more 255.
is there way can implement using low level simd sse2 instructions? unfortunately have never used sse. tried searching sse2 function purpose, not available. (sse) guaranteed reduce computation time such small-sized problems?
any suggestions??
note: have implemented similar algorithms using opencl , cuda , worked great when problem size big. small sized problems cost of overhead more. not sure how works on sse
you can abuse psadbw
calculate small horizontal sums quickly.
something this: (not tested)
pxor xmm0, xmm0 psadbw xmm0, [a + 0] pxor xmm1, xmm1 psadbw xmm1, [a + 16] paddw xmm0, xmm1 pshufd xmm1, xmm0, 2 paddw xmm0, xmm1 ; low word in xmm0 total sum
attempted intrinsics version:
i never use intrinsics code makes no sense whatsoever. disassembly looked ok though.
uint16_t sum_32(const uint8_t a[32]) { __m128i 0 = _mm_xor_si128(zero, zero); __m128i sum0 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(a))); __m128i sum1 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(&a[16]))); __m128i sum2 = _mm_add_epi16(sum0, sum1); __m128i totalsum = _mm_add_epi16(sum2, _mm_shuffle_epi32(sum2, 2)); return totalsum.m128i_u16[0]; }
Comments
Post a Comment