Click here to Skip to main content
15,867,308 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
My sse2 code is long and slow, how can I make it fast? _mm_store_si128() failed but _mm_storeu_si128()accepted, why?

C++
void tom::add(void* ptr)
{   
     __declspec(align(16))short* b =(short*)ptr;
     int j;               
       #if cplusplus
	for(j = 0; j < 4; j++)
	   {
	/// 1st stage transform.
	int x0 = (int)(b[j]		+ b[j+12]);
	int x3 = (int)(b[j]		- b[j+12]);
	int x1 = (int)(b[j+4] + b[j+8]);
	int x2 = (int)(b[j+4] - b[j+8]);
	/// 2nd stage transform.
				
	b[j]		= (short)(x0 + x1);
	b[j+8]	= (short)(x0 - x1);
	b[j+4]	= (short)(x2 + (x3 << 1));
	b[j+12]	= (short)(x3 - (x2 << 1));
	}//end for j...
       #else 
		
       __m128i f0,f1,f2,f3;
			
              j=0;
      f0 = _mm_set_epi32(b[j+3],b[j+2],b[j+1],b[j]);
      f1 = _mm_set_epi32(b[j+7],b[j+6],b[j+5],b[j+4]);
      f2 = _mm_set_epi32(b[j+11],b[j+10],b[j+9],b[j+8]);
      f3 = _mm_set_epi32(b[j+15],b[j+14],b[j+13],b[j+12]);
      __declspec(align(16)) __m128i*b = (__m128i*)ptr;
      __m128i temp0,temp1,temp2,temp3,temp4;
	 temp0 = f0;
	 temp1 = f1;
	 temp2 = f2;
       temp3 = f3;
	 temp0 = _mm_add_epi16(temp0, f3);
	 temp1 = _mm_add_epi16(temp1, f2);
	 f0 = _mm_sub_epi16(f0, f3);
	 f1 = _mm_sub_epi16(f1, f2);
	temp4  = temp0;
	temp4 = _mm_add_epi16(temp4, temp1);
	_mm_storeu_si128(b, temp4);
	temp0 = _mm_sub_epi16(temp0, temp1);
	_mm_storeu_si128(b+2, temp0);
	temp1 = f0;
	temp4 = f1;
	temp1 = _mm_slli_epi16(temp1, 1);
	temp4 = _mm_slli_epi16(temp4, 1);
	f0 = _mm_add_epi16(f0, temp4);
	f1 = _mm_sub_epi16(f1, temp1);
	_mm_storeu_si128(b+1, f0);
	_mm_storeu_si128(b+3, f1);
        #endif
}
Posted
Updated 25-Aug-10 2:24am
v4
Comments
Sauro Viti 25-Aug-10 7:04am    
"Loop unrolling" and "make the code small" point to the opposite directions... ;-)
SMART LUBOBYA 25-Aug-10 7:14am    
you are right unrolling means long codes,perhaps the straight way to put it; how do i unrow the loop using sse2 intrinsic?
short*b = (short)prt; has two rows each with 8 integers.

1 solution

The reason _mm_store_si128() does not work is that the data is not aligned. I believe your declaration(s) of the pointer variable b will give that variable 16-byte alignment, not what that variable is pointing at. In order for it to work with _mm_store_si128(), you would have to make sure that the data pointed to by the ptr parameter is 16-byte aligned. The _mm_storeu_si128() intrinsic works because it uses unaligned data.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900