Click here to Skip to main content
15,881,413 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Any suggestions on how to improve the code, in order to make it fast?How would I re-write the following function as inline assembly?

C++
void tomSimd::calculations(void* btr)
{
    __declspec(align(8))short* block =(short*)btr;
    
    int j;
    
    __declspec(align(8)) __m64*block1 = (__m64*)block;
    __m64 s0,s1,s2,s3,f0,f1,f2,f3,temp4,temp5,temp6,temp7;
    j=0;
    
    // transpose input
    temp4 = _mm_unpacklo_pi16(block1[j],block1[j+1]);
    temp5 = _mm_unpacklo_pi16(block1[j+2],block1[j+3]);
    temp6 = _mm_unpackhi_pi16(block1[j],block1[j+1]);
    temp7 = _mm_unpackhi_pi16(block1[j+2],block1[j+3]);
    f0 = _mm_unpacklo_pi32(temp4,temp5);
    f2 = _mm_unpacklo_pi32(temp6,temp7);
    f1 = _mm_unpackhi_pi32(temp4,temp5);
    f3 = _mm_unpackhi_pi32(temp6,temp7);
    
    // stage one
    s0 =_mm_add_pi16(f0,f3);
    s3 =_mm_sub_pi16(f0,f3);
    s1 =_mm_add_pi16(f1,f2);
    s2 =_mm_sub_pi16(f1,f2);
    
    //stage 2
    block1[j] =_mm_add_pi16(s0,s1);
    block1[j+2] =_mm_sub_pi16(s0,s1);
    block1[j+1] =_mm_add_pi16(s2,_mm_slli_pi16(s3, 1));
    block1[j+3] =_mm_sub_pi16(s3,_mm_slli_pi16(s2, 1));
    
    _mm_empty();
}
Posted
Updated 20-Oct-10 2:58am
v2
Comments
super 20-Oct-10 7:01am    
Do you have any metrics with respect to speed?
I wanted to know, how fast is the execution and how much you desire?
SMART LUBOBYA 20-Oct-10 7:09am    
block is a 4x4 matrix which i am transposing, manipulating through those two stages. but block is declared as a short of one column consisting of 16 elements. when i did it C++ my speed is 260ms, in MMX its 255ms sometimes sames as c++. i expected that it would be faster in MMX. here is the c++ equivalent.
for(j = 0; j < 16; j += 4)
{
/// 1st stage transform.
int s0 = (int)(block[j] + block[j+3]);
int s3 = (int)(block[j] - block[j+3]);
int s1 = (int)(block[j+1] + block[j+2]);
int s2 = (int)(block[j+1] - block[j+2]);

/// 2nd stage transform.
block[j] = (short)(s0 + s1);
block[j+2] = (short)(s0 - s1);
block[j+1] = (short)(s2 + (s3 << 1));
block[j+3] = (short)(s3 - (s2 << 1));
}//end for j...

you can make it faster by not casting and not using void * ptr. That is not easily optimizeably code for the compiler.
 
Share this answer
 
The _mm_empty(); (assembly instruction: emms) is an expensive instruction that takes quite some cycles. If you use it in a loop you might consider adding that loop into this method so you can skip it until you're completely done (as long as you don't use any FP instructions)

Good luck!
 
Share this answer
 
Comments
SMART LUBOBYA 20-Oct-10 11:14am    
if i am use assembly how do load the elements ie block[j] loads first column, block[j+1] second column etc. i am not sure how to treat the j.
E.F. Nijboer 21-Oct-10 12:13pm    
In your code I'm not sure what j is for anyway. I think the following article could get you going:
http://www.codeproject.com/KB/recipes/mmxintro.aspx

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900