codes works but are slow

Question

0.00/5 (No votes)

See more:

Any suggestions on how to improve the code, in order to make it fast?How would I re-write the following function as inline assembly?

C++

void tomSimd::calculations(void* btr)
{
    __declspec(align(8))short* block =(short*)btr;
    
    int j;
    
    __declspec(align(8)) __m64*block1 = (__m64*)block;
    __m64 s0,s1,s2,s3,f0,f1,f2,f3,temp4,temp5,temp6,temp7;
    j=0;
    
    // transpose input
    temp4 = _mm_unpacklo_pi16(block1[j],block1[j+1]);
    temp5 = _mm_unpacklo_pi16(block1[j+2],block1[j+3]);
    temp6 = _mm_unpackhi_pi16(block1[j],block1[j+1]);
    temp7 = _mm_unpackhi_pi16(block1[j+2],block1[j+3]);
    f0 = _mm_unpacklo_pi32(temp4,temp5);
    f2 = _mm_unpacklo_pi32(temp6,temp7);
    f1 = _mm_unpackhi_pi32(temp4,temp5);
    f3 = _mm_unpackhi_pi32(temp6,temp7);
    
    // stage one
    s0 =_mm_add_pi16(f0,f3);
    s3 =_mm_sub_pi16(f0,f3);
    s1 =_mm_add_pi16(f1,f2);
    s2 =_mm_sub_pi16(f1,f2);
    
    //stage 2
    block1[j] =_mm_add_pi16(s0,s1);
    block1[j+2] =_mm_sub_pi16(s0,s1);
    block1[j+1] =_mm_add_pi16(s2,_mm_slli_pi16(s3, 1));
    block1[j+3] =_mm_sub_pi16(s3,_mm_slli_pi16(s2, 1));
    
    _mm_empty();
}

Posted 20-Oct-10 0:55am

SMART LUBOBYA

Updated 20-Oct-10 2:58am

Saurabh.Garg

v2

Add a Solution

Comments

super 20-Oct-10 7:01am

Do you have any metrics with respect to speed?
I wanted to know, how fast is the execution and how much you desire?

SMART LUBOBYA 20-Oct-10 7:09am

block is a 4x4 matrix which i am transposing, manipulating through those two stages. but block is declared as a short of one column consisting of 16 elements. when i did it C++ my speed is 260ms, in MMX its 255ms sometimes sames as c++. i expected that it would be faster in MMX. here is the c++ equivalent.
for(j = 0; j < 16; j += 4)
{
/// 1st stage transform.
int s0 = (int)(block[j] + block[j+3]);
int s3 = (int)(block[j] - block[j+3]);
int s1 = (int)(block[j+1] + block[j+2]);
int s2 = (int)(block[j+1] - block[j+2]);

/// 2nd stage transform.
block[j] = (short)(s0 + s1);
block[j+2] = (short)(s0 - s1);
block[j+1] = (short)(s2 + (s3 << 1));
block[j+3] = (short)(s3 - (s2 << 1));
}//end for j...

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

T2102 · Answer 1 · 2010-10-20T03:34:00

Solution 1

you can make it faster by not casting and not using void * ptr. That is not easily optimizeably code for the compiler.

Posted 20-Oct-10 3:34am

T2102

E.F. Nijboer · Answer 2 · 2010-10-20T03:47:00

Solution 2

The _mm_empty(); (assembly instruction: emms) is an expensive instruction that takes quite some cycles. If you use it in a loop you might consider adding that loop into this method so you can skip it until you're completely done (as long as you don't use any FP instructions)

Good luck!

Posted 20-Oct-10 3:47am

E.F. Nijboer

Comments

SMART LUBOBYA 20-Oct-10 11:14am

if i am use assembly how do load the elements ie block[j] loads first column, block[j+1] second column etc. i am not sure how to treat the j.

E.F. Nijboer 21-Oct-10 12:13pm

In your code I'm not sure what j is for anyway. I think the following article could get you going:
http://www.codeproject.com/KB/recipes/mmxintro.aspx