Comments by gilgamash

gilgamash 21-Sep-11 4:06am View

Deleted

Hoi,
I have to contradict here a lot. Go use, for instance, Intel profiling code on O2 level optimized code which you assume to be well optimized by the compiler and you will be surprised about the amount of cache misses, false jump predictions etc. And this
"They either have no insight in optimization techniques, or they haven't upgraded their compilers since the last millennium"
is arrogant and wrong.
Best regards,
G.

gilgamash 12-Apr-11 7:36am View

Deleted

This short paper explains it well and thoroughly:

http://symbolaris.com/course/Compilers/23-cachedep.pdf

Best regards,
G.

P.S.: My comment number 4 was a rather garbled sentence, so I edited it :-)

gilgamash 12-Apr-11 6:29am View

Deleted

You can make it still faster:
1) Replace the divisions by defining a constant 1/255 before the loops and multiplying with that. Saves around 20ish clocks each loop at least. Put it into a register to avoid cache misses!
2) When using Intel/Amd: The SSE2 and later commands could help a lot when computing rDest, gDest, bDest, as those are all independent and offer a perfect motivation for using SIMD commands.
3) Variables av and rem are perfect candidates for cache misses. You might wanna consider register variables for those, too.
4) Using loop variables within loops frequently results in heavy cache misses, too. Reordering when necessary might reduce that.

Otherwise: Nice and quick!

G.

Comments by gilgamash (Top 3 by date)