|
Write the loop as suggested by Michael. The processor might even have its own instructions for copying a byte from one register into a different byte in another (or even combining multiple byte copies into a single instruction). Look at the assembler for the bit shift and or solution and the union and see which is more efficient. I'm interested to hear how much faster it becomes.
|
|
|
|
|
It isn't a division. It's a constant division by k , which any sane compiler (even the .NET JIT compiler, though its sanity is debatable) turns into a multiplication with approximately 0x100000000 / k (and possibly a few extra instructions for correct signed rounding) or some other constant, depending on the data size.
|
|
|
|
|
I hope it also does some scaling: 0x100000000 is a mighty big factor to introduce without the matching 32 bit right shift.
|
|
|
|
|
You don't need to, you can just take the upper half (edx)
|
|
|
|
|
You got me there.
The last time I really looked at CPUs, registers and assembly language was a long time ago, in a galaxy far, far away (the Motorola 68k family, to be exact). And I never did get to know the Intel CPUs.
By now I've been working at the C/C++/C# level for far too long and I've clearly got soft in the head - and didn't put 2 & 2 together ("64 bit CPUs" => 64 bit registers! )
|
|
|
|
|
There is that funny 64bit thing.. but.. x86 has always had a double-width mul (although it used to give almost no advantage compared to div)
You can even multiply 2 64bit numbers and get a 128bit result (in rdx:rax)
|
|
|
|
|
That pesky Intel chip. Always doing things by halves!
|
|
|
|
|
I've faced the exact same issue a little while ago (targeting a different processor though). IIRC the big surprise for me was that I gained a significant performance increase by swapping from the for loop you have to the while loop below (must be friendlier to the C# optimizer). Something else you might try is to manually unroll your loop to do 4 pixels at a time in the inner loop and read your source image channels 32-bits at a time. This is definitely something where you'd benefit from dropping down to native code if the performance of this step is that critical (and if p/invoke proves to be significant you can implement it using a mixed mode assembly). Usually with machine vision though converting to a packed byte format is only done as a last step for displaying/storing the results, processing is usually done in planar formats (which I really wouldn't call 'proprietary' either btw) for better performance.
private static unsafe void PlanarToPackedByteRgb32(
int width, int height,
IntPtr rSrc, IntPtr gSrc, IntPtr bSrc,
IntPtr dest, int stride)
{
var rSrcPtr = (byte*)rSrc.ToPointer();
var gSrcPtr = (byte*)gSrc.ToPointer();
var bSrcPtr = (byte*)bSrc.ToPointer();
var destPtr = (byte*)dest.ToPointer();
var destEndPtr = destPtr + stride * height;
var rowStep = 4 * width;
while (destPtr != destEndPtr)
{
var it = (uint*)destPtr;
var end = (uint*)(destPtr + rowStep);
destPtr += stride;
while (it != end)
{
*it++ =
((uint)(*rSrcPtr++) << 16) |
((uint)(*gSrcPtr++) << 8) |
((uint)(*bSrcPtr++) << 0);
}
}
}
|
|
|
|
|
I wonder how fast your function is when you don't do anything within your 'for loops'
|
|
|
|
|
To all,
Thank you one and all for your responses.
Of all the suggestions the one with the largest impact on speed is the one going to the 32bpp format. This reduced the conversion time from ~50ms to ~42ms but the extra 1.9Mb required for each image is not (in my particular case) a good trade off. The other suggestions resulted in 1ms or maybe 2ms improvements with no single technique showing a clear improvement.
This project involves inspecting components in trays - there may be up to 4 images per component with (so far) a max of 52 components per tray. All these images need to be available to the operator at "a touch of the screen".
With this many images (each is 1600x1200) I really just need to dust off the ol' C/ASM skills and convert to a 16bpp format - gaining 2Mb per image in the process.
Again, thanks for the suggestions.
Subvert The Dominant Paradigm
|
|
|
|
|
Buy a fast memory adapter that pretends to be a disk.
Problem solved without any coding.
|
|
|
|
|
Loop unrolling in the form of parallel loop execution would help if it were C/C++. You'd need to use array indexing into the rows rather than incrementing pointers.
Reading the data in large data-cache-line sized chunks would also be a win in C/C++. The instructions to mask and shift can fit in the instruction cache and they will absolutely scream though the data being no longer RAM-limited. Maybe the MMX registers can be applied here?
Writing the output data as uncached data-cache-line sized chunks also helps in C/C++ on some CPU architectures. Not writing back through the CPU data cache helps keep the input data in it (= fewer misses).
Unfortunately, to get to this level of tuning the code, you need to be generating native instructions where you have some chance or being able to predict what the CPU and its caches will be chewing on at any given moment. Running inside the C# interpreter you're probably SOL since you have neither any idea nor control of what native code is getting executed.
BTW, that operation on a 2MP image takes no more than a couple of milliseconds in our C++ code.. and that's without applying all the fancy tricks I mentioned above. You best bet is probably to just bite the bullet and thunk through to a native language for the high performance work to get the ~20X speed improvement -- that's what I'd do unless I was in a mood to learn something about what kinds of performance C# can be made to do.
patbob
|
|
|
|
|
Thanks for the thoughts - definitely on the mark.
I've already decided to do the C/C++ approach (maybe MMX if time allows) - it's been a few years since I've done any bit bangin' like this. It's actually fun.
Subvert The Dominant Paradigm
|
|
|
|
|
Don't use byte pointers for reading the source colors.
Maintain uint* or stretch to ulong* instead of using byte*.
Precalculate the single loop count, width*height/4 or /8 if you use ulong.
In the loop calculate 4 or 8 bytes at a time
int work;
begin loop
int tempRed = *_r++;
*imgPtr++ = alpha | ((tempRed & 0x000000FF) << 16) | ((tempGreen & 0x000000FF) << 8) | ((tempBlue & 0x000000FF));
*imgPtr++ = alpha | ((tempRed & 0x0000FF00) << 8) | ((tempGreen & 0x0000FF00)) | ((tempBlue & 0x0000FF00) >> 8);
*imgPtr++ = alpha | ((tempRed & 0x00FF0000) ) | ((tempGreen & 0x00FF0000) >> 8) | ((tempBlue & 0xFF0000) >> 16);
*imgPtr++ = alpha | ((tempRed & 0xFF000000) >> 8) | ((tempGreen & 0xFF000000) >> 16) | ((tempBlue & 0xFF000000) >> 24);
end loop
You will need some after-loop checks that perform the same logic for stragglers. (modulus 4 or 8).
switch(leftover_modulus) {
case 3:
*imgPtr++ = alpha | ((tempRed & 0x00FF0000) ) | ((tempGreen & 0xFF0000) >> 8) | ((tempBlue & 0xFF0000) >> 16);
case 2:
...
case 1:
...
break;
default:
}
|
|
|
|
|
Some ideas of the top of my head..
Can you interleave the input data for the function _b,_g,_r so they might fit caches better?
Maybe you can unroll some of the loop, iterate in 12 bytes at a time 4 * 3, and store as 3 uint* operations...
The last of the column w % 4 (i think) must be performed by your current loop (between 0 and 3 operations).
|
|
|
|
|
The first I see is that
( ( s / 3 ) - w ) * 3;
is the same as
s - w * 3;
although the suggestion to move this outside the loop will save you more...
|
|
|
|
|
Looks like an ideal case for some loop unrolling. If say bmp.Width is always even, the inner col loop could be rewritten as
for ( int col = 0; col < w; col += 2 ) {
*imgPtr++ = *b++;
*imgPtr++ = *g++;
*imgPtr++ = *r++;
*imgPtr++ = *b++;
*imgPtr++ = *g++;
*imgPtr++ = *r++;
}
saving half the overhead of loop management. C# doesn't allow fallthrough in switch statements, otherwise Duffs device would be perfect here.
- turin
|
|
|
|
|
Another approach. Some of the other posts got this idea kicking...
Have one outer loop based on a block size that tries to optimize read and write cache sizes.
By working with a single color at a time you should optimize read hits.
A simple starting point of blocksize = 1 should be close to what you have now.
You are trading read cache hits for loop overhead. The correct block size might swing this in your favor.
loop totalsize/blocksize
output pointer = (set based on block size)
loop blocksize(red)
read one byte from red pointer (pointer += 1)
write one byte to output pointer (pointer +=3)
endloop blocksize(red)
output pointer = (set based on block size) + 1 offset to skip red
loop blocksize(green)
read one byte from green pointer (pointer += 1)
write one byte to output pointer (pointer +=3)
endloop blocksize(green)
output pointer = (set based on block size) + 2 offset to skip red and green
loop blocksize(blue)
read one byte from blue pointer (pointer += 1)
write one byte to output pointer (pointer +=3)
endloop blocksize(blue)
endloop totalsize/blocksize
Variations:
have the input pointers start at different offsets (thirds) within the blocksize
and wrap at the end. This will smooth write conflicts on the output pointer.
apply one thread per color - you would want these threads
to be preallocated and dedicated to the blit engine
|
|
|
|
|
Why dont you do a BlockCopy using the Buffer class?
[^]
[^]
|
|
|
|
|
I'm using a low level mouse hook to track mouse movement and clicks.
Main window installs the hook on it's thread and everything works until that window has to do more demanding tasks (it can be emulated with Thread.Sleep(1000) ) which causes the mouse to freeze for the duration of the process.
I thought to fix this by initialising the hook on different thread.
MSDN ( http://msdn.microsoft.com/en-us/library/ms644986(VS.85).aspx ) says "This hook is called in the context of the thread that installed it. The call is made by sending a message to the thread that installed the hook. Therefore, the thread that installed the hook must have a message loop."
The problem is I don't know how to set up a callback loop on the background thread.
Any help is appreciated.
|
|
|
|
|
You're looking at it in the wrong way. The normal approach is to have the main thread do all the GUI stuff, and delegate all long-winding computations (including Thread.Sleep) to other threads. That is the way to keep your app responsive no matter what.
|
|
|
|
|
That idea crossed my mind once and I will probably end up doing it, but still, any unexpected window lag will still result in mouse being frozen which is very very bad.
For the sake of learning, I'm still interested in how to achieve my first suggestion.
Is there any other way to make sure there is no risk of unresponsive mouse when hooking it?
|
|
|
|
|
There is an old application developed in Vista with icon for application and main form. However in windows 7 that icon is not displayed when the application is in task bar.
Чесноков
|
|
|
|
|
I'd expect the form-icon on the taskbar. Can you give us some more clues? Is there no icon at all, or are you seeing some kind of default-icon? What version of Weven are you running?
I are Troll
|
|
|
|
|
The default icon is in there, as though no icon was attached to application.
Though in a ALT+TAB windows switch dialog the icon is correct.
Windows 7 home premium.
Чесноков
|
|
|
|