|
(this is a long one, i apologize)
i'm working on an image filter. the basic operation is:
1. run across a row of pixels (one BYTE per component); perform a calculation and store the result in an array of floats. the output array is the same width as a pixel row.
2. run across the same row, but in the opposite direction; perform a similar (but not identical) calculation. store the results in a different array of floats.
3. sum each pair from the two float arrays and store the result in a temporary float image of the same dimensions as the input image, but rotate the output 90 degrees. in other words: output rows as columns in the temp image.
4. using the same basic method as 1 and 2, process the floating point temp image. sum the results and write them to an output 8-bit image.
so, again: there are two calculation loops and a summation loop per row. after all input rows are done, the process is repeated using the temp as input.
Loop over 8-bit Rows
Loop over Column[y] left to right => tempRow1
Loop over Column[y] right to left => tempRow2
tempRow1 + tempRow2 => temp f.p. image row[y]
Loop over temp f.p. Rows
Loop over Column[y] left to right => tempRow1
Loop over Column[y] right to left => tempRow2
tempRow1 + tempRow2 => output 8-bit image row[y]
just for reference, here's the first calculation loop:
for (x=4;x < width;x++)
{
pTempPos[x] =
numerator_factors_positive[0] * pCurPix[0]
+ numerator_factors_positive[1] * pCurPix[-iChannels]
+ numerator_factors_positive[2] * pCurPix[-2 * iChannels]
+ numerator_factors_positive[3] * pCurPix[-3 * iChannels]
- denominator_factors_positive[1] * pTempPos[x - 1]
- denominator_factors_positive[2] * pTempPos[x - 2]
- denominator_factors_positive[3] * pTempPos[x - 3]
- denominator_factors_positive[4] * pTempPos[x - 4];
pCurPix+=uChannels;
}
that's the first loop. the third loop looks identical, except that pCurPix is a pointer to a floating point value in the third loop - it is a BYTE pointer here. the 2nd and 4th loops are very similar to that and are also identical to each other - again, except for the pCurPix data type.
also, i wrote this code as a template so i can switch the type of the floating point data from float to double, for testing (the "factor" arrays, the temp row buffers and the temp image).
a little more info:
one of the parameters ("sigma") to the function is used to set the values in those factors array. and the algorithm is constant complexity with respect to that parameter - sigma changes the values that pixels are multiplied by, not the number of times the calculations happen. the only thing that influences the amount of calculations performed is the size of the input image. in theory...
another parameter is the number of color channels in the input image (1=grayscale, 3=RGB, 4=RGBA, etc.)
and finally here's the problem! :
when:
1. the class is using float s for the f.p. data type
2. the image is a single channel
3. the value of sigma is near 3...
the third loop (again, which looks exactly like what i've posted above) slows down to where it's literally ten times slower than all the other loops. as i move sigma away from three, the performance quickly goes to where it should : by 6, loop #3 is as fast as the rest, and they all stay the same speed as far as i can tell, for all other values of sigma: 50 is as fast as 6, and 70 is as fast as 6.
so it would seem the solution is to use doubles. but an array of doubles is 2x as big as an array floats. and, even worse, the float version of this is 2x faster than the double version !
here are the overall timings for the float version (sigma 1st col, time for 50 reps, 2nd col):
0.10 0.44
0.60 0.56
1.10 0.76
1.60 0.80
2.10 1.29
2.60 1.48 -- spike, around 3.0
3.10 1.44
3.60 1.25
4.10 0.97
4.60 0.66
5.10 0.45
5.60 0.39
6.10 0.34
6.50 0.34
.. and then it stays at 0.34s until sigma = 93.5, when it totally blows up and does this:
93.00 0.34
93.10 0.36
93.20 0.34
93.30 0.34
93.40 0.34
93.50 27.22
93.60 27.38
93.70 27.19
93.80 0.34
93.90 0.36
94.00 0.36
these timings are on a Core2 2.4MHz. but i can duplicate the slowdown on a single-core Pentium D 2.8.
anybody have any idea what could be going on?
update: ok, 93.5 issue is when some of the array values go to infinity... i don't see anything like that at 3.0, though.
modified on Wednesday, September 30, 2009 4:59 PM
|
|
|
|
|
Odd, according to the manual the speeds of fmul fadd and fsub are the same for all normal inputs, which is everything except denormals infinities and NANs.
Have you checked for denormals?
edit: it changes a little if you're using SSE, but not much, I would still check for denormals just in case
|
|
|
|
|
yeah, i've been looking for that. it was blowing up at sigma = 93.5 because of a div by zero in the factors calculations (which put a bunch of #INFs and #INDs in the calcs). but everything looks reasonable in the sigma=3 range.
|
|
|
|
|
Any chance you can show your code that includes your use of the sigma function parameter ? Does your code produce the correct values ?
|
|
|
|
|
 sure.
here's most of it:
void CalcIIRFactors (FloatType sigma)
{
const FloatType b0 = (FloatType)-1.783;
const FloatType b1 = (FloatType)-1.723;
const FloatType w0 = (FloatType)0.6318;
const FloatType w1 = (FloatType)1.997;
const FloatType b0OverSigma = b0 / sigma;
const FloatType b1OverSigma = b1 / sigma;
const FloatType w0OverSigma = w0 / sigma;
const FloatType w1OverSigma = w1 / sigma;
const FloatType pi = (FloatType)acos(-1.0);
const FloatType scale = sqrt (2 * pi) * sigma;
const FloatType a0 = (FloatType)1.680 / scale;
const FloatType a1 = (FloatType)3.735 / scale;
const FloatType c0 = (FloatType)-0.6803 / scale;
const FloatType c1 = (FloatType)-0.2598 / scale;
numerator_factors_positive [0] = a0 + c0;
numerator_factors_positive [1] =
(exp(b1OverSigma)*(c1*sin(w1OverSigma)-
(c0+2*a0)*cos(w1OverSigma)) +
exp(b0OverSigma)*(a1*sin(w0OverSigma) -
(2*c0+a0)*cos (w0OverSigma)));
numerator_factors_positive [2] =
(2 * exp(b0OverSigma+b1OverSigma) *
((a0+c0)*cos(w1OverSigma)*cos(w0OverSigma) -
a1*cos(w1OverSigma)*sin(w0OverSigma) -
c1*cos(w0OverSigma)*sin(w1OverSigma)) +
c0*exp(2*b0OverSigma) + a0*exp(2*b1OverSigma));
numerator_factors_positive [3] =
(exp(b1OverSigma+2*b0OverSigma) * (c1*sin(w1OverSigma) -
c0*cos(w1OverSigma)) + exp(b0OverSigma+2*b1OverSigma) *
(a1*sin(w0OverSigma) - a0*cos(w0OverSigma)));
numerator_factors_positive [4] = 0.0;
denominator_factors_positive [0] = 0.0;
denominator_factors_positive [1] =
-2 * exp(b1OverSigma) * cos(w1OverSigma) -
2 * exp(b0OverSigma) * cos (w0OverSigma);
denominator_factors_positive [2] =
4 * cos(w1OverSigma) * cos(w0OverSigma) *
exp(b0OverSigma + b1OverSigma) +
exp(2 * b1OverSigma) + exp(2 * b0OverSigma);
denominator_factors_positive [3] =
-2 * cos(w0OverSigma) * exp(b0OverSigma +
2*b1OverSigma) - 2*cos(w1OverSigma) *
exp(b1OverSigma + 2*b0OverSigma);
denominator_factors_positive [4] = exp(2*b0OverSigma + 2*b1OverSigma);
int i;
for (i = 0; i < FACTORS; i++)
{
denominator_factors_negative[i] = denominator_factors_positive[i];
}
numerator_factors_negative[0] = 0.0;
for (i = 1; i < FACTORS; i++)
{
numerator_factors_negative[i] =
numerator_factors_positive[i] -
denominator_factors_positive[i] *
numerator_factors_positive[0];
}
there's a little more after this, but it's working on things that aren't used inside the offending loop. and, yes, the values are correct. - even when slow, it's giving good results.
also, testing with finer precision, it looks like the slowdown isn't centered at 3.0 exactly, rather, it's looking more like sigma = 2.7 is the key. e ??
i wish i could post the whole thing, but... well, it's proprietary.
modified on Wednesday, September 30, 2009 6:11 PM
|
|
|
|
|
I understand about the proprietary requirements.
Your code produces normal values for the numerator/demonimator values over the range of sigma that I you mentioned in my experiments.
The only suggestion I have at this point is can you instrument your runtime statistics at a finer levels to see what portions of the your calculations could be responsible to efficiency reduction.
If you find the solution it would be nice if you could post a follow-up to the thread.
Good luck.
|
|
|
|
|
Since performance is an issue you could probably achieve a speed improvement if you compute some more temporary variables like sin, cos, and exp of various expressions because several of those are recomputed more than once. For example, cos(w1OverSigma) is computed seven times in that function.
|
|
|
|
|
the thing is, those are calculated ahead of the actual processing, and don't change. the actual work is pretty much 4x the loop i posted.
|
|
|
|
|
Hello. I'm writing some C code for an embedded application, and I've run into a problem wherein a compare against an enumerated value is not being executed correctly. Take the following code snippet, for example:
typedef unsigned int UINT16;
typedef enum enum_items_tag
{
ITEM_1,
ITEM_2,
ITEM_3,
ITEM_918,
MAX_ENUM_ITEMS
} enum_items_t;
UINT16 n;
for ( n = 0; n < MAX_ENUM_ITEMS; n++ )
{
}
The code executes as expected, until n is incremented to equal MAX_ENUM_ITEMS, at which time the compare fails, and execution continues within the loop (when it should have exited). I've done things like this in the past without any problems.
I've tried re-typing n as enum_items_t (i.e. declaring n as "enum_items_t n"), as well as type casting MAX_ENUM_ITEMS as UINT16. The only other thing I can think of at this point is that maybe there is an issue with the number of items there are in my enumerated type (919). Does anyone know if there are such constraints on enumerated types? I'm using a GCC based compiler. Or, if you have any other ideas, it would be much appreciated. Thanks.
|
|
|
|
|
why do not output value of MAX_ENUM_ITEMS to see what it is, you may miss some thing in the middle of the enum.
totally 919 items? MAX_ENUM_ITEMS should be 918.
|
|
|
|
|
You're right. Good catch! The enumerated type enum_items_t has a total of 920 items, including MAX_ENUM_ITEMS. I was thinking in terms of offset, or what n would be as the loop repeats.
Unfortunately, outputting constants information with embedded programming can be a little tricky, especially when you're stuck using a ***** compiler/IDE. I'll see what I can do, though. Thanks.
|
|
|
|
|
Just curiosity: what is the usefulness of such a enum ?
If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler.
-- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong.
-- Iain Clarke
[My articles]
|
|
|
|
|
It's used for quickly indexing into an array, rather than having to use processor time to search for an element.
|
|
|
|
|
Well, I cannot see any speed advantage. Would you please elaborate a bit?
If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler.
-- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong.
-- Iain Clarke
[My articles]
|
|
|
|
|
I found the problem. As usual, it turned out to be really simple. My INVALID_ITEM item was positioned at the beginning of the enumeration. When I moved it to the end, after MAX_ENUM_ITEMS, things lined up and started working.
|
|
|
|
|
I did a test for size change of MFC exe.
When a line of code or a small resource bitmap is deleted or added in the exe program, exe size keeps unchanged.
So I guess exe is aligned in size - similar to class aligned in size by its members.
Do you think it is correct or have more info about this?
|
|
|
|
|
Yes. By default applications are aligned on 512 byte boundaries. (Due to a quirk in the Windows 98 paging mechanism, before Visual Studio 2008, Win32 applications were aligned on 4096 byte boundaries unless you added the /OPT:NOWIN98 switch to the linker. The linker for VS 2008 and later now uses 512 alignment by default.)
|
|
|
|
|
Hi Joe,
The /OPT:NOWIN98 should be sdded at which place of project settings? How?
|
|
|
|
|
In Visual Studio 2003/2005 it's in Properties|Linker|Optimization. Select "No (/OPT:NOWIN98) Only set this for Release mode.
In Visual Studio 2008, this will generate a warning, so it should be set to Default.
In Visual C++ 6.0, go to Settings|Link. In the Project Options Edit control, go to end and manually add "/OPT:NOWIN98"
|
|
|
|
|
Joe, thanks,
The min change is from 4096 to 512 bytes now.
cheers.
|
|
|
|
|
How are you looking at the size? If you're looking at the Windows Explorer, it only shows size in "cluster" increments.
Open a command window and use the "dir" command to see the actual files size at the byte level.
Karl - WK5M
PP-ASEL-IA (N43CS)
PGP Key: 0xDB02E193
PGP Key Fingerprint: 8F06 5A2E 2735 892B 821C 871A 0411 94EA DB02 E193
|
|
|
|
|
krmed wrote: Open a command window and use the "dir" command to see the actual files size at the byte level.
Better still, right click in Explorer and select Properties .
|
|
|
|
|
krmed wrote: If you're looking at the Windows Explorer, it only shows size in "cluster" increments.
It also shows the file's actual size, not just its size on disk.
"Old age is like a bank account. You withdraw later in life what you have deposited along the way." - Unknown
"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons
|
|
|
|
|
Unless you have a different Windows Explorer than I do, it only shows one size - and it's not the actual size.
But perhaps you were responding to the idea of looking at the properties?
Karl - WK5M
PP-ASEL-IA (N43CS)
PGP Key: 0xDB02E193
PGP Key Fingerprint: 8F06 5A2E 2735 892B 821C 871A 0411 94EA DB02 E193
|
|
|
|
|
krmed wrote: But perhaps you were responding to the idea of looking at the properties?
Correct.
"Old age is like a bank account. You withdraw later in life what you have deposited along the way." - Unknown
"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons
|
|
|
|