Logarithmic scaling algo by x, not y

Question

3.50/5 (2 votes)

See more:

OK, seems like I got stupid, as I can't embrace myself on the following:

I have a set of data y(x). Basically, x is discrete time (seconds), y is some data values, dependent on that x. Scaling y is simple: ScaledY = ln(y - min(y) + 1). If I need to compress data, let's say, by minute, it's also easy: calc average y for each minute, then calc ScaledY using the above formula. Now, the problem I'm trying to solve is - instead of Y, scale X logarithmically. The only idea that I have so far is this:

- divide my data set in two parts. Calculate average y for all x on the left part; divide right part in two parts, and repeat this divide/calc average logic until all is processed.

The reason I don't like this approach is an "average" on the left part, because this is not correct from a math point of view.

Is there any other approach or may be a standard algo?

Posted 26-Sep-13 4:37am

Kosta Cherry

Add a Solution

Comments

CPallini 26-Sep-13 12:33pm

Why do you need to average?

Kosta Cherry 26-Sep-13 14:16pm

Because in (any) left part data values of y could be very different, and that part results in one y value. Basically, imagine an ln graph, but rotated by 90 degrees.

Sergey Alexandrovich Kryukov 26-Sep-13 15:54pm

The question is not 100% clear, but what's the problem, school algebra? Then remember one thing: if you have some equation, you can take logarithm of left and right part; it will leave expression the same for the points where the expression under the logarithms is not 0. :-)
—SA

Kosta Cherry 26-Sep-13 16:25pm

Nope, it's not a school algebra, and not an assignment either :)
It's a real-world problem. I have a data set - very big one. It can be presented as an array, where index is a second, and value is data value for that second. I need to compress data for future analysis, but in a special way - the older data is less "influential", and newer data is more "important". So, I decided to do logarithmic compression, so the older data is, the more it's gets compressed. For example, for the recent minute I want to keep all 60 values, but for a minute that was an hour ago I need just one value. For data a day ago I need one value for the whole hour, and so forth. Thing is, this is not exactly logarithmic, and averaging is not exactly the way to do it, because, within one compression interval values are also not equally important - the closer data is to present, the more "weight" it should have. I mean, I can go with the idea I currently have, but what bugs me is this - this should be pretty standard scaling algorithm, it's I'm already 20 year past the university :(, and never needed this before, and just don't know how to properly formulate a problem to google it. If someone can just point me to the right resource (preferably with readily available algo, language is not important), this would be very helpful.

CPallini 26-Sep-13 16:58pm

I guess you are looking for a distribution. Suppose your data is continuos, say y(x).
Then, you could choose, rather arbitrarly, w(x) in such a way that contribution of interval {x0,x1} is given by the integral of w(x)y(x)dx, computed in the range {x0,x1} divided by the integral of w(x)dx, computed on the whole x range (latter integral is a normalization factor). That would work provided the integrals converge. Of course,in your actual algorithm, you have to discretize, using sums instead of integrals, but I suppose you got the idea.

Kosta Cherry 26-Sep-13 17:11pm

Yeah, but that's an exact question I'm struggling with - what exact weight function w(x) to use, and how to define {x0,x1} interval so they are logarithmically distributed. But thank you anyway - now I know where to look.

CPallini 26-Sep-13 17:43pm

Well if x is 'how many time elapsed', that is very past events have large positive x, the w(x)=1/x looks a good candidate to me.

Kosta Cherry 26-Sep-13 20:02pm

Yeah, I did it differently, but you gave me right direction to think about.
I'll post the resulting function in a solution in case anyone is interested.

CPallini 27-Sep-13 3:20am

I am interested, of course.

Sergey Alexandrovich Kryukov 26-Sep-13 18:08pm

Perhaps you don't understand the phrasing. If something is not a school assignment, it does not mean it's not a "school algebra", which simply means the level of mathematical knowledge, something which people used to do being at school. Frankly, I never understood those "20 years". You learned multiplication table even sooner, but you don't complain that it was too long ago, right?
Come on, I hope you are almost there already, make some tiny effort... :-)
—SA

Kosta Cherry 26-Sep-13 20:13pm

It's not about math, it's about understanding problem first, sorta "requirement gathering" from self :). You know old joke (as you are Russian too) about the professor who was explaining the subject to students so many times that he finally understood it himself? That's what I got here today :) Once I started to talk with CPallini, I was able to explain to myself what exactly I need, then came the algo idea, the rest was very simple.

Sergey Alexandrovich Kryukov 26-Sep-13 20:50pm

This is not really a joke, this is a perfect truth... Well, mathematics is all about understanding problem, too...
—SA

[no name] 27-Sep-13 1:08am

Thank you. Some humour is always appreciated.

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Kosta Cherry · Answer 1 · 2013-09-26T14:04:00

Thanks to CPallini, who gave me a good hint, I came up with this function if anyone is interested. (This is first, rough code, but it produced exactly what I was looking for):

C++

void scaleXLn(std::vector<double>& data, std::vector<double>& result)
{
   size_t oldSize = data.size();
   size_t newSize = log((double)oldSize);
   ++newSize; // because it is always truncated during conversion
   result.resize(newSize, 0.0);
   // now we have "newSize" number of "new" points.
   // Let's loop by them:
   size_t leftIdx = 0; 
   for (int i=newSize - 1; i>=0; --i)
   {
      // get the right index:
      size_t rightIdx = i ? oldSize - exp(i) : oldSize - 1;
      // now we have "old" interval "left" to "right".
      // Our w(x) is 1/log(x + 1)
      // Let's calculate weighted average:
      double wxSum = 0.0f;
      double avg = 0.0f;
      for (size_t j = leftIdx; j<=rightIdx; ++j)
      {
         double wx = 1.0f/log(oldSize - j + 1);
         wxSum += wx;
         avg += wx*data[j];
      }
      result[newSize - i - 1] = avg/wxSum;
      // next
      leftIdx = rightIdx + 1;
   }
}