|
For similar strings the relevant data in the 2D matrix eventually is located around the diagonal (although the naive implementation of the algorithm, such as shown by Wikipedia, needs all the matrix elements to come up with the distance). So what you could do is:
1. use a sparse matrix, as ahmed already suggested; [ADDED] maybe in the top triangle, you don't need to store those elements that equal their top neighbour plus one, and in the bottom triangle those that equal their left neighbour plus one. However this is likely to fail dramatically if the strings start out with a couple of differences...[/ADDED]
2. or use a band centered around the main diagonal, making the memory cost linear rather than quadratic (unless you want to have the band's thickness a constant fraction of the problem size);
3. or develop something similar to the sliding window approach I applied here: LPTextFileDiff: another textfile compare utility.[^]
For #2 and #3 you would need:
- to estimate the first row and column values from what you already have, that could be tricky and somewhat unreliable;
- to detect pathological cases
You would not get exact distance for very distant strings, which I guess isn't really important.
Another way to attack the problem is by "tokenizing", i.e. replace your input (a sequence of characters) by a sequence of tokens, where a token would be a specific identifier, a keyword, an operator, white space, etc., whatever is relevant in the language you're dealing with. That may reduce the dimension of the problem by a factor of 3 to 5 I'd say.
If you're not inclined to create a full tokenizer (which is much simpler than a full parser), you could still look for popular strings (say all the keywords) and replace each of them by a different Unicode character that is not supposed to be present otherwise.
[ADDED]And then, maybe you don't really want Levinshtein at all, any good measure of difference could be sufficient. How about this stochastic one:
int distance=0;
for(a fixed number of iterations, say 300) {
from the shortest string select a random substring with length 1%
search for it in the longer string;
if (no hit) {
distance+=BIG_NUMBER;
} else {
if (multiple hits) find the one that is closest to the original, position wise (and relative, work in %)
distance+=square of distance of positions in smaller and larger string
}
}
This may take more cycles but it won't cost you a lot of memory!
[/ADDED]
modified 18-Oct-11 22:02pm.
|
|
|
|
|
I gave it another pass, and now I discovered Wikipedia[^] holds the obvious optimization: "We can adapt the algorithm to use less space, O(min(n,m)) instead of O(mn), since it only requires that the previous row and current row be stored at any one time." (I had come to a similar conclusion holding two lines perpendicular to the major diagonal).
And you would only need one sweep, as all you want is the distance, not the way to get there with insert/delete/replace operations.
|
|
|
|
|
I don't know... it sounds similar to what I'm doing, which only reduces the memory footprint, not the complexity of the algorithm. I expect it's still O(mn), but I'll give it another read.
|
|
|
|
|
of course the calculations are O(mn), each matrix element needs some (simple) calculations. If you don't like that (your original question was about memory foot print, not calculations), you shouldn't be using Levinshtein at all; I've never used it myself as I consider it too much trouble to get an estimated difference between strings. That is why I suggested some alternatives. Tokenizing would save both cycles and bytes, the stochastic approach probably only saves bytes, not cycles, as it needs a lot of string scanning.
|
|
|
|
|
I was merely stating that I don't think the page is correct when it says the stated algorithm is O(min(n,m)) .
Anyway, I've had an idea of a different technique forming in my head, but before I try it I wanted to find out what other Code Projectors had used.
|
|
|
|
|
it is O(min(n,m)) in space, and that is what Wikipedia said. The computational effort remains quadratic no matter what.
|
|
|
|
|
I didn't realize that Big-O could be used for space.
|
|
|
|
|
You could use Big-O notation for any cost function you choose; time is the most popular one obviously.
|
|
|
|
|
PIEBALDconsult wrote: Code Projectors LOL. I like that. I'm gonna have to remember that and use it.
If your actions inspire others to dream more, learn more, do more and become more, you are a leader." - John Quincy Adams You must accept one of two basic premises: Either we are alone in the universe, or we are not alone in the universe. And either way, the implications are staggering” - Wernher von Braun
|
|
|
|
|
|
The French article in Wikipedia points to two references that should be of interest to you: http://fr.wikipedia.org/wiki/Distance_de_Levenshtein[^]. Check the Notes et réferences.
These describe implementations that are faster than O(NM) when you bound the number of differences.
|
|
|
|
|
Great link. I seldom see a regional wikipedia page offering more or better information than the English one does.
BTW: the work by Myers is what Nick Butler[^] started from when creating his entry for the lean-and-mean competition[^] two years ago. However IMO it is overkill for what PIEBALD is aiming at.
|
|
|
|
|
and a great algorithmician
Well, from the little info provided by PIEBALD, it is hard to figure out what he needs.
|
|
|
|
|
YvesDaoust wrote: a great algorithmician
YvesDaoust wrote: it is hard to figure out what he needs
he wants the codez.
|
|
|
|
|
Shouldn't be a big deal to find source code of a good diff utility.
|
|
|
|
|
I'm not looking for a diff utility, as stated in my post I'm just looking for a more efficient distance calculation.
|
|
|
|
|
Hey! I'm standing right here, I can hear you!
|
|
|
|
|
I'll give it a look later. Merci.
|
|
|
|
|
You can also have a look at
"A faster algorithm computing string edit distances, William J. Masek, Michael S. Paterson, Journal of Computer and System Sciences, Volume 20, Issue 1, February 1980, Pages 18-31."
It has an O(n^2/log(n)) behavior.
There are faster algorithm for approximate Levenshtein distance computation.
|
|
|
|
|
YvesDaoust wrote: for approximate Levenshtein distance computation
Approximate isn't good enough.
|
|
|
|
|
Hi, i'am a researcher in computer vision system (Electronics Engineer by profession) designing a system capable of out performing the current state of the art vision system. OpenCV 2.2 did not impress me, vision by machines seems 2 lag behind the simplest animal u can think of (like a cat or something else). i think computers are powerful enough 2 handle vision nearly as good as humans. Why are the state of art vision systems very task specific and not as robust as they should be?any suggestions?
|
|
|
|
|
Let's see your system and then we can judge.
|
|
|
|
|
Well first i have 2 deal with patent issues plus i'am writing a journal on it, i'have written a proprietary vision library and will be ready 2 show my system 2 the world when all the legal issues are done and when i finalise the design. these legal issues make innovation very difficult
|
|
|
|
|
BCDXBOX360 wrote: i'am designing a system capable of out performing the current state of the art vision system.
BCDXBOX360 wrote: these legal issues make innovation very difficult
You seem to have two conflicting statements here.
|
|
|
|
|
Richard MacCutchan wrote: You seem to have two conflicting statements here.
maybe i was supposed to write that ,legal processes of getting patents and other rights to an invention discourages innovation but does not make it impossible.
|
|
|
|