|
Skippums wrote: It is not a dictionary of words, but a dictionary of characters
My dictionary is different from yours: if you have little string that are repeated you can think that that are 'words' and proceed with my method (only to reduce the length of the array)
Only you can decide, I don't know what kynd of strings you are working on and what kynd of semplification can be done.
Russell
|
|
|
|
|
Sounds like part of doing a diff of two file.
|
|
|
|
|
I get the feeling its going to end up On^2ish no matter how you code around it. How big are your strings?
Brute force without the ram overhead would be to scan index pointers thru A and B, looking ahead in A as you get matches in B and keeping the highest match. Then incrementing the A pointer to the next character etc.
Depending on what your text looks at you might be able to hack in some optimisations (like checking if A[ia]+longestmatchlength != B[ib]+longestmatchlength -> increment ib by longestmatchlength).
In fact that small optimisation could speed an individual scan of the B array up over time - possibly even flattening the On^2 back down to closer to On.
|
|
|
|
|
There has been a lot of work done on this for diff'ing files and many algorithms published. Just google on diff algorithms, there are a couple of articles on CP, e.g. here[^] and here[^]
Peter
"Until the invention of the computer, the machine gun was the device that enabled humans to make the most mistakes in the smallest amount of time."
|
|
|
|
|
The second link that you gave proposes the exact same idea as to how to do this that I came up with yesterday. I will only perform step #1, but that page does confirm to me that my idea on how to tackel this is a valid one. Using that algorithm, I think that I will eliminate most of the checks, which will yield an O(n*log(n)) algorithm (I think). Thank you very much for posting this so I could confirm that I am not going down a dead end! Thanks to everyone else who posted responses for me as well!
|
|
|
|
|
Here is the code (in c#) that I finally ended up implementing. As I said before, I think it is O(n*log(n))...
public static string LongestMatchingSubstring(string s1, string s2, int minMatchLength, bool caseSensitive) {
if (string.IsNullOrEmpty(s1) || string.IsNullOrEmpty(s2))
return string.Empty;
string shorter, longer;
if (s1.Length > s2.Length) {
shorter = s2;
longer = s1;
} else {
shorter = s1;
longer = s2;
}
if (shorter.Length < minMatchLength)
return string.Empty;
Dictionary<char, IList<int>> hashTable = new Dictionary<char, IList<int>>(
Math.Min(caseSensitive ? 96 : 64, longer.Length));
if (caseSensitive) {
for (int i = 0; i < longer.Length; ++i) {
char c = longer[i];
if (!hashTable.ContainsKey(c))
hashTable.Add(c, new List<int>());
hashTable[c].Add(i);
}
} else {
for (int i = 0; i < longer.Length; ++i) {
char c = char.ToLower(longer[i]);
if (!hashTable.ContainsKey(c))
hashTable.Add(c, new List<int>());
hashTable[c].Add(i);
}
}
string result = string.Empty;
--minMatchLength;
FindLongestMatchingSubstring(shorter, longer,
hashTable, ref result, ref minMatchLength, caseSensitive);
return result;
}
private static void FindLongestMatchingSubstring(string str1, string str2,
Dictionary<char, IList<int>> ht, ref string result, ref int minMatchLength, bool caseSensitive) {
if (str1.Length <= minMatchLength)
return;
int midpoint = str1.Length / 2;
char key = caseSensitive ? str1[midpoint] : char.ToLower(str1[midpoint]);
if (ht.ContainsKey(key)) {
foreach (int index in ht[key]) {
string temp = GetFullMatch(str1, midpoint, str2, index, caseSensitive);
if (temp.Length > minMatchLength) {
minMatchLength = temp.Length;
result = temp;
}
}
}
FindLongestMatchingSubstring(str1.Substring(0, midpoint),
str2, ht, ref result, ref minMatchLength, caseSensitive);
FindLongestMatchingSubstring(str1.Substring(midpoint + 1),
str2, ht, ref result, ref minMatchLength, caseSensitive);
}
private static string GetFullMatch(string str1, int str1Idx, string str2, int str2Idx, bool caseSensitive) {
int minOffset = -Math.Min(str1Idx, str2Idx);
int maxOffset = Math.Min(str1.Length - str1Idx, str2.Length - str2Idx) - 1;
int lowOffset = -1, highOffset = 1;
if (caseSensitive) {
for (; lowOffset >= minOffset && str1[str1Idx + lowOffset] ==
str2[str2Idx + lowOffset]; --lowOffset)
;
for (; highOffset <= maxOffset && str1[str1Idx + highOffset] ==
str2[str2Idx + highOffset]; ++highOffset)
;
} else {
for (; lowOffset >= minOffset && char.ToLower(str1[str1Idx + lowOffset]) ==
char.ToLower(str2[str2Idx + lowOffset]); --lowOffset)
;
for (; highOffset <= maxOffset && char.ToLower(str1[str1Idx + highOffset]) ==
char.ToLower(str2[str2Idx + highOffset]); ++highOffset)
;
}
++lowOffset;
--highOffset;
return str2.Substring(str2Idx + lowOffset, highOffset - lowOffset + 1);
}
Hope this helps some people!
Jeff
|
|
|
|
|
Skippums wrote: I think it is O(n*log(n))...
I did not look at the code, but why don't you measure it?
It suffices to perform three runs with complexity c, 10*c and 100*c
where c should take a measurable time (say 1 second)
From the measured times, you can figure out the curve that fits...
Luc Pattyn [Forum Guidelines] [My Articles]
this weeks tips:
- make Visual display line numbers: Tools/Options/TextEditor/...
- show exceptions with ToString() to see all information
- before you ask a question here, search CodeProject, then Google
|
|
|
|
|
It is slightly more difficult, though, because the time it takes depends on the data. For example, if the longest substring matches immediately and is longer than half the length of the shorter substring, then it takes only O(n) time. If none of the charaters from either string match, it also only takes O(n) time. If the smaller string is shorter than the minimum result you are willing to accept, it takes O(1) time to inform me that I am an idiot for trying. I would like to test it for the worst case senario, but I am having trouble figuring out what that is. Anyone have any insight to a case when this will perform worse than O(n)?
|
|
|
|
|
Seems like you need a lot of test cases (random data could be considered) and a good
definition of what your primary measure of size "n" is.
Lacking these, your "I think it is O(n*log(n))..." seems a bit premature.
Luc Pattyn [Forum Guidelines] [My Articles]
this weeks tips:
- make Visual display line numbers: Tools/Options/TextEditor/...
- show exceptions with ToString() to see all information
- before you ask a question here, search CodeProject, then Google
|
|
|
|
|
Worst case that I can think of is if you have a string like 0101010...10 of length m = 2^i-1 where i is a + int, and another of all 1's of length n where n is any + int > m. Then this algorithm will perform approximately (m / 2) * 2 * n * log2(m + 1) = m*n*i operations, or O(n*n*log(n)). So I guess that solves whether or not it is O(n * log(n))! It still should outperform almost any other algorithm when 1) The alphabet over which you are performing the search is reasonably large, 2) The expected overlap between the two strings is reasonably large. Ultimately, I don't really care what the final time is on this algorithm. I am really only looking for any advice on how I can improve it. Thanks,
Jeff
|
|
|
|
|
Sorry for this late posting, but I have just been reading old forums and ran into this.
If you are still interested in this, I have an algorithm that runs pretty quickly, abet, written in MASM, but the method could be written in C# (I think it could, I have never coded in C#). I have a complete working program - source and EXE and a method to enter test cases via a file with the strings. It is quite lengthy, but mostly commentary. The source with comments is about 2250 lines. The MaxSubstring function is 472 lines with comments. Assembled with MASM 6.15.
Dave Augustine
|
|
|
|
|
It's been a long time since i did any really hard maths and a friend of mine just asked me the following question:
"How can I easily find all possible solutions for
3125r = 1024x - 8404
where r and x are both integers, and r is divisible by 5"
Any very intelligent people out there who know of an answer?
|
|
|
|
|
Hi,
there is an infinite number of solutions, given by:
r=1020+5120t
x=3121+15625t
where any value of t will do.
Luc Pattyn [Forum Guidelines] [My Articles]
this weeks tips:
- make Visual display line numbers: Tools/Options/TextEditor/...
- show exceptions with ToString() to see all information
- before you ask a question here, search CodeProject, then Google
|
|
|
|
|
I thought there would be any number of solutions (with no particular grounding or proof). But that's certainly a faster way of finding them...
|
|
|
|
|
Would it be an impudent question to ask how you arrived at this solution? Trying to work back from your solution leaves me a bit stumped...
|
|
|
|
|
Hi Paddy,
there are several ways to find the solution(s) of ax+by=c.
First of all a single, linear equation in two variables has either no solution
at all (say 2x + 4y = 1) or an infinite number, by adding the second coefficient
to the first number and subtracting the first coef from the second number.
So the main problem is to find a first solution.
Method 1: write a little prog, assume x=u+bt and y=v-at and have it try a lot
of values u (keep t at 0) and calculate real v until it happens to be integer.
Method 2: follow the strict mathematical approach; that's what you would need
if decent code is required to solve such problems in general
Method 3: the way in-between, with a couple of shortcuts, the ideal way
to solve a single problem manually. Like so:
3125r = 1024x - 8404 with constraint r=multiple of 5
lets try to replace variables by other variables and reduce the size of the
constants:
right hand side is multiple of 4, hence r must be multiple of 20, say r=20a
3125 * 20a = 1024x - 8404
15625a = 256x - 2101
lets ignore all multiples of the smallest coef, 256
15625=61*256+9
-2101=-9*256+203
9a = 203 (modulo 256)
try a couple of numbers 203 + 256k until a multiple of 9 is found;
this is bound to happen in 9 tries !
one immediately sees a solution is a=(203+256)/9=51
(this is where the mathematician would go more formally !)
hence r=1020, then calculate the x that goes with it, and add one variable t
with the right coefs.
Luc Pattyn [Forum Guidelines] [My Articles]
this weeks tips:
- make Visual display line numbers: Tools/Options/TextEditor/...
- show exceptions with ToString() to see all information
- before you ask a question here, search CodeProject, then Google
|
|
|
|
|
Thanks very much for your time )
|
|
|
|
|
Paddy Boyd wrote: r is divisible by 5
Then use r=5*y , then:
3125*5*y = 1024x - 8404
where y and x are both integers
It is a nice start point
Russell
|
|
|
|
|
Just in case Luc's asleep (seems to be never) you could use this applet[^] and solve
-1024x + 15625y + 8404 = 0
then use r = 5y.
If you tick the box 'Step-by-step' it even shows all working!
The source code for this applet is 3742 lines long! (mostly to do with the quadratic problem I guess)
Peter
"Until the invention of the computer, the machine gun was the device that enabled humans to make the most mistakes in the smallest amount of time."
|
|
|
|
|
cp9876 wrote: seems to be never
Really? Gee I must be sleeptyping again.
Luc Pattyn [Forum Guidelines] [My Articles]
this weeks tips:
- make Visual display line numbers: Tools/Options/TextEditor/...
- show exceptions with ToString() to see all information
- before you ask a question here, search CodeProject, then Google
|
|
|
|
|
i want to make a schedule for distributing employees over a month ,using genetic algorithm & C#
can anyone tell me plz how to proceed ?
|
|
|
|
|
|
In excel there is function COMBIN(number,number_chosen)
COMBIN(7,6) WILL RETURN TOTAL 7 COMBINATION
COMBIN(20,6) WILL RETURN TOTAL 38760 COMBINATION
I need the total combintions generated the when the values are given in sql server stored procedure
|
|
|
|
|
You can calculate factorial with FACT function so calculation a combination shouldn't be a problem
|
|
|
|
|
COMBIN(x,y) is simply calculed as x!/(y!(x-y)!)
where x! = 1*2*3*4...*x
(this is simple to implement)
a faster way to solve x!/(y!(x-y)!) is (x*(x-1)*(x-2)*...(x-y+1))/y!
(in this way you have to do some * less)
Russell
|
|
|
|
|