One way to approach a problem like this is called "divide-and-conquer:" break the problem into functional chunks: implement, and test, one chunk at a time. I think you can deal with this problem by dividing it up into three tasks:
1. It's clear from the information we have now that you need to do some
pre-processing of the string before you analyze which words are unique and "second longest." Multiple-white-space has to be changed to one white-space ... or ignored; and, punctuation, and other special characters, must be removed.
2. Then you want to
eliminate duplicate words in the string.
3. Finally, you want to get the lengths of the cleaned-up unique words in the string, and
get the words with length equal to the length of a second longest word in the string.
Note you could start with working on any of these tasks; you could create appropriate sample data to test with that was cleaned-up, or had duplicates removed, to use in tasks 2, 3, for example.
Let's focus on task 1:
While you could do some fancy stuff in Linq to handle multi-character and multi-white-space change to one white-space, I think simple may be better here; we'll use a StringBuilder for efficiency in dealing with characters.
You already know you need to have a loop and go through the whole string character by character.
private string StripWords(string theString, bool doRemovePunctuation, params char[] otherCharsToRemove)
{
StringBuilder sb = new StringBuilder();
bool currentCIsWhiteSpace = false, lastCIsWhiteSpace = false;
char chr;
foreach (var currentc in theString)
{
if (lastCIsWhiteSpace && char.IsLetterOrDigit(currentc))
{
}
lastCIsWhiteSpace = currentCIsWhiteSpace;
currentCIsWhiteSpace = Char.IsWhiteSpace(currentc);
chr = currentc;
if (currentCIsWhiteSpace
|| (doRemovePunctuation && Char.IsPunctuation(currentc))
|| othersCharsToRemove.Contains(currentc))
{
chr = ' ';
currentCIsWhiteSpace = true;
}
sb.Append(chr);
}
return sb.ToString();
}
If we get this task right, then with a string like:
string testString = @"I like <apples>. I treasure: mango, pineapple, lychee. I like rambutan, and green bananas; durian stinks";
And a call to:
string cleanString = StripWords(testString, true,':','<','>');
We should get output like this:
"I like apples I treasure mango pineapple lychee I like rambutan and green bananas durian stinks"