Click here to Skip to main content
15,885,244 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi all,

I'm wondering how to find the 2nd longest words in the text?

The following is the code for counting letters and words but I can't figure out how to put them together to get the out come of the 2nd longest words(red, and)...

I'm new to programming so simpler coding (must be in C#) would be great :)

Thank you all in advance~

Counting Letters

C#
int count = 0;
            string st = "I like apples. I like red apples. I like red apples and green bananas.";


            foreach (char c in st)
            {
                if (char.IsLetter(c))
                {
                    count++;
                }
            }

            lblNumOfLetters.Text = count.ToString();



Counting Words
C#
string st = "I like apples. I like red apples. I like red apples and green bananas.";

            char[] sep = { ' ' };

            String[] res = st.Split(sep);

            lblNumOfWords.Text = res.Length.ToString();
Posted
Updated 7-Nov-14 18:16pm
v3
Comments
BillWoodruff 8-Nov-14 1:19am    
Do you need to remove punctuation before you analyze the word lengths ? So, using your sample string: "apple" and "apple," would both be considered the same words, with five characters ?

When you say: "the 2nd longest words" ... that implies to me that if your string was:

that cat in the hat was the next up at bat

Then the second longest words would be: cat, the, hat, was, bat

I assume you want the result list NOT to contains duplicates: correct ?
Member 10977819 8-Nov-14 7:43am    
Hi Bill, maybe I left my msg little too late. I'm in NZ time zone so if you leave a msg, I'll be checking it about 9 hours later... Thanks and sorry for replying it late...
BillWoodruff 8-Nov-14 9:20am    
fyi: I'm at GMT +7, so I'm five hours "before" you. I'm going to post some code to help you get started, leaving out parts of it for you to figure out.
Member 10977819 8-Nov-14 4:04am    
Hi Bill, thanks again~ and sorry I was away... and yes, "apple" and "apple," are considered the same words with 5 characters. And yes, if it was "that cat in the hat was the next up at bat" then I want to see "cat, the, hat, was, bat" with NO duplicates. Thanks again and sorry for the late reply...

One way to approach a problem like this is called "divide-and-conquer:" break the problem into functional chunks: implement, and test, one chunk at a time. I think you can deal with this problem by dividing it up into three tasks:

1. It's clear from the information we have now that you need to do some pre-processing of the string before you analyze which words are unique and "second longest." Multiple-white-space has to be changed to one white-space ... or ignored; and, punctuation, and other special characters, must be removed.

2. Then you want to eliminate duplicate words in the string.

3. Finally, you want to get the lengths of the cleaned-up unique words in the string, and get the words with length equal to the length of a second longest word in the string.

Note you could start with working on any of these tasks; you could create appropriate sample data to test with that was cleaned-up, or had duplicates removed, to use in tasks 2, 3, for example.

Let's focus on task 1:

While you could do some fancy stuff in Linq to handle multi-character and multi-white-space change to one white-space, I think simple may be better here; we'll use a StringBuilder for efficiency in dealing with characters.

You already know you need to have a loop and go through the whole string character by character.
C#
private string StripWords(string theString, bool doRemovePunctuation, params char[] otherCharsToRemove)
{
    StringBuilder sb = new StringBuilder();

    // keep track of whether the last space was white-space
    // note it may be white-space now because we changed
    // it to white-space in the code below
    bool currentCIsWhiteSpace = false, lastCIsWhiteSpace = false;

    char chr;

    foreach (var currentc in theString)
    {
        // can we take a short-cut here ?
        if (lastCIsWhiteSpace && char.IsLetterOrDigit(currentc))
        {
            // what do you need to update here
            // so that you continue the loop iteration ?
            //continue;
        }

        lastCIsWhiteSpace = currentCIsWhiteSpace;

        currentCIsWhiteSpace = Char.IsWhiteSpace(currentc);

        // if we are removing punctuation: replace with space
        // remove other optional chars: replace with space
        chr = currentc;

        if (currentCIsWhiteSpace
            || (doRemovePunctuation && Char.IsPunctuation(currentc))
            || othersCharsToRemove.Contains(currentc))
        {
            chr = ' ';
            currentCIsWhiteSpace = true;
        }

        sb.Append(chr);
    }

    return sb.ToString();
}
If we get this task right, then with a string like:
C#
string testString = @"I like <apples>. I treasure: mango, pineapple, lychee. I like rambutan, and green bananas; durian stinks";
And a call to:
C#
string cleanString = StripWords(testString, true,':','<','>');
We should get output like this:
"I like apples I treasure mango pineapple lychee I like rambutan and green bananas durian stinks"
 
Share this answer
 
v5
Comments
Maciej Los 8-Nov-14 10:47am    
Good advice about duplicates!
+5!
Manas Bhardwaj 8-Nov-14 11:18am    
Nice +5
[no name] 8-Nov-14 17:55pm    
Also my 5. Bruno
Member 10977819 8-Nov-14 18:59pm    
Wow! a lot happened overnight :O so many suggests to read and try! Thanks Bill~ you have a talent in explaining things I believe, I bet you can be a professor in Computer Science, maybe you already are :) Sorry for replying it late coz I had to take some time to try all the coding. As PIEBALD mentioned that this looks like an assignment, yes it is but it's not for school it is for job interview... As I never done this sort of tasks during my study(mainly we built DB driven websites), this website allows me to see how other coders do the same tasks in different way. It's been very valuable :)
BillWoodruff 9-Nov-14 3:29am    
Please believe me when I say that "computer science" and I have little in common :)

Some friendly advice re job interview: it may be better to have a strong grasp of the basic aspects of C# and its structures (namespaces, classes [including abstract, and virtual], interfaces, structs, enums, properties), and C#'s Generic collection features, than to focus on the intricacies of using Linq at this point.

To use any computer language you need to achieve mastery of how variables are created, and how semantic control (what a given name refers to at any point in the program) is expressed, how to use iterators (loops), and how to evaluate code conditionally. Second you need to develop the ability to use the inheritance features of the language (assuming it's a modern OOP language).

To sum up: beware of becoming "intoxicated with the icing" (Linq) to the point you "forget the cake" (fundamental knowledge and skill) :)
The discussion about the list of separators is still active. I agreee with BillWoodruff about the duplicates. As you can see apples word is iterated 3 times. In case you want to get third or fourth longest word, the result will be apple. Why? Have a look at the list of returned words:
1 - bananas
2 - apples 
3 - apples 
4 - apples 
5 - green 
6 - like 
7 - like 
8 - like 
9 - red 
10 - red 
11 - and 
12 - I 
13 - I 
14 - I


When we remove duplicated words, the result list should looks like:
1 - bananas
2 - apples
3 - green
4 - like
5 - red
6 - and
7 - I


And now the time for sample code. It uses Linq[^]:
C#
string st = "I like apples. I like red apples. I like red apples and green bananas.";
char[] sep = new char[]{'.',',',' '};
string secondLongestWord = (from words in st.Split(sep).Distinct().ToArray()
    orderby words.Length descending
    select words).Take(2).Last().ToString();

Console.WriteLine("Second longest word is: {0}" , secondLongestWord);

Result:
Second longest word is: apples


[EDIT]
Few words of explanation about the code (see comments).
C#
//get the array of non-duplicated words, splited by defined characters
(from words in st.Split(sep).Distinct().ToArray()
//order by the length of text
orderby words.Length descending
//list words, returns IOrderedEnumerable<String>
select words)
//get only 2 rows
.Take(2)
//get last row, in this case the second one;)
.Last()
//return string
.ToString()


For futher information, please see:
Take()[^]
Last()[^]

Final note: In a meanwhile i found another solution, using Regex class[^].

Using proper pattern[^] we are able to remove punctuation characters and numbers. Just replace from clause with:
C#
from words in Regex.Split(st, @"\W").Distinct().ToArray()

and remove this line:
C#
char[] sep = new char[]{'.',',',' '};


[EDIT 2]
Thanks to BillWoodruff[^] for valuable comment.
In case when there must be a list of "second-longest-words", the solution is:
C#
string st = "I like apples :) I like red apples :P I'd like to eat 1 red apple and 5 (five) yellow bananas.";

int secondvalue = (from words in Regex.Split(st, @"\W").Distinct().ToArray()
    orderby words.Length descending
    select words.Length).Take(2).Last();
Console.WriteLine("Second length of word is: {0}", secondvalue);

var qry = from words in Regex.Split(st, @"\W").Distinct().ToArray()
where words.Length == secondvalue
select words;
Console.WriteLine();
Console.WriteLine("List of words:");
foreach (string word in qry)
{
    Console.WriteLine("{0}", word);
}


I'd suggest to write extension method ;)
Extension Methods (C# Programming Guide)[^]
How to: Add Custom Methods for LINQ Queries[^]
 
Share this answer
 
v4
Comments
[no name] 8-Nov-14 10:57am    
Now this guy starts also as a Linq lawyer...solution looks that cool! Is Take(2) not dangerous if for example input is empty...just a question. My 5.
Maciej Los 8-Nov-14 11:06am    
Thank you, Bruno ;)
Linq is the area which i'am still trying to discover...
Is Take(2) not dangerous if for example input is empty? - good question! As far as i know, it's safety method even if input string is empty.
[no name] 8-Nov-14 11:29am    
And btw should it not be Take(1)...but that is really a Q because I have no idea about Linq
Maciej Los 8-Nov-14 11:33am    
OK, i'll improve my answer to provide more details about what code does.
[no name] 8-Nov-14 11:49am    
Sorry it wasn't my intent to force you or to stress you. Keep in mind I have no idea about Linq. So I think your answer is ok like it is.
Regards, Bruno
You can improve the second by adding more separators (e.g. punctuation) and by specifying http://msdn.microsoft.com/en-us/library/system.stringsplitoptions(v=vs.110).aspx[^]

Then iterate the results and compare the Lengths of each string.
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 8-Nov-14 1:12am    
This is not a good idea, because the number of potential delimiter is much greater. Please see my answer.
—SA
BillWoodruff 8-Nov-14 1:42am    
Except: if the OP has a clear idea of which delimiters they want to deal with, then this is a very reasonable suggestion ... at this point where we are "in the dark fumbling with the elephant" trying to figure out if we are "wise men" :)
Sergey Alexandrovich Kryukov 8-Nov-14 1:48am    
Well, more exactly, if the set of the delimiters is limited to some predefined set.
But the problem looks like an assignment, so it should be solved strictly.
—SA
PIEBALDconsult 8-Nov-14 9:48am    
"But the problem looks like an assignment, so it should be solved " by the student, not us. Don't give a student a fish.
Sergey Alexandrovich Kryukov 8-Nov-14 21:38pm    
I don't think I gave too much, but I agree with the idea: it's much better to teach how to fish.
—SA
Use the following functions:
http://msdn.microsoft.com/en-us/library/system.char.isletter%28v=vs.110%29.aspx[^],
http://msdn.microsoft.com/en-us/library/0t641e58%28v=vs.110%29.aspx[^].

Everything else will be considered as word delimiters. How to consider digits, delimiters or parts of words, depends on required approach. For example, one can count digits as non-delimiters, but ignore all words containing at least one digit. Or something else.

—SA
 
Share this answer
 
v3
Comments
BillWoodruff 8-Nov-14 1:38am    
I think until the OP clarifies the use-cases they need to deal with, we are kind of guessing here in terms of what specific functions of the 'Char Class might be most useful ... of course, the OP may not have fully thought the use-cases through ... yet: nothing wrong with that; if we can help the OP elucidate the use-cases, I think we all win :) I've asked the OP to clarify.

Char.IsLetterOrDigit was the first thing sprang to my mind :)

If I'm going to write a function to strip/replace a bunch of chars in a string, I think I'll take the trouble to make it general-purpose, and handle cases like "<apple>" => "apple"
Sergey Alexandrovich Kryukov 8-Nov-14 1:51am    
You see, even of OP clarifies the definition of "words" (I wouldn't hold my breath :-), this approach will work anyway. Don't you agree? I explained why using IsLetter and IsDigit separately, in the last paragraph.
Anyway, it was the good idea to ask for clarification. It just cannot change much...
—SA
BillWoodruff 8-Nov-14 1:55am    
In your last paragraph: "For example, one can count the as non-delimiters" I think you accidentally left out some characters, or the CP editor stripped them out.
Sergey Alexandrovich Kryukov 8-Nov-14 1:59am    
Typo, of course. Thank you very much; fixed.
—SA
Maciej Los 8-Nov-14 10:49am    
Yeah, good point.
+5!

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900