Click here to Skip to main content
15,883,731 members
Please Sign up or sign in to vote.
1.44/5 (2 votes)
See more:
How to get no. of occurrence of a letter which are together in c# win. form

Example:-
Case
string s = "aaaabbbcca";

Output
a-4,b-3,c-2,a-1

Note:
1) Please don't provide solutions with loops or iteration because of performance issue
2) I have to use GC.Collect() because there is 10,000,000,000 length of string in my real case so i need to free memory as soon as the data has been processed)

My code till now is
StringBuilder Output = new StringBuilder();

int Times = 0;
char NewChar = snewbuild[0];
char Lastchar = NewChar;//Need first char
for (; snewbuild.Length > 0; )
{
      NewChar = snewbuild[0];
      if (Lastchar == NewChar)
      {
            Times++;
      }
      else
      {
            Output.Append(Lastchar + "-" + Times + ",");
            Times = 1;
            GC.Collect();
      }
      Lastchar = NewChar;
      snewbuild.Remove(0, 1);
}
Output.Append(Lastchar + "-" + Times);


Any piece of code or any new idea for the following question will be appreciated & thanks in advance
Posted
Updated 28-Oct-14 0:13am
v5
Comments
Robert Welliever 28-Oct-14 4:16am    
I think the number you posted exceeds the maximum allowable size for a C# string. You should run my code with your large input set and no explicit garbage collection. Watch memory use with perfmon to see if normal gc isn't working well enough.
agent_kruger 28-Oct-14 4:18am    
no sir, 1) my string holds that big amount of data 2) sir can u suggest any large input set which will be efficient in this case
Robert Welliever 28-Oct-14 4:19am    
By the way, your using string concatenation inside your improperly cased "Output" StringBuilder variable. That's buying a Mercedes and then sh*tting on the seat.
agent_kruger 28-Oct-14 4:24am    
shitting in Mercedes sounds good :laugh by the way sir according to you how should i use it then?
Robert Welliever 28-Oct-14 4:58am    
Just do multiple calls to Append(). For example:
Output.Append(Lastchar).Append("-").Append(Times).Append(",");

C#
public static string TallyPhraseFrequencies(string input)
{
    StringBuilder output = new StringBuilder();
    int frequency = 1;
    char c = input[0];
    for (int i = 1; i < input.Length; i++)
    {
        if (i == input.Length - 1)
        {
            if (input[i] != c)
            {
                output.Append(c.ToString()).Append("-").Append(frequency).Append(",");
                output.Append(input[i].ToString()).Append("-1,");
            }
            else
                output.Append(c.ToString()).Append("-").Append(frequency + 1).Append(",");
        }
        else if (input[i] != c)
        {
            output.Append(c.ToString()).Append("-").Append(frequency).Append(",");
            c = input[i];
            frequency = 1;
        }
        else
            frequency++;
    }
    return output.ToString();
}
 
Share this answer
 
v5
Comments
Robert Welliever 28-Oct-14 5:06am    
Wait, it's still not right. I will get this before going to bed.
agent_kruger 28-Oct-14 5:09am    
thank you sir and as soon as you do u get accepted answer and +5 before me going to bed
Robert Welliever 28-Oct-14 5:23am    
Ok, I apologize. It's done and testing working now. I have not been so humbled in a long time. That was extremely easy and yet I made a terrible mess out of it at first. Thank you for the challenge though, way better than a crossword puzzle.
Robert Welliever 28-Oct-14 5:54am    
Oh and if you ever did actually need to call the GC explicitly it would not need to be done in every iteration of the loop. You might just add some code like:
if(i %100000 == 0) GC.Collect();

Good night and good luck.
agent_kruger 28-Oct-14 6:13am    
gn sir and actually i just did that
.NET objects cannot be > 2gb. in size. A string object on a 64-bit system ... Unicode = two bytes per character ... would probably max out at a little over 1gb.

StringBuilder has a maximum capacity the same as that of an Int32: +2,147,483,647.

While I believe you could read a 10gb. text file (5gb. of Unicode two-byte character codes), and process it chunk-by-chunk to analyze same-letter adjacent frequency, I cannot believe you can have an in-memory string of that size.

There is also simply no way to perform frequency analysis without some form of loop/iteration.

Anyhow, here's one way you could go about this:
C#
private string parseFrequencies(string data)
{
    StringBuilder sb1 = new StringBuilder();
    StringBuilder sb2 = new StringBuilder();

    sb1.Append(data);

    int count, pos;

    while (sb1.Length > 0)
    {
        char c = sb1[0];

        pos = 1;
        count = 1;

        while (pos < sb1.Length)
        {
            if (sb1[pos] == c)
            {
                count++;
                pos++;
            }
            else
            {
                break;
            }
        }

        sb2.Append(string.Format("{0}-{1},", c, count));

        sb1.Remove(0, count);
    }

    return sb2.ToString();
}

// sample test
string freq = parseFrequencies("aaaabbbcca"); // => "a-4,b-3,c-2,a-1,"
 
Share this answer
 
Comments
Maciej Los 28-Oct-14 16:53pm    
Nice work!
Please, see my answer ;)
[no name] 28-Oct-14 17:09pm    
Nicely answered, my 5+
The loops required can be tidied up a bit, for example
C#
public static void RunLengthEncode(string s) {
  Console.WriteLine(s);
  int pos = 0;
  while (pos < s.Length) {
    char c = s[pos];
    int startPos = pos;
    for (; pos < s.Length && c == s[pos]; ++pos) ;
    int runLength = pos - startPos;
    // example output
    Console.Write("{0}-{1},", c, runLength);
  }
  Console.WriteLine();
}


Alan.
 
Share this answer
 
Comments
agent_kruger 29-Oct-14 2:21am    
sir, your answer is right but sir as i said in the question that if we use loop on 10,000,000,000 length of string the performance will surely go down. sir, please see if any amendments can be made but thanks for answering
Alan N 29-Oct-14 8:36am    
The time taken is proportional to the number of characters. As each character must be read and compared, the only way to go faster is to divide the task between several threads. Of course this will only get a real speed increase if sufficient processor cores are available to run the threads.
agent_kruger 29-Oct-14 8:56am    
problem with threads are they are bit slower than running a normal code and will take the same time or even more
agent_kruger 29-Oct-14 2:24am    
sir, it worked great with 100,000,000 so +5 and accepted answer but see sir if any amendments can be made beacuse i need to do it with 10,000,000,000 length
Alan N 29-Oct-14 8:19am    
In common with other respondents I can't understand your claim to be able to work with 10,000,000,000 character strings. Think about it, even if it were possible it would require a single memory allocation of at least twice that many bytes.
Quote:
1) Please don't provide solutions with loops or iteration because of performance issue
You are using iterations in your own code.



That said, as far as I know, without using concurrency, there nothing more you can do for speeding up the process.
Your GC.Collect looks indeed suspicious. If you have an already allocated huge string, what's the purpose of continuously remove and collect small pieces? Such a operation only slows down your algorithm.
 
Share this answer
 
Comments
agent_kruger 28-Oct-14 4:19am    
i know sir but i dont want iteration because of performance issue
agent_kruger 28-Oct-14 4:20am    
sir and i am removing the string after processing and freeing the memory bcoz as soon as some memory is freed performance increases bit by bit
CPallini 28-Oct-14 7:17am    
If you are reading characters from a file (or multiple files) then you have to read chunks of them in order to maximize performance.
I haven't enough time, but i think i'm close to finding solution using Linq...

Here is my code:
C#
string text = "aaaaBbCcDDa";
int number = 0;
int counter = 1;
var query = from letters in text.ToArray()
          select new { index = number++, letter = letters, prev = text[number - 2 < 0 ? 0 : number - 2], countNo = letters == text[number - 2 < 0 ? 0 : number - 2] ? counter++ : counter=1 };

Console.WriteLine("Index | Letter | Prev | Count");
foreach (var v in query)
{
    Console.WriteLine("  {0}   |   {1}    |   {2}  |   {3}", v.index, v.letter, v.prev, v.countNo );
}
Console.ReadKey();


Result:
index letter prev    countNo
0        a    a        1
1        a    a        2
2        a    a        3
3        a    a        4
4        B    a        1
5        b    B        1
6        C    b        1
7        c    C        1
8        D    c        1
9        D    D        1 --> here should be 2, that's why i wrote that i almost found solution ;)
10       a    D        1


To do: improve linq query, then write another query to get max of countNo for each letter ;)
 
Share this answer
 
Comments
BillWoodruff 28-Oct-14 22:17pm    
I wondered if a Linq solution was possible for this problem (the problem as I understand it now, not the problem as I mis-understood it at first) ... assumed there probably was since I am certainly not at a high level of mastery of Linq.

My unexplored hunch is that using 'TakeWhile might be used for the look-ahead function required here. Be interesting to see what you come up with if you keep working on this. cheers, Bill
Maciej Los 29-Oct-14 2:20am    
Thank you for your comment, Bill ;)
agent_kruger 29-Oct-14 2:16am    
sir. i think the performance of the code with 10,000,000,000 length of string would not be great. please see if you can do something but thank you for your answer
Maciej Los 29-Oct-14 2:20am    
OK ;)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900