Introduction
I needed to know, so I wrote a quick, brute-force-and-ignorance (henceforth known as "BF+I") method to do it:
static long LinesCount(string s)
{
long count = 0;
int position = 0;
while ((position = s.IndexOf('\n', position)) != -1)
{
count++;
position++;
}
return count;
}
A quick test, and it works fine.
But then, I thought - isn't that a bit...well...rubbish? Isn't there another way to do it?
And sure, there is: I can think of two off the top of my head: Regex and Linq. So, I knocked up more methods.
And then I got a lot of feedback suggesting other ways, so I have re-worked this tip to reflect those.
A quick-n-dirty test rig to check them:
string s = File.ReadAllText(@"D:\Temp\MyLargeTextFile.txt");
long index = LinesCountIndexOf(s);
long regex = LinesCountRegex(s);
long linq= LinesCountLinq(s);
Console.WriteLine("{0}:{1}:{2}", index, regex, linq);
Stopwatch s0 = new Stopwatch();
... repeated, one per test
Stopwatch sn = new Stopwatch();
s0.Start();
for (int i = 0; i < 100; i++)
{
index = LinesCountIndexOf(s);
}
s0.Stop();
sb.AppendFormat("IndexOf : {0,06}\n", s0.ElapsedMilliseconds);
tbResults.Lines = sb.ToString().Split('\n');
sl.Start();
... Identical blocks, different method names.
The input file is 1.64MB, and contains around 23K lines. Please note that this file is read once for the entire test run, and is just used to give a significant amount of data for the tests to run against. A small amount of data would take us below the timer resolution of 1 millisecond and distort the result. This is not a test of how fast I can count lines in a file!
The methods tried so far:
static long LinesCountIndexOf(string s)
{
long count = 0;
int position = 0;
while ((position = s.IndexOf('\n', position)) != -1)
{
count++;
position++;
}
return count;
}
static Regex r = new Regex("\n", RegexOptions.Multiline);
static long LinesCountRegex(string s)
{
MatchCollection mc = r.Matches(s);
return mc.Count;
}
static long LinesCountLinq(string s)
{
return (from ch in s
where ch == '\n'
select ch).Count();
}
static long LinesCountSplit(string s)
{
return (s.Split(new char[] { '\n' })).Length;
}
static long LinesCount2(string s)
{
long count = 0;
int posMax = s.Length;
for (int position = 0; position < posMax; )
if (s[position++] == '\n')
count++;
return count;
}
static long LinesCount3(string s)
{
long count = 0;
int posMax = s.Length;
char[] a = s.ToCharArray();
for (int position = 0; position < posMax; )
if (a[position++] == '\n')
count++;
return count;
}
static long LinesCount4(string s)
{
long count = 0;
int position = -1;
while ((position = s.IndexOf('\n', position + 1)) != -1)
{
count++;
}
return count;
}
static long LinesCount5(string s)
{
long count = 0;
int posMax = s.Length;
char[] a = s.ToCharArray();
foreach (char c in a)
if (c == '\n')
count++;
return count;
}
static long LinesCount6(string s)
{
long count = 0;
foreach (char c in s)
if (c == '\n')
count++;
return count;
}
static long LinesCount7(string s)
{
return s.Length - s.Replace("\n", "").Length;
}
static long LinesCountAhoCorasick(string s)
{
IStringSearchAlgorithm searchAlg = new StringSearch();
searchAlg.Keywords = new string[] { "\n" };
StringSearchResult[] results = searchAlg.FindAll(s);
return results.Length;
}
static long LinesCountLinqCount(string s)
{
return s.Count(c => (c == '\n'));
}
unsafe static long LinesCountUnsafePtr(string s)
{
long lineCount = 1;
fixed (char* pchar = s)
{
char* p = pchar;
for (; *p != '\0'; p++)
{
if (*p == '\n') lineCount++;
}
}
return lineCount;
}
The addition of a StreamReader
version, which doesn't really fit, but people seem to be suggesting file based solutions:
ssr.Start();
for (int i = 0; i < 100; i++)
{
StreamReader sr = new StreamReader(@"D:\Temp\MyLargeTextFile.txt");
index = LinesCountStream(sr);
sr.Dispose();
}
ssr.Stop();
...
static long LinesCountStream(StreamReader sr)
{
long count = 0;
while (sr.ReadLine() != null)
{
count++;
}
return count;
}
They all return the same number of lines, which is fine. But the timing tests were interesting:
Test ElapsedMilliseconds
IndexOf : 167
Index (2) : 953
Index (3) : 1026
Index (4) : 167
Index (5) : 1211
Index (6) : 1544
Index (7) : 1088
Split : 1103
Regex : 2517
Linq (Expl) : 3576
Linq Count : 3531
Stream Read : 976
Unsafe Ptr : 633
I had expected Regex to be much, much faster! The serious difference between Linq and Regex was a surprise, but I was really surprised when BF+I beat them both hands down! The Split
method didn't surprise me by being slower than Index, as it means .NET has to allocate 23K string
s each time. To finish it all off, I rewrote the BF+I version as an Extension Method, and made it assume that a string
with no newlines was a single line. As an extension method, it was within a millisecond of the original times, as you would expect.
static class ExtensionMethods
{
public static long Lines(this string s)
{
long count = 1;
int position = 0;
while ((position = s.IndexOf('\n', position)) != -1)
{
count++;
position++;
}
return count;
}
}
BTW: Adding RegexOptions.Compiled
to the regular expression brought it down to:
Test ElapsedMilliseconds
BF+I 170
Split 1089
Regex 2063
Linq 3583
So it is WELL worth adding it to all your Regexes if you can.
Revision History
- [Edit]Split method suggested by Grasshopper.iics[^] added -OriginalGriff[/edit]
- [Edit2]I forgot to add the credit for the split method " /> - OriginalGriff[/edit2]
- [Edit3]Deeksha Shenoy[/Edit3]
- [Edit4]Inclusion of more variants as suggested by: unruled boy[^] (Aho-Corasick, awaiting method) Grasshopper.iics[^] (Split) akemper[^] (Variations on BF+I, Replace, Split) George Swan[^] (LinqCount) Thanks to all these people (and others if I haven't updated this yet)! -OriginalGriff [/Edit4]
- [Edit5]I forgot to introduce the three letter acronym "BF+I" when I first defined it as "Brute-Force-and-Ignorance". My thanks to fuzz_ball for pointing this out. Sorry for any confusion caused... - OriginalGriff[/Edit5]
- [Edit6]Jon Bellamy[^] suggested a StreamReader, which I have included despite it being file related, rather than string. Please note that the test is different, as StreamReader does not support Seek or Rewind so has to be instantiated new each time. (Removing the Dispose only took 10 milliseconds off the time) Roman Shero[^] suggested the unsafe version Extra text added to the file read to make it clear why I load a file at all - OriginalGriff[/Edit6]
- [Edit7]Added download for test rig (it's not pretty!), fixed formatting of code section (don't know what happened to it there), minor spelling mistake fixed, and Revision history reformatted - OriginalGriff[/Edit7]
Born at an early age, he grew older. At the same time, his hair grew longer, and was tied up behind his head.
Has problems spelling the word "the".
Invented the portable cat-flap.
Currently, has not died yet. Or has he?