|
I think your Regex is too restrictive. I assume a "word" is here to be defined as all that is separated by some "spaces".
Spaces in Regex are \s non-spaces are \S.
So, your regex should rather look as follows: ...Regex(@"\S+",...)
|
|
|
|
|
I think your proposal ignores punctuation characters. The \w takes as word characters "letters, digits, and underscores". This allows for spaces, punctuation & etc.
Peter Wasser
Art is making something out of nothing and selling it.
Frank Zappa
|
|
|
|
|
I think using \S instead of \w is a better choice here. Or you need to make the search a bit more sophisticated to match e.g. monetary numbers as one word or to match "doesn't" as one word, etc.
\w is a very strict "word character". I would expect the following as one word (which is not matched by \w+): $123,000.00.
Since punctuation is in most of the cases followed by a space, the \S approach gives for a day-by-day use a better match in my eyes.
|
|
|
|
|
Sure.
How would that cope with something like:
"incidentally , and might I say ( without prejudice )"
Peter Wasser
Art is making something out of nothing and selling it.
Frank Zappa
|
|
|
|
|
So, we have both a counter example that renders the other solution kind of useless
The question is: do you prefer an inclusive set of characters to defined "word" (e.g. \w+) or do you prefer an exclusive set of characters to define "word" (e.g. \S+).
In any case, both, \S+ and \w+ result in not so reliable results
A bit more sophistication is obviously needed, maybe: (?:\w[-\.,:;']\w|\w)+.
This treats combined terms as one word (e.g., i.e., ad-hoc, 12,345.00, 10:45:00, didn't, etc.). It seems to me no problem to ignore the money symbol ($).
|
|
|
|
|
Correct. The title of your post says it all. How do we define word?
Some definitions exclude abreviations, numbers etc.
So word counting algorithms come up with different answers.
Your point is quite valid and a more complex expression probably needs to be used to get the best result.
May have to do some experimentation.
Peter Wasser
Art is making something out of nothing and selling it.
Frank Zappa
|
|
|
|
|
string[] text = new string[]
{
"The total number of words \t in this sentence,is 10.",
"Mr O'Brien-Smith arrived at 8.30 and spent \t $1,000.99",
"$123,000.00",
"incidentally , and might I say ( without prejudice )",
" (e.g., i.e., ad-hoc, 12,345.00, 10:45:00, didn't, etc.).",
};
Func<string, int>[] counters = new Func<string, int>[]
{
t=> Regex.Matches(t, @"\w+").Count,
t=> t.Split(default(Char[]), StringSplitOptions.RemoveEmptyEntries).Length,
t=> Regex.Matches(t, @"[^\s!?¡¿\-\–]+").Count,
t=> CountWords(t),
t=> Regex.Matches(t, @"\S+").Count,
t=> Regex.Matches(t, @"(?:\w[-\.,:;']\w|\w)+").Count,
};
Console.WriteLine("{1,3} {2,3} {3,3} {4,3} {5} <-- {0,-60}", "Text", "tip", "A#1", "A#2", "A#3", "mine");
foreach (var str in text)
{
foreach (var f in counters) Console.Write("{0,3} ", f(str));
Console.WriteLine(" <-- {0}", str);
}
results in:
tip A#1 A#2 A#3 \S+ mine <-- Text
10 9 9 9 9 9 <-- The total number of words in this sentence,is 10.
13 8 9 8 8 8 <-- Mr O'Brien-Smith arrived at 8.30 and spent $1,000.99
3 1 1 1 1 1 <-- $123,000.00
7 10 10 10 10 7 <-- incidentally , and might I say ( without prejudice )
15 7 8 7 7 7 <-- (e.g., i.e., ad-hoc, 12,345.00, 10:45:00, didn't, etc.).
You may now evaluate the outcome...
Cheers
Andi
|
|
|
|
|
The following is the best I could find so far (to my intuitive understanding of "word"):
...
t=> Regex.Matches(t, @"(?:\d[\.,:]\d|\w[-\.']\w|\w)+").Count,
...
The resulting table is:
tip A#1 A#2 A#3 \S+ M#1 M#2 <-- Text
10 9 9 9 9 9 10 <-- The total number of words in this sentence,is 10.
13 8 9 8 8 8 8 <-- Mr O'Brien-Smith arrived at 8.30 and spent $1,000.99
3 1 1 1 1 1 1 <-- $123,000.00
7 10 10 10 10 7 7 <-- incidentally , and might I say ( without prejudice )
15 7 8 7 7 7 7 <-- (e.g., i.e., ad-hoc, 12,345.00, 10:45:00, didn't, etc.).
M#2 is the only one that matches my expectations for all sample strings. But as said, it's first of all not an absolute measure and second, it is pure heuristic - it may be sufficient in many, but not all cases...
|
|
|
|
|
Reason for my vote of 5
Very nice.
|
|
|
|
|
@Pete,
Have updated the Regex code part, please take a look if its correct. Just for the readability sake i updated it because earlier html tags got mixed up in between the code.
|
|
|
|
|
Reason for my vote of 5
I did consider your example in the past, and it counted spaces with at least one non-empty space character (tab, enter, space) - this is a lot cleaner 
|
|
|
|
|
Excellent, Pete. I've done something similar in the past, but your is more elegant.
|
|
|
|
|
Reason for my vote of 5
nice tip and my learn something new today moment too Thanks Pete
|
|
|
|
|
I am looking for a regex to count the characters (excl. whitespaces etc.).
Any help would be appreciated. Many thanks!!
|
|
|
|
|
It helps in proper counting of words.
|
|
|
|
|