Click here to Skip to main content
15,885,435 members
Articles / Programming Languages / C#

StringTokenizer Library

Rate me:
Please Sign up or sign in to vote.
2.55/5 (10 votes)
23 Apr 2006CPOL4 min read 58.7K   627   24   12
Yet another C# implementation of Java's StringTokenizer (in a ready to use library).

Image 1

Introduction

I've been using CodeProject for some time now, and I thought that I might write something instead of just reading. And since I needed a String Tokenizer class (importing a big Java project), I thought I'll write one myself, and here it is - my first article. Do be gentle :)

The StringToke... what?

This class was designed not just for people who know Java, so a quick introduction is required. If you know what this is, feel free to skip to the next paragraph.

Let's say you have a string, for example, "one, two, three, four", and you want to easily extract and use all those numbers separately. What this class offers is a simple interface to do that. You can specify the characters which will be used to 'cut' the string yourself (called delimiters), or use the default set, which is " \t\n\r\f":

  • the space character
  • the tab character
  • the newline character
  • the carriage-return character
  • the form-feed character

After the tokenization, you simply use the NextToken property to obtain the next token. Upon using this property, a private index of the current token position is incremented so that the next token can be obtained. You should always check if the next token exists before you try to extract it using the HasNextToken property. You can also ask for the delimiters to be extracted as tokens and to return empty tokens (+ you can specify your own string for an empty token, such as "MISSING" or simply null). Please see the example for details.

Documentation

I've included (in the ./doc/ folder of the Zip file) an HTML documentation. It was generated with NDoc, hence the case-sensitivity problems (i.e., NDoc redirects all properties' links into the equivalent methods, and for example, clicking on 'NextToken' results in going to the 'nextToken()' page). I don't know how to fix this, so I'm just waiting for a new release. If someone does, please tell me and I'll regenerate the documentation. The code itself is also XML-commented, so Visual Studio will provide on-the-fly support while using the class.

Nevertheless, this is a very self-explanatory class, and if you don't need the details, just start using it :)

About the implementation

This is just another C# version of the java.util.StringTokenizer class. Basically, it's a wrapper class around the String.Split method. It implements all of its Java equivalent methods apart from those only needed by the Enumeration interface. All implemented Java-compliant methods have their C# equivalents in properties. The example will clarify this later. Basically, this implementation includes:

  • Java's methods 'as is' (preserving the exact names for compatibility) which are just aliases for
  • C# properties named exactly as the Java-compatible methods with the first letter capitalized (Camel-case)

Do please remember this subtle difference: Each Java methodName() method has (and uses) its equivalent property MethodName.

The IEnumerable interface has been implemented so you can iterate through an instance of the class using the foreach loop. Doing this increments the internal current position index (just like using NextToken or nextToken()), so remember to invoke the Reset() method to re-read the tokens. Or, use the indexer (it doesn't increment the index and can be used at any time).

This StringTokenizer class is a member of the StringTools namespace.

Public methods, properties, and a constant

All the methods do (I hope* :) what their Java relatives, so I'll just describe the new things:

  • StringTokenizer(string, params char[]) constructor - gives a way to specify the delimiters one by one without the need to stringify them.
  • StringTokenizer(string, string, bool, bool returnEmpty) constructor - gives a way to ask the class to return the tokens which are empty using the default String.Empty string.
  • StringTokenizer(string, string, bool, bool, string empty) constructor - gives a way to specify the string to be returned instead of the default String.Empty string.
  • void Reset() method - resets the current position so that the tokens can be extracted again.
  • string DefaultDelimiters constant - holds the default set of delimiters.
  • int Count property - returns the total number of tokens extracted from the tokenized string.
  • string this[int] indexer - returns the token at the specified index.
  • string EmptyString property - returns the string used for empty tokens.

*I've tested it in a variety of ways, and it seems to work like it should, though if you find a bug, please let me know.

The long awaited example

Example usage of the class:

Don't forget to include: using StringTools;.

C#
string str = "One, two, three";
Console.WriteLine("The string to be tokenized: [{0}]", str);

StringTokenizer st = new StringTokenizer(str, ",");
Console.WriteLine("\nThe Java way + comma tokenization:");
while (st.hasMoreTokens()) // == st.HasMoreTokens
    Console.WriteLine("[{0}]", st.nextToken()); // == st.NextToken

Console.WriteLine("\nThe C# way + comma tokenization");
st.Reset();// Not available in Java - after this we can reget the tokens
foreach(string token in st)
    Console.WriteLine("[{0}]", token);

Console.WriteLine("\nThe other C# way + tokenize using \", \" + return tokens");
Console.WriteLine("Uses the indexer to get tokens - doesn't " + 
                  "increment the 'current position'");
st = new StringTokenizer(str, " ,", true);
for (int i = 0; i<st.Count; i++)
    Console.WriteLine("Tokens left:{2}, token number {0} is [{1}]", 
                      i.ToString(), st[i], st.CountTokens.ToString());

string database = "John|Smith|46|5550000|||john@internet.com|";
Console.WriteLine("\nSample database tokenization for database line:\n[{0}]\n", database);
st = new StringTokenizer(database, "|", false, true, "MISSING DATA");
foreach (string token in st)
    Console.WriteLine("[{0}]", token);

This outputs:

The string to be tokenized: [One, two, three]

The Java way + comma tokenization:
[One]
[ two]
[ three]

The C# way + comma tokenization
[One]
[ two]
[ three]

The other C# way + tokenize using ", " + return tokens
Uses the indexer to get tokens - doesn't increment the 'current position'
Tokens left:7, token number 0 is [One]
Tokens left:7, token number 1 is [,]
Tokens left:7, token number 2 is [ ]
Tokens left:7, token number 3 is [two]
Tokens left:7, token number 4 is [,]
Tokens left:7, token number 5 is [ ]
Tokens left:7, token number 6 is [three]

Sample database tokenization for database line:
[John|Smith|46|5550000|||john@internet.com|]

[John]
[Smith]
[46]
[5550000]
[MISSING DATA]
[MISSING DATA]
[john@internet.com]
[MISSING DATA]

Requirements

.NET Framework 2.0.

Usage

Just include the StringTokenizer.dll library in your code and you're done! Or you can also include the whole 'raw' StringTokenizer.cs code file into your project.

And don't forget using StringTools;.

Credits

I'd like to thank:

  • M.Lansdaal for proving that empty tokens are useful and that using Microsoft namespaces isn't a good practice
  • paillave for pointing out that I forgot to implement the IEnuberable interface

History

  • 24.04.2006 - Publication
  • 24.04.2006 - Minor changes
  • 24.04.2006 - Second release
  • Added returning empty tokens, implemented IEnumerable, changed the namespace, included projects in the .Zip file (not just source code), and documentation in HTML format instead of .CHM.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
QmQ
Poland Poland
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 5 Pin
ricardmag6-Feb-13 11:53
ricardmag6-Feb-13 11:53 
Great work!!!
tanks!
GeneralMy vote of 4 Pin
Maheesha_20114-Feb-12 16:00
Maheesha_20114-Feb-12 16:00 
Generalinteresting but... Pin
paillave25-Apr-06 1:49
paillave25-Apr-06 1:49 
GeneralRe: interesting but... Pin
QmQ25-Apr-06 9:55
QmQ25-Apr-06 9:55 
GeneralComment about tokenization Pin
M.Lansdaal24-Apr-06 11:10
M.Lansdaal24-Apr-06 11:10 
GeneralRe: Comment about tokenization Pin
QmQ24-Apr-06 11:58
QmQ24-Apr-06 11:58 
GeneralRe: Comment about tokenization Pin
M.Lansdaal24-Apr-06 12:21
M.Lansdaal24-Apr-06 12:21 
GeneralRe: Comment about tokenization Pin
QmQ24-Apr-06 12:52
QmQ24-Apr-06 12:52 
GeneralRe: Comment about tokenization Pin
M.Lansdaal24-Apr-06 13:04
M.Lansdaal24-Apr-06 13:04 
GeneralRe: Comment about tokenization Pin
QmQ24-Apr-06 13:28
QmQ24-Apr-06 13:28 
GeneralRe: Comment about tokenization Pin
M.Lansdaal24-Apr-06 16:45
M.Lansdaal24-Apr-06 16:45 
GeneralRe: Comment about tokenization Pin
QmQ25-Apr-06 9:59
QmQ25-Apr-06 9:59 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.