you need a tokenizer i think.
At the risk of plugging my own offerings try this:
Rolex (Reboot): Unicode Enabled Lexer Generator in C#[
^]
// rolex lex spec
complexToken = 'complex token[0-9]+'
token = '[A-Za-z][A-Za-z0-9]*'
int = '\-?[0-9]*' // just an example
ws<hidden> = '[ \t]+' // hide whitespace
Then you can tokenize by doing this
foreach(var token in new MyTokenizer("my test string"))
Console.WriteLine("{0}: {1} at {2}",tok.SymbolId,tok.Value,tok.Position)
It creates a single file with no dependendecies. Input is
IEnumerable<char>
, "output" is
IEnumerable<Token>
if you need that over a file, use this
IO: A Small Streaming I/O and UTF-32 Library[
^]