Click here to Skip to main content
15,867,453 members
Articles / Desktop Programming / Windows Forms

Easily Create Your Own Parser

Rate me:
Please Sign up or sign in to vote.
4.81/5 (56 votes)
9 Jul 2011CPOL7 min read 182.6K   9.5K   160   52
Create a hand made parser in VB.NET or C# easily and fast
TokenicerSS.jpg

Introduction

One of the more difficult tasks in computer science is building parsers and compilers. There are a lot of tools available that aid in the tedious task, most notably, Flex and Yacc, both available on the Linux/UNIX platform. The program I present here in this article is called TokenIcer. It is similar to Flex, but TokenIcer provides a nice easy to use GUI that serves as an editor for your rules, as well as a test bed for testing your rules. In addition, once your parsing rules have been defined, TokenIcer can create a parser class, based on your rules, in either C# or VB.NET

Background

To be able to use TokenIcer well, you should have a pretty good understanding of how Regular Expressions work. Each rule you enter into TokenIcer will be based on a Regular Expression. Any regular Expression that the .NET Regex library can parse, will also be valid in TokenIcer.

The way a parser works, and also the way TokenIcer will work, is that you feed into the parser some kind of input string. For example, if we feed the following line into a parser:

3+2 * (6 + 1)  

We should expect our parser to provide us output like this:

{Integer}{Plus}{Integer}{Whitespace}{Asterisk}{Whitespace}{LeftParen}
{Integer}{Whitespace}{Plus}{Whitespace}{Integer}{RightParen}{Newline} 

What we do with this parser output depends on exactly what we are trying to accomplish. Maybe you are building a language compiler, or perhaps a math parser. This is what TokenIcer does. It takes input like "3+2 * (6 + 1)" and converts it into a series of enumerated values.

Using the Code

When you run TokenIcer, you will be presented with a screen with 3 text boxes and an TreeView. The first text box is where we enter our rules. Rules are simply regular expressions wrapped between quotes. Immediately following the rule is a space (or multiple spaces, if you prefer). Following the space is the identifier for that rule. As in the example above, an identifier can be something like Integer, Whitespace or whatever you want to use. Any valid VB.NET or C# identifier can be used. Go ahead and enter the following rule in the text box:

"[0-9]+" INTEGER 

This rule will correctly identify an integer number. In the middle text box, is our test bed. Anything you enter in here will be parsed according to the rules you have in the text box above it. Go ahead and enter any integer number you wish in the test box. Once you enter your number, click the "Test Grammar" button at the bottom left of the window. Once you do, you will see the third textbox, the output text box, did, indeed, correctly identify our number as an {INTEGER}. The output tree also shows INTEGER, but if you expand the tree, you will see the actual value the parser parsed. Now go ahead and enter the following two rules in the first text box, immediately following the first rule:

"[0-9]+\.+[0-9]+" FLOAT
"[ \t]+" WHITESPACE

In your test box, try testing the following:

3.2 15 4.932 

Go ahead and click "Test Grammar". You may be surprised to see a bunch of "UNDEFINED" tags and no FLOAT tags. The reason for this (gotcha #1) is because the parser will parse your input using the first rule it comes to that matches. When making your rules, you must place "higher priority" rules first. The parser came to the "3" and said "ok 3 is an integer". It didn't even look past the 3 to see if there was a decimal point. To fix this, simply put the FLOAT line above the INTEGER line. This way, the parser will try to match the FLOAT rule first and if there is no decimal point, move the INTEGER rule. Change your rules around in the rule text box so it looks like this:

"[0-9]+\.+[0-9]+" FLOAT
"[0-9]+" INTEGER
"[ \t]+" WHITESPACE 

Now if you click "Test Grammar", the output box and the output tree will look as expected. It is now properly parsing as you would expect.

To wrap up this section, I will now show you how TokenIcer can create C# or VB.NET classes that you can include in your own projects. Since we now have our three rules and they are tested without input, go ahead and click the button that says "Generate Class...". When you do that, a window pops up with a drop down list asking which language you prefer. You can select either C# or VB.NET. Also there is a checkbox for including comments. If you check this box, some simple comments will also be generated to help you understand the code. Once you've selected your language, click "Generate my Class" and another window will pop up with the generated C# or VB.NET code. You can hit <Ctrl>+A to select it all and <Ctrl>+C to copy the code into your own project. It's all that simple!

The TokenParser class exposes the following property:

  • InputString -- After instantiating a copy of TokenParser, you must set this string property to the value of the string you want parsed.

The TokenParser class exposes the following methods:

  • GetToken() -- Call this method to retrieve the next token from the input string. The GetToken() method returns a Token object. A Token object contains the TokenName (which is an enum of tokens that you defined in your rules earlier, like INTEGER, WHITESPACE, FLOAT) and it contains the TokenValue. The TokenValue will be the value retrieved from the parser. For example, an INTEGER might return "53" for example.
  • Peek() -- This method will return the next token that GetToken() will return. It allows you to look ahead in the token buffer without actually pulling anything off the queue. The return value of Peek() is a PeekToken. A PeekToken is a special object that contains a Token object and an index. By calling Peek() and passing a PeekToken as an argument to Peek(), Peek() will return the Token that is returned after the last Peek() call. In this way, you can peek ahead several tokens deep.
  • ResetParser() -- This method resets the parser. After calling this, you must set the InputString property to a string again and then you can call GetToken() or PeekToken() as you normally would to parse a new string.

Comments

TokenIcer supports two types of comments. Inline comments and full line comments. Both comments are achieved by prefacing your comment with a hash symbol (a # sign). The first type of comment, inline comments, are a way to comment your enumerations. Look at this example:

"[a-zA-Z_][a-zA-Z0-9_]*" IDENTIFIER #This is an Identifier 

When you create your C# or VB.NET class, you have the option to comment rules. If this option is turned on, then whatever comment you have inline with your rule will be added as a comment to the generated enumeration.

The second type of comment is a full line comment, like so:

# This line is completely ignored.  

Any rule line beginning with a hash symbol is completely ignored. You may use this for your own purpose anyway you see fit.

Example Parser Rules

Here is an example of a very simple BASIC parser:

"[Ll][Ee][Tt]" LET
"[Pp][Rr][Ii][Nn][Tt]" PRINT
"[Cc][Ll][Ss]" CLS
"[Rr][Ee][Mm][^\r\n]*" REM
"[Ee][Nn][Dd]" END
"[Gg][Oo][Tt][Oo]" GOTO
"[Ff][Oo][Rr]" FOR
"[Ss][Tt][Ee][Pp]" STEP
"[Nn][Ee][Xx][Tt]" NEXT
"[Tt][Oo]" TO
"\:" COLON
"=" EQUALS
"\".*?\"" STRING
"[a-zA-Z_][a-zA-Z0-9_]*"     IDENTIFIER
"[ \t]+" WHITESPACE
"[\r\n]+" NEWLINE
"[0-9]?\.+[0-9]+" FLOAT
"[0-9]+" INTEGER
"'.*" APOSTROPHE
"\(" LPAREN
"\)" RPAREN
"\*" ASTERISK
"\/" SLASH
"\+" PLUS
"\-" MINUS 

These rules can properly parse the following program:

10 CLS
20 PRINT "Hello" : PRINT " World!"
30 X = 5
40 Y = X * 2 + 3.5
50 PRINT Y
60 FOR Z = 50 TO 1 STEP -1
70 PRINT "Z = " + Z
80 NEXT Z
90 END

Feel free to expand the parser rules and try to make your own BASIC compiler!

History

  • 7/3/2011 - Version 1.0 released
  • 7/5/2011 - Version 1.1 released. TokenIcer can now save grammar and test input files!
  • 7/7/2011 - Version 1.2 released. TokenIcer now works properly when using reserved VB.NET or C# keywords as identifiers. Also, did some code refactoring to clean some stuff up.
  • 7/8/2011 - Version 1.3 released. The GetToken() engine is now about 95% faster (thank you T_uRRiCA_N for the help). Also, the rule editor text box now has line numbers for easier reading.
  • 7/9/2011 - Version 1.4 released. Before I get into the new features, I just want to apologize for having so many updates in such a short amount of time. I promise to slow it down a bit and you will not see new updates, at least for a week or so (unless there is some major bug in this release). Okay new features for this release include syntax highlighting, as well as a new option menu to disable or change the colors, as well as the ability to disable line numbering. Also, this version now supports comments, both inline and full line comments. Also added is the option to create XML-style comments so you can create documentation using SandCastle or some other program.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer http://www.icemanind.com
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralHere's a sample grammar for conditional statements Pin
Velcro202010-Mar-22 7:36
Velcro202010-Mar-22 7:36 
QuestionGreat Example Pin
Member 1426888420-Jan-22 10:06
Member 1426888420-Jan-22 10:06 
PraiseCreated JSONless (no publications, uses tabs only to markup JSON) using this great program Pin
Velcro202010-Apr-21 5:46
Velcro202010-Apr-21 5:46 
GeneralMy vote of 5 Pin
ethar125-Apr-20 8:37
ethar125-Apr-20 8:37 
QuestionProblem with macOS Pin
Irken Invader Zim14-Apr-20 5:05
Irken Invader Zim14-Apr-20 5:05 
Praisegreat example Pin
pauljayhoon15-Aug-18 0:25
pauljayhoon15-Aug-18 0:25 
SuggestionGreat work! But a little slow. Pin
LeoX Toronto18-May-17 9:19
LeoX Toronto18-May-17 9:19 
PraiseCreate article Pin
Yicker41922-Sep-16 3:35
Yicker41922-Sep-16 3:35 
QuestionMy vote of 5 Pin
pauljayhoon12-Sep-16 16:32
pauljayhoon12-Sep-16 16:32 
AnswerRe: My vote of 5 Pin
icemanind13-Sep-16 9:22
icemanind13-Sep-16 9:22 
Questiongreate job! solve my problem :) Pin
Mr. xieguigang 谢桂纲4-Feb-16 19:47
professionalMr. xieguigang 谢桂纲4-Feb-16 19:47 
QuestionParsing tokens Pin
rziboo27-Jan-16 3:38
rziboo27-Jan-16 3:38 
QuestionMessage & Question! Pin
Member 1155932125-Apr-15 11:29
Member 1155932125-Apr-15 11:29 
QuestionIs it possible to publish this programming language? How? Pin
Member 1133073412-Jan-15 2:35
Member 1133073412-Jan-15 2:35 
AnswerRe: Is it possible to publish this programming language? How? Pin
Predator7522-Apr-15 22:45
professionalPredator7522-Apr-15 22:45 
QuestionPerformances Pin
xNetDeveloper30-Oct-14 4:27
xNetDeveloper30-Oct-14 4:27 
AnswerRe: Performances Pin
icemanind3-Nov-14 15:14
icemanind3-Nov-14 15:14 
Can you pass along the regex code and the test code you used? And I will debug for you!
QuestionTrouble with my Grammar Pin
melance4216-Aug-14 5:45
melance4216-Aug-14 5:45 
AnswerRe: Trouble with my Grammar Pin
icemanind8-Oct-14 12:16
icemanind8-Oct-14 12:16 
SuggestionSuggestions Pin
Singhbajwa12-May-14 22:26
Singhbajwa12-May-14 22:26 
Questionhow do you use identifiers again? Pin
Member 1067913418-Mar-14 1:48
Member 1067913418-Mar-14 1:48 
AnswerRe: how do you use identifiers again? Pin
icemanind8-Apr-14 13:47
icemanind8-Apr-14 13:47 
QuestionProgramming Language. ~ Help Pin
Modi Deep9-Mar-14 20:04
Modi Deep9-Mar-14 20:04 
AnswerRe: Programming Language. ~ Help Pin
icemanind8-Apr-14 13:48
icemanind8-Apr-14 13:48 
QuestionSpeed up parsing by at least 50% Pin
Lennard Fonteijn8-Oct-13 13:39
Lennard Fonteijn8-Oct-13 13:39 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.