65.9K
CodeProject is changing. Read more.
Home

Parsing Math Expressions: Approach towards State of Art

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.82/5 (6 votes)

Dec 7, 2017

CPOL

7 min read

viewsIcon

12683

downloadIcon

234

Parse Expressions in three steps: 1) Tokenize, 2) Convert2Postfix 3) ProcessPostfix

Parsing math Expressions is obviously a difficult Task. On the other Hand it is a very wellknown Task, and is solved, since Editors are in use to write Code.
So it may be a good idea - instead of re-inventing the wheel - to figure out some of the concepts, which may be used by professional Expression-Parsers.

And before we can come to concret code we must discuss several more or less abstract topics, namely:

  • What is an Expression?
  • What is a Token?
  • What is Infix-Notation, and what is PostFix-Notation?
  • What is an Operator?

With this background we can create code of surprisingly power, shortness and performance.
So lets step in

Expressions and Tokens

An Expression is a Text, composed of several atomic units - the Tokens.
In the here introduced point of view a token (of an infix-noted Expression) can only have two meanings: It is either an opening/closing Bracket or an Operator (the term 'Operator' will be discussed later).

Infix-Notation

On Infix-noted Expressions mainly binary Operators are set between their both Operands.
Example: (3 + 2) * (15 / 3 - 8)
That causes a significant problem: In which Order the Operators shall operate?
To solve it, there was introduced the concept "Operator-Rank" (power over multiply, multiply over addition,...). Additionally there is a concept "Bracket" to break the Ranking-Concept and define operating-orders, which differ from the order as it would ensue from the Operator-Ranks.
A real confusing mess of concepts - isn't it?
And things get worse, when unary Operators come into play, like Negation, or Factorial (eg: -3, 5!)

Postfix-Notation

The same Expression-Sample as above, but postfix-noted:
3 2 + 15 3 / 8 - *
Postfix has no such circumstancies like Infix: The Operators Operands are always at its left, the Order of Operating is always from left to right, and that's that!
Operating means: consume Operator and Operands, and replace them with the Operation-Result.
Theese Results (except the last one) are again Operands of further Operators.
See Picture:

You see clearly the sukzessive Reducing of Token, until only one remains: the expression-result.
This Algorithm - "PostfixProcessing" - can be implemented very smart and efficient.
But that's the last step of all - please keep patient.

Operator

In the here introduced meaning an Operator can have one Operand (unary OP), two (binary OP), or more, and even can have no Operand at all!
What is meant with: "Operator with more than 2 Operands"? Functions can do so, eg a Function SumOfThree( 3, 4, -15 ) would be considered as Operator with 3 Operands.
And how can I imagine an "Operator without Operand"? Very simple: Imagine a Number. A Number consumes no Operand, but creates a Result - namely itself.
Another such "zeroary Operator" is a named Parameter. A Parameter is also a no-operand-consuming Operator, and the Result which it creates is taken from another, external mechanism.
You can consider Numbers and Constants as Operators with "intrinsic result", while Parameters use a kind of "result-injection" - as said: none of them consumes an Operand.

Code

Tokenize

First we need to convert the expression to a List of Tokens:

Private Function Tokenize(expression As String) As List(Of Token)
   Tokenize = New List(Of Token)
   Dim rgxToken As New Regex("(?<operator>[+\-*/^()])|(?<number>[\d\.]+)|(?<param>[a-z])")
   Dim tk = New Token("("c) ' since the following Loop accesses the precuser-token, an initial dummy-Token is needed
   For Each mt As Match In rgxToken.Matches(expression)
      If mt.Groups("number").Success Then
         tk = New Token("0"c) With {.Number = Double.Parse(mt.Value, CultureInfo.InvariantCulture)}
      Else
         Dim c = mt.Value(0)
         If c = "-"c AndAlso tk.CausesUnaryNegation Then ' access the precuser-token to decide the meaning of "-"
            ' "-" can mean substraction as well as unary negation. If so replace it by "~" for easyer distinction
            c = "~"c
         End If
         tk = New Token(c)
      End If
      Tokenize.Add(tk)
   Next
End Function

Basically that is simple - Regex help to distinct operators, numbers, params (line #3).
The '-'-Token causes a little problem, because its meaning is context-sensitiv: if it has a Number at its left, it is the binary Substraction-Operator, otherwise it is the unary Negation-Operator, and applies to its Operand at its right.
Lines #10-14 solve this by replacing the Negation-Token with unambigous '~'.

But see the Token-Class itself:

Public Class Token

   Public Number As Double, OpCode As Char, Rank, OperandCount As Integer
   
   Private Shared _Ranks As String() = {"(", ")", "+-", "*/", "^", "~"} ' the element-index is the Token-Rank
   Private Shared _OperandCounts As String() = {"~", "+-*/^"} ' the element-index is the operators OperandCount-1

   Public Sub New(opcode As Char)
      Me.OpCode = opcode
      Rank = _Ranks.FindIndex(Function(s) s.Contains(opcode))
      If Rank < 0 Then Rank = 99 ' OpCodes, not listed in _Ranks, are Numbers or Parameters, and get Rank=99
      OperandCount = _OperandCounts.FindIndex(Function(s) s.Contains(opcode)) + 1 ' OpCodes, not listed in _OperandCounts, automatically get OperandCount=0
   End Sub

   ''' <summary> if true, a following '-'-Token means unary Negation. Otherwise means (binary) Subtraction </summary>
   Public Function CausesUnaryNegation() As Boolean
      Return Rank <> 99 AndAlso OpCode <> ")"c
   End Function
	
End Class

The Main-Thing is line #3, which says, what a Token needs to know: Namely an OperatorCode (OpCode), its Rank and its OperandCount.
Moreover my Token knows a 'Number', but that is only in use for Number-Tokens.

Infix2Postfix

This is the most tricky code of this article - it is known as "Shunting-Yard"-Algorithm, and was invented by Dijkstra (who else?).
Please refer to the link - its explainations are excellent, especially the detailed samples of processing an expression step by step.

Private Function Infix2Postfix(infixTokens As List(Of Token)) As List(Of Token)
   'shunting-yard-algo by Dijkstra: holds a 'tokenStack' as interim Token-storage.
   'The rules of populating and flushing it to output are as follows:
   '1) Every token, except '(', flushes all higher or equal ranked tokens from tokenStack to output
   '2) Then every token, except ')', enter the tokenStack.
   ' Exceptional ')'-Behavior: since '(' is of lowest Rank, and ')' is next, the latter has flushed all other tokens, and a '(' now is stackTop. Remove it too.
   '3) Neither '(' nor ')' move to output
   Dim tokenStack As New Stack(Of Token)
   Dim output = New List(Of Token)
   For Each tk In infixTokens
      If tk.OpCode <> "("c Then ' all Tokens - except '(' - flush higher ranked other Tokens from tokenStack to output
         While tokenStack.Count > 0 AndAlso tokenStack.Peek.Rank >= tk.Rank
            output.Add(tokenStack.Pop())
         End While
      End If
      ' ')' doesn't enter the tokenStack. Instead remove the '(', which now is on Stack-Top
      If tk.OpCode = ")"c Then tokenStack.Pop() Else tokenStack.Push(tk)
   Next
   output.AddRange(tokenStack) ' final rest-flush
   Return output
End Function

Since the code is less than 20 lines, and the comment describes exactly what is done I dispense further explainations. Because when you know, what a Stack is, then you should understand it by closely looking at and thinking about it.

ProcessPostfix

Postfix-Processing can target several goals: eg build an expression-tree, or build a TreeNode-tree to visualize the expression-tree.
But mostly expected is to solve the Expression.
The Algorithm was already anticipated in the "What is Postfix" - Section, but in Repetition see the Picture, how Operations sukzessive reduce Tokens until only the Expression-Result remains:

To implement that again the Stack-Class is very helpful - we use an operandStack As Stack(Of Double), from which Operators can pop their Operands. And the ProcessPostfix-Algo (here named: Solve()) pushes the Operation-Results of the Operators back to operandStack.
See Code:

Module ExpressionSolver
   Private _Operators As Dictionary(Of Char, Func(Of Stack(Of Double), Double))

   Sub New()
      'associate Tokens with Operators: Functions, which pop their (Double-)Operands from operandStack, process them and retrieve a Double as Result
      '"~" is Unary-Negation-Postfix, all other Operators are as in Vb.
      '"^" takes some effort, to respect order of left and right operand
      _Operators = New Dictionary(Of Char, Func(Of Stack(Of Double), Double)) From
                                          {{"~"c, Function(stk) -stk.Pop},
                                          {"^"c, Function(stk)
                                                    Dim right = stk.Pop
                                                    Return stk.Pop ^ right
                                                 End Function},
                                          {"*"c, Function(stk) stk.Pop * stk.Pop},
                                          {"/"c, Function(stk) 1 / stk.Pop * stk.Pop},
                                          {"+"c, Function(stk) stk.Pop + stk.Pop},
                                          {"-"c, Function(stk) -stk.Pop + stk.Pop}}
      ' Number- and Parameter-Operators create Values without processing Stack-Elements as Operands. So their Operators are Nothing
      _Operators.Add("0"c, Nothing) ' Number-Operator
      For i = 0 To 25 ' Parameter-Operator: [a-z] 
         _Operators.Add(Convert.ToChar(97 + i), Nothing)
      Next
   End Sub

   Public Function Solve(expression As String, ParamArray paramValues() As Double) As Double
      Dim operandStack As New Stack(Of Double)
      For Each tk In Infix2Postfix(Tokenize(expression)) ' process postfix-ordered Tokens
         Dim result As Double
         Select Case tk.OpCode
            Case "0"c : result = tk.Number ' numbers: 'intrinsic result'

            Case "a"c To "z"c            ' parameters: result 'injected' by paramValues
               Dim i = Convert.ToInt32(tk.OpCode) - 97 ' compute param-index from OpCode
               result = paramValues(i)

            Case Else             ' other operators: result by consuming operands from stack
               result = _Operators(tk.OpCode).Invoke(operandStack) 
         End Select
         operandStack.Push(result)
      Next
      Return operandStack.Pop
   End Function

Upto line#23 you see the initializing of the available Operators, and how they get stored in a Dictionary.
In this case an Operator is an anonymous Function which pops one or more Operands from a given Stack, and return a result.
Note the Number- and Parameter- Operators (#19 - #22): Since their evaluating-result-mechanism is different, they do not interact with the stack - (especially don't pop Operands from it).

Now look at the Solve()-Function - after such effortful Initialisation it is banal: Just invoke every Tokens Operator with the operandStack (line #37) and push the result back to operandStack (line #39).
Exceptions are made with Numbers and Parameters: From a Number-Token (OpCode '0') the result is its Number (#30), and for a ParameterToken (OpCodes 'a' To 'z') its result is taken from the paramValues-Array (#34).

Another PostfixProcessing: Build Treeview

The Principle is the same: Execute eaches Tokens Operation on the operandStack. Only two differences to the Solve()-Code above: 1) the operands aren't numbers, but are TreeNodes 2) no algo-exceptions are required.

   Private Function BuildTreeNodes(postfixTokens As List(Of Token)) As TreeNode
      Dim operandStack As New Stack(Of TreeNode)
      For Each tk In postfixTokens
         Dim result = New TreeNode(tk.ToString)
         For i = 1 To tk.OperandCount
            result.Nodes.Insert(0, operandStack.Pop) ' respect reversed order of stack-elements
         Next
         operandStack.Push(result)
      Next
      Return operandStack.Pop
   End Function

there are no differentiated Operators, since every Token operates the same way:
Create a (Result-)TreeNode (line #4), pop Operands, and for each Operand insert a further TreeNode into the Result-TreeNodes NodeCollection (lines #5 - #7).
Then push the Result-TreeNode back to Stack (#8).
See picture oft the attached Sample-App:

On Top you see the infix-noted Expression, below its Postfix-Version, below-left a Tree-Representation.
On the right you can set Values for the Expression-Params, and right-bottom you get the Expression-Result.

Further PostfixProcessings: Linq-ExpressionTree

Maybe in an update or a continuing article I will show the "Kings-Way" of Expression-parsing: There will be a PostfixProcessing-Function, which composes a Linq.Expression with classes of the Linq.Expression-Namespace, and compile it to an usual anonymous Function.
To this Function you can pass parameter-values, and its evaluation-Performance is simply not toppable, since it doesn't need to re-parse the Expression at each call, and of course since it is compiled.
Moreover i will support Constants and Functions, namely all stuff of the Math-Namespace: Sin(), Cos(), Abs(), Pi,...
And even userdefined Functions, even with more than 2 Operands, can be made recognizable as Expression-terms.
(In former times I used that technique to create a Tablefunction-Feature for Datagridviews - similar to Excel. But that would be a too big topic for an article)

Conclusion

We saw, that a little Background-Knowledge (especially about Infix-/Postfix-Notation, and of course about the faboulous Shunting-Yard-Algo, created by Mr. Dijkstra) places us in a position, where we have a clear concept of how to solve the task "infix-math-expression-parsing" in an efficient, reliable and powerful way (eg. including params and function-calls).
The general Approach is: 1) Tokenize 2) Convert2Postfix 3) ProcessPostfix

And please note, that the attached sources meet the goal "demonstrate a general Approach for parsing expressions", but nevertheless it has several drawbacks (and lazyness prevents me to fix them):

  • The Datatype Char for Token.OpCode was a wrong choice. When it comes to extend functionality with math. Function-Names (Sin, Cos, Abs,...), DataType Char can't handle that - String obviously would have been better.
  • My "named Parameters" are fix predefined: They are a, b, c, ..., z, and in that order. It would have been better, if one could choose freely the param-Names within an expression, as useful.
  • some other, minor drawbacks...

It is consolatory, that it can't be goal of an article, to introduce the "perfect code". It's enough, when an article just introduces some more or less basic concepts, which enable programmers to develop their own "perfect code" - namely perfect fitting to their needs.