Yet another Pratt parser example and explaination

earthengine

5.00/5 (2 votes)

Jul 12, 2013

Public Domain

4 min read

14354

Includes some extensions of Pratt parser

Introduction

This is another explaintion of Pratt's top down operator precedence parser. This parser orginally developed by Vaughan Pratt in 1973, and becomes famous after Douglas Crockford introduced it in "Beautiful Code" and Jslint. From then on, several (not plenty) blogs contains some explanations of its detail.

Background

Parsing is a major topic in computer science, and there are a lot of tools and algorithms have been invented in this field. Most of them require some sort of formal grammar (basically a set of production rules) to work with. Although top down descendent is usually constructed by hand, the program structure is still basically represents the grammar in production rules.

It is sure that every parser MUST define the grammar of the language to parse, Pratt's parser do, as well. However, as an interesting exception, his parser is not presenting the grammar in production rules. He's parser is based on rules over tokens instead.

Furthermore, Pratt's paper begins with a long discussion reasoning his motivation that lead him to the algorithm, but that discussion was basically ignored by most of his follower. Even Pratt himself did not go any further some of his motivating ideas.

In this article, I want to represent my implementation of Pratt's parser in Java, and to review some of Pratt's ignored ideas and figure out the interesting consequences.

Let's get started

As some other bloggers spotted out, Pratt's terminology (and at least some of his followers) is terrible hard to understand for beginners. However, when use proper wording, his idea and the heart of the algorithm is incredibly simple and easy to understand. So let me start on this part

First of all, the parser's job is to convert the input (usually a sequence of tokens) into some other form (it can be some other language, or some internal data format, usually the Abstract Syntax Tree - AST). You may think that the "token sequence" is an Iterator, but the natural operations on the token sequence is not the familiar Iterator.next or Iterator.hasNext in Java, since in most parser programs there exists a special "End of Input" (EOI) token, so there is not need to check the end of input. You can image that the sequence is infinite and after some point all subsequent tokens are EOI. Also, most parsers requires "lookahead" ability, or equivalently a small cache space to allow the control flow jump to the right branch without actually calling Iterator.next. So we design the class TokenSequence as following:

public interface TokenSequence<T> {
    /**
     * Return the current token, without moving to the next 
     * @return
     */
    T current();
    /**
     * Return the current token and move to the next
     * @return
     */
    T consume();
}

In any time, calling any of above method return the same value. The only difference is consume() move the input and current() do not.

Now, let's consider the interface of the parser. You need to give the parser a TokenSequence, and a grammar (so we can suppose an interface Grammar although we don't know how it looks like yet) as the rule to parse.

public class Parser<T> {
    private TokenSequence<T> tokens;
    private Grammar<T> g;

    public Parser(TokenSequence<T> tokens, Grammar<T> g) {
        this.tokens = tokens;
        this.g = g;
    }
        ...

To run the parser, we need a method "parse" but what it would return?

As we have talked about, the answer is anything. So there should be another generic type parameter. I call it E for expression.

There is one thing interesting. In the original paper and most of the followers, the parser accepts an integer rbp as a parameter. We will soon see what it is but now let me say there is another generic type here. I call it ET. So now the class is

public class Parser<E, ET, T> {
    private TokenSequence<T> tokens;
    private Grammar<E, ET, T> g;

    public Parser(TokenSequence<T> tokens, Grammar<E, ET, T> g) {
        this.tokens = tokens;
        this.g = g;
    }
    
    public E parse(ET et) throws ParseException {        
                //To be implemented...
        throw new UnsupportedOperationException();
    }
}

Now let's see how to implement the parse method. I give the implement as the following before explaining it:

public E parse(ET et) throws ParseException {
    E parsed = g.parsePrefix(this, tokens.consume());

    while (g.isInfixable(et, parsed, tokens.current())) {            
        parsed = g.parseInfix(this, parsed, tokens.consume());
    }

    return parsed;
}

When parse method being called, it retrieves one token by calling tokens.consume(). and then it delegate the job to the Grammar object, hopes it gives some result.

Briefly, the Grammar object need to check the given token, if it is a prefix operator line "!" in C, it then call the handler for it. The handler may call parse recursively, to finally construct a result. Since we are at the very beginning of the parsing process, the token can only be a prefix of a bigger group, or a standalone token that represents something to be returned.

Once you have something parsed, the next thing to do is to make a decision: to continue or to return. If you are familiar with LR algorithms, that is similar to the "shift-reduce" conflicts. If parsing continues, any further tokens read by the parser will be either infix or suffix. The Grammar shall then provide two methods for it: one to decide weather to return to the caller (reads "the current token is not infixable", means the current token cannot used with the parsed result to form another result, thus the loop must end and parse method must return).

(to be continued...)