Multi-Machine Parsing #1: Introduction

Peter Belcak

5.00/5 (3 votes)

Jan 10, 2021

CPOL

8 min read

4211

Presenting the case for a generalized approach to parser construction from multiple parts

Motivation

When constructing a parser for a new (or, indeed, an existing) language, one is faced with two standard choices:

to employ a parser generator (Wikipedia humbly lists some 80 of them), which involves feeding your language's grammar to the generator and harvesting an oven-ready parser on the other side; and
to write the entire parser yourself from scratch.

Regardless of the choice made, one very likely ends up using one standard parsing algorithm, such as LALR/GLR with Bison, or recursive descent LL(k)/LL-regular parsing for a large part when writing the parser from scratch.

This causes issues, and it is this one-algorithm-fits-all-of-my-grammar approach that is one of the main reasons why efficient parsing remains to be a hard problem even some 50 years after its golden age, in spite of the plethora of parsing algorithms and generators that have been invented.

We can broadly classify the most common of the issues due to using a single parsing generator into four groups:

blindness to desirable essential ambiguity
creation of non-essential ambiguity
prohibitive parser size
sub-optimal parser performance

In the rest of this article, we will go through each of these issues, give examples where appropriate, and hint on how a multi-machine parser would solve the problem introduced by limiting ourselves to a single parsing algorithm, painting the solution in broad strokes. The following articles in this series will then give the solutions to these issues in detail.

The Issues

Issue #1: Blindness to Desirable Essential Ambiguity

A grammar for a programming language may suffer from a constructional homonymity: we construct what we believe is the reasonably simplest grammar capturing the structure of our language, and then discover there exists an input that can be understood in several ways (i.e., multiple equivalent parser trees can be constructed for the same string of terminals).

For some (rare) languages, this is a feature, for this essential ambiguity (i.e., one that is an attribute of the grammar rather than the parsing method used) may be mirroring the underlying reality of there being a semantic ambiguity. Borrowing an example from Chomsky's seminal 1956 paper, the grammar...

Sentence -> NounPhrase VerbPhrase

VerbPhrase -> Verb NounPhrase
Verb -> are flying
Verb -> are

NounPhrase -> they
NounPhrase -> flying planes
NounPhrase -> planes

... may comprehend

Quote:

they are flying planes

... in two ways,

"they" - those dots on the horizon - "are" - exist as - "flying planes" - the vehicles for flying in the process of flying, and
"they" - the pilots - "are flying" - are controlling the flight - "planes" - of the vehicles for flying

The example semantic meaning was given only for illustration, the important difference here is in the syntactic comprehension of the input string.

In such a scenario, it might be desirable to recognize both alternatives, and then let the semantic analysis phase that follows the parsing decide what was meant, based on the context.

While there are algorithms that can find all possible comprehensions of an input by a grammar rather efficiently (Earley, 1970; Hext and Roberts 1970), their parsing internals grow huge in size for normal grammars, they are on large too slow to be deployed to parsing of modern versatile programming languages, and generators that would produce them are hard to find. At the moment, they are more of an ancient vase in the museum of parsing techniques rather than a trusted tool a compiler designer would use to handle a more peculiar task at hand.

Thus we arrive at the issue: any common, well-understood, standard algorithm is likely to be blind to the ambiguity that may itself be desired, or even be a feature of one's language; but it is the fact that we grew so accustomed to using a single parsing algorithm that forces our hand to strip all non-determinism from our language at the syntactic analysis level. A real-world example from C++: the use of typename keyword.

A Multi-Machine Solution

Isolate the part of grammar that contains the desired essential ambiguity. If designing a language yourself, you may wish to use such instances of ambiguity on purpose, but keep it small.

Once you have the critical part of the grammar isolated, handle it with one of the algorithms that deal with ambiguity well (see the references linked above), using one of the efficient deterministic approaches for the rest of the grammar.

Issue #2: Creation of Non-Essential Ambiguity

This is, by far, the most common issue. Suppose you make the choice to use the parsing algorithm A. Since there are no truly general efficient parsing algorithms, there will exist an intricacy, which, if introduced, makes it impossible to construct the parser of type A.

For the canonical LL(k) parsers, such intricacy is having two productions that differ only by their tails, separated from their beginnings by nonterminals of arbitrary length. For canonical LR(k) parsers, such intricacy is any adjustment to the grammar that introduces a shift/reduce or a reduce/reduce conflict -- and for the less general variants of LR(k) parsing (such as SLR or LALR), breaching the constraints of the parsing method chosen is even easier.

The message here is that committing to one parsing method forces you to fit your grammar into the constraints of that method.

We shall demonstrate this on a specific example. Consider the following code snippet, written in a language designed for the configuration of warehouse sorting robots:

node Sorter : OutputTrackSolids OutputTrackLiquids {
    use input InputTrackA;
    // the description of the sorting process
}

Captured with a Bison-like context-free grammar (CFG), the syntactical skeleton for the language may be described as:

NodeDefinition = KW_NODE IDENTIFIER OP_COLON NodeOutputList OP_LCURLY NodeBody OP_RCURLY;

NodeOutputList = IDENTIFIER NodeOutputList | /* empty */;
...

The rest of the grammar may then (well) be designed to be LL(k), and a parser can be constructed with textbook techniques.

Suppose that a new version of the language is being prepared in which any sorter may have an arbitrary number of input tracks -- that something like the following is desirable:

node Sorter : OutputTrackSolids OutputTrackLiquids with inputs TrackA TrackB {
    // the description of the multi-track sorting process
    // the `use input` directive is prohibited; inputs have already been specified
}

Again, this structure can be captured by an otherwise LL(k) grammar containing the following:

NewNodeDefinition = KW_NODE IDENTIFIER OP_COLON NodeOutputList KW_WITH_INPUTS NodeInputList
...

Suppose now further that any parser for this new language is to be backward-compatible, i.e., the first piece of code is to be properly recognized as well. No canonical LL(k) parser, not even an LL(finite) parser can successfully disambiguate between the nonterminals NodeDefinition and NewNodeDefinition, as no fixed amount of lookahead can see through the arbitrarily long non-terminal NodeOutputList. There are several things that can be done to resolve this issue, but each of them either involves a "hack" of some sort (for example, using a predicate) or some other adjustment to your grammar.

Again, the bottom line here is that:

Quote:

... by committing to using a single parsing method you were robbed of your freedom to use the grammar that you wish to use and consequently design the language that you want to design.

A Multi-Machine Solution

A clever combination of multiple types of parsers can eradicate this issue. For the example given above, an LL(k) parser using a dedicated finite automaton when skimming over the NodeOutputList to disambiguate between NodeDefinition and NewNodeDefinition would work, even with linear-time complexity. This example will be elaborated on much more closely later in the series.

Issue #3: Prohibitive Parser Size

Parsers take up space in proportion to the size of the underlying grammar. Further, as a general rule of thumb, the more expressive the grammars parsing method M can handle, the bigger the M-parsers.

The problem the use of a single parsing method introduces here is that the most complex part of the grammar determines the parsing method for the entire parser.

As an example, consider writing a parser for a language that supports the usual C-like arithmetic expressions, e.g., by also implementing the parsing routines for the following small sub-grammar.

Expression -> Expression - Term
Expression -> Expression + Term
Expression -> Term

Term -> NUMBER /* a number constant */
Term -> IDENTIFIER /* a variable name */
Term -> '(' Expression ')'

The parser for the rest of the language might be implementable with a simple LL(1) parser, yet the presence of the arithmetic expressions alone forces you to use a different parsing method, a one that will likely produce a much bigger parser.

A Multi-Machine Solution

Isolate the most-complex part of your grammar, handle it with a complex machine, and then attach this machine to a simpler super-parser that handles the rest of the grammar.

In the example above, the grammar for arithmetic expressions can be isolated, and an operator-precedence or an LR(1) parser can be used to parse it. The rest of our grammar can be handled differently, and the OP/LR(1) parser will be invoked only when desirable.

Issue #4: Sub-Optimal Parser Performance

Parsing methods designed to handle more complex grammars are usually plausibly efficient on one type of grammars, while somewhat less efficient for some particular types of syntactic structures.

An artificial but real example is an ANTLR-generated LL(*) parser for the following grammar:

A = a* b* B;
B = a b+ c;

... working on strings of the form aⁿb^mc with non-negative n and positive m. This example mimics the competition without ambiguity for adjectives between two nouns in a sentence in some Slavic languages, and will trigger the worst-case exponential-time parsing behaviour of the parser. However, we know for sure from the theory of LL-regular grammars that a linear-time parser exists - if only we could use such a parser for this part of grammar, and the LL(*) parser for everything else.

A Multi-Machine Solution

Knowing the limitations of your primary choice of parsing machine, find the parts of your grammar that trigger the less-than-optimal performance. For those, a specialized parsing machine can be used, while the main parsing machine can handle the overall skeleton of the grammar.

The Takeaway

Committing to using just one parsing algorithm, especially when using a parser generator, often has a significant impact on the design of the language being parsed as well.

Tables can be turned by being aware of the strengths and weaknesses of various parsing methods and compounding them into a single parser.

While combining various parsing methods is somewhat common when a compiler is written by programmers from scratch, it is rare when a parser generator is used. In the second part of this series, we will go through the literature available on the subject and explore how various parser generators such as Bison, Yacc, Astir or ANTLR handle multi-machine parser generation, if at all.

History

10^th January, 2021: Initial version
10^th January, 2021: Series Thumbnail added
11^th January, 2021: Added further clarification to the "they are flying planes" example under Issue #1