Maximum Munch Principle

Zeeshan Amjad

4.31/5 (13 votes)

Sep 2, 2003

CPOL

4 min read

52931

How the C++ compiler breaks its input into tokens

Introduction

Most STL users are already aware of the need to put an extra space when taking a template as a template argument to any class. For example when you want to make a vector of complex numbers, which is also a template class, you have to put extra space between two > operators.

std::vector<std::complex<int> > vecComp;

If you forget to put the extra space then the compiler will treat it as a logical shift operator >>. (Although it has been recommended to extend the language to understand the meaning of >> dependent on its context [VAN03]).

To understand better what is going behind the scene and why should we need it, we have to look at the workings of the compiler and its construction. The simplest definition of a compiler is that it is a program which translates one language into other language. The C++ compiler translates the C++ source program into the required platform specific machine language.

The first phase of any compiler is to read the source program, and convert it into tokens after removing any white space and comments. The tokens are used by the parser. This process of converting the input stream into tokens is called Lexical Analysis or scanning [AHO86]. It is possible to make a language which doesn't have any comments and where the Lexical Analyzer doesn't remove white space, but in this case it is the responsibility of the parser and other compiler phase to handle white space. And you have to rewrite the grammar of it to handle white space, which is not a reasonable and easy task.

The Lexical Analyzer usually also stores line numbers for better error information, and may also store a copy of the source program with error information. But in the case of multi-line comments or lines that only contain comments in the source program the storage of line numbers will not be the same after removing the comment lines.

Tokens are logically cohesive sequence of characters of collective meaning. In more technical terms tokens are usually terminal symbols in the grammar [AHO86]. Usually token types are keywords, operators, identifiers, constant, string, digits punctuation symbols etc. The character sequence, from the input stream to create a token is called a lexeme.

One typical C++ scanner may generate the following tokens from a given C++ statement.

int a = b + 10;

This statement generates seven tokens. First it generates the token keyword for int. Then there is an identifier a. The tokenizer first searches for it in the symbol table, a data structure to store information about identifiers. If the token is not found in the symbol table enter its name a into the symbol table and generates a token type identifier. Then the punctuation token is generated followed by identifier b, with the same rule of symbol table which was used with identifier a, then plus token, digit token of value 10 and finally punctuation token for semicolon. Tokens are usually generated with its attributes. In the case of an identifier the pointer to its symbol table entry is stored as an attribute of that token. If we write the pairs of tokens and its attribute of above statement then it would be something like this

{Keyword integer, NULL}
{Identifier, Pointer to symbol table entry of "a"}
{Assignment Operator, NULL }
{Identifier, Pointer to symbol table entry of "b"}
{Plus Operator, NULL}
{Digit, 10}
{Terminator symbol, NULL}

Compilers usually create token from the longest lexeme e.g. in case of >> in C++ compiler create token of shift right operator rather than two greater than operators [HOL90]. This is also called Maximum Munch Tokenization Principle, or simply Maximum Munch Principle, that the C++ implementation has to consider as many characters as possible to create token.

Other examples of the Maximum Munch Principle can be seen with digraphs. Take a look at the following code.

std::list<::x> lst;

Here <: is a digraph symbol and treated as [. So the above code is compiled as

std::list[:x> lst;

So you have to put an extra space between < and the colon. The correct code is written as

std::list< ::x> lst;

And the same is true in the following example.

int x = y%::z;

Here %: is again a digraph and considered to be # and this statement becomes

int x = y#:z;

You have to put extra space here too to compile it correctly.

int x = y % ::z;

It is interesting to note exactly the reverse of this when the Maximum Munch Principle is not applied, but the reader of the code thinks it is. For example once someone ask Bjarne Stroustrup to overload or create an ** operator, which is present in COBOL and some other languages. There is no ** operator in C++ and there isn't any way to create any new operator in C++. But in C++ the * operator has two meanings, multiplication and dereference operator. You can overload both operators; the only difference is number of parameters. The multiplication operator needs two parameters, however the dereference operator needs only one. Stroustrup used the nice technique to overload both operators to give the illusion of ** operator.

// overloading ** operator
#include <cmath>
#include <ciostream>

struct Index {
    double d;
    Index(double dd) :d(dd) { }
};

struct II {
    double d;
    II(double dd) :d(dd) { }
};

struct III {
    double d;
    III(double ddd) : d(ddd) { }
};

II operator*(Index i) {
    return II(i.d); 
}

double operator*(double d , II i) {
    return pow(d,i.d); 
}

int main () {
    Index i = 3;
    std::cout << 2**i << "\n";    
}

Here the statement 2**i gives the illusion of the ** operator, but in fact it is two * operators; one is a multiplication operator and the other is a dereference operator.

References

[AHO98] Compilers Principles, Techniques and Tools. Alfred V. Aho, Ravi Sehi, Jeffrey D Ullman 1986
[HOL90] Compiler Design in C. Allen I. Holub 1990
[VAN03] C++ Templates, The Complete Guide. David Vandevoorde, Nicolai M. Josuttis 2003