LexContext: A Streamlined Cursor Over a Text Input

honey the codewitch

4.20/5 (3 votes)

Jan 17, 2020

MIT

6 min read

8871

Manage location tracking, lifetime, error reporting and basic parsing/lexing with this simple class

Introduction

This is a topic I've covered before, but the code was getting rusty, and a bit bloated, so I've decided to streamline, rewrite, and produce an accompanying article.

I write a lot of parsers and lexers and generators for those, and in the course of doing this, I've developed some simple tools to cover common pitfalls in doing so. The class presented here is absolutely foundational to a lot of my code. I use it or one of its variants in many of my projects, including Parsley. It provides a unified facade over disparate input sources, managing lifetime, location information such as line and column info, capturing input, error reporting, and a better cursor than TextReader or IEnumerator<char> provides, all in a lightweight class.

I'm going to recover ground I've covered in my previous article mentioned in the introduction. If you've followed along so far from the previous article, the main changes I've made are removing lookahead to eke a little bit more performance out of the class, and separating some of the helper methods into a separate file, making the core API extremely spartan, while keeping the extended functionality as a separate included file.

Conceptualizing this Mess

There are several pitfalls to lexing and parsing this class covers, despite the API being small, so we'll cover them in sections.

Unified Input and Lifetime Management

LexContext provides Create() and CreateFrom() which take an IEnumerator<char> or a TextReader respectively. There is also a CreateFrom() overload that takes a filename. Furthermore, there is CreateFromUrl() which allows you to create a LexContext over a remote or local source indicated by URL.

In addition, LexContext provides a Close() method and is disposable so that you can close the underlying file or network stream, if any.

Regardless of how the instance was created, the interface for accessing it is the same.

Cursor Management

When writing lexers or lexerless parsers, you often need to keep the current character under the cursor around, and pass it to the various parse or lex functions you've created. I'll demonstrate later, but basically, you'll find yourself using Peek() on TextReader if you don't, and that's Bad(TM) as it prevents your code from working over network streams or other non-seekable sources, like Console input.

IEnumerator<char> is better for this as it provides the Current property which provides access to that character under the cursor that we need. However, unlike TextReader, IEnumerator<char> will throw if the cursor is positioned past the end of the stream, and does not give you any indication of that except while you MoveNext(). This is extremely undesirable as it means you have to track the ending condition yourself, which can complicate the parsing code considerably if you have to incorporate that bookkeeping into your parsing routines.

LexContext, like TextReader will report the current cursor value even if the cursor is not over a character. Like IEnumerator<char>, it reports the current character through the Current property and does so without seeking. It will report the end of input with -1/EndOfInput or before the beginning with -2/BeforeInput when the cursor is in the initial position. It will also report -3/Disposed if the LexContext has been closed/disposed of.

Advance() works like TextReader.Read() in that it advances the cursor and returns the character that was read. After Advance(), that character can be accessed again through the Current property. You can use whichever mechanism for retrieving the current character that suits your code in the moment.

Using these makes the above pitfalls vanish, and in doing so, makes your code to execute a lex/parse almost trivial while preserving the ability to examine non-seekable streams.

Location Information

LexContext tracks position, line and column info, and the current file or URL where the document resides, if any. This is often used for error reporting, but can be used for whatever it's needed for. Tab width is 4 unless otherwise specified using the TabWidth property. Setting this to the input decide's tab width allows column information to be accurately tracked. For the console, as well as most underlying input sources, the default should be appropriate. The Line and Column properties are one based, while the Position property is zero based. FileOrUrl holds the source location of the current document.

Error Handling

ExpectingException is the exception that is thrown when the lexer/parser encounters an error. It reports error location information, a message, and an optional collection of expected characters or symbols. You can throw it manually, but typically it will be raised as a result of calling Expecting().

While parsing, you call Expecting() to indicate that you expect Current to be one of a series of characters. If Current is not one of those characters, an ExpectingException is automatically filled with the current context information and then thrown.

Capturing Input

As you parse or lex, you'll typically need to capture the input you're examining to be retrieved later during the parse/lex, as part of a subparse, or simply to report from your lex or parse functions. LexContext provides a CaptureBuffer to help do just that.

CaptureBuffer itself is a StringBuilder, while LexContext provides Capture() to capture the current character under the cursor, if any, GetCapture() to retrieve all or part of the captured buffer, and ClearCapture() to clear the capture buffer.

If you need to execute a subparse, simply capture what you need to subparse into the capture buffer, and then create another LexContext over the captured string! You can then feed that LexContext to your subparse functions.

Using this Mess

I've provided a simply JSON minifier by way of example. It is about half of a JSON parser as well in that it loads the parsed text into dictionaries and lists depending on whether they are JSON objects or JSON arrays. What it doesn't do is normalize scalar values like strings. Strings are returned with quotes and embedded escapes, numbers, bools and nulls are returned as strings.

static void Main(string[] args)
{
    // minifies JSON. Does so by parsing into an intermediary graph
    // this step wasn't required, but makes it easier to adapt
    // the code to a real world JSON parser

    // holds our json data
    IDictionary<string, object> json = null;
            
    // parse our file
    using (var pc = LexContext.CreateFrom(@"..\..\Burn Notice.2919.tv.json"))
        json = _ParseJsonObject(pc);
            
    // write our json data out
    _WriteJsonTo(json, Console.Out);
}
static object _ParseJson(LexContext pc)
{
    // parses a JSON object, array, or value
    pc.TrySkipWhiteSpace();
    switch (pc.Current)
    {
        case '{':
            return _ParseJsonObject(pc);
        case '[':
            return _ParseJsonArray(pc);
        default:
            return _ParseJsonValue(pc);
    }
}
static IDictionary<string, object> _ParseJsonObject(LexContext pc)
{
    // a JSON {} object - our objects are dictionaries
    var result = new Dictionary<string, object>();
    pc.TrySkipWhiteSpace();
    pc.Expecting('{');
    pc.Advance();
    pc.Expecting(); // expecting anything other than end of input
    while ('}' != pc.Current && -1 != pc.Current) // loop until } or end
    {
        pc.TrySkipWhiteSpace();
        // _ParseJsonValue parses any scalar value, but we only want 
        // a string so we check here that there's a quote mark to 
        // ensure the field will be a string.
        pc.Expecting('"');
        var fn = _ParseJsonValue(pc);
        pc.TrySkipWhiteSpace();
        pc.Expecting(':');
        pc.Advance();
        // add the next value to the dictionary
        result.Add(fn, _ParseJson(pc));
        pc.TrySkipWhiteSpace();
        pc.Expecting('}', ',');
        // skip commas
        if (',' == pc.Current) pc.Advance();
    }
    // make sure we're positioned on the end
    pc.Expecting('}');
    // ... and read past it
    pc.Advance();
    return result;
}
static IList<object> _ParseJsonArray(LexContext pc)
{
    // a JSON [] array - our arrays are lists
    var result = new List<object>();
    pc.TrySkipWhiteSpace();
    pc.Expecting('[');
    pc.Advance();
    pc.Expecting(); // expect anything but end of input
                    // loop until end of array or input
    while (-1 != pc.Current && ']' != pc.Current)
    {
        pc.TrySkipWhiteSpace();
        // add the next item
        result.Add(_ParseJson(pc));
        pc.TrySkipWhiteSpace();
        pc.Expecting(']', ',');
        // skip the comma
        if (',' == pc.Current) pc.Advance();
    }
    // ensure we're on the final position
    pc.Expecting(']');
    // .. and read past it
    pc.Advance();
    return result;
}
static string _ParseJsonValue(LexContext pc)
{
    // parses a scalar JSON value, represented as a string
    // strings are returned quotes and all, with escapes 
    // embedded
    pc.TrySkipWhiteSpace();
    pc.Expecting(); // expect anything but end of input
    pc.ClearCapture();
    if ('\"' == pc.Current)
    {
        pc.Capture();
        pc.Advance();
        // reads until it finds a quote
        // using \ as an escape character
        // and consuming the final quote 
        // at the end
        pc.TryReadUntil('\"', '\\', true);
        // return what we read
        return pc.GetCapture();
    }
    pc.TryReadUntil(false, ',', '}', ']', ' ', '\t', '\r', '\n', '\v', '\f');
    return pc.GetCapture();
}

You can see for starters, that there are API functions like TrySkipWhiteSpace(). Those are in LexContext.BaseExtensions.cs. All these do is drive the LexContext in order to fulfill the requested operation, like skipping over whitespace. TrySkipUntil()/TryReadUntil() are very helpful as they are escape aware. Above, we use TryReadUntil() to capture a string with \ as the escape character.

After we parse the JSON into the appropriate objects, we simply write them to the console, as shown here:

static void _WriteJsonTo(object json, TextWriter writer)
{
    var d = json as IDictionary<string, object>;
    if (null != d)
        _WriteJsonObjectTo(d, writer);
    else
    {
        var l = json as IList<object>;
        if (null != l)
            _WriteJsonArrayTo(l, writer);
        else
            writer.Write(json);
    }
}
static void _WriteJsonObjectTo(IDictionary<string, object> json, TextWriter writer)
{
    var delim = "{";
    foreach (var field in json)
    {
        writer.Write(delim);
        _WriteJsonTo(field.Key, writer);
        writer.Write(":");
        _WriteJsonTo(field.Value, writer);
        delim = ",";
    }
    if ("{" == delim)
        writer.Write(delim);
    writer.Write("}");
}
static void _WriteJsonArrayTo(IList<object> json, TextWriter writer)
{
    var delim = "[";
    foreach (var item in json)
    {
        writer.Write(delim);
        _WriteJsonTo(item, writer);
        delim = ",";
    }
    if ("[" == delim)
        writer.Write(delim);
    writer.Write("]");
}

Note that this is suitable for minification, but it is not a standards compliant JSON parser. In order to keep things simple for the demo, it's a bit "looser" than the JSON spec in that it allows more than the JSON spec does. Namely, it doesn't enforce valid string escapes, accepts more kinds of whitespace, and doesn't enforce termination of a string on a newline. Despite that, as long as the JSON is valid to begin with, this will parse it.

History

17^th January, 2020 - Initial submission