How to Implement a Simple Parser from Scratch

Aug 4, 2019 · 1195 words

A while ago, I received a private message from Brother Ye, who suddenly needed to solve a task like this:

Convert expressions in the following format:

Multi(a, Multi(b, c)) --> a * (b * c)

Divide(a, Sub(b, c)) --> a / (b - c)

The supported operators are:

Add: +

Sub: -

Multi: *

Divide: /

And, as luck would have it, he needed to write it in C++, a language he hadn't used much. I realized this was a parser problem. My first reaction was to recommend flex/bison, but then I thought it might be overkill for such a small task. So, I started wondering if writing such an expression parser by hand would be difficult. The conclusion I reached was: it's not difficult.

Anyone familiar with compiler principles knows what a parser is. A parser, or syntactic analyzer, is a component found in the front-end of every compiler. However, from the perspective of compiler principles, the scope of 'language' is much broader than what we typically understand as programming languages. Any string formation with certain rules can be considered a language, such as the expressions described using functions like Add and Sub in the task above.

So, to solve the task above, one only needs to perform syntactic analysis on the expression string, obtain an intermediate representation (usually a parse tree or abstract syntax tree), and then output this intermediate representation in the desired format. In other words, we need to provide a parser for the expression, and any solution to this task can essentially be seen as writing a parser.

Normally, there's absolutely no need for us to hand-write a parser, because tools already exist to generate them for us. Thanks to the great programmers who invented such tools decades ago. The ones I've used include flex/bison for C/C++ and ANTLR for Java. You just need to provide a grammar description, and these tools can automatically generate the corresponding parser for you. Hand-writing a parser would be very complex and error-prone, not a wise choice.

However, for small tasks like the one exemplified above, using tools that automatically generate parsers can sometimes feel too heavy-handed. In such cases, perhaps hand-writing a parser is a better option. Moreover, in this specific task scenario, our parser can be significantly simplified in at least two aspects:

First, the language we need to process should not have complex state transitions like general-purpose programming languages. Typically, by looking at the current string, one should be able to tell what kind of content needs to be parsed next. Markup languages generally follow this style, for example:

XML/HTML: Seeing <tag> indicates the start of a tag, until </tag>
CSS: Declarations after a selector are always enclosed in curly braces, with each declaration separated by a ;
Markdown: A line starting with # is a heading, and one starting with 1. is an ordered list item

Second, we don't need to perform complex syntax error handling; simply reporting 'syntax error' is sufficient, without painstakingly explaining what exactly went wrong.

With these two prerequisites, we can start thinking about how to hand-write a parser. Of course, I've already thought it through, and below is a simple implementation of a parser that I've provided. I implemented it in Java, using a bit of lambda expression syntax, but it's not difficult to understand. Since the main work of a parser is string comparison, it's pretty much the same in any language. I'll consider implementing it in other languages later.

In terms of implementation, let's make another simplification: we'll store the string to be parsed as a character array, rather than reading it from a so-called 'character stream'. This way, we don't have to consider scenarios where characters are read (get) but not consumed. These are concerns for the input module; we'll focus on the parser itself.

First, our SimpleParser is defined as follows:

public class SimpleParser {

    private char[] input;
    private int pos;

    public SimpleParser(String source) {
        this.input = source.toCharArray();
        this.pos = 0;
    }
}

We store the input as a character array, and pos is a pointer to the next character to be read. Incrementing pos by one is equivalent to consuming a character.

Next, let's add some utility functions:

private void consumeWhitespace() {
    consumeWhile(Character::isWhitespace);
}

private String consumeWhile(Predicate<Character> test) {
    StringBuilder sb = new StringBuilder();
    while (!eof() && test.test(nextChar())) {
        sb.append(consumeChar());
    }
    return sb.toString();
}

private char consumeChar() {
    return input[pos++];
}

private boolean startsWith(String s) {
    return new String(input, pos, input.length - pos).startsWith(s);
}

private char nextChar() {
    return input[pos];
}

private boolean eof() {
    return pos >= input.length;
}

These functions are inspired by a series of articles I read previously: Let's build a browser engine! (the original was in Rust). Let's take a look at these functions:

Among them, the nextChar and startsWith functions are used for 'looking ahead' to determine the state of the subsequent input. This is actually a bit different from the syntactic analysis described in compiler principles (recall that syntactic analysis methods in compiler principles only look ahead one character), but since we are only checking if it equals a fixed string, it's not a major issue.

The functions starting with consume... are the actual input reading functions. Among them, consumeWhile is a general-purpose function, and consumeWhitespace is implemented based on it. Similarly, we can also implement a function to parse variable names based on it:

private String parseVariableName() {
    return consumeWhile(Character::isAlphabetic);
}

Notice that this is essentially parsing the variable names in our task. With this approach, the subsequent implementation is actually very simple. We might initially think that hand-writing a parser would be very complex, but that's usually because we haven't found the right starting point. Therefore, these utility functions are particularly important; once we have them, we can gradually build out the entire parser functionality step by step.

Then we can write it like this:

// Parses an expression consisting of a single variable
private VariableExpression parseVariableExpression() {
    String name = parseVariableName();
    // Definition of VariableExpression omitted
    return new VariableExpression(name);
}

// Parses an Add/Sub/Multi/Divide expression
private CompoundExpression parseCompoundExpression(String name) {
    for (char c : name.toCharArray()) {
        checkState(c == consumeChar());
    }
    checkState('(' == consumeChar());
    // Recursive parsing
    Expression left = parseExpression();
    checkState(',' == consumeChar());
    consumeWhitespace();
    Expression right = parseExpression();
    checkState(')' == consumeChar());
    // Definition of CompoundExpression omitted
    return new CompoundExpression(name, left, right);
}

// VariableExpression and CompoundExpression are both Expression
private Expression parseExpression() {
    if (startsWith("Add")) {
        return parseCompoundExpression("Add");
    } else if (startsWith("Sub")) {
        return parseCompoundExpression("Sub");
    } else if (startsWith("Multi")) {
        return parseCompoundExpression("Multi");
    } else if (startsWith("Divide")) {
        return parseCompoundExpression("Divide");
    } else {
        return parseVariableExpression();
    }
}

At this point, the main work of our parser is done, and the remaining task is very simple. Does it seem like our task was a bit too simple? In this scenario, hand-writing a parser is indeed not difficult. Next, one could try hand-writing a Markdown parser for practice 😜.

P.S. Brother Ye never actually did this task later; I only remembered to implement this parser now, just because I thought it would be fun.

The complete code for the parser discussed in this article can be found on my GitHub: simpleparser.