Posted on Dec 3, 2021

100 Languages Speedrun: Episode 14: Recursive Descent Parser

Back in episode 8, we took our first steps towards creating a new programming language, and implemented a tokenizer. If you didn't read that episode, I recommend doing it now, otherwise this episode might be a bit confusing.

Our language

We'll create a very simple language for doing math.

It will have one statement per line, and there will only be these statements:

reading number from user and saving it to variable read name
printing variable print name
setting variable set name = expression

Any empty line will be ignored. # will start a comment that will go until end of current line.

So a program could look like this:

# Program for converting miles to km read miles set kms = miles * 1.60934 print kms

What is an `expression`?

It's very clear what a statement is, but what is an expression?

Let's write down the possibilities. These would be enough for our simple language:

a number like 1.60934
( expression )
expression + expression
expression - expression
expression * expression
expression / expression

And we have the whole language!

Well, almost. There's just tiny problem of ambiguity. Right now the language has no idea if 3 + 4 * 5 means 3 + ( 4 * 5 ) or (3 + 4) * 5. It also doesn't know if 2 - 3 - 4 is 2 - (3 - 4) or (2 - 3) - 4.

There are two approaches to solving this in use. One is to tell our parser "operator precedence" (if + or * goes first), and for each operator its associativity (left or right or not allowed - it determines which way 2 - 3 - 4 is parsed).

The other is to rewrite grammar to get rid of such ambiguity, and that's generally the simpler approach.

Unambiguous grammar

Instead of having expression with 6 possibilities, it will now only have these:

expression + product
expression - product
product

Then product is:

product * factor
product / factor
factor

And factor is:

( expression )
number

Expressed this way, it's impossible to have ambiguities. You can verify it yourself. A product cannot possibly have anything with + on either side, unless we use parentheses, because none of the rules it refers to has +. And 2 - 3 - 4 cannot possibly mean 2 - (3 - 4) because the right side of - is a product, so it cannot have any - in it, without explicit parentheses.

By the way don't think too much about names like product or factor. I don't find them terribly intuitive myself. You could call it expression2, expression3 or such.

A lot of "parser generator" tools accept definitions like what we just came up with. Unfortunately for this episode we won't be using a parser generator, we'll be writing Recursive Descent Parser ourselves, and it requires grammars to be tweaked a bit.

Grammar for Recursive Descent Parser

I didn't discuss what Recursive Descent Parser are yet. We'll get to it soon.

But a big limitation of Recursive Descent Parser is that none of the parsing rules can refer to itself on the left, because then it would goes into infinite recursive loop.

So let's rewrite our definitions in a way that won't have this issue:

expression is:

product ( ("+"|"-") product )*

product is:

factor ( ("*"|"/") factor ]*)

And factor is:

"(" expression ")"
number

I had to extend the notation a bit. a|b means a or b. ( ... )* means zero or more of whatever's in the parentheses. And as there are special characters now, I'm quoting all literal symbols like "+", or "(".

Tokenizer

OK, let's start with tokenizer. It will use StringScanner from Ruby standard library, and some regexp. Here's tokenizer program for our math language:

#!/usr/bin/env ruby require "strscan" require "pathname" class MathLanguage def initialize(path) @lines = Pathname(path).readlines end def tokenize(string) scanner = StringScanner.new(string) tokens = [] until scanner.eos? if scanner.scan(/\s+/) # do nothing elsif scanner.scan(/#.*/) # comments - ignore rest of line elsif scanner.scan(/-?\d+(?:\.\d*)?/) tokens << { type: "number", value: scanner.matched.to_f } elsif scanner.scan(/[\+\-\*\/=()]|set|read|print/) tokens << { type: scanner.matched } elsif scanner.scan(/[a-zA-Z][a-zA-Z0-9]*/) tokens << { type: "id", value: scanner.matched } else raise "Invalid character #{scanner.rest}" end end tokens end def call @lines.each do |line| tokens = tokenize(line) puts "Line: \"#{line.chomp}\" has tokens:" tokens.each do |token| p token end end end end unless ARGV.size == 1 STDERR.puts "Usage: #{$0} file.math" exit 1 end MathLanguage.new(ARGV[0]).call

If we run it, we can see it turns all lines into tokens, and skips comments:

$ ./math_tokenizer.rb miles_to_km.math Line: "# Program for converting miles to km" has tokens: Line: "read miles" has tokens: {:type=>"read"} {:type=>"id", :value=>"miles"} Line: "set kms = miles * 1.60934" has tokens: {:type=>"set"} {:type=>"id", :value=>"kms"} {:type=>"="} {:type=>"id", :value=>"miles"} {:type=>"*"} {:type=>"number", :value=>1.60934} Line: "print kms" has tokens: {:type=>"print"} {:type=>"id", :value=>"kms"}

What is a Recursive Descent Parser?

We have a list of definitions. Now what is this Recursive Descent Parser I've been alluding to?

It's simply a bunch of method (or functions if you don't like OOP terminology) - one per grammar rule. Each method consumes some of the tokens from the left, and returns parsed content.

For Recursive Descent Parser, methods for rule X should expect that X will be on the left of the list, and raise exception if it isn't - but they should be OK with there being some leftover tokens.

As we're only parsing and not executing anything, for now our loop will look like this:

 def call @lines.each do |line| @tokens = tokenize(line) next if @tokens.empty? statement = parse_statement raise "Extra tokens left over" unless @tokens.empty? pp statement end end

I'm saving @tokens to instance variable, so we don't need to pass it to every method.

Also to simplify the code, here's some helper methods. next_token_is? checks if next token is of any of the expected types and return true or false. expect_token shifts token if it's one of expected types, or raises exception if it's of a wrong type:

 def next_token_is?(*types) @tokens[0] && types.include?(@tokens[0][:type]) end def expect_token(*types) raise "Parse error" unless next_token_is?(*types) @tokens.shift end

And with that, here's our parser:

 def parse_factor token = expect_token("number", "id", "(") case token[:type] when "number", "id" token when "(" result = parse_expression expect_token(")") result end end def parse_product result = parse_factor while next_token_is?("*", "/") op_token = @tokens.shift result = {type: op_token[:type], left: result, right: parse_factor} end result end def parse_expression result = parse_product while next_token_is?("+", "-") op_token = @tokens.shift result = {type: op_token[:type], left: result, right: parse_product} end result end def parse_statement token = expect_token("read", "set", "print") case token[:type] when "read" token = expect_token("id") {type: "read", id: token[:value]} when "set" var_token = expect_token("id") expect_token("=") expr = parse_expression {type: "set", id: var_token[:value], expr: expr} when "print" token = expect_token("id") {type: "print", id: token[:value]} end end

I recommend taking a good look at this code. Recursive Descent Parsers are very readable like that, but since it's probably not the type of code you've seen before, it might be a bit confusing initially.

We can give it a try:

$ ./math_parser.rb miles_to_km.math {:type=>"read", :id=>"miles"} {:type=>"set", :id=>"kms", :expr=> {:type=>"*", :left=>{:type=>"id", :value=>"miles"}, :right=>{:type=>"number", :value=>1.60934}}} {:type=>"print", :id=>"kms"}

Error handling

One huge advantage of Recursive Descent Parsers over parser generators is potential for great error reporting. Recursive Descent is much closer to the way humans think, so if there's an error, we can say something like At line 20 char 1: expected one of "read", "set", or "print", but got token: "number".

For complicated technical reasons, parsers created by parser generators often have atrociously bad error reporting.

I'm mentioning this, as for sake of simplicity, for this episode we won't be doing anything beyond raise "Parse error", but we could, and we will in the future.

Run it!

To run the code we change the loop slightly:

 def call @variables = {} @lines.each do |line| @tokens = tokenize(line) next if @tokens.empty? statement = parse_statement raise "Extra tokens left over" unless @tokens.empty? eval_statement(statement) end end

And the code to evaluate parsed statements and expressions:

 def eval_expression(expr) case expr[:type] when "number" expr[:value] when "id" @variables[expr[:value]] when "+", "-", "*", "/" left = eval_expression(expr[:left]) right = eval_expression(expr[:right]) left.send(expr[:type], right) else raise end end def eval_statement(statement) case statement[:type] when "read" @variables[statement[:id]] = STDIN.readline.to_f when "print" puts @variables[statement[:id]] when "set" @variables[statement[:id]] = eval_expression(statement[:expr]) else raise end end

I left the else raise there to catch programming errors. If we coded things correctly, it should never be reached.

If you're not familiar with Ruby, left.send(expr[:type], right) is equivalent to left + right, left - right, left * right, or left / right depending on expr[:type].

And that's all we needed to create our proper programming language:

$ ./math.rb miles_to_km.math 500 804.67

Where do we go from here?

We didn't need any special tooling, just some regular expressions for tokenizer, and some plain recursive methods for parser and for execution.

This approach is perfectly fine for simple programming languages. You can just add more token types, more grammar rules, some error handling, and some interesting interpreter, and you can have a decent language for your needs. Nothing about it is specific to Ruby - you can use any language. Regular expression support is definitely a big plus, but people have been writing tokenizers without regular expressions too, looking one character at a time.

In the future we'll try to do something similar with tools like PLY (Python Lex-Yacc), or ANTLR.

Code

All code examples for the series will be in this repository.

Code for the Recursive Descent Parser episode is available here.

DEV Community

100 Languages Speedrun: Episode 14: Recursive Descent Parser

Our language

What is an `expression`?

Unambiguous grammar

Grammar for Recursive Descent Parser

Tokenizer

What is a Recursive Descent Parser?

Error handling

Run it!

Where do we go from here?

Code

Top comments (0)

Our language

What is an expression?

Unambiguous grammar

Grammar for Recursive Descent Parser

Tokenizer

What is a Recursive Descent Parser?

Error handling

Run it!

Where do we go from here?

Code

What is an `expression`?