Combinator Parsing By Swanand Pagnis
Higher-Order Functions for Parsing By Graham Hutton
• Abstract & Introduction • Build a parser, one fn at a time • Moving beyond toy parsers
Abstract
In combinator parsing, the text of parsers resembles BNF notation. We present the basic method, and a number of extensions. We address the special problems presented by whitespace, and parsers with separate lexical and syntactic phases. In particular, a combining form for handling the “offside rule” is given. Other extensions to the basic method include an “into” combining form with many useful applications, and a simple means by which combinator parsers can produce more informative error messages.
• Combinators that resemble BNF notation • Whitespace handling through "Offside Rule" • "Into" combining form for advanced parsing • Strategy for better error messages
Introduction
Primitive Parsers • Take input • Process one character • Return results and unused input
Combinators • Combine primitives • Define building blocks • Return results and unused input
Lexical analysis and syntax • Combine the combinators • Define lexical elements • Return results and unused input
input: "from:swiggy to:me" output: [("f", "rom:swiggy to:me")]
input: "42 !=> ans" output: [("4", "2 !=> ans")]
rule: 'a' followed by 'b' input: "abcdef" output: [(('a','b'),"cdef")]
rule: 'a' followed by 'b' input: "abcdef" output: [(('a','b'),"cdef")] Combinator
Language choice
Suggested: Lazy Functional Languages
Miranda: Author's choice
Haskell: An obvious choice. 🤓
Racket: Another obvious choice. 🤓
Ruby: 🍎 to 🍊 so $ for learning
OCaml: Functional, but not lazy.
Haskell %
Simple when stick to fundamental FP • Higher order functions • Immutability • Recursive problem solving • Algebraic types
Let's build a parser, one fn at a time
type Parser a b = [a] !-> [(b, [a])]
Types help with abstraction • We'll be dealing with parsers and combinators • Parsers are functions, they accept input and return results • Combinators accept parsers and return parsers
A parser is a function that accepts an input and returns parsed results and the unused input for each result
Parser is a function type that accepts a list of type a and returns all possible results as a list of tuples of type (b, [a])
(Parser Char Number) input: "42 it is!" !-- a is a [Char] output: [(42, " it is!")] !-- b is a Number
type Parser a b = [a] !-> [(b, [a])]
Primitive Parsers
succeed !:: b !-> Parser a b succeed v inp = [(v, inp)]
Always succeeds Returns "v" for all inputs
failure !:: Parser a b failure inp = []
Always fails Returns "[]" for all inputs
satisfy !:: (a !-> Bool) !-> Parser a a satisfy p [] = failure [] satisfy p (x:xs) | p x = succeed x xs !-- if p(x) is true | otherwise = failure []
satisfy !:: (a !-> Bool) !-> Parser a a satisfy p [] = failure [] satisfy p (x:xs) | p x = succeed x xs !-- if p(x) is true | otherwise = failure [] Guard Clauses, if you want to Google
literal !:: Eq a !=> a !-> Parser a a literal x = satisfy (!== x)
match_3 = (literal '3') match_3 "345" !-- !=> [('3',"45")] match_3 "456" !-- !=> []
succeed failure satisfy literal
Combinators
match_3_or_4 = match_3 `alt` match_4 match_3_or_4 "345" !-- !=> [('3',"45")] match_3_or_4 "456" !-- !=> [('4',"56")]
alt !:: Parser a b !-> Parser a b !-> Parser a b (p1 `alt` p2) inp = p1 inp !++ p2 inp
(p1 `alt` p2) inp = p1 inp !++ p2 inp List concatenation
(match_3 `and_then` match_4) "345" # !=> [(('3','4'),"5")]
🐉
and_then !:: Parser a b !-> Parser a c !-> Parser a (b, c) (p1 `and_then` p2) inp = [ ((v1, v2), out2) | (v1, out1) !<- p1 inp, (v2, out2) !<- p2 out1 ]
and_then !:: Parser a b !-> Parser a c !-> Parser a (b, c) (p1 `and_then` p2) inp = [ ((v1, v2), out2) | (v1, out1) !<- p1 inp, (v2, out2) !<- p2 out1 ] List comprehensions
(v11, out11) (v12, out12) (v13, out13) … (v21, out21) (v22, out22) … (v31, out31) (v32, out32) … p1 p2
((v11, v21), out21) ((v11, v22), out22) …
(match_3 `and_then` match_4) "345" # !=> [(('3','4'),"5")]
Manipulating values
match_3 = (literal '3') match_3 "345" !-- !=> [('3',"45")] match_3 "456" !-- !=> []
(number "42") "42 !=> answer" # !=> [(42, " answer")]
(keyword "for") "for i in 1!..42" # !=> [(:for, " i in 1!..42")]
using !:: Parser a b !-> (b !-> c) !-> Parser a c (p `using` f) inp = [(f v, out) | (v, out) !<- p inp ]
((string "3") `using` float) "3" # !=> [(3.0, "")]
Levelling up
many !:: Parser a b !-> Parser a [b] many p = ((p `and_then` many p) `using` cons) `alt` (succeed [])
0 or many
(many (literal 'a')) "aab" !=> [("aa","b"),("a","ab"),("","aab")]
(many (literal 'a')) "xyz" !=> [("","xyz")]
some !:: Parser a b !-> Parser a [b] some p = ((p `and_then` many p) `using` cons)
1 or many
(some (literal 'a')) "aab" !=> [("aa","b"),("a","ab")]
(some (literal 'a')) "xyz" !=> []
positive_integer = some (satisfy Data.Char.isDigit) negative_integer = ((literal '-') `and_then` positive_integer) `using` cons positive_decimal = (positive_integer `and_then` (((literal '.') `and_then` positive_integer) `using` cons)) `using` join negative_decimal = ((literal '-') `and_then` positive_decimal) `using` cons
number !:: Parser Char [Char] number = negative_decimal `alt` positive_decimal `alt` negative_integer `alt` positive_integer
word !:: Parser Char [Char] word = some (satisfy isLetter)
string !:: (Eq a) !=> [a] !-> Parser a [a] string [] = succeed [] string (x:xs) = (literal x `and_then` string xs) `using` cons
(string "begin") "begin end" # !=> [("begin"," end")]
xthen !:: Parser a b !-> Parser a c !-> Parser a c p1 `xthen` p2 = (p1 `and_then` p2) `using` snd
thenx !:: Parser a b !-> Parser a c !-> Parser a b p1 `thenx` p2 = (p1 `and_then` p2) `using` fst
ret !:: Parser a b !-> c !-> Parser a c p `ret` v = p `using` (const v)
succeed, failure, satisfy, literal, alt, and_then, using, string, many, some, string, word, number, xthen, thenx, ret
Expression Parser & Evaluator
data Expr = Const Double | Expr `Add` Expr | Expr `Sub` Expr | Expr `Mul` Expr | Expr `Div` Expr
(Const 3) `Mul` ((Const 6) `Add` (Const 1))) # !=> "3*(6+1)"
parse "3*(6+1)" # !=> (Const 3) `Mul` ((Const 6) `Add` (Const 1)))
(Const 3) Mul ((Const 6) `Add` (Const 1))) # !=> 21
BNF Notation expn !::= expn + expn | expn − expn | expn ∗ expn | expn / expn | digit+ | (expn)
Improving a little: expn !::= term + term | term − term | term term !::= factor ∗ factor | factor / factor | factor factor !::= digit+ | (expn)
Parsers that resemble BNF
addition = ((term `and_then` ((literal '+') `xthen` term)) `using` plus)
subtraction = ((term `and_then` ((literal '-') `xthen` term)) `using` minus)
multiplication = ((factor `and_then` ((literal '*') `xthen` factor)) `using` times)
division = ((factor `and_then` ((literal '/') `xthen` factor)) `using` divide)
parenthesised_expression = ((nibble (literal '(')) `xthen` ((nibble expn) `thenx`(nibble (literal ')'))))
value xs = Const (numval xs) plus (x,y) = x `Add` y minus (x,y) = x `Sub` y times (x,y) = x `Mul` y divide (x,y) = x `Div` y
expn = addition `alt` subtraction `alt` term
term = multiplication `alt` division `alt` factor
factor = (number `using` value) `alt` parenthesised_expn
expn "12*(5+(7-2))" # !=> [ (Const 12.0 `Mul` (Const 5.0 `Add` (Const 7.0 `Sub` Const 2.0)),""), … ]
value xs = Const (numval xs) plus (x,y) = x `Add` y minus (x,y) = x `Sub` y times (x,y) = x `Mul` y divide (x,y) = x `Div` y
value = numval plus (x,y) = x + y minus (x,y) = x - y times (x,y) = x * y divide (x,y) = x / y
expn "12*(5+(7-2))" # !=> [(120.0,""), (12.0,"*(5+(7-2))"), (1.0,"2*(5+(7-2))")]
expn "(12+1)*(5+(7-2))" # !=> [(130.0,""), (13.0,"*(5+(7-2))")]
Moving beyond toy parsers
Whitespace? 🤔 (
white = (literal " ") `alt` (literal "t") `alt` (literal "n")
white = many (any literal " tn")
/s!*/
any p = foldr (alt.p) fail
any p [x1,x2,!!...,xn] = (p x1) `alt` (p x2) `alt` !!... `alt` (p xn)
white = many (any literal " tn")
nibble p = white `xthen` (p `thenx` white)
The parser (nibble p) has the same behaviour as parser p, except that it eats up any white- space in the input string before or afterwards
(nibble (literal 'a')) " a " # !=> [('a',""),('a'," "),('a'," ")]
symbol = nibble.string
symbol "$fold" " $fold " # !=> [("$fold", ""), ("$fold", " ")]
The Offside Rule
w = x + y where x = 10 y = 15 - 5 z = w * 2
w = x + y where x = 10 y = 15 - 5 z = w * 2
When obeying the offside rule, every token must lie either directly below, or to the right of its first token
i.e. A weak indentation policy
The Offside Combinator
type Pos a = (a, (Integer, Integer))
prelex "3 + n 2 * (4 + 5)" # !=> [('3',(0,0)), ('+',(0,2)), ('2',(1,2)), ('*',(1,4)), … ]
satisfy !:: (a !-> Bool) !-> Parser a a satisfy p [] = failure [] satisfy p (x:xs) | p x = succeed x xs !-- if p(x) is true | otherwise = failure []
satisfy !:: (a !-> Bool) !-> Parser (Pos a) a satisfy p [] = failure [] satisfy p (x:xs) | p a = succeed a xs !-- if p(a) is true | otherwise = failure [] where (a, (r, c)) = x
satisfy !:: (a !-> Bool) !-> Parser (Pos a) a satisfy p [] = failure [] satisfy p (x:xs) | p a = succeed a xs !-- if p(a) is true | otherwise = failure [] where (a, (r, c)) = x
offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp onside (a, (r, c)) (b, (r', c')) = r' !>= r !&& c' !>= c
offside !:: Parser (Pos a) b !-> Parser (Pos a) b
offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)]
(nibble (literal 'a')) " a " # !=> [('a',""),('a'," "),('a'," ")]
offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)]
offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp
offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp
offside !:: Parser (Pos a) b !-> Parser (Pos a) b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp onside (a, (r, c)) (b, (r', c')) = r' !>= r !&& c' !>= c
(3 + 2 * (4 + 5)) + (8 * 10) (3 + 2 * (4 + 5)) + (8 * 10)
(offside expn) (prelex inp_1) # !=> [(21.0,[('+',(2,0)),('(',(2,2)),('8',(2,3)),('*', (2,5)),('1',(2,7)),('0',(2,8)),(')',(2,9))])] (offside expn) (prelex inp_2) # !=> [(101.0,[])]
Quick recap before we 🛫
∅ !|> succeed, fail !|> satisfy, literal !|> alt, and_then, using !|> many, some !|> string, thenx, xthen, return !|> expression parser & evaluator !|> any, nibble, symbol !|> prelex, offside
Practical parsers
🎯 Syntactical analysis 🎯 Lexical analysis 🎯 Parse trees
type Parser a b = [a] !-> [(b, [a])] type Pos a = (a, (Integer, Integer))
data Tag = Ident | Number | Symbol | Junk deriving (Show, Eq) type Token = (Tag, [Char])
(Symbol, "if") (Number, "123")
Parse the string with parser p, & apply token t to the result
(p `tok` t) inp = [ (((t, xs), (r, c)), out) | (xs, out) !<- p inp] where (x, (r,c)) = head inp
(p `tok` t) inp = [ ((<token>,<pos>),<unused input>) | (xs, out) !<- p inp] where (x, (r,c)) = head inp
(p `tok` t) inp = [ (((t, xs), (r, c)), out) | (xs, out) !<- p inp] where (x, (r,c)) = head inp
((string "where") `tok` Symbol) inp # !=> ((Symbol,"where"), (r, c))
many ((p1 `tok` t1) `alt` (p2 `tok` t2) `alt` !!... `alt` (pn `tok` tn))
[(p1, t1), (p2, t2), …, (pn, tn)]
lex = many.(foldr op failure) where (p, t) `op` xs = (p `tok` t) `alt` xs
🐉
lex = many.(foldr op failure) where (p, t) `op` xs = (p `tok` t) `alt` xs
# Rightmost computation cn = (pn `tok` tn) `alt` failure
# Followed by (pn-1 `tok` tn-1) `alt` cn
many ((p1 `tok` t1) `alt` (p2 `tok` t2) `alt` !!... `alt` (pn `tok` tn))
lexer = lex [ ((some (any_of literal " nt")), Junk), ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
lexer = lex [ ((some (any_of literal " nt")), Junk), ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
lexer = lex [ ((some (any_of literal " nt")), Junk), ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
head (lexer (prelex "where x = 10")) # !=> ([((Symbol,"where"),(0,0)), ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)), ((Number,"10"),(0,10)) ],[])
(head.lexer.prelex) "where x = 10" # !=> ([((Symbol,"where"),(0,0)), ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)), ((Number,"10"),(0,10)) ],[])
(head.lexer.prelex) "where x = 10" # !=> ([((Symbol,"where"),(0,0)), ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)), ((Number,"10"),(0,10)) ],[]) Function composition
length ((lexer.prelex) "where x = 10") # !=> 198
Conflicts? Ambiguity?
In this case, "where" is a source of conflict. It can be a symbol, or identifier.
lexer = lex [ {- 1 -} ((some (any_of literal " nt")), Junk), {- 2 -} ((string "where"), Symbol), {- 3 -} (word, Ident), {- 4 -} (number, Number), {- 5 -} ((any_of string ["(",")","="]), Symbol)]
Higher priority, higher precedence
Removing Junk
strip !:: [(Pos Token)] !-> [(Pos Token)] strip = filter ((!!= Junk).fst.fst)
((!!= Junk).fst.fst) ((Symbol,"where"),(0,0)) # !=> True ((!!= Junk).fst.fst) ((Junk,"where"),(0,0)) # !=> False
(fst.head.lexer.prelex) "where x = 10" # !=> [((Symbol,"where"),(0,0)), ((Junk," "),(0,5)), ((Ident,"x"),(0,6)), ((Junk," "),(0,7)), ((Symbol,"="),(0,8)), ((Junk," "),(0,9)), ((Number,"10"),(0,10))]
(strip.fst.head.lexer.prelex) "where x = 10" # !=> [((Symbol,"where"),(0,0)), ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)), ((Number,"10"),(0,10))]
Syntax Analysis
characters !|> lexical analysis !|> tokens
tokens !|> syntax analysis !|> parse trees
f x y = add a b where a = 25 b = sub x y answer = mult (f 3 7) 5
f x y = add a b where a = 25 b = sub x y answer = mult (f 3 7) 5 Script
f x y = add a b where a = 25 b = sub x y answer = mult (f 3 7) 5 Definition
f x y = add a b where a = 25 b = sub x y answer = mult (f 3 7) 5 Body
f x y = add a b where a = 25 b = sub x y answer = mult (f 3 7) 5 Expression
f x y = add a b where a = 25 b = sub x y answer = mult (f 3 7) 5 Definition
f x y = add a b where a = 25 b = sub x y answer = mult (f 3 7) 5 Primitives
data Script = Script [Def] data Def = Def Var [Var] Expn data Expn = Var Var | Num Double | Expn `Apply` Expn | Expn `Where` [Def] type Var = [Char]
prog = (many defn) `using` Script
defn = ( (some (kind Ident)) `and_then` ((lit "=") `xthen` (offside body))) `using` defnFN
body = ( expr `and_then` (((lit "where") `xthen` (some defn)) `opt` [])) `using` bodyFN
expr = (some prim) `using` (foldl1 Apply)
prim = ((kind Ident) `using` Var) `alt` ((kind Number) `using` numFN) `alt` ((lit "(") `xthen` (expr `thenx` (lit ")")))
!-- only allow a kind of tag kind !:: Tag !-> Parser (Pos Token) [Char] kind t = (satisfy ((!== t).fst)) `using` snd — only allow a given symbol lit !:: [Char] !-> Parser (Pos Token) [Char] lit xs = (literal (Symbol, xs)) `using` snd
prog = (many defn) `using` Script
defn = ( (some (kind Ident)) `and_then` ((lit "=") `xthen` (offside body))) `using` defnFN
body = ( expr `and_then` (((lit "where") `xthen` (some defn)) `opt` [])) `using` bodyFN
expr = (some prim) `using` (foldl1 Apply)
prim = ((kind Ident) `using` Var) `alt` ((kind Number) `using` numFN) `alt` ((lit "(") `xthen` (expr `thenx` (lit ")")))
data Script = Script [Def] data Def = Def Var [Var] Expn data Expn = Var Var | Num Double | Expn `Apply` Expn | Expn `Where` [Def] type Var = [Char]
Orange functions are for transforming values.
Use data constructors to generate parse trees
Use evaluation functions to evaluate and generate a value
f x y = add a b where a = 25 b = sub x y answer = mult (f 3 7) 5
Script [ Def "f" ["x","y"] ( ((Var "add" `Apply` Var "a") `Apply` Var "b") `Where` [ Def "a" [] (Num 25.0), Def "b" [] ((Var "sub" `Apply` Var "x") `Apply` Var "y")]), Def "answer" [] ( (Var "mult" `Apply` ( (Var "f" `Apply` Num 3.0) `Apply` Num 7.0)) `Apply` Num 5.0)]
Strategy for writing parsers
1. Identify components i.e. Lexical elements
lexer = lex [ ((some (any_of literal " nt")), Junk), ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
2. Structure these elements a.k.a. syntax
defn = ((some (kind Ident)) `and_then` ((lit "=") `xthen` (offside body))) `using` defnFN body = (expr `and_then` (((lit "where") `xthen` (some defn)) `opt` [])) `using` bodyFN expr = (some prim) `using` (foldl1 Apply) prim = ((kind Ident) `using` Var) `alt` ((kind Number) `using` numFN) `alt` ((lit "(") `xthen` (expr `thenx` (lit ")")))
3. BNF notation is very helpful
4. TDD in the absence of types
Where to, next?
Monadic Parsers Graham Hutton, Eric Meijer
Introduction to FP Philip Wadler
The Dragon Book If your interest is in compilers
Libraries?
Haskell: Parsec, MegaParsec. ✨ OCaml: Angstrom. ✨ 🚀 Ruby: rparsec, or roll you own Elixir: Combine, ExParsec Python: Parsec. ✨
Thank you!
Twitter: @_swanand GitHub: @swanandp

Combinator parsing