Clojure Instaparse Parser Tutorial: Grammar Syntax
Instaparse supports 3 grammar specification syntax. It supports:
- “Instaparse's EBNF”
- (Extended Backus–Naur Form) grammar syntax (it has some PEG (Parsing Expression Grammar) features.) This is the default.
- “Instaparse's ABNF”
- (Augmented Backus–Naur Form).
- Using Parser Combinators
By default, the function parser
interpret its input as Instaparse's EBNF.
To use ABNF syntax, use the optional parameter :input-format
. For example,
(parser grammar :input-format :abnf)
For how to use parser combinator, see below.
(BNF is a syntax that specify grammar. EBNF is a extension based on BNF. Each parser software has different idea of EBNF. ABNF is a alternative syntax, a standard, with precise specification, but each parser software may also have variations.)
Instaparse's EBNF Syntax
Instaparse's EBNF syntax is very lose. For example, extra spaces can be used, and the separator for left-hand-side and right-hand-side can be any of {:
, :=
, ::=
, =
}, end of a rule may be indicated by a semicolon ;
or a period .
, or simply a newline char, etc.
Meaning | Syntax | Example |
---|---|---|
Rule Definition | : , := , ::= , = | S=A |
End of rule | ; , . (optional) | S=A; |
Alternation | | | A|B |
Concatenation | space(s) or , | A B |
Grouping | (…) | (A|B)C |
Optional | …? , […] | A? , [A] |
One or more | …+ | A+ |
Zero or more | …* , {…} | A* , {A} |
String terminal | "…" , '…' | 'a' , "a" |
Regex terminal | #"…" , #'…' | #'a' , #"a" |
Epsilon | Epsilon , epsilon , EPSILON , eps , ε , "" , '' | S='a' S|Epsilon |
Comment | (*…*) | (*love*) |
Hide element in parse tree | <…> | A=('+'|'-') <' '> NUMBER |
Name | Syntax | Example |
Lookahead | & | &A |
Negative lookahead | ! | !A |
Ordered Alternative | / | A/B |
Look Ahead Example
here's a look-ahead example.
;; lookahead example (ns example.core (:require [instaparse.core :as insta])) (def x7 (insta/parser "S = &'ab'('a'|'b')+")) (x7 "aba") ;; output ;; [:S "a" "b" "a"] (x7 "aaa") ;; output ;; Parse error at line 1, column 1: ;; aaa ;; ^ ;; Expected: ;; "ab"
in the above example, the &'ab'
is a condition that applies to the next unit of the grammar rule. If a string does not starts with ab
, then the rule ('a'|'b')+
is ignored.
Here is my favorite example of lookahead, a parser that only succeeds on strings with a run of a's followed by a run of b's followed by a run of c's, where each of those runs must be the same length. If you've ever taken an automata course, you may remember that there is a very elegant proof that it is impossible to express this set of constraints with a pure context-free grammar. Well, with lookahead, it is possible:
;; language, a followed by b followed by c, but only if each has the same length (def abc (insta/parser "S = &(A 'c') 'a'+ B A = 'a' A? 'b' <B> = 'b' B? 'c'")) (abc "aaabbbccc") ;; output [:S "a" "a" "a" "b" "b" "b" "c" "c" "c"]
negative look-ahead is the like look-ahead but with negation. That is, the grammar rule unit after the negative look-ahead is applied only if the input string at that point does not match the negative look-ahead spec.
ABNF
For Instaparse's ABNF syntax, see https://github.com/Engelberg/instaparse/blob/master/docs/ABNF.md
Combinators
The main idea of parser combinator is that parsers (as functions) can be combined to create new parsers. See Parser combinator
To use parser combinator, you need to load (use 'instaparse.combinators)
and build the grammar. It is a Clojure map or vector data type. Then, feed this to the function “parser”. Example:
(use 'instaparse.core) (use 'instaparse.combinators) (def myGrammar xyz) (parser myGrammar)
For detail of Instaparse's combinators syntax, see https://github.com/Engelberg/instaparse#combinators