Clojure Instaparse Parser Tutorial
Instaparse is a Clojure library. It is a parser. Basically, it has a function that takes a bnf-like grammar (string), output a function, and this function takes a string input, and output a nested list (syntax tree) for that grammar.
Instaparse home page is at https://github.com/Engelberg/instaparse
Loading Instaparse Library
The easiest way to use it, is to create a new Clojure project via Leiningen, by:
# create a clojure project lein new app parser-play
then, just add the following line to your leiningen config file project.clj
. This will make sure the library is downloaded. For example, you should have this:
:dependencies [[org.clojure/clojure "1.6.0"] [instaparse "1.3.4"] ]
Then, in your source code, you need to load the lib. Example:
;; load the lib as “insta” (ns example.core (:require [instaparse.core :as insta]))
You can also play with instaparse using lein repl by lein repl
If you are not familiar with Leiningen, see Clojure Leiningen Tutorial .
You should have working knowledge with Clojure the language before you can use Instaparse. 〔see Clojure Tutorial〕
Basic Example
(parser …)
is a parser generator. It takes in a grammar (as string) and returns a function.
;; load the lib as “insta” (ns example.core (:require [instaparse.core :as insta])) (def xx (insta/parser "S = AB* AB = A B A = 'a'+ B = 'b'+"))
The argument to (parser …)
is a string. The string is Instaparse's EBNF notation to specify a grammar. Each line is a grammar rule.
In the above example, the grammar means:
- Start with 0 or more AB
- AB can be replaced by A followed by B
- A is 1 or more “a”
- B is 1 or more “b”
〔see Clojure Instaparse Parser Tutorial: Grammar Syntax〕
(parser …)
returns a function. So, xx is now a parser function.
Now, if we call “xx” with the input string "aaaaabbbaaaabb"
:
(xx "aaaaabbbaaaabb")
It returns the parse tree:
[:S [:AB [:A "a" "a" "a" "a" "a"] [:B "b" "b" "b"]] [:AB [:A "a" "a" "a" "a"] [:B "b" "b"]]]
- Each grammar rule results in 1 level of nesting in the result parse tree.
- The first element of each node in parse tree is the name of a rule. (as a Clojure key datatype)
Instaparse can also take a grammar from a URL. Example:
(def phone-uri-parser (insta/parser "https://raw.github.com/Engelberg/instaparse/master/test/instaparse/phone_uri.txt" :input-format :abnf))
(def lang-x-parser (insta/parser "lang-x-grammar.bnf"))
The URL content should be the grammar, but without the beginning and ending string quotes. For example, here's a file content:
S = AB* AB = A B A = 'a'+ B = 'b'+
Example 2
Here's a example of a simple language, of the form n+n+…
that is single-digit integer addition. (without spaces)
;; load the lib as “insta” (ns example.core (:require [instaparse.core :as insta])) (def pp (insta/parser "S = N | (N ('+' N)+); N = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';" )) (pp "1") ;; output ;; [:S [:N "1"]] (pp "3+4") ;; output ;; [:S [:N "3"] "+" [:N "4"]] (pp "3+4+2") ;; output ;; [:S [:N "3"] "+" [:N "4"] "+" [:N "2"]]
Example 3
Here's a example of addition of digits, with space as optional separator.
;; load the lib as “insta” (ns example.core (:require [instaparse.core :as insta])) (def qq (insta/parser "S = DIGIT (SPACES '+' SPACES DIGIT)*; SPACES = ' '*; DIGIT = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';" )) (qq "1") ;; output ;; [:S [:DIGIT "1"]] (qq "3+4") ;; output ;; [:S [:DIGIT "3"] [:SPACES] "+" [:SPACES] [:DIGIT "4"]] (qq "3 +4") ;; output ;; [:S [:DIGIT "3"] [:SPACES " "] "+" [:SPACES] [:DIGIT "4"]] (qq "3 + 4") ;; output ;; [:S [:DIGIT "3"] [:SPACES " "] "+" [:SPACES " "] [:DIGIT "4"]] (qq "3+4+2") ;; output ;; [:S [:DIGIT "3"] [:SPACES] "+" [:SPACES] [:DIGIT "4"] [:SPACES] "+" [:SPACES] [:DIGIT "2"]]
Instaparse Functions
The functions provided by Instaparse are:
- “parser”
- Takes a grammar string and returns a parser function.
- “parse”
- Takes a parser function and returns a parse tree (or failure object).
- “parses”
- Takes a parser function and returns all possible parse tree and or partial parse tree. (if failure, returns empty list with Clojure Metadata of the failure object.)
- “transform”
- Takes a parse tree and a keyword/function map, returns the transformed result.
- “failure?”
-
Takes output from “parse” or “parses”, returns
true
orfalse
. - “get-failure”
- Takes output from “parse” or “parses”, returns the failure object. (if the output is a failed parse.)
- “span”
- Takes a parse tree (or parts of), return the corresponding start/end position of input
- “visualize”
- Shows a image of the parse tree (require other libs).
How to use Instaparse?
Here's the general use pattern.
- Use
(parser grammar)
to generate a parser function f. - Use
(parse f input)
to generate a parser tree. - Use
(transform parse_tree map)
to compute, do things, to your parse tree.
- Use
(parses f input …)
to check for errors or debug your grammar. - Use
(span parse_tree)
to check for errors or debug your grammar. - Use
(failure? parse_output)
and(get-failure parse_output)
to check for parse errors or get parse error.
Function: parser
(parser grammarX)
- Return a function for parsing grammar grammarX.
The returned function f takes a input string and returns a parse tree (a nested vector, by default). (f x …)
is a equivalent to (parse f x …)
. See parse
for detail of f's input and output.
Optional arguments for parser
:
:input-format :ebnf
- Interpret the input as Instaparse's default syntax. This is the default.
:input-format :abnf
- interpret input as ABNF grammar.
〔see Clojure Instaparse Parser Tutorial: Grammar Syntax〕
:output-format :hiccup
- default.
:output-format :enlive
〔see Clojure Instaparse Parser Tutorial: Output Formats〕
:start :rule_name
-
Specify start rule. rule_name is the left-hand-side of a grammar rule. Note the
:
in front. By default, it's the first rule. 〔see Clojure Instaparse Parser Tutorial: Change Start Rule〕 :string-ci true
- Ignore letter case in all string literals in grammar. (case insensitive)
:no-slurp true
- When true, the input is interpreted as grammar. When false, smartly determine if the input is a URI (that contains grammar). (the name “slurp” came from Clojure standard function “slurp” for reading file content. )
:auto-whitespace :standard
:auto-whitespace :comma
:auto-whitespace custom-whitespace-parser
- ?
Function: parse
(parse parser inputX)
-
Return a parse tree (or failure object), using parser parser on input string inputX. The parser is a function returned by function
parser
.
Optional arguments for parse
:
:unhide :content
- Unhide terminal symbols.
:unhide :tags
- Unhide rules.
:unhide :all
- Unhide both terminal symbols and rules.
〔see Clojure Instaparse Parser Tutorial: Hide Tokens〕
:start :rule_left_side_name
- 〔see Clojure Instaparse Parser Tutorial: Change Start Rule〕
:partial true
- Return a lazy sequence of parse trees that are results of parsing increasingly more the input. 〔see Clojure Instaparse Parser Tutorial: Partial Parse〕
:total true
- When parsing fails, force return a parse tree, with failure object embedded as a element in parse tree. 〔see Clojure Instaparse Parser Tutorial: Total Parse〕
:optimize :memory
- Try to use less memory.
Function: parses
(parses parser inputX)
- Return a lazy sequence of all possible parse tree of inputX. (if failure, returns empty list, with failure object attached as Clojure Metadata)
“parses” has the same optional parameters as “parse”, except :optimize
.
“parses” is most useful when you are still experimenting with your grammar. For example, creating a equivalent grammar that has only one parse tree.
Function: transform
Clojure Instaparse Parser Tutorial: Transform Parse Tree
Function: failure?, get-failure
(failure? result)
-
Return
true
if result is a parsing failure. The result is output from(parse …)
or(parses …)
(get-failure result)
- get the failure object.
detail: Clojure Instaparse Parser Tutorial: Parse Error
Function: span
(span parse_tree)
-
Return a
[start-index end-index]
of the char span of the sub tree of parse_tree. The start-index is inclusive, end-index is exclusive. Returnsnil
if no span metadata is attached.
Function: visualize
(visualize parse_tree …)
- Creates a graphviz visualization of the parse tree.
This requires clojure lib “rhizome” https://github.com/ztellman/rhizome, which in turn requires the Graphviz tool for actually generating the image.
Optional keyword arguments:
:output-file filename
- save image to file named filename
:options options
- pass options options to “rhizome”.
Important: This function will only work if you have added rhizome to your dependencies, and installed graphviz on your system. See https://github.com/ztellman/rhizome for more information.
Grammar Syntax
Clojure Instaparse Parser Tutorial: Grammar Syntax
Transform Parse Tree
Clojure Instaparse Parser Tutorial: Transform Parse Tree
Parse Tree Output Formats
Clojure Instaparse Parser Tutorial: Output Formats
Hiding Elements in Output
Clojure Instaparse Parser Tutorial: Hide Tokens
“Partial Parse”: Show Parse Progress
Clojure Instaparse Parser Tutorial: Partial Parse
“Total Parse”: Force Return Parse Tree with Failure Embedded
Clojure Instaparse Parser Tutorial: Total Parse
Change Start Rule
Clojure Instaparse Parser Tutorial: Change Start Rule
Instaparse Parse Quirks
Instaparse supports any context-free grammar, including left recrusion, right recrusion.
;; right recursion grammar example ((insta/parser "S = 'a' S | ''") "aaaa") [:S "a" [:S "a" [:S "a" [:S "a" [:S]]]]]
;; left recursion grammar example ((insta/parser "S = S 'a' | ''") "aaaa") ;; [:S [:S [:S [:S [:S] "a"] "a"] "a"] "a"]
Handling Infinite Loop Grammar
;; example of grammar that are infinite loop ((insta/parser "S = S") "a") ;; output ;; Parse error at line 1, column 1: ;; a ;; ^
Handling Ambiguous Grammar
Ambiguous Grammar is one that has many possible parse tree.
(def abi (insta/parser "S = A A A = 'a'*")) (abi "aaaa") ;; is same as (insta/parse abi "aaaa") ;; output ;; [:S ;; [:A "a"] ;; [:A "a" "a" "a"]] ;; show all possible parse tree (insta/parses abi "aaaa") ;; output ;; ([:S ;; [:A "a"] ;; [:A "a" "a" "a"]] ;; [:S ;; [:A "a" "a" "a" "a"] ;; [:A]] ;; [:S ;; [:A "a" "a"] ;; [:A "a" "a"]] ;; [:S ;; [:A "a" "a" "a"] ;; [:A "a"]] ;; [:S ;; [:A] ;; [:A "a" "a" "a" "a"]])
Notes and Exercise
• Instaparse's accepted grammar syntax is abusive of sugar syntax. Find a example where it is ambiguous or some other ill behavior. (note: in parser community, “ambiguous grammar” means a grammar that has more than one passible parse tree. Here, for instaparse's grammar syntax, i meant to say that it's “ambiguous” in the sense that what language it specifies is not determined, or no well defined. For example, explore languages that has lots white spaces, or grammar rules's left-hand-side and right-hand-side cannot be determined without a end of rule marker.)
• find a systematic (or formal) way to prove or disprove whether a grammar is ambiguous. (note: this is different from “ambiguous grammar”)
• explore, or formally find equivalence, of BNF and EBNF. For example, write a translation engine. For example, for the operators {*
, +
} in EBNF, find systematic way to translate them to BNF without. Also, consider translation between EBNF and ABNF
Reference
Instaparse home page https://github.com/Engelberg/instaparse