Clojure Instaparse Parser Tutorial

By Xah Lee. Date: 2014-11-22. Last updated: 2015-01-01.

Instaparse is a Clojure library. It is a parser. Basically, it has a function that takes a bnf-like grammar (string), output a function, and this function takes a string input, and output a nested list (syntax tree) for that grammar.

Instaparse home page is at https://github.com/Engelberg/instaparse

Loading Instaparse Library

The easiest way to use it, is to create a new Clojure project via Leiningen, by:

# create a clojure project
lein new app parser-play

then, just add the following line to your leiningen config file project.clj. This will make sure the library is downloaded. For example, you should have this:

:dependencies [[org.clojure/clojure "1.6.0"]
                 [instaparse "1.3.4"]
                 ]

Then, in your source code, you need to load the lib. Example:

;; load the lib as “insta”
(ns example.core (:require [instaparse.core :as insta]))

You can also play with instaparse using lein repl by lein repl

If you are not familiar with Leiningen, see Clojure Leiningen Tutorial .

You should have working knowledge with Clojure the language before you can use Instaparse. [see Clojure Tutorial]

Basic Example

(parser …) is a parser generator. It takes in a grammar (as string) and returns a function.

;; load the lib as “insta”
(ns example.core (:require [instaparse.core :as insta]))

(def xx
  (insta/parser
    "S = AB*
     AB = A B
     A = 'a'+
     B = 'b'+"))

The argument to (parser …) is a string. The string is Instaparse's EBNF notation to specify a grammar. Each line is a grammar rule.

In the above example, the grammar means:

Start with 0 or more AB
AB can be replaced by A followed by B
A is 1 or more “a”
B is 1 or more “b”

[see Clojure Instaparse Parser Tutorial: Grammar Syntax]

(parser …) returns a function. So, xx is now a parser function.

Now, if we call “xx” with the input string "aaaaabbbaaaabb":

(xx "aaaaabbbaaaabb")

It returns the parse tree:

[:S
 [:AB
  [:A "a" "a" "a" "a" "a"]
  [:B "b" "b" "b"]]
 [:AB
  [:A "a" "a" "a" "a"]
  [:B "b" "b"]]]

Each grammar rule results in 1 level of nesting in the result parse tree.
The first element of each node in parse tree is the name of a rule. (as a Clojure key datatype)

Instaparse can also take a grammar from a URL. Example:

(def phone-uri-parser
     (insta/parser
      "https://raw.github.com/Engelberg/instaparse/master/test/instaparse/phone_uri.txt"
      :input-format :abnf))

(def lang-x-parser
     (insta/parser "lang-x-grammar.bnf"))

The URL content should be the grammar, but without the beginning and ending string quotes. For example, here's a file content:

S = AB*
AB = A B
A = 'a'+
B = 'b'+

Example 2

Here's a example of a simple language, of the form n+n+… that is single-digit integer addition. (without spaces)

;; load the lib as “insta”
(ns example.core (:require [instaparse.core :as insta]))

(def pp
     (insta/parser
      "S = N | (N ('+' N)+);
       N = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';"
))

(pp "1")
;; output
;; [:S [:N "1"]]

(pp "3+4")
;; output
;; [:S [:N "3"] "+" [:N "4"]]

(pp "3+4+2")
;; output
;; [:S [:N "3"] "+" [:N "4"] "+" [:N "2"]]

Example 3

Here's a example of addition of digits, with space as optional separator.

;; load the lib as “insta”
(ns example.core (:require [instaparse.core :as insta]))

(def qq
     (insta/parser
      "S = DIGIT (SPACES '+' SPACES DIGIT)*;
       SPACES = ' '*;
       DIGIT = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';"
      ))

(qq "1")
;; output
;; [:S [:DIGIT "1"]]

(qq "3+4")
;; output
;; [:S [:DIGIT "3"] [:SPACES] "+" [:SPACES] [:DIGIT "4"]]

(qq "3 +4")
;; output
;; [:S [:DIGIT "3"] [:SPACES " "] "+" [:SPACES] [:DIGIT "4"]]

(qq "3 + 4")
;; output
;; [:S [:DIGIT "3"] [:SPACES " "] "+" [:SPACES " "] [:DIGIT "4"]]

(qq "3+4+2")
;; output
;; [:S [:DIGIT "3"] [:SPACES] "+" [:SPACES] [:DIGIT "4"] [:SPACES] "+" [:SPACES] [:DIGIT "2"]]

Instaparse Functions

The functions provided by Instaparse are:

“parser”: Takes a grammar string and returns a parser function.
“parse”: Takes a parser function and returns a parse tree (or failure object).
“parses”: Takes a parser function and returns all possible parse tree and or partial parse tree. (if failure, returns empty list with Clojure Metadata of the failure object.)
“transform”: Takes a parse tree and a keyword/function map, returns the transformed result.
“failure?”: Takes output from “parse” or “parses”, returns true or false.
“get-failure”: Takes output from “parse” or “parses”, returns the failure object. (if the output is a failed parse.)
“span”: Takes a parse tree (or parts of), return the corresponding start/end position of input
“visualize”: Shows a image of the parse tree (require other libs).

How to use Instaparse?

Here's the general use pattern.

Use (parser grammar) to generate a parser function f.
Use (parse f input) to generate a parser tree.
Use (transform parse_tree map) to compute, do things, to your parse tree.

Use (parses f input …) to check for errors or debug your grammar.
Use (span parse_tree) to check for errors or debug your grammar.
Use (failure? parse_output) and (get-failure parse_output) to check for parse errors or get parse error.

Function: parser

(parser grammarX): Return a function for parsing grammar grammarX.

The returned function f takes a input string and returns a parse tree (a nested vector, by default). (f x …) is a equivalent to (parse f x …) . See parse for detail of f's input and output.

Optional arguments for parser:

:input-format :ebnf: Interpret the input as Instaparse's default syntax. This is the default.
:input-format :abnf: interpret input as ABNF grammar.

[see Clojure Instaparse Parser Tutorial: Grammar Syntax]

:output-format :hiccup: default.
:output-format :enlive

[see Clojure Instaparse Parser Tutorial: Output Formats]

:start :rule_name: Specify start rule. rule_name is the left-hand-side of a grammar rule. Note the : in front. By default, it's the first rule. [see Clojure Instaparse Parser Tutorial: Change Start Rule]
:string-ci true: Ignore letter case in all string literals in grammar. (case insensitive)
:no-slurp true: When true, the input is interpreted as grammar. When false, smartly determine if the input is a URI (that contains grammar). (the name “slurp” came from Clojure standard function “slurp” for reading file content. )
:auto-whitespace :standard
:auto-whitespace :comma
:auto-whitespace custom-whitespace-parser: ?

Function: parse

(parse parser inputX): Return a parse tree (or failure object), using parser parser on input string inputX. The parser is a function returned by function parser.

Optional arguments for parse:

:unhide :content: Unhide terminal symbols.
:unhide :tags: Unhide rules.
:unhide :all: Unhide both terminal symbols and rules.

[see Clojure Instaparse Parser Tutorial: Hide Tokens]

:start :rule_left_side_name: [see Clojure Instaparse Parser Tutorial: Change Start Rule]
:partial true: Return a lazy sequence of parse trees that are results of parsing increasingly more the input. [see Clojure Instaparse Parser Tutorial: Partial Parse]
:total true: When parsing fails, force return a parse tree, with failure object embedded as a element in parse tree. [see Clojure Instaparse Parser Tutorial: Total Parse]
:optimize :memory: Try to use less memory.

Function: parses

(parses parser inputX): Return a lazy sequence of all possible parse tree of inputX. (if failure, returns empty list, with failure object attached as Clojure Metadata)

“parses” has the same optional parameters as “parse”, except :optimize.

“parses” is most useful when you are still experimenting with your grammar. For example, creating a equivalent grammar that has only one parse tree.

Function: transform

Clojure Instaparse Parser Tutorial: Transform Parse Tree

Function: failure?, get-failure

(failure? result): Return true if result is a parsing failure. The result is output from (parse …) or (parses …)
(get-failure result): get the failure object.

detail: Clojure Instaparse Parser Tutorial: Parse Error

Function: span

(span parse_tree): Return a [start-index end-index] of the char span of the sub tree of parse_tree. The start-index is inclusive, end-index is exclusive. Returns nil if no span metadata is attached.

Function: visualize

(visualize parse_tree …): Creates a graphviz visualization of the parse tree.

This requires clojure lib “rhizome” https://github.com/ztellman/rhizome, which in turn requires the Graphviz tool for actually generating the image.

Optional keyword arguments:

:output-file filename: save image to file named filename
:options options: pass options options to “rhizome”.

Important: This function will only work if you have added rhizome to your dependencies, and installed graphviz on your system. See https://github.com/ztellman/rhizome for more information.

Grammar Syntax

Clojure Instaparse Parser Tutorial: Grammar Syntax

Transform Parse Tree

Clojure Instaparse Parser Tutorial: Transform Parse Tree

Parse Tree Output Formats

Clojure Instaparse Parser Tutorial: Output Formats

Instaparse Parse Quirks

Instaparse supports any context-free grammar, including left recrusion, right recrusion.

;; right recursion grammar example
((insta/parser "S = 'a' S | ''") "aaaa")
[:S "a" [:S "a" [:S "a" [:S "a" [:S]]]]]

;; left recursion grammar example
((insta/parser "S = S 'a' | ''") "aaaa")
;; [:S [:S [:S [:S [:S] "a"] "a"] "a"] "a"]

Handling Infinite Loop Grammar

;; example of grammar that are infinite loop
((insta/parser "S = S") "a")

;; output
;; Parse error at line 1, column 1:
;; a
;; ^

Handling Ambiguous Grammar

Ambiguous Grammar is one that has many possible parse tree.

(def abi
  (insta/parser
    "S = A A
     A = 'a'*"))

(abi "aaaa")
;; is same as
(insta/parse abi "aaaa")
;; output
;; [:S
;;  [:A "a"]
;;  [:A "a" "a" "a"]]

;; show all possible parse tree
(insta/parses abi "aaaa")
;; output
;; ([:S
;;   [:A "a"]
;;   [:A "a" "a" "a"]]
;;  [:S
;;   [:A "a" "a" "a" "a"]
;;   [:A]]
;;  [:S
;;   [:A "a" "a"]
;;   [:A "a" "a"]]
;;  [:S
;;   [:A "a" "a" "a"]
;;   [:A "a"]]
;;  [:S
;;   [:A]
;;   [:A "a" "a" "a" "a"]])

Notes and Exercise

• Instaparse's accepted grammar syntax is abusive of sugar syntax. Find a example where it is ambiguous or some other ill behavior. (note: in parser community, “ambiguous grammar” means a grammar that has more than one passible parse tree. Here, for instaparse's grammar syntax, i meant to say that it's “ambiguous” in the sense that what language it specifies is not determined, or no well defined. For example, explore languages that has lots white spaces, or grammar rules's left-hand-side and right-hand-side cannot be determined without a end of rule marker.)

• find a systematic (or formal) way to prove or disprove whether a grammar is ambiguous. (note: this is different from “ambiguous grammar”)

• explore, or formally find equivalence, of BNF and EBNF. For example, write a translation engine. For example, for the operators {*, +} in EBNF, find systematic way to translate them to BNF without. Also, consider translation between EBNF and ABNF

Reference

Instaparse home page https://github.com/Engelberg/instaparse