Clojure Instaparse Parser Tutorial

By Xah Lee. Date: . Last updated: .

Instaparse is a Clojure library. It is a parser. Basically, it has a function that takes a bnf-like grammar (string), output a function, and this function takes a string input, and output a nested list (syntax tree) for that grammar.

Instaparse home page is at https://github.com/Engelberg/instaparse

Loading Instaparse Library

The easiest way to use it, is to create a new Clojure project via Leiningen, by:

# create a clojure project
lein new app parser-play

then, just add the following line to your leiningen config file project.clj. This will make sure the library is downloaded. For example, you should have this:

:dependencies [[org.clojure/clojure "1.6.0"]
                 [instaparse "1.3.4"]
                 ]

Then, in your source code, you need to load the lib. Example:

;; load the lib as “insta”
(ns example.core (:require [instaparse.core :as insta]))

You can also play with instaparse using lein repl by lein repl

If you are not familiar with Leiningen, see Clojure Leiningen Tutorial.

You should have working knowledge with Clojure the language before you can use Instaparse. 〔➤see Xah Clojure Tutorial

Basic Example

(parser …) is a parser generator. It takes in a grammar (as string) and returns a function.

;; load the lib as “insta”
(ns example.core (:require [instaparse.core :as insta]))

(def xx
  (insta/parser
    "S = AB*
     AB = A B
     A = 'a'+
     B = 'b'+"))

The argument to (parser …) is a string. The string is Instaparse's EBNF notation to specify a grammar. Each line is a grammar rule.

In the above example, the grammar means:

〔➤see Clojure Instaparse Parser Tutorial: Grammar Syntax

(parser …) returns a function. So, xx is now a parser function.

Now, if we call “xx” with the input string "aaaaabbbaaaabb":

(xx "aaaaabbbaaaabb")

It returns the parse tree:

[:S
 [:AB
  [:A "a" "a" "a" "a" "a"]
  [:B "b" "b" "b"]]
 [:AB
  [:A "a" "a" "a" "a"]
  [:B "b" "b"]]]

Instaparse can also take a grammar from a URL. Example:

(def phone-uri-parser
     (insta/parser
      "https://raw.github.com/Engelberg/instaparse/master/test/instaparse/phone_uri.txt"
      :input-format :abnf))
(def lang-x-parser
     (insta/parser "lang-x-grammar.bnf"))

The URL content should be the grammar, but without the beginning and ending string quotes. For example, here's a file content:

S = AB*
AB = A B
A = 'a'+
B = 'b'+

Example 2

Here's a example of a simple language, of the form n+n+… that is single-digit integer addition. (without spaces)

;; load the lib as “insta”
(ns example.core (:require [instaparse.core :as insta]))

(def pp
     (insta/parser
      "S = N | (N ('+' N)+);
       N = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';"
))

(pp "1")
;; output
;; [:S [:N "1"]]

(pp "3+4")
;; output
;; [:S [:N "3"] "+" [:N "4"]]

(pp "3+4+2")
;; output
;; [:S [:N "3"] "+" [:N "4"] "+" [:N "2"]]

Example 3

Here's a example of addition of digits, with space as optional separator.

;; load the lib as “insta”
(ns example.core (:require [instaparse.core :as insta]))

(def qq
     (insta/parser
      "S = DIGIT (SPACES '+' SPACES DIGIT)*;
       SPACES = ' '*;
       DIGIT = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';"
      ))

(qq "1")
;; output
;; [:S [:DIGIT "1"]]

(qq "3+4")
;; output
;; [:S [:DIGIT "3"] [:SPACES] "+" [:SPACES] [:DIGIT "4"]]

(qq "3 +4")
;; output
;; [:S [:DIGIT "3"] [:SPACES " "] "+" [:SPACES] [:DIGIT "4"]]

(qq "3 + 4")
;; output
;; [:S [:DIGIT "3"] [:SPACES " "] "+" [:SPACES " "] [:DIGIT "4"]]

(qq "3+4+2")
;; output
;; [:S [:DIGIT "3"] [:SPACES] "+" [:SPACES] [:DIGIT "4"] [:SPACES] "+" [:SPACES] [:DIGIT "2"]]

Instaparse Functions

The functions provided by Instaparse are:

How to use Instaparse?

Here's the general use pattern.

Function: parser

(parser grammarX) → returns a function for parsing grammar grammarX.

The returned function f takes a input string and returns a parse tree (a nested vector, by default). (f x …) is a equivalent to (parse f x …) . See parse for detail of f's input and output.

Optional arguments for parser:

〔➤see Clojure Instaparse Parser Tutorial: Grammar Syntax

〔➤see Clojure Instaparse Parser Tutorial: Output Formats

Function: parse

(parse parser inputX) → return a parse tree (or failure object), using parser parser on input string inputX. The parser is a function returned by function parser.

Optional arguments for parse:

〔➤see Clojure Instaparse Parser Tutorial: Hide Tokens

Function: parses

(parses parser inputX) → returns a lazy sequence of all possible parse tree of inputX. (if failure, returns empty list, with failure object attached as Clojure Metadata)

“parses” has the same optional parameters as “parse”, except :optimize.

“parses” is most useful when you are still experimenting with your grammar. For example, creating a equivalent grammar that has only one parse tree.

Function: transform

(transform key/function map tree) → traverse the parse tree tree depth first, and for each node in tree, apply the corresponding function in key/function map. Return the result. The key/function map has the form {:key1 f1, :key2 f2, …}

detail: Clojure Instaparse Parser Tutorial: Transform Parse Tree

Function: failure?, get-failure

(failure? result → returns true if result is a parsing failure. The result is output from (parse …) or (parses …)

(get-failure result → get the failure object.

detail: Clojure Instaparse Parser Tutorial: Parse Error

Function: span

(span parse_tree)→ returns a [start-index end-index] of the char span of the sub tree of parse_tree. The start-index is inclusive, end-index is exclusive. Returns nil if no span metadata is attached.

Function: visualize

(visualize parse_tree …) → Creates a graphviz visualization of the parse tree.

This requires clojure lib “rhizome” https://github.com/ztellman/rhizome, which in turn requires the Graphviz tool for actually generating the image.

Optional keyword arguments:

Important: This function will only work if you have added rhizome to your dependencies, and installed graphviz on your system. See https://github.com/ztellman/rhizome for more information.

Grammar Syntax

Clojure Instaparse Parser Tutorial: Grammar Syntax

Transform Parse Tree

Clojure Instaparse Parser Tutorial: Transform Parse Tree

Parse Tree Output Formats

Clojure Instaparse Parser Tutorial: Output Formats

Hiding Elements in Output

Clojure Instaparse Parser Tutorial: Hide Tokens

“Partial Parse”: Show Parse Progress

Clojure Instaparse Parser Tutorial: Partial Parse

“Total Parse”: Force Return Parse Tree with Failure Embedded

Clojure Instaparse Parser Tutorial: Total Parse

Change Start Rule

Clojure Instaparse Parser Tutorial: Change Start Rule

Instaparse Parse Quirks

Instaparse supports any context-free grammar, including left recrusion, right recrusion.

;; right recursion grammar example
((insta/parser "S = 'a' S | ''") "aaaa")
[:S "a" [:S "a" [:S "a" [:S "a" [:S]]]]]
;; left recursion grammar example
((insta/parser "S = S 'a' | ''") "aaaa")
;; [:S [:S [:S [:S [:S] "a"] "a"] "a"] "a"]

Handling Infinite Loop Grammar

;; example of grammar that are infinite loop
((insta/parser "S = S") "a")

;; output
;; Parse error at line 1, column 1:
;; a
;; ^

Handling Ambiguous Grammar

Ambiguous Grammar is one that has many possible parse tree.

(def abi
  (insta/parser
    "S = A A
     A = 'a'*"))

(abi "aaaa")
;; is same as
(insta/parse abi "aaaa")
;; output
;; [:S
;;  [:A "a"]
;;  [:A "a" "a" "a"]]

;; show all possible parse tree
(insta/parses abi "aaaa")
;; output
;; ([:S
;;   [:A "a"]
;;   [:A "a" "a" "a"]]
;;  [:S
;;   [:A "a" "a" "a" "a"]
;;   [:A]]
;;  [:S
;;   [:A "a" "a"]
;;   [:A "a" "a"]]
;;  [:S
;;   [:A "a" "a" "a"]
;;   [:A "a"]]
;;  [:S
;;   [:A]
;;   [:A "a" "a" "a" "a"]])

Notes and Exercise

• Instaparse's accepted grammar syntax is abusive of sugar syntax. Find a example where it is ambiguous or some other ill behavior. (note: in parser community, “ambiguous grammar” means a grammar that has more than one passible parse tree. Here, for instaparse's grammar syntax, i meant to say that it's “ambiguous” in the sense that what language it specifies is not determined, or no well defined. For example, explore languages that has lots white spaces, or grammar rules's left-hand-side and right-hand-side cannot be determined without a end of rule marker.)

• find a systematic (or formal) way to prove or disprove whether a grammar is ambiguous. (note: this is different from “ambiguous grammar”)

• explore, or formally find equivalence, of BNF and EBNF. For example, write a translation engine. For example, for the operators {*, +} in EBNF, find systematic way to translate them to BNF without. Also, consider translation between EBNF and ABNF

Reference

Instaparse home page https://github.com/Engelberg/instaparse