Pattern Matching vs Grammar Specification

By Xah Lee. Date: . Last updated: .

This page is a personal note on the relation of pattern matching and grammar specification, and their application in formal logic. Pattern matching here can mean textual pattern matching such as regex or list structure and datatype pattern matching such as in Mathematica.

A Need for Grammar Spec Lang

In computing, there's pattern matching. For example, those in Mathematica, for pattern matching list structure and datatypes ( Mathematica patterns ) , and those in perl, for pattern matching strings. These pattern matching languages are designed to match patterns, but i have realized, that they are rather unfit, when used as a mean to specify the structure of textual forms. In short, what i came to understand is that pattern matching facilities (may it be regex for strings or Mathematica's for structure), although powerful, but is not capable of specifying formal grammar, even very simple ones. (this seems obvious now when stated as above, but it was a realization for me.)

Often, when writing doc or in verbal communication, on computer programing or doing math with Mathematica, you are tempted to use a pattern to communicate a particular form a expression may have. For example, you would like to say that email address has the form xyz, where xyz is a perl regex:

[A-z0-9]+@[A-z]+\.com

Similarly, you might use regex to communicate the form of a url, domain name, file path. In math (especially in formal logic or in computer algebra systems), i often want to say that certain function has the form xyz, where xyz is a Mathematica pattern that indicates the function's parameters, output, and their types. (i.e.: a computer language expression analogous to traditional notation f:ℂ²→ℝ³). However, in practice, using regex or Mathematica's patterns for the purpose of specifying a form, is often unsatisfactory. Typically, using patterns only gives a general idea, but isn't a correct specification of the form. For the purpose of giving a general idea, verbal English description is almost as good. Almost.

For example, in the official doc for E-mail address, RFC 2822, section 1.2.2 and section 3.4 〔http://tools.ietf.org/html/rfc2822#section-1.2.2http://tools.ietf.org/html/rfc2822#section-3.4 〕 it does not use a regex to specify email address's syntax. Instead, it uses a variant of ABNF (Augmented Backus–Naur Form) mixed with many pages of explanations, for human reading. 〔➤see What's the Difference Between BNF, EBNF, ABNF?

Here's a regex for email address that presumably correctly match the spec in RFC 2822.

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

(from http://www.regular-expressions.info/email.html)

The regex is 426 chars long and not human readable.

It is quite desirable to have a simple language for specifying syntax, that can be parsed by a computer, yet designed in a human-readable way, and concise. With such a language, we could use it to verify if a text is a valid form. We could use it for human communication. Conceivably, it could replace regex for string pattern matching or much enhance pattern matching as used in Mathematica or other functional langs.

Pattern Matching vs Grammar Specification

Pattern matching is used to check if a given text is valid (matches the pattern). Grammar spec lang such as BNF does reverse: it takes a spec and “generates” all possible valid texts. Can there be a single syntax, that works both ways?

Parsing Expression Grammar!

I have found what i wanted!! See Parsing expression grammar.

Apparently, it's still not in wide use. There are one or two implementations in languages like Haskell and Perl6. I would like to see this in Perl, Python, PHP, JavaScript, and most of all, Emacs Lisp.

There are 2 Emacs Lisp implementations:

Origin of My Interest

PS here's what lead me to write this article. I was studying algebraic curves Elementary Geometry of Algebraic Curves amazon , by C G Gibson. While reading the book, in conjunction of working on my Visual Dictionary of Special Plane Curves website, and also my long time interest in formal mathematics (See: The Codification of Mathematics), i wanted to express the definition and theorems of algebraic curves as a sequence of symbols. (and hopefully show proofs as derivation of these symbol strings according to some given rules. In other words, my wish is to put the human element out)

In other words, i was foraging into formal systems. This is when i EXPLICITLY realized a long experience, that pattern matching as by regex or Mathematica isn't sufficient to specify even simple forms (i.e. conext free grammar). This got me started to study formal languages, grammar, BNF, CFL, PEG, parsers, automata, etc. I have actually never studied these subjects. (I recall, back in 1997, i was naively trying to devise production rules for the Mathematica language. I was, effectively, devising BNF for Mathematica's FullForm (which is just nested list like lisp).)

Another reason that i dived into the subject, is that for the past 2 or so years i've been wanting to have a lisp source code reformatter in emacs based on a lexical scan. 〔➤see A Simple Lisp Code Formatter〕 Getting acquainted with parser theory will help me.