Pattern Matching vs Lexical Grammar Specification

,

This page is a personal note on the relation of pattern matching and lexical grammar specification, and their application in formal logic. Pattern matching here can mean textual pattern matching such as regex or list structure and datatype pattern matching such as in Mathematica.

A Need For Lexical Grammar Spec Lang

In computing, there's pattern matching. ⁖ those in Mathematica, for pattern matching list structure and datatypes ( Mathematica patterns ) , and those in perl, for pattern matching strings. These pattern matching domain-specific languages are designed to match patterns, but i have realized, that they are rather unfit, when used as a mean to specify the structure of textual forms. In short, what i came to understand is that pattern matching facilities (may it be regex for strings or Mathematica's for structure), although powerful, but is not capable of specifying formal grammar, even very simple ones. (this seems obvious now when stated as above, but it was a realization for me.)

Often, when writing doc or in verbal communication, on computer programing or doing math with Mathematica, you are tempted to use a pattern to communicate a particular form a expression may have. For example, you would like to say that email address has the form xyz, where xyz is a perl regex:

[A-z0-9]+@[A-z]+\.com

Similarly, you might use regex to communicate the form of a url, domain name, file path. In math (especially in formal logic or in computer algebra systems), i often want to say that certain function has the form xyz, where xyz is a Mathematica pattern that indicates the function's parameters, output, and their types. (i.e.: a computer language expression analogous to traditional notation f:ℂ²→ℝ³). However, in practice, using regex or Mathematica's patterns for the purpose of specifying a form, is often unsatisfactory. Typically, using patterns only gives a general idea, but isn't a correct specification of the form. For the purpose of giving a general idea, verbal English description is almost as good. Almost.

For example, in the official doc for E-mail address RFC 2822, it does not use a regex to specify email address's syntax. Instead, it uses a variant of BNF mixed with many pages of explanations, for human reading.

Here's a regex for email address that presumably correctly match the spec in RFC 2822. Source www.regular-expressions.info

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

The regex is 426 chars long and not human readable.

It is quite desirable to have a simple language for specifying syntax, that can be parsed by a computer, yet designed in a human-readable way, and concise. With such a language, we could use it to verify if a text is a valid form. We could use it for human communication. Conceivably, it could replace regex for string pattern matching or much enhance pattern matching as used in Mathematica or other functional langs.

Pattern Matching vs Grammar Specification

Pattern matching is used to check if a given text is valid (matches the pattern). Grammar spec lang such as BNF does reverse: it takes a spec and “generates” all possible valid texts. Can there be a single syntax, that works both ways?

To what degree, pattern matching language and grammar specification language are orthogonal? That is: are these simply two ways of looking at the same idea? Can they be unified (in theory or practice)? This question, can be considered with respect to computer science of formal languages field. And, it can be also considered with respect to practical programing, as in tools to parse languages, regex to match strings, patterns to match text structures, or text generator tools to generate possible strings (such as math formula syntax, email syntax, URL syntax, programing lang syntax).

One grammar specification lang is BNF. There are various (incompatible) versions and extensions of BNF, and BNF is mostly used for humans reading, in the same way that traditional math notation is a language for human consumption, not for machine to process. The machine readable versions used by parsers each has its own ad hoc incompatible extensions. As a human-to-human lang, it is imprecise and filled with errors. Is there a general purpose, widely used, BNF-like language that is designed to run as executable program, but also has qualities for human reading?

If we look at the top 20 most popular computer languages used today (⁖ Java, Perl, Visual Basic, SQL, … lisp), most of them do not have a language specification or anything close to a grammar as a formal language. For those that have some form of spec, perhaps 95% of them are specified in a form for human to read (as opposed to machine readable parser lang), and are extremely complex and imprecise. For examples: Java with Java Language Specification , Python with its The Python Language Reference, Perl, PHP, have no grammar spec. Even the functional lang Haskell, does not have a formal grammar spec.

Take HTML for example, and even the much simpler XML's official specifications of its syntax by W3C at http://www.w3.org/TR/xml11/, are wildly complex with dense and verbose descriptions, in tens of pages. (as far as i know, there is no parser based specification for as simple a language as XML.)

The grammar of XML is conceptually very simple. It seems something is lacking, by the fact that its Lexical grammar is not specified by a computer language. Conceivable it would be within 50 lines of such a language, the essence would probably just 10 lines.

The above issues in computer science are roughly in the fields of parsing, pattern matching, formal language theory.

I don't have any experience nor had any study of parsing. It appears that Yacc and Flex lexical analyser is something i'm looking for.

I'll be spending time studying parsing in the near future to fill my void in this important topic. As of now, my impression of yacc and lex is that they don't seem to be the system i'm looking for. For example, apparently, SGML, XML, do not use them to specify the grammar. Nor, most languages i know of. They seem just to be a very technical tool and not something general enough to be used for human communication as well in wide number of computer languages.

Parsing Expression Grammar!

I have found what i wanted!! See Parsing expression grammar.

Apparently, it's still not in wide use. There are one or two implementations in languages like Haskell and Perl6. I would like to see this in Perl, Python, PHP, JavaScript, and most of all, Emacs Lisp.

There are 2 Emacs Lisp implementations:

Study Notes

Some Wikipedia learning and notes:

BNFs, CFL, Generative Grammar

The Backus–Naur form has several variants used today. The original BNF discussed in 1959 is hardly used today. Major variants are, perhaps in order of popularity:

BNF and variants are so-called Metasyntax. Metasyntax means roughly a syntax that is used to describe the formal grammar of other langs. In particular, the BNF (and variants, henceforth BNFs) are used to describe a class of language called Context-free languages (CFL).

Context-free langs are those can be described using Context-free grammar, and context-free grammar is roughly those that can be described by a set of certain transformation rules (often called production rules). BNFs are syntaxes for these production rules.

A grammar (formal grammar), is a high-level (typically human-level) spec that describes a language's possible strings (i.e. the source code). There are many classes of grammars. One classification is whether they are context-free. And, of the context-free ones, there are few ways to describe them. One way, by listing several transformation rules, is called Generative grammar. Another way, by mathematical qualifications, is called Analytic grammar. (So, for example, BNFs are used for generative grammar.)

BNF Examples

As far as i know, none of the following examples are written for machine to parse. Also, the BNF extention is non-standard.

parsing

parser flow diagram
parser flow diagram.

Parsing can be broken down to few steps.

Lexical analysis, which breaks the text stream into tokens. Then, syntactic analysis organize the stream of tokens into a parse tree.

In lexical analysis, it can be separated into 2 steps: scanner and tokenizer.

Algorithms for parsing can be grouped into types, based on their approach. In one classification, there's Top-down parsing and Bottom-up parsing.

formal language

formal grammar specifies all possible string of a language.

Lexical grammar specifies the syntax of tokens. For example, XML is specified in lexical grammar. (since here is not one specific language, but a lexical structure that can spawn off new langs)

Once a formal grammar is given, the reverse problem of determining whether a string belongs to that lang, is theoretically studied in a field named Automata theory, which is how regex originated from.

Some links to check out:

Origin of My Interest

PS here's what lead me to write this article. I was studying algebraic curves Elementary Geometry of Algebraic Curves amazon , by C G Gibson. While reading the book, in conjunction of working on my Visual Dictionary of Special Plane Curves website, and also my long time interest in formal mathematics (See: The Codification of Mathematics), i wanted to express the definition and theorems of algebraic curves as a sequence of symbols. (and hopefully show proofs as derivation of these symbol strings according to some given rules. In other words, my wish is to put the human element out)

In other words, i was foraging into formal systems. This is when i EXPLICITLY realized a long experience, that pattern matching as by regex or Mathematica isn't sufficient to specify even simple forms (i.e. lexical grammar). This got me started to study formal languages, grammar, BNF, CFL, PEG, parsers, automata, etc. I have actually never studied these subjects. (I recall, back in 1997, i was naively trying to devise production rules for the Mathematica language. I was, effectively, devising BNF for Mathematica's FullForm (which is just nested list like lisp).)

Another reason that i dived into the subject, is that for the past 2 or so years i've been wanting to have a lisp source code reformatter in emacs based on a lexical analysis. 〔➤ A Simple Lisp Code Formatter〕 Getting acquainted with parser theory will help me.

Transformation Systems

Discovered several languages. TXL, Colm, and the concept of program transformation.

There's the concept of Program transformation. Basically, it takes a input source code and transforms by some spec into another form. Similar to the concept of compiler, but more emphasis that the input and output are high-level source code, i think. ⁖ Source-to-source compiler.

There seems to be quite a few such languages. Here's a big list: Source www.program-transformation.org.

TXL (programming language) (home at www.txl.ca) seems to be a popular transformation language. And a new supposedly improved one is Colm, at: Source www.complang.org. On the colm site, it has a Ph D thesis in PDF that gives a nice overview of such systems.

blog comments powered by Disqus