Formal Grammar

By Xah Lee. Date: . Last updated: .

warning: work in progress.

A formal grammar is a set of rules for rewriting strings.

The elements in Σ (the terminals) is called the alphabet. You can think of each element as english alphabet, or characters, or strings.

The system above is sometimes also called {generative grammar, rewriting system, phrase structure grammar}.

Essentially, it's a string replacement system. You are given a finite list of replacements, with one of them as the starting point. When the right-hand-side has a nonterminal, pick any rule where the left hand side that matches, and do the replacement. Repeat, until you reach a point where no more noneterminal is left. This is a generated string, and is said to be a valid string/sentence of this particular grammar.

Formal Grammar Hierarchy (Chomsky Hierarchy)

Chomsky hierarchy of formal languages

recursively enumerable
 context sensitive
  context free
   regular

Chomsky hierarchy

Context-sensitive grammar

A context-sensitive grammar
Lhs must have only one nonterminal, for all rules.

It's called context-sensitive because the nonterminal on the lhs may be surrounded by terminals. So, in order to replace, the neighboring nonterminals (the context) must be considered.

see also Noncontracting grammar

see also Kuroda normal form

Context-free grammar

Context-free grammar

a context-free grammar
Left-hand-side of every production rule is just one nonterminal.

Linear Grammar and Regular Grammar

Linear grammar

a linear grammar
A context-free grammar, and has at most one nonterminal in the right hand side, for every production rule.

That is, lhs is just 1 nonterminal (so it's “context-free”), and rhs must have at most 1 nonterminal.

example

S = 'a' S 'b'
S = ''

This grammar generates {ab, aabb, aaabbb, …}.

regular grammar
Like linear grammar, but every nonterminal on rhs must all be at left ends, or all be at right ends.
left-regular grammar (aka left-linear grammar)
Any nonterminal in right hand sides are at the left ends.
right-regular grammar (aka right-linear grammar)
Any nonterminal in right hand sides are at the right ends.
mixed-left-right linear grammar
If in a linear grammar, the nonterminal in rhs is sometimes at left end, and sometimes at right end.

By inserting new nonterminals, every linear grammar can be converted to mixed-left-right linear grammar.

For example, if you have a rule

X → aYb

it can be replaced by


Analytic Grammar

like formal grammar, but with slightly different requirement or spec for the rules.

analytic grammar is basically parsers, which specifies a lang by working backwards from string to grammar, as opposed a list of rules that generates string.

examples are:

Lexical Grammar

Lexical grammar

when a grammar is used to describe the syntax for “words” in a programing language, we call it lexical grammar.

For example, identifier (variable name, function name), keywords, string, numbers, operators, are the “words” here.

so, “lexical grammar” is not a name of a grammar power classification. Rather, it describe the grammar's purpose.

The grammar for words are usually very simple, of the type regular grammar. Therefore, regular expression is used for lexical grammar.

Recursive grammar

Recursive grammar

a grammar is called a recursive grammar if in the rewrite process, a nonterminal can lead to a string that contains the same nonterminal. Else, it's said to be non-recursive grammar.

Note: a non-recursive grammar can only generate a finite number of sentences. You need recursion to have a language that's not just a finite set of sentences.

example of immediate left recursive grammar:

A → A α | β

example of indirect left recursive grammar:

A → B α | C
B → A β | D

naive recursive descent parser might fall into infinite recursion when trying to parse a grammar which contains this rule.

unterminating recursive grammar

example

S = S
S = A
A = S

ambiguous grammar

Ambiguous Grammar

grammar Equivalence

Equivalence (formal languages)

Parsing expression grammar

Parsing expression grammar

parsing expression grammar, or PEG, is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004[1] and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

Unlike CFGs, PEGs cannot be ambiguous; if a string parses, it has exactly one valid parse tree. It is conjectured that there exist context-free languages that cannot be parsed by a PEG, but this is not yet proven.[1] PEGs are well-suited to parsing computer languages, but not natural languages where their performance is comparable to general CFG algorithms such as the Earley algorithm.[2]