Formal Grammar

By Xah Lee. Date: 2014-12-29. Last updated: 2016-08-27.

warning: work in progress.

A formal grammar is a set of rules for rewriting strings.

A finite set Σ. of symbols called terminal symbols.
A finite set N. of symbols called nonterminal symbols. The N and ∑ does not share elements.
A finite set P (Each element is called production rule or just rule in short.). Each rule is a set of 2 elements, called left-hand-side and right-hand-side. Usually written as L → R. L is a sequence of one or more terminal and or nonterminal symbols, with at least one nonterminal, and R is a sequence of zero or more terminal/nonterminal symbols. (If R is empty, it can be written as ε.)
One of the rule, is designated as the start rule.

The elements in Σ (the terminals) is called the alphabet. You can think of each element as english alphabet, or characters, or strings.

The system above is sometimes also called {generative grammar, rewriting system, phrase structure grammar}.

Essentially, it's a string replacement system. You are given a finite list of replacements, with one of them as the starting point. When the right-hand-side has a nonterminal, pick any rule where the left hand side that matches, and do the replacement. Repeat, until you reach a point where no more noneterminal is left. This is a generated string, and is said to be a valid string/sentence of this particular grammar.

Formal Grammar Hierarchy (Chomsky Hierarchy)

Chomsky hierarchy of formal languages

recursively enumerable
 context sensitive
  context free
   regular

Chomsky hierarchy

Type-0 grammars (unrestricted grammars) include all formal grammars. (in particular, lhs can have more than one nonterminal)
Type-1 grammars (context-sensitive grammars) lhs have at most one nonterminal, for all rules.
Type-2 grammars (context-free grammars) lhs is just one nonterminal, for all rules. (that is, no terminal allowed on lhs.)
Type-3 grammars (regular grammars) lhs is just one nonterminal, and rhs is of a single terminal, possibly followed by a single nonterminal (right regular). Alternatively, rhs can consist of a single terminal, possibly preceded by a single nonterminal (called left regular).

Context-sensitive grammar

A context-sensitive grammar: Lhs must have only one nonterminal, for all rules.

It's called context-sensitive because the nonterminal on the lhs may be surrounded by terminals. So, in order to replace, the neighboring nonterminals (the context) must be considered.

Context-free grammar

a context-free grammar: Left-hand-side of every production rule is just one nonterminal.

Linear Grammar and Regular Grammar

Linear grammar

a linear grammar: A context-free grammar, and has at most one nonterminal in the right hand side, for every production rule.

That is, lhs is just 1 nonterminal (so it's “context-free”), and rhs must have at most 1 nonterminal.

example

S = 'a' S 'b'
S = ''

This grammar generates {ab, aabb, aaabbb, …}.

regular grammar: Like linear grammar, but every nonterminal on rhs must all be at left ends, or all be at right ends.
left-regular grammar (aka left-linear grammar): Any nonterminal in right hand sides are at the left ends.
right-regular grammar (aka right-linear grammar): Any nonterminal in right hand sides are at the right ends.
mixed-left-right linear grammar: If in a linear grammar, the nonterminal in rhs is sometimes at left end, and sometimes at right end.

By inserting new nonterminals, every linear grammar can be converted to mixed-left-right linear grammar.

For example, if you have a rule

X → aYb

it can be replaced by

X → aZ
Z → Yb

Analytic Grammar

like formal grammar, but with slightly different requirement or spec for the rules.

analytic grammar is basically parsers, which specifies a lang by working backwards from string to grammar, as opposed a list of rules that generates string.

examples are:

Lexical Grammar

Lexical grammar

when a grammar is used to describe the syntax for “words” in a programing language, we call it lexical grammar.

For example, identifier (variable name, function name), keywords, string, numbers, operators, are the “words” here.

so, “lexical grammar” is not a name of a grammar power classification. Rather, it describe the grammar's purpose.

The grammar for words are usually very simple, of the type regular grammar. Therefore, regular expression is used for lexical grammar.

Recursive grammar

a grammar is called a recursive grammar if in the rewrite process, a nonterminal can lead to a string that contains the same nonterminal. Else, it's said to be non-recursive grammar.

Note: a non-recursive grammar can only generate a finite number of sentences. You need recursion to have a language that's not just a finite set of sentences.

example of immediate left recursive grammar:

A → A α | β

example of indirect left recursive grammar:

A → B α | C
B → A β | D

naive recursive descent parser might fall into infinite recursion when trying to parse a grammar which contains this rule.

unterminating recursive grammar

example

S = S

S = A
A = S

ambiguous grammar

Ambiguous Grammar

grammar Equivalence

weak equivalence of two grammars means they generate the same set of strings.
strong equivalence (aka structural equivalence) means that the two parse trees are reasonably similar.

Equivalence (formal languages)

Parsing expression grammar

parsing expression grammar, or PEG, is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004[1] and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

Unlike CFGs, PEGs cannot be ambiguous; if a string parses, it has exactly one valid parse tree. It is conjectured that there exist context-free languages that cannot be parsed by a PEG, but this is not yet proven.[1] PEGs are well-suited to parsing computer languages, but not natural languages where their performance is comparable to general CFG algorithms such as the Earley algorithm.[2]