# Formal Grammar

**warning**: work in progress.

A formal grammar is a set of rules for rewriting strings.

- A finite set Σ. of symbols called
**terminal symbols**. - A finite set N. of symbols called
**nonterminal symbols**. The N and ∑ does not share elements. - A finite set P (Each element is called
**production rule**or just**rule**in short.). Each rule is a set of 2 elements, called left-hand-side and right-hand-side. Usually written as L → R. L is a sequence of one or more terminal and or nonterminal symbols, with at least one nonterminal, and R is a sequence of zero or more terminal/nonterminal symbols. (If R is empty, it can be written as ε.) - One of the rule, is designated as the
**start rule**.

The elements in Σ (the terminals) is called the **alphabet**. You can think of each element as english alphabet, or characters, or strings.

The system above is sometimes also called {generative grammar, rewriting system, phrase structure grammar}.

Essentially, it's a string replacement system. You are given a finite list of replacements, with one of them as the starting point. When the right-hand-side has a nonterminal, pick any rule where the left hand side that matches, and do the replacement. Repeat, until you reach a point where no more noneterminal is left. This is a generated string, and is said to be a valid string/sentence of this particular grammar.

## Formal Grammar Hierarchy (Chomsky Hierarchy)

Chomsky hierarchy of formal languages

recursively enumerable context sensitive context free regular

[ Chomsky hierarchy ] [ https://en.wikipedia.org/wiki/Chomsky_hierarchy ]

- Type-0 grammars (
**unrestricted grammars**) include all formal grammars. (in particular, lhs can have more than one nonterminal) - Type-1 grammars (
**context-sensitive grammars**) lhs have at most one nonterminal, for all rules. - Type-2 grammars (
**context-free grammars**) lhs is just one nonterminal, for all rules. (that is, no terminal allowed on lhs.) - Type-3 grammars (
**regular grammars**) lhs is just one nonterminal, and rhs is of a single terminal, possibly followed by a single nonterminal (right regular). Alternatively, rhs can consist of a single terminal, possibly preceded by a single nonterminal (called left regular).

### Context-sensitive grammar

- A context-sensitive grammar
- Lhs must have only one nonterminal, for all rules.

It's called context-sensitive because the nonterminal on the lhs may be surrounded by terminals. So, in order to replace, the neighboring nonterminals (the context) must be considered.

see also [ Noncontracting grammar ] [ https://en.wikipedia.org/wiki/Noncontracting_grammar ]

see also [ Kuroda normal form ] [ https://en.wikipedia.org/wiki/Kuroda_normal_form ]

### Context-free grammar

[ Context-free grammar ] [ https://en.wikipedia.org/wiki/Context-free_grammar ]

- a context-free grammar
- Left-hand-side of every production rule is just one nonterminal.

## Linear Grammar and Regular Grammar

[ Linear grammar ] [ https://en.wikipedia.org/wiki/Linear_grammar ]

- a linear grammar
- A context-free grammar, and has at most one nonterminal in the right hand side, for every production rule.

That is, lhs is just 1 nonterminal (so it's “context-free”), and rhs must have at most 1 nonterminal.

example

S = 'a' S 'b' S = ''

This grammar generates {ab, aabb, aaabbb, …}.

**regular grammar**- Like linear grammar, but every nonterminal on rhs must all be at left ends, or all be at right ends.
**left-regular grammar**(aka**left-linear grammar**)- Any nonterminal in right hand sides are at the left ends.
**right-regular grammar**(aka**right-linear grammar**)- Any nonterminal in right hand sides are at the right ends.
**mixed-left-right linear grammar**- If in a linear grammar, the nonterminal in rhs is sometimes at left end, and sometimes at right end.

By inserting new nonterminals, every linear grammar can be converted to mixed-left-right linear grammar.

For example, if you have a rule

`X → aYb`

it can be replaced by

`X → aZ`

`Z → Yb`

## Analytic Grammar

like formal grammar, but with slightly different requirement or spec for the rules.

analytic grammar is basically parsers, which specifies a lang by working backwards from string to grammar, as opposed a list of rules that generates string.

examples are:

- [ Parsing expression grammar ] [ https://en.wikipedia.org/wiki/Parsing_expression_grammar ]
- [ Top-down parsing language ] [ https://en.wikipedia.org/wiki/Top-down_parsing_language ]
- [ Link grammar ] [ https://en.wikipedia.org/wiki/Link_grammar ]

## Lexical Grammar

[ Lexical grammar ] [ https://en.wikipedia.org/wiki/Lexical_grammar ]

when a grammar is used to describe the syntax for “words” in a programing language, we call it lexical grammar.

For example, identifier (variable name, function name), keywords, string, numbers, operators, are the “words” here.

so, “lexical grammar” is not a name of a grammar power classification. Rather, it describe the grammar's purpose.

The grammar for words are usually very simple, of the type regular grammar. Therefore, regular expression is used for lexical grammar.

## Recursive grammar

[ Recursive grammar ] [ https://en.wikipedia.org/wiki/Recursive_grammar ]

a grammar is called a recursive grammar if in the rewrite process, a nonterminal can lead to a string that contains the same nonterminal. Else, it's said to be non-recursive grammar.

Note: a non-recursive grammar can only generate a finite number of sentences. You need recursion to have a language that's not just a finite set of sentences.

example of immediate left recursive grammar:

A → A α | β

example of indirect left recursive grammar:

A → B α | C B → A β | D

naive recursive descent parser might fall into infinite recursion when trying to parse a grammar which contains this rule.

### unterminating recursive grammar

example

S = S

S = A A = S

## ambiguous grammar

## grammar Equivalence

- weak equivalence of two grammars means they generate the same set of strings.
- strong equivalence (aka structural equivalence) means that the two parse trees are reasonably similar.

Equivalence (formal languages)

## Parsing expression grammar

[ Parsing expression grammar ] [ https://en.wikipedia.org/wiki/Parsing_expression_grammar ]

parsing expression grammar, or PEG, is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004[1] and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

Unlike CFGs, PEGs cannot be ambiguous; if a string parses, it has exactly one valid parse tree. It is conjectured that there exist context-free languages that cannot be parsed by a PEG, but this is not yet proven.[1] PEGs are well-suited to parsing computer languages, but not natural languages where their performance is comparable to general CFG algorithms such as the Earley algorithm.[2]