Parser Notes

By Xah Lee. Date: . Last updated: .

warning: work in progress.

todo. write a script that's rewrite system. given a bnf, generates random sentences.

show the connection of formal language to decidability problem. (computability theory, aka recursion theory)

Ambiguous Grammar (study/detail dangling else problem)

context free grammar ≠ context free semantics

remember, “context free grammar” ≠ “context free semantics”. Basically all langs are the former. Very few are the latter.

For example, parenthesis has different meaning depending on neighboring chars, in most languages.

todo: need more research on this. Clarify, define, “context free semantics”

research on the question of any real programing languages C Java Python JavaScript etc are really context free syntax. Particularly look into C++, perl.

virtual machines: turing, state machine, automaton, regex

The general decision problem of whether a grammar is ambiguous is undecidable.

string rewrite system, parser, automata, math logic, regex, formal logic, formal math, are essentially the same topic, is practical and theoretical.

most people think parser is (trivial) part of compiler. Actually, compiler is a formal lang. For example, ∀ existing compiler ⇒ ∃ formal grammar. Wee!


In mathematics, a cofinite subset of a set X is a subset A whose complement in X is a finite set.

If the complement is not finite, but it is countable, then one says the set is cocountable.


Recursive set

In computability theory, a set of natural numbers (1, 2, 3, etc) is called recursive, computable or decidable if there is an algorithm which terminates after a finite amount of time and correctly decides whether or not a given number belongs to the set.

A more general class of sets consists of the recursively enumerable sets, also called semidecidable sets. For these sets, it is only required that there is an algorithm that correctly decides when a number is in the set; the algorithm may give no answer (but not the wrong answer) for numbers not in the set. Or, equivalently: There is an algorithm that enumerates the members of S. That means that its output is simply a list of the members of S: s1, s2, s3, ... . If necessary, this algorithm may run forever.

A set which is not computable is called noncomputable or undecidable.

A subset S of the natural numbers is called recursive if there exists a total computable function f such that f(x) = 1 if x ∈ S and f(x) = 0 if x ∉ S . In other words, the set S is recursive if and only if the indicator function 1S is computable.

Recursive set

recursive language

a formal language is called recursive if it is a recursive subset of the set of all possible finite sequences over the alphabet of the language.

Recursive language

This type of language was not defined in the Chomsky hierarchy of (Chomsky 1959). All regular, context-free and context-sensitive languages are recursive.

turing complete

to determine if a language or system is turing complete, 2 major ways:

Μ-recursive function

Finite state machine

Finite state machine

A finite-state machine (FSM) or finite-state automaton (plural: automata), or simply a state machine, is a mathematical model of computation used to design both computer programs and sequential logic circuits.

Considered as an abstract model of computation, the finite state machine is weak; it has less computational power than some other models of computation such as the Turing machine.[1] That is, there are tasks which no FSM can do, but some Turing machines can. This is because the FSM has limited memory. The memory is limited by the number of states.

state machine is this:

math defintions goes like this:

The state machines can be subdivided into Transducers, Acceptors, Classifiers and Sequencers.


Automata theory

Formal definition

A deterministic finite automaton is represented formally by a 5-tuple (Q,Σ,δ,q0,F), where:

Input word

An automaton reads a finite string of symbols a1,a2,...., an , where ai ∈ Σ, which is called an input word. The set of all words is denoted by Σ*.


A sequence of states q0,q1,q2,...., qn, where qi ∈ Q such that q0 is the start state and qi = δ(qi-1,ai) for 0 < i ≤ n, is a run of the automaton on an input word w = a1,a2,...., an ∈ Σ*. In other words, at first the automaton is at the start state q0, and then the automaton reads symbols of the input word in sequence. When the automaton reads symbol ai it jumps to state qi = δ(qi-1,ai). qn is said to be the final state of the run.

Accepting word

A word w ∈ Σ* is accepted by the automaton if qn ∈ F.

Recognized language

An automaton can recognize a formal language. The language L ⊆ Σ* recognized by an automaton is the set of all the words that are accepted by the automaton.

Recognizable languages

The recognizable languages are the set of languages that are recognized by some automaton. For the above definition of automata the recognizable languages are regular languages. For different definitions of automata, the recognizable languages are different.

Pushdown automaton

Pushdown automaton

Pushdown automata differ from finite state machines in two ways:

Content Word and Function Word

In linguistics “content words” are words such as nouns, most verbs, adjectives, and adverbs that refer to some object, action, or characteristic. Content words contrast with function words , which function primarily to express the grammatical relationships between other words in a sentence, such as {the, of, and, to}.

Content words are open class words, meaning that new content words can be added to the lexicon easily.[1] In relation to phonology, content words adhere to the minimal word constraint, while function words do not.[2]

Content word

Function words:

  • Prepositions: of, at, in, without, between
  • Pronouns: he, they, anybody, it, one
  • Determiners: the, a, that, my, more, much, either, neither
  • Conjunctions: and, that, when, while, although, or
  • Auxiliary verbs: be (is, am, are), have, got, do
  • Particles: no, not, nor, as Function Words

content words:

  • Nouns: John, room, answer
  • Adjectives: happy, new, large, grey
  • “Full” verbs: search, grow, hold, have
  • Adverbs: really, completely, very, also, enoug

from, by Liina Pylkkanen, Associate Professor of Linguistics and Psychology, at New York University.

lex, lexeme, lexicon


Formally, in linguistics, a lexicon is a language's inventory of lexemes. The word “lexicon” derives from the Greek λεξικόν (lexicon), neuter of λεξικός (lexikos) meaning “of or for words”.

A lexeme is a unit of lexical meaning that exists regardless of the number of inflectional endings it may have or the number of words it may contain.

2016-07-26 resources for amateur compiler writer

Xah Regular Expression Articles Index