# Parser Notes

By Xah Lee. Date: . Last updated: .

warning: work in progress.

todo. write a script that's rewrite system. given a bnf, generates random sentences.

show the connection of formal language to decidability problem. (computability theory, aka recursion theory)

Ambiguous Grammar (study/detail dangling else problem)

### context free grammar ≠ context free semantics

remember, “context free grammar” ≠ “context free semantics”. Basically all langs are the former. Very few are the latter.

For example, parenthesis has different meaning depending on neighboring chars, in most languages.

todo: need more research on this. Clarify, define, “context free semantics”

research on the question of any real programing languages C Java Python JavaScript etc are really context free syntax. Particularly look into C++, perl.

virtual machines: turing, state machine, automaton, regex

The general decision problem of whether a grammar is ambiguous is undecidable.

string rewrite system, parser, automata, math logic, regex, formal logic, formal math, are essentially the same topic, is practical and theoretical.

most people think parser is (trivial) part of compiler. Actually, compiler is a formal lang. For example, ∀ existing compiler ⇒ ∃ formal grammar. Wee!

### cofinite

In mathematics, a cofinite subset of a set X is a subset A whose complement in X is a finite set.

If the complement is not finite, but it is countable, then one says the set is cocountable.

### Recursive set

In computability theory, a set of natural numbers (1, 2, 3, etc) is called recursive, computable or decidable if there is an algorithm which terminates after a finite amount of time and correctly decides whether or not a given number belongs to the set.

A more general class of sets consists of the recursively enumerable sets, also called semidecidable sets. For these sets, it is only required that there is an algorithm that correctly decides when a number is in the set; the algorithm may give no answer (but not the wrong answer) for numbers not in the set. Or, equivalently: There is an algorithm that enumerates the members of S. That means that its output is simply a list of the members of S: s1, s2, s3, ... . If necessary, this algorithm may run forever.

A set which is not computable is called noncomputable or undecidable.

A subset S of the natural numbers is called recursive if there exists a total computable function f such that f(x) = 1 if x ∈ S and f(x) = 0 if x ∉ S . In other words, the set S is recursive if and only if the indicator function 1S is computable.

### recursive language

a formal language is called recursive if it is a recursive subset of the set of all possible finite sequences over the alphabet of the language.

This type of language was not defined in the Chomsky hierarchy of (Chomsky 1959). All regular, context-free and context-sensitive languages are recursive.

## turing complete

to determine if a language or system is turing complete, 2 major ways:

• show it can emulate turing machine.
• show it can compute μ-recursive function

## Finite state machine

A finite-state machine (FSM) or finite-state automaton (plural: automata), or simply a state machine, is a mathematical model of computation used to design both computer programs and sequential logic circuits.

Considered as an abstract model of computation, the finite state machine is weak; it has less computational power than some other models of computation such as the Turing machine. That is, there are tasks which no FSM can do, but some Turing machines can. This is because the FSM has limited memory. The memory is limited by the number of states.

state machine is this:

• there are n “nodes”. (lets lable them n1 n2 n3 …)
• there are m “events”. (lets lable them e1 e2 e3 …)
• there is a function, f(node, event) that returns a node.
• one of the node is designated start node.

math defintions goes like this:

• A deterministic finite state machine or acceptor deterministic finite state machine is a quintuple (Σ, S, s_0, δ, F), where:
• Σ is the input alphabet (a finite, non-empty set of symbols).
• S is a finite, non-empty set of states.
• s_0 is an initial state, a element of S.
• δ is the state-transition function: δ: S × Σ → S (in a nondeterministic finite automaton it would be δ: S × Σ → \mathcal{P}(S), i.e., δ would return a set of states).
• F is the set of final states, a (possibly empty) subset of S.

The state machines can be subdivided into Transducers, Acceptors, Classifiers and Sequencers.

## Automaton

Formal definition

A deterministic finite automaton is represented formally by a 5-tuple (Q,Σ,δ,q0,F), where:

• Q is a finite set of states.
• Σ is a finite set of symbols, called the alphabet of the automaton.
• δ is the transition function, that is, δ: Q × Σ → Q.
• q0 is the start state, that is, the state of the automaton before any input has been processed, where q0∈ Q.
• F is a set of states of Q (i.e. F⊆Q) called accept states.

### Input word

An automaton reads a finite string of symbols a1,a2,...., an , where ai ∈ Σ, which is called an input word. The set of all words is denoted by Σ*.

### Run

A sequence of states q0,q1,q2,...., qn, where qi ∈ Q such that q0 is the start state and qi = δ(qi-1,ai) for 0 < i ≤ n, is a run of the automaton on an input word w = a1,a2,...., an ∈ Σ*. In other words, at first the automaton is at the start state q0, and then the automaton reads symbols of the input word in sequence. When the automaton reads symbol ai it jumps to state qi = δ(qi-1,ai). qn is said to be the final state of the run.

### Accepting word

A word w ∈ Σ* is accepted by the automaton if qn ∈ F.

### Recognized language

An automaton can recognize a formal language. The language L ⊆ Σ* recognized by an automaton is the set of all the words that are accepted by the automaton.

### Recognizable languages

The recognizable languages are the set of languages that are recognized by some automaton. For the above definition of automata the recognizable languages are regular languages. For different definitions of automata, the recognizable languages are different.

### Pushdown automaton

Pushdown automata differ from finite state machines in two ways:

• They can use the top of the stack to decide which transition to take.
• They can manipulate the stack as part of performing a transition.

### Content Word and Function Word

In linguistics “content words” are words such as nouns, most verbs, adjectives, and adverbs that refer to some object, action, or characteristic. Content words contrast with function words , which function primarily to express the grammatical relationships between other words in a sentence, such as {the, of, and, to}.

Content words are open class words, meaning that new content words can be added to the lexicon easily. In relation to phonology, content words adhere to the minimal word constraint, while function words do not.

### Function words:

• Prepositions: of, at, in, without, between
• Pronouns: he, they, anybody, it, one
• Determiners: the, a, that, my, more, much, either, neither
• Conjunctions: and, that, when, while, although, or
• Auxiliary verbs: be (is, am, are), have, got, do
• Particles: no, not, nor, as Function Words

### content words:

• Adjectives: happy, new, large, grey
• “Full” verbs: search, grow, hold, have
• Adverbs: really, completely, very, also, enoug

from http://www.psych.nyu.edu/pylkkanen/Neural_Bases/13_Function_Words.pdf, by Liina Pylkkanen, Associate Professor of Linguistics and Psychology, at New York University.

### lex, lexeme, lexicon

Formally, in linguistics, a lexicon is a language's inventory of lexemes. The word “lexicon” derives from the Greek λεξικόν (lexicon), neuter of λεξικός (lexikos) meaning “of or for words”.

A lexeme is a unit of lexical meaning that exists regardless of the number of inflectional endings it may have or the number of words it may contain.

2016-07-26 resources for amateur compiler writer http://c9x.me/compile/bib/