How Compiler Works

By Xah Lee. Date: 2015-12-29. Last updated: 2018-04-05.

Compiler: Translates a computer language A to B. Typically, translates a higher level language such as JavaScript, Python, Golang, etc, to machine code.
Interpreter: Is like a compiler, but translate a small bit of code at a time and run it. (when you use interactive command line interface of a language (aka REPL (read eval print loop)), it's interpreter doing the work.)

The difference of compiler and interpreter is blurred today.

Assembly Language: A low level language that still make sense to human.

A line of assembly language looks like this:

MOV AL, 61h       ; Load AL with 97 decimal (61 hex)

Machine Code: Set of instructions executed directly by CPU. Each CPU has different machine language. Machine code is basically a sequence of 0 and 1s.
Assembler: Software that translates assembly language code into machine code. The output of assembler is called object file. (it may generate more than one.)
Linker: Software that takes the object files of assembler's object files into one executable file.
Loader: Part of the operating system that loads (aka run, execute) a file.

Compiler Phases

The task of a compiler is usually divided into 2 steps.

analysis. This is parsing: Read source code, understand what it means, output intermediate code.
synthesis. Reads intermediate code, and generate machine code or target language code.

The compiler part that does analysis (aka parsing) is often called the front-end of compiler.
The compiler part that does synthesis (aka code generation) is often called the back-end of compiler.

parser flow diagram — compiler flow diagram

When compiler translates language A to B, usually it generates a intermediate representation first, then from that, to B.

Intermediate Language (aka bytecode): This is a language created by compiler after parsing (aka syntactic analysis). After parsing, the semantics of the source code is known, and this info is the intermediate language.

Intermediate language is useful because:

Breaks compiler software into components for easier management.
Is often portable (that is, not hardware dependent).
Can be used to generate different target language.

One example of intermediate code is known as three-address code.

t1 := b * b
t2 := 4 * a
t3 := t2 * c
t4 := t1 - t3
t5 := sqrt(t4)
t6 := 0 - b
t7 := t5 + t6
t8 := 2 * a
t9 := t7 / t8
x := t9

example of three-address code.

Parser Phases

The analysis part, is called parsing, done by parser. Parsing usually has the following phases:

lexical analysis
syntactic analysis
semantic analysis

Lexical analysis: The first step of parsing. It takes a input string, and lexical grammar, then breaks the input string into a sequence of syntactical units, each is called a lexeme. (lexeme basically means “word”)
Lexer (aka tokenizer, scanner): The software that does lexical analysis.
Token: A lexeme with extra info, such as what type the lexeme is. e.g. Keyword, operator, identifier, number, string. Often, informally, people use the word “token” for both “lexeme” and “token”.

For example, given code of ruby, print 3+2;, the lexemes are

print
3
+
2
;

example of lexeme and token
Lexeme/Token	Token category
sum	Identifier
=	Assignment operator
3	Integer literal
+	Addition operator
2	Integer literal
;	End of statement

Symbol Table: A table that stores info about identifiers and their associated info, such as its type, location in the source. Symbol table is usually generated by parser, and used thru-out compiling process.
Parser: Software that does parsing (aka Syntactic Analysis). Parser takes a input string, and a grammar, and generates a tree. The tree structure corresponds to about how the input string is generated from the grammar. (parser may also output other related info, such as symbol table). In a narrower sense, the software after lexer is called parser. I.e. Input string is feed to lexer, and its output feed to parser.
Semantic Analysis: Check the parse tree for invalid semantic. For example, variable used but not declared, which is syntactically valid, but semantically invalid. Other example: type checking, such as in 4+"some", is syntactically valid but semantically invalid in some languages.

Grammar

A language's syntax, needs to be specified precisely, and that spec is called grammar.

syntax vs grammar

in our context of computer language or formal language,

syntax: Rules for valid sequence of characters of a language. For example, we say “what's the syntax for array in JavaScript?”
grammar: Specification of syntax for a programing language, written in a form called Backus-Naur Form (aka BNF) or a variant, e.g. EBNF, ABNF.

In broader linguistics context (e.g. English, French, Chinese), syntax is a subset of grammar:

syntax is the order structure of words (such as position of verbs, nouns).
grammar is all rules about the validity of a sentence (such as including correct tense agreement, singular plural, person pronoun (“i go”, “she goes”))

Note: a computer language do not necessarily have a grammar spec. For example, unix config files, of lines of the form x = y, we can usually just parse it without needing to read in a grammar spec, because it is too simple, we understand it implicitly.

What is Formal Grammar

There are many approaches to specifying grammar.

One approach suitable for computer language is called formal grammar. It is also known as, or can be said as phrase structure grammar, or generative grammar, or rewrite rules. These different terms pretty much mean the same thing. Each emphasize different aspects.

Formal Grammar is basically a list of string rewrite rules. Each rule says how one string can be replaced by another, and we are also given a starting string. In this way, we can follow the rules to generate many strings. The collection of all possible strings, is said to be the valid strings of the language, namely, it is the language. This is also why, “formal grammar” is also called “generative grammar” and “rewrite rules”.

Backus–Naur Form (BNF)

Formal grammar is almost always written in a notation called Backus–Naur Form (or a variant of), shortened as BNF.

Backus–Naur Form: A notation used to specify the grammar of computer languages.

For example, let's say we have a language for addition. For example, 3+4 and 6+2 are both valid source code.

The grammar of this language, is specified in BNF like this:

S → NUMBER+NUMBER NUMBER → 0 NUMBER → 1 NUMBER → 2 NUMBER → 3 NUMBER → 4 NUMBER → 5 NUMBER → 6 NUMBER → 7 NUMBER → 8 NUMBER → 9

This grammar, describes a language's syntax, that is addition of 2 single digit integers.

In the above, those NUMBER are called nonterminals and those 0 to 9, and the plus sign + are called terminals.

Nonterminal (aka Variable, Place-Holder): . Nonterminal needs to be replaced by terminal. In linguistics, they are also called “syntactic categories”, such as “verb”, “noun”. Nonterminals are usually written in all UPPERCASE letters.
Terminal (aka Constant): They are called terminals, because, the variables (nonterminals) are eventually replaced by terminal, as the end of process.

In practice, terminals are either token strings, such as {“print”, “while”, “if”, “+”, “327”, “4.5”}. Or, in a lower level of lexical parser, they are characters, such as letters, digits, punctuation, space. The concepts are the same. Terminal denotes something that's fixed, they are constants.

In our grammar above, there are many rules that has NUMBER on the lhs (left-hand-side) . We make the grammar syntax simpler, by adding a “or” (aka alternative) syntax, using the VERTICAL LINE | character, like this:

S → NUMBER + NUMBER NUMBER → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

This way, the notation is simpler. Each production rule will just have one unique nonterminal on the left-hand-side.

Some characters in BNF have special meta meaning here:

The arrow → is used to separate the left-hand-side and right-hand-side.
The vertical bar | means “or” (aka alternative). It means, you can pick the left or right side of | for replacement.
nonterminals are ALLCAPS by convention.
space are use as separators in our syntax.

Here, we assume that the terminals of a language does not include → and | and space character, otherwise, we need to use a notation other than BNF. For example, in EBNF (a variant of BNF), terminals are enclosed by straight double quote (U+22: QUOTATION MARK), for example:

S → NUMBER "+" NUMBER NUMBER→ "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Note: BNF is mostly used for teaching purposes or theoretical computer science context. BNF is not a specific language. There are many variations. Two most popular are called Extended Backus–Naur Form (EBNF) and Augmented Backus–Naur Form (ABNF). The differences of these notations have no importance with respect to theoretical computer science.

[see What is the Difference Between BNF, EBNF, ABNF?]

Formal Grammar Hierarchy (Chomsky Hierarchy)

Given a formal grammar, all possible strings from that grammar is said to be the language defined by that grammar. That is, a given grammar defines a language.

In theory of formal grammar, The set of possible rewrite rules are classified into categories, and one of the classification is called Formal Grammar Hierarchy (aka Chomsky Hierarchy).

Type-0 grammars (unrestricted grammars): no restriction.
Type-1 grammars (context-sensitive grammars): any lhs have at most one nonterminal.
Type-2 grammars (context-free grammars): any lhs is just one nonterminal.
Type-3 grammars (regular grammars): any lhs is just one nonterminal, and rhs is just 1 terminal followed by 0 or 1 nonterminal (called “right regular grammar”). Alternatively, rhs is just 1 terminal preceded by 0 or 1 nonterminal (called “left regular grammar”).

The classification of formal grammar is important in theoretical computer science. The study of formal grammar is not just for writing compiler, but is fundamental to questions about computability and is a foundation of mathematical logic. But for us, for creating compiler for programing language, we don't have to worry about the deeper theories.

For detail about formal grammar, see see Formal Grammar .

For practical programing languages, we are interested in just context free grammar.

The class of languages of context free grammar is called context free languages.

Context-Free Grammar (CFG)

Here's description of context-free grammar (CFG).

A set of replacement rules. (called production rules)
Each rule has left-hand-side (lhs) and right-hand-side (rhs).
One of the rule is designated as start rule. (often written as S, as lhs. Or, the first rule is implicitly understood as the start rule.)
A set of symbols, called nonterminal.
A set of symbols, called terminal. (nonterminal and terminal do not share elements)
lhs is one nonterminal.
rhs is a sequence of nonterminal and or terminal.
There's a special symbol, usually ε (U+3B5: GREEK SMALL LETTER EPSILON), that means empty string. This can be used for rhs as the sole rhs.

Note: all context-free grammar with the empty string ε can be replaced by a equivalent one without using ε. But it is not trivial. Grammars involving ε is convenient for human understanding, but painful for proofs and parsers to deal with.

Generate Sentence by Grammar

Begin with the start rule. You get the rhs.
Replace a nonterminal on the rhs, to a matching rule's rhs. That is, pick a rule whose lhs is the nonterminal. This replacement step is called a production.
Repeat the above until no more nonterminal is left.
The result is called a sentence.

All possible sentences, is called the language of that grammar.

when generating a sentence from a grammar, the intermediate string that still contain nonterminals is called sentential form.

Grammar Equivalence

Two grammars are said to be equivalent if they generates the same set of strings.

Grammar Equivalence is important topic in both theory and practice. In theoretical work, the concept is needed to classify grammars. In practical work, we often need to rewrite a grammar to a equivalent one, so it can be parsed by some specific parsing algorithm.

Recognize Language: Derive Language from Grammar

Inverse problem of generating language, is called recognizing language. That is, given a input string , and given a grammar, find out the exact ordered sequence of rule selections to produce the input string. (or, determine if it is not possible from the given grammar.)

Parsing

Parsing, is exactly the inverse problem of generating a sentence from grammar. Parser takes the input string and a (context-free) grammar, and returns a tree structure, such that each non-leaf node corresponds to one of the nonterminal. Each the node's children is the rhs elements of the rule. The leaf nodes, are the terminals, corresponding to the input string.

invalid language → If it is impossible to derive the input string from the given grammar, then the input string is said to be invalid (or, we say, it has syntax error).

Is it input string, or input stream, or input language?

The input to a parser is called various things depending on the focus:

Input string.
Input stream (of characters). Because, in practice, the first step of parsing, the lexer reads the input character by character in continuous fashion.
Input stream (of lexems). Because, parser takes output from lexer, lexem by lexem, in a continuous fashion.
Source code. (input to parser is source code, from programing language user's point of view.)
A (potential) language. (point of view of formal language theory, or computability.)

the following is work in progress.

warning: work in progress.

Order Doesn't Matter, Choice Matters

When you generate a sentence from a grammar, given a sentential form, which nonterminal you replace first doesn't matter, with respect to the final sentence. What matters is, for a nonterminal, which choice (if there are more than 1) of rule you picked to replace a nonterminal.

for example, if we have

E → E + T | T T → T * F | F F → ( E ) | NUMBER NUMBER → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

todo. explain

Leftmost Derivation, Rightmost Derivation

leftmost derivation → in parsing, in each sentential form, always replace the leftmost nonterminal first.

rightmost derivation → in parsing, in each sentential form, always replace the rightmost nonterminal first.

todo, show that a input string and grammar can have multiple derivations.

todo: show that a given derivation has more info than parse tree, because derivation contains order info. 2 derivation can arrive at the same parse tree. show example.

Ambiguous Grammar

Operator Associativity and Precedence

Note: operator associativity and precedence, is implied from a given grammar. Or, when you design a grammar for a language you have in mind, such as adding and multiplying numbers (e.g 3+4*2), you have to encode the operator precedence into the grammar.

todo. explain.

Top-Down Parsing, Bottom-Up Parsing

There are many techniques to parsing. They can in general divided into top-down and bottom-up approach.

parsing methods can be classified as top-down or bottom-up and as directional or non-directional; the directional methods can be further distinguished into deterministic and non-deterministic ones. [Dick Grune, §1.3] (but there are also techniques that doesn't fall into one of these classification.)

Cocke-Younger-Kasami algorithm and Earley's algorithm can parse any grammar. [Dragon Book] CYK algorithm Earley parser

top-down parsing is the approach of building the root of the parse tree first (with the start production rule), then build children.

bottom-up parsing is the approach of building the leaf nodes first. That is, start from terminals.

top-down parser is easier to write manually. Bottom-up parsing are usually used by automated tools.

Here's the major approaches to parsing, and their hierachy.

Top-down parsers:

recursive decent
␣predictive parser (no backtracking)
␣␣LL(k) parser

Bottom-up parsers:

Precedence parser
␣Operator-precedence parser
␣Simple precedence parser
BC (bounded context) parsing
LR parser (Left-to-right, Rightmost derivation)
␣Simple LR (SLR) parser
␣LALR parser
␣Canonical LR (LR(1)) parser
␣GLR parser
CYK parser
Recursive ascent parser
Shift-Reduce Parser

Some of the parser, can only parse specific class of grammar.

In practice, LL parser (top-down) and LR parser (bottom-up) is enough for most programing languages.

top-down parsers cannot handle left-recursive grammars. The grammar needs to be transformed to eliminate left-recursion. [dragon book, §4.3.3]

eliminate left-recursion

left-factoring

Left factoring, a grammar transformation to eliminate to make grammar suitable for predictive, or top-down, parsing.

lexical analysis

whitespace and comment

the lexer strip out these. Make things much easier for parser. but if part of grammar, it'll be much difficult. [dragon book 2.6.1]

lex generates symbol table.

lex may count newlines, so it can report errors.

some lex are separated into 2 steps:

token and attributes. Token needs a attribute, because, for example, 3 and 4 are both numbers. But, they are not the same. For parser, it's not necessary to know the difference, but later, for semantic, you need them.

symbol table: keep info of tokens, type, location found in source code (in case of reporting syntax error)

scan. Scan are those that does not require output tokens.
lexical analysis

token is a name and attribute pair.

common classes of tokens:

one token for each keyword
token for operators, one for each, or together
1 token for all identifiers
one or more tokens for constants, for example, numbers, strings
tokens for each punctuation

original regular expressions

regular expressions, in its original theoretical use, is just

| → the “or”
* → 0 or more.
adjacency → concatenation

here's later additions, as syntactic convenience, but equivalent in power:

+ → one or more. (called positive closure) This is equivalent as x|x*
? → zero or one. This is equivalent as x|ε
[…] → character class. This is equivalent to many or |.

nondeterministic finite automaton (NFA) and Deterministic Finite Automaton (DFA)

In automata theory, a finite state machine is called a deterministic finite automaton (DFA), if

each of its transitions is uniquely determined by its source state and input symbol, and
reading an input symbol is required for each state transition.

A nondeterministic finite automaton (NFA), or nondeterministic finite state machine, does not need to obey these restrictions. In particular, every DFA is also an NFA.

parsing arithmetic

suppose we have a lang of just plus and multiply. This grammar can be written as:

E → E + T | T
T → T * F | F
F → (E) | id

The letters of nonterminals are chosen as a convenient reminder, E means expression, T is term, F is factor.

this grammar belong to LR grammar. (which is bottom-up) The grammar is left-recursive.

A equivalent grammar, but not left-recursive, is this:

E → T E'
E → +T E' | E
T → F T
V → ????

the following grammar treats + and * as same (no precedence). It is a ambiguous grammar.

E → E + E | E * E | ( E ) | id

Simple Parser in Java

put the dragon book java code here. annotate it

Syntax Error Handling

two strategies:

panic-mode recovery: Discard input symbols until reaching a bracket.
phrase-level recovery: Do local correction. Remove/Add a comma, semicolon, bracket, etc.

[dragon book, §4.1.4]

Common errors

Lexical Error: Example: misspelling. Missing quotes
Syntactic Error: Example: missing semicolon, missing brackets.
Semantic Error: Example: type mismatch.

error detection and handling is a difficult problem.

another technique to handle error is to add production rules to the grammar for checking common errors, such as missing semicolon.

McNaughton-Yamada algorithm

given a regex, generates nda diagram At http://hackingoff.com/compilers/regular-expression-to-nfa-dfa

In computer science, Thompson's construction is an algorithm for transforming a regular expression into an equivalent nondeterministic finite automaton (NFA). This NFA can be used to match strings against the regular expression.

Regular expression and nondeterministic finite automaton are two abstract representation of formal languages. While regular expressions are used e.g. to describe advanced search patterns in "find and replace"-like operations of text processing utilities, the NFA format is better suited for execution on a computer. Hence, this algorithm is of practical interest, since it can be considered as a compiler from regular expression to NFA. On a more theoretical point of view, this algorithm is a part of the proof that they both accept exactly the same languages, that is, the regular languages.

A thus obtained automaton can be made deterministic by the powerset construction and then be minimized to get an optimal automaton corresponding to the given regular expression, but it may also be used directly.

Thompson's construction algorithm (aka McNaughton-Yamada algorithm)