How Compiler Works

By Xah Lee. Date: . Last updated: .

warning: work in progress.

compiler → translates a computer language A to B. Typically, translates a higher level language such as ruby, C, to machine code.

interpreter → is like a compiler, but translate a small bit of code at a time, and also run it. (when you use interactive command line interface of a language (aka REPL (read eval print loop)), it's interpreter doing the work.)

Assembly language → a low level language that still make sense to human.

A line of assembly language looks like this:

MOV AL, 61h       ; Load AL with 97 decimal (61 hex)

machine code → set of instructions executed directly by CPU. Each CPU has different machine language. Machine code is basically a sequence of 0 and 1s.

Assembler → software that translates assembly language code into machine code. The output of assembler is called object file. (it may generate more than one.)

linker → software that takes the object files of assembler's object files into one executable file.

loader → part of the operating system that loads (aka run, execute) a file.

Compiler Phases

The task of a compiler is usually divided into 2 steps.

  1. analysis → This is parsing: Read source code, understand what it means, output intermediate code.
  2. synthesis → Reads intermediate code, and generate machine code or target language code.
parser flow diagram
parser flow diagram

When compiler translates language A to B, usually it generates a intermediate representation first, then from that, to B.

intermediate language → (aka bytecode) This is a language created by compiler after parsing (aka syntactic analysis). After parsing, the semantics of the source code is known, and this info is the intermediate language. Intermediate language is useful because: ① breaks compiler software into components for easier management. ② is often portable (that is, not hardware dependent). ③ can be used to generate different target language.

One example of intermediate code is known as three-address code.

t1 := b * b
t2 := 4 * a
t3 := t2 * c
t4 := t1 - t3
t5 := sqrt(t4)
t6 := 0 - b
t7 := t5 + t6
t8 := 2 * a
t9 := t7 / t8
x := t9
example of three-address code.

Phases of Parser

The analysis part, is called parsing, done by parser. Parsing usually has the following phases:

  1. lexical analysis
  2. syntactic analysis
  3. semantic analysis

lexical analysis → is the first step of parsing. It takes a input string, and lexical grammar, then breaks the input string into a sequence of syntactical units, each is called a lexeme. The software that does lexical analysis is called lexer → (aka tokenizer, scanner)

For example, given code of ruby, print 3+2;, the lexemes are

  1. print
  2. 3
  3. +
  4. 2
  5. ;

token → a lexeme with extra info, such as what type the lexeme is. e.g. keyword, operator, identifier, number, string.

example of lexeme and token
Lexeme/TokenToken category
sumIdentifier
=Assignment operator
3Integer literal
+Addition operator
2Integer literal
;End of statement

symbol table → a table that stores info about identifiers and each is associated info, such its type, location in the source. Symbol table is usually generated by parser, and used thru-out compiling process.

parser → software that does parsing (aka Syntactic Analysis). Parser takes a input string, and a grammar, and generates a tree. The tree structure corresponds to about how the input string is generated from the grammar. (parser may also output other related info, such as symbol table). In a narrower sense, the software after lexer is called parser. i.e. input string is feed to lexer, and its output feed to parser.

semantic analysis → check the parse tree for invalid semantic. For example, variable used but not declared, which is syntactically valid, but semantically invalid. Other example: type checking, such as in 4+"some".

grammar

A language's syntax, needs to be specified precisely, and that spec is called grammar.

syntax vs grammar

in our context (computer language, formal language),

In broader linguistics context, syntax is a subset of grammar:

Note: a computer language do not necessarily have to specify grammar. For example, unix config files, of lines of the form x = y, we can usually just parse it without needing to read in a grammar spec, because it is too simple.

What's Formal Grammar

There are many approaches to specifying grammar.

One approach suitable for computer language is called formal grammar. It is also known as, or can be said as phrase structure grammar, or generative grammar, or rewrite rules. These different terms pretty much mean the same thing. They emphasize different aspects.

Formal Grammar is basically a list of string rewrite rules. Each rule says how one string can be replaced by another, and we are also given a starting string. In this way, we can follow the rules to generate many strings. The collection of strings, are said to be the valid strings of the language, namely, it is the language. This is also why, “formal grammar” is also called “generative grammar” and “rewrite rules”.

Backus–Naur Form

Formal grammar is usually written in a notation called Backus–Naur Form, shortened as BNF.

Backus–Naur Form → a notation used to specify the grammar of computer languages.

for example, let's say we have a language for addition. For example, 3+4, 6+2 are all valid source code.

the grammar of this language, is specified in BNF like this:

SNUMBER+NUMBER NUMBER → 0 NUMBER → 1 NUMBER → 2 NUMBER → 3 NUMBER → 4 NUMBER → 5 NUMBER → 6 NUMBER → 7 NUMBER → 8 NUMBER → 9

This grammar, describes a language's syntax, that is addition of 2 single digit integers.

In the above, those NUMBER are called nonterminals and those 0 to 9, and the plus sign + are called terminals.

nonterminal → aka variable, place-holder. Nonterminal needs to be replaced by terminal. In linguistics, they are also called “syntactic categories”, such as “verb”, “noun”. Nonterminals are usually written in all UPPERCASE letters.

terminal → a symbol (or a fixed string). They are called terminals, because, when generating sentences from the rules, the terminals in rules are the end. As in, the process of replacing a variable terminates when it is replaced by a terminal.

Each terminal is represented by a symbol (a letter). In practice, terminals are either token strings, such as {“print”, “while”, “if”, “+”, “327”, “4.5”}. Or, in a lower level of lexical parser, they are characters, such as letters, digits, punctuation, space. The concepts are the same. Terminal denotes something that's fixed.

If you look at the grammar, there are many rules that has NUMBER on the lhs (left-hand-side) . We make the grammar syntax simpler, by adding a “or” (aka alternative) syntax, using the VERTICAL LINE | character, like this:

SNUMBER + NUMBER NUMBER → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

This way, the notation is simpler. Each production rule will just have one unique nonterminal on the lhs.

Some characters in BNF have special meta meaning here:

Here, we assume that the terminals of a language does not include and | and space character, otherwise, we need to use a notation other than BNF. For example, in EBNF (a variant of BNF), terminals are enclosed by QUOTATION MARK " , for example:

SNUMBER + NUMBER NUMBER→ "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Note: BNF is mostly used for teaching purposes or theoretical computer science context. BNF is not a specific language. There are many variations. Two most popular are called Extended Backus–Naur Form (EBNF) and Augmented Backus–Naur Form (ABNF). The differences of these notations have no theoretical importance. 〔►see What's the Difference Between BNF, EBNF, ABNF?

Formal Grammar Hierarchy (Chomsky Hierarchy)

Given a formal grammar, all possible strings from that grammar is said to be the language defined by that grammar. That is, each grammar defines a language.

The set of rewrite rules are classified into categories, and one of the classification is called Formal Grammar Hierarchy (aka Chomsky Hierarchy).

The classification of formal grammar is important in theoretical computer science. The study of formal grammar is not just for writing compiler, but is fundamental to questions about computability and is a foundation of mathematical logic. But for us, for creating compiler for programing language, we don't have to worry about the deeper theories.

For detail about formal grammar, see see Formal Grammar.

For programing languages, we are interested in a particular class of grammar called context free grammar. The class of languages of context free grammar is called context free languages.

Context-Free Grammar (CFG)

Here's informal description of context-free grammar (CFG).

Note: all context-free grammar with the empty string ε can be replaced by a equivalent one without using ε. But it is not trivial. Grammars involving ε is convenient for human understanding, but painful for proofs and parsers to deal with.

is terminal a string or a symbol?

terminals can be considered to be a symbol, or a sequence of glyph, or a string. Which word to use depends on the context.

In theoretical text (such as formal language, computability), terminal is usually called a symbol. In practical contexts, terminals is usually called a string or a token.

Generate Sentence by Grammar

  1. Begin with the start rule. You get the rhs. Lets call this the current “sentential form”.
  2. Replace a nonterminal on the rhs, to a matching rule's rhs. That is, pick a rule whose lhs is the nonterminal. This replacement step is called a production.
  3. Repeat the above until no more nonterminal is left in the sentential form.
  4. The result is called a sentence.

All possible sentences, is called the language of that grammar.

when generating a sentence from a grammar, the intermediate string (which still contain nonterminals) is called sentential form.

context-free grammar → a grammar where the lhs is just one single nonterminal, for all production rules.

What is context sensitive grammar?

In a context-free grammar, the lhs must be one single nonterminal.

If the lhs can have both nonterminal and terminals, then it is not context-free grammar. It is context-sensitive grammar.

For practical programing language parsing, we are generally only interested in context-free grammar. For detail of other grammar, see: Formal Grammar

Grammar Equivalence

two grammars are said to be equivalent if they generates the same set of strings.

Grammar Equivalence is important topic in both theory and practice. In theoretical work, the concept is needed to classify grammars. In practical work, we often need to rewrite a grammar to a equivalent one, so it can be parsed by some specific parsing algorithm.

Recognize Language: Derive Language from Grammar

inverse problem of generating language, is called recognizing language. That is, given a input string , and given a grammar, find out the exact ordered sequence of rule selections to produce the input string. (or, determine if it is not possible from the given grammar.)

Parsing, Syntactic Analysis

parsing (aka Syntactic Analysis), is exactly the derivation problem. Parser takes the input string and a context-free grammar, and returns a tree structure, such that each non-leaf node corresponds to one of the nonterminal. Each the node's children is the rhs elements of the rule. The leaf nodes, are the terminals, corresponding to the input string.

invalid language → If it is impossible to derive the input string from the given grammar, then the input string is said to be invalid (aka syntax error, doesn't belong to the language)

Is it input string, or input stream, or input language?

The input to a parser are called various things depending on the focus:






the following is work in progress.

Derivation of Grammar

When you generate a sentence from a grammar, given a sentential form, which nonterminal you replace first doesn't matter, with respect to the final sentence. What matters is which choice you pick when a nonterminal has alternatives on the rhs.

for example, if we have

SNUMBER + NUMBER NUMBER → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

EE + T | T TT * F | F F → ( E ) | NUMBER NUMBER → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

in step 1, we have this NUMBER + NUMBER

leftmost derivation → in parsing, in each sentential form, always replace the leftmost nonterminal first.

rightmost derivation → in parsing, in each sentential form, always replace the rightmost nonterminal first.

todo, show that a input string and grammar can have multiple derivations.

todo: show that a given derivation has more info than parse tree, because derivation contains order info. 2 derivation can arrive at the same parse tree. show example.

Ambiguous Grammar

Ambiguous Grammar

Operator Associativity and Precedence

Top-Down Parsing, Bottom-Up Parsing

There are many techniques to parsing. They can in general divided into top-down and bottom-up approach.

parsing methods can be classified as top-down or bottom-up and as directional or non-directional; the directional methods can be further distinguished into deterministic and non-deterministic ones. [Dick Grune, §1.3] (but there are also techniques that doesn't fall into one of these classification.)

Cocke-Younger-Kasami algorithm and Earley's algorithm can parse any grammar. [Dragon Book] 〔CYK algorithm〕 〔Earley parser

top-down parsing is the approach of building the root of the parse tree first (with the start production rule), then build children.

bottom-up parsing is the approach of building the leaf nodes first. That is, start from terminals.

top-down parser is easier to write manually. Bottom-up parsing are usually used by automated tools.

Here's the major approaches to parsing, and their hierachy.

Top-down parsers:

Bottom-up parsers:

Some of the parser, can only parse specific class of grammar.

In practice, LL parser (top-down) and LR parser (bottom-up) is enough for most programing languages.

top-down parsers cannot handle left-recursive grammars. The grammar needs to be transformed to eliminate left-recursion. [dragon book, §4.3.3]

eliminate left-recursion

left-factoring

Left factoring, a grammar transformation to eliminate to make grammar suitable for predictive, or top-down, parsing.

lexical analysis

whitespace and comment

the lexer strip out these. Make things much easier for parser. but if part of grammar, it'll be much difficult. [dragon book 2.6.1]

lex generates symbol table.

lex may count newlines, so it can report errors.

some lex are separated into 2 steps:

token and attributes. Token needs a attribute, because, for example, 3 and 4 are both numbers. But, they are not the same. For parser, it's not necessary to know the difference, but later, for semantic, you need them.

symbol table: keep info of tokens, type, location found in source code (in case of reporting syntax error)

token is a name and attribute pair.

common classes of tokens:

original regular expressions

regular expressions, in its original theoretical use, is just

here's later additions, as syntactic convenience, but equivalent in power:

nondeterministic finite automaton (NFA) and Deterministic Finite Automaton (DFA)

In automata theory, a finite state machine is called a deterministic finite automaton (DFA), if

A nondeterministic finite automaton (NFA), or nondeterministic finite state machine, does not need to obey these restrictions. In particular, every DFA is also an NFA.

parsing arithmetic

suppose we have a lang of just plus and multiply. This grammar can be written as:

E → E + T | T
T → T * F | F
F → (E) | id

The letters of nonterminals are chosen as a convenient reminder, E means expression, T is term, F is factor.

this grammar belong to LR grammar. (which is bottom-up) The grammar is left-recursive.

A equivalent grammar, but not left-recursive, is this:

E → T E'
E → +T E' | E
T → F T
V → ????

the following grammar treats + and * as same (no precedence). It is a ambiguous grammar.

E → E + E | E * E | ( E ) | id

Simple Parser in Java

put the dragon book java code here. annotate it

Syntax Error Handling

two strategies:

[dragon book, §4.1.4]

common errors

error detection and handling is a difficult problem.

another technique to handle error is to add production rules to the grammar for checking common errors, such as missing semicolon.

McNaughton-Yamada algorithm

given a regex, generates nda diagram @ http://hackingoff.com/compilers/regular-expression-to-nfa-dfa

In computer science, Thompson's construction is an algorithm for transforming a regular expression into an equivalent nondeterministic finite automaton (NFA). This NFA can be used to match strings against the regular expression.

Regular expression and nondeterministic finite automaton are two abstract representation of formal languages. While regular expressions are used e.g. to describe advanced search patterns in "find and replace"-like operations of text processing utilities, the NFA format is better suited for execution on a computer. Hence, this algorithm is of practical interest, since it can be considered as a compiler from regular expression to NFA. On a more theoretical point of view, this algorithm is a part of the proof that they both accept exactly the same languages, that is, the regular languages.

A thus obtained automaton can be made deterministic by the powerset construction and then be minimized to get an optimal automaton corresponding to the given regular expression, but it may also be used directly.

Thompson's construction algorithm (aka McNaughton-Yamada algorithm)