Compiler Construction Hanspeter Mössenböck University of Linz http://ssw.jku.at/Misc/CC/ Text Book N.Wirth: Compiler Construction, Addison-Wesley 1996 http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf 1
1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Chomsky's Classification of Grammars 1.5 The MicroJava Language 2
Why should I learn about compilers? It's part of the general background of any software engineer • How do compilers work? • How do computers work? (instruction set, registers, addressing modes, run-time data structures, ...) • What machine code is generated for certain language constructs? (efficiency considerations) • What is good language design? • Opportunity for a non-trivial programming project Also useful for general software development • Reading syntactically structured command-line arguments • Reading structured data (e.g. XML files, part lists, image files, ...) • Searching in hierarchical namespaces • Interpretation of command codes • ... 3
1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Chomsky's Classification of Grammars 1.5 The MicroJava Language 4
Dynamic Structure of a Compiler character stream v a l = 1 0 * v a l + i lexical analysis (scanning) ident assign ident plus ident number times 1 3 2 4 1 5 1 token stream token number "val" 10 "val" "i" token value syntax analysis (parsing) Statement syntax tree Expression Term ident = number * ident + ident 5
Dynamic Structure of a Compiler Statement syntax tree Expression Term ident = number * ident + ident semantic analysis (type checking, ...) intermediate syntax tree, symbol table, ... representation optimization code generation const 10 machine code load 1 mul ... 6
Compiler versus Interpreter Compiler translates to machine code scanner parser ... code generator loader source code machine code Interpreter executes source code "directly" • statements in a loop are scanned and parsed scanner parser again and again source code interpretation Variant: interpretation of intermediate code • source code is translated into the ... compiler ... VM code of a virtual machine (VM) source code intermediate code • VM interprets the code (e.g. Java bytecode) simulating the physical machine 7
Static Structure of a Compiler "main program" parser & directs the whole compilation sem. analysis scanner code generation provides tokens from generates machine code symbol table the source code maintains information about declared names and types uses data flow 8
1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Chomsky's Classification of Grammars 1.5 The MicroJava Language 9
What is a grammar? Example Statement = "if" "(" Condition ")" Statement ["else" Statement]. Four components terminal symbols are atomic "if", ">=", ident, number, ... nonterminal symbols are decomposed Statement, Condition, Type, ... into smaller units productions rules how to decom- Statement = Designator "=" Expr ";". Designator = ident ["." ident]. pose nonterminals ... start symbol topmost nonterminal Java 10
EBNF Notation John Backus : developed the first Fortran compiler Extended Backus-Naur form Peter Naur : edited the Algol60 report for writing grammars terminal nonterminal terminates literal Productions symbol symbol a production Statement = "write" ident "," Expression ";" . left-hand side right-hand side by convention • terminal symbols start with lower-case letters • nonterminal symbols start with upper-case letters Metasymbols ≡ a or b or c | separates alternatives a | b | c ≡ ab | ac (...) groups alternatives a (b | c) ≡ ab | b [...] optional part [a] b ≡ b | ab | aab | aaab | ... {...} iterative part {a}b 11
Example: Grammar for Arithmetic Expressions Productions Expr = ["+" | "-"] Term {("+" | "-") Term}. Term = Factor {("*" | "/") Factor}. Expr Factor = ident | number | "(" Expr ")". Terminal symbols Term simple TS: "+", "-", "*", "/", "(", ")" (just 1 instance) terminal classes: ident, number (multiple instances) Factor Nonterminal symbols Expr, Term, Factor Start symbol Expr 12
Terminal Start Symbols of Nonterminals What are the terminal symbols with which a nonterminal can start? Expr = ["+" | "-"] Term {("+" | "-") Term}. Term = Factor {("*" | "/") Factor}. Factor = ident | number | "(" Expr ")". First(Factor) = ident, number, "(" First(Term) = First(Factor) = ident, number, "(" First(Expr) = "+", "-", First(Term) = "+", "-", ident, number, "(" 14
Terminal Successors of Nonterminals Which terminal symbols can follow a nonterminal in the grammar? Expr = ["+" | "-"] Term {("+" | "-") Term}. Term = Factor {("*" | "/") Factor}. Factor = ident | number | "(" Expr ")". Where does Expr occur on the Follow(Expr) = ")" , eof right-hand side of a production? What terminal symbols can follow there? Follow(Term) = "+", "-", Follow(Expr) = "+", "-", ")", eof Follow(Factor) = "*", "/", Follow(Term) = "*", "/", "+", "-", ")", eof 15
Strings and Derivations String A finite sequence of symbols from an alphabet. Alphabet: all terminal and nonterminal symbols of a grammar. Strings are denoted by greek letters ( α , β , γ , ...) e.g: α = ident + number β = - Term + Factor * number Empty String The string that contains no symbol (denoted by ε ). Derivation α β α ⇒ β ⇒ (direct derivation) Term + Factor * Factor Term + ident * Factor NTS right-hand side of a production of NTS α ⇒ * β α ⇒ γ 1 ⇒ γ 2 ⇒ ... ⇒ γ n ⇒ β (indirect derivation) 16
Recursion X ⇒ * ω 1 X ω 2 A production is recursive if Can be used to express repetitions and nested structures X ⇒ ω 1 X ω 2 Direct recursion X ⇒ X a ⇒ X a a ⇒ X a a a ⇒ b a a a a a ... Left recursion X = b | X a. X ⇒ a X ⇒ a a X ⇒ a a a X ⇒ ... a a a a a b Right recursion X = b | a X. X ⇒ (X) ⇒ ((X)) ⇒ (((X))) ⇒ (((... (b)...))) Central recursion X = b | "(" X ")". X ⇒ * ω 1 X ω 2 Indirect recursion Example Expr ⇒ Term ⇒ Factor ⇒ "(" Expr ")" Expr = Term {"+" Term}. Term = Factor {"*" Factor}. Factor = id | "(" Expr ")". 17
How to Remove Left Recursion Left recursion cannot be handled in topdown parsing Both alternatives start with b . X = b | X a. The parser cannot decide which one to choose Left recursion can always be transformed into iteration X ⇒ baaaa...a X = b {a} . Another example E = T | E "+" T. What phrases can be derived? T E T + T E + T T + T + T E + T + T ... E + T + T + T ... Thus E = T {"+" T}. 18
1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Chomsky's Classification of Grammars 1.5 The MicroJava Language 19
Classification of Grammars Due to Noam Chomsky (1956) Grammars are sets of productions of the form α = β . Unrestricted grammars ( α and β arbitrary) class 0 e.g: X = a X b | Y c Y. X ⇒ aXb ⇒ aYcYb ⇒ dYb ⇒ bbb a Y c = d. d Y = b b. Recognized by Turing machines Context-sensitive grammars (| α | ≤ | β |) class 1 e.g: a X = a b c. Recognized by linear bounded automata Context-free grammars ( α = NT, β ≠ ε ) class 2 e.g: X = a b c. Recognized by push-down automata Only these two classes are relevant in compiler Regular grammars ( α = NT, β = T or T NT) construction class 3 e.g: X = b | b Y. Recognized by finite automata 20
1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Chomsky's Classification of Grammars 1.5 The MicroJava Language 21
Sample MicroJava Program main program; no separate compilation program P final int size = 10; class Table { classes (without methods) int[] pos; int[] neg; } Table val; global variables { void main() int x, i; local variables { //---------- initialize val ---------- val = new Table; val.pos = new int[size]; val.neg = new int[size]; i = 0; while (i < size) { val.pos[i] = 0; val.neg[i] = 0; i = i + 1; } //---------- read values ---------- read(x); while (x != 0) { if (x >= 0) val.pos[x] = val.pos[x] + 1; else if (x < 0) val.neg[-x] = val.neg[-x] + 1; read(x); } } } 22
Lexical Structure of MicroJava Identifiers ident = letter {letter | digit | '_'}. Numbers all numbers are of type int number = digit {digit}. Char constants all character constants are of type char charConst = '\'' char '\''. (may contain \r, \n, \t) no strings Keywords program class if else while read print return void final new Operators + - * / % == != > >= < <= ( ) [ ] { } = ; , . Comments // ... eol Types arrays classes int char 23
Recommend
More recommend