Compiler Design Spring 2018 3.0 Frontend Thomas R. Gross Computer Science Department ETH Zurich, Switzerland 1
Admin issues § Recitation sessions take place only when announced § In the lecture / on course website / on the mailing list § No recitation session this week § Next recitation session § March 15, 2018 @ 15:00 § ETF E1 (tentative) 2
Compiler model Source program “Front-end” IR Optimizer Question: How to build IR (tree)? “Back-end” ASM file 3
Overview § 3.1 Introduction § 3.2 Lexical analysis § 3.3 “Top down” parsing § 3.4 “Bottom up” parsing 4
3.1 Introduction § Frontend responsible to turn input program into IR § Input: Usually a string of ASCII or Unicode characters § IR: As required by later stages of the compiler § Frontend divided into § Lexical analysis – deals with reading the input program § Also known as scanning § Scanner, Lexer § Syntactic analysis – understand structure of the input program § Also known as parsing § Parser 5
3.1 Introduction (cont’d) § Good news: Syntactic and lexical analysis well understood § Good theory and books, e.g., Aho et al., Chapters 2 (in part), 3, and 4 § Good tools § Bad news: Even good tools may be painful to use § Good == powerful § Many options § Still can’t handle all possible languages § May give cryptic error messages 6
3.1 Introduction (cont’d) § Need to understand theory to use tool § Same theory that allows building tool § Tools made hand-crafted frontends obsolete § Frontend tools used for other domains 7
Languages § Frontend processes input program § Need a way to describe what input is allowed § Formal languages § Well-researched area § First part of compilers supported by tools § In this lecture: brief review § Aho et al. covers topic in more depth § Focus on essentials § (Speed an issue in real life) § Theory behind tools 8
Languages: Grammar § Grammars provide a set of rules to generate “strings” § A grammar consists of § T erminals: a, b, c, … § N on- t erminals: X, Y, Z, … § Set of productions § S tart symbol: S § Some terminology § Terminal symbols: Sometimes called characters or tokens § Non-terminal symbols: Also called syntactic variables § String: Sequence of symbols from some alphabet § Other terms: Word, sentence 9
Productions § General form § Left-hand side à Right-hand side § LHS à RHS (for short) § LHS, RHS: Strings over alphabets of terminal and non-terminal symbols § Example: Grammar G 1 S à aBa S à aXa Xb à Xbc | c Ba à aBa | b § How does a grammar generate a language (known as L(G))? § Using the grammar G 1 as an example 10
L(G) § From production to derivation S à aBa S à aXa Given Xb à Xbc | c w -- a word over (T ∪ NT), § Ba à aBa | b a , b , g words over (T ∪ NT) § ( a , b , g may be empty) § s.t. w = a b g and P a production b à d We say that w’ = a d g is derived from w, i.e., w ⇒ w’. § Example derivation (with G 1 ) § S ⇒ aBa ⇒ aaBa ⇒ aab § L(G 1 ) = a n b, n ≥ 1 12
L(G) § L(G) = set of strings w such that § w consists only of symbols from the set of terminals § There exists a sequence of productions P 1 , P 2 , …P n such that S ⇒ RHS 1 by P 1 , … (by P i ), …. ⇒ w (by P n ) § In other words: there exists a derivation S ⇒ P1 … … ⇒ Pn w (or S ⇒ * w) 14
Productions, 2 nd look § No constraints on LHS, RHS § Some RHS could be dead-end street S à aXa Xb à … § Remove dead-end streets § Updated grammar G 1 ’ S à aBa Ba à aBa | b 16
Productions, 3 rd look § We care about L(G) – prune productions that do not contribute § Restrictions on LHS § Only a single non-terminal is allowed on the left hand side § For example: A à a § “Context free” grammar or Type-2 grammar § Context-free grammars important § Efficient analysis techniques known § From now on only context-free grammars unless noted 17
Regular and linear grammars § Linear grammar: Context-free, at most 1 NT in the RHS § Left-linear grammar: Linear, NT appears at left end of RHS § Right-linear grammar: Linear, NT appears at the right end of RHS § Regular grammar: Either right-linear or left-linear § Regular grammars generate regular languages § Could also be described by regular expression § Can be recognized by Finite Deterministic Automaton § Type-3 grammar 18
Special cases § ∅ – a language (but not an interesting one) § e – the empty string § Must use a symbol so that we can see it § Can be the RHS § A à e 22
3.1 Introduction § So far: Brief summary of grammars § Using multiple grammars to save work § Properties of derivations § Parse trees § Properties of grammars § Detect ambiguity § Avoid ambiguity 23
3.1.1 Example grammar G 2 § Start symbol: S § Terminals: { a, b, …, z, +, -, *, /, ( , ) } § Non-Terminals: { S, E, Op, Id, L, M, N } § Productions S à E E à E Op E | - E | ( E ) | Id Note: ℇ -production allows Op à + | - | * | / us to make M “disappear” Id à L M L à a | b | ... | z M à L M | N M | e S ⇒ E ⇒ Id ⇒ L M ⇒ L L M ⇒ a L M N à 0 | 1 | ... | 9 ⇒ ap M ⇒ ap 24
Parsing § Given G and a word w ∈ T*: we want to know if “w ∈ L(G)?” § Analysis problem § Answer is either YES or NO § ap ∈ L(G 2 ) § ap + bp ∈ L(G 2 ) § ap++ ∉ L(G 2 ) § For YES we need to find a sequence of productions so that S ⇒ … … ⇒ w § (or S ⇒ * w for short) 26
w = a3 + b § Derivation S ⇒ E ⇒ E Op E ⇒ E Op Id ⇒ E + Id ⇒ Id + Id ⇒ Id + LM ⇒ Id + L ⇒ Id + b ⇒ LM + b ⇒ a M + b ⇒ a N M + b ⇒ a3 M + b ⇒ a3 + b 29
Comments § If a string w contains multiple non-terminals we have a choice when expanding w ⇒ w’ § Grammars that are context-free and without useless non-terminals: must have a production for each non-terminal in w § Assume A, B ∈ NT, A à a , B à b are productions P 1 , P 2 § w = d A t B g § Choice #1: w 1 = d a t B g § Choice #2: w 2 = d A t b g § (Both w ⇒ w 1 or w ⇒ w 2 possible) 30
More comments § Question: Does the choice influence L(G)? § Or, is (w 1 ⇒ * x ∈ L(G)) ⇔ (w 2 ⇒ * x ∈ L(G)) § Answer: choice does not matter for context-free grammars § How to decide which production to pick? § Everything worked out in the example § We’ve always picked the right production § Found w = a3 + b § Later more… 31
More comments § Part of the derivation is pretty boring § Do we care about exact steps to generate identifier “a3”? § Details (not always) important 32
3.1.1 Example grammar G 2 § Start symbol: S § Terminals: { a, b, …, z, +, -, *, /, ( , ) } § Non-Terminals: { S, E, Op, Id, L, M, N } § Productions S à E E à E Op E | - E | ( E ) | Id Op à + | - | * | / Id à L M L à a | b | ... | z M à L M | N M | e N à 0 | 1 | ... | 9 33
More comments § Part of the derivation is pretty boring § Do we care about exact steps to generate identifier “a3”? § Details (not always) important § Can we find a better way to deal with this aspect? § Better: Simpler § Better: Maybe also more efficient 34
36
Regular expressions § Idea: Use regular expression to capture “uninteresting” part of a grammar § Here: Exact rules for identifier names § Replace part of grammar G 2 … Id à L M L à a | b | ... | z M à L M | N M | e N à 0 | 1 | ... | 9 § Regular expressions recognized by Finite State Machines § Either a Deterministic Finite Automaton (DFA) § Or a Nondeterministic Finite Automaton (NFA) 37
Token § Idea: Introduce grammar symbol that represents string described by regular expression § Terminal for the grammar § Rules/production to generate regular expression string § When looking for a derivation identify strings that can be described by regular expression § “Token” § Example: a3 + b Tokens: Id (“a3”) + Id (“b”) regexp regexp § Chunks of the input stream 38 § More in 3.2 Lexical analysis
Examples § a3 + b … really … Id(“a3”) + Id(“b”) § z * u + x … really … Id(“z”) * Id(“u”) + Id(“x”) § Id * Id + Id ∈ L(G 2 ) § Treat terminals the same way § Id(“z”) Term(“*”) Id(“u”) Term(“+”) Id(“x”) 40
3.1.2 Simplified grammar G 3 § Start symbol: S § Terminals: { a, b, …, z, +, -, *, /, ( , ), Id } § Non-Terminals: { S, E, Op, Id, L, M, N } § Productions and regular definitions S à E E à E Op E | - E | ( E ) | Id L = { a | b | c | … | z } Op à + | - | * | / N = { 0 | 1 | 2 | … | 9 } Id: L { L | N } * 41 regexp
More simplifications? § Can grammar G 3 simplified even further? § Are there other productions we can replace with a regular expression? § Productions S à E E à E Op E | - E | ( E ) | Id Id à L { L | N } * L = { a | b | c | … | z } N = { 0 | 1 | 2 | … | 9 } Op à + | - | * | / § Could treat Op the same way Op: { + | - | * | / } 43
Simplified grammar G 4 Start symbol: S § Terminals: { a, b, …, z, +, -, *, /, ( , ), Id } § Non-Terminals: { S, E, Op} § § Productions and regular definitions S à E (1) E à E Op E (2) | - E (3) | ( E ) (4) | Id (5) Op à + | - | * | / (6) 44 Id: L { L | N } * L = { a | b | c | … | z }, N = { 0 | 1 | 2 | … | 9 }
Recommend
More recommend