Scanning and Parsing Structure of a Typical Interpreter Compiler Analysis Synthesis Announcements character stream – Project 1 is 5% of total grade – Project 2 is 10% of total grade lexical analysis IR code generation – Project 3 is 15% of total grade tokens “words” IR – Project 4 is 10% of total grade syntactic analysis optimization Today AST “sentences” IR – Outline of planned topics for course – Overall structure of a compiler semantic analysis code generation – Lexical analysis (scanning) annotated AST – Syntactic analysis (parsing) target language interpreter CS553 Lecture Scanning and Parsing 2 CS553 Lecture Scanning and Parsing 3 Lexical Analysis (Scanning) Interaction Between Scanning and Parsing Break character stream into tokens (“words”) – Tokens, lexemes, and patterns – Lexical analyzers are usually automatically generated from patterns lexer.next() parse tree (regular expressions) ( e.g., lex) lexer.peek() or AST character stream Lexical Examples Parser analyzer token lexeme(s) pattern token const const const if if if relation <,<=,=,!=,... < | <= | = | != | ... identifier foo,index [a-zA-Z_]+[a-zA-Z0-9_]* number 3.14159,570 [0-9]+ | [0-9]*.[0-9]+ string “hi”, “mom” “.*” const pi := 3.14159 ⇒ const, identifier ( pi ), assign,number ( 3.14159 ) CS553 Lecture Scanning and Parsing 4 CS553 Lecture Scanning and Parsing 5 1
Specifying Tokens with SableCC Recognizing Tokens with DFAs Theory meets practice: f i – Regular expressions, formal 1 4 5 ‘if‘ t_if languages, grammars, parsing… SableCC example input file: Tokens Package minijava; t_plus = '+'; letter or digit Helpers t_if = 'if'; all = [0..0xFFFF]; letter cr = 13; letter (letter | digit)* t_id = letter (letter | digit | underscore)*; 1 2 t_id digit = ['0'..'9']; t_blank = (' ' | eo eol | tab)+; letter = ['a'..'z'] | ['A'..'Z']; t_comment = c_comment | line_comment; underscore = ’_’; Ignored Tokens not_star = [all - '*']; Ambiguity due to matching substrings not_star_slash = [not_star - '/']; t_blank, – Longest match t_comment; c_comment = '/*' not_star* ('*' – Rule priority (not_star_slash not_star*)?)* '*/'; CS553 Lecture Scanning and Parsing 6 CS553 Lecture Scanning and Parsing 7 Syntactic Analysis (Parsing) Interaction Between Scanning and Parsing Impose structure on token stream – Limited to syntactic structure ( ⇒ high-level) – Structure usually represented with an abstract syntax tree (AST) lexer.next() parse tree – Parsers are usually automatically generated from context-free grammars lexer.peek() ( e.g., yacc, bison, cup, javacc, sablecc) or AST character stream Lexical Parser for Example analyzer token i 1 10 asg for i = 1 to 10 do a[i] = x * 5; arr tms a i x 5 for id( i ) equal number( 1 ) to number( 10 ) do id( a ) lbracket id( i ) rbracket equal id( x ) times number( 5 ) semi CS553 Lecture Scanning and Parsing 8 CS553 Lecture Scanning and Parsing 9 2
Bottom-Up Parsing: Shift-Reduce Shift-Reduce Parsing Example Stack Input Action Grammer a + b + c (1) S -> E (2) E -> E + T $ a + b + c shift (1) S -> E S -> E (3) E -> T $ a + b + c reduce (4) (2) E -> E + T -> E + T (4) T -> id $ T + b + c reduce (3) (3) E -> T -> E + id (4) T -> id -> E + T + id $ E + b + c shift -> E + id + id $ E + b + c shift -> T + id + id $ E + b + c reduce (4) -> id + id + id $ E + T + c reduce (2) $ E + c shift Rightmost derivation: expand rightmost non-terminals first $ E + c shift SableCC, yacc, and bison generate shift-reduce parsers: $ E + c reduce (4) – LALR(1): look-ahead, left-to-right, rightmost derivation in reverse, 1 symbol lookahead – LALR is a parsing table construction method, smaller tables than canonical LR $ E + T reduce (2) $ E reduce (1) $ S accept Reference: Barbara Ryder’s 198:515 lecture notes Reference: Barbara Ryder’s 198:515 lecture notes CS553 Lecture Scanning and Parsing 10 CS553 Lecture Scanning and Parsing 11 Shift-Reduce Parsing Example (precedence problem) Syntax-directed Translation: AST Construction example Stack Input Action Grammer with production rules (1) S -> E (2) E -> E + T S: E { $$ = $1; }; $ a + b * c shift (3) E -> E * T E: E ‘+’ T { $$ = new node(“+”, $1, $3); } (4) E -> T | T { $$ = $1; } ; (5) T -> id T: T_ID { $$ = new leaf(“id”, $1); }; Implicit parse tree for a+b+c AST for a+b+c S + E + E + T c E + T T_ID b a T T_ID T_ID c b a Reference: Barbara Ryder’s 198:515 lecture notes CS553 Lecture Scanning and Parsing 12 CS553 Lecture Scanning and Parsing 13 3
Using SableCC to specify grammar and generate AST Parsing Terms Productions CFG (Context-free Grammer) cst_program {-> program} = cst_main_class cst_class_decl* – production rule {-> New program(cs cst_main_class.main_class,[cst_class_decl.class_decl])} ; – terminal cst_exp_list {-> exp* } = – nonterminal {many_rule} cst_exp cst_exp_rest* {-> [cst_exp.exp, cst_exp_rest.exp] } – FOLLOW(X): “the set of terminals that can immediately follow X” | {empty_rule} {-> [] } ; cst_exp_rest {-> exp* } = t_comma cst_exp {-> [cst_exp.exp] }; BNF (Backus-Naur Form) and EBNF (Extended BNF): equivalent to CFGs Abstract Syntax Tree program = main_class [class_decls]:class_decl*; exp = {call} exp t_ t_id [args]:exp* | ... CS553 Lecture Scanning and Parsing 14 CS553 Lecture Scanning and Parsing 15 Parsing Terms cont … Concepts Top-down parsing Compilation stages in a compiler – LL(1): left-to-right reading of tokens, leftmost derivation, 1 symbol look-ahead – Scanning, parsing, semantic analysis, intermediate code generation, – Predictive parser : an efficient non-backtracking top-down parser that can handle optimization, code generation LL(1) Lexical analysis or scanning – More generally recursive descent parsing may involve backtracking – Tools: SableCC, lex, flex, etc. Bottom-up Parsing Syntactic analysis or parsing – LR(1): left-to-right reading of tokens, rightmost derivation in reverse, 1 symbol – Tools: SableCC, yacc, bison, etc. lookahead – Shift-reduce parsers: for example, bison, yacc, and SableCC generated parsers – Methods for producing an LR parsing table – SLR, simple LR – Canonical LR, most powerful – LALR(1) CS553 Lecture Scanning and Parsing 16 CS553 Lecture Scanning and Parsing 17 4
Next Time Language Implementation Timeline Modern DFA [Kildall] & Lamport’s parallelism Value numbering [Cocke&Schwartz] For entertainment purposes only! Pascal [Wirth] & 1 st uproc [4004] C [Ritchie] & ML [Milner et al.] COBOL [Short Range Comm.] GCD test [Banerjee & Towle] Flow-sens. defined [Banning] BASIC [Kemeny & Kurtz] Simula [Dahl & Nygaard] Dep. vectors [Karp et al.] Lex & YACC [Johnson] Lecture Copying GC [Cheney] Prolog [Colmeraurer] May v. must [Barth] Parafrase [Kuck] PRE [Morel et al.] Fortran [Backus] LISP [McCarthy] Parser generators Algol [Comm.] – More undergraduate compilers review A-0 [Hopper ] ‘50 ‘60 ‘70 ‘80 Trace sched. [Fisher] Coloring reg. alloc. [Chaitin] Sparse cond. const. [Wegman&Zadeck] 1 st RISC (IBM 801), Wolfe’s thesis Smalltalk [Kay] & PFC [Kennedy] Itanium ships & Jikes RVM [IBM] Superblock scheduling [Hwu] SSA [Cytron] 486 w/ cache PDG [Ferante] Perl [Wall] SW pipelining [Lam] Dragon book [ASU] Java [Gosling&Sun] CS553 @ CSU C++ [Stroustrup] Ocaml [INRIA] ‘80 ‘90 2000 2010 CS553 Lecture Scanning and Parsing 18 CS553 Lecture Scanning and Parsing 19 5
Recommend
More recommend