cse 401 introduction to compiler construction course
play

CSE 401: Introduction to Compiler Construction Course Outline - PDF document

CSE 401: Introduction to Compiler Construction Course Outline Professor: Craig Chambers Front-end issues: lexical analysis (scanning): characters tokens TA: Markus Mock syntax analysis (parsing): tokens abstract syntax trees


  1. CSE 401: Introduction to Compiler Construction Course Outline Professor: Craig Chambers Front-end issues: • lexical analysis (scanning): characters → tokens TA: Markus Mock • syntax analysis (parsing): tokens → abstract syntax trees Text: Compilers: Principles, Techniques, and Tools , Aho et al. • semantic analysis (typechecking): annotate ASTs Goals: Midterm • learn principles & practice of language implementation • brings together theory & pragmatics of previous courses Back-end issues: • study interactions among: • run-time storage representations • intermediate & target code generation: ASTs → asm code • language features • implementation efficiency • optimizations • compiler complexity • architectural features Final • gain experience with object-oriented design & C++ • gain experience working on a team Prerequisites: • 326, 341, 378 • very helpful: 322 Craig Chambers 1 CSE 401 Craig Chambers 2 CSE 401 Project Grading Start with compiler for PL/0, written in C++ Project: 40% total Homework: 20% total Add: Midterm: 15% • comments Final: 25% • arrays • call-by-reference arguments • results of procedures Homework & projects due at the start of class • for loops • break statements 3 free late days, per person • and more... • thereafter, 25% off per calendar day late Completed in stages over the quarter Strongly encourage working in a 2-person team on project Grading based on: • correctness • clarity of design & implementation • quality of test cases Craig Chambers 3 CSE 401 Craig Chambers 4 CSE 401

  2. An example compilation First step: lexical analysis Sample PL/0 program: squares.0 “Scanning”, “tokenizing” module main; Read in characters, clump into tokens var x: int , result: int ; • strip out whitespace in the process procedure square(n: int ); begin result := n * n; end square; begin x := input ; while x <> 0 do square(x); output := result; x := input ; end ; end main. Craig Chambers 5 CSE 401 Craig Chambers 6 CSE 401 Specifying tokens: regular expressions Second step: syntax analysis Example: “Parsing” Ident ::= Letter AlphaNum* Read in tokens, turn into a tree based on syntactic structure Integer ::= Digit+ AlphaNum ::= Letter | Digit Letter ::= 'a' | ... | 'z' | 'A' | ... | 'Z' Digit ::= '0' | ... | '9' Craig Chambers 7 CSE 401 Craig Chambers 8 CSE 401

  3. Specifying syntax: context-free grammars Third step: semantic analysis EBNF is a popular notation for CFG’s “Name resolution and typechecking” Example: Given AST: Stmt ::= AsgnStmt | IfStmt | ... • figure out what declaration each name refers to AsgnStmt ::= LValue := Expr ; • perform static consistency checks LValue ::= Id IfStmt ::= if Test then Stmt [ else Stmt] ; Key data structure: symbol table Test ::= Expr = Expr | Expr < Expr | ... • maps names to info about name derived from declaration Expr ::= Term + Term | Term - Term | Term Term ::= Factor * Factor | ... | Factor Semantic analysis steps: Factor ::= - Factor | Id | Int | ( Expr ) 1. Process each scope, top down 2. Process declarations in each scope into symbol table for EBNF specifies concrete syntax of language scope Parser usually constructs tree representing abstract syntax of 3. Process body of each scope in context of symbol table language Craig Chambers 9 CSE 401 Craig Chambers 10 CSE 401 Fourth step: storage layout Fifth step: intermediate & target code generation Given symbol tables, Given annotated AST & symbol tables, determine how & where variables will be stored at run-time produce target code Often done as three steps: What representation for each kind of data? • produce machine-independent low-level representation of program (intermediate representation) • perform machine-independent optimizations of IR (optional) How much space does each variable require? • translate IR into machine-specific target instructions • instruction selection In what kind of memory should it be placed? • register allocation • static, global memory • stack memory • heap memory Where in that kind of memory should it be placed? • e.g. what stack offset Craig Chambers 11 CSE 401 Craig Chambers 12 CSE 401

  4. The bigger picture Compilers vs. interpreters Compilers are translators Compilers implement languages by translation Interpreters implement languages directly Characterized by • input language Embody different trade-offs among: • target language • execution speed of program • degree of “understanding” • start-up overhead, turn-around time • ease of implementation • programming environment facilities • conceptual difficulty Craig Chambers 13 CSE 401 Craig Chambers 14 CSE 401 Engineering issues Lexical Analysis / Scanning Portability Purpose: turn character stream (input program) into token stream • ideal: multiple front-ends & back-ends sharing intermediate language • parser turns token stream into syntax tree Sequencing phases of compilation Token: group of characters forming basic, atomic chunk of syntax • stream-based • syntax-directed Whitespace: characters between tokens that are ignored Multiple passes? Craig Chambers 15 CSE 401 Craig Chambers 16 CSE 401

  5. Separate lexical and syntax analysis Language design issues Separation of concerns / good design Most languages today are “free-form” • scanner: • layout doesn’t matter • handle grouping chars into tokens • use whitespace to separate tokens, if needed • ignoring whitespace Alternatives: • handling I/O, machine dependencies • Fortran, Algol 68: whitespace ignored • parser: • Haskell: use layout to imply grouping • handle grouping tokens into syntax trees Most languages today have “reserved words” Restricted nature of scanning allows faster implementation • can’t be used as identifiers • scanning is time-consuming in many compilers Alternative: PL/I: have “keywords” • keywords treated specially only in certain contexts, otherwise just idents Most languages separate scanning and parsing Alternative: C++: type vs. ident • parser wants scanner to distinguish types from vars • scanner doesn’t know how things declared Craig Chambers 17 CSE 401 Craig Chambers 18 CSE 401 Lexemes, tokens, and patterns Regular expressions Lexeme: group of characters that form a token Notation for specifying patterns of lexemes in a token Regular expressions: • powerful enough to do this Token: class of lexemes that match a pattern • simple enough to be implemented efficiently • equivalent to finite state machines Pattern: description of string of characters Token may have attributes, if more than one lexeme in token Needs of parser determine how lexemes grouped into tokens Craig Chambers 19 CSE 401 Craig Chambers 20 CSE 401

  6. Syntax of regular expressions Notational conveniences Defined inductively E + means 1 or more occurrences of E • base cases: E k means k occurrences of E the empty string ( ε ) [ E ] means 0 or 1 occurrence of E (optional E ) a symbol from the alphabet ( x ) { E } means E * • inductive cases: sequence of two RE’s: E 1 E 2 either of two RE’s: E 1 | E 2 not ( x ) means any character in the alphabet but x Kleene closure (zero or more occurrences) of a RE: E * not ( E ) means any string of characters in the alphabet but those matching E Notes: E 1 - E 2 means any string matching E 1 except those matching E 2 • can use parens for grouping • precedence: * highest, sequence, | lowest • whitespace insignificant Craig Chambers 21 CSE 401 Craig Chambers 22 CSE 401 Naming regular expressions Using regular expressions to specify tokens Can assign names to regular expressions Identifiers Can use the name of a RE in the definition of another RE ::= letter (letter | digit) * ident Examples: Integer constants letter ::= a | b | ... | z ::= digit + integer digit ::= 0 | 1 | ... | 9 sign ::= + | - alphanum ::= letter | digit signed_int ::= [sign] integer Real number constants EBNF-like notation for RE’s real ::= signed_int [fraction] [exponent] Can reduce named RE’s to plain RE by “macro expansion” ::= . digit + fraction • no recursive definitions allowed exponent ::= (E|e) signed_int Craig Chambers 23 CSE 401 Craig Chambers 24 CSE 401

Recommend


More recommend