Compilation 2014 Warm-up project Aslan Askarov aslan@cs.au.dk Revised from slides by E. Ernst
Straight-line Programming Language • Toy programming language: no branching, no loops • Skip lexing and parsing issues • Focus on the “meaning” – interpretation • Syntax Stm → Stm; Stm (CompoundStm) ExpList → Exp , ExpList (PairExpList) Stm → id := Exp ExpList → Exp (AssignStm) (LastExpList) Stm → print ( ExpList ) Binop → + (PrintStm) (Plus) Exp → id Binop → – (IdExp) (Minus) Exp → num Binop → × (NumExp) (Times) Exp → Exp BinOp Exp Binop → / (OpExp) (Div) Exp → ( Stm , Exp ) (EseqExp)
Straight-line program • Source: CompoundStm � a := 5 + 3; AssignStm CompoundStm b := (print (a, a - 1),10 * a); � OpExp AssignStm PrintStm a print (b) � NumExp BinOp NumExp EseqExp LastExpList b � PrintStm OpExp IdExp • Corresponding syntax tree: 5 Plus 3 PairExpList NumExp BinOp IdExp b IdExp LastExpList Times 10 a OpExp a IdExp BinOp NumExp Minus a 1
SLP syntax representation datatype • SML declaration (CompoundStm) Stm → Stm; Stm type id = string (AssignStm) Stm → id := Exp datatype binop Stm → print ( ExpList ) (PrintStm) = Plus | Minus | Times | Div Exp → id (IdExp) datatype stm Exp → num (NumExp) = CompoundStm of stm * stm Exp → Exp BinOp Exp (OpExp) | AssignStm of id * exp Exp → ( Stm , Exp ) (EseqExp) | PrintStm of exp list (PairExpList) ExpList → Exp , ExpList and exp ExpList → Exp (LastExpList) = IdExp of id Binop → + (Plus) | NumExp of int Binop → – (Minus) Binop → × | OpExp of exp * binop * exp (Times) Binop → / (Div) | EseqExp of stm * exp
SLP syntax representation • Source program CompoundStm a := 5 + 3; � AssignStm CompoundStm b := (print (a, a - 1),10 * a); � print (b) OpExp AssignStm PrintStm a � NumExp BinOp NumExp EseqExp LastExpList b • SML value: PrintStm OpExp IdExp 5 Plus 3 val prog = CompoundStm ( PairExpList NumExp BinOp IdExp b AssignStm (“a", OpExp ( NumExp 5, IdExp LastExpList Times 10 a Plus, NumExp 3)), OpExp a CompoundStm ( IdExp BinOp NumExp AssignStm ("b", EseqExp ( PrintStm [IdExp "a", Minus a 1 OpExp (…)], OpExp (NumExp 10, …))), PrintStm [IdExp "b"]))
Project assignment • Follow descriptions p10-12 in MCIML • “Modularity principles” p9-10: discussed on Friday, may be ignored at first
Lexical analysis
Lexical analysis High-level source code Lexing Parsing Elaboration Low-level target … code
Lexical analysis First phase in the compilation Input: stream of characters i f ( x > 0 ) \n \t t h e n 1 \n \t e l s e 0 IF LPAREN ID (“x”) GE INT (0) RPAREN THEN INT (1) ELSE INT (0) Output: stream of tokens in our language Discards comments, whitespace, newline, tab characters, preprocessor directives
Tokens Type Examples ID foo n14 a’ my-fun INT 73 0 070 REAL 0.0 .5 10. IF if COMMA , LPAREN ( ASGMT :=
Non-tokens Type Examples comments /* dead code */ // comment (* nest (*ed*) *) preprocessor directives #define N 10 #include <stdio.h> whitespace
Token data structure • Many tokens need no associated data, e.g.: IF , COMMA, LPAREN, RPAREN, ASGMT � • Some tokens carry an associated string: ID (“my-fun”) � • Some tokens carry associated data of other types: INT (73), INT (1), FLOAT (IEEE754, 1001111100…) � • Tokens may include useful additional information: start/end pos in input file (line number + column, or charpos)
Q/A • Consider source program var δ := 0.0 � • Language: case sensitive, ASCII � • How to report error of using δ ? FileName:Line.Col: Illegal character δ
Regular expressions • We can use regular expressions to specify programming language tokens • Regular expressions: • Expected to be well-known • Syntax: • symbol a • choice x | y • concat x y • empty ε • repeat x*
Regular expressions used for scanning • Examples • if (IF); • [a-z][a-z0-9]* (ID); • [0-9]* (NUM); • ([0-9]+”.”[0-9]*) | ([0-9]* ”.” [0-9]+) (REAL); • (”--” [a-z]*”\n”) | (” ”|”\t”) (continue()); • . (error (); continue());
Resolving ambiguities • Rule: when a string can match multiple tokens, the longest matching token wins • if (IF); � i f x > 0 • [a-z][a-z0-9]* (ID); � ID (“ifx”) � • We also need to specify priorities if we match several tokens of the same length. • Usual rule: earliest declaration wins i f ID (“if”) IF
Lexical analysis Specification: Tokens as regular exps +longest-matching rule +priorities Formalism: NFA DFA Implementation: Simulate NFA Simulate DFA linear complexity Program that translates raw text Output: into stream of tokens
Total NFA for ID,IF,NUM,REAL a-e,g-z,0-9 0-9,a-z ID error IF REAL 0-9,a-z f 0-9 0-9 . ID 4 2 3 5 6 a-h,j-z . i NUM REAL 0-9 0-9 7 8 1 blank etc. - 0-9 whitespace other blank - \n etc. 9 12 13 11 10 error error a-z
ML-Lex • Lexer generator, “built-in” part of SML/NJ • Accepts lexical specification, produces a scanner • Example specification (* SML declarations *) type lexresult = Tokens.token fun eof() = Tokens.EOF(0,0) %% (* Lex definitions *) digits=[0-9]+ %% (* Regular Expressions and Actions *) if => (Tokens.IF(yypos,yypos+2)); [a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); {digits} => (Tokens.NUM( Int.fromString yytext, yypos, yypos + size yytext); ({digits}”.”[0-9]*)|([0-9]*”.”{digits}) => (Tokens.REAL( Real.fromString yytext, yypos, yypos + size yytext)); (“--”[a-z]*”\n”)|(“ “|”\n”|”\t”)+ => (continue()); • => ( ErrorMsg.error yypos “Illegal character”; continue());
Lexer states • Helpful when handling di ff erent “kinds” of tokens • For ex.: use state • INITIAL in general lexing (automatic) • STRING when scanning the contents of a string • COMMENT when scanning a comment • Point: keep di ff erent concerns apart – simpler! • Syntax: ... (* Regular Expressions and Actions *) <INITIAL>if => (Tokens.IF(yypos,yypos+2)); <INITIAL>[a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); ... <INITIAL>”\”” => (YYBEGIN STRING; continue()); ... <STRING>. => (continue()); ...
Summary • Warm-up project: Program in SML! • Straight-line programming language, no lexing/parsing involved • Express programs: use abstract syntax tree datatype • Project specified on website, essentially as in the book • Lexical analysis • Avoid complexity in grammar. Use lexer • Based on regular expressions. Implementation via NFA/DFA • Theory assumed known • Tools: ML-Lex • Scanner generator, outputs SML code from spec • Note lexer states
Recommend
More recommend