Compilation 2016 Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst
Lexical analysis High-level source code Lexing Parsing Elaboration Low-level target … code
Lexical analysis First phase in the compilation Input: stream of characters i f ( x > 0 ) \n \t t h e n 1 \n \t e l s e 0 IF LPAREN ID (“x”) GE INT (0) RPAREN THEN INT (1) ELSE INT (0) Output: stream of tokens in our language Discards comments, whitespace, newline, tab characters, preprocessor directives
Tokens Type Examples ID foo n14 a’ my-fun INT 73 0 070 REAL 0.0 .5 10. IF if COMMA , LPAREN ( ASGMT :=
Non-tokens Type Examples comments /* dead code */ // comment (* nest (*ed*) *) preprocessor directives #define N 10 #include <stdio.h> whitespace
Token data structure • Many tokens need no associated data, e.g.: IF , COMMA, LPAREN, RPAREN, ASGMT • Some tokens carry an associated string: ID (“my-fun”) • Some tokens carry associated data of other types: INT (73), INT (1), FLOAT (IEEE754, 1001111100…) • Tokens may include useful additional information: start/end pos in input file (line number + column, or charpos)
Q/A • Consider source program var δ := 0.0 • Language: case sensitive, ASCII • How to report error of using δ ? FileName:Line.Col: Illegal character δ
Regular expressions • We can use regular expressions to specify programming language tokens • Regular expressions R • Expected to be well-known (dRegAut) • Syntax • character a • choice R 1 | R 2 • concat R 1 · R 2 also sometimes R 1 R 2 • empty string ε • repeat R*
Regular expressions used for scanning Examples if (IF); [a-z][a-z0-9]* (ID); [0-9]* (NUM); ([0-9]+”.”[0-9]*) | ([0-9]* ”.” [0-9]+) (REAL); (”--” [a-z]*”\n”) | (” ”|”\t”) (continue()); . (error (); continue());
Resolving ambiguities • Rule: when a string can match multiple tokens, the longest matching token wins • if (IF); i f x > 0 • [a-z][a-z0-9]* (ID); ID (“ifx”) • We also need to specify priorities if we match several tokens of the same length. • Usual rule: earliest declaration wins i f ID (“if”) IF
Lexical analysis Specification: Tokens as regular exps +longest-matching rule +priorities Formalism: NFA DFA Implementation: Simulate NFA Simulate DFA linear complexity Program that translates raw text Output: into stream of tokens “classical” approach – from RegEx to NFA to DFA
Total NFA for ID,IF,NUM,REAL a-e,g-z,0-9 0-9,a-z ID error IF REAL 0-9,a-z f 0-9 0-9 . ID 4 2 3 5 6 a-h,j-z . i NUM REAL 0-9 0-9 1 7 8 blank etc. - 0-9 whitespace other blank - \n etc. 9 12 13 10 11 error error a-z
ML-Lex • Lexer generator, “built-in” part of SML/NJ • Accepts lexical specification, produces a scanner • Example specification (* SML declarations *) type lexresult = Tokens.token fun eof() = Tokens.EOF(0,0) %% (* Lex definitions *) digits=[0-9]+ %% (* Regular Expressions and Actions *) if => (Tokens.IF(yypos,yypos+2)); [a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); {digits} => (Tokens.NUM( Int.fromString yytext, yypos, yypos + size yytext); ({digits}”.”[0-9]*)|([0-9]*”.”{digits}) => (Tokens.REAL( Real.fromString yytext, yypos, yypos + size yytext)); (“--”[a-z]*”\n”)|(“ “|”\n”|”\t”)+ => (continue()); • => ( ErrorMsg.error yypos “Illegal character”; continue());
Lexer states • Helpful when handling di ff erent “kinds” of tokens • For ex.: use state • INITIAL in general lexing (automatic) • STRING when scanning the contents of a string • COMMENT when scanning a comment • Point: keep di ff erent concerns apart – simpler! • Syntax: ... (* Regular Expressions and Actions *) <INITIAL>if => (Tokens.IF(yypos,yypos+2)); <INITIAL>[a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); ... <INITIAL>”\”” => (YYBEGIN STRING; continue()); ... <STRING>. => (continue()); ...
Lexical analysis Specification: Tokens as regular exps +longest-matching rule +priorities Formalism: NFA DFA Implementation: Simulate NFA Simulate DFA linear complexity Program that translates raw text Output: into stream of tokens alternative, purely algebraic approach – from RegEx to DFA using regexp derivatives
More on SML [online demo]
Warmup project
Straight-line Programming Language • Toy programming language: no branching, no loops • Skip lexing and parsing issues • Focus on the “meaning” – interpretation • Syntax Stm → Stm; Stm (CompoundStm) ExpList → Exp , ExpList (PairExpList) Stm → id := Exp ExpList → Exp (AssignStm) (LastExpList) Stm → print ( ExpList ) Binop → + (PrintStm) (Plus) Exp → id Binop → – (IdExp) (Minus) Exp → num Binop → × (NumExp) (Times) Exp → Exp BinOp Exp Binop → / (OpExp) (Div) Exp → ( Stm , Exp ) (EseqExp)
Straight-line program • Source: CompoundStm a : = 5 + 3 ; AssignStm CompoundStm b : = ( p r i n t ( a , a - 1),10 * a); OpExp AssignStm PrintStm p r i n t ( b ) a NumExp BinOp NumExp EseqExp LastExpList b PrintStm OpExp IdExp • Corresponding syntax tree: 5 Plus 3 PairExpList NumExp BinOp IdExp b IdExp LastExpList Times 10 a OpExp a IdExp BinOp NumExp Minus a 1
SLP syntax representation datatype • SML declaration (CompoundStm) Stm → Stm; Stm type id = string (AssignStm) Stm → id := Exp datatype binop Stm → print ( ExpList ) (PrintStm) = Plus | Minus | Times | Div Exp → id (IdExp) datatype stm Exp → num (NumExp) = CompoundStm of stm * stm Exp → Exp BinOp Exp (OpExp) | AssignStm of id * exp Exp → ( Stm , Exp ) (EseqExp) | PrintStm of exp list (PairExpList) ExpList → Exp , ExpList and exp ExpList → Exp (LastExpList) = IdExp of id Binop → + (Plus) | NumExp of int Binop → – (Minus) | OpExp of exp * binop * exp Binop → × (Times) | EseqExp of stm * exp Binop → / (Div)
SLP syntax representation • Source program CompoundStm a := 5 + 3; AssignStm CompoundStm b := (print (a, a - 1),10 * a); print (b) OpExp AssignStm PrintStm a NumExp BinOp NumExp EseqExp LastExpList b • SML value: PrintStm OpExp IdExp 5 Plus 3 val prog = CompoundStm ( PairExpList NumExp BinOp IdExp b AssignStm (“a", OpExp ( NumExp 5, IdExp LastExpList Times 10 a Plus, NumExp 3)), OpExp a CompoundStm ( AssignStm ("b", IdExp BinOp NumExp EseqExp ( PrintStm [IdExp "a", OpExp (…)], Minus a 1 OpExp (NumExp 10, …))), PrintStm [IdExp "b"]))
Project assignment • Follow descriptions p10-12 in MCIML • “Modularity principles” p9-10: discussed on Friday, may be ignored at first • Clarification: • Let bindings are OK • References, arrays, and ref update (:=) are not OK
Summary • Warm-up project: Program in SML! • Straight-line programming language, no lexing/parsing involved • Express programs: use abstract syntax tree datatype • Project specified on website, essentially as in the book • Lexical analysis • Avoid complexity in grammar. Use lexer • Based on regular expressions. • Implementation via RE → NFA → DFA (theory assumed known) • Alternatives: via RE derivatives → DFA • Tools: ML-Lex • Scanner generator, outputs SML code from spec • Note lexer states
Recommend
More recommend