Scanning COMP 520: Compiler Design (4 credits) Alexander Krolik - - PowerPoint PPT Presentation

scanning
SMART_READER_LITE
LIVE PREVIEW

Scanning COMP 520: Compiler Design (4 credits) Alexander Krolik - - PowerPoint PPT Presentation

COMP 520 Winter 2017 Scanning (1) Scanning COMP 520: Compiler Design (4 credits) Alexander Krolik alexander.krolik@mail.mcgill.ca MWF 13:30-14:30, MD 279 COMP 520 Winter 2017 Scanning (2) Announcements (Friday, January 6th) Facebook group:


slide-1
SLIDE 1

COMP 520 Winter 2017 Scanning (1)

Scanning

COMP 520: Compiler Design (4 credits) Alexander Krolik

alexander.krolik@mail.mcgill.ca

MWF 13:30-14:30, MD 279

slide-2
SLIDE 2

COMP 520 Winter 2017 Scanning (2)

Announcements (Friday, January 6th) Facebook group:

  • Useful for discussions/announcements
  • Link on myCourses or in email

Milestones:

  • Continue picking your group (3 recommended)
  • Create a GitHub account, learn git as needed

Midterm:

  • Either 1st or 2nd week after break on the Friday
  • 1.5 hour “in class” midterm, so either 30 minutes before/after class. Thoughts?
  • Tentative date: Friday, March 10th. Or the week after? Thoughts?
slide-3
SLIDE 3

COMP 520 Winter 2017 Scanning (3)

Readings Textbook, Crafting a Compiler:

  • Chapter 2: A Simple Compiler
  • Chapter 3: Scanning–Theory and Practice

Modern Compiler Implementation in Java:

  • Chapter 1: Introduction
  • Chapter 2: Lexical Analysis

Flex tool:

  • Manual - https://github.com/westes/flex
  • Reference book, Flex & bison -

http://mcgill.worldcat.org/title/flex-bison/oclc/457179470

slide-4
SLIDE 4

COMP 520 Winter 2017 Scanning (4)

Scanning:

  • also called lexical analysis;
  • is the first phase of a compiler;
  • takes an arbitrary source file, and identifies meaningful character sequences.
  • note: at this point we do not have any semantic or syntactic information

Overall:

  • a scanner transforms a string of characters into a string of tokens.
slide-5
SLIDE 5

COMP 520 Winter 2017 Scanning (5)

An example:

var a = 5 if (a == 5) { print "success" } tVAR tIDENTIFIER: a tASSIGN tINTEGER: 5 tIF tLPAREN tIDENTIFIER: a tEQUALS tINTEGER: 5 tRPAREN tLBRACE tIDENTIFIER: print tSTRING: success tRBRACE

slide-6
SLIDE 6

COMP 520 Winter 2017 Scanning (6)

Review of COMP 330:

  • Σ is an alphabet, a (usually finite) set of symbols;
  • a word is a finite sequence of symbols from an alphabet;
  • Σ∗ is a set consisting of all possible words using symbols from Σ;
  • a language is a subset of Σ∗.

An example:

  • alphabet: Σ={0,1}
  • words: {ǫ, 0, 1, 00, 01, 10, 11, . . . , 0001, 1000, . . . }
  • language:

– {1, 10, 100, 1000, 10000, 100000, . . . }: “1” followed by any number of zeros – {0, 1, 1000, 0011, 11111100, . . . }: ?!

slide-7
SLIDE 7

COMP 520 Winter 2017 Scanning (7)

A regular expression:

  • is a string that defines a language (set of strings);
  • in fact, a regular language.

A regular language:

  • is a language that can be accepted by a DFA;
  • is a language for which a regular expression exists.
slide-8
SLIDE 8

COMP 520 Winter 2017 Scanning (8)

In a scanner, tokens are defined by regular expressions:

  • ∅ is a regular expression [the empty set: a language with no strings]
  • ε is a regular expression [the empty string]
  • a, where a ∈ Σ is a regular expression [Σ is our alphabet]
  • if M and N are regular expressions, then M|N is a regular expression

[alternation: either M or N]

  • if M and N are regular expressions, then M · N is a regular expression

[concatenation: M followed by N]

  • if M is a regular expression, then M ∗ is a regular expression

[zero or more occurences of M] What are M? and M +?

slide-9
SLIDE 9

COMP 520 Winter 2017 Scanning (9)

Examples of regular expressions:

  • Alphabet Σ={a,b}
  • a* = {ǫ, a, aa, aaa, aaaa, . . . }
  • (ab)* = {ǫ, ab, abab, ababab, . . . }
  • (a|b)* = {ǫ, a, b, aa, bb, ab, ba, . . . }
  • a*ba* = strings with exactly 1 “b”
  • (a|b)*b(a|b)* = strings with at least 1 “b”
slide-10
SLIDE 10

COMP 520 Winter 2017 Scanning (10)

We can write regular expressions for the tokens in our source language using standard POSIX notation:

  • simple operators: "*", "/", "+", "-"
  • parentheses: "(", ")"
  • integer constants: 0|([1-9][0-9]*)
  • identifiers: [a-zA-Z_][a-zA-Z0-9_]*
  • white space: [ \t\n]+

[. . . ] define a character class:

  • matches a single character from a set;
  • allows ranges of characters to be “alternated”; and
  • can be negated using “^” (i.e. [^\n]).

The wildcard character:

  • is represented as “.” (dot); and
  • matches all characters except newlines by default (in most implementations).
slide-11
SLIDE 11

COMP 520 Winter 2017 Scanning (11)

A scanner:

  • can be generated using tools like flex (or lex), JFlex, . . . ;
  • by defining regular expressions for each type of token.

Internally, a scanner or lexer:

  • uses a combination of deterministic finite automata (DFA);
  • plus some glue code to make it work.
slide-12
SLIDE 12

COMP 520 Winter 2017 Scanning (12)

A finite state machine (FSM):

  • represents a set of possible states for a system;
  • uses transitions to link related states.

A deterministic finite automaton (DFA):

  • is a machine which recognizes regular languages;
  • for an input sequence of symbols, the automaton either accepts or rejects the string;
  • it works deterministically - that is given some input, there is only one sequence of steps.
slide-13
SLIDE 13

COMP 520 Winter 2017 Scanning (13)

Background (DFAs) from textbook, “Crafting a Compiler”

slide-14
SLIDE 14

COMP 520 Winter 2017 Scanning (14)

DFAs (for the previous example regexes):

❧ ❤ ❧ ✲ ✲ ❧ ❤ ❧ ✲ ✲ ❧ ❤ ❧ ✲ ✲ ❧ ❤ ❧ ✲ ✲ ❧ ❤ ❧ ✲ ✲ ❧ ❤ ❧ ❤ ❧ ❤ ❧ ❤ ❧ ❤ ❧ ❄ ✲ ✲

\t\n \t\n

❧ ❧ ❧ ✲ ✲ ✑✑ ✸ ◗◗ s ❄ ✲ ✲ ❄ ✲

* / + ( )

  • 0-9

1-9 a-zA-Z0-9_ a-zA-Z_

slide-15
SLIDE 15

COMP 520 Winter 2017 Scanning (15)

Try it yourself:

  • Design a DFA matching binary strings divisible by 3. Use only 3 states.
  • Design a regular expression for floating point numbers of form: {1., 1.1, .1} (a digit on at least one side
  • f the decimal)
  • Design a DFA for the language above language.
slide-16
SLIDE 16

COMP 520 Winter 2017 Scanning (16)

Background (Scanner Table) from textbook, “Crafting a Compiler”

slide-17
SLIDE 17

COMP 520 Winter 2017 Scanning (17)

Background (Scanner Algorithm) from textbook, “Crafting a Compiler”

slide-18
SLIDE 18

COMP 520 Winter 2017 Scanning (18)

A non-deterministric finite automaton:

  • is a machine which recognizes regular languages;
  • for an input sequence of symbols, the automaton either accepts or rejects the string;
  • it works non-deterministically - that is given some input, there is potentially more than one path;
  • an NFA accepts a string if at least one path leads to an accept.

Note: DFAs and NFAs are equally powerful.

slide-19
SLIDE 19

COMP 520 Winter 2017 Scanning (19)

Regular Expressions to NFA (1) from textbook, “Crafting a Compiler”

slide-20
SLIDE 20

COMP 520 Winter 2017 Scanning (20)

Regular Expressions to NFA (2) from textbook, ”Crafting a Compiler"

slide-21
SLIDE 21

COMP 520 Winter 2017 Scanning (21)

Regular Expressions to NFA (3) from textbook, ”Crafting a Compiler"

slide-22
SLIDE 22

COMP 520 Winter 2017 Scanning (22)

How to go from regular expressions to DFAs?

  • 1. flex accepts a list of regular expressions (regex);
  • 2. converts each regex internally to an NFA (Thompson construction);
  • 3. converts each NFA to a DFA (subset construction)
  • 4. may minimize DFA

See “Crafting a Compiler", Chapter 3; or “Modern Compiler Implementation in Java", Chapter 2

slide-23
SLIDE 23

COMP 520 Winter 2017 Scanning (23)

What you should know:

  • 1. Understand the definition of a regular language, whether that be: prose, regular expression, DFA, or

NFA.

  • 2. Given the definition of a regular language, construct either a regular expression or an automaton.

What you do not need to know:

  • 1. Specific algorithms for converting between regular language definitions.
  • 2. DFA minimization
slide-24
SLIDE 24

COMP 520 Winter 2017 Scanning (24)

Let’s assume we have a collection of DFAs, one for each lex rule

reg_expr1

  • >

DFA1 reg_expr2

  • >

DFA2 ... reg_rexpn

  • >

DFAn

How do we decide which regular expression should match the next characters to be scanned?

slide-25
SLIDE 25

COMP 520 Winter 2017 Scanning (25)

Given DFAs D1, . . . , Dn, ordered by the input rule order, the behaviour of a flex-generated scanner on an input string is:

while input is not empty do si := the longest prefix that Di accepts

l := max{|si|}

if l > 0 then

j := min{i : |si| = l} remove sj from input perform the jth action

else (error case)

move one character from input to output

end end

  • The longest initial substring match forms the next token, and it is subject to some action
  • The first rule to match breaks any ties
  • Non-matching characters are echoed back
slide-26
SLIDE 26

COMP 520 Winter 2017 Scanning (26)

Why the “longest match” principle? Example: keywords

... import return tIMPORT; [a-zA-Z_][a-zA-Z0-9_]* return tIDENTIFIER; ...

Given a string “importedFiles”, we want the token output of the scanner to be

tIDENTIFIER(importedFiles)

and not

tIMPORT tIDENTIFIER(edFiles)

Because we prefer longer matches, we get the right result.

slide-27
SLIDE 27

COMP 520 Winter 2017 Scanning (27)

Why the “first match” principle? Example: keywords

... continue return tCONTINUE; [a-zA-Z_][a-zA-Z0-9_]* return tIDENTIFIER; ...

Given a string “continue foo”, we want the token output of the scanner to be

tCONTINUE tIDENTIFIER(foo)

and not

tIDENTIFIER(continue) tIDENTIFIER(foo)

“First match” rule gives us the right answer: When both tCONTINUE and tIDENTIFIER match, prefer the first.

slide-28
SLIDE 28

COMP 520 Winter 2017 Scanning (28)

When “first longest match” (flm) is not enough, look-ahead may help. FORTRAN allows for the following tokens:

.EQ., 363, 363., .363

flm analysis of 363.EQ.363 gives us:

tFLOAT(363) E Q tFLOAT(0.363)

What we actually want is:

tINTEGER(363) tEQ tINTEGER(363)

To distinguish between a tFLOAT and a tINTEGER followed by a “.”, flex allows us to use look-ahead, using ’/’:

363/.EQ. return tINTEGER;

A look-ahead matches on the full pattern, but only processes the characters before the ’/’. All subsequent characters are returned to the input stream for further matches.

slide-29
SLIDE 29

COMP 520 Winter 2017 Scanning (29)

Another example taken from FORTRAN, FORTRAN ignores whitespace

  • 1. DO5I = 1.25 ❀ DO5I=1.25

in C, these are equivalent to an assignment:

do5i = 1.25;

  • 2. DO 5 I = 1,25 ❀ DO5I=1,25

in C, these are equivalent to looping:

for(i=1;i<25;++i){...}

(5 is interpreted as a line number here) To get the correct token output:

  • 1. flm analysis correct:

tID(DO5I) tEQ tREAL(1.25)

  • 2. flm analysis gives the incorrect result. What we want is:

tDO tINT(5) tID(I) tEQ tINT(1) tCOMMA tINT(25)

But we cannot make decision on tDO until we see the comma, look-ahead comes to the rescue:

DO/({letter}|{digit})*=({letter}|{digit})*, return tDO;

slide-30
SLIDE 30

COMP 520 Winter 2017 Scanning (30)

Announcements (Monday, January 9th) Facebook group:

  • Useful for discussions/announcements
  • Link on myCourses or in email

Milestones:

  • Learn flex, bison, SableCC
  • Assignment 1 out Wednesday
  • Continue forming your groups

Midterm:

  • Friday, March 17th
  • 1.5 hour “in class” midterm. You have the option of either 13:00-14:30 or 13:30-15:00.
slide-31
SLIDE 31

COMP 520 Winter 2017 Scanning (31)

Introduce yourselves! (no, not joking)

  • Name
  • Major/year
  • If grad student, research area
  • Any other fun facts we should know...
slide-32
SLIDE 32

COMP 520 Winter 2017 Scanning (32)

In practice, we use tools to generate scanners. Using flex:

✓ ✒ ✏ ✑ ✓ ✒ ✏ ✑ ✓ ✒ ✏ ✑ ❄ ❄ ✲ ✲ ❄ ❄

joos.l flex lex.yy.c gcc scanner foo.joos tokens

slide-33
SLIDE 33

COMP 520 Winter 2017 Scanning (33)

A flex file:

  • is used to define a scanner implementation;
  • has 3 main sections divided by %%:
  • 1. Declarations, helper code
  • 2. Regular expression rules and associated actions
  • 3. User code
  • and saves much effort in compiler design.

/* includes and other arbitrary C code. copied to the scanner verbatim */ %{ %} /* helper definitions */ DIGIT [0-9] %% /* regex + action rules come after the first %% */ RULE ACTION %% /* user code comes after the second %% */ main () {}

slide-34
SLIDE 34

COMP 520 Winter 2017 Scanning (34)

$ cat print_tokens.l # flex source code /* includes and other arbitrary C code */ %{ #include <stdio.h> /* for printf */ %} /* helper definitions */ DIGIT [0-9] /* regex + action rules come after the first %% */ %% [ \t\n]+ printf ("white space, length %i\n", yyleng); "*" printf ("times\n"); "/" printf ("div\n"); "+" printf ("plus\n"); "-" printf ("minus\n"); "(" printf ("left parenthesis\n"); ")" printf ("right parenthesis\n"); 0|([1-9]{DIGIT}*) printf ("integer constant: %s\n", yytext); [a-zA-Z_][a-zA-Z0-9_]* printf ("identifier: %s\n", yytext); %% /* user code comes after the second %% */ main () { yylex (); }

slide-35
SLIDE 35

COMP 520 Winter 2017 Scanning (35)

Sometimes a token is not enough, we need the value as well:

  • want to capture the value of an identifier; or
  • need the value of a string, int, or float literal.

In these cases, flex provides:

  • yytext: the scanned sequence of characters;
  • yylval: a user-defined variable from the parser (bison) to be returned with the token; and
  • yyleng: the length of the scanned sequence.

[a-zA-Z_][a-zA-Z0-9_]* { yylval.stringconst = (char *)malloc(strlen(yytext)+1); printf(yylval.stringconst,"%s",yytext); return tIDENTIFIER; }

slide-36
SLIDE 36

COMP 520 Winter 2017 Scanning (36)

Using flex to create a scanner is really simple:

$ vim print_tokens.l $ flex print_tokens.l $ gcc -o print_tokens lex.yy.c -lfl

slide-37
SLIDE 37

COMP 520 Winter 2017 Scanning (37)

Running this scanner with input:

a*(b-17) + 5/c

$ echo "a*(b-17) + 5/c" | ./print_tokens

  • ur print_tokens scanner outputs:

identifier: a times left parenthesis identifier: b minus integer constant: 17 right parenthesis white space, length 1 plus white space, length 1 integer constant: 5 div identifier: c white space, length 1

slide-38
SLIDE 38

COMP 520 Winter 2017 Scanning (38)

Count lines and characters:

%{ int lines = 0, chars = 0; %} %% \n lines++; chars++; . chars++; %% main () { yylex (); printf ("#lines = %i, #chars = %i\n", lines, chars); }

slide-39
SLIDE 39

COMP 520 Winter 2017 Scanning (39)

Getting (better) position information in flex:

  • is easy for line numbers: option and variable yylineno; but
  • is more involved for character positions.

If position information is useful for further compilation phases:

  • it can be stored in a structure yylloc provided by the parser (bison); but
  • must be updated by a user action.

typedef struct yyltype { int first_line, first_column, last_line, last_column; } yyltype; %{ #define YY_USER_ACTION yylloc.first_line = yylloc.last_line = yylineno; %} %option yylineno %% . { printf("Error: (line %d) unexpected char ’%s’\n", yylineno, yytext); exit(1); }

slide-40
SLIDE 40

COMP 520 Winter 2017 Scanning (40)

Actions in a flex file can either:

  • do nothing – ignore the characters;
  • perform some computation, call a function, etc.; and/or
  • return a token (token definitions provided by the parser).

%{ #include <stdlib.h> /* for atoi */ #include <stdio.h> /* for printf */ #include "lang.tab.h" /* for tokens */ %} %% [aeiouy] /* ignore */ [0-9]+ printf ("%i", atoi (yytext) + 1); ’\\n’ { yylval.rune_const = ’\n’; return tRUNECONST; } %% main () { yylex (); }

slide-41
SLIDE 41

COMP 520 Winter 2017 Scanning (41)

Summary

  • a scanner transforms a string of characters into a string of tokens;
  • scanner generating tools like flex allow you to define a regular expression for each type of token;
  • internally, the regular expressions are transformed to a deterministic finite automata for matching;
  • to break ties, matching uses 2 principles: “longest match” and “first match”.