Introduction to lex (or flex) Some slides borrowed from M Scherger
Lex/Flex: A Scanner Generator in C Regular Expression Thomson’s Construction Nondeterministic Finite Automaton “Subset” Construction Deterministic Finite Automaton Table-driven Scanner So why not do this with a tool? 2 Introduction to lex (or flex) Fall 2012
Lex Lex is a such tool for creating lexical analyzers M. E. Lesk and E. Schmidt 1975 Lexical analyzers tokenize input streams Regular expressions define tokens Tokens are the terminals of a language Converts regular expressions into DFAs DFAs are implemented as table driven state machines Some versions of Lex are proprietary and so not all versions of *nix come with an open source version flex – Fast Lexical Analyzer is an open source version Vern Paxson 3 Introduction to lex (or flex) Fall 2012
The Basic Process Lex source program Lex lex.yy.c any.l compiler C a.out lex.yy.c Compiler Sequence a.out Input stream of tokens 4 Introduction to lex (or flex) Fall 2012
Format of a lex File Definitions %% Rules %% User code 1 st section holds declarations of simple name definitions and start conditions 2 nd section holds pattern-action pairs 3 rd section is copied directly to lex.yy.c C code and comments Typical file extensions: .l .lex .flex 5 Introduction to lex (or flex) Fall 2012
Compiling and Running > flex linenos.flex yywrap() issue > gcc lexyy.c -lfl > a.out < infile > outfile 6 Introduction to lex (or flex) Fall 2012
Regular Expressions and Lex A regular expression is an expression that matches sets of strings (the “language” of the regular expression). In its basic form, a regular expression is built up out of basic expressions (individual symbols) and the operations choice (|), concatenation (no operator), and repetition (*). A regular expression may also contain certain other metasymbols: parentheses for grouping (to change precedence, just as in arithmetic) others as needed to extend the operator set in useful ways 7 Introduction to lex (or flex) Fall 2012
Regular Expressions in Lex RE Matches c - c is a single character A A Matches the character c x x d d \ c – c is a single character \. . Use this to escape special characters \n Newline \t tab “ str ” - str is a string “ Abc ” Abc Matches entire string str “The” The [ str ]- str is a string [aeiou] Lowercase vowels Matches any single character from str [abcde] The letters a to e 8 Introduction to lex (or flex) Fall 2012
Regular Expressions – Character Classes [ x-y ] – x and y are characters RE Matches [a-z] All lowercase characters All characters in the range x - y [0-9] All digits [a-df-z] lowercase characters except e These can be combined [a-z0-9A-Z] Alphanumeric characters [A-Zaeiou] Upper case letters and lc vowels [^ str ] – str is a string [^ \n\t] all non whitespace [^aeiou] matches anything but lowercase vowels 9 Introduction to lex (or flex) Fall 2012
Regular Expressions p * – p is a pattern Zero or more occurrences of p A AA AAA .... A* r rr ... r* ab*c* a ab ac abb abc acc abbb abbc abcc accc ... p + – p is a pattern One or more occurrences of p A+ A AA AAA AAAA ... ab+ ab abb abbb .... a*b+ b ab bb aab abb bbb .. 10 Introduction to lex (or flex) Fall 2012
Regular Expressions p ? - p is a pattern Zero or one occurrences of p A A? ab?c? a ab ac abc p { m,n } – p is a pattern, m and n are ints Matches m through n occurrences of p if ,n is missing, n = m , if just n is missing n = ∞ a{1,3} a aa aaa a{1,1} a a{1} a a{3,} aaa aaaa aaaaa … 11 Introduction to lex (or flex) Fall 2012
Regular Expressions p 1 p 2 – p 1 and p 2 are patterns ab ab Matches p 1 followed by p 2 a+b+ ab aab abb ( p ) - p is a pattern Used to override precedence (group things) (abc)+ abc abcabc abcabcabc … abc+ abc abcc abccc … p 1 |p 2 – p 1 and p 2 are patterns Matches either p 1 or p 2 a|an|the a an the Notice precedence ba|ed ba ed b(a|e)d bed bad 12 Introduction to lex (or flex) Fall 2012
Regular Expression - Extra Things p 1 / p 2 – p 1 and p 2 are patterns Matches p 1 only if it's followed by p 2 p 2 is not part of yytext RE: a+/bc Input: aaabc bc aaaad matches first aaa only.. ^ p – p is a pattern matches p only if it is at the start of a line p $ – p is a pattern matches p only if it is at the end of a line 13 Introduction to lex (or flex) Fall 2012
Two more complex examples [-+]?[0-9]+(\.[0-9]+)?([Ee][-+]?[0-9]+)? or: nat = [0-9]+ signedNat = [-+]? nat number = signedNat(\. nat)? ([Ee] signedNat)? C comments /\*/*(\**[^/*]/*)*\**\*/ 14 Introduction to lex (or flex) Fall 2012
Pattern Matching Examples 15 Introduction to lex (or flex) Fall 2012
Format of a lex File Definitions %% Rules %% User code 1 st section holds declarations of simple name definitions and start conditions 2 nd section holds pattern-action pairs 3 rd section is copied directly to lex.yy.c C code and comments 16 Introduction to lex (or flex) Fall 2012
Definitions Definitions are of the form: name definition A name begins with a letter or underscore followed by 0 or more letters, digits, '-', or '_'. You access it with { name } Example definitions: Digit [0-9] Char [A-Z] AlphaNum [a-zA-Z0-9] ws [ \n\t] IntegerConst [0-9]+ 17 Introduction to lex (or flex) Fall 2012
Definitions Example Digit [0-9] Char [a-zA-Z] AlphaNum [a-zA-Z0-9] %% {Digit}+”.”{Digit}+ ({Char}|_)({AlphaNum}|[_-])* {printf (“A name '%s' \ n”, yytext);} %% 18 Introduction to lex (or flex) Fall 2012
Rules Rules are of the form: pattern action pattern is the RE to match and action is what to do when it is matched Default rule is to echo the input Lex matches the longest string possible If a tie, it matches the 1 st rule in the spec Actions can be empty – do nothing Actions can be complex Use {} if multi-lined don't forget ';'s yytext contains the string matched 19 Introduction to lex (or flex) Fall 2012
Example Rules \n linecount++; [0-9]+ sum+=atoi(yytext); {ws}+ a|an|the printf (“found an article \ n”); [aeiou]+ { printf (“A string of vowels \ n”); vcnt++; } 20 Introduction to lex (or flex) Fall 2012
Predefined Rules ECHO Copy yytext to output [a-z]+ ECHO; REJECT Go to the next alternative, that is the second choice rule to be selected and it’s action taken she s++; he h++; Won’t count the imbedded he she {s++; REJECT;} he {h++; REJECT;} \n But this will 21 Introduction to lex (or flex) Fall 2012
Rules Example ex1.l The commands lex ex1.l %% produces lex.yy.c a*b printf (“Token 1 found \ n”); cc -o ex1 lex.yy.c – ll c+ printf (“Token 2 found \ n”); create executable May need – lfl if using flex %% ./ex1 main() { to execute aaaaaaabbccd yylex(); Default is stdin and Token 1 found } stdout so type Token 1 found aaaaaaaabbccd <return> Token 2 found d 22 Introduction to lex (or flex) Fall 2012
An Example Count chars, words, lines %{ The %{ %} pair allow you unsigned ccnt=0, wcnt = 0, lcnt = 0; to make declarations for %} your lexer word [^ \t\n]+ eol \n %% {word}{wcnt++;ccnt+=yyleng;} {eol} {ccnt++;lcnt++;} . ccnt++; %% main() {yylex(); } 23 Introduction to lex (or flex) Fall 2012
About lex Lex uses some predefined functions stored in lex library (link with -ll or -lfl) By default lex copies input to output By default lex reads stdin, writes stdout Lex reads its input (a lex script) and produced lex.yy.c Use %{ and %} in definitions section to declare globals and put #includes You can use flex instead Not all 'lex'es are equal! Man page has more info! 24 Introduction to lex (or flex) Fall 2012
Example 1: The Simplest Example The simplest example of a lex program is a scanner that acts like the UNIX `cat`program %% . |\n ECHO; %% Or it could be written as… %% . ECHO; \n ECHO; %% 25 Introduction to lex (or flex) Fall 2012
Lex Predefined Variables 26 Introduction to lex (or flex) Fall 2012
Flex Internal Names Lex internal name Meaning/Use lex.yy.c or lexyy.c Lex output file name yylex Lex scanning routine yytext string matched on current action yyleng length of yytext yyin Lex input file (default: stdin ) yyout Lex output file (default: stdout ) input Lex buffered input routine ECHO Lex default action (print yytext to yyout ) See the Flex documentation for others 27 Introduction to lex (or flex) Fall 2012
Flex Operational Conventions yylex() runs until it is stopped by a return ambiguity is resolved by order any text not explicitly matched is echoed to stdout EOF is automatically matched and returns 0 from yylex() (unless yywrap() is suitably defined) yylex() returns an int which can be a token 28 Introduction to lex (or flex) Fall 2012
Recommend
More recommend