Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory C OMPILING A L ANGUAGE DCC 888
Dealing with Programming Languages • LLVM gives developers many tools to interpret or compile a language: – The intermediate representaDon – Lots of analyses and opDmizaDons When is it worth designing a new • We can work on a language that already language? exists, e.g., C, C++, Java, etc • We can design our own language. We need a front Machine independent Machine dependent end to convert optimizations, such as optimizations, such programs in the constant propagation as register allocation source language 2344555 to LLVM IR *+,-), 1#% ((0 !"#$%&'() '()./0 '()./0 '().-
The Simple Calculator • To illustrate this capacity of LLVM, let's design a very simple programming language: – A program is a funcDon applicaDon – A funcDon contains only one argument x – Only the integer type exists – The funcDon body contains only addiDons, mulDplicaDons, references to x , and integer constants in polish notaDon: 1) Can you understand why we got each of these values? 2) How is the grammar of our language?
The Architecture of Our Compiler !"#"$ 2$34"$ %&$'"$ (#)$ !!*05136' 1) Can you guess the meaning of the *&$(#)$ 0,1(#)$ different arrows? 2) Can you guess the +,-(#)$ .//(#)$ role of each class? 3) What would be a good execuDon mode for our system?
The ExecuDon Engine Our execuDon engine parses the expression, converts it to a funcDon wriXen in LLVM IR, JIT $> ./driver 4 � * x x � compiles this funcDon, and runs it with the Result: 16 � argument passed to the program in command $> ./driver 4 � line. + x * x 2 � Let's start with Result: 12 � our lexer. Which tokens do we ; ModuleID = 'Example' $> ./driver 4 � have? * x + x 2 � Result: 24 � define i32 @fun(i32 %x) { entry: %addtmp = add i32 %x, 2 %multmp = mul i32 %x, %addtmp ret i32 %multmp }
Lexer.h � The Lexer • A lexer is a program that divides a string of characters into tokens. – A token is a terminal in our grammar, e.g., #ifndef LEXER_H a symbol that is part of the alphabet of our language. #define LEXER_H #include <string> – Lexers can be easily implemented as class Lexer { finite automata. public: std::string getToken(); Lexer() : lastChar(' ') {} 1) Again: which kind of private: tokens do we have? char lastChar; inline char getNextChar() { 2) Can you guess the char c = lastChar; implementaDon of the lastChar = getchar(); getToken() method? return c; } }; #endif
Lexer.cpp � ImplementaDon of the Lexer #include "Lexer.h" std::string Lexer::getToken() { while (isspace(lastChar)) { lastChar = getchar(); } if (isalpha(lastChar)) { std::string idStr; do { idStr += getNextChar(); } while (isalnum(lastChar)); return idStr; } else if (isdigit(lastChar)) { std::string numStr; do { numStr += getNextChar(); } while (isdigit(lastChar)); return numStr; } else if (lastChar == EOF) { 1) Would you be able to return ""; represent this lexer as } else { a state machine? std::string operatorStr; operatorStr = getNextChar(); 2) We must now define return operatorStr; the parser. How can } we implement it? }
Parser.cpp � Parsing • Parsing is the act to transform a string of tokens in a syntax tree ♤ . #ifndef PARSER_H 1) What are these #define PARSER_H forward declaraDons good for? #include <string> class Expr; 2) Do you understand class Lexer; this syntax? class Parser { 3) What does the parser public: return? Parser(Lexer* argLexer) : lexer(argLexer) {} Expr* parseExpr(); private: Lexer* lexer; }; #endif ♤ : it used to be one of the most important problems in computer science.
Syntax Trees • The parser produces syntax trees. * x x + x * x 2 * x + x 2 + * * x x x * x + x 2 2 x How can we implement these trees in C++?
Expr.h � The Nodes of the Tree #ifndef AST_H #define AST_H #include "llvm/IR/IRBuilder.h" class AddExpr : public Expr { public: class Expr { AddExpr(Expr* op1Arg, Expr* op2Arg) : public: op1(op1Arg), op2(op2Arg) {} virtual ~Expr() {} llvm::Value *gen(llvm::IRBuilder<> *builder, virtual llvm::Value *gen(llvm::IRBuilder<> *builder, llvm::LLVMContext& con) const; llvm::LLVMContext& con) const = 0; private: }; const Expr* op1; const Expr* op2; class NumExpr : public Expr { }; public: NumExpr(int argNum) : num(argNum) {} class MulExpr : public Expr { llvm::Value *gen(llvm::IRBuilder<> *builder, public: llvm::LLVMContext& con) const; MulExpr(Expr* op1Arg, Expr* op2Arg) : staDc const unsigned int SIZE_INT = 32; op1(op1Arg), op2(op2Arg) {} private: llvm::Value *gen(llvm::IRBuilder<> *builder, const int num; llvm::LLVMContext& con) const; }; private: const Expr* op1; class VarExpr : public Expr { const Expr* op2; public: There is a gen method }; llvm::Value *gen(llvm::IRBuilder<> *builder, that is a bit weird. We llvm::LLVMContext& con) const; #endif shall look into it later. staDc llvm::Value* varValue; };
Going Back into the Parser • Our parser will build a syntax tree. &%'()**+,-#. + x * x 2 !"#$%# ((((&%'(/"#+,-.01 ((((&%'(234+,-#. + ((((((((&%'(/"#+,-#.01 ((((((((&%'(536+,-#.70 x * ((((0 0 2 x The polish notaDon really So, how can we simplifies parsing. We implement our already have the tree, and parser? without parentheses! Jan Łukasiewicz, father of the Polish notaDon
Parser.cpp � The Parser's ImplementaDon Expr* Parser::parseExpr() { #include "Expr.h" std::string tk = lexer‐>getToken(); #include "Lexer.h" if (tk == "") { #include "Parser.h" return NULL; } else if (isdigit(tk[0])) { 1) Why checking the first return new NumExpr(atoi(tk.c_str())); character of each token is } else if (tk[0] == 'x') { already enough to avoid any return new VarExpr(); ambiguity? } else if (tk[0] == '+') { Expr *op1 = parseExpr(); 2) Now we need a way to Expr *op2 = parseExpr(); translate trees into LLVM IR. return new AddExpr(op1, op2); How to do it? } else if (tk[0] == '*') { Expr *op1 = parseExpr(); !"#$%&'() '()156 "$; Expr *op2 = parseExpr(); <,!=), return new MulExpr(op1, op2); *+,-), ./#,12)"34 7#% 89: } else { return NULL; } ./#,0 '()17#%1- }
Expr.cpp � The Translator #include "Expr.h" Our implementaDon has a llvm::Value* VarExpr::varValue = NULL; small hack: our language has only one variable, which llvm::Value* NumExpr::gen we have decided to call 'x'. (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { This variable must be return llvm::ConstantInt::get (llvm::Type::getInt32Ty(context), num); represented by an LLVM } value, which is the llvm::Value* VarExpr::gen argument of the funcDon (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { that we will create. Thus, llvm::Value* var = VarExpr::varValue; we need a way to inform return var ? var : NULL; } the translator this value. We llvm::Value* AddExpr::gen do it through a staDc (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { variable varValue . That is llvm::Value* v1 = op1‐>gen(builder, context); the only staDc variable that llvm::Value* v2 = op2‐>gen(builder, context); we are using in this class. return builder‐>CreateAdd(v1, v2, "addtmp"); } llvm::Value* MulExpr::gen (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { llvm::Value* v1 = op1‐>gen(builder, context); llvm::Value* v2 = op2‐>gen(builder, context); return builder‐>CreateMul(v1, v2, "multmp"); }
Driver.cpp � The Driver's Skeleton int main(int argc, char** argv) { The procedure if (argc != 2) { that creates an llvm::errs() << "Inform an argument to your expression.\n"; LLVM funcDon is return 1; not that } else { complicated. Can llvm::LLVMContext context; you guess its llvm::Module *module = new llvm::Module("Example", context); implementaDon? llvm::FuncDon *funcDon = createEntryFunc=on (module, context); module‐>dump(); llvm::ExecuDonEngine* engine = createEngine(module); JIT(engine, funcDon, atoi(argv[1])); } } !"#$%&'() '()156 "$; <,!=), *+,-), ./#,12)"34 7#% 89: ./#,0 '()17#%1-
Driver.cpp � CreaDng an LLVM FuncDon llvm::FuncDon *createEntryFuncDon( This code is not "that" complicated, but it llvm::Module *module, is not super straighworward either, so we llvm::LLVMContext &context) { will go a bit more carefully over it. llvm::Func=on *func=on = llvm::cast<llvm::Func=on>( module‐>getOrInsertFunc=on("fun", llvm::Type::getInt32Ty(context), llvm::Type::getInt32Ty(context), (llvm::Type *)0) ); llvm::BasicBlock *bb = llvm::BasicBlock::Create(context, "entry", funcDon); llvm::IRBuilder<> builder(context); builder.SetInsertPoint(bb); Let's start with llvm::Argument *argX = funcDon‐>arg_begin(); this humongous argX‐>setName("x"); call. What do you VarExpr::varValue = argX; think it is doing? Lexer lexer; Parser parser(&lexer); Expr* expr = parser.parseExpr(); llvm::Value* retVal = expr‐>gen(&builder, context); builder.CreateRet(retVal); return funcDon; }
Recommend
More recommend