Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012
Overview • About me! • New Relic’s PHP Agent escapee. • Now on New Projects, doing unspeakably un-PHP things. • Wannabe compiler nerd. • Terminology & brief intro to compilers: • Grammars, Scanners & Parsers • General architecture of a bytecode compiler • Hands on: Modifying the PHP language • PHP/Zend compiler architecture & summary • Case study in adding a new keyword
“Zend” vs. “Zend Engine” vs. “PHP” • I will use all of these interchangeably throughout this talk. • Referring to the bytecode compiler in the “Zend Engine 2” in most cases. • The distinction doesn’t really matter here.
Compilers 101: Scanners • Or lexical analyzers , or tokenizers T_WHILE • Input : raw source code '(' • Output : a stream of tokens T_VARIABLE("x") while ($x == $y) T_IS_EQUAL T_VARIABLE("y") ')'
Compilers 101: Parsers • Input: a stream of tokens from the scanner T_WHILE • Output is implementation dependent '(' • Often an intermediate, in-memory representation of the program in tree form. 0: ZEND_IS_EQUAL ~0 !0 !1 T_VARIABLE("x") • e.g. Parse Tree or Abstract Syntax Tree 1: ZEND_JMPZ ~0 ->3 2: … • Or directly generate bytecode. 3: … T_IS_EQUAL • Goal of a parser is to structure the token stream. T_VARIABLE("y") • Parsers are frequently generated from a DSL ')' • See parser generators like Yacc/Bison, ANTLR, etc. or e.g. parser combinators in Haskell, Scala, ML.
Compilers 101: Context-free grammars • Or simply “grammar” • A grammar describes the complete syntax of a (programming) language. • Usually expressed in Extended Backus-Naur Form (EBNF) • Or some variant thereof. • Variants of EBNF used for a lot of DSL-based parser generators • e.g. Yacc/Bison, ANTLR, etc.
Generalized Compiler Architecture* Source files Source code Scanner Token stream Parser Bytecode Abstract Bytecode Code Generator Interpreter Syntax Tree * Actually a generalized *bytecode* compiler architecture
Generalized *PHP* Compiler Architecture Source files Scanner Source code Token stream e r . l n n a s c e _ a g g u a n _ l n d z e d / e n Z Parser y e r . r s p a e _ g u a n g l a d _ e n / z n d Z e Bytecode Abstract Bytecode Code Generator Interpreter Syntax Tree c e . p i l P m H c c o P t e . d _ c u e n x e / z _ e n d d Z e e n c d / z o m n Z e p i l e s d i r e c t l y t o b y t e c o d e !
Case Study: The “until” statement <?php It’s basically while (!...) ... $x = 5; until ($x == 0) { $x--; echo “Oh hi, Mark [$x]\n”; } -- output -- Oh hi, Mark [4] Oh hi, Mark [3] Oh hi, Mark [2] Oh hi, Mark [1] Oh hi, Mark [0]
How to add “until” to the PHP language 1.Tell the scanner how to tokenize new keyword(s) 2.Describe the syntax of the new construct 3.Emit bytecode
Before you start... • You’ll need the usual gcc toolchain, GNU Bison, etc. • Debian/Ubuntu apt-get install build-essential • OSX Xcode command line tools should give you most of what you need. • Also ensure that you have re2c • Debian/Ubuntu apt-get install re2c • OSX (Homebrew) brew install re2c • Used to generate the scanner • Silently ignored if not found by the configure script! • And, of course, source code for some recent version of PHP 5. • I’m working with PHP 5.4.4
1. Tell the scanner how to tokenize “until” T_UNTIL • Zend/zend_language_scanner.l • Input for re2c , which will generate the Zend language scanner. '(' • Describes how raw source code should be converted into tokens. • Note that no structure is implied here: that’s the parser’s job. T_VARIABLE("x") • Tell the scanner that the word “until” is special. until ($x == $y) T_IS_EQUAL • The parser also needs to know about new tokens! • How is this done for the while keyword? T_VARIABLE("y") ')'
2. Describe the syntax of “until” • Zend/zend_language_parser.y • Essentially serves as the grammar for the Zend language. • Also describes actions to perform during parsing. • Input for the the parser generator (Bison) used to generate the PHP parser. • Tell PHP how until statements are structured syntactically. • How was it done for a while statement? T_UNTIL '(' expr ')' statement
3. Emit bytecode • Add actions to Zend/zend_language_parser.y • What should they do? • Recall that PHP generates bytecode during the parsing process. • Generate bytecode describing the semantics of until in terms of the PHP VM. Compiler • Er, wait -- what bytecode do we need to generate? Bytecode
Intermission: PHP bytecode intro • opline <opcode> <result?> <op1?> <op2?> • Data structure representing a single line of PHP VM “assembly” • Includes opcode + operands ZEND_JMP <op1> • opline # associated with each opline Unconditional jump to the opline # in op1 e.g. jump to opline #10 • Different variable types, differentiated by prefix: ZEND_JMP ->10 • Variables ( $ ) • Compiled variables ( ! ) ZEND_JMPZ <op1> <op2> Conditional jump to the opline # in op2 • Temporary variables ( ~ ) i fg op1 is zero e.g. jump to opline #3 if ~0 is zero • ZEND_JMP ZEND_JMPZ ~0 ->3 • “goto” • Conditional variants: ZEND_JMPZ , ZEND_JMPNZ ZEND_IS_EQUAL <result> <op1> <op2> • opline #s used as address operand for JMP instructions (->) result=1 if op1 == op2, otherwise result=0 e.g. set ~0=1 if !0 == 10 ZEND_IF_EQUAL ~0 !0 10
Unconditional jump: ZEND_JMP 0: ... 1: ... 2: ZEND_JMP ->0
Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
4. Emit bytecode (cont.) • Zend/zend_compile.c • The Zend language’s code generation logic lives here. • No DSLs here: plain old C source code. • First, let’s try to understand the bytecode for while • How do we need to modify it for until ?
Demo! • Time to build! • The usual ./configure && make dance on Linux & OSX. • To be thorough, regenerate data used by the tokenizer extension. (cd ext/tokenizer && ./tokenizer_data_gen.sh) • http://php.net/manual/en/book.tokenizer.php • You’ll need to run make again once you’ve done this. • With a little luck, magic happens and you get a binary in sapi/cli/php • Take until out for a spin!
And exhale. • Lots to take in, right? • In my experience, this stuff is best learned bit-by-bit through practice. • Ask questions! • Google • php-internals • Or hey, ask me...
Thanks! oscon@tomlee.co @tglee http://newrelic.com ... and come see Inside Python @ 5pm in D135 :)
Recommend
More recommend