Python-based interpreter ... b += a ... Application: FooLang ... compile 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) Application: Bytecode 27 INPLACE_ADD interpret 28 STORE_FAST 1 (b) ... Interpreter: Python while True: compile bc = bcs[bci] bci += bc.length Interpreter: Bytecode if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 7
Python-based interpreter ... b += a ... Application: FooLang ... compile 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) Application: Bytecode 27 INPLACE_ADD interpret 28 STORE_FAST 1 (b) ... Interpreter: Python while True: compile bc = bcs[bci] bci += bc.length Interpreter: Bytecode if bc.type == INPLACE_ADD: interpret v1 = stack.pop() v2 = stack.pop() Interpreter Interpreter if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 7
RPython Framework Application: Python compile Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8
RPython Framework Application: Python Interpreter: RPython compile Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8
RPython Framework Application: Python Interpreter: RPython Framework: RPython compile Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8
RPython Framework Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Framework: C Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8
RPython Framework Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Framework: C Application: Bytecode compile interpret PyPy: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8
RPython Framework Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Framework: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) ... Interpreter while True: bc = bcs[bci] bci += bc.length if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) type(v2) == int): i3 = getfield(p1, intval) stack.push(v1 + v2) i4 = getfield(p2, intval) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) type(v2) == int): i3 = getfield(p1, intval) stack.push(v1 + v2) i4 = getfield(p2, intval) elif ... i5 = int_add_ovf(i3, i4) elif bc.type == LOAD_FAST: guard_no_overflow() stack.push(local[bc.varnum]) ... ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) Deoptimization back to interpreter on guard failure type(v2) == int): i3 = getfield(p1, intval) stack.push(v1 + v2) i4 = getfield(p2, intval) elif ... i5 = int_add_ovf(i3, i4) elif bc.type == LOAD_FAST: guard_no_overflow() stack.push(local[bc.varnum]) ... ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9
Cross-layer annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10
Cross-layer annotations application annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10
Cross-layer annotations application annotations interpreter annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10
Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10
Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble JIT-ed code: Binary asm of interest Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10
Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble perf counters JIT-ed code: Binary asm of interest using PAPI Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10
Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble perf counters JIT-ed code: Binary asm of interest using PAPI Dynamic Binary Instrumentation Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10
Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble perf counters JIT-ed code: Binary asm of interest using PAPI phase counters, Dynamic Binary Instrumentation IR node counters Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10
Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 11
Meta-tracing JIT improves the performance significantly PyPy with meta-tracing JIT speedup over CPython: 10 15 20 25 30 0 5 51.2 richards crypto_pyaes 30.2 chaos telco spectral-norm django twisted_iteration spitfire_cstringio raytrace-simple hexiom2 float ai nbody_modified twisted_pb fannkuch genshi_text pyflate-fast bm_mako twisted_names json_bench Motivation • Meta-tracing • PyPy >> CPython • PyPy << C genshi_xml bm_chameleon pypy_interp twisted_tcp html5lib meteor-contest sympy_sum spitfire spambayes rietveld deltablue eparse sympy_expand slowspitfire sympy_integrate pidigits bm_mdp sympy_str 12
PyPy speedup over CPython and Pycket speedup over Racket: Meta-tracing JIT improves performance significantly across multiple languages PyPy speedup 12 11 10 9 8 7 6 5 4 3 2 1 0 binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 13
PyPy speedup over CPython and Pycket speedup over Racket: Meta-tracing JIT improves performance significantly across multiple languages Pycket speedup PyPy speedup 2 12 11 10 1.5 9 8 7 1 6 5 4 0.5 3 2 1 0 0 binarytrees fannkuchredux fasta mandelbrot meteor nbody pidigits revcomp spectralnorm binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 13
Meta-tracing JIT VM phases richards calls to AOT funs JIT GC deoptimization tracing & opt interpreter 0 2B 4B 6B 8B 10B instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 14
Meta-tracing JIT VM phases richards calls to AOT funs JIT GC deoptimization tracing & opt interpreter 0 2B 4B 6B 8B 10B instructions sympy_str calls to AOT funs JIT GC deoptimization tracing & opt interpreter 0 2B 4B 6B 8B 10B instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 14
Meta-tracing JIT VM phases Fastest on PyPy Slowest on PyPy JIT calls JIT GC deopt tracing interp Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 15
The JIT phase: The fastest benchmarks tend to execute JIT-compiled code the most JIT + JIT call to AOT 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 16
Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] i += 1 Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17
Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17
Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... guard_gt(i0, 0) i3 = getarrayitem(p1, 0) setarrayitem(p2, 0, i3) Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17
Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... guard_gt(i0, 0) i3 = getarrayitem(p1, 0) setarrayitem(p2, 0, i3) guard_gt(i0, 1) i4 = getarrayitem(p1, 1) setarrayitem(p2, 1, i4) Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17
Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... guard_gt(i0, 0) i3 = getarrayitem(p1, 0) setarrayitem(p2, 0, i3) guard_gt(i0, 1) i4 = getarrayitem(p1, 1) setarrayitem(p2, 1, i4) guard_gt(i0, 2) i5 = getarrayitem(p1, 2) setarrayitem(p2, 2, i5) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17
Examples of significant AOT-compiled functions Benchmark % Source Function ai 19.4 interpreter setobject.get_storage_from_list bm_chameleon 17.9 RPython types rordereddict.ll_call_lookup_function bm_mako 26.1 RPython lib runicode.unicode_encode_ucs1_helper json_bench 18.5 PyPy module _pypyjson.raw_encode_basestring_ascii nbody_modified 44.6 external lib pow Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 18
JIT calls to AOT-compiled functions: AOT-compiled functions can improve performance by avoiding long traces JIT JIT call to AOT functions 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 19
PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20
PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest richards 50 30 10 0 2B 4B 6B 8B 10B instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20
PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards 50 30 10 0 2B 4B 6B 8B 10B instructions PyPy w/o JIT breakeven point CPython breakeven point Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20
PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards html5lib 3 50 2 30 1 10 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B instructions instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20
PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards html5lib 3 50 2 30 1 10 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B instructions instructions PyPy w/o JIT breakeven point CPython breakeven point Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20
PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards html5lib sympy_str 3 50 2 2 30 1 1 10 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B instructions instructions instructions PyPy w/o JIT breakeven point Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20
Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21
Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? ▪ Meta-tracing JIT compilation significantly improves the performance PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21
Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? ▪ Meta-tracing JIT compilation significantly improves the performance ▪ AOT-compiled functions are good to break pathological traces PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21
Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? ▪ Meta-tracing JIT compilation significantly improves the performance ▪ AOT-compiled functions are good to break pathological traces ▪ Easier-to-JIT programs perform the best and warm up the fastest PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21
PyPy and Pycket slowdown over C/C++: Meta-tracing JIT has a big performance gap between static languages 1374 PyPy slowdown 30 31 25 20 15 10 5 0 binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 22
PyPy and Pycket slowdown over C/C++: Meta-tracing JIT has a big performance gap between static languages 1374 Pycket slowdown PyPy slowdown 12 30 31 10 25 8 20 6 15 4 10 2 5 0 0 binarytrees fannkuchredux fasta mandelbrot meteor nbody pidigits revcomp spectralnorm binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 22
Meta-tracing JIT phases JIT JIT call to AOT functions 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 23
Meta-tracing JIT IR node breakdown: Likely a big part of JIT compiled code is overhead Fastest on PyPy Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 24
Meta-tracing JIT IR node breakdown: Likely a big part of JIT compiled code is overhead Fastest on PyPy Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 24
Meta-tracing JIT IR node breakdown: Likely a big part of JIT compiled code is overhead Fastest on PyPy Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 24
Meta-tracing JIT phases JIT JIT call to AOT functions 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 25
Interpreter phase Interpreter 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 26
RPython-to-C translation has overheads PyPy without meta-tracing JIT speedup over CPython: 0.2 0.4 0.6 0.8 1.2 0 1 richards crypto_pyaes chaos telco spectral-norm django twisted_iteration spitfire_cstringio raytrace-simple hexiom2 float ai nbody_modified twisted_pb fannkuch genshi_text pyflate-fast bm_mako twisted_names json_bench Motivation • Meta-tracing • PyPy >> CPython • PyPy << C genshi_xml bm_chameleon pypy_interp twisted_tcp html5lib meteor-contest sympy_sum spitfire spambayes rietveld deltablue eparse sympy_expand slowspitfire sympy_integrate pidigits bm_mdp sympy_str 27
Tracing and optimization phase Tracing & optimization 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 28
Deoptimization phase Deoptimization 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 29
Garbage collection phase Garbage collection 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 30
Meta-tracing JIT VM overheads: Overheads are diverse and can add up to significant portion of execution Interpreter Tracing & optimization Deoptimization Garbage collection 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 31
Iron law of processor performance: Does meta-tracing VM code execute poorly in addition to more instructions? Time Instructions Cycle Time × = × Program Program Instructions Cycle Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 32
Comparing meta-tracing JIT IPC to C/C++: Meta-tracing has a similar IPC for most benchmarks C/C++ IPC PyPy IPC Pycket IPC 2.25 1.875 1.5 1.125 0.75 0.375 0 s x x a e t r y s a p m g o o e u u t d d t n m n s r e i r e g d d i o d o i b a t t o r r i e e o e b x n d t f l d e c r r y e m n e l a i s h d v a r p g l e a o c c n e r e t r n e u r u a c h r i n k n m e t b e n k p m n s a a f h c Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 33
Comparing meta-tracing JIT IPC to C/C++: Meta-tracing has a similar IPC for most benchmarks C/C++ IPC PyPy IPC Pycket IPC 2.25 1.875 1.5 1.125 0.75 0.375 0 s x x a e t r y s a p m g o o e u u t d d t n m n s r e i r e g d d i o d o i b a t t o r r i e e o e b x n d t f l d e c r r y e m n e l a i s h d v a r p g l e a o c c n e r e t r n e u r u a c h r i n k n m e t b e n k p m n s a a f h c Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 33
IPC measurements can be accurately matched against VM phases JIT GC deopt trace interp Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 34
Microarchitectural characterization by the VM phase: Meta-tracing-JIT-compiled code has a similar IPC, fewer branches and mispredictions IPC 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Interp Trace Deopt GC JIT C/C++ Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 35
Microarchitectural characterization by the VM phase: Meta-tracing-JIT-compiled code has a similar IPC, fewer branches and mispredictions IPC Branch per instruction 1.8 0.2 1.6 1.4 0.15 1.2 1 0.1 0.8 0.6 0.05 0.4 0.2 0 0 Interp Trace Deopt GC JIT C/C++ Interp Trace Deopt GC JIT C/C++ Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 35
Recommend
More recommend