cross layer workload characterization of meta tracing jit
play

Cross-Layer Workload Characterization of Meta-Tracing JIT VMs Berkin - PowerPoint PPT Presentation

Cross-Layer Workload Characterization of Meta-Tracing JIT VMs Berkin Ilbeyi 1 , Carl Friedrich Bolz-Tereick 2 , and Christopher Batten 1 1 Cornell University, 2 Heinrich-Heine-Universitt Dsseldorf Dynamic languages are popular S. Cass. The


  1. Python-based interpreter ... b += a ... Application: FooLang ... compile 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) Application: Bytecode 27 INPLACE_ADD interpret 28 STORE_FAST 1 (b) ... Interpreter: Python while True: compile bc = bcs[bci] bci += bc.length Interpreter: Bytecode if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 7

  2. Python-based interpreter ... b += a ... Application: FooLang ... compile 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) Application: Bytecode 27 INPLACE_ADD interpret 28 STORE_FAST 1 (b) ... Interpreter: Python while True: compile bc = bcs[bci] bci += bc.length Interpreter: Bytecode if bc.type == INPLACE_ADD: interpret v1 = stack.pop() v2 = stack.pop() Interpreter Interpreter if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 7

  3. RPython Framework Application: Python compile Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  4. RPython Framework Application: Python Interpreter: RPython compile Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  5. RPython Framework Application: Python Interpreter: RPython Framework: RPython compile Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  6. RPython Framework Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Framework: C Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  7. RPython Framework Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Framework: C Application: Bytecode compile interpret PyPy: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  8. RPython Framework Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Framework: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  9. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) ... Interpreter while True: bc = bcs[bci] bci += bc.length if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  10. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  11. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  12. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  13. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  14. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  15. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  16. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  17. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  18. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  19. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) type(v2) == int): i3 = getfield(p1, intval) stack.push(v1 + v2) i4 = getfield(p2, intval) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  20. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) type(v2) == int): i3 = getfield(p1, intval) stack.push(v1 + v2) i4 = getfield(p2, intval) elif ... i5 = int_add_ovf(i3, i4) elif bc.type == LOAD_FAST: guard_no_overflow() stack.push(local[bc.varnum]) ... ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  21. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) Deoptimization back to interpreter on guard failure type(v2) == int): i3 = getfield(p1, intval) stack.push(v1 + v2) i4 = getfield(p2, intval) elif ... i5 = int_add_ovf(i3, i4) elif bc.type == LOAD_FAST: guard_no_overflow() stack.push(local[bc.varnum]) ... ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  22. Cross-layer annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  23. Cross-layer annotations application annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  24. Cross-layer annotations application annotations interpreter annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  25. Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  26. Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble JIT-ed code: Binary asm of interest Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  27. Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble perf counters JIT-ed code: Binary asm of interest using PAPI Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  28. Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble perf counters JIT-ed code: Binary asm of interest using PAPI Dynamic Binary Instrumentation Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  29. Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble perf counters JIT-ed code: Binary asm of interest using PAPI phase counters, Dynamic Binary Instrumentation IR node counters Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  30. Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 11

  31. Meta-tracing JIT improves the performance significantly PyPy with meta-tracing JIT speedup over CPython: 10 15 20 25 30 0 5 51.2 richards crypto_pyaes 30.2 chaos telco spectral-norm django twisted_iteration spitfire_cstringio raytrace-simple hexiom2 float ai nbody_modified twisted_pb fannkuch genshi_text pyflate-fast bm_mako twisted_names json_bench Motivation • Meta-tracing • PyPy >> CPython • PyPy << C genshi_xml bm_chameleon pypy_interp twisted_tcp html5lib meteor-contest sympy_sum spitfire spambayes rietveld deltablue eparse sympy_expand slowspitfire sympy_integrate pidigits bm_mdp sympy_str 12

  32. PyPy speedup over CPython and Pycket speedup over Racket: Meta-tracing JIT improves performance significantly across multiple languages PyPy speedup 12 11 10 9 8 7 6 5 4 3 2 1 0 binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 13

  33. PyPy speedup over CPython and Pycket speedup over Racket: Meta-tracing JIT improves performance significantly across multiple languages Pycket speedup PyPy speedup 2 12 11 10 1.5 9 8 7 1 6 5 4 0.5 3 2 1 0 0 binarytrees fannkuchredux fasta mandelbrot meteor nbody pidigits revcomp spectralnorm binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 13

  34. Meta-tracing JIT VM phases richards calls to AOT funs JIT GC deoptimization tracing & opt interpreter 0 2B 4B 6B 8B 10B instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 14

  35. Meta-tracing JIT VM phases richards calls to AOT funs JIT GC deoptimization tracing & opt interpreter 0 2B 4B 6B 8B 10B instructions sympy_str calls to AOT funs JIT GC deoptimization tracing & opt interpreter 0 2B 4B 6B 8B 10B instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 14

  36. Meta-tracing JIT VM phases Fastest on PyPy Slowest on PyPy JIT calls JIT GC deopt tracing interp Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 15

  37. The JIT phase: The fastest benchmarks tend to execute JIT-compiled code the most JIT + JIT call to AOT 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 16

  38. Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] i += 1 Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17

  39. Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17

  40. Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... guard_gt(i0, 0) i3 = getarrayitem(p1, 0) setarrayitem(p2, 0, i3) Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17

  41. Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... guard_gt(i0, 0) i3 = getarrayitem(p1, 0) setarrayitem(p2, 0, i3) guard_gt(i0, 1) i4 = getarrayitem(p1, 1) setarrayitem(p2, 1, i4) Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17

  42. Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... guard_gt(i0, 0) i3 = getarrayitem(p1, 0) setarrayitem(p2, 0, i3) guard_gt(i0, 1) i4 = getarrayitem(p1, 1) setarrayitem(p2, 1, i4) guard_gt(i0, 2) i5 = getarrayitem(p1, 2) setarrayitem(p2, 2, i5) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17

  43. Examples of significant AOT-compiled functions Benchmark % Source Function ai 19.4 interpreter setobject.get_storage_from_list bm_chameleon 17.9 RPython types rordereddict.ll_call_lookup_function bm_mako 26.1 RPython lib runicode.unicode_encode_ucs1_helper json_bench 18.5 PyPy module _pypyjson.raw_encode_basestring_ascii nbody_modified 44.6 external lib pow Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 18

  44. JIT calls to AOT-compiled functions: AOT-compiled functions can improve performance by avoiding long traces JIT JIT call to AOT functions 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 19

  45. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  46. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest richards 50 30 10 0 2B 4B 6B 8B 10B instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  47. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards 50 30 10 0 2B 4B 6B 8B 10B instructions PyPy w/o JIT breakeven point CPython breakeven point Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  48. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards html5lib 3 50 2 30 1 10 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B instructions instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  49. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards html5lib 3 50 2 30 1 10 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B instructions instructions PyPy w/o JIT breakeven point CPython breakeven point Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  50. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards html5lib sympy_str 3 50 2 2 30 1 1 10 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B instructions instructions instructions PyPy w/o JIT breakeven point Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  51. Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21

  52. Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? ▪ Meta-tracing JIT compilation significantly improves the performance PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21

  53. Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? ▪ Meta-tracing JIT compilation significantly improves the performance ▪ AOT-compiled functions are good to break pathological traces PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21

  54. Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? ▪ Meta-tracing JIT compilation significantly improves the performance ▪ AOT-compiled functions are good to break pathological traces ▪ Easier-to-JIT programs perform the best and warm up the fastest PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21

  55. PyPy and Pycket slowdown over C/C++: Meta-tracing JIT has a big performance gap between static languages 1374 PyPy slowdown 30 31 25 20 15 10 5 0 binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 22

  56. PyPy and Pycket slowdown over C/C++: Meta-tracing JIT has a big performance gap between static languages 1374 Pycket slowdown PyPy slowdown 12 30 31 10 25 8 20 6 15 4 10 2 5 0 0 binarytrees fannkuchredux fasta mandelbrot meteor nbody pidigits revcomp spectralnorm binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 22

  57. Meta-tracing JIT phases JIT JIT call to AOT functions 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 23

  58. Meta-tracing JIT IR node breakdown: Likely a big part of JIT compiled code is overhead Fastest on PyPy Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 24

  59. Meta-tracing JIT IR node breakdown: Likely a big part of JIT compiled code is overhead Fastest on PyPy Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 24

  60. Meta-tracing JIT IR node breakdown: Likely a big part of JIT compiled code is overhead Fastest on PyPy Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 24

  61. Meta-tracing JIT phases JIT JIT call to AOT functions 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 25

  62. Interpreter phase Interpreter 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 26

  63. RPython-to-C translation has overheads PyPy without meta-tracing JIT speedup over CPython: 0.2 0.4 0.6 0.8 1.2 0 1 richards crypto_pyaes chaos telco spectral-norm django twisted_iteration spitfire_cstringio raytrace-simple hexiom2 float ai nbody_modified twisted_pb fannkuch genshi_text pyflate-fast bm_mako twisted_names json_bench Motivation • Meta-tracing • PyPy >> CPython • PyPy << C genshi_xml bm_chameleon pypy_interp twisted_tcp html5lib meteor-contest sympy_sum spitfire spambayes rietveld deltablue eparse sympy_expand slowspitfire sympy_integrate pidigits bm_mdp sympy_str 27

  64. Tracing and optimization phase Tracing & optimization 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 28

  65. Deoptimization phase Deoptimization 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 29

  66. Garbage collection phase Garbage collection 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 30

  67. Meta-tracing JIT VM overheads: Overheads are diverse and can add up to significant portion of execution Interpreter Tracing & optimization Deoptimization Garbage collection 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 31

  68. Iron law of processor performance: Does meta-tracing VM code execute poorly in addition to more instructions? Time Instructions Cycle Time × = × Program Program Instructions Cycle Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 32

  69. Comparing meta-tracing JIT IPC to C/C++: Meta-tracing has a similar IPC for most benchmarks C/C++ IPC PyPy IPC Pycket IPC 2.25 1.875 1.5 1.125 0.75 0.375 0 s x x a e t r y s a p m g o o e u u t d d t n m n s r e i r e g d d i o d o i b a t t o r r i e e o e b x n d t f l d e c r r y e m n e l a i s h d v a r p g l e a o c c n e r e t r n e u r u a c h r i n k n m e t b e n k p m n s a a f h c Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 33

  70. Comparing meta-tracing JIT IPC to C/C++: Meta-tracing has a similar IPC for most benchmarks C/C++ IPC PyPy IPC Pycket IPC 2.25 1.875 1.5 1.125 0.75 0.375 0 s x x a e t r y s a p m g o o e u u t d d t n m n s r e i r e g d d i o d o i b a t t o r r i e e o e b x n d t f l d e c r r y e m n e l a i s h d v a r p g l e a o c c n e r e t r n e u r u a c h r i n k n m e t b e n k p m n s a a f h c Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 33

  71. IPC measurements can be accurately matched against VM phases JIT GC deopt trace interp Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 34

  72. Microarchitectural characterization by the VM phase: Meta-tracing-JIT-compiled code has a similar IPC, fewer branches and mispredictions IPC 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Interp Trace Deopt GC JIT C/C++ Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 35

  73. Microarchitectural characterization by the VM phase: Meta-tracing-JIT-compiled code has a similar IPC, fewer branches and mispredictions IPC Branch per instruction 1.8 0.2 1.6 1.4 0.15 1.2 1 0.1 0.8 0.6 0.05 0.4 0.2 0 0 Interp Trace Deopt GC JIT C/C++ Interp Trace Deopt GC JIT C/C++ Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 35

Recommend


More recommend