Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown Research supported by IBM CAS, NSERC, CITO 1
Interpreter performance • Why not just in time (JIT) compile? • High performance JVMs still interpret • People use interpreted languages that don’t yet have JITs • They still want performance! • 30-40% of execution time is due to stalls caused by branch misprediction. • Our technique eliminates 95% of branch mispredictions Context Threading 2
Overview ✔ Motivation • Background: The Context Problem • Existing Solutions • Our Approach • Inlining • Results Context Threading 3
A Tale of Two Machines Virtual Machine Interpreter Execution Cycle Virtual Loaded Program Program load Bytecode Bodies Pipeline Execution Cycle Real Machine Target Address Predictors (Indirect) CPU Return Address Wayness (Conditional) Context Threading 4
Interpreter fetch execute Loaded Program Load dispatch Parms Internal Bytecode Representation bodies Execution Cycle Context Threading 5
Running Java Example Java Bytecode Java Source 0: iconst_0 void foo(){ 1: istore_1 2: iload_1 int i=1; Javac 3: iload_1 do{ compiler 4: iadd 5: istore_1 i+=i; 6: iload_1 } while(i<64); 7: bipush 64 } 9: if_icmplt 2 12: return Context Threading 6
Switched Interpreter while(1){ opcode = *vPC++; switch(opcode){ case iload_1: .. break; case iadd: .. break; //and many more.. } }; slow. burdened by switch and loop overhead 7 Context Threading
“ Threading” Dispatch iload_1: 0: iconst_0 .. execution of 1: istore_1 goto *vPC++; 2: iload_1 virtual program 3: iload_1 “threads” 4: iadd iadd: 5: istore_1 .. through bodies 6: iload_1 goto *vPC++; 7: bipush 64 9: if_icmplt 2 (as in needle & thread) istore: 12: return .. goto *vPC++; ‣ No switch overhead. Data driven indirect branch. 8 Context Threading
Context Problem iload_1: .. 0: iconst_0 goto *vPC++; 1: istore_1 2: iload_1 3: iload_1 iadd: 4: iadd .. 5: istore_1 goto *vPC++; 6: iload_1 7: bipush 64 indirect branch istore: 9: if_icmplt 2 predictor .. 12: return goto *vPC++; (micro-arch) ‣ Data driven indirect branches hard to predict 9 Context Threading
Direct Threaded Interpreter vPC iload_1: &&iload_1 .. … goto *vPC++; &&iload_1 iload_1 &&iadd iload_1 iadd: &&istore_1 iadd .. istore_1 &&iload_1 goto *vPC++; iload_1 &&bipush bipush 64 64 istore: if_icmplt 2 &&if_icmplt .. … goto *vPC++; -7 DTT - Direct C implementation Virtual Threading Table of each body Program Target of computed goto is data-driven Context Threading 10
Existing Solutions Replicate Super Instruction 1 iload_1 goto *pc 1 Body Body 1 Body Body 2 iload_1 Body goto *pc 2 GOTO *PC 2 ???? Ertl & Gregg: Piumarta & Ricardi : Bodies and Dispatch Bodies Replicated Replicated Limited to relocatable virtual instructions Context Threading 11
Overview ✔ Motivation ✔ Background: The Context Problem ✔ Existing Solutions • Our Approach • Inlining • Results Context Threading 12
Key Observation • Virtual and native control flow similar • Linear or straight-line code • Conditional branches • Calls and Returns • Indirect branches • Hardware has predictors for each type • Direct uses indirect branch for everything! ‣ Solution: Leverage hardware predictors Context Threading 13
Essence of our Solution CTT - Context … Bytecode bodies Threading Table iload_1 (ret terminated) (generated code) iload_1 iadd call iload_1 iload_1: istore_1 call iload_1 .. iload_1 ret; call iadd bipush 64 call istore_1 if_icmplt 2 iadd: call iload_1 … .. .. ret; Return Branch Predictor Stack Package bodies as subroutines and call them Context Threading 14
Subroutine Threading vPC … Bytecode bodies iload_1 call iload_1 (ret terminated) iload_1 call iload_1 iadd iload_1: call iadd istore_1 … call istore_1 iload_1 ret; call iload_1 bipush 64 call bipush if_icmplt 2 iadd: … call if_icmplt … 64 ret; CTT load time -7 if_cmplt: generated code … DTT contains goto *vPC++; addresses in CTT virtual branch instructions as before Context Threading 15
The Context Threading Table • A sequence of generated call instructions • Good alignment of virtual and hardware control flow for straight-line code. ‣ Can virtual branches go into the CTT? Context Threading 16
Specialized Branch Inlining vPC … … if(icmplt) target : … goto target: 5 call iload_1 Conditional call … Branch … Predictor now … mobilized … target: Branch Inlined DTT Into the CTT Inlining conditional branches provides context Context Threading 17
Tiny Inlining • Context Threading is a dispatch technique • But, we inline branches • Some non-branching bodies are very small • Why not inline those? Inline all tiny linear bodies into the CTT Context Threading 18
Overview ✔ Motivation ✔ Background: The Context Problem ✔ Existing Solutions ✔ Our Approach ✔ Inlining • Results Context Threading 19
Experimental Setup • Two Virtual Machines on two hardware architectures. • VM: Java/SableVM, OCaml interpreter • Compare against direct threaded SableVM • SableVM distro uses selective inlining • Arch: P4, PPC • Branch Misprediction • Execution Time Is our technique effective and general? Context Threading 20
Mispredicted Taken Branches Subroutine Branch Inlining Tiny Inlining 1.00 Direct Threading Normalized to 0.75 0.50 0.25 0 compress db jack javac jess mpeg mtrt ray scimark soot SableVm/Java Pentium 4 95% mispredictions eliminated on average Context Threading 21
Execution time Subroutine Branch Inlining Tiny Inlining 1.00 Pentium 4 Direct Threading Normalized to 0.75 0.50 0.25 0 s b k c s g t y k t r o s a s e a d c r t e e r v a o a p m r a j j m m s p j m i c s o c 27% average reduction in execution time Context Threading 22
Execution Time (geomean) Subroutine Branch Inlining Tiny Inlining 1.00 Direct Threading Normalized to 0.75 0.50 0.25 0 4 c 4 c p p p p p p / / a l / / m v a l m a v a j a c a j o c o Our technique is effective and general Context Threading 23
Conclusions • Context Problem: branch mispredictions due to mismatch between native and virtual control flow • Solution: Generate control flow code into the Context Threading Table • Results • Eliminate 95% of branch mispredictions • Reduce execution time by 30-40% ‣ recent, post CGO 2005, work follows Context Threading 24
What about Scripting Languages? • Recently ported context 10 5 threading to TCL. Tcl Cycles per virtual instruction Ocaml • 10x cycles executed per 10 4 bytecode dispatched. Cycles per Dispatch 10 3 • Much lower dispatch overhead. 10 2 • Speedup due to subroutine threading, 10 1 approx. 5%. • TCL conference 2005 10 0 Tcl or Ocaml Benchmark Context Threading 25
Recommend
More recommend