automatic synthesis of high speed processor simulators
play

Automatic Synthesis of High-Speed Processor Simulators Martin - PowerPoint PPT Presentation

Automatic Synthesis of High-Speed Processor Simulators Martin Burtscher and Ilya Ganusov Computer Systems Laboratory Cornell University Motivation Processor simulators are invaluable tools They allow us to cheaply and quickly test ideas


  1. Automatic Synthesis of High-Speed Processor Simulators Martin Burtscher and Ilya Ganusov Computer Systems Laboratory Cornell University

  2. Motivation � Processor simulators are invaluable tools � They allow us to cheaply and quickly test ideas � Problem � Portable simulators tend to be slow � Fast simulators are complex and either require access to source code or symbol table info, are ISA specific (non-portable), need dynamic compilation support, or perturb the simulation Automatic Synthesis of High-Speed Processor Simulators

  3. Functional Simulation � Simulates correct behavior but not timing � Used for prototyping, trace generation, etc. � Needed for fast forwarding (sampling) � Integral part of cycle-accurate simulators � Average fast-forwarding and simulation time for SPECcpu2000 with early SimPoints � sim-fast + sim-mase: 1.9h + 1.25h = 3.15h � SyntSim + sim-mase: 0.25h + 1.25h = 1.5h Automatic Synthesis of High-Speed Processor Simulators

  4. Contributions � Goal � Develop a functional simulator that is simple, portable, and fast (+ supports instrumentation) � Our approach: SyntSim � Before every run, statically synthesize a simulator that is optimized for the given binary � Combine interpreted- and compiled-mode simulation for speed and simplicity � Perform other important optimizations Automatic Synthesis of High-Speed Processor Simulators

  5. SyntSim’s Features � Simplicity � Only a little more complex than an interpreter � Even works with stripped executables � Easy to add code to simulate caches, etc. � Portability � Emits C source code � Does not perturb simulation � Performance � Only 6.6x slower than native execution on SPECcpu2000 reference runs (geo. mean) Automatic Synthesis of High-Speed Processor Simulators

  6. Interpreted-Mode Simulation � Instruction example inst = mem[pc]; op = inst >> 26; � addq r7, 200, r22 switch (op) { case ALUop: op rsrc imm func rdst � rsrc = (inst >> 21) & 31; � Interpreted code imm = (inst >> 13) & 255; func = (inst >> 5) & 255; � Slow simulation speed rdst = inst & 31; � Handles all adds in all switch (func) { programs case AddI: reg[rdst] = reg[rsrc] + imm; � Compiled once pc++; Automatic Synthesis of High-Speed Processor Simulators

  7. Compiled-Mode Simulation � Instruction example reg[22] = reg[7] + 200; � addq r7, 200, r22 op rsrc imm func rdst � � Optimizations � Translated code � No decoding � Fast simulation speed � Hardcoded � Only handles this add indices and in this program immediates � Incurs synthesis and � Other optims. compilation overhead Automatic Synthesis of High-Speed Processor Simulators

  8. Mixed-Mode Simulation � Combine interpreted and compiled mode � Translating the 15% most-frequently executed static instructions suffices to run 99.9% of the dynamic instruction in compiled mode � Remaining instructions are interpreted � Translating only frequently executed instrs � Much shorter compilation time � Smaller executable (better i-cache performance) Automatic Synthesis of High-Speed Processor Simulators

  9. SyntSim’s Operation instruction high-speed program optional definitions simulator executable profile (C code) (C code) add: D=A+B; interpreter SyntSim sub: D=A-B; code generator and optimizer bne: compiled- if (A) goto B; mode … simulator user options Automatic Synthesis of High-Speed Processor Simulators

  10. Compiled-Mode Simulator static void RunCompiled() { forever { switch (pc/4) { case 0x4800372c: r[2] = RdMem8(r[1]-30768); // 12000dcb8: ldq r2, -30768(r1) s1 = r[0]; s2 = r[4]; // 12000dcd0: cmplt r0, r4, r0 r[0] = 0; if (s1<s2) r[0] = 1; ic += 3; if (0!=r[0]) goto L12000dcf0; // 12000dcdc: bne r0, 12000dcf0 ic += 1; goto L12000c970; // 12000dcec: br r31, 12000c970 L12000dcf0: ic += 1; pc = r[26]&(~3ULL); // 12000dcf0: ret r31, (r26), 1 icnt[fnc(lasttarget)] += ic; ic = 0; lasttarget = pc; break; default: RunInterpreted(); } // switch } // forever } Automatic Synthesis of High-Speed Processor Simulators

  11. Related Work � MINT 1994 � Dyn. decompile short code sequences into fncs � QPT/EEL 1994/1995 � Rewrite executable, use quite precise algorithm for indirect branches, need dyn. translation � SuperSim 1996 � Static decompilation into C, fully labeled � UQBT 2000 � Decompilation into special high-level language, static hooks to interpret untranslated code Automatic Synthesis of High-Speed Processor Simulators

  12. Evaluation Methodology � System � 750MHz 64-bit Alpha 21264A � 64kB L1, 8MB L2, 2GB RAM � Tru64 UNIX V5.1 � Benchmarks � 20 SPECcpu2000 programs, highly optimized � All F77 and C programs except perlbmk � Full test, train, and reference runs Automatic Synthesis of High-Speed Processor Simulators

  13. Profile vs. Heuristic Performance � Runtimes include 32.8 66.3 24 � Synthesis time (0.08s) 22 ref runs with ref profiles ref runs with heuristics slowdown relative to native execution.. 20 � Compilation time (33s) 18 � Simulation time (3160s) 16 14 � Profile based 12 � 6.6x gmean slowdown 10 (2x to 16x) 8 6 � Heuristic based 4 � 8.7x gmean slowdown 2 0 (2.2x to 66x) gzip vpr gcc mcf crafty parser gap vortex bzip2 twolf mesa art equake ammp wupwise swim mgrid applu sixtrack apsi geo_mean Automatic Synthesis of High-Speed Processor Simulators

  14. Mixed-Mode Performance � Observations 10.0 � Better profiles help ref runs with test profiles 9.5 ref runs with train profiles � Pure compiled mode is slowdown relative to native execution.. ref runs with ref profiles ref runs with heuristics slower than mixed 9.0 mode with good profile 8.5 (24% on train runs) 8.0 � Best c/i ratio decreases 7.5 with quality of profile 7.0 � 99.9% compiled mode is best with self profile 6.5 (15% of static instrs) 6.0 0 0.001 0.01 0.1 1 10 percent interpreted instructions (dynamic) Automatic Synthesis of High-Speed Processor Simulators

  15. Comparison with Interpreters � SyntSim’s interpreter 282 250 � 2.5x faster than sim-fast 225 slowdown relative to native execution.. ref runs using mixed mode 200 � Mixed mode ref runs using interpreter 175 ref runs using sim-fast � 19x faster than sim-fast 150 125 � 8x faster on ref runs 100 than SyntSim interpreter 75 (3.6x to 14x) 50 � 7x faster on train runs 25 � 3.7x faster on test runs 0 twolf gzip gcc mcf parser vortex bzip2 mesa art equake ammp mgrid applu geo_mean Automatic Synthesis of High-Speed Processor Simulators

  16. Comparison with ATOM � Adding instrumentation 35 � Identical C code SyntSim 30 � Instruction count (ic) ATOM slowdown relative to native execution.. � Mem hierarchy (memh) 25 � Branch predictor (bp) 20 15 � Results 10 � ic: ATOM is 2x faster � rest: SyntSim is 2.6x 5 faster than ATOM 0 ic ic+memh ic+memh+bp Automatic Synthesis of High-Speed Processor Simulators

  17. Conclusions � Presented a fully automated technique to statically create fast yet portable simulators � Interleaves compiled- and interpreted- mode simulation for speed and simplicity � Only 6.6x slower than native execution � Only 13x slowdown when counting instructions and simulating a memory hierarchy and a branch predictor (warmup) Automatic Synthesis of High-Speed Processor Simulators

Recommend


More recommend