Automatic Synthesis of High-Speed Processor Simulators Martin Burtscher and Ilya Ganusov Computer Systems Laboratory Cornell University
Motivation � Processor simulators are invaluable tools � They allow us to cheaply and quickly test ideas � Problem � Portable simulators tend to be slow � Fast simulators are complex and either require access to source code or symbol table info, are ISA specific (non-portable), need dynamic compilation support, or perturb the simulation Automatic Synthesis of High-Speed Processor Simulators
Functional Simulation � Simulates correct behavior but not timing � Used for prototyping, trace generation, etc. � Needed for fast forwarding (sampling) � Integral part of cycle-accurate simulators � Average fast-forwarding and simulation time for SPECcpu2000 with early SimPoints � sim-fast + sim-mase: 1.9h + 1.25h = 3.15h � SyntSim + sim-mase: 0.25h + 1.25h = 1.5h Automatic Synthesis of High-Speed Processor Simulators
Contributions � Goal � Develop a functional simulator that is simple, portable, and fast (+ supports instrumentation) � Our approach: SyntSim � Before every run, statically synthesize a simulator that is optimized for the given binary � Combine interpreted- and compiled-mode simulation for speed and simplicity � Perform other important optimizations Automatic Synthesis of High-Speed Processor Simulators
SyntSim’s Features � Simplicity � Only a little more complex than an interpreter � Even works with stripped executables � Easy to add code to simulate caches, etc. � Portability � Emits C source code � Does not perturb simulation � Performance � Only 6.6x slower than native execution on SPECcpu2000 reference runs (geo. mean) Automatic Synthesis of High-Speed Processor Simulators
Interpreted-Mode Simulation � Instruction example inst = mem[pc]; op = inst >> 26; � addq r7, 200, r22 switch (op) { case ALUop: op rsrc imm func rdst � rsrc = (inst >> 21) & 31; � Interpreted code imm = (inst >> 13) & 255; func = (inst >> 5) & 255; � Slow simulation speed rdst = inst & 31; � Handles all adds in all switch (func) { programs case AddI: reg[rdst] = reg[rsrc] + imm; � Compiled once pc++; Automatic Synthesis of High-Speed Processor Simulators
Compiled-Mode Simulation � Instruction example reg[22] = reg[7] + 200; � addq r7, 200, r22 op rsrc imm func rdst � � Optimizations � Translated code � No decoding � Fast simulation speed � Hardcoded � Only handles this add indices and in this program immediates � Incurs synthesis and � Other optims. compilation overhead Automatic Synthesis of High-Speed Processor Simulators
Mixed-Mode Simulation � Combine interpreted and compiled mode � Translating the 15% most-frequently executed static instructions suffices to run 99.9% of the dynamic instruction in compiled mode � Remaining instructions are interpreted � Translating only frequently executed instrs � Much shorter compilation time � Smaller executable (better i-cache performance) Automatic Synthesis of High-Speed Processor Simulators
SyntSim’s Operation instruction high-speed program optional definitions simulator executable profile (C code) (C code) add: D=A+B; interpreter SyntSim sub: D=A-B; code generator and optimizer bne: compiled- if (A) goto B; mode … simulator user options Automatic Synthesis of High-Speed Processor Simulators
Compiled-Mode Simulator static void RunCompiled() { forever { switch (pc/4) { case 0x4800372c: r[2] = RdMem8(r[1]-30768); // 12000dcb8: ldq r2, -30768(r1) s1 = r[0]; s2 = r[4]; // 12000dcd0: cmplt r0, r4, r0 r[0] = 0; if (s1<s2) r[0] = 1; ic += 3; if (0!=r[0]) goto L12000dcf0; // 12000dcdc: bne r0, 12000dcf0 ic += 1; goto L12000c970; // 12000dcec: br r31, 12000c970 L12000dcf0: ic += 1; pc = r[26]&(~3ULL); // 12000dcf0: ret r31, (r26), 1 icnt[fnc(lasttarget)] += ic; ic = 0; lasttarget = pc; break; default: RunInterpreted(); } // switch } // forever } Automatic Synthesis of High-Speed Processor Simulators
Related Work � MINT 1994 � Dyn. decompile short code sequences into fncs � QPT/EEL 1994/1995 � Rewrite executable, use quite precise algorithm for indirect branches, need dyn. translation � SuperSim 1996 � Static decompilation into C, fully labeled � UQBT 2000 � Decompilation into special high-level language, static hooks to interpret untranslated code Automatic Synthesis of High-Speed Processor Simulators
Evaluation Methodology � System � 750MHz 64-bit Alpha 21264A � 64kB L1, 8MB L2, 2GB RAM � Tru64 UNIX V5.1 � Benchmarks � 20 SPECcpu2000 programs, highly optimized � All F77 and C programs except perlbmk � Full test, train, and reference runs Automatic Synthesis of High-Speed Processor Simulators
Profile vs. Heuristic Performance � Runtimes include 32.8 66.3 24 � Synthesis time (0.08s) 22 ref runs with ref profiles ref runs with heuristics slowdown relative to native execution.. 20 � Compilation time (33s) 18 � Simulation time (3160s) 16 14 � Profile based 12 � 6.6x gmean slowdown 10 (2x to 16x) 8 6 � Heuristic based 4 � 8.7x gmean slowdown 2 0 (2.2x to 66x) gzip vpr gcc mcf crafty parser gap vortex bzip2 twolf mesa art equake ammp wupwise swim mgrid applu sixtrack apsi geo_mean Automatic Synthesis of High-Speed Processor Simulators
Mixed-Mode Performance � Observations 10.0 � Better profiles help ref runs with test profiles 9.5 ref runs with train profiles � Pure compiled mode is slowdown relative to native execution.. ref runs with ref profiles ref runs with heuristics slower than mixed 9.0 mode with good profile 8.5 (24% on train runs) 8.0 � Best c/i ratio decreases 7.5 with quality of profile 7.0 � 99.9% compiled mode is best with self profile 6.5 (15% of static instrs) 6.0 0 0.001 0.01 0.1 1 10 percent interpreted instructions (dynamic) Automatic Synthesis of High-Speed Processor Simulators
Comparison with Interpreters � SyntSim’s interpreter 282 250 � 2.5x faster than sim-fast 225 slowdown relative to native execution.. ref runs using mixed mode 200 � Mixed mode ref runs using interpreter 175 ref runs using sim-fast � 19x faster than sim-fast 150 125 � 8x faster on ref runs 100 than SyntSim interpreter 75 (3.6x to 14x) 50 � 7x faster on train runs 25 � 3.7x faster on test runs 0 twolf gzip gcc mcf parser vortex bzip2 mesa art equake ammp mgrid applu geo_mean Automatic Synthesis of High-Speed Processor Simulators
Comparison with ATOM � Adding instrumentation 35 � Identical C code SyntSim 30 � Instruction count (ic) ATOM slowdown relative to native execution.. � Mem hierarchy (memh) 25 � Branch predictor (bp) 20 15 � Results 10 � ic: ATOM is 2x faster � rest: SyntSim is 2.6x 5 faster than ATOM 0 ic ic+memh ic+memh+bp Automatic Synthesis of High-Speed Processor Simulators
Conclusions � Presented a fully automated technique to statically create fast yet portable simulators � Interleaves compiled- and interpreted- mode simulation for speed and simplicity � Only 6.6x slower than native execution � Only 13x slowdown when counting instructions and simulating a memory hierarchy and a branch predictor (warmup) Automatic Synthesis of High-Speed Processor Simulators
Recommend
More recommend