axcis accelerating architectural exploration using
play

AXCIS: Accelerating Architectural Exploration using Canonical - PowerPoint PPT Presentation

AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments Rose Liu & Krste Asanovi Computer Architecture Group MIT CSAIL 1 Simulation for Large Design Space Exploration Large design space studies explore


  1. AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments Rose Liu & Krste Asanovi ć Computer Architecture Group MIT CSAIL 1

  2. Simulation for Large Design Space Exploration � Large design space studies explore thousands of processor designs � Identify those that minimize costs and maximize performance Pareto-optimal designs on curve CPI Area � Speed vs. Accuracy tradeoff � Maximize simulation speedup while maintaining sufficient accuracy to identify interesting design points for later detailed simulation 2 of 32

  3. Reduce Simulated Instructions: Sampling � Perform detailed microarchitectural simulation during sample points & functional warming between sample points � SimPoints [ASPLOS, 2002], SMARTS [ISCA, 2003] � Use efficient checkpoint techniques to reduce simulation time to minutes � TurboSMARTS [SIGMETRICS, 2005], Biesbrouck [HiPEAC, 2005] Sample points – simulate in detail 3 of 32

  4. Reduce Simulated Instructions: Statistical Simulation Generate a short synthetic trace (with statistical � properties similar to original workload) for simulation Eeckhout [ISCA, 2004], Oskin [ISCA, 2000] � Nussbaum [PACT, 2001] Stage 1 Stage 2 Execution Synthetic Statistical Program Driven Trace Image Profiling Generation Synthetic Trace Simulation IPC Config Stage 3 4 of 32

  5. AXCIS Framework (performed once) Stage 1 • Machine independent CIST Dynamic except for branch Program predictor and cache Canonical Trace & organizations Instruction Compressor Inputs Segment Table • Stores all information needed for performance analysis Configs AXCIS IPC1 In-order superscalars: • Issue width Performance IPC2 • # of functional units • # of cache primary- Model IPC3 miss tags • Latencies • Branch penalty Stage 2 5 of 32

  6. In-Order Superscalar Machine Model ( size & penalty ) ( org. & latency ) ( ) Branch Pred. Parameters Blocking Icache Fetch Memory Issue ( issue width ) Non- blocking Dcache FPU ALU ( latency ) LSU . . . ( number of units ) (# primary miss tags) (latency) (organization Completion & latency) 6 of 32

  7. Stage 1: Dynamic Trace Compression (performed once) Stage 1 CIST Dynamic Program Canonical Trace & Instruction Compressor Inputs Segment Table Configs AXCIS IPC1 Performance IPC2 Model IPC3 Stage 2 7 of 32

  8. Instruction Segments Events: (dcache, icache, bpred) addq (-- , hit, correct) instruction segment ldq (miss, hit, correct) subq (--, hit, correct) defining stq (miss, hit, correct) instruction � An instruction segment captures all performance- critical information associated with a dynamic instruction 8 of 32

  9. Instruction Segments Events: (dcache, icache, bpred) addq (-- , hit, correct) instruction segment ldq (miss, hit, correct) subq (--, hit, correct) defining stq (miss, hit, correct) instruction � An instruction segment captures all performance- critical information associated with a dynamic instruction 9 of 32

  10. Dynamic Trace Compression � Program behavior repeats due to loops, and repeated function calls � Multiple different dynamic instruction segments can have the same behavior (canonically equivalent) regardless of the machine configuration � Compress the dynamic trace by storing in a table: � 1 copy of each type of segment � How often we see it in the dynamic trace 10 of 32

  11. Canonical Instruction Segment Table CIST Freq Segment Int_ALU addq (--, hit, correct) Int_ALU 1 ldq (miss, hit, correct) addq (--, hit, correct) ldq (miss, hit, correct) subq (--, hit, correct) stq (miss, hit, correct) 11 of 32

  12. Canonical Instruction Segment Table CIST Freq Segment Int_ALU addq (--, hit, correct) Int_ALU 1 Load_Miss ldq (miss, hit, correct) Int_ALU 1 Load_Miss addq (--, hit, correct) ldq (miss, hit, correct) subq (--, hit, correct) stq (miss, hit, correct) 12 of 32

  13. Canonical Instruction Segment Table CIST Freq Segment addq (--, hit, correct) Int_ALU 1 ldq Load_Miss (miss, hit, correct) Int_ALU 1 Load_Miss addq Int_ALU (--, hit, correct) Load_Miss ldq 1 (miss, hit, correct) Int_ALU subq (--, hit, correct) stq (miss, hit, correct) 13 of 32

  14. Canonical Instruction Segment Table CIST Freq Segment addq (--, hit, correct) Int_ALU 1 ldq (miss, hit, correct) Int_ALU 1 2 Load_Miss addq Int_ALU (--, hit, correct) Load_Miss Load_Miss ldq 1 (miss, hit, correct) Int_ALU subq (--, hit, correct) stq (miss, hit, correct) 14 of 32

  15. Canonical Instruction Segment Table CIST Freq Segment addq (--, hit, correct) Int_ALU 1 ldq (miss, hit, correct) Int_ALU 1 2 Load_Miss addq (--, hit, correct) Load_Miss Load_Miss ldq 1 2 (miss, hit, correct) Int_ALU subq Int_ALU (--, hit, correct) stq (miss, hit, correct) 15 of 32

  16. Canonical Instruction Segment Table CIST Freq Segment addq (--, hit, correct) Int_ALU 1 ldq (miss, hit, correct) Int_ALU 1 2 Load_Miss addq (--, hit, correct) Load_Miss ldq 1 2 (miss, hit, correct) Load_Miss Int_ALU subq Int_ALU (--, hit, correct) Load_Miss Int_ALU 1 stq (miss, hit, correct) Store_Miss Store_Miss Total ins: 6 16 of 32

  17. Stage 2: AXCIS Performance Model (performed once) Stage 1 CIST Dynamic Program Canonical Trace & Instruction Compressor Inputs Segment Table Config AXCIS Performance IPC Model Stage 2 17 of 32

  18. AXCIS Performance Model Calculates IPC using a single linear dynamic � programming pass over the CIST entries Total work is proportional to the # of CIST entries � Total Ins Total Ins = = IPC + Total Cycles Total Ins Total Effective Stalls CIST Size = ∑ Total Effective Stalls Freq(i) * EffectiveS talls(Defi ningIns(i) ) = i 1 EffectiveStalls = MAX ( stalls (DataHazards), stalls (StructuralHazards), stalls (ControlFlowHazards) ) 18 of 32

  19. Performance Model Calculations Freq Segment Stalls State For each defining instruction: Int_ALU 0 1 � Calculate its effective stalls & Int_ALU its corresponding 2 Load_Miss 2 microarchitecture state snapshot Load_Miss � Follow 2 99 Int_ALU dependencies to look up the Load_Miss effective stalls & 1 99 Int_ALU state of other instructions in Store_Miss ??? ??? previous entries Look up in previous segment Total ins: 6 Calculate 19 of 32

  20. Stall Cycles From Data Hazards Freq Stalls State Input configuration: Load_Miss … Ins Type Latency (cycles) Int_ALU 1 99 Int_ALU 3 Store_Miss ??? ??? Load_Miss 100 Use data dependencies (e.g. RAW) to detect data hazards � Stalls(DataHazards) � = MAX ( -1, Latency ( producer = Load_Miss ) – DepDist – EffectiveStalls ( IntermediateIns = Int_ALU ) ) = MAX (-1, (100 – 2 – 99) ) = -1 stalls (can issue with previous instruction) 20 of 32

  21. Stall Cycles from Structural Hazards Freq Stalls Microarchitectural State Load_Miss … Int_ALU 1 99 Store_Miss ??? ??? CISTs record special dependencies to capture all possible � structural hazards across all configurations The AXCIS performance model follows these special � dependencies to find the necessary microarchitectural states to: 1. Determine if a structural hazard exists & the number of stall cycles until it is resolved 2. Derive the microarchitectural state after issuing the current defining instruction 21 of 32

  22. Stall Cycles From Control Flow Hazards Freq Icache Branch Pred. Load_Miss … … 1 Int_ALU … … Store_Miss hit correct & not taken Control flow events directly map to stall cycles � Icache Bpred Stalls Hit Incorrect & taken/not taken Mispred penalty Correct & taken 0 Correct & not taken -1 Miss Incorrect & taken/not taken Memory latency + mispred penalty Correct & taken Memory latency Correct & not taken Memory latency - 1 22 of 32

  23. Lossless Compression Scheme Lossless Compression Scheme: (perfect accuracy) � Compress two segments if they always experience the same � stall cycles regardless of the machine configuration Impractical to implement within the Dynamic Trace Compressor � addq (--, hit, correct) ldiq always Issues with addq ldiq (--, hit, correct) addq (--, hit, correct) stq stq (miss, hit, correct) (miss, hit, correct) 23 of 32

  24. Three Compression Schemes Instruction Characteristics Based Compression: � Compress segments that “look” alike (i.e. have the same length, � instruction types, dependence distances, branch and cache behaviors) Limit Configurations Based Compression: � Compress segments whose defining instructions have the same � instruction types, stalls and microarchitectural state under the 2 configurations simulated during trace compression Relaxed Limit Configurations Based Compression: � Relaxed version of the limit-based scheme – does not compare � microarchitectural state Improves compression at the cost of accuracy � 24 of 32

Recommend


More recommend