stressright
play

StressRight: Finding the Right Stress for Accurate In-development - PowerPoint PPT Presentation

StressRight: Finding the Right Stress for Accurate In-development System Evaluation Jaewon Lee 1 , Hanhwi Jang 1 , Jae-eon Jo 1 , Gyu-Hyeon Lee 2 , Jangwoo Kim 2 High Performance Computing Lab Pohang University of Science and Technology (POSTECH)


  1. StressRight: Finding the Right Stress for Accurate In-development System Evaluation Jaewon Lee 1 , Hanhwi Jang 1 , Jae-eon Jo 1 , Gyu-Hyeon Lee 2 , Jangwoo Kim 2 High Performance Computing Lab Pohang University of Science and Technology (POSTECH) 1 Seoul National University 2

  2. Configuring Workloads • Modern workloads are configurable − No definite answer: depends on the usage scenario 1

  3. Evaluating a System Workloads Performance report (e.g., latency, throughput) Reconfigure workloads & system System 2

  4. Evaluating an In-development System No performance report Workloads Investigate uArchitecture details System simulator / emulator System modeling tools: Too slow or too inaccurate 3

  5. Workload Configuration Matters •  Configuration   System behavior − The system executes different code patterns  Different analysis results & system design insights Must configure to represent actual usage scenarios 4

  6. Index • Introduction / Motivation • Limitations • Proposed idea: StressRight • Evaluation • Conclusion 5

  7. Limitations (of the Existing Methods) • Inaccurate insights about the configurations − Short simulation: No high-level metrics − DBT-based simulation: No kernel considerations − Emulator: No timing considerations 5 Latency (Normalized) 4 3 2 1 0 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 memcached query throughput (Normalized) 6

  8. Index • Introduction / Motivation • Limitations • Proposed idea: StressRight − Goals & Key ideas − Method details • Conclusion 7

  9. StressRight: Goals • Goals (e.g., latency, throughput) − Quickly derive workload-reported performance metrics  To explore workload configurations for in-devel systems  To evaluate the systems with right stress behaviors • Requirements − Long workload execution  Must observe high-level workload-reported performance metrics − Efficient performance model  To quickly derive the performance metrics 8

  10. StressRight: Key Ideas • Long workload execution − Use timing-agnostic platforms (e.g., Emulators) ⇒ Extract user & kernel behavior, analyze performance later • Efficient performance model − Leverage redundancy in workloads ⇒ Analyze only the unique behaviors (i.e., code blocks) ⇒ Overall behavior = ∑ Analyzed unique behaviors 9

  11. StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B

  12. StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Memory / Branch trace Hit rate Functional Cache Branch simulation Time

  13. StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Memory / Branch trace Hit rate Functional Cache Branch simulation Time A A A Timing B B reconstruction C

  14. StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Hit rate Functional Cache Branch simulation Time High $ hit Low $ hit Med $ hit A A A Timing B B reconstruction C

  15. StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Hit rate Functional Cache Branch simulation Time A A A Timing B B reconstruction C Core 0 A B A 120 Ops/sec Reschedule (Accurate) & Reinterpret Core 1 C A B A

  16. StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial IQ ROB 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

  17. StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial IQ Empty Empty ROB Issue rate drops (not true for longer traces) 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

  18. StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial • Solution: Consider hypothetical next block − Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate IQ Empty Empty ROB Issue rate drops (not true for longer traces) 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

  19. StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial • Solution: Consider hypothetical next block − Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate IQ Current block avg. issue = 2.0 IPC Next Next block issues Next proportional to 2.0 IPC ROB 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

  20. StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial • Solution: Consider hypothetical next block − Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate Larger window IQ  Issue more Current block avg. issue = 2.0 IPC Next Next Next Next block issues Next proportional to 2.0 IPC ROB 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

  21. StressRight: Multiple Performances • Challenge: Difficult to model every scenario Code block Mem Mem Mem Mem Mem 12

  22. StressRight: Multiple Performances • Challenge: Difficult to model every scenario → Analysis → IPC A 90% $ Hit → Analysis → IPC B 50% $ Hit → Analysis → IPC C Code block Mem Mem Mem Mem Mem 30% $ Hit … 12

  23. StressRight: Multiple Performances • Challenge: Difficult to model every scenario • Solution: Mix template scenarios − Random-generate scenarios & mix them − Few templates are enough Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss IPC 2.0 IPC 1.6 40% Hit template Hit Miss Miss Hit Miss 12

  24. StressRight: Multiple Performances • Challenge: Difficult to model every scenario • Solution: Mix template scenarios − Random-generate scenarios & mix them − Few templates are enough Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss IPC 2.0 IPC 1.6 40% Hit template Hit Miss Miss Hit Miss 50% Hit IPC 1.8 12

  25. StressRight: Rescheduling • Basic scheduling method − Schedule to the earliest possible slot • Three rules − Rule 1: Blocks from a thread execute serially − Rule 2: Critical sections shouldn’t overlap − Rule 3: Threads should wait for barriers 13

  26. StressRight: Rescheduling • Rule 1: Blocks from a thread execute serially − Tag code blocks with the executor thread ID − Prohibit blocks from a thread from running concurrently Core 0 Thread 1 Core 1 Thread 1 Core 0 Thread 1 Core 1 Idle Thread 1 14

  27. StressRight: Rescheduling • Rule 2: Critical sections shouldn’t overlap − Tag code blocks with synchronization variable ID (if applicable) − Prohibit the critical sections from overlapping Core 0 A Thread 1 Core 1 Thread 2 A Core 0 Thread 1 A Thread 2 Core 1 Idle A 15

  28. StressRight: Rescheduling • Rule 3: Threads should wait for barriers − Tag code blocks related to barrier operations (if applicable) − Prohibit the scheduling before the last barrier_wait() barrier_wait() Core 0 Thread 1 Thread 1 Thread 2 Core 1 Last barrier_wait() Core 0 Thread 1 Idle Thread 1 Thread 2 Core 1 16

  29. Index • Introduction / Motivation • Limitations • Proposed idea: StressRight • Evaluation • Conclusion 17

  30. Evaluation • Quantitative analysis − Why StressRight would work well • Accuracy and speed − Comparison with cycle-level simulation (MARSSx86) − Model 1 / 12 / 16 OoO x86 cores − SPEC, PARSEC, memcached • Implementation − Emulation: QEMU, Reconstruction models: Python, Functional simulators: C++ 18

  31. Quantitative Analysis • Efficiency of the method − # instructions: full-execution vs. unique code blocks − Orders of magnitude reduction in the analysis load 19 *mcd:memcached, BS:blackscholes, BT:bodytrack, SW:swaptions, DD:dedup

  32. Quantitative Analysis • Accuracy of the dynamic resource models − Functional simulations are accurate enough Functional vs. Cycle-level memory simulation 20

  33. Quantitative Analysis • Accuracy of the dynamic resource models − Functional simulations are accurate enough Functional vs. Cycle-level memory simulation 20

  34. Accuracy: SPEC • Validating the pipeline model − Correctly estimates the first-order performance  Improvement in progress: Better memory model 21

  35. Accuracy: PARSEC • Validating the scheduler − Correctly estimates the scaling behavior  Improvement in progress: Barrier synchronizations *We model a 12-core system 22

Recommend


More recommend