tuning slide
play

TUNING SLIDE Fast and Accurate Microarchitectural Simulation with - PowerPoint PPT Presentation

TUNING SLIDE Fast and Accurate Microarchitectural Simulation with ZSim Daniel Sanchez, Nathan Beckmann, Anurag Mukkara, Po-An Tsai MIT CSAIL MICRO-48 Tutorial December 5, 2015 Welcome! Agenda 4 8:30 9:10 Intro and Overview 9:10


  1. Parallelization Techniques 17  Parallel Discrete Event Simulation (PDES): Host Host Thread 0 Thread 1  Divide components across host threads Mem 0  Execute events from each component maintaining illusion of full order L3 Bank 0 L3 Bank 1 Core 0 Core 1

  2. Parallelization Techniques 17  Parallel Discrete Event Simulation (PDES): Host Host Thread 0 Thread 1  Divide components across host threads Mem 0  Execute events from each component 15 15 maintaining illusion of full order L3 Bank 0 L3 Bank 1 Skew < 10 cycles 5 10 10 5 Core 0 Core 1

  3. Parallelization Techniques 17  Parallel Discrete Event Simulation (PDES): Host Host Thread 0 Thread 1  Divide components across host threads Mem 0  Execute events from each component 15 15 maintaining illusion of full order L3 Bank 0 L3 Bank 1  Accurate Skew < 10 cycles 5 10 10 5  Not scalable Core 0 Core 1

  4. Parallelization Techniques 17  Parallel Discrete Event Simulation (PDES): Host Host Thread 0 Thread 1  Divide components across host threads Mem 0  Execute events from each component 15 15 maintaining illusion of full order L3 Bank 0 L3 Bank 1  Accurate Skew < 10 cycles 5 10 10 5  Not scalable Core 0 Core 1  Lax synchronization: Allow skews above inter-component latencies, tolerate ordering violations  Scalable  Inaccurate

  5. Characterizing Interference 18 Path-altering interference If we simulate two accesses out of order, their paths through the memory hierarchy change Mem LLC 1 2 GETS A GETS A MISS HIT Core 0 Core1

  6. Characterizing Interference 18 Path-altering interference If we simulate two accesses out of order, their paths through the memory hierarchy change Mem Mem LLC LLC 1 2 2 1 GETS A GETS A GETS A GETS A MISS HIT HIT MISS Core 0 Core1 Core 0 Core 1

  7. Characterizing Interference 18 Path-altering interference Path-preserving interference If we simulate two accesses out of order, their If we simulate two accesses out of order, their paths through the memory hierarchy change timing changes but their paths do not Mem Mem Mem 3 4 LLC LLC LLC (blocking) 1 2 2 1 1 2 5 GETS A GETS A GETS A GETS A GETS A GETS B 6 MISS HIT HIT MISS MISS HIT Core 0 Core1 Core 0 Core 1 Core 0 Core 1

  8. Characterizing Interference 18 Path-altering interference Path-preserving interference If we simulate two accesses out of order, their If we simulate two accesses out of order, their paths through the memory hierarchy change timing changes but their paths do not Mem Mem Mem Mem 3 4 4 5 LLC LLC LLC (blocking) LLC (blocking) 1 2 2 1 1 2 2 1 5 6 GETS A GETS A GETS A GETS A GETS A GETS B GETS A GETS B 6 3 MISS HIT HIT MISS MISS HIT MISS HIT Core 0 Core1 Core 0 Core 1 Core 0 Core 1 Core 0 Core 1

  9. Characterizing Interference 19  Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores): 1 in10K accesses

  10. Characterizing Interference 19  Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores): 1 in10K accesses  Path-altering interference extremely rare in small intervals

  11. Characterizing Interference 19  Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores): 1 in10K accesses  Path-altering interference extremely rare in small intervals  Strategy:  Simulate path-preserving interference faithfully  Ignore (but optionally profile) path-altering interference

  12. Bound-Weave Parallelization 20  Divide simulation in small intervals (e.g., 1000 cycles)  Two parallel phases per interval: Bound and weave

  13. Bound-Weave Parallelization 20  Divide simulation in small intervals (e.g., 1000 cycles)  Two parallel phases per interval: Bound and weave Bound phase: Find paths Weave phase: Find timings

  14. Bound-Weave Parallelization 20  Divide simulation in small intervals (e.g., 1000 cycles)  Two parallel phases per interval: Bound and weave Bound phase: Find paths Weave phase: Find timings Bound-Weave equivalent to PDES for path-preserving interference

  15. Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2 L1I L1D L1I L1D L1I L1D L1I L1D Core 0 Core 1 Core 2 Core 3

  16. Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0

  17. Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0 Host Thread 0 Host Thread 1 Host Time

  18. Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0 Bound Phase: Parallel simulation until cycle 1000, gather access traces Host Thread 0 Core 0 Core 1 Host Thread 1 Core 3 Core 2 Host Time

  19. Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0 Bound Phase: Parallel simulation until cycle 1000, gather access traces Host Thread 0 Core 0 Core 1 Domain 0 Host Thread 1 Core 3 Core 2 Domain 1 Host Time Weave Phase : Parallel event-driven simulation of gathered traces until actual cycle 1000

  20. Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0 Bound Phase: Parallel simulation until cycle Feedback: Adjust core cycles 1000, gather access traces Host Thread 0 Core 0 Core 1 Domain 0 Host Thread 1 Core 3 Core 2 Domain 1 Host Time Weave Phase : Parallel event-driven simulation of gathered traces until actual cycle 1000

  21. Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0 Bound Phase: Parallel simulation until cycle Feedback: Adjust core cycles 1000, gather access traces Host Thread 0 Core 0 Core 1 Domain 0 Core 3 Core 1 … Host Thread 1 Core 3 Core 2 Domain 1 Core 2 Core 0 Host Time Weave Phase : Parallel event-driven simulation of Bound Phase gathered traces until actual cycle 1000 (until cycle 2000)

  22. Example: Bound Phase 22  Host thread 0 simulates core 0, records trace: Mem1 @ 110 READ L3b1 @ 50 L3b0 @ 80 L3b3 @ 270 30 120 HIT MISS HIT 20 20 20 L3b0 @ 230 20 20 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290  Edges fix minimum latency between events  Minimum L3 and main memory latencies (no interference)

  23. Example: Weave Phase 23  Host threads simulate components from domains 0,1 L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Host Thread 0 L3b0 @ 80 L3b1 @ 50 30 120 MISS 20 20 HIT 20 20 20 L3b0 @ 230 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290  Host threads only sync when needed  e.g., thread 1 simulates other events (not shown) until cycle 110, syncs  Lower bounds guarantee no order violations

  24. Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Host Thread 0 L3b0 @ 80 L3b1 @ 50 30 120 MISS 20 20 HIT 20 20 20 L3b0 @ 230 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290

  25. Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 L3b1 @ 50 30 120 MISS 20 20 HIT 20 20 20 L3b0 @ 230 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290

  26. Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 L3b1 @ 50 30 120 170 MISS 20 20 HIT 20 20 20 L3b0 @ 230 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290

  27. Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 L3b1 @ 50 30 120 170 MISS 20 20 HIT 20 20 20 L3b0 @ 230 280 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290

  28. Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 290 L3b1 @ 50 30 120 170 MISS 20 20 HIT 20 20 20 L3b0 @ 230 280 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290 300

  29. Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 320 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 290 L3b1 @ 50 30 120 170 MISS 20 20 HIT 20 20 20 L3b0 @ 230 280 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290 300

  30. Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 320 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 290 L3b1 @ 50 30 120 170 MISS 20 20 HIT 20 20 20 L3b0 @ 230 280 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290 300 340

  31. Bound-Weave Scalability 25  Bound phase scales almost linearly  Using novel shared-memory synchronization protocol (later)  Weave phase scales much better than PDES  Threads only need to sync when an event crosses domains  A lot of work shifted to bound phase

  32. Bound-Weave Scalability 25  Bound phase scales almost linearly  Using novel shared-memory synchronization protocol (later)  Weave phase scales much better than PDES  Threads only need to sync when an event crosses domains  A lot of work shifted to bound phase  Need bound and weave models for each component, but division is often very natural  e.g., caches: hit/miss on bound phase; MSHRs, pipelined accesses, port contention on weave phase

  33. Bound-Weave Take-Aways 26  Minimal synchronization:  Bound phase: Unordered accesses (like lax)  Weave: Only sync on actual dependencies

  34. Bound-Weave Take-Aways 26  Minimal synchronization:  Bound phase: Unordered accesses (like lax)  Weave: Only sync on actual dependencies  No ordering violations in weave phase

  35. Bound-Weave Take-Aways 26  Minimal synchronization:  Bound phase: Unordered accesses (like lax)  Weave: Only sync on actual dependencies  No ordering violations in weave phase  Works with standard event-driven models  e.g., 110 lines to integrate with DRAMSim2

  36. Multithreaded Accuracy 27  23 apps: PARSEC, SPLASH-2, SPEC OMP2001, STREAM  11.2% avg perf error (not IPC), 10/23 within 10%  Similar differences as single-core results

  37. 1024-Core Performance 28  Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale

  38. 1024-Core Performance 28  Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale 200 MIPS hmean 41 MIPS hmean

  39. 1024-Core Performance 28  Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale 200 MIPS hmean 41 MIPS hmean ~100-1000x faster

  40. 1024-Core Performance 28  Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale 200 MIPS hmean ~5x between least and most detailed models! 41 MIPS hmean ~100-1000x faster

  41. Bound-Weave Scalability 29

  42. Bound-Weave Scalability 29 10.1-13.6x speedup @ 16 cores

  43. Outline 30  Introduction  Detailed DBT-accelerated core models  Bound-weave parallelization  Lightweight user-level virtualization

  44. Lightweight User-Level Virtualization 31  No 1Kcore OSs ZSim has to be user-level for now  No parallel full-system DBT

  45. Lightweight User-Level Virtualization 31  No 1Kcore OSs ZSim has to be user-level for now  No parallel full-system DBT  Problem: User-level simulators limited to simple workloads  Lightweight user-level virtualization: Bridge the gap with full-system simulation  Simulate accurately if time spent in OS is minimal

  46. Lightweight User-Level Virtualization 32  Multiprocess workloads  Scheduler (threads > cores)  Time virtualization  System virtualization  Simulator-OS deadlock avoidance  Signals  ISA extensions  Fast-forwarding

  47. ZSim Limitations 33  Not implemented yet:  Multithreaded cores  Detailed NoC models  Virtual memory (TLBs)

  48. ZSim Limitations 33  Not implemented yet:  Multithreaded cores  Detailed NoC models  Virtual memory (TLBs)  Fundamentally hard:  Systems or workloads with frequent path-altering interference (e.g., fine-grained message-passing across whole chip)  Kernel-intensive applications

  49. Summary 34  Three techniques to make 1Kcore simulation practical  DBT-accelerated models: 10-100x faster core models  Bound-weave parallelization: ~10-15x speedup from parallelization with minimal accuracy loss  Lightweight user-level virtualization: Simulate complex workloads without full-system support  ZSim achieves high performance and accuracy:  Simulates 1024-core systems at 10s-1000s of MIPS  Validated against real Westmere system, avg error ~10%

  50. 35 Simulator Organization

  51. Main Components 36 Harness Config System Driver Initialization Global Memory Core timing User- models level Stats virtualiz Memory system ation timing models

  52. ZSim Harness 37  Most of zsim implemented as a pintool (libzsim.so)  A separate harness process (zsim) controls simulation  Initializes global memory  Launches pin processes  Checks for deadlock

  53. ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so)  A separate harness process (zsim) controls simulation  Initializes global memory  Launches pin processes  Checks for deadlock

  54. ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so)  A separate harness process (zsim) controls simulation  Initializes global memory  Launches pin processes  Checks for deadlock zsim

  55. ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so)  A separate harness process (zsim) controls simulation  Initializes global memory  Launches pin processes  Checks for deadlock Global Memory zsim

  56. ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so)  A separate harness process (zsim) controls simulation  Initializes global memory  Launches pin processes  Checks for deadlock Global Memory zsim

  57. ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so) process0 = {  A separate harness process command = “ ls ”; }; (zsim) controls simulation  Initializes global memory process1 = { command = “echo foo”;  Launches pin processes };  Checks for deadlock … Global Memory zsim

  58. ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so) process0 = {  A separate harness process command = “ ls ”; }; (zsim) controls simulation  Initializes global memory process1 = { command = “echo foo”;  Launches pin processes };  Checks for deadlock … Global Memory pin – t libzsim.so -- ls zsim

  59. ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so) process0 = {  A separate harness process command = “ ls ”; }; (zsim) controls simulation  Initializes global memory process1 = { command = “echo foo”;  Launches pin processes };  Checks for deadlock … Global Memory pin – t libzsim.so -- ls zsim pin – t libzsim.so – echo foo

  60. Global Memory 38  Pin processes communicate through a shared memory segment, managed as a single global heap  All simulator objects must be allocated in the global heap

  61. Global Memory 38  Pin processes communicate through a shared memory segment, managed as a single global heap  All simulator objects must be allocated in the global heap Process 0 address space Program code Local heap Global heap libzsim.so

  62. Global Memory 38  Pin processes communicate through a shared memory segment, managed as a single global heap  All simulator objects must be allocated in the global heap Process 0 Process 1 address space address space Program code Program code Local heap Local heap Global heap Global heap libzsim.so libzsim.so

  63. Global Memory 38  Pin processes communicate through a shared memory segment, managed as a single global heap  All simulator objects must be allocated in the global heap Process 0 Process 1 address space address space Program code Program code Local heap Local heap Global heap Global heap libzsim.so libzsim.so

  64. Global Memory 38  Pin processes communicate through a shared memory segment, managed as a single global heap  All simulator objects must be allocated in the global heap Process 0 Process 1 address space address space Global heap and Program code Program code libzsim.so code in Local heap Local heap same memory locations across all processes  Can Global heap Global heap use normal pointers & virtual functions libzsim.so libzsim.so

Recommend


More recommend