TUNING SLIDE Fast and Accurate Microarchitectural Simulation with - PowerPoint PPT Presentation

Parallelization Techniques 17  Parallel Discrete Event Simulation (PDES): Host Host Thread 0 Thread 1  Divide components across host threads Mem 0  Execute events from each component maintaining illusion of full order L3 Bank 0 L3 Bank 1 Core 0 Core 1

Parallelization Techniques 17  Parallel Discrete Event Simulation (PDES): Host Host Thread 0 Thread 1  Divide components across host threads Mem 0  Execute events from each component 15 15 maintaining illusion of full order L3 Bank 0 L3 Bank 1 Skew < 10 cycles 5 10 10 5 Core 0 Core 1

Parallelization Techniques 17  Parallel Discrete Event Simulation (PDES): Host Host Thread 0 Thread 1  Divide components across host threads Mem 0  Execute events from each component 15 15 maintaining illusion of full order L3 Bank 0 L3 Bank 1  Accurate Skew < 10 cycles 5 10 10 5  Not scalable Core 0 Core 1

Parallelization Techniques 17  Parallel Discrete Event Simulation (PDES): Host Host Thread 0 Thread 1  Divide components across host threads Mem 0  Execute events from each component 15 15 maintaining illusion of full order L3 Bank 0 L3 Bank 1  Accurate Skew < 10 cycles 5 10 10 5  Not scalable Core 0 Core 1  Lax synchronization: Allow skews above inter-component latencies, tolerate ordering violations  Scalable  Inaccurate

Characterizing Interference 18 Path-altering interference If we simulate two accesses out of order, their paths through the memory hierarchy change Mem LLC 1 2 GETS A GETS A MISS HIT Core 0 Core1

Characterizing Interference 18 Path-altering interference If we simulate two accesses out of order, their paths through the memory hierarchy change Mem Mem LLC LLC 1 2 2 1 GETS A GETS A GETS A GETS A MISS HIT HIT MISS Core 0 Core1 Core 0 Core 1

Characterizing Interference 18 Path-altering interference Path-preserving interference If we simulate two accesses out of order, their If we simulate two accesses out of order, their paths through the memory hierarchy change timing changes but their paths do not Mem Mem Mem 3 4 LLC LLC LLC (blocking) 1 2 2 1 1 2 5 GETS A GETS A GETS A GETS A GETS A GETS B 6 MISS HIT HIT MISS MISS HIT Core 0 Core1 Core 0 Core 1 Core 0 Core 1

Characterizing Interference 18 Path-altering interference Path-preserving interference If we simulate two accesses out of order, their If we simulate two accesses out of order, their paths through the memory hierarchy change timing changes but their paths do not Mem Mem Mem Mem 3 4 4 5 LLC LLC LLC (blocking) LLC (blocking) 1 2 2 1 1 2 2 1 5 6 GETS A GETS A GETS A GETS A GETS A GETS B GETS A GETS B 6 3 MISS HIT HIT MISS MISS HIT MISS HIT Core 0 Core1 Core 0 Core 1 Core 0 Core 1 Core 0 Core 1

Characterizing Interference 19  Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores): 1 in10K accesses

Characterizing Interference 19  Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores): 1 in10K accesses  Path-altering interference extremely rare in small intervals

Characterizing Interference 19  Accesses with path-altering interference with barrier synchronization every 1K/10K/100K cycles (64 cores): 1 in10K accesses  Path-altering interference extremely rare in small intervals  Strategy:  Simulate path-preserving interference faithfully  Ignore (but optionally profile) path-altering interference

Bound-Weave Parallelization 20  Divide simulation in small intervals (e.g., 1000 cycles)  Two parallel phases per interval: Bound and weave

Bound-Weave Parallelization 20  Divide simulation in small intervals (e.g., 1000 cycles)  Two parallel phases per interval: Bound and weave Bound phase: Find paths Weave phase: Find timings

Bound-Weave Parallelization 20  Divide simulation in small intervals (e.g., 1000 cycles)  Two parallel phases per interval: Bound and weave Bound phase: Find paths Weave phase: Find timings Bound-Weave equivalent to PDES for path-preserving interference

Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2 L1I L1D L1I L1D L1I L1D L1I L1D Core 0 Core 1 Core 2 Core 3

Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0

Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0 Host Thread 0 Host Thread 1 Host Time

Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0 Bound Phase: Parallel simulation until cycle 1000, gather access traces Host Thread 0 Core 0 Core 1 Host Thread 1 Core 3 Core 2 Host Time

Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0 Bound Phase: Parallel simulation until cycle 1000, gather access traces Host Thread 0 Core 0 Core 1 Domain 0 Host Thread 1 Core 3 Core 2 Domain 1 Host Time Weave Phase : Parallel event-driven simulation of gathered traces until actual cycle 1000

Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0 Bound Phase: Parallel simulation until cycle Feedback: Adjust core cycles 1000, gather access traces Host Thread 0 Core 0 Core 1 Domain 0 Host Thread 1 Core 3 Core 2 Domain 1 Host Time Weave Phase : Parallel event-driven simulation of gathered traces until actual cycle 1000

Bound-Weave Example 21  2-core host simulating Mem Ctrl 0 Mem Ctrl 1 4-core system L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3  1000-cycle intervals L2 L2 L2 L2  Divide components L1I L1D L1I L1D L1I L1D L1I L1D among 2 domains Core 0 Core 1 Core 2 Core 3 Domain 1 Domain 0 Bound Phase: Parallel simulation until cycle Feedback: Adjust core cycles 1000, gather access traces Host Thread 0 Core 0 Core 1 Domain 0 Core 3 Core 1 … Host Thread 1 Core 3 Core 2 Domain 1 Core 2 Core 0 Host Time Weave Phase : Parallel event-driven simulation of Bound Phase gathered traces until actual cycle 1000 (until cycle 2000)

Example: Bound Phase 22  Host thread 0 simulates core 0, records trace: Mem1 @ 110 READ L3b1 @ 50 L3b0 @ 80 L3b3 @ 270 30 120 HIT MISS HIT 20 20 20 L3b0 @ 230 20 20 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290  Edges fix minimum latency between events  Minimum L3 and main memory latencies (no interference)

Example: Weave Phase 23  Host threads simulate components from domains 0,1 L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Host Thread 0 L3b0 @ 80 L3b1 @ 50 30 120 MISS 20 20 HIT 20 20 20 L3b0 @ 230 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290  Host threads only sync when needed  e.g., thread 1 simulates other events (not shown) until cycle 110, syncs  Lower bounds guarantee no order violations

Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Host Thread 0 L3b0 @ 80 L3b1 @ 50 30 120 MISS 20 20 HIT 20 20 20 L3b0 @ 230 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290

Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 L3b1 @ 50 30 120 MISS 20 20 HIT 20 20 20 L3b0 @ 230 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290

Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 L3b1 @ 50 30 120 170 MISS 20 20 HIT 20 20 20 L3b0 @ 230 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290

Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 L3b1 @ 50 30 120 170 MISS 20 20 HIT 20 20 20 L3b0 @ 230 280 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290

Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 290 L3b1 @ 50 30 120 170 MISS 20 20 HIT 20 20 20 L3b0 @ 230 280 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290 300

Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 320 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 290 L3b1 @ 50 30 120 170 MISS 20 20 HIT 20 20 20 L3b0 @ 230 280 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290 300

Example: Weave Phase 24  Delays propagate as events are simulated: L3b3 @ 270 320 Mem1 @ 110 Host Thread 1 HIT READ Row miss  +50 cycles Host Thread 0 L3b0 @ 80 290 L3b1 @ 50 30 120 170 MISS 20 20 HIT 20 20 20 L3b0 @ 230 280 20 RESP 30 30 100 40 Core0 @ 30 Core0 @ 60 Core0 @ 90 Core0 @ 250 Core0 @ 290 300 340

Bound-Weave Scalability 25  Bound phase scales almost linearly  Using novel shared-memory synchronization protocol (later)  Weave phase scales much better than PDES  Threads only need to sync when an event crosses domains  A lot of work shifted to bound phase

Bound-Weave Scalability 25  Bound phase scales almost linearly  Using novel shared-memory synchronization protocol (later)  Weave phase scales much better than PDES  Threads only need to sync when an event crosses domains  A lot of work shifted to bound phase  Need bound and weave models for each component, but division is often very natural  e.g., caches: hit/miss on bound phase; MSHRs, pipelined accesses, port contention on weave phase

Bound-Weave Take-Aways 26  Minimal synchronization:  Bound phase: Unordered accesses (like lax)  Weave: Only sync on actual dependencies

Bound-Weave Take-Aways 26  Minimal synchronization:  Bound phase: Unordered accesses (like lax)  Weave: Only sync on actual dependencies  No ordering violations in weave phase

Bound-Weave Take-Aways 26  Minimal synchronization:  Bound phase: Unordered accesses (like lax)  Weave: Only sync on actual dependencies  No ordering violations in weave phase  Works with standard event-driven models  e.g., 110 lines to integrate with DRAMSim2

Multithreaded Accuracy 27  23 apps: PARSEC, SPLASH-2, SPEC OMP2001, STREAM  11.2% avg perf error (not IPC), 10/23 within 10%  Similar differences as single-core results

1024-Core Performance 28  Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale

1024-Core Performance 28  Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale 200 MIPS hmean 41 MIPS hmean

1024-Core Performance 28  Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale 200 MIPS hmean 41 MIPS hmean ~100-1000x faster

1024-Core Performance 28  Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale 200 MIPS hmean ~5x between least and most detailed models! 41 MIPS hmean ~100-1000x faster

Bound-Weave Scalability 29

Bound-Weave Scalability 29 10.1-13.6x speedup @ 16 cores

Outline 30  Introduction  Detailed DBT-accelerated core models  Bound-weave parallelization  Lightweight user-level virtualization

Lightweight User-Level Virtualization 31  No 1Kcore OSs ZSim has to be user-level for now  No parallel full-system DBT

Lightweight User-Level Virtualization 31  No 1Kcore OSs ZSim has to be user-level for now  No parallel full-system DBT  Problem: User-level simulators limited to simple workloads  Lightweight user-level virtualization: Bridge the gap with full-system simulation  Simulate accurately if time spent in OS is minimal

Lightweight User-Level Virtualization 32  Multiprocess workloads  Scheduler (threads > cores)  Time virtualization  System virtualization  Simulator-OS deadlock avoidance  Signals  ISA extensions  Fast-forwarding

ZSim Limitations 33  Not implemented yet:  Multithreaded cores  Detailed NoC models  Virtual memory (TLBs)

ZSim Limitations 33  Not implemented yet:  Multithreaded cores  Detailed NoC models  Virtual memory (TLBs)  Fundamentally hard:  Systems or workloads with frequent path-altering interference (e.g., fine-grained message-passing across whole chip)  Kernel-intensive applications

Summary 34  Three techniques to make 1Kcore simulation practical  DBT-accelerated models: 10-100x faster core models  Bound-weave parallelization: ~10-15x speedup from parallelization with minimal accuracy loss  Lightweight user-level virtualization: Simulate complex workloads without full-system support  ZSim achieves high performance and accuracy:  Simulates 1024-core systems at 10s-1000s of MIPS  Validated against real Westmere system, avg error ~10%

35 Simulator Organization

Main Components 36 Harness Config System Driver Initialization Global Memory Core timing User- models level Stats virtualiz Memory system ation timing models

ZSim Harness 37  Most of zsim implemented as a pintool (libzsim.so)  A separate harness process (zsim) controls simulation  Initializes global memory  Launches pin processes  Checks for deadlock

ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so)  A separate harness process (zsim) controls simulation  Initializes global memory  Launches pin processes  Checks for deadlock

ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so)  A separate harness process (zsim) controls simulation  Initializes global memory  Launches pin processes  Checks for deadlock zsim

ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so)  A separate harness process (zsim) controls simulation  Initializes global memory  Launches pin processes  Checks for deadlock Global Memory zsim

ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so) process0 = {  A separate harness process command = “ ls ”; }; (zsim) controls simulation  Initializes global memory process1 = { command = “echo foo”;  Launches pin processes };  Checks for deadlock … Global Memory zsim

ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so) process0 = {  A separate harness process command = “ ls ”; }; (zsim) controls simulation  Initializes global memory process1 = { command = “echo foo”;  Launches pin processes };  Checks for deadlock … Global Memory pin – t libzsim.so -- ls zsim

ZSim Harness 37  Most of zsim implemented as ./build/opt/zsim test.cfg a pintool (libzsim.so) process0 = {  A separate harness process command = “ ls ”; }; (zsim) controls simulation  Initializes global memory process1 = { command = “echo foo”;  Launches pin processes };  Checks for deadlock … Global Memory pin – t libzsim.so -- ls zsim pin – t libzsim.so – echo foo

Global Memory 38  Pin processes communicate through a shared memory segment, managed as a single global heap  All simulator objects must be allocated in the global heap

Global Memory 38  Pin processes communicate through a shared memory segment, managed as a single global heap  All simulator objects must be allocated in the global heap Process 0 address space Program code Local heap Global heap libzsim.so

Global Memory 38  Pin processes communicate through a shared memory segment, managed as a single global heap  All simulator objects must be allocated in the global heap Process 0 Process 1 address space address space Program code Program code Local heap Local heap Global heap Global heap libzsim.so libzsim.so

Global Memory 38  Pin processes communicate through a shared memory segment, managed as a single global heap  All simulator objects must be allocated in the global heap Process 0 Process 1 address space address space Global heap and Program code Program code libzsim.so code in Local heap Local heap same memory locations across all processes  Can Global heap Global heap use normal pointers & virtual functions libzsim.so libzsim.so

TUNING SLIDE Fast and Accurate Microarchitectural Simulation with - PowerPoint PPT Presentation

TUNING SLIDE Fast and Accurate Microarchitectural Simulation with ZSim Daniel Sanchez, Nathan Beckmann, Anurag Mukkara, Po-An Tsai MIT CSAIL MICRO-48 Tutorial December 5, 2015 Welcome! Agenda 4 8:30 9:10 Intro and Overview 9:10

Slide 1 Papua New Guinea Mission 2019 Pastor Gary Blumanthal Slide 2 Slide 3 Slide 4 Slide 5

Slide 1 Slide 2 Slide 3 Slide 4 Slide 5 Slide 6 Slide 7 Dubois Farm Subdivision Schedule A

Slide 1 Slide 2 Slide 3 Slide 4 Slide 5 Slide 6 Slide 7 Slide 8 Get in contact:

Extracted from Slide 1: Extracted from Slide 2: Extracted from Slide 3: Extracted from Slide 4:

Extracted from Slide 1: Extracted from Slide 2: Extracted from Slide 3: Extracted from Slide 4:

Extracted from Slide 1: Extracted from Slide 2: Extracted from Slide 3: Extracted from Slide 4:

Slide 1 of 7 Slide 2 of 7 Slide 3 of 7 Slide 4 of 7 Slide 5 of 7 Slide 6 of 7 Slide 7 of 7

Slide 16 1. Disjoint 2. Not disjoint 3. Disjoint 4. Not disjoint 5. Disjoint Slide 18 Slide 25

Handouts Appointment Procedures June 2012 (Amended Version) Slide 1 Slide 2 Slide 3 Slide 4

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and Beyond Xuechen Li 1,2 Denny Wu 1,2

e-MOTICON e-MO MObility Transnational strategy for an Interoperable CO COmmunity and

Dual Formulation of Second order Target Problems Nizar TOUZI Ecole Polytechnique Paris Joint

Averaging along irregular curves and regularization of ODEs Rmi Catellier Centre Henri

Pydgin for RISC-V: A Fast and Productive Instruction-Set Simulator Berkin Ilbeyi In

SLE and CFT Mitsuhiro Kato @ QFT2005 1. Introduction Critical phenomena Conformal Field

Building Reusable and Trustworthy pipelines 1 Airflow Summit 2020, @nehiljain Outline 1.

Non-hermitian Diffusion Maciej A. Nowak Mark Kac Complex Systems Research Center, Marian