Fast and Accurate Performance Analysis of Synchronization Mario - PowerPoint PPT Presentation

Fast and Accurate Performance Analysis of Synchronization Mario Badr and Natalie Enright Jerger

Evaluating Multi-Threaded Performance • Difficult and Time Consuming • Non-Determinism • Cross-stack effects • Different Architectures • Goal: Make it Straightforward and Fast • One trace, many total orders • High level of abstraction 2

More Synchronization != More Overhead Ferret 3

Multi-threaded, Multi-core Workflows Programmer Systems Researcher Architect Write Multi-threaded Program Change Kernel Design Multi-processor Profile for Bottlenecks Test Implementation Simulate with Benchmarks Implement Optimization Modify Implementation Optimize Design Release Program Release Modifications Release Chip One application? One architecture? Application input? Multiple architectures? Simulation time? Architectures that don’t exist? 4

Cross-Stack Interactions for Synchronization Application Thread Library/Application Runtime Operating System Architecture 5

Modelling Multithreaded Applications Runtime/OS Model Architecture Model Representation of the Thread Model Application Architectural Configuration 6

Execution of a Parallel Program t 1 t 2 t 3 t 4 7

What impacts a thread’s execution time? • Heterogeneity • Architectures (e.g., big.LITTLE) • Dynamic Voltage and Frequency Scaling (DVFS) • Contention • Synchronization • Many other things 8

The Impact of Heterogeneity t 1 t 2 t 3 t 4 9

The Impact of Synchronization t 1 t 2 t 3 t 4 10

Heterogeneity and Synchronization t 1 t 2 t 3 t 4 The order and time of synchronization events impacts performance. 11

Modelling Cross-Stack Interactions • How to represent a multi-threaded application? • Task Graph • Trace • How to model the operating system and runtime? • Thread scheduling • Synchronization • How to model the architecture? • Rate of execution (e.g., cycles per instruction) 12

The Producer Consumer Example Adding Work to a Queue Removing Work from a Queue 13

Representing an Application Synchronization Trace Task Graph Thread Event Primitive Producer Consumer Consumer L ock mutex Producer L ock mutex L Consumer Wait enqueue Producer S ignal enqueue S U wait Producer U nlock mutex U L Consumer S ignal dequeue Consumer U nlock mutex S U 14

The order of synchronization events • A synchronization trace gives us the program order of each thread • We want to determine the total order of all synchronization events • The total order must be correct • Safety (e.g., no two threads in the same critical section) • Liveness (e.g., all threads make progress eventually) 15

Thread Event Consumer locks Thread Event mutex first Consumer Producer L ock L ock (original trace) Producer Consumer L ock L ock Consumer Producer Wait S ignal Producer Producer S ignal U nlock Producer Consumer U nlock Wait Consumer Consumer S ignal S ignal Producer locks Consumer mutex first Consumer U nlock U nlock Consumer is Thread Event Thread Event much faster than Producer Consumer L ock L ock producer Consumer Producer Wait S ignal Producer Producer U nlock L ock Producer Consumer S ignal L ock Consumer Producer U nlock Wait One Trace, Multiple Total Orders – Captures Non-Determinism Producer is Consumer Consumer S ignal S ignal much faster than Consumer consumer Consumer U nlock U nlock 16

Modelling Locks and Condition Variables Per-Lock Thread Queues Condition Variable Counters • On wait t 1 t 2 • Decrement counter by 1 • On signal t 3 t 4 • Increment counter by 1 • On broadcast • Increment counter by number of t 1 consumers 17

Estimating the Time Between Events 1. Dynamic Instructions • The distance between events 2. Core Frequency and Microarchitecture • The rate between events 3. The Scheduling of Threads • The opportunity to execute dynamic instructions 4. The Timing of Prior Events • The dependencies between threads 18

Our High Level Abstraction Each core has its own cycles per instruction instruction count between frequency and the CPI frequency for each thread events for a given thread ID (TID) A instructions till next event TID(1) Acquire(A) 100 Controller TID(3) Acquire(A) 342 TID(2) Barrier(B) 612 TID(1) Release(A) 30 executing TID(3) Release(A) 34 thread to threads current event TID(1) Barrier(B) 843 core map TID(3) Barrier(B) 702 dependencies Inter-thread ... ... ... ... ... instruction count between events sleep B C Trace the synchronization event queue schedule and the object it is acting on Thread Model – A sequence of events Synchronization Model Scheduler 19

Validation Methodology • Benchmarks: PARSEC 3.0, Splash-3 • Execution time measured with GNU time • Traces generated with Pin • Cycles-per-instruction profiled with VTune ™ • Architecture: Intel Xeon E5-2650 v2 • 2 sockets, 8 cores per socket, 2 threads per core • 20 MB L3 Cache • 2.6 GHz • Three runs for each experiment 20

Assumptions and Approximations • Cycles-Per-Instruction encompasses microarchitecture and memory hierarchy performance • Synchronization events have zero latency • Context switches have zero latency • Synchronization model approximates application state • i.e., for condition variables 21

Model Validation: 4 Cores, Single Socket Average of measured Average of estimated 200 180 160 140 Time (seconds) 120 100 80 60 40 20 0 22

Model Validation: 32 Cores, Dual Socket Average of measured Average of estimated 60 50 Time (seconds) 40 30 20 10 0 23

Water (nsquared): 8 Cores Estimated with Our Model Estimated with Vtune ™ Average of Computation Average of Synchronization Average of Computation Average of Synchronization 8E+10 80 7E+10 70 Time (nanoseconds) 6E+10 60 Time (seconds) 5E+10 50 4E+10 40 3E+10 30 2E+10 20 1E+10 10 0 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Thread ID Thread ID 24

Model Runtime Benchmark Input Set Input Size Trace Size Runtime blackscholes Native 603 MB 1.1 KB 4 ms bodytrack Native 616 MB 31 MB 4.9 minutes water (nsquared) Native 3.6 MB 53 MB 7.5 minutes average 2 MB 32 seconds Orders of magnitude faster than simulation of smaller input sets. 25

Conclusion • A very high level of abstraction can accurately and quickly estimate the performance of a multi-threaded application on a multi-core processor. • Average 7.2% error in total execution time • Average 32 seconds to generate an estimate • Programmers and Systems Researchers can evaluate on many architectures • Architects can evaluate with native inputs and many applications 26

Future Work • How much non-determinism is there across multiple traces of an application? • How can a {memory, network} contention model be added to improve error without significantly increasing model complexity? 27

Our Work is Open Source https://github.com/mariobadr/simsync-pmam License: Apache 2.0 Mario Badr and Natalie Enright Jerger 28

Scenario A – Consumer locks mutex first 1. Consumer locks mutex 2. Producer attempts lock • Producer blocked Thread Event Primitive 3. Consumer waits for enqueue Consumer L ock mutex • Consumer blocked, silent unlock Producer L ock mutex • Producer unblocked, silent lock Consumer Wait enqueue 4. Producer signals enqueue Producer S ignal enqueue • Consumer tries to lock, remains blocked Producer 5. Producer unlocks mutex U nlock mutex • Consumer unblocked, silent lock Consumer S ignal dequeue 6. Consumer signals dequeue Consumer U nlock mutex 7. Consumer unlocks mutex 29

Scenario B – Consumer is much faster 1. Consumer locks mutex 2. Consumer waits for enqueue Thread Event Primitive • Consumer blocked, silent unlock Consumer L ock mutex 3. Producer locks mutex Consumer Wait enqueue 4. Producer signals enqueue Producer L ock mutex • Consumer tries lock, remains blocked Producer S ignal enqueue 5. Producer unlocks mutex Producer U nlock mutex • Consumer unblocked, silent lock Consumer S ignal dequeue 6. Consumer signals dequeue Consumer U nlock mutex 7. Consumer unlocks mutex 30

Scenario C – Producer locks mutex first 1. Producer locks mutex 2. Consumer attempts lock Thread Event Primitive • Consumer blocked Producer L ock mutex 3. Producer signals enqueue Consumer L ock mutex 4. Producer unlocks mutex Producer S ignal enqueue • Consumer unblocked Producer U nlock mutex 5. Consumer locks mutex Consumer Wait enqueue 6. Consumer does not have to wait Consumer S ignal dequeue 7. Consumer signals dequeue Consumer U nlock mutex 8. Consumer unlocks mutex 31

Scenario D – Producer is much faster 1. Producer locks mutex 2. Producer signals enqueue Thread Event Primitive 3. Producer unlocks mutex Producer L ock mutex Producer S ignal enqueue 4. Consumer locks mutex Producer U nlock mutex 5. Consumer does not have to Consumer L ock mutex wait Consumer Wait enqueue 6. Consumer signals dequeue Consumer S ignal dequeue Consumer U nlock mutex 7. Consumer unlocks mutex 32

Fast and Accurate Performance Analysis of Synchronization Mario - PowerPoint PPT Presentation

Fast and Accurate Performance Analysis of Synchronization Mario Badr and Natalie Enright Jerger Evaluating Multi-Threaded Performance Difficult and Time Consuming Non-Determinism Cross-stack effects Different Architectures

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Drive-Thru: Drive-Thru: Fast, Accurate Evaluation of Fast, Accurate Evaluation of Storage Power

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo

Thread and Synchronization Synchronization Mechanisms (Module 20) Yann-Hang Lee Arizona State

Semaphores and Monitors: High-level Synchronization Constructs 1 Synchronization Constructs

Time Synchronization Goals of this chaper Understand the importance of time synchronization in

synchronization.txt synchronization.txt Feb 2 2009 1:10 Page 1

Synchronization in sensor networks Synchronization in sensor networks Jie Gao Computer Science

File Synchronization with File Synchronization with Syxaw in an Ad-hoc Network Syxaw in an

Chapter 7: Process Synchronization Background The Critical-Section Problem

Module 6: Process Synchronization Background The Critical-Section Problem

CSCI [4|6] 730 Operating Systems Synchronization Part 1 : The Basics Maria Hybinette, UGA

Chapter 6: Process [& Thread] Synchronization Why is synchronization needed? CSCI [4|6]

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Synchronization Synchronization

fixed-mobile convergence Jean-Pierre Bienaim Secretary General, 5G Infrastructure Association

Principles and Protocols for Power Control in Wireless Ad Hoc Networks Vikas Kawadia and P. R.

S Department of Psychology Assumption College Smartphone Use & Mental Health The use

Project: IEEE P802.15 Working Group for Wireless Personal Area Networks (WPANs) Project: IEEE

Wireless Sensor Networks Challenges in applying WSNs in Smart Grid Spectrum-Aware and

Itabashi Japanese Garden April 2019 Canada Japan W a l k e r s L i n e (Tokyo)

Compression and sports Jean-Patrick BENIGNI, MD Paris, France Compression material SPORT

Assessment 49 year Pulmonary function No comorbidity Septic shock and ARDS after

Fast and Accurate Performance Analysis of Synchronization Mario - PowerPoint PPT Presentation

Fast and Accurate Performance Analysis of Synchronization Mario Badr and Natalie Enright Jerger Evaluating Multi-Threaded Performance Difficult and Time Consuming Non-Determinism Cross-stack effects Different Architectures

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Drive-Thru: Drive-Thru: Fast, Accurate Evaluation of Fast, Accurate Evaluation of Storage Power

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Clock Synchronization Synchronization Clock Henrik Lnn Electronics &amp; Software Volvo

Thread and Synchronization Synchronization Mechanisms (Module 20) Yann-Hang Lee Arizona State

Semaphores and Monitors: High-level Synchronization Constructs 1 Synchronization Constructs

Time Synchronization Goals of this chaper Understand the importance of time synchronization in

synchronization.txt synchronization.txt Feb 2 2009 1:10 Page 1

Synchronization in sensor networks Synchronization in sensor networks Jie Gao Computer Science

File Synchronization with File Synchronization with Syxaw in an Ad-hoc Network Syxaw in an

Chapter 7: Process Synchronization Background The Critical-Section Problem

Module 6: Process Synchronization Background The Critical-Section Problem

CSCI [4|6] 730 Operating Systems Synchronization Part 1 : The Basics Maria Hybinette, UGA

Chapter 6: Process [&amp; Thread] Synchronization Why is synchronization needed? CSCI [4|6]

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Synchronization Synchronization

fixed-mobile convergence Jean-Pierre Bienaim Secretary General, 5G Infrastructure Association

Principles and Protocols for Power Control in Wireless Ad Hoc Networks Vikas Kawadia and P. R.

S Department of Psychology Assumption College Smartphone Use &amp; Mental Health The use

Project: IEEE P802.15 Working Group for Wireless Personal Area Networks (WPANs) Project: IEEE

Wireless Sensor Networks Challenges in applying WSNs in Smart Grid Spectrum-Aware and

Itabashi Japanese Garden April 2019 Canada Japan W a l k e r s L i n e (Tokyo)

Compression and sports Jean-Patrick BENIGNI, MD Paris, France Compression material SPORT

Assessment 49 year Pulmonary function No comorbidity Septic shock and ARDS after

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo

Chapter 6: Process [& Thread] Synchronization Why is synchronization needed? CSCI [4|6]

S Department of Psychology Assumption College Smartphone Use & Mental Health The use