fast and accurate performance analysis of synchronization
play

Fast and Accurate Performance Analysis of Synchronization Mario - PowerPoint PPT Presentation

Fast and Accurate Performance Analysis of Synchronization Mario Badr and Natalie Enright Jerger Evaluating Multi-Threaded Performance Difficult and Time Consuming Non-Determinism Cross-stack effects Different Architectures


  1. Fast and Accurate Performance Analysis of Synchronization Mario Badr and Natalie Enright Jerger

  2. Evaluating Multi-Threaded Performance • Difficult and Time Consuming • Non-Determinism • Cross-stack effects • Different Architectures • Goal: Make it Straightforward and Fast • One trace, many total orders • High level of abstraction 2

  3. More Synchronization != More Overhead Ferret 3

  4. Multi-threaded, Multi-core Workflows Programmer Systems Researcher Architect Write Multi-threaded Program Change Kernel Design Multi-processor Profile for Bottlenecks Test Implementation Simulate with Benchmarks Implement Optimization Modify Implementation Optimize Design Release Program Release Modifications Release Chip One application? One architecture? Application input? Multiple architectures? Simulation time? Architectures that don’t exist? 4

  5. Cross-Stack Interactions for Synchronization Application Thread Library/Application Runtime Operating System Architecture 5

  6. Modelling Multithreaded Applications Runtime/OS Model Architecture Model Representation of the Thread Model Application Architectural Configuration 6

  7. Execution of a Parallel Program t 1 t 2 t 3 t 4 7

  8. What impacts a thread’s execution time? • Heterogeneity • Architectures (e.g., big.LITTLE) • Dynamic Voltage and Frequency Scaling (DVFS) • Contention • Synchronization • Many other things 8

  9. The Impact of Heterogeneity t 1 t 2 t 3 t 4 9

  10. The Impact of Synchronization t 1 t 2 t 3 t 4 10

  11. Heterogeneity and Synchronization t 1 t 2 t 3 t 4 The order and time of synchronization events impacts performance. 11

  12. Modelling Cross-Stack Interactions • How to represent a multi-threaded application? • Task Graph • Trace • How to model the operating system and runtime? • Thread scheduling • Synchronization • How to model the architecture? • Rate of execution (e.g., cycles per instruction) 12

  13. The Producer Consumer Example Adding Work to a Queue Removing Work from a Queue 13

  14. Representing an Application Synchronization Trace Task Graph Thread Event Primitive Producer Consumer Consumer L ock mutex Producer L ock mutex L Consumer Wait enqueue Producer S ignal enqueue S U wait Producer U nlock mutex U L Consumer S ignal dequeue Consumer U nlock mutex S U 14

  15. The order of synchronization events • A synchronization trace gives us the program order of each thread • We want to determine the total order of all synchronization events • The total order must be correct • Safety (e.g., no two threads in the same critical section) • Liveness (e.g., all threads make progress eventually) 15

  16. Thread Event Consumer locks Thread Event mutex first Consumer Producer L ock L ock (original trace) Producer Consumer L ock L ock Consumer Producer Wait S ignal Producer Producer S ignal U nlock Producer Consumer U nlock Wait Consumer Consumer S ignal S ignal Producer locks Consumer mutex first Consumer U nlock U nlock Consumer is Thread Event Thread Event much faster than Producer Consumer L ock L ock producer Consumer Producer Wait S ignal Producer Producer U nlock L ock Producer Consumer S ignal L ock Consumer Producer U nlock Wait One Trace, Multiple Total Orders – Captures Non-Determinism Producer is Consumer Consumer S ignal S ignal much faster than Consumer consumer Consumer U nlock U nlock 16

  17. Modelling Locks and Condition Variables Per-Lock Thread Queues Condition Variable Counters • On wait t 1 t 2 • Decrement counter by 1 • On signal t 3 t 4 • Increment counter by 1 • On broadcast • Increment counter by number of t 1 consumers 17

  18. Estimating the Time Between Events 1. Dynamic Instructions • The distance between events 2. Core Frequency and Microarchitecture • The rate between events 3. The Scheduling of Threads • The opportunity to execute dynamic instructions 4. The Timing of Prior Events • The dependencies between threads 18

  19. Our High Level Abstraction Each core has its own cycles per instruction instruction count between frequency and the CPI frequency for each thread events for a given thread ID (TID) A instructions till next event TID(1) Acquire(A) 100 Controller TID(3) Acquire(A) 342 TID(2) Barrier(B) 612 TID(1) Release(A) 30 executing TID(3) Release(A) 34 thread to threads current event TID(1) Barrier(B) 843 core map TID(3) Barrier(B) 702 dependencies Inter-thread ... ... ... ... ... instruction count between events sleep B C Trace the synchronization event queue schedule and the object it is acting on Thread Model – A sequence of events Synchronization Model Scheduler 19

  20. Validation Methodology • Benchmarks: PARSEC 3.0, Splash-3 • Execution time measured with GNU time • Traces generated with Pin • Cycles-per-instruction profiled with VTune ™ • Architecture: Intel Xeon E5-2650 v2 • 2 sockets, 8 cores per socket, 2 threads per core • 20 MB L3 Cache • 2.6 GHz • Three runs for each experiment 20

  21. Assumptions and Approximations • Cycles-Per-Instruction encompasses microarchitecture and memory hierarchy performance • Synchronization events have zero latency • Context switches have zero latency • Synchronization model approximates application state • i.e., for condition variables 21

  22. Model Validation: 4 Cores, Single Socket Average of measured Average of estimated 200 180 160 140 Time (seconds) 120 100 80 60 40 20 0 22

  23. Model Validation: 32 Cores, Dual Socket Average of measured Average of estimated 60 50 Time (seconds) 40 30 20 10 0 23

  24. Water (nsquared): 8 Cores Estimated with Our Model Estimated with Vtune ™ Average of Computation Average of Synchronization Average of Computation Average of Synchronization 8E+10 80 7E+10 70 Time (nanoseconds) 6E+10 60 Time (seconds) 5E+10 50 4E+10 40 3E+10 30 2E+10 20 1E+10 10 0 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Thread ID Thread ID 24

  25. Model Runtime Benchmark Input Set Input Size Trace Size Runtime blackscholes Native 603 MB 1.1 KB 4 ms bodytrack Native 616 MB 31 MB 4.9 minutes water (nsquared) Native 3.6 MB 53 MB 7.5 minutes average 2 MB 32 seconds Orders of magnitude faster than simulation of smaller input sets. 25

  26. Conclusion • A very high level of abstraction can accurately and quickly estimate the performance of a multi-threaded application on a multi-core processor. • Average 7.2% error in total execution time • Average 32 seconds to generate an estimate • Programmers and Systems Researchers can evaluate on many architectures • Architects can evaluate with native inputs and many applications 26

  27. Future Work • How much non-determinism is there across multiple traces of an application? • How can a {memory, network} contention model be added to improve error without significantly increasing model complexity? 27

  28. Our Work is Open Source https://github.com/mariobadr/simsync-pmam License: Apache 2.0 Mario Badr and Natalie Enright Jerger 28

  29. Scenario A – Consumer locks mutex first 1. Consumer locks mutex 2. Producer attempts lock • Producer blocked Thread Event Primitive 3. Consumer waits for enqueue Consumer L ock mutex • Consumer blocked, silent unlock Producer L ock mutex • Producer unblocked, silent lock Consumer Wait enqueue 4. Producer signals enqueue Producer S ignal enqueue • Consumer tries to lock, remains blocked Producer 5. Producer unlocks mutex U nlock mutex • Consumer unblocked, silent lock Consumer S ignal dequeue 6. Consumer signals dequeue Consumer U nlock mutex 7. Consumer unlocks mutex 29

  30. Scenario B – Consumer is much faster 1. Consumer locks mutex 2. Consumer waits for enqueue Thread Event Primitive • Consumer blocked, silent unlock Consumer L ock mutex 3. Producer locks mutex Consumer Wait enqueue 4. Producer signals enqueue Producer L ock mutex • Consumer tries lock, remains blocked Producer S ignal enqueue 5. Producer unlocks mutex Producer U nlock mutex • Consumer unblocked, silent lock Consumer S ignal dequeue 6. Consumer signals dequeue Consumer U nlock mutex 7. Consumer unlocks mutex 30

  31. Scenario C – Producer locks mutex first 1. Producer locks mutex 2. Consumer attempts lock Thread Event Primitive • Consumer blocked Producer L ock mutex 3. Producer signals enqueue Consumer L ock mutex 4. Producer unlocks mutex Producer S ignal enqueue • Consumer unblocked Producer U nlock mutex 5. Consumer locks mutex Consumer Wait enqueue 6. Consumer does not have to wait Consumer S ignal dequeue 7. Consumer signals dequeue Consumer U nlock mutex 8. Consumer unlocks mutex 31

  32. Scenario D – Producer is much faster 1. Producer locks mutex 2. Producer signals enqueue Thread Event Primitive 3. Producer unlocks mutex Producer L ock mutex Producer S ignal enqueue 4. Consumer locks mutex Producer U nlock mutex 5. Consumer does not have to Consumer L ock mutex wait Consumer Wait enqueue 6. Consumer signals dequeue Consumer S ignal dequeue Consumer U nlock mutex 7. Consumer unlocks mutex 32

Recommend


More recommend