analyzing the performance of lock free data structures a
play

Analyzing the Performance of Lock-Free Data Structures: A - PowerPoint PPT Presentation

Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model Aras Atalar, Paul Renaud-Goud and Philippas Tsigas Chalmers University of Technology qwwe Motivation Pp Pp Lock-free Data Structures: Literature and


  1. Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model Aras Atalar, Paul Renaud-Goud and Philippas Tsigas Chalmers University of Technology qwwe

  2. Motivation Pp Pp ◮ Lock-free Data Structures: ◮ Literature and industrial applications (Intel’s Threading Building Blocks Framework, Java concurrency package) ◮ Limitations of their lock-based counterparts: deadlocks, convoying and programming flexibility ◮ Provide high scalability Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 2 13 Aras Atalar

  3. Motivation Pp Pp ◮ Lock-free Data Structures: ◮ Literature and industrial applications (Intel’s Threading Building Blocks Framework, Java concurrency package) ◮ Limitations of their lock-based counterparts: deadlocks, convoying and programming flexibility ◮ Provide high scalability ◮ Framework to characterize the scalability: ◮ Facilitate the lock-free designs ◮ Rank implementations within a fair framework Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 2 13 Aras Atalar

  4. Settings Pp Pp Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm 1 Initialization(); 2 while ! done do Parallel_Work(); /* Application specific code, conflict-free */ 3 while ! success do 4 current ← Read(AP); 5 new ← Critical_Work(current); 6 success ← CAS(AP , current , new); 7 Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13 Aras Atalar

  5. Settings Pp Pp Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm 1 Initialization(); 2 while ! done do Parallel_Work(); /* Application specific code, conflict-free */ 3 while ! success do 4 current ← Read(AP); 5 new ← Critical_Work(current); 6 success ← CAS(AP , current , new); 7 Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13 Aras Atalar

  6. Settings Pp Pp Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm 1 Initialization(); 2 while ! done do Parallel_Work(); /* Application specific code, conflict-free */ 3 while ! success do 4 current ← Read(AP); 5 new ← Critical_Work(current); 6 success ← CAS(AP , current , new); 7 Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13 Aras Atalar

  7. Settings Pp Pp Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm 1 Initialization(); 2 while ! done do Parallel_Work(); /* Application specific code, conflict-free */ 3 while ! success do 4 current ← Read(AP); 5 new ← Critical_Work(current); 6 success ← CAS(AP , current , new); 7 Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13 Aras Atalar

  8. Settings Pp Pp Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm 1 Initialization(); 2 while ! done do Parallel_Work(); /* Application specific code, conflict-free */ 3 while ! success do 4 current ← Read(AP); 5 new ← Critical_Work(current); 6 success ← CAS(AP , current , new); 7 Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13 Aras Atalar

  9. Settings Pp Pp Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm 1 Initialization(); 2 while ! done do Parallel_Work(); /* Application specific code, conflict-free */ 3 while ! success do 4 current ← Read(AP); 5 new ← Critical_Work(current); 6 success ← CAS(AP , current , new); 7 Inputs of the analysis: ◮ Platform parameters: CAS and Read Latencies, in clock cycles ◮ Algorithm parameters: ◮ Critical Work and Parallel Work Latencies, in clock cycles ◮ Total number of threads Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13 Aras Atalar

  10. Overview Pp Pp Case Constant Exponential Poisson Throughput (ops/msec) cw = 50, threads = 8 12000 10000 8000 6000 4000 0 2000 4000 6000 Parallel Work (cycles) Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 4 13 Aras Atalar

  11. Executions Under Contention Levels Pp Pp Throughput Parallel work Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13 Aras Atalar

  12. Pp Executions Under Contention Levels Pp parallel work successful retry Throughput failed retry Low contention Parallel work T 0 T 1 T 2 T 3 System Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13 Aras Atalar

  13. Pp Executions Under Contention Levels Pp parallel work successful retry Throughput failed retry Peak performance Parallel work T 0 T 1 T 2 T 3 System Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13 Aras Atalar

  14. Pp Executions Under Contention Levels Pp parallel work successful retry Throughput failed retry High contention Parallel work T 0 T 1 T 2 T 3 System Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13 Aras Atalar

  15. Impacting Factors Pp Pp ◮ Logical Conflicts ◮ Hardware Conflicts CAS Expansion Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 6 13 Aras Atalar

  16. Logical Conflicts: ( f ) -Cyclic Executions Pp Pp ◮ Periodic: every thread is in the same state as one period before parallel work ◮ Shortest period contains exactly 1 successful attempt and successful retry failed retry exactly f fails per thread idle thread Past Present Future T 0 T 1 T 2 T 3 System Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 7 13 Aras Atalar

  17. Inevitable and Wasted Failures Pp Pp T 0 T 1 T 2 T 3 System vs. System T 0 T 1 T 2 T 3 Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 8 13 Aras Atalar

  18. Hardware Conflicts: CAS Expansion Pp Pp CAS Previously Expansion Read & Critical Work expanded CAS ◮ Input: P rl threads already in the retry loop ◮ A new thread attempts to CAS during the retry (Read + Critical_Work + e ( P rl ) + CAS ), within a probability h : � retry cost ( t ) � e ( P rl + h ) = e ( P rl ) + h × dt . retry 0 Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 9 13 Aras Atalar

  19. Throughput: Combining Impacting Factors Pp Pp ◮ Input: P rl (Average number of threads inside retry loop) 1. Calculate expansion: e ( P rl ) 2. Compute amount of work in a retry: Retry = Read + Critical _ Work + e ( P rl ) + CAS 3. Estimate number of logical conflicts: LogicalConflicts ( Retry , Parallel _ Work , Threads ) � Average number of threads inside the retry loop Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 10 13 Aras Atalar

  20. Throughput: Combining Impacting Factors Pp Pp ◮ Input: P rl (Average number of threads inside retry loop) 1. Calculate expansion: e ( P rl ) 2. Compute amount of work in a retry: Retry = Read + Critical _ Work + e ( P rl ) + CAS 3. Estimate number of logical conflicts: LogicalConflicts ( Retry , Parallel _ Work , Threads ) � Average number of threads inside the retry loop ◮ Convergence via fixed point iteration Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 10 13 Aras Atalar

  21. Results: Synthetic Tests Pp Pp Case Low High Average Real cw = 50, threads = 4 cw = 50, threads = 8 12000 12000 10000 10000 Throughput (ops/msec) 8000 8000 6000 6000 4000 4000 1000 2000 3000 0 2000 4000 6000 cw = 1600, threads = 4 cw = 1600, threads = 8 1500 1500 1000 1000 0 5000 10000 15000 20000 0 10000 20000 30000 40000 Parallel Work (cycles) Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 11 13 Aras Atalar

  22. Back-off Optimization: Michael-Scott Queue Pp Pp Type Exponential Linear New None Value 0 1 2 4 8 16 32 cw = 225, threads = 8 7000 Throughput (ops/msec) 6000 5000 4000 3000 0 2500 5000 7500 Parallel Work (cycles) Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 12 13 Aras Atalar

  23. Conclusion Pp Pp ◮ Focus on the cases where parallel work is constant ◮ An approach based on the estimation of logical and hardware conflicts ◮ Validate our model using synthetic tests and several reference data structures ◮ Linear combination of retry loops Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 13 13 Aras Atalar

  24. Results: Treiber’s Stack Pp Pp Case Low High Average Real Case Low High Average Real cw = 50, threads = 6 cw = 50, threads = 8 12000 12000 10000 10000 8000 8000 Throughput (ops/msec) Throughput (ops/msec) 6000 6000 4000 4000 0 1000 2000 3000 4000 0 2000 4000 6000 cw = 1500, threads = 6 cw = 1500, threads = 8 2000 2000 1500 1500 1000 1000 0 10000 20000 30000 0 10000 20000 30000 40000 Parallel Work (cycles) Parallel Work (cycles) Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 14 13 Aras Atalar

Recommend


More recommend