non blocking data structures and transactional memory
play

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, - PowerPoint PPT Presentation

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 14 November 2014 Lecture 6 Introduction Amdahls law Basic spin-locks Queue-based locks Hierarchical locks Reader-writer locks Reading


  1. NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 14 November 2014

  2. Lecture 6 � Introduction � Amdahl’s law � Basic spin-locks � Queue-based locks � Hierarchical locks � Reader-writer locks � Reading without locking � Flat combining

  3. Overview � Building shared memory data structures � Lists, queues, hashtables, … � Why? � Used directly by applications (e.g., in C/C++, Java, C#, …) � Used in the language runtime system (e.g., management of work, implementations of message passing, …) � Used in traditional operating systems (e.g., synchronization between top/bottom-half code) � Why not? � Don’t think of “threads + shared data structures” as a default/good/complete/desirable programming model � It’s better to have shared memory and not need it… 3

  4. What do we care about? Ease to write Suppose I have a sequential How does performance change implementation (no as we increase the number of concurrency control at all): is Does it matter? Who is the When can it threads? When does the the new implementation 5% Correctness target audience? How much be used? implementation add or avoid slower? 5x slower? 100x effort can they put into it? Is synchronization? slower? What does it mean implementing a data structure to be correct? an undergrad programming Between threads in the same e.g., if multiple concurrent exercise? …or a research process? Between processes threads are using iterators on a paper? How well sharing memory? Within an shared data structure at the How fast is it? interrupt handler? does it scale? same time? With/without some kind of runtime system support? 4

  5. What do we care about? Ease to write When can it Correctness be used? How well How fast is it? does it scale? 5

  6. What do we care about? Be explicit about goals and trade-offs 1. � A benefit in one dimension often has costs in another � Does a perf increase prevent a data structure being used in some particular setting? � Does a technique to make something easier to write make the implementation slower? � Do we care? It depends on the setting 2. Remember, parallel programming is rarely a recreational activity � The ultimate goal is to increase perf (time, or resources used) � Does an implementation scale well enough to out-perform a good sequential implementation? 6

  7. Suggested reading � “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures, from both practical and theoretical perspectives � “Transactional memory, 2 nd edition”, Harris, Larus, Rajwar – recently revamped survey of TM work, with 350+ references � “NOrec: streamlining STM by abolishing ownership records”, Dalessandro, Spear, Scott, PPoPP 2010 � “Simplifying concurrent algorithms by exploiting transactional memory”, Dice, Lev, Marathe, Moir, Nussbaum, Olszewski, SPAA 2010 � Intel “Haswell” spec for SLE (speculative lock elision) and RTM (restricted transactional memory) 7

  8. Amdahl’s law

  9. Amdahl’s law � “Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi-core hardware. On a machine with n cores, how many cores do you need to use to get a 4x speed-up on the overall algorithm?” 9

  10. Amdahl’s law, f=70% 4.5 4.0 3.5 Desired 4x speedup 3.0 Speedup 2.5 2.0 Speedup achieved (perfect scaling on 70%) 1.5 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 10

  11. Amdahl’s law, f=70% 1 ����������(�, �) = �(1 − �) + � � � � f = fraction of code speedup applies to c = number of cores used 11

  12. Amdahl’s law, f=70% 4.5 4.0 3.5 Desired 4x speedup 3.0 Speedup 2.5 Limit as c → ∞ = 1/(1-f) = 3.33 2.0 1.5 Speedup achieved (perfect scaling on 70%) 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 12

  13. Amdahl’s law, f=10% 1.12 1.10 1.08 Amdahl’s law limit, just 1.11x 1.06 Speedup achieved Speedup with perfect scaling 1.04 1.02 1.00 0.98 0.96 0.94 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 13

  14. Amdahl’s law, f=98% Speedup 60 20 40 50 10 30 0 1 7 13 19 25 31 37 43 49 55 #cores 61 67 73 79 85 91 97 103 109 115 121 127 14

  15. Amdahl’s law & multi-core Suppose that the same h/w budget (space or power) can make us: 1 2 3 4 1 2 5 6 7 8 1 9 10 11 12 3 4 13 14 15 16 15

  16. Perfof big & small cores 1.2 Assumption: perf = α √resource 1.0 Core perf (relative to 1 big core 0.8 Total perf: Total perf: 1 * 1 = 1 0.6 16 * 1/4 = 4 0.4 0.2 0.0 1/16 1/8 1/4 1/2 1 Resources dedicated to core 16

  17. Amdahl’s law, f=98% 3.5 3.0 Perf (relative to 1 big core) 2.5 16 small 4 medium 2.0 1.5 1 big 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 17

  18. Amdahl’s law, f=75% 1.2 1 big 1.0 Perf (relative to 1 big core) 4 medium 0.8 16 small 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 18

  19. Amdahl’s law, f=5% 1.2 1 big 1.0 Perf (relative to 1 big core) 0.8 4 medium 0.6 0.4 16 small 0.2 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 19

  20. Asymmetric chips 3 4 1 7 8 9 10 11 12 13 14 15 16 20

  21. Amdahl’s law, f=75% 1.6 1+12 1.4 Perf (relative to 1 big core) 4 medium 1.2 1 big 1 0.8 16 small 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 21

  22. Amdahl’s law, f=5% 1.2 Perf (relative to 1 big core) 1 1 big 0.8 4 medium 0.6 1+12 0.4 0.2 16 small 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 22

  23. Amdahl’s law, f=98% 3.5 Perf (relative to 1 big core) 3 1+12 2.5 16 small 4 medium 2 1.5 1 big 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #Cores 23

  24. Amdahl’s law, f=98% 9 Speedup (relative to 1 big core) 8 7 1+192 6 5 4 3 2 256 small 1 0 #Cores 24

  25. Amdahl’s law, f=98% 9 Speedup (relative to 1 big core) 8 7 1+192 6 5 4 3 2 256 small 1 0 Leave larger core idle #Cores in parallel section 25

  26. Basic spin-locks

  27. Test and set (pseudo-code) Pointer to a location holding a boolean value (TRUE/FALSE) �������������������������� ������������ �������� Read the current ������������ contents of the ���������� location b points to… � �������������� � …set the contents of *b to TRUE 27

  28. Test and set • Suppose two threads use it at once testAndSet(b)->true Thread 1: time Thread 2: testAndSet(b)->false 28 Non-blocking data structures and transactional memory

  29. Test and set lock lock: FALSE FALSE => lock available TRUE => lock held void acquireLock(bool *lock) { Each call tries to acquire while (testAndSet(lock)) { the lock, returning TRUE /* Nothing */ if it is already held } } NB: all this is pseudo- code, assuming SC void releaseLock(bool *lock) { memory *lock = FALSE; } 29 Non-blocking data structures and transactional memory

  30. Test and set lock lock: FALSE TRUE Thread 1 Thread 2 void acquireLock(bool *lock) { while (testAndSet(lock)) { /* Nothing */ } } void releaseLock(bool *lock) { *lock = FALSE; } 30 Non-blocking data structures and transactional memory

  31. What are the problems here? testAndSet implementation causes contention 31 Non-blocking data structures and transactional memory

  32. Contention from testAndSet Single- Single- threaded threaded core core L1 cache L1 cache L2 cache L2 cache Main memory 32 Non-blocking data structures and transactional memory

  33. Multi-core h/w –separate L2 testAndSet(k) Single- Single- threaded threaded core core L1 cache L1 cache k L2 cache L2 cache k Main memory 33 Non-blocking data structures and transactional memory

  34. Multi-core h/w –separate L2 testAndSet(k) Single- Single- threaded threaded core core k L1 cache L1 cache k L2 cache L2 cache Main memory 34 Non-blocking data structures and transactional memory

  35. Multi-core h/w –separate L2 testAndSet(k) Single- Single- threaded threaded core core Does this still happen in practice? Do modern CPUs avoid fetching the L1 cache L1 cache k line in exclusive mode on failing TAS? L2 cache L2 cache k Main memory 35 Non-blocking data structures and transactional memory

  36. What are the problems here? testAndSet No control over implementation locking policy causes contention Only supports mutual Spinning may waste exclusion: not reader- resources while writer locking waiting 36

  37. General problem � No logical conflict between two failed lock acquires � Cache protocol introduces a physical conflict � For a good algorithm: only introduce physical conflicts if a logical conflict occurs � In a lock: successful lock-acquire & failed lock-acquire � In a set: successful insert(10) & failed insert(10) � But not: � In a lock: two failed lock acquires � In a set: successful insert(10) & successful insert(20) � In a non-empty queue: enqueue on the left and remove on the right 37

Recommend


More recommend