concurrency races synchronization
play

Concurrency, Races & Synchronization CS 450: Operating Systems - PowerPoint PPT Presentation

Concurrency, Races & Synchronization CS 450: Operating Systems Michael Lee <lee@iit.edu> Agenda - Concurrency: what, why, how - Concurrency-related problems - Locks & Locking strategies - Concurrent programming with semaphores


  1. void contextswitch(Context *from, Context *to) { #define setcontext(u) setmcontext(&(u)->uc_mcontext) if( swapcontext(&from->uc, &to->uc) < 0){ #define getcontext(u) getmcontext(&(u)->uc_mcontext) fprint(2, "swapcontext failed: %r\n"); assert(0); #define SET setmcontext } #define GET getmcontext } SET: int swapcontext(ucontext_t *oucp, movl 4(%esp), %eax const ucontext_t *ucp) { ... if( getcontext(oucp) == 0) movl 28(%eax), %ebp setcontext(ucp) ; ... return 0; movl 72(%eax), %esp } pushl 60(%eax) /* new %eip */ movl 48(%eax), %eax struct ucontext { ret sigset_t uc_sigmask; mcontext_t uc_mcontext; GET: ... movl 4(%esp), %eax }; ... struct mcontext { movl %ebp, 28(%eax) ... ... int mc_ebp; movl $1, 48(%eax) /* %eax */ ... movl (%esp), %ecx /* %eip */ int mc_ecx; movl %ecx, 60(%eax) int mc_eax; leal 4(%esp), %ecx /* %esp */ ... movl %ecx, 72(%eax) int mc_eip; int mc_cs; movl 44(%eax), %ecx /* restore %ecx */ int mc_eflags; movl $0, %eax int mc_esp; ret ... };

  2. Next: return to reason #3 for concurrency (performance)

  3. int A[DIM][DIM], /* src matrix A */ B[DIM][DIM], /* src matrix B */ C[DIM][DIM]; /* dest matrix C */ /* C = A x B */ void matrix_mult () { int i, j, k; for (i=0; i<DIM; i++) { for (j=0; j<DIM; j++) { C[i][j] = 0; for (k=0; k<DIM; k++) C[i][j] += A[i][k] * B[k][j]; } } } Run time, with DIM=50, real 0m1.279s user 0m1.260s 500 iterations: sys 0m0.012s

  4. void row_dot_col(void *index) { void run_with_thread_per_cell() { int *pindex = (int *)index; pthread_t ptd[DIM][DIM]; int i = pindex[0]; int index[DIM][DIM][2]; int j = pindex[1]; for(int i = 0; i < DIM; i ++) C[i][j] = 0; for(int j = 0; j < DIM; j ++) { for (int x=0; x<DIM; x++) index[i][j][0] = i; C[i][j] += A[i][x]*B[x][j]; index[i][j][1] = j; } pthread_create(&ptd[i][j], NULL, row_dot_col, index[i][j]); } for(i = 0; i < DIM; i ++) for(j = 0; j < DIM; j ++) pthread_join( ptd[i][j], NULL); } Run time, with DIM=50, real 4m18.013s user 0m33.655s 500 iterations: sys 4m31.936s

  5. void run_with_n_threads(int num_threads) { void *compute_rows(void *arg) { pthread_t tid[num_threads]; int *bounds = (int *)arg; int tdata[num_threads][2]; for (int i=bounds[0]; int n_per_thread = DIM/num_threads; i<=bounds[1]; i++) { for (int i=0; i<num_threads; i++) { for (int j=0; j<DIM; j++) { tdata[i][0] = i*n_per_thread; C[i][j] = 0; tdata[i][1] = (i < num_threads) for (int k=0; k<DIM; k++) ? ((i+1)*n_per_thread)-1 C[i][j] += A[i][k] : DIM; * B[k][j]; pthread_create(&tid[i], NULL, } compute_rows, } tdata[i]); } } for (int i=0; i<num_threads; i++) pthread_join(tid[i], NULL); }

  6. 1.700 1.275 0.850 0.425 0.000 1 2 3 4 5 6 7 8 9 10 Dual processor system, Num. threads kernel threading, 1.700 DIM=50, 500 iterations 1.275 0.850 0.425 0.000 1 2 3 4 5 6 7 8 9 10 Num. threads Real User System

  7. but matrix multiplication happens to be an embarrassingly parallelizable computation! - not typical of concurrent tasks!

  8. computations on shared data are typically interdependent (and this isn’t always obvious!) — may impose a cap on parallelizability

  9. Amdhal’s law predicts max speedup given two parameters: - P : parallelizable fraction of program - N : # of execution cores

  10. 1 max speedup S = P N + (1 − P ) † P → 1; S → N ‡ N → ∞ ; S → 1/(1 - P )

  11. source: http://en.wikipedia.org/wiki/File:AmdahlsLaw.svg

  12. Amdahl’s law is based on a fixed problem size with fixed parallelizable fraction — but we can argue that as we have more computing power we simply tend to throw larger / more granular problem sets at it

  13. e.g., graphics processing: keep turning up resolution/detail weather modeling: increase model parameters/accuracy chess/weiqi AI: deeper search tree

  14. Gustafson & Barsis posit that - we tend to scale problem size to complete in the same amount of time , regardless of the number of cores - parallelizable amount of work scales linearly with # of cores

  15. Gustafson’s Law computes speedup based on: - N cores - non- parallelizable fraction, P

  16. speedup S = N – P ∙ ( N – 1) † P → 1; S → 1 † P → 0; S → N - predicted speedup is linear with respect to number of cores!

  17. Speedup: S Number of cores: N

  18. Amdahl’s vs. Gustafson’s: - latter has rosier implications for big data / data science - but not all datasets naturally increase in resolution - both stress the import of maximizing parallelization

  19. some primary challenges of concurrent programming are to: 1. identify thread interdependencies 2. identify (1)’s potential ramifications 3. ensure correctness

  20. e.g., final change in count? (expected = 2) Thread A Thread B a1 count = count + 1 b1 count = count + 1 interdependency: shared var count

  21. factoring in machine-level granularity: Thread A Thread B a1 lw (count), %r0 b1 lw (count), %r0 a2 add $1, %r0 b2 add $1, %r0 a3 sw %r0, (count) b3 sw %r0, (count) answer: either +1 or +2!

  22. race condition ( s ) exists when results are dependent on the order of execution of concurrent tasks

  23. shared resource ( s ) are the problem or, more specifically, concurrent mutability of shared resources

  24. code that accesses shared resource(s) = critical section

  25. synchronization : time-sensitive coordination of critical sections so as to avoid race conditions

  26. e.g., specific ordering of different threads, or 
 mutually exclusive access to variables

  27. important: try to separate and decouple application logic from synchronization details - not doing this well adds unnecessary complexity to high-level code, and makes it much harder to test and maintain!

  28. most common technique for implementing synchronization is via software “locks” - explicitly required & released by consumers of shared resources

  29. § Locks & Locking Strategies

  30. basic idea: - create a shared software construct that has well defined concurrency semantics - aka. a “thread-safe” object - Use this object as a guard for another, un-thread-safe 
 shared resource

  31. Thread A Thread B a1 count = count + 1 b1 count = count + 1 count e r acquire i u q c a T A T B

  32. Thread A Thread B a1 count = count + 1 b1 count = count + 1 count allocated e r acquire i u q c a T A T B

  33. Thread A Thread B a1 count = count + 1 b1 count = count + 1 count allocated acquire e s u T A T B

  34. Thread A Thread B a1 count = count + 1 b1 count = count + 1 count allocated e acquire s a e l e r T A T B

  35. Thread A Thread B a1 count = count + 1 b1 count = count + 1 count use allocated T A T B

  36. locking can be: - global ( coarse-grained ) - per-resource ( fine-grained )

  37. coarse-grained locking policy count buff GUI logfile T A T B T C T D

  38. coarse-grained locking policy count buff GUI logfile T A T B T C T D

  39. coarse-grained locking policy count buff GUI logfile T A T B T C T D

  40. coarse-grained locking: - is (typically) easier to reason about - results in a lot of lock contention - could result in poor resource utilization — may be impractical for this reason

  41. fine-grained locking policy count buff GUI logfile T A T B T C T D

  42. fine-grained locking: - may reduce (individual) lock contention - may improve resource utilization - can result in a lot of locking overhead - can be much harder to verify correctness! - e.g., due to problems such as deadlock

  43. deadlock with fine-grained locking policy count buff GUI logfile T A T B T C T D

  44. so far, have only considered mutual exclusion what about instances where we require a specific order of execution? - often very difficult to achieve with simple-minded locks

  45. § Abstraction: Semaphore

  46. Little Book of Semaphores

  47. Semaphore rules: 1. When you create the semaphore, you can initialize its value to any integer, but after that the only operations you are allowed to perform are increment (increase by one) and decrement (decrease by one). You cannot read the current value of the semaphore. 2. When a thread decrements the semaphore, if the result is negative, the thread blocks itself and cannot continue until another thread increments the semaphore. 3. When a thread increments the semaphore, if there are other threads wait- ing, one of the waiting threads gets unblocked.

  48. Initialization syntax: Listing 2.1: Semaphore initialization syntax 1 fred = Semaphore(1)

  49. Operation names? 1 fred.increment_and_wake_a_waiting_process_if_any() 2 fred.decrement_and_block_if_the_result_is_negative() 1 fred.increment() 2 fred.decrement() 1 fred.signal() 2 fred.wait() 1 fred.V() 2 fred.P()

  50. How to use semaphores for synchronization? 1.Identify essential usage “patterns” 2.Solve “classic” synchronization problems

  51. Essential synchronization criteria: 1. avoid starvation 2. guarantee bounded waiting 3. no assumptions on relative speed (of threads) 4. allow for maximum concurrency

  52. § Using Semaphores for Synchronization

  53. Basic patterns: I. Rendezvous II. Mutual exclusion (Mutex) III.Multiplex IV . Generalized rendezvous / Barrier & Turnstile

  54. I. Rendezvous Thread A Thread B 1 1 statement a1 statement b1 2 2 statement a2 statement b2 Ensure that a1 < b2 , b1 < a2

  55. aArrived = Semaphore(0) bArrived = Semaphore(0) Thread A Thread B 1 statement a1 1 statement b1 2 2 aArrived.signal() bArrived.signal() 3 3 bArrived.wait() aArrived.wait() 4 4 statement a2 statement b2

  56. Note: Swapping 2 & 3 → Deadlock! Thread A Thread B 1 statement a1 1 statement b1 2 2 bArrived.wait() aArrived.wait() 3 3 aArrived.signal() bArrived.signal() 4 4 statement a2 statement b2 Each thread is waiting for a signal that will never arrive

  57. II. Mutual exclusion Thread A Thread B count = count + 1 count = count + 1 Ensure that critical sections do not overlap

  58. mutex = Semaphore(1) Here is a solution: Thread A Thread B mutex.wait() mutex.wait() # critical section # critical section count = count + 1 count = count + 1 mutex.signal() mutex.signal() Danger : if a thread blocks while “holding” the mutex semaphore, it will also block all other mutex-ed threads!

  59. III. multiplex = Semaphore(N) 1 multiplex.wait() 2 critical section 3 multiplex.signal() Permits N threads through into their critical sections

  60. IV . Generalized Rendezvous / Barrier Puzzle: Generalize the rendezvous solution. Every thread should run the following code: Listing 3.2: Barrier code 1 rendezvous 2 critical point

  61. Hint: 1 n = the number of threads 2 count = 0 3 mutex = Semaphore(1) 4 barrier = Semaphore(0)

  62. 1 rendezvous 2 3 mutex.wait() 4 count = count + 1 5 mutex.signal() 6 7 if count == n: barrier.signal() 8 9 barrier.wait() 10 barrier.signal() 11 12 critical point

  63. 1 rendezvous 2 3 mutex.wait() 4 count = count + 1 5 mutex.signal() 6 7 if count == n: turnstile .signal() 8 9 turnstile .wait() 10 turnstile .signal() 11 12 critical point state of turnstile after all threads make it to 12?

  64. 1 rendezvous 2 3 mutex.wait() 4 count = count + 1 5 if count == n: turnstile .signal() 6 mutex.signal() 7 8 turnstile .wait() 9 turnstile .signal() 10 11 critical point fix for non-determinism (but still off by one)

  65. next: would like a reusable barrier need to re-lock turnstile

Recommend


More recommend