blowing up the c 11 atomic barrier
play

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in - PowerPoint PPT Presentation

Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in LLVM Robin Morisset, Intern at Google Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics


  1. Blowing up the (C++11) atomic barrier Optimizing C++11 atomics in LLVM Robin Morisset, Intern at Google

  2. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  3. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  4. Can this possibly print 0-0 ? Thread 1 Thread 2 x <- 1; y <- 1; print y; print x;

  5. Can this possibly print 0-0 ? Yes if your compiler reorder accesses Thread 1 Thread 2 print y; print x; x <- 1; y <- 1;

  6. Can this possibly print 0-0 ? Yes on x86: needs a fence Flush your (FIFO) store buffer x <- 1; y <- 1; mfence; mfence; print y; print x;

  7. Can this possibly print 0 ? x <- 42; if (ready) ready <- 1; print x;

  8. Can this possibly print 0 ? Yes on ARM Flush your (non-FIFO) store buffer x <- 42; if (ready) dmb ish; print x; ready <- 1;

  9. Can this possibly print 0 ? Yes on ARM: needs 2 fences to prevent Flush your Don’t speculate (non-FIFO) reads across store buffer x <- 42; if (ready) dmb ish; dmb ish; ready <- 1; print x;

  10. Doing it portably C11/C++11 memory model ● data race (dynamic) = undefined ● no data race (using mutexes) = intuitive behavior (“Sequentially consistent”) ● for lock-free code: atomic accesses

  11. Sequentially consistent x.store(1, seq_cst ); y.store(1, seq_cst ); print(y.load( seq_cst )); print(x.load( seq_cst ));

  12. Release/acquire x = 42; if (ready.load( acquire )) ready.store(1, release ); print(x);

  13. Release/acquire x = 42; if (ready.load( acquire )) ready.store(1, release ); print(x);

  14. Release/acquire x = 42; if (ready.load( acquire )) ready.store(1, release ); print(x);

  15. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  16. Compiler optimizations ? void foo(int *x, int n) { void foo(int *x, int n) { int tmp = *x; for(int i=0; i<n; ++i){ for(int i=0; i < n; ++i){ LICM *x *= 42; tmp *= 42; } } *x = tmp; } }

  17. Compiler optimizations ? void foo(int *x, int n) { void foo(int *x, int n) { int tmp = *x; LICM *x = tmp; } }

  18. Compiler optimizations ? void foo(int *x, int n) { void foo(int *x, int n) { int tmp = *x; LICM *x = tmp; } } ++(*x); // in another thread...

  19. Never introduce a store where there was none

  20. Dead store elimination ? x = 42; … x = 43;

  21. Dead store elimination ? x = 42; flag1.store(true, release ); while (!flag2.load( acquire )) continue; x = 43;

  22. Dead store elimination ? x = 42; while (!flag1.load( acquire )) flag1.store(true, release ); continue; while (!flag2.load( acquire )) print(x); continue; flag2.store(true, release ); x = 43;

  23. Dead store elimination ? Race ! x = 42; while (!flag2.load( acquire )) print(x); continue; flag2.store(true, release ); x = 43;

  24. Dead store elimination ? x = 42; while (!flag1.load( acquire )) flag1.store(true, release ); continue; print(x); x = 43; Race !

  25. Anything can happen to memory between a release and an acquire

  26. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  27. Fence elimination ldr r0, [r0] int t = y.load( acquire ); dmb ish … … x.store(1, release ); dmb ish str r2, [r1]

  28. ldr … dmb ish 2 fences on main path str … dmb ish str …

  29. ldr … dmb ish 1 fence on main path str … dmb ish str …

  30. ldr … dmb ish 1 fence on main path str … dmb ish str …

  31. ldr … str … Build graph from CFG str …

  32. Source ldr … Build graph from CFG str … Identify sources/sinks str … Sink

  33. Source ldr … Build graph from CFG str … Identify sources/sinks str … Sink

  34. Source ldr … 5 ∞ Build graph from CFG 2 str … Identify sources/sinks ∞ 2 Annotate with frequency ∞ 5 str … Sink

  35. Source ldr … 5 ∞ Build graph from CFG 2 + 5 = 7 is minimum 2 Identify sources/sinks str … ∞ Annotate with frequency 2 Find min-cut ∞ 5 str … Sink

  36. ldr … Build graph from CFG dmb ish Identify sources/sinks str … Annotate with frequency Find min-cut Move fences dmb ish str …

  37. .loop: while(flag.load( acquire )) ldr r0, [r1] {} dmb ish bnz .loop

  38. .loop: while(flag.load( acquire )) ldr r0, [r1] {} bnz .loop dmb ish

  39. Source .loop: ldr r0, [r1] 98 dmb ish 100 bnz .loop … 2 memory access Sink

  40. Source .loop: ldr r0, [r1] 98 100 bnz .loop … 2 dmb ish memory access Sink

  41. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  42. x.load( release ) ?

  43. x.load( release ) ? x.fetch_add(0, release )

  44. x86 x.load( release ) ? mov %eax, $0 x.fetch_add(0, release ) lock xadd (%ebx), %eax

  45. x86 x.load( release ) ? 7200% mov %eax, $0 x.fetch_add(0, release ) lock speedup xadd (%ebx), %eax for a seqlock* mfence mov %eax, (%ebx)

  46. Power ARM x.store(0, release ) hwsync dmb sy stw … str … x.load( acquire ) lwz … ldr … hwsync dmb sy

  47. Power ARM x.store(0, release ) lwsync dmb ish stw … str … x.load( acquire ) lwz … ldr … lwsync dmb ish

  48. Power ARM (Swift) x.store(0, release ) lwsync dmb ishst stw … str … x.load( acquire ) lwz … ldr … lwsync dmb ish

  49. Power rlwinm r2, r3, 3, 27, 28 li r4, 2 xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3 or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1

  50. Power rlwinm r2, r3, 3, 27, 28 li r4, 2 Shuffling xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3 or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1

  51. Power rlwinm r2, r3, 3, 27, 28 li r4, 2 Shuffling xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 andc r5, r5, r3 or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1 Loop

  52. Power rlwinm r2, r3, 3, 27, 28 li r4, 2 Shuffling xori r5, r2, 24 rlwinm r2, r3, 0, 0, 29 li r3, 255 slw r4, r4, r5 x.store(2, relaxed ) slw r3, r3, r5 and r4, r4, r3 LBB4_1: lwarx r5, 0, r2 Load linked andc r5, r5, r3 Store conditional or r5, r4, r5 stwcx. r5, 0, r2 bne cr0, LBB4_1 Loop

  53. Power x.store(2, relaxed ) li r2, 2 stb r2, 0(r3)

  54. x86 x.store(2, relaxed ) mov %eax, $2 mov (%ebx), %eax

  55. x86 x.store(2, relaxed ) mov (%ebx), $2

  56. Background: C++11 atomics Optimizing around atomics Fence elimination Miscellaneous optimizations Further work: Problems with atomics

  57. Relaxed attribute print(y.load( relaxed )); print(x.load( relaxed )); x.store(1, relaxed ); y.store(1, relaxed );

  58. Relaxed attribute print(y.load( relaxed )); print(x.load( relaxed )); x.store(1, relaxed ); y.store(1, relaxed ); Can print 1-1

  59. Relaxed attribute t_y = y.load( relaxed ); t_x = x.load( relaxed ); x.store(t_y, relaxed ); y.store(t_x, relaxed ); x = y = ???

  60. Relaxed attribute if(y.load( relaxed )) if(x.load( relaxed )) x.store(1, relaxed ); y.store(1, relaxed ); print(“foo”); print(“bar”); Can print foobar !

  61. Consume attribute *x = 42; t = x.load( acquire ); x.store(1, release ); print(*t);

  62. Consume attribute *x = 42; t = x.load( consume ); x.store(1, release ); print(*t); Ordered

  63. Consume attribute *x = 42; t = x.load( consume ); x.store(1, release ); print(*y); Unordered !

  64. Consume attribute *x = 42; t = x.load( consume ); x.store(1, release ); print(*(y + t - t)); ???

  65. Conclusion ● Atomics = portable lock-free code in C11/C++11 ● Tricky to compile, but can be done ● Lots of open questions

  66. Questions ?

Recommend


More recommend