sword a bounded memory overhead detector of openmp data
play

SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in - PowerPoint PPT Presentation

SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in Production Runs Simone Atzeni, Ganesh Gopalakrishnan, Zvonimir Rakamaric School of Computing, University of Utah, Salt Lake City , UT 84112 Presented at IPDPS 2018 See paper for


  1. SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in Production Runs Simone Atzeni, Ganesh Gopalakrishnan, Zvonimir Rakamaric School of Computing, University of Utah, Salt Lake City , UT 84112 Presented at IPDPS 2018 See paper for details Courtesy Pinterest Ignacio Laguna, Greg L. Lee, Dong H. Ahn Lawrence Livermore National Laboratory, Livermore, CA Github.com / PRUNERS

  2. What is a data race?

  3. What is a data race? Thread 1 Thread 2

  4. What is a data race? Thread 1 Thread 2 R/W W

  5. What is a data race? Thread 1 Thread 2 No synchronizations R/W W

  6. One way to eliminate this race T0 T1 W R/W

  7. One way to eliminate this race T0 T1 LOCK LOCK W R/W UNLOCK UNLOCK

  8. One way to eliminate this race T0 T1 LOCK LOCK W R/W UNLOCK UNLOCK

  9. Another way to eliminate this race T0 T1 W R/W

  10. Another way to eliminate this race T0 T1 Signal using `special’ variables ACQUIRE • Java ‘volatile’ annotations • NOT C ‘volatiles’ ! W R/W • C++11 ’atomic’ annotations RELEASE

  11. A third way T0 T1 W R/W

  12. A third way T0 T1 Put a barrier W R/W

  13. Why eliminate races?

  14. Popular answer: “avoid nondeterminism” T0 T1 t = X X = 0

  15. Unclear what “nondeterminism” means..

  16. Execution Order is Still Nondeterministic T0 T1 LOCK LOCK X = 0 t = X UNLOCK UNLOCK

  17. More relevant: Avoid “pink elephants” !

  18. More relevant: Avoid “pink elephants” ! Pink elephant (Sutter) : “A value you never wrote but managed to read” Aka ”out of thin air” value

  19. The birth of a pink elephant… T0 T0 T1 T1 Compiler Optimizations X = 24 X = 0 t = X t = X 24 t is 0 here read here You may never have written “24” in your program

  20. Details of how a pink elephant is made! The compiler has T0 T0 T1 T1 NO IDEA that the user meant to communicate here !! X = 24 X = 0 t = X t = X 24 Y = 23 Y = 23 Compiler read here optimizations X = Y + 1 create these pink-elephant values…

  21. This is why code containing data races often fail (only) when optimized!

  22. Race-freedom ensures intended communications T0 T1 • You don’t observe “half baked” values • Code does not reorder around sync. points • No “word tearing” W R/W • Pending writes flushed (fences inserted)nly

  23. Exploding a myth! There is no such thing as a benign race !!

  24. Races in OpenMP programs are hard to spot • See#tinyurl.com/ompRaces if#you#wish# • but$later$ ! • Static#analysis#tools#never#shown#to#work#well • First#usable#OpenMP dynamic#race#checker#(afaik) • Archer$[Atzeni,$IPDPS’16] • More$on$that$soon • This$talk$ will#present#the#second#usable#dynamic#race#checker • Sword

  25. This talk: Why and how of another OMP race checker

  26. The Pink Elephant Actually Struck Us! • HYDRA&porting&on&Sequoia&at&LLNL • Large&multiphysics MPI/OpenMP application • Non@deterministic&crashes&in&OpenMP region • Only&when&the&code&was&optimized! • Suspected&data&race • Emergency&hack: • Disabled&OpenMP&in&Hypre • Root@cause&found&by&Archer&: • two&threads&writing& 0 to&a&common&location&without&synchronization

  27. Archer to the rescue!

  28. Archer to the rescue! Archer [IPDPS’16] • Utah: Simone Atzeni, Ganesh Gopalakrishnan, Zvonimir Rakamaric • LLNL: Dong H. Ahn, Ignacio Laguna, Martin Schulz, Gregory L. Lee • RWTH: Joachim Protze, Matthias S. Muller – In production use at LLNL Part of the “PRUNERS” tool suite PRUNERS was a finalist of the 2017 R&D 100 Award Selection

  29. Archer’s “find” Two$threads$writing$ 0 to$the$same$location$ without$synchronization

  30. Archer’s “find” Two$threads$writing$ 0 to$the$same$location$ without$synchronization

  31. Did we live “happily ever after?”

  32. No !

  33. Archer has “memory-outs”; also misses races

  34. Archer has “memory-outs”; also misses races • Archer&increases&memory&500% • It&also&misses&races! • These&were&known&issues • Finally'surfaced'with'the'”right'large'example” • Root9cause'found'by'Archer': • two'threads'writing' 0 to'a'common'location' without'synchronization

  35. Reason: Archer employs “shadow cells” Core 0 Core 1 Core 2 Core 3 A0 A1 Amax A programmable number of cells ss0 ss0 ss0 per address ss1 ss1 …. ss1 (4 shown, and is typical) ss2 ss2 ss2 ss3 ss3 ss3

  36. ~4 shadow cells per application location Core 0 Core 1 Core 2 Core 3 A0 A1 Amax A programmable number of cells ss0 ss0 ss0 per address ss1 ss1 …. ss1 (4 shown, and is typical) ss2 ss2 ss2 ss3 ss3 ss3 Shadow-cells immediately increase memory demand by a factor of four

  37. Archer misses races due to shadow cell eviction

  38. Archer misses races due to shadow cell eviction Core Core Core Core 0 1 2 3 A0 A1 Amax ss0 ss0 ss0 ss1 ss1 …. ss1 ss2 ss2 ss2 ss3 ss3 ss3

  39. Archer misses races due to shadow cell eviction Core Core Core Core 0 1 2 3 A0 A1 Amax ss0 ss0 ss0 ss1 ss1 …. ss1 ss2 ss2 ss2 ss3 ss3 ss3 Thread 3 writes a[3] All threads read a[3] All threads read A[3] Thread 3 writes A[3]

  40. Capacity conflict ! evict shadow cell Core 0 Core 1 Core 2 Core 3 A0 A1 Amax ss0 ss0 ss0 ss1 ss1 …. ss1 ss2 ss2 ss2 ss3 ss3 ss3 With shadow-cell evicted, races are missed

  41. Archer misses races due to HB-masking

  42. Archer misses races due to HB-masking These are These races concurrent; are missed there are two in this races here! interleaving!

  43. Solution : Get rid of shadow cells !!

  44. Need New Approach with Online/Offline split Core 0 Core 1 Core 2 Core 3 Compression Compression Compression Compression Offline Analysis Race Reports

  45. Details of the online phase Core 0 Core 1 Core 2 Core 3 Compression Compression Compression Compression • Collect'traces'per'core' un#coordinated • Trace-collection-speeds-increased;-we-use-the-OMPT-tracing-method • Employ-data-compression-to-bring-FULL-traces-out • Only'2.5'MB'compression'buffer'per'thread' (fits-in-L3-cache)

  46. Consequences for the offline phase Core 0 Core 1 Core 2 Core 3 Compression Compression Compression Compression • We#would#have#lost#all#the#synchronization#information • We#only#know#what#each#thread#is#doing • We#must#recover#the#concurrency#structure • And#in#the#context#of#its#happens;before#order,#detect#races!

  47. Offline synchronization recovery and analysis 0 - [0,1] Core 0 Core 1 Core 2 Core 3 1 - [0,1][0,2] 2 - [0,1][1,2] Compression Compression Compression Compression 5 - [0,1][1,2][0,2] 3 - [0,1][0,2][0,2] 6 - [0,1][1,2][1,2] 4 - [0,1][0,2][1,2] m_acq() m_acq(M1) R2: race on y write(y) write(x) m_rel(M1) m_rel() Barrier(2) Barrier(1) read(x) R1: race on y write(y) OpSem IBarrier(3) 7 - [0,1][2,2] FOR-LOOP (HIPS’18) R3: race on x m_acq() write(x) 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] m_rel() m_acq(M1) IBarrier(5) read(y) m_rel(M1) IBarrier(4) IBarrier(6) 10 - [0,1][4,2] 11 - [0,1][3,2] IBarrier(7) 12 - [1,1]

  48. Offset-Span Labels: How we record concurrency (Mellor-Crummey, 1991)

  49. Key state in OpSem: Maintain Barrier Intervals 0 - [0,1] 1 - [0,1][0,2] 2 - [0,1][1,2] 5 - [0,1][1,2][0,2] 3 - [0,1][0,2][0,2] 6 - [0,1][1,2][1,2] 4 - [0,1][0,2][1,2] m_acq() m_acq(M1) Barrier&Interval&1 Barrier&Interval&2 write(y) write(x) m_rel() m_rel(M1) Barrier(2) Barrier(1) read(x) Barrier&Interval&3 write(y) IBarrier(3) 7 - [0,1][2,2] FOR-LOOP m_acq() Barrier&Interval&5 write(x) 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] m_rel() m_acq(M1) IBarrier(5) read(y) m_rel(M1) IBarrier(4) IBarrier(6) 10 - [0,1][4,2] 11 - [0,1][3,2] IBarrier(7) 12 - [1,1]

  50. Examples of Races Reported 0 - [0,1] 1 - [0,1][0,2] 2 - [0,1][1,2] 5 - [0,1][1,2][0,2] 3 - [0,1][0,2][0,2] 6 - [0,1][1,2][1,2] 4 - [0,1][0,2][1,2] m_acq() m_acq(M1) R2: race on y write(y) write(x) m_rel() m_rel(M1) Barrier(2) Barrier(1) Barrier& read(x) R1: race on y write(y) Interval&3 IBarrier(3) 7 - [0,1][2,2] FOR-LOOP R3: race on x m_acq() write(x) 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] m_rel() m_acq(M1) IBarrier(5) read(y) m_rel(M1) Race&within& IBarrier(4) IBarrier(6) same& 10 - [0,1][4,2] barrier& 11 - [0,1][3,2] interval IBarrier(7) 12 - [1,1]

  51. Examples of Races Reported 0 - [0,1] 1 - [0,1][0,2] 2 - [0,1][1,2] 5 - [0,1][1,2][0,2] 3 - [0,1][0,2][0,2] 6 - [0,1][1,2][1,2] 4 - [0,1][0,2][1,2] m_acq() m_acq(M1) Barrier&Interval&2 R2: race on y write(y) write(x) m_rel() m_rel(M1) Races&across& Barrier(2) Barrier(1) parallel&regions Barrier& read(x) R1: race on y write(y) Interval&3 IBarrier(3) 7 - [0,1][2,2] FOR-LOOP R3: race on x m_acq() Barrier&Interval&5 write(x) 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] m_rel() m_acq(M1) IBarrier(5) read(y) m_rel(M1) IBarrier(4) IBarrier(6) 10 - [0,1][4,2] 11 - [0,1][3,2] IBarrier(7) 12 - [1,1]

Recommend


More recommend