gpu pus s ap appl plic ications ations
play

GPU PUs s Ap Appl plic ications ations Tyler Sorensen - PowerPoint PPT Presentation

Wea eak Memo emory y Be Beha havi vior ors s in in GPU PUs s Ap Appl plic ications ations Tyler Sorensen Supervisors: Alastair F. Donaldson and James Brotherston 15 July 2015 Imperial Concurrency Workshop 1 Overview Current


  1. Wea eak Memo emory y Be Beha havi vior ors s in in GPU PUs s Ap Appl plic ications ations Tyler Sorensen Supervisors: Alastair F. Donaldson and James Brotherston 15 July 2015 Imperial Concurrency Workshop 1

  2. Overview • Current techniques for reasoning about GPU applications under weak memory models are limited to hand analysis 2

  3. Overview • Current techniques for reasoning about GPU applications under weak memory models are limited to hand analysis • This is laborious, error prone, requires a formal model 3

  4. Overview • Current techniques for reasoning about GPU applications under weak memory models are limited to hand analysis • This is laborious, error prone, requires a formal model • We propose e a n a new w methodo dolo logy y bas ased on stress ess an and fu fuzz z testin ting 4

  5. Overview Add stressing/fuzzing hooks and postcondition GPU Annotated notated GPU application application • Run annotated application for many iterations and check for postcondition violations. 5

  6. Overview • Buggy dot product routine 6

  7. Overview • Buggy dot product routine • Running the program for 1 hour (~2 seconds per run) the number of failed postconditions are: No No stress ess Stress ess/fuzzing /fuzzing 0 7

  8. Overview • Buggy dot product routine • Running the program for 1 hour (~2 seconds per run) the number of failed postconditions are: No No stress ess Stress ess/fuzzing /fuzzing 0 396 396 8

  9. Roadmap • Background • Stress testing details • Results 9

  10. Weak memory models • consider the test known as message passing (MP) 10

  11. Weak memory models • consider the test known as message passing (MP) 11

  12. Weak memory models • consider the test known as message passing (MP) 12

  13. Weak memory models • consider the test known as message passing (MP) 13

  14. Message passing (MP) test • T ests how to implement a handshake idiom Data Data 14

  15. Message passing (MP) test • T ests how to implement a handshake idiom Flag Flag 15

  16. Message passing (MP) test • T ests how to implement a handshake idiom Stale e Data 16

  17. 17

  18. 18

  19. 19

  20. 20

  21. 21

  22. assertion this is known cannot as Lamport’s be satisfied by sequential interleavings consistency (or SC) 22

  23. Weak memory models • can we assume assertion will never pass? 23

  24. Weak memory models • can we assume assertion will never pass? No! 24

  25. Weak memory models • Alglave et al. report this assertion passes 41 million times out of 5 billion test runs on T egra2 ARM processor 1 1 http://diy.inria.fr/cats/tables.html 25

  26. Weak memory models • what happened? 26

  27. Weak memory models • what happened? • architectures implement weak memory models where the hardware is allowed to re-order certain memory instructions. • weak memory models can allow weak behaviors (executions that do not correspond to an interleaving) 27

  28. GPU programming Block 0 Block 1 Block n Within blocks, Threads threads are grouped into warps Shared memory Shared memory Shared memory for block 0 for block 1 for block n Global Memory 28

  29. GPU programming Threads Global Memory 29

  30. GPU programming Block 0 Block 1 Block n Threads Global Memory 30

  31. GPU programming Block 0 Block 1 Block n Threads Shared memory Shared memory Shared memory for block 0 for block 1 for block n Global Memory 31

  32. GPU programming Block 0 Block 1 Block n Within blocks, Threads threads are grouped into warps Shared memory Shared memory Shared memory for block 0 for block 1 for block n Global Memory 32

  33. Roadmap • Background • Stress testing details • Results 33

  34. GPU memory models • Previous work 1 showed that GPUs empirically have weak memory models. • Done using a tool which ran litmus tests on GPUs • Required heuristics for weak behaviors to appear 1 GPU concurrency: Weak behaviours and programming assumptions. ASPLOS ’15. 34

  35. Litmus tests 35

  36. Memory stress T0 T1 extra thread 1 extra thread n . . . . . run T0 run T1 loop: loop: test test read or write read or write program program to scratchpad to scratchpad 36

  37. Memory stress T0 T1 extra thread 1 extra thread n . . . . . run T0 run T1 loop: loop: test test read or write read or write program program to scratchpad to scratchpad Memory 37

  38. Memory stress T0 T1 extra thread 1 extra thread n . . . . . run T0 run T1 loop: loop: test test read or write read or write program program to scratchpad to scratchpad Memory X Y 38

  39. Memory stress T0 T1 extra thread 1 extra thread n . . . . . run T0 run T1 loop: loop: test test read or write read or write program program to scratchpad to scratchpad Memory Scratch X Scratch Y Scratch 39

  40. Memory stress • Can we extend memory stress for testing applications? 40

  41. Memory stress block 0 block n extra block 0 extra block x . . . . . . . . . . Run Memory application stress Application Memory Scratchpad Memory 41

  42. Memory stress block 0 block n extra block 0 extra block x . . . . . . . . . . Run Memory application stress Application memory Scratchpad Memory 42

  43. Memory stress • Goal: design stress to reveal weak behaviors with no a priori knowledge about the application. Memory stress • We investigate using litmus tests, MP , SB, and LB 43

  44. Memory stress Where to stres ess: s: 44

  45. Memory stress Where to stres ess: s: X Y • For each distance D : 45

  46. Memory stress Where to stres ess: s: X Y • For each distance D : 46

  47. Memory stress Where to stres ess: s: X Y • For each distance D: 47

  48. Memory stress Where to stres ess: s: X Y • For each distance D : 48

  49. Memory stress Where to stres ess: s: X D Y • For each distance D : 49

  50. Memory stress Where to stres ess: s: X D Y • For each distance D : • For each scratchpad location I: I 50

  51. Memory stress Where to stres ess: s: X D Y • For each distance D : • For each scratchpad location I: I I 51

  52. Memory stress Where to stres ess: s: X D Y • For each distance D : • For each scratchpad location I: I I 52

  53. Memory stress Where to stres ess: s: • For each distance D : X D Y • For each scratchpad location I: I I • Run MP , SB, LB LB at at distan ance e D litmus us tests ts stressi ssing ng only locat atio ion n I I fo for 1000 0 iterat ratio ions ns 53

  54. Memory stress 54

  55. Memory stress Distance D 55

  56. Memory stress X D Y Distance D 56

  57. Memory stress Distance D Index I stressed 57

  58. Memory stress I I Distance D Index I stressed 58

  59. Memory stress Distance D Litmus test Index I stressed 59

  60. Memory stress Vertical bar represents the magnitude of weak behaviors observed 60

  61. Memory stress • Visualization samples 61

  62. Memory stress • Visualization samples 62

  63. Memory stress • Visualization samples 63

  64. Memory stress • Visualization samples 64

  65. Memory stress • What does this tell us? 65

  66. Memory stress • What does this tell us? • T o reveal weak behaviors we only need to stress 1 in every 32 locations* • We call a contiguous region of 32 elements a pat atch *64 for some chips 66

  67. Memory stress • How many patches can we effectively stress? • If D is unknown (as in applications), we would like to stress as many disjoint patches as possible 67

  68. Memory stress • Scratchpad has size of 64 patches • We try stressing a randomly selected n patches for values 1 – 64 for n 68

  69. 69

  70. Zoom in on first 8 70

  71. 71

  72. Stressing 2 random patches is most effective 72

  73. Memory stress • Now we have a memory stressing strategy! • Stress two random patches in the scratchpad • Patch size may change per chip 73

  74. Roadmap • Background • Stress testing details • Results 74

  75. Application N-body particle simulation in Lonestar GPU benchmark 1 1 see: http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu 75

  76. Application N-body particle simulation in Lonestar GPU benchmark 1 • Documented to have communication across blocks • No other information a priori needed for our testing • Post condition checks the final location of particles 1 see: http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu 76

  77. Application Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs on a Quadro K5200: 77

  78. Application Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs on a Quadro K5200: No No stress ess With th stres ess 0 78

  79. Application Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs on a Quadro K5200: No No stress ess With th stres ess 0 48 48 79

  80. Comparing stresses • Does it matter how we stress? • We compare our systematic stressing method to 2 other stressing strategies 80

Recommend


More recommend