presenter box lean box leangsuksun gsuksun
play

Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed - PowerPoint PPT Presentation

Partially supported by Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed Professor*, Computer Science Louisiana Tech University Louisiana Tech University box@latech.edu S. Laosooksathit, N. Naksinehaboon, K. Chanchio Amir Fabin


  1. Partially supported by Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed Professor*, Computer Science Louisiana Tech University Louisiana Tech University box@latech.edu S. Laosooksathit, N. Naksinehaboon, K. Chanchio Amir Fabin Box. Leang Box. Leangsuk uksun, A. Dhungana, U of Texas, Arlington Thammasat Univ Thammasat Univ C Ch C. Chandler dl Louisiana Tech U th HPCVirt 4 th HPCVirt workshop, Paris, France orkshop, Paris, France, April 13, 2010 , April 13, 2010

  2.  Motivations  Background - VCCP  GPU checkpoint protocols: Memcopy vs simpleStream i l S  CheCUDA (related work)  GPU checkpoint protocols: CUDA Streams  GPU checkpoint protocols: CUDA Streams  Restart protocols  Scheduling model and Analysis  Scheduling model and Analysis  Conclusion 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 2

  3.  More attention on GPUs  ORNL-NVDIA 10 petaflop machine  Large scale GPU cluster -> fault tolerance for GPU GPU applications li i ◦ Normal checkpoint doesn’t help GPU applications when a failure occurs. e a a u e occu s ◦ GPU execution isn’t saved when do checkpoint on CPU 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 3

  4.  High transparency ◦ Checkpoint/restart mechanisms should be transparent transparent to applications OS and runtime transparent transparent to applications, OS, and runtime environments; no modification required  Efficiency ◦ Checkpoint/restart mechanisms should not not generate unacceptable overheads  Normal Execution Normal Execution  Communication  Checkpointing Delay 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

  5. Run apps/OS Run apps/OS unmodified unmodified Checkpoint/restart Checkpoint/res Checkpoint/restart eckpoint/restart protocols tart protocols protocols protocols FIFO FIFO, FIFO R li bl FIFO , Reli liabl ble 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

  6. 1. Pauce VM computation 1. Pauce VM computation 2. Flush messages out of the network network 3. Locally Save State of every VM 4. Continue computation p 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

  7. VCCP checkpoint protocol VCCP checkpoint protocol Head compute01 compute02 save save save 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

  8. VCCP checkpoint protocol VCCP checkpoint protocol Head compute0 compute0 1 1 2 2 Flush communication communication channel channel empty channel empty channel empty save VM & buffer save VM & buffer save VM & buffer 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

  9. VCCP checkpoint protocol VCCP checkpoint protocol Head compute01 compute02 res lt result res lt result success resume cont t resume cont cont cont 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010

  10.  Publication in IEEE cluster 2009  Average overhead 12%  Provide transparent checkpoint/restart 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 10

  11. Host Device Device Initialization 1. Grid 1 Device memory D i 2. Kernel Block Block Block allocation 1 (0, 0) (1, 0) (2, 0) Copies data to device 3. Block Block Block (0, 1) ( , ) (1, 1) ( , ) (2, 1) ( , ) memory Executes kernel (Calling 4. Grid 2 __global__ function) Kernel 2 2 Copies data from device 5. memory (retrieve results) Block (1, 1) Thread Thread Thread Thread Thread  Issues – latency round trip (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) data movement Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 11

  12.  Long running GPU application  High (relatively) failure rate in a large scale GPU cluster in MPI & GPU environment  Save GPU software state S GPU f  Move data back from GPU in low latency ◦ Memcopy (pauce GPU) vs simpleStream ◦ Memcopy (pauce GPU) vs simpleStream (concurrency) 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 12

  13.  “CheCUDA: A Checkpoint/Restart Tool for CUDA Applications” by H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi  A prototype of an add on package of BLCR  A prototype of an add-on package of BLCR for GPU checkpointing  Memcopy approach  Memcopy approach 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 13

  14. 2 CPU checkpointing 2 Migration/ CPU checkpoint checkpoint GPU 1 checkpointing 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 14

  15. Process starts H-D memory H D memory copy Kernel starts Syncthread() GPU checkpoint GPU checkpoint duration CPU checkpoint/ migration Kernel completes co p etes D-H memory D H memory copy Process ends 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 15

  16. Copying all the user data in the device 1. memory to the host memory Writing the current status of the application 2. and the user data to a checkpoint file and the user data to a checkpoint file 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 16

  17. Read the checkpoint file 1. Initialize the GPU and recreating CUDA 2. resources Sending the user data back to the device S di h d b k h d i 3. memory 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 17

  18.  Transfer data from device to host = overhead ◦ Must pauce GPU computation until the copy is completed  SimpleStream ◦ Using latency hiding (Streams) to reduce the overhead ◦ CUDA streams = overlap memory copy and kernel execution execution 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 18

  19. Process starts H D memory H-D memory Code Code Analys Code Code Analysis Analysis Analysis Kernel starts K l t t copy After the sync point, OVERWRITE? Syncthread() GPU checkpoint duration CPU checkpoint / migration YES NO D-H memory copy py Kernel completes Kernel completes Process ends 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 19

  20. Process starts H-D memory H D memory Code Analysis Code Code Code Analys Analysis Analysis Kernel starts K l t t copy After the sync point, OVERWRITE? Syncthread() YES NO 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 20

  21. Process starts After the H-D memory H D memory Kernel starts K l t t sync point, copy OVERWRITE? Syncthread() NO GPU checkpoint duration CPU checkpoint / migration D-H memory copy py Kernel completes Kernel completes BACK Process ends 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 21

  22. Process starts H D memory H-D memory After the After the Kernel starts K l t t copy sync point, OVERWRITE? Syncthread() Duplicate image YES YES C Copy the sync image in GPU h i i GPU GPU checkpoint duration CPU checkpoint/ migration migration D-H memory copy py Kernel completes Kernel completes Process ends BACK 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 22

  23.  Restart CPU  Transfer the last GPU checkpoint back to CPU  Recreate CUDA context from the CKpt file  Restart the kernel execution from the marked synchronization point 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 23

  24.  GPU checkpoint after a thread synchronization  NOT every thread synchronization  QUESTION??? QUESTION??? ◦ Which thread synchronization should a checkpoint be invoked? be o ed  FACTORs ◦ GPU checkpoint overhead ◦ Chance of a failure occurrence 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 24

  25. O th n th m C ˆ C j   n     ˆ P C C   f j    m    j j m     O    Perform the checkpoint: ˆ P O C 1 P f f 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 25

  26. O th n th m C ˆ C j   n     ˆ Skip the checkpoint: P C C   f j   m     j j m     O    Perform the checkpoint: ˆ P O C 1 P f f    n     Perform the checkpoint p P C ÷ O   f f j j ÷     j m 4 th HPCVirt workshop, Paris, France, April 13, 2010 4 th HPCVirt workshop, Paris, France, April 13, 2010 26

Recommend


More recommend