atlas tracking optimization on gpu
play

Atlas Tracking Optimization on GPU Luis Domingues Professor: - PowerPoint PPT Presentation

Master Thesis Atlas Tracking Optimization on GPU Luis Domingues Professor: Frdric Bapst Supervisors: Paolo Calafiura Wim Lavrijsen Expert: Mathieu Monney 02/25/2015 Target Luis Domingues - January 2015 2 Code we started from


  1. Master Thesis Atlas Tracking Optimization on GPU Luis Domingues Professor: Frédéric Bapst Supervisors: Paolo Calafiura Wim Lavrijsen Expert: Mathieu Monney 02/25/2015

  2. Target Luis Domingues - January 2015 2

  3. Code we started from ● Demonstrator of ATLAS trigger on GPUs ● Basic host side – Take data – Send and compute data on GPU – Sleep waiting the response Luis Domingues - January 2015 3

  4. Code we started from Luis Domingues - January 2015 4

  5. Overlapping pixels and SCT ● The pixel and SCT processing are done in sequence ● Same event, but sequential processing... Time Time Pixel Kernels stamp stamp Time Time SCT Kernels stamp stamp Time Luis Domingues - January 2015 5

  6. Overlapping pixels and SCT Luis Domingues - January 2015 6

  7. CUDA Streams ● A stream is a queue of execution ● Non-default streams can be executed in parallel Stream1 H2D Kernel D2H Stream2 H2D Kernel D2H Stream3 H2D Kernel D2H Time H2D = Host to device transfer D2H = Device to host transfer Luis Domingues - January 2015 7

  8. Overlapping pixels and SCT ● Use CUDA Streams ● Start the processing of SCT before pixels end Time Time Pixel stream Kernels stamp stamp Time Time SCT stream Kernels stamp stamp Time Luis Domingues - January 2015 8

  9. Overlapping pixels and SCT Luis Domingues - January 2015 9

  10. Overlapping pixels and SCT ● For 2000 events, without overlapping – Avg Pixel: 2.03 ms – Avg SCT: 1.95 ms – Total avg: 3.98 ms ● For 2000 events, overlapping – Avg Pixel: 2.3 ms – Avg SCT: 2.5 ms Luis Domingues - January 2015 10

  11. Overlapping pixels and SCT ● Total execution – Without overlapping: 8.65 s – With overlapping: 6.53 s Luis Domingues - January 2015 11

  12. Multi-thread server side ● Huge amount of “small” data – They do not fulfill the GPU ● Parallelize the “event” level processing with streams Luis Domingues - January 2015 12

  13. Multi-thread server side Client Client Client Client FIFO Client Client Client Client Luis Domingues - January 2015 13

  14. Multi-thread server side ● Life of a thread Luis Domingues - January 2015 14

  15. Multi-thread server side Luis Domingues - January 2015 15

  16. Multi-thread server side ● Executions time – Without overlapping: 8.65 s – With overlapping: 6.53 s – Multi-threading server side: 4.7 s Luis Domingues - January 2015 16

  17. CUDA Occupancy ● A good setup of Grid/Block size in card can be significant ● CUDA offers an API to maximize the occupancy of the kernels Luis Domingues - January 2015 17

  18. CUDA Occupancy Cuda Core Multiprocessor GPU Luis Domingues - January 2015 18

  19. CUDA Occupancy ● Bad block size Setup Cuda Core Multiprocessor GPU Kernel 1 Kernel 2 Intra-block synchronization Luis Domingues - January 2015 19

  20. CUDA Occupancy ● Better block Setup Cuda Core Multiprocessor GPU Kernel 1 Kernel 2 Intra-block synchronization Luis Domingues - January 2015 20

  21. CUDA Occupancy ● Maximize the occupancy kills global performances ● Runs results for 2000 events – Big Blocks size: 10.88 s – Original configuration: 4.7 s – Small blocks size: 4.4 s Luis Domingues - January 2015 21

  22. CUDA Occupancy ● Maximize the occupancy kills global performances ● Runs results for 2000 events – Big blocks size: 3 kernels in parallel (Max 5) – Small blocks size: 4 kernels in parallel (Max 7) Luis Domingues - January 2015 22

  23. Conclusion ● Important points when using a GPU – Port of an algorithm to the GPU – Communicate with the GPU – Host side design ● Keep the GPU busy ● Big occupancy does not allow the GPU to schedule its tasks efficiently Luis Domingues - January 2015 23

Recommend


More recommend