Master Thesis Atlas Tracking Optimization on GPU Luis Domingues Professor: Frédéric Bapst Supervisors: Paolo Calafiura Wim Lavrijsen Expert: Mathieu Monney 02/25/2015
Target Luis Domingues - January 2015 2
Code we started from ● Demonstrator of ATLAS trigger on GPUs ● Basic host side – Take data – Send and compute data on GPU – Sleep waiting the response Luis Domingues - January 2015 3
Code we started from Luis Domingues - January 2015 4
Overlapping pixels and SCT ● The pixel and SCT processing are done in sequence ● Same event, but sequential processing... Time Time Pixel Kernels stamp stamp Time Time SCT Kernels stamp stamp Time Luis Domingues - January 2015 5
Overlapping pixels and SCT Luis Domingues - January 2015 6
CUDA Streams ● A stream is a queue of execution ● Non-default streams can be executed in parallel Stream1 H2D Kernel D2H Stream2 H2D Kernel D2H Stream3 H2D Kernel D2H Time H2D = Host to device transfer D2H = Device to host transfer Luis Domingues - January 2015 7
Overlapping pixels and SCT ● Use CUDA Streams ● Start the processing of SCT before pixels end Time Time Pixel stream Kernels stamp stamp Time Time SCT stream Kernels stamp stamp Time Luis Domingues - January 2015 8
Overlapping pixels and SCT Luis Domingues - January 2015 9
Overlapping pixels and SCT ● For 2000 events, without overlapping – Avg Pixel: 2.03 ms – Avg SCT: 1.95 ms – Total avg: 3.98 ms ● For 2000 events, overlapping – Avg Pixel: 2.3 ms – Avg SCT: 2.5 ms Luis Domingues - January 2015 10
Overlapping pixels and SCT ● Total execution – Without overlapping: 8.65 s – With overlapping: 6.53 s Luis Domingues - January 2015 11
Multi-thread server side ● Huge amount of “small” data – They do not fulfill the GPU ● Parallelize the “event” level processing with streams Luis Domingues - January 2015 12
Multi-thread server side Client Client Client Client FIFO Client Client Client Client Luis Domingues - January 2015 13
Multi-thread server side ● Life of a thread Luis Domingues - January 2015 14
Multi-thread server side Luis Domingues - January 2015 15
Multi-thread server side ● Executions time – Without overlapping: 8.65 s – With overlapping: 6.53 s – Multi-threading server side: 4.7 s Luis Domingues - January 2015 16
CUDA Occupancy ● A good setup of Grid/Block size in card can be significant ● CUDA offers an API to maximize the occupancy of the kernels Luis Domingues - January 2015 17
CUDA Occupancy Cuda Core Multiprocessor GPU Luis Domingues - January 2015 18
CUDA Occupancy ● Bad block size Setup Cuda Core Multiprocessor GPU Kernel 1 Kernel 2 Intra-block synchronization Luis Domingues - January 2015 19
CUDA Occupancy ● Better block Setup Cuda Core Multiprocessor GPU Kernel 1 Kernel 2 Intra-block synchronization Luis Domingues - January 2015 20
CUDA Occupancy ● Maximize the occupancy kills global performances ● Runs results for 2000 events – Big Blocks size: 10.88 s – Original configuration: 4.7 s – Small blocks size: 4.4 s Luis Domingues - January 2015 21
CUDA Occupancy ● Maximize the occupancy kills global performances ● Runs results for 2000 events – Big blocks size: 3 kernels in parallel (Max 5) – Small blocks size: 4 kernels in parallel (Max 7) Luis Domingues - January 2015 22
Conclusion ● Important points when using a GPU – Port of an algorithm to the GPU – Communicate with the GPU – Host side design ● Keep the GPU busy ● Big occupancy does not allow the GPU to schedule its tasks efficiently Luis Domingues - January 2015 23
Recommend
More recommend