enabling predictable parallelism
play

Enabling predictable parallelism in single-GPU systems with - PowerPoint PPT Presentation

Enabling predictable parallelism in single-GPU systems with persistent CUDA threads P A O L O B U R G I O U N I V E R S I T Y O F M O D E N A , I T A L Y P A O L O . B U R G I O @ U N I M O R E . I T Toulouse, 6 july 2016 GP-GPUs /


  1. Enabling predictable parallelism in single-GPU systems with persistent CUDA threads P A O L O B U R G I O U N I V E R S I T Y O F M O D E N A , I T A L Y P A O L O . B U R G I O @ U N I M O R E . I T Toulouse, 6 july 2016

  2. GP-GPUs / General Purpose GPUs  Born for graphics, subsequently General Purposes computation  Massively parallel architectures  Baseline for next-generation of power efficient embedded devices  Tremendous Performance/Watt  Growing interest also for automotive and avionics  Still, not adoptable within (real-time) industrial settings 2 Toulouse, 6 july 2016

  3. Why not real-time GPUs?  Complex architecture harnesses analyzability  Poor predictability  Non-openness of drivers, firmware..  Hard to do research  Typically, GPU treated a "black box"  Atomic shared resource Hard to extract timing guarantees 3 Toulouse, 6 july 2016

  4. LightKer  Expose GPU architecture at the application level  Host-accelerator architecture  Clusters of cores  Non-Uniform Memory Access (NUMA) system  Same as modern accelerators Cluster GPU Host GPU core  Pure software approach L1 local MEM  No additional hardware! MEM mem/cache L2 (global) MEM Mem/cache MEM MEM MEM Host core 4 Toulouse, 6 july 2016

  5. Persistent GPU threads  Run at user-level  Pinned to cores  Continuously spin-wait for work to execute 1 CUDA thread  1 GPU core 1 CUDA block  1 GPU cluster 5 Toulouse, 6 july 2016

  6. Host-to-device communication  Lock-free mailbox  1 mailbox item for each cluster  Clusters exposed at the application level  Master thread for each cluster Master core (per-cluster) CORE MEM Triggered by Host R R W W from_GPU to_GPU GPU MEM Host MEM 6 Toulouse, 6 july 2016

  7. LK vs traditional execution model  LK execution split in  Init, { Copyin, Trigger, Wait, Copyout}, Dispose  "Traditional" GPU kernel  { Alloc, Copyin, Launch, Wait, Copyout, Dispose }  Testbench  NVIDIA GTX 980  2048 CUDA cores, 16 clusters 7 Toulouse, 6 july 2016

  8. Validation  Synthetic benchmark  Copyin/out not yet considered  Trigger phase 1000x faster   Synch/Wait is comparable Single SM LK Init LK Trigger LK Wait LK Dispose 509M 239 190k 30M CUDA Alloc CUDA Spawn CUDA Wait CUDA Dispose 496M 3.9k 175k 274k Full GPU LK Init LK Trigger LK Wait LK Dispose 503M 210 190k 30M CUDA Alloc CUDA Spawn CUDA Wait CUDA Dispose 497M 3.8k 176k 247k 8 Toulouse, 6 july 2016

  9. Try it!  LightKernel v0.2  Open source  http://hipert.mat.unimore.it/LightKer/  ...and visit our poster  This Project has received funding from the European Union’s Horizon 2020 research and innovation programme under 9 Toulouse, 6 july 2016 grant agreement : 688860.

Recommend


More recommend