green flash
play

Green Flash Persistent Kernel : Real-Time, Low-Latency and High- - PowerPoint PPT Presentation

GTC 2017 Green Flash Persistent Kernel : Real-Time, Low-Latency and High- Performance Computation on Pascal Julien BERNARD Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014 Green Flash


  1. GTC 2017 Green Flash Persistent Kernel : Real-Time, Low-Latency and High- Performance Computation on Pascal Julien BERNARD Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014

  2. Green Flash ● Public and private actors – Paris Observatory – University of Durham – Microgate – PLDA ● Part of Horizon 2020 : EU Research and Innovation programme ● 3 years project ● 3,8 million € ● Involve about 30 people ● Research axes – Real time HPC with accelerators and smart interconnects – Energy efficient platform based on FPGA – Real Time Controller (RTC) prototype for European – Extremely Large Telescope Adaptive Optics (AO) system

  3. Contributors Maxime Lainé : software engineer Denis Perret : FPGA expert Arnaud Sevin : software lead Damien Gratadour : project lead Christophe Rouaud : PLDA project lead Gaetan Dufourcq : QuickPlay expert GTC 2017

  4. E-ELT : Adaptive Optics ● Compensate in real-time the wavefront perturbations ● Using a wavefront sensor - WFS to measure them ● Using a deformable mirror – DM to reshape the wavefront ● Commands to the mirror must be computed in real-time (~ms rate) GTC 2017

  5. RTC concept for ELT AO GTC 2017

  6. RTC concept for ELT AO GTC 2017

  7. Real Time controller Legacy architecture Sensor Switch RTC ● IE. SPARTA architecture Active elements – DSP & CPU – VXS backplane Instrument WFS meas. DM com. Freq Performance (Hz) (GMAC/s) Sphere 1 2.6K 1 1.3k 1.5k 5.2 AOF 4 2.4k 1 1.2k 1k 11.8 GTC 2017

  8. Real Time controller Sensor 0 RTC Cluster network Node 0 Sensor 1 architecture Sensor 2 Sensor 3 RTC Switch Sensor 4 Node ... Sensor 5 Active elements 0 Active elements 1 RTC Node N-1 Active elements 2 Instrument WFS meas. DM com. Freq Performance (Hz) (GMAC/s) Sphere 1 2.6K 1 1.3k 1.5k 5.2 AOF 4 2.4k 1 1.2k 1k 11.8 ELT 6 80k 3 15k 500 1.2k GTC 2017

  9. Legacy GPU programming main { setup(); while(run){ recv(…); cudaMemcpy(…, GPU 10GbE GPU HostToDevice); RAM NIC computing_kernel<<<>>>(…); cudaMemcpy(…, PCIe DeviceToHost); send(…); CPU } CPU RAM } GTC 2017

  10. Legacy GPU programming cudaMemcopy() overhead times (5.12Mo in, 64Ko out) Kernel launches overhead times Both cases : jitter of 20 to 30 µsec (40 µsec sometimes) GTC 2017

  11. Legacy GPU programming Leaves not enough time for computations GTC 2017

  12. Improvement GPU direct & I/O Persistent Kernel Memory mapping GTC 2017

  13. GPU direct & I/O Memory mapping GTC 2017

  14. GPU direct & I/O Memory mapping Host CPU app ram DMA Camera control P FPGA control Meas. Comp. Latency measures DMA C measurement UDP I- DMA GPU ram GPU answers Offmoad e Pixels Camera Engine bufger 3 Pixels protocol bufger compute DMA . handler kernels 0 DMC protocol DMA DM handler start com bufger FPGA NIC ● FPGA writes/reads directly to/from GPU memory ● CPU free for other kind of computations GTC 2017

  15. FPGA Development platform Eased devel. Process using the QuickPlay tool from PLDA GTC 2017

  16. FPGA Development platform ● Single generic design / multiple target boards – ExpressK-US board (hosting a Kintex UltraScale from Xilinx) – ExpressGX V board (hosting a Stratix V from Altera) – μXlink board from microgate (hosting a Arria 10 board from Altera) GTC 2017

  17. Persistent Kernel GTC 2017

  18. Classic implementation GTC 2017

  19. Persistent kernel implementation GTC 2017

  20. GPU direct, I/O Memory mapping & Persistent kernel main { setup(); persistent_kernel <<<>>>(…); … } GPU 10GbE GPU RAM FPGA persistent_kernel(…){ NIC while(run){ start pollMemory(…); PCIe computation(...); startDMATransfer(…); } CPU CPU } RAM GTC 2017

  21. Pipelining I/O and compute FPGA PLDA XPressG5 Camera EVT HS-2000M GPU Tesla C2070 10GbE network OS Debian wheezy SCAO Pyramid case: 240 x 240 pixels, encoded on 16b µsec No GPUDirect GPUDirect + persistent kernel iterations GTC 2017

  22. Pipelining I/O and compute GTC 2017

  23. DGX-1 benchmark ● FPGA is replace by CPU ● Each node master receive frame data ● Work is shared between all devices ● RTC master send back RTC Master Node masters final resut Slaves GTC 2017

  24. Result 1/2 : Time and jitter Histogram 4 devices case with 10,048 slopes x 15,000 commands Average : 0.45ms Jitter peak to peak : 17µs Variation : 1.8 % Time in ms GTC 2017

  25. Result 2/2 : Sync & Intercom time Intercommunication time Synchronize time Average : 24µs Jitter : 12µs Average : 15µs Jitter : 8.8µs

  26. Conclusion & future work ● Conclusion ● Future – Using GPUDirect and a Test on AO bench (with DM – persistent kernel allow efficient and WFS) data delivery to the RTC Use multi nodes architecture – – Lower jitter Test with fp16 – – Simpler execution stream – QuickPlay tool from PLDA ● Eased FPGA development cycle ● Mix communication protocols and data processing into the same streams ● Expandable ecosystem, with QuickStore / QuickAliance

  27. Thank you Question ? Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014

  28. ● DGX-1 benchmark ● Result 1/2 : Time and jitter ● Result 2/2 : Sync & Intercom time ● Conclusion & future work ● Thank you ● RTC AO prototype for E-ELT ● Test pipeline ● Time measurement strategies ● Conclusion : Persistent kernel ● future work ● New features ● Test architecture GTC 2017

Recommend


More recommend