“ How w GPUs PUs can can Help lp High gh G.Lamanna – GTC2016 San Jose 6.4.2016 En Energy rgy Ph Physic ysics ” GTC2016 San Jose 6.4.2015 Gianluca Lamanna (INFN) On behalf of GAP collaboration
Ou Outlin ine High Energy Physics What is it? The challenge of the trigger systems G.Lamanna – GTC2016 San Jose 6.4.2016 Big data and real time GPU for online selection Why? A physics case The rings in the NA62 RICH detector 2
What t is the high h energy rgy physics? ics? HEP (High Energy Physics) is devoted to study subatomic particles, radiations and interactions G.Lamanna – GTC2016 San Jose 6.4.2016 All the matter is built with very few particles The mass of the particles is generated by the interaction with a “field” ( Higgs Particle) The interaction between particles is mediated by bosons 3
Huge e mach chin ines es fo for t r the infi finitesimal itesimal LHC@CERN is the biggest To investigate the particles accelerator in the world subatomic world we need very high 27 Km between France and energy Switzerland G.Lamanna – GTC2016 San Jose 6.4.2016 21 countries 12000 scientists from 120 nationalities 4
…Huge machines = big data Higher energy and higher intensity are G.Lamanna – GTC2016 San Jose 6.4.2016 mandatory for new discovers Technical challenging Huge Volume of Data 5
What t is the tri rigge ger? r? The purpose of the collisions trigger systems is to decide if the “event” is interesting G.Lamanna – GTC2016 San Jose 6.4.2016 L0 : Hardware Level Bandwidth reduction Total Data Size Increase physics potential of the HLT : Software Levels experiment High efficient and high purity trigger is mandatory in searching for tiny effects and rare events storage 6
G.Lamanna – GTC2016 San Jose 6.4.2016 7
Next gener erati ation n tri rigger ger Next generation experiments will look for tiny effects: The trigger systems become more and more important Higher readout band G.Lamanna – GTC2016 San Jose 6.4.2016 New links to bring data faster on processing nodes Accurate online selection High quality selection closer and closer to the detector readout Flexibility, Scalability, Upgradability More software less hardware 8
Diff ffere rent nt Solutio tions ns Brute force: PCs Elegant: FPGA Bring all data on a huge pc Use a programmable logic to have a farm, using fast (and flexible way to apply your trigger eventually smart) routers. conditions. Pro: easy to program, Pro: flexibility and low deterministic G.Lamanna – GTC2016 San Jose 6.4.2016 flexibility; Cons: very latency; Cons: not so easy (up to now) expensive, most of resources to program, algorithm complexity just to process junk. limited by FPGA clock and logic. Rock Solid: Custom Hardware Off-the-shelf: GPU Build your own board with Try to exploit hardware built for other dedicated processors and purposes continuously developed for links other reasons Pro: power, reliability; Cons: Pro: cheap, flexible, scalable, PC based. several years of R&D Cons: Latency (sometimes to re-rebuild the wheel), limited flexibility 9
GPU U in low level l tri rigge ger? r? Latency: Is the GPU latency per event small enough to cope with the tiny latency of a low level trigger system? Is the latency stable enough for usage in synchronous trigger G.Lamanna – GTC2016 San Jose 6.4.2016 systems? Computing power: Is the GPU fast enough to take trigger decision at tens of MHz events rate? 10
Lo Low Le Level vel tri rigger: ger: NA62 62 Test t bench ch RICH: 17 m long, 3 m in diameter, filled with Ne at 1 atm Reconstruct Cherenkov G.Lamanna – GTC2016 San Jose 6.4.2016 Rings to distinguish between pions and muons from 15 to 35 GeV 2 spots of 1000 PMs each Time resolution: 70 ps MisID: 5x10 -3 10 MHz events: about 20 hits per particle 11
Latency: La ency: main in pro roblem em of of G GPU U com omput puting ing Total latency Host PC dominated by double copy in Host RAM VRAM Decrease the data G.Lamanna – GTC2016 San Jose 6.4.2016 transfer time: GPU NIC DMA (Direct Memory Access) PCI Custom manage of express NIC buffers “Hide” some chipset CPU RAM component of the latency optimizing the multi-events computing 12
Nanet et-1 1 board rd Nanet-1: board based on the ApeNet+ card logic PCIe interface with GPU G.Lamanna – GTC2016 San Jose 6.4.2016 Direct P2P/RDMA capability Offloading of network protocol Multiple 1Gb/s link support Use FPGA resources to perform on-the-fly data preparation 13
Nanet et-1 1 in NA62 G.Lamanna – GTC2016 San Jose 6.4.2016 NANET TTC interface TESLA K20 14
G.Lamanna – GTC2016 San Jose 6.4.2016 Nanet et-1: 1: Perf rformances rmances 15
Nanet et-1: 1: Perf rformances rmances G.Lamanna – GTC2016 San Jose 6.4.2016 After NANET latency if fully dominated by GbE transmission. 16
Nanet et-10 10 ALTERA Stratix V dev board (TERASIC DE5-Net board) PCIe x8 Gen3 (8 GB/s) 4 SFP+ ports (Link speed up G.Lamanna – GTC2016 San Jose 6.4.2016 to 10Gb/s) GPUDirect /RDMA capability UDP offloads supports FPGA preprocessing (merging, decompression, …) VCI 2016 16/02/2016 17 17
Ring fi fittin ing Trackless no information from the tracker Difficult to merge information from many detectors at L0 Fast Multi rings on the Not iterative procedure G.Lamanna – GTC2016 San Jose 6.4.2016 market: Events rate at levels of tens of MHz With seeds: Low latency Likelihood, Online (synchronous) trigger Constrained Hough, Accurate … Offline resolution required Trackless: fiTQun, APFit, possibilistic clustering, Metropolis-Hastings, Hough transform, … 18
Histogram ogram algori rith thm Rings are identified looking at distance bins whose contents exceed a threshold value G.Lamanna – GTC2016 San Jose 6.4.2016 XY plane divided into a grid An histogram is created with distances from the grid points and hits of the physics event 19
Results lts Sending real data from NA62 2015 RUN NaNet-1 board GPU NVidia K20 G.Lamanna – GTC2016 San Jose 6.4.2016 Merging events in GPU from two different sources FPGA merger will be implemented soon Kernel histogram 33x10 6 protons per pulse >10 MHz Max 1ms latency allowed 20
Almag magest: est: multi ti-ri ring ng ident ntif ificatio ication New algorithm (Almagest) based on Ptolemy’s theorem : G.Lamanna – GTC2016 San Jose 6.4.2016 “A quadrilateral is cyclic (the vertex lie on a circle) if and only if is valid the relation: AD*BC+AB*DC=AC*BD “ 21
G.Lamanna – GTC2016 San Jose 6.4.2016 Almag magest: est: multi ti-ri ring ng ident ntif ificatio ication 22
Almag magest est re results lts Tesla K20 Only computing time presented <0.5 us per event (multi- G.Lamanna – GTC2016 San Jose 6.4.2016 rings) for large buffers 1 us 23
Conclu lusio sions ns (1) Several possible uses in HEP: data analysis, Monte Carlo, … The GPU in the trigger could give several advantages, but the processing performances should be carefully G.Lamanna – GTC2016 San Jose 6.4.2016 studied (IO, Latency, Throughput) Several experiments are thinking about to use GPU in the trigger in future (both in Lower and Higher levels): Upgrade: ATLAS, LHCb, CMS, ALICE (already used GPU in run1), … NA62, PANDA, CBM, STAR, … 24
Conclu lusio sions ns (2) To match the required latency in Low Level triggers, it is mandatory that data coming from the network must be copied to GPU memory avoiding bouncing buffers on host. A working solution with the NaNet-1 board has been G.Lamanna – GTC2016 San Jose 6.4.2016 realized and tested on the NA62 RICH detector. Multi-ring algorithms such as Almagest and Histogram are implemented on GPU. The GPU-based L0 trigger with the new board NaNet-10 will be implemented during the next NA62 Run starting on April 2016. GPUs are flexibles, scalable, powerful, ready to use, cheap and take advantage of continuous development for other purposes: they are a viable alternative to other expensive and less powerful solution. 25
G.Lamanna – GTC2016 San Jose 6.4.2016 SPARES 26
HLT with h GPU HLT is a “natural” place where to use GPU The increasing in LHC luminosity and in the number G.Lamanna – GTC2016 San Jose 6.4.2016 of overlapping events poses new challenges to the trigger system, and new solutions have to be developed for the fore coming upgrades A simple increase of the threshold can reduce signal efficiency drastically More resolution and more complex reconstruction in HLT capabilities Reconstruction complexity and computing time scales with number of hits/tracks Higher throughput means increase network and CPU capabilities 27 Parallel computing is the solution
PFRING ING G.Lamanna – GTC2016 San Jose 6.4.2016 Special driver for direct access to NIC buffer Data are directly available in userland Double copy avoided Pros: No extra HW needed; Cons: Pre-processing on CPU 28
Recommend
More recommend