How w GPUs PUs can can Help lp High gh G.Lamanna GTC2016 San - PowerPoint PPT Presentation

“ How w GPUs PUs can can Help lp High gh G.Lamanna – GTC2016 San Jose 6.4.2016 En Energy rgy Ph Physic ysics ” GTC2016 San Jose 6.4.2015 Gianluca Lamanna (INFN) On behalf of GAP collaboration

Ou Outlin ine High Energy Physics What is it? The challenge of the trigger systems G.Lamanna – GTC2016 San Jose 6.4.2016 Big data and real time GPU for online selection Why? A physics case The rings in the NA62 RICH detector 2

What t is the high h energy rgy physics? ics? HEP (High Energy Physics) is devoted to study subatomic particles, radiations and interactions G.Lamanna – GTC2016 San Jose 6.4.2016 All the matter is built with very few particles The mass of the particles is generated by the interaction with a “field” ( Higgs Particle) The interaction between particles is mediated by bosons 3

Huge e mach chin ines es fo for t r the infi finitesimal itesimal LHC@CERN is the biggest To investigate the particles accelerator in the world subatomic world we need very high 27 Km between France and energy Switzerland G.Lamanna – GTC2016 San Jose 6.4.2016 21 countries 12000 scientists from 120 nationalities 4

…Huge machines = big data Higher energy and higher intensity are G.Lamanna – GTC2016 San Jose 6.4.2016 mandatory for new discovers Technical challenging Huge Volume of Data 5

What t is the tri rigge ger? r? The purpose of the collisions trigger systems is to decide if the “event” is interesting G.Lamanna – GTC2016 San Jose 6.4.2016 L0 : Hardware Level Bandwidth reduction Total Data Size Increase physics potential of the HLT : Software Levels experiment High efficient and high purity trigger is mandatory in searching for tiny effects and rare events storage 6

G.Lamanna – GTC2016 San Jose 6.4.2016 7

Next gener erati ation n tri rigger ger Next generation experiments will look for tiny effects: The trigger systems become more and more important Higher readout band G.Lamanna – GTC2016 San Jose 6.4.2016 New links to bring data faster on processing nodes Accurate online selection High quality selection closer and closer to the detector readout Flexibility, Scalability, Upgradability More software less hardware 8

Diff ffere rent nt Solutio tions ns Brute force: PCs Elegant: FPGA Bring all data on a huge pc Use a programmable logic to have a farm, using fast (and flexible way to apply your trigger eventually smart) routers. conditions. Pro: easy to program, Pro: flexibility and low deterministic G.Lamanna – GTC2016 San Jose 6.4.2016 flexibility; Cons: very latency; Cons: not so easy (up to now) expensive, most of resources to program, algorithm complexity just to process junk. limited by FPGA clock and logic. Rock Solid: Custom Hardware Off-the-shelf: GPU Build your own board with Try to exploit hardware built for other dedicated processors and purposes continuously developed for links other reasons Pro: power, reliability; Cons: Pro: cheap, flexible, scalable, PC based. several years of R&D Cons: Latency (sometimes to re-rebuild the wheel), limited flexibility 9

GPU U in low level l tri rigge ger? r? Latency: Is the GPU latency per event small enough to cope with the tiny latency of a low level trigger system? Is the latency stable enough for usage in synchronous trigger G.Lamanna – GTC2016 San Jose 6.4.2016 systems? Computing power: Is the GPU fast enough to take trigger decision at tens of MHz events rate? 10

Lo Low Le Level vel tri rigger: ger: NA62 62 Test t bench ch RICH: 17 m long, 3 m in diameter, filled with Ne at 1 atm Reconstruct Cherenkov G.Lamanna – GTC2016 San Jose 6.4.2016 Rings to distinguish between pions and muons from 15 to 35 GeV 2 spots of 1000 PMs each Time resolution: 70 ps MisID: 5x10 -3 10 MHz events: about 20 hits per particle 11

Latency: La ency: main in pro roblem em of of G GPU U com omput puting ing Total latency Host PC dominated by double copy in Host RAM VRAM Decrease the data G.Lamanna – GTC2016 San Jose 6.4.2016 transfer time: GPU NIC DMA (Direct Memory Access) PCI Custom manage of express NIC buffers “Hide” some chipset CPU RAM component of the latency optimizing the multi-events computing 12

Nanet et-1 1 board rd Nanet-1: board based on the ApeNet+ card logic PCIe interface with GPU G.Lamanna – GTC2016 San Jose 6.4.2016 Direct P2P/RDMA capability Offloading of network protocol Multiple 1Gb/s link support Use FPGA resources to perform on-the-fly data preparation 13

Nanet et-1 1 in NA62 G.Lamanna – GTC2016 San Jose 6.4.2016 NANET TTC interface TESLA K20 14

G.Lamanna – GTC2016 San Jose 6.4.2016 Nanet et-1: 1: Perf rformances rmances 15

Nanet et-1: 1: Perf rformances rmances G.Lamanna – GTC2016 San Jose 6.4.2016 After NANET latency if fully dominated by GbE transmission. 16

Nanet et-10 10 ALTERA Stratix V dev board (TERASIC DE5-Net board) PCIe x8 Gen3 (8 GB/s) 4 SFP+ ports (Link speed up G.Lamanna – GTC2016 San Jose 6.4.2016 to 10Gb/s) GPUDirect /RDMA capability UDP offloads supports FPGA preprocessing (merging, decompression, …) VCI 2016 16/02/2016 17 17

Ring fi fittin ing Trackless no information from the tracker Difficult to merge information from many detectors at L0 Fast Multi rings on the Not iterative procedure G.Lamanna – GTC2016 San Jose 6.4.2016 market: Events rate at levels of tens of MHz With seeds: Low latency Likelihood, Online (synchronous) trigger Constrained Hough, Accurate … Offline resolution required Trackless: fiTQun, APFit, possibilistic clustering, Metropolis-Hastings, Hough transform, … 18

Histogram ogram algori rith thm Rings are identified looking at distance bins whose contents exceed a threshold value G.Lamanna – GTC2016 San Jose 6.4.2016 XY plane divided into a grid An histogram is created with distances from the grid points and hits of the physics event 19

Results lts Sending real data from NA62 2015 RUN NaNet-1 board GPU NVidia K20 G.Lamanna – GTC2016 San Jose 6.4.2016 Merging events in GPU from two different sources FPGA merger will be implemented soon Kernel histogram 33x10 6 protons per pulse >10 MHz Max 1ms latency allowed 20

Almag magest: est: multi ti-ri ring ng ident ntif ificatio ication New algorithm (Almagest) based on Ptolemy’s theorem : G.Lamanna – GTC2016 San Jose 6.4.2016 “A quadrilateral is cyclic (the vertex lie on a circle) if and only if is valid the relation: AD*BC+AB*DC=AC*BD “ 21

G.Lamanna – GTC2016 San Jose 6.4.2016 Almag magest: est: multi ti-ri ring ng ident ntif ificatio ication 22

Almag magest est re results lts Tesla K20 Only computing time presented <0.5 us per event (multi- G.Lamanna – GTC2016 San Jose 6.4.2016 rings) for large buffers 1 us 23

Conclu lusio sions ns (1) Several possible uses in HEP: data analysis, Monte Carlo, … The GPU in the trigger could give several advantages, but the processing performances should be carefully G.Lamanna – GTC2016 San Jose 6.4.2016 studied (IO, Latency, Throughput) Several experiments are thinking about to use GPU in the trigger in future (both in Lower and Higher levels): Upgrade: ATLAS, LHCb, CMS, ALICE (already used GPU in run1), … NA62, PANDA, CBM, STAR, … 24

Conclu lusio sions ns (2) To match the required latency in Low Level triggers, it is mandatory that data coming from the network must be copied to GPU memory avoiding bouncing buffers on host. A working solution with the NaNet-1 board has been G.Lamanna – GTC2016 San Jose 6.4.2016 realized and tested on the NA62 RICH detector. Multi-ring algorithms such as Almagest and Histogram are implemented on GPU. The GPU-based L0 trigger with the new board NaNet-10 will be implemented during the next NA62 Run starting on April 2016. GPUs are flexibles, scalable, powerful, ready to use, cheap and take advantage of continuous development for other purposes: they are a viable alternative to other expensive and less powerful solution. 25

G.Lamanna – GTC2016 San Jose 6.4.2016 SPARES 26

HLT with h GPU HLT is a “natural” place where to use GPU The increasing in LHC luminosity and in the number G.Lamanna – GTC2016 San Jose 6.4.2016 of overlapping events poses new challenges to the trigger system, and new solutions have to be developed for the fore coming upgrades A simple increase of the threshold can reduce signal efficiency drastically More resolution and more complex reconstruction in HLT capabilities Reconstruction complexity and computing time scales with number of hits/tracks Higher throughput means increase network and CPU capabilities 27 Parallel computing is the solution

PFRING ING G.Lamanna – GTC2016 San Jose 6.4.2016 Special driver for direct access to NIC buffer Data are directly available in userland Double copy avoided Pros: No extra HW needed; Cons: Pre-processing on CPU 28

How w GPUs PUs can can Help lp High gh G.Lamanna GTC2016 San - PowerPoint PPT Presentation

How w GPUs PUs can can Help lp High gh G.Lamanna GTC2016 San Jose 6.4.2016 En Energy rgy Ph Physic ysics GTC2016 San Jose 6.4.2015 Gianluca Lamanna (INFN) On behalf of GAP collaboration Ou Outlin ine High Energy

Tem pus 5 1 7 2 0 0 - Tem pus-1 -2 0 1 1 -1 -BE- Tem pus-SMGR Establishing and capacity building

John Muir Medical Center John Muir Medical Center Walnut Creek Cam pus Walnut Creek Cam pus

Mobile I Pv6 Service Mobile I Pv6 Service over the KAI ST Cam pus over the KAI ST Cam pus w

DTU Opportunities for students in Environmental Engineering DTU in the world and in Denmark

W elcom e to Clifton Cam pus HOW MANY STUDENTS STUDY AT CLI FTON CAMPUS? Clifton Cam pus

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Composable GPU programming GPUs -- what are they? Basic model: SIMD, SPMD, MIMD; blocks

PUS EALTH AMPU T OWN WN H ALL ALL M EETI EETING C I T Y H A L L L L M A Y 8 , 2 0 1 4 WyCo la

UC Davis Sacram ento Cam pus Past, Present and Future Mark Rom ney, Senior Facilities Planner

Hybrid Monte-Carlo in Path Space Patrick Malsom Department of Physics, University of Cincinnati,

Beyond Data and Model Parallelism for Deep Neural Networks Zhihao Jia, Matei Zaharia and Alex

Lance Spitzner www.securingthehuman.org/blog lspitzner@sans.org @securethehuman Security

C+I Metrics Initiative Introducing a Crowdsourced Bottom-Up Approach to Developing

1. Scientific report: auto-evaluation of the ADAMIS team a. Activities and results ADAMIS is an

Manitoba Hydro Capital Expenditure Review J a n u a r y 2 0 1 8 Powerhouse Complex July 2017

Luxury Residences At Palm Drive A Garden With Your LUXURY HOME Attached Shaloo Agencies :

How to Find 'HIDDEN' Grant Funding for Disability Programs 1 How to Find How to Find