accelerated machine learning inference as a service for
play

Accelerated machine learning inference as a service for particle - PowerPoint PPT Presentation

Accelerated machine learning inference as a service for particle physics computing Nhan Tran December 8, 2019 Work based on https://arxiv.org/abs/1904.08986 and studies in fastmachinelearning.org community Markus Atkinson Burt Holzman


  1. Accelerated machine learning inference as a service for particle physics computing Nhan Tran December 8, 2019

  2. Work based on https://arxiv.org/abs/1904.08986 
 and studies in fastmachinelearning.org community Markus Atkinson Burt Holzman Mark Neubauer Sergo Jindariani Zhenbin Wu Thomas Klijnsma Javier Duarte Ben Kreis Mia Liu Phil Harris Kevin Pedro Jeff Krupa Nhan Tran Giuseppe Di Guglielmo Sang Eon Park Dylan Rankin Suffian Khan Brian Lee Scott Hauck Kalin Ovcharov Shih Chieh Hsu Brandon Perez Paul Chow Kelvin Mei Andrew Putnam Naif Tarafdar Cha Suaysom Vladimir Loncar Ted Way Matt Trahms Jennifer Ngadiuba Colin Versteeg Dustin Werran Maurizio Pierini Sioni Summers � 2

  3. The computing conundrum CMS online filter farm project CMS detector LHC (current) HL-LHC (upgraded) Simultaneous interactions 60 200 L1 accept rate 100 kHz 750 kHz HLT accept rate 1 kHz 7.5 kHz Event size 2.0 MB 7.4 MB CMS offline computing 
 HLT computing power 0.5 MHS06 9.2 MHS06 profile projection Storage throughput 2.5 GB/s 61 GB/s Event network throughput 1.6 Tb/s 44 Tb/s Compute needs growing by more than 10x Environments getting more complex Need more sophisticated analysis techniques � 3

  4. The computing conundrum uting Challenges 136PU event (2018) data vs. Run 2/3 → exabytes ) els Compute needs growing by more than 10x Environments getting more complex Need more sophisticated analysis techniques � 4 → 30+ petabytes/year

  5. The computing conundrum � 5

  6. Heterogeneous compute + Registers Control + + CPUs GPUs Unit (CU) ASICs Arithmetic FPGAs Logic Unit + + + + (ALU) Advances in heterogeneous FLEXIBILITY EFFICIENCY computing driven by } machine learning � 6

  7. Heterogeneous compute + Registers Control + + CPUs GPUs Unit (CU) ASICs Arithmetic FPGAs Logic Unit + + + + (ALU) Advances in heterogeneous FLEXIBILITY EFFICIENCY computing driven by } machine learning � 6

  8. Why fast inference? • Training has its own computing Opportunities for Accelerated Machine Learning Inference in Fundamental Physics challenges Javier Duarte 1 , Philip Harris 2 , Alex Himmel 3 , Burt Holzman 3 , Wesley Ketchum 3 , Jim Kowalkowski 3 , Miaoyuan Liu 3 , Brian Nord 3 , Gabriel Perdue 3 , Kevin Pedro 3 , • But happens ~once/year and Nhan Tran 3 , and Mike Williams 2 1 University of California San Diego, La Jolla, CA 92093, USA outside of compute infrastructure 2 Massachusetts Institute of Technology, Cambridge, MA 02139, USA 3 Fermi National Accelerator Laboratory, Batavia, IL 60510, USA • Inference happens on billions ABSTRACT In this brief white paper, we discuss the future computing challenges for fundamental physics experiments. The use cases for of events many times a year deploying machine learning across physics for simulation, reconstruction, and analysis is rapidly growing. This will lead us to many applications where exploring accelerated machine learning algorithm inference could bring valuable and necessary gains in performance. Finally, we conclude by discussing the future challenges in deploying new heterogeneous computing hardware. • Unique challenge across HEP This community report is inspired by discussions at the Fast Machine Learning Workshop 1 held September 10-13, 2019. Contents • Massive datasets of statistically 1 Introduction 1 1.1 Computing model in particle physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 independent events 1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Challenges and Applications for Accelerated Machine Learning Inference 2 2.1 CMS and ATLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 LHCb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 LSST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 LIGO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.5 DUNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Outlook and Opportunities 6 Work in progress, [link] � 7

  9. Pros & Cons On how to integrate heterogeneous compute into our computing model Domain ML as a Service 
 direct 
 (aaS) connect GPU FPGA ASIC � 8

  10. Pros & Cons On how to integrate heterogeneous compute into our computing model Domain ML Our first study: MLaaS with FPGAs as a Service 
 direct 
 (aaS) connect GPU FPGA ASIC � 8

  11. To ML or not to ML Re-engineer physics algorithms 
 Re-cast physics problem as a 
 for new hardware machine learning problem Language: OpenCL, OpenMP, 
 Language: C++, Python HLS, Kokkos,…? (TensorFlow, PyTorch,…) Hardware: CPU, FPGA, GPU Hardware: CPU, FPGA, GPU, ASIC Is there a way to have the best of both worlds 
 with physics aware ML? � 9

  12. aaS or direct connect C OPROCESSOR 
 (GPU,FPGA,ASIC) C OPROCESSOR 
 C OPROCESSOR 
 (GPU,FPGA,ASIC) (GPU,FPGA,ASIC) C OPROCESSOR 
 (GPU,FPGA,ASIC) C OPROCESSOR 
 (GPU,FPGA,ASIC) Pros: Pros: scalable algorithms less system complexity scalable to the grid/cloud no network latency heterogeneity (mixed hardwares) � 10

  13. aaS or direct connect Algo 1 C OPROCESSOR 
 (GPU,FPGA,ASIC) Algo 2 Algo 1 C OPROCESSOR 
 C OPROCESSOR 
 Algo 1 (GPU,FPGA,ASIC) (GPU,FPGA,ASIC) Algo 2 C OPROCESSOR 
 Algo 2 (GPU,FPGA,ASIC) Algo 1 C OPROCESSOR 
 (GPU,FPGA,ASIC) Algo 2 Pros: Pros: scalable algorithms less system complexity scalable to the grid/cloud no network latency heterogeneity (mixed hardwares) � 10

  14. hardware choices • GPUs • Power hungry • Batching for optimal performance • Mature software ecosystem • ASICs • Most efficient Op/W • Less flexible • FPGAs • Middle solution, flexible and less 
 power hungry than GPU • Does not require batching � 11

  15. hardware choices • GPUs • Power hungry • Batching for optimal performance • Mature software ecosystem • ASICs • Most efficient Op/W • Less flexible • FPGAs • Middle solution, flexible and less 
 power hungry than GPU • Does not require batching � 12

  16. hardware choices • GPUs • Power hungry • Batching for optimal performance • Mature software ecosystem • ASICs • Most efficient Op/W • Less flexible • FPGAs • Middle solution, flexible and less 
 power hungry than GPU • Does not require batching � 13

  17. Brainwave on Azure ML � 14

  18. – Brainwave on Azure ML CPUs CPUs FPGAs FPGAs – CPUs FPGAs FPGAs CPUs rporation � 14

  19. hardware choices � 15

  20. hardware choices � 15

  21. The models Deep Learning in Science and Ind • This talk focused on standard CNN network for top tagging: ResNet-50 • One big network, 
 single-to-few batch DeepAK8 arXiv:1605.07678 • Another different example • HCal Reco , network for per channel reconstruction in CMS detector • Small network, 
 16000 batch 
 (will come back to this) � 16

  22. Tagging tops Averaged over 1000 jets Public top tagging data challenge � 17

  23. SONIC S ervices for O ptimized N etwork I nference on C oprocessors Event Processing Job Parameter 
 Configuration Sets ML INFER 1 M ODULE 1 M ODULE 6 M ODULE 2 Input Source Output 1 (data or simulation) Output 2 M ODULE 3 … threads ML INFER 2 M ODULE 5 M ODULE 4 Event Setup Database Coprocessor � 18

  24. SONIC S ervices for O ptimized N etwork I nference on C oprocessors • Convert experimental data to neural network inference (TF tensor), send to coprocessor using communication protocol • CMSSW ExternalWork mechanism for asynchronous, non-blocking requests External FPGA, GPU, etc. processing CMSSW acquire () (other work) produce () thread • SONIC CMSSW repository • Supporting gRPC with TensorFlow, working on TensorRT � 19

  25. SONIC: single service Fermilab (IL) → Azure (VA) → Fermilab (IL): < Δ t> ~ 60ms Azure (on-prem): < Δ t> ~ 10ms ResNet50 time on FPGA ~ 1.8 ms, classier on CPU ~ 2ms � 20

  26. SONIC: scale out Scaling Tests Worker Node JetImageProducer Brainwave Service Worker Node JetImageProducer … Worker Node JetImageProducer Simple scaling tests show can hit max throughput of FPGA from SONIC → “worst case” scenario i.e. the optimal way to use the hardware is keep it busy all the time 50 simultaneous CPU jobs saturate 1 FPGA This is conservative since these jobs only ran 1 module � 21

  27. Comparisons Performance Comparisons Type Note Latency [ms] Throughput [img/s] Xeon 2.6 GHz 1750 0.6 CPU* i7 3.6 GHz 500 2 *Performance depends on clock speed, TensorFlow version, # threads (1) batch = 1 7 143 GPU † ** batch = 32 1.5 667 † Directly connected to CPU via PCIe – not a service **Performance depends on batch size & optimization of ResNet-50 remote 60 660 Brainwave on-prem 10 (1.8 on FPGA) 660 � 30x (cloud, remote) to 175x (edge, on-prem) faster than current CMSSW CPU inference FPGA runs at batch-of-1 GPU is competitive at large batch size � 22

Recommend


More recommend