acknowledgement
play

Acknowledgement Frank Chen, Glenn Holloway, Dan Janni, Peter - PowerPoint PPT Presentation

P ara D nn github.com/Emma926/paradnn A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms Yu (Emma) Wang, Gu-Yeon Wei, David Brooks Harvard University Contact: ywang03@g.harvard.edu 3/3/2020 Acknowledgement


  1. P ara D nn github.com/Emma926/paradnn A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms Yu (Emma) Wang, Gu-Yeon Wei, David Brooks Harvard University Contact: ywang03@g.harvard.edu 3/3/2020

  2. Acknowledgement Frank Chen, Glenn Holloway, Dan Janni, Peter Mattson, Lifeng Nai, David Patterson, Francesco Pontiggia, Parthasarathy Ranganathan, Vijay Reddi, Brennan Saeta, Zak Stone, Anitha Vijayakumar, Shibo Wang, Qiumin Xu, Doe Hyun Yoon, Cliff Young

  3. Challenges with ML Benchmarking ● Diversity in deep learning models used Problem Domains, Models, Datasets ○ ● Pace of field State-of-the-art models evolve every few months ○ ● Varying evaluation metrics Accuracy, Time to train, Latency of inference ○ ● Multi-disciplinary field Algorithms, Systems, Hardware, ML Software Stacks ○

  4. State of the art: MLPerf 0.6 Area Benchmark Dataset Model Reference Implementation Vision Image classification ImageNet ResNet-50 TensorFlow Object detection COCO 2017 Mask R-CNN Pytorch Object detection COCO 2017 SSD-ResNet34 Pytorch Language/ Translation WMT Eng-Germ Transformer TensorFlow Audio Speech recognition GNMT PyTorch WMT Eng-Germ Commerce Recommendation MovieLens-20M NCF PyTorch Action Reinforcement Learning Go Mini-go TensorFlow

  5. State of the art: MLPerf 0.6 Area Benchmark Dataset Model Reference Implementation Vision Image classification ImageNet ResNet-50 TensorFlow Object detection COCO 2017 Mask R-CNN Pytorch Object detection COCO 2017 SSD-ResNet34 Pytorch Language/ Translation WMT Eng-Germ Transformer TensorFlow Audio Speech recognition GNMT PyTorch WMT Eng-Germ Commerce Recommendation MovieLens-20M NCF PyTorch Action Reinforcement Learning Go Mini-go TensorFlow

  6. Our Methodology P ara D nn

  7. Our Methodology P ara D nn

  8. ParaDnn vs MLPerf P ara D nn - Avoid drawing conclusions based on - Good for studying accuracy or several arbitrary models convergence with real datasets - Generate thousands of parameterized , - Represent the specific models some end-to-end models people care about - Prepare hardware designs for future models - Complement the use of existing real-world models, i.e. MLPerf

  9. ParaDnn Canonical Models Fully Connected (FC) # of Layers Input # of Nodes # of Nodes Output CNNs: Residual, Bottleneck # of Res/Bottleneck x 4 Input FC Layer Output Blocks (filter size) RNNs: RNN, LSTM, GRU # of Layers Input RNN or LSTM or GRU cell (size) RNN or LSTM or GRU cell Output

  10. Models

  11. Models - ParaDnn covers a larger range than the real models - from 10k to ~1 billion parameters

  12. Analysis Enabled by ParaDnn - Roofline analysis of TPU v2 - Homogenous Platform Comparison: TPU v2 vs v3 - Heterogeneous Platform Comparison: TPU vs GPU

  13. The Roofline Model David Brooks, Gu-Yeon Wei 13

  14. The Roofline Model Peak FLOPS David Brooks, Gu-Yeon Wei 14

  15. The Roofline Model Peak FLOPS Memory Bandwidth David Brooks, Gu-Yeon Wei 15

  16. The Roofline Model compute-intensive David Brooks, Gu-Yeon Wei 16

  17. The Roofline Model memory-intensive compute-intensive David Brooks, Gu-Yeon Wei 17

  18. Transformer David Brooks, Gu-Yeon Wei 18

  19. FC Models ParaDnn sweeps a large range of models, from memory-bound to compute-bound. David Brooks, Gu-Yeon Wei 19

  20. FC Models Compute-bound David Brooks, Gu-Yeon Wei 20

  21. FC Models Memory-bound David Brooks, Gu-Yeon Wei 21

  22. TPU v2 vs v3? 22

  23. How to upgrade to TPU v3? TPU v2 23

  24. How to upgrade to TPU v3? TPU v3 (FLOPS ) TPU v2 24

  25. How to upgrade to TPU v3? TPU v3 (FLOPS ) TPU v2 TPU v3 (Mem BW ) 25

  26. How to upgrade to TPU v3? TPU v3 (FLOPS Mem BW ) TPU v3 (FLOPS ) TPU v2 TPU v3 (Mem BW ) 26

  27. How to upgrade to TPU v3? TPU v3 (FLOPS Mem BW ) ? x TPU v2 ? x 27

  28. Architecture of TPU v2 vs v3 180 TFLOPS / Board 420 TFLOPS / Board 28 Figure is from https://cloud.google.com/tpu/docs/system-architecture

  29. Google’s Choice of TPU v3 TPU v3 2.3 x TPU v2 ? x 29

  30. TPU v3 vs v2: FC Operation Breakdown 30

  31. TPU v3 vs v2: FC Operation Breakdown Compute-bound: 2.3x speedup 31

  32. TPU v3 vs v2: FC Operation Breakdown Memory-bound: 1.5x speedup 32

  33. TPU v3 vs v2: FC Operation Breakdown Memory-bound, but benefit from 2x memory capacity: 3x speedup 33

  34. Google’s Choice of TPU v3 TPU v3 2.3 x TPU v2 1.5 x 34

  35. TPU v3 vs v2: FC Operation Breakdown ParaDnn provides diverse set of operations, and shows different operations are sensitive to different system component upgrades. 35

  36. TPU vs GPU?

  37. Hardware Platforms 37

  38. Hardware Platforms 300 GB/s per core 38

  39. FC and CNN FC W A FC FC Gradient G Weighted Sum

  40. FC and CNN FC CNN Fewer Weights W W Conv A FC A Larger Conv ops FC Conv Gradient Gradient G G Weighted Weighted Sum Sum

  41. Hardware Platforms 300 GB/s per core 41

  42. FC TPU/GPU Speedup colored with Batch Size 9 0.35 42

  43. FC TPU/GPU Speedup colored with Batch Size 9 TPU is better GPU is better 0.35 43

  44. FC TPU/GPU Speedup colored with Batch Size 9 TPU is better GPU is better 0.35 44

  45. FC TPU/GPU Speedup colored with Node Size 9 More nodes More weights More memory-bound 45

  46. Hardware Platforms 1.44x 300 GB/s per core 46

  47. CNN TPU/GPU Speedup colored with Batch Size 47

  48. CNN TPU/GPU Speedup colored with Batch Size - Up to 6x speedup - TPU architecture and software is highly optimized for CNNs 48

  49. CNN TPU/GPU Speedup colored with Batch Size - All models runs faster on TPU. - Larger batch sizes lead to higher speedups. 49

  50. CNN TPU/GPU Speedup colored with Filters - More filters have higher speedup lower bounds 50

  51. Conclusion - Parameterized methodology: ParaDnn + a set of analysis methods - Single platform analysis: TPU v2 - Homogenous platform comparison: TPU v2 vs v3 - Heterogeneous platform comparison: TPU vs GPU

  52. Limitations of this Work - Does not include: - Inference - Multi-node system: multi-GPU, or TPU pods - Accuracy, convergence - Cloud overhead - Tractability - Limit the range of hyperparameters and datasets - Small batch sizes (<16) and large batch sizes (> 2k) are not studied - Synthetic datasets do not include data infeed overhead - Iterations of TPU loop is 100. Larger numbers can slightly increase the performance.

  53. P ara D nn Available: github.com/Emma926/paradnn Questions?

Recommend


More recommend