compressing dma engine leveraging activation sparsity for
play

Compressing DMA Engine: Leveraging Activation Sparsity For Training - PowerPoint PPT Presentation

(C) Minsoo Rhu 1 Compressing DMA Engine: Leveraging Activation Sparsity For Training Deep Neural Networks Minsoo Rhu , Mike OConnor * , Niladrish Chatterjee * , Jeff Pool * , Youngeun Kwon , and Stephen W. Keckler * POSTECH and


  1. (C) Minsoo Rhu 1 Compressing DMA Engine: Leveraging Activation Sparsity For Training Deep Neural Networks Minsoo Rhu ✝ , Mike O’Connor * , Niladrish Chatterjee * , Jeff Pool * , Youngeun Kwon ✝ , and Stephen W. Keckler * POSTECH ✝ and NVIDIA *

  2. (C) Minsoo Rhu 2 Motivation

  3. (C) Minsoo Rhu 3 ML trends: deeper & larger DNN models From AlexNet to ResNet [AlexNet*] 7 convolutional layers (2012) * Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012

  4. (C) Minsoo Rhu 4 ML trends: deeper & larger DNN models From AlexNet to ResNet [ResNet*] 153 convolutional layers (2016) * He et al., “Deep Residual Learning for Image Recognition”, CVPR-2016

  5. (C) Minsoo Rhu 5 Memory “capacity” limits in DNN training Training large & deep DNNs incurs large memory allocations — The Next Platform, “Baidu eyes deep learning strategy in wake of new GPU options”, April 26 th 2016

  6. (C) Minsoo Rhu 6 Prior solution: virtualized DNN (vDNN) Expose both CPU and GPU memory for allocating DNN training data PCIe CPU GPU CPU memory GPU memory * Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016

  7. (C) Minsoo Rhu 7 Prior solution: virtualized DNN (vDNN) Expose both CPU and GPU memory for allocating DNN training data PCIe CPU GPU Spill to CPU memory CPU memory GPU memory * Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016

  8. (C) Minsoo Rhu 8 Prior solution: virtualized DNN (vDNN) Expose both CPU and GPU memory for allocating DNN training data PCIe CPU GPU Migrate back to GPU memory CPU memory GPU memory * Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016

  9. (C) Minsoo Rhu 9 Large Model Support (LMS) with PowerAI Expose both CPU and GPU memory for allocating DNN training data * https://developer.ibm.com/linuxonpower/2017/09/22/realizing-value-large-model-support-lms-powerai-ibm-caffe/

  10. (C) Minsoo Rhu 10 HPC system node for deep learning Multiple GPUs (4 to 8) connected under a PCIe root complex High capacity, low bandwidth memory (DDR4) QuickPath Interconnect (QPI) Big Data Deeper & wider neural networks Low capacity, high bandwidth stacked memory (HBM)

  11. (C) Minsoo Rhu 11 HPC system node for deep learning Multiple GPUs (4 to 8) connected under a PCIe root complex High capacity, low bandwidth memory (DDR4) GPU-CPU QuickPath Interconnect (QPI) migration traffic Big Data Deeper & wider neural networks Low capacity, high bandwidth stacked memory (HBM)

  12. (C) Minsoo Rhu 12 HPC system node for deep learning Multiple GPUs (4 to 8) connected under a PCIe root complex High capacity, low bandwidth memory (DDR4) GPU-CPU GPU-CPU QuickPath Interconnect (QPI) migration traffic migration traffic Big Data Deeper & wider neural networks Low capacity, high bandwidth stacked memory (HBM) Challenges: PCIe channel bandwidth becomes a performance bottleneck!

  13. (C) Minsoo Rhu 13 Opportunity: “sparse” data structures Amplify effective PCIe bandwidth via compressing CPU-migrated data PCIe CPU GPU Spill to CPU memory CPU memory GPU memory

  14. (C) Minsoo Rhu 14 Opportunity: “sparse” data structures Amplify effective PCIe bandwidth via compressing CPU-migrated data 0 0 0 0 c a 0 0 0 d 0 0 b 0 0 0 0 0 0 0

  15. (C) Minsoo Rhu 15 Key contributions of this work Application characterization study on sparsity when training convolutional neural networks Architectural support for leveraging activation sparsity in virtualized DNNs

  16. (C) Minsoo Rhu 16 Q. How much sparsity do DNNs exhibit during training?

  17. (C) Minsoo Rhu 17 Case study) AlexNet Characterizing the changes in layer density during training [AlexNet*] * Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012

  18. (C) Minsoo Rhu 18 Case study) AlexNet Characterizing the changes in layer density during training Test image [AlexNet*] * Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012

  19. (C) Minsoo Rhu 19 Case study) AlexNet Characterizing the changes in layer density during training conv0 (96, 55, 55) 55 Trained Trained Trained Trained Trained Trained 55 (0%) (20%) (40%) (60%) (80%) (100%) 96 Feature maps Test image

  20. (C) Minsoo Rhu 20 Case study) AlexNet Characterizing the changes in layer density during training (55x55) 2D image conv0 (96, 55, 55) 55 Trained Trained Trained Trained Trained Trained 55 (0%) (20%) (40%) (60%) (80%) (100%) 96 Feature maps Test image

  21. (C) Minsoo Rhu 21 Case study) AlexNet Characterizing the changes in layer density during training (55x55) 2D image conv0 96 channels (96, 55, 55) 55 Trained Trained Trained Trained Trained Trained 55 (0%) (20%) (40%) (60%) (80%) (100%) 96 Feature maps Test image

  22. (C) Minsoo Rhu 22 Case study) AlexNet Characterizing the changes in layer density during training conv0 (96, 55, 55) Trained Trained Trained Trained Trained Trained (0%) (20%) (40%) (60%) (80%) (100%) Average layer density: 49% (51% of activations are 0-valued) Test image

  23. (C) Minsoo Rhu 23 Case study) AlexNet Characterizing the changes in layer density during training conv0 (96, 55, 55) Trained Trained Trained Trained Trained Trained (0%) (20%) (40%) (60%) (80%) (100%) Test image

  24. (C) Minsoo Rhu 24 Case study) AlexNet Characterizing the changes in layer density during training conv0 (96, 55, 55) Trained Trained Trained Trained Trained Trained (0%) (20%) (40%) (60%) (80%) (100%) Average layer density: 49% (51% of activations are 0-valued) Test image

  25. (C) Minsoo Rhu 25 Case study) AlexNet Characterizing the changes in layer density during training conv1 (256, 27, 27) Trained Trained Trained Trained Trained Trained (0%) (20%) (40%) (60%) (80%) (100%) Average layer density: 36% (64% of activations are 0-valued)

  26. (C) Minsoo Rhu 26 Case study) AlexNet Characterizing the changes in layer density during training conv4 (256, 13, 13) Trained Trained Trained Trained Trained Trained (0%) (20%) (40%) (60%) (80%) (100%) Average layer density: 22% (78% of activations are 0-valued)

  27. (C) Minsoo Rhu 27 Case study) AlexNet Putting everything together Time (0% to 100%)

  28. (C) Minsoo Rhu 28 Case study) AlexNet Putting everything together Time (0% to 100%)

  29. (C) Minsoo Rhu 29 Case study) AlexNet Putting everything together Observation #1: First CONV layer consistently exhibits around 50% layer density across the entire training process.

  30. (C) Minsoo Rhu 30 Case study) AlexNet Putting everything together Observation #2: Pooling layers always increase overall activation density.

  31. (C) Minsoo Rhu 31 Case study) AlexNet Putting everything together Observation #3: Within each layer, activation density rapidly decreases during the initial training periods; once training period reaches the fine-tuning stage, density gradually crawls back up again.

  32. (C) Minsoo Rhu 32 Case study) AlexNet Putting everything together Observation #4: Later layers are generally more sparser than earlier layers

  33. (C) Minsoo Rhu 33 Case study) VGG-16 Putting everything together Sparser Deeper

  34. (C) Minsoo Rhu 34 What causes such behavior in DNNs? Discussed much more in our paper J

  35. (C) Minsoo Rhu 35 What causes such behavior in DNNs? Observation#4: Sparsity increases as you go deep inside the network Sparser Deeper

  36. (C) Minsoo Rhu 36 What causes such behavior in DNNs? Observation#4: Sparsity increases as you go deep inside the network Input images Activations * Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

  37. (C) Minsoo Rhu 37 What causes such behavior in DNNs? Observation#4: Sparsity increases as you go deep inside the network First few layers: filters are trained to respond to “ class-invariant” features - Corners - Edges - Colors Input images Activations * Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

  38. (C) Minsoo Rhu 38 What causes such behavior in DNNs? Observation#4: Sparsity increases as you go deep inside the network Input images Activations * Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

  39. (C) Minsoo Rhu 39 What causes such behavior in DNNs? Observation#4: Sparsity increases as you go deep inside the network Deeper layers: more “ class-specific” features Input images Activations (e.g., Textures …) * Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

  40. (C) Minsoo Rhu 40 What causes such behavior in DNNs? Observation#4: Sparsity increases as you go deep inside the network Input images Activations * Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

Recommend


More recommend