Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos - PowerPoint PPT Presentation

Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos Baidu SVAIL April 7, 2016 Gregory Diamos Persistent RNNs

SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Gregory Diamos Persistent RNNs

Deep Learning at SVAIL 100 GFLOP/s 6 TFLOP/s 800 TFLOP/s 100 PFLOP/s 1 laptop 1 GPU 128 GPUs 16K GPUs recognition accuracy human level deep learning state of the art many previous methods data and compute Hypothesis: deep learning scales with data and compute . Can we strong scale deep learning to the limits of technology? Gregory Diamos Persistent RNNs

Persistent RNNs 30x speedup at a mini-batch size of 4 Why is reducing the mini-batch size important? Train bigger and deeper models. Strong scale to more GPUs. Improve efficiency of deployed models. Gregory Diamos Persistent RNNs

Training Deep RNNs Gregory Diamos Persistent RNNs

Deep speech Near human level speech recognition in Mandarin and English Trained on over 10,000 hours (about 1 year) of speech data. 20 ExaFLOPs of work to train (7 days on 16 GPUs at 40% of peak). Gregory Diamos Persistent RNNs

Data parallel training GPU 0 speech data mini-batch ... GPU 1 Data parallelism: The training data is grouped into mini-batches. Each GPU trains a copy of the model on a slice of the mini-batch. GPUs synchronize their models after a fixed number of steps. Gregory Diamos Persistent RNNs

Mini-batch constraints So how should you choose the mini-batch size? 64 per GPU 1024 wall-clock time to convergence inefficient hardware inefficient optimization mini-batch size Hardware efficiency will set a lower bound. Optimization efficiency will set an upper bound. Shrinking the mini-batch per GPU enables the use of more GPUs. Gregory Diamos Persistent RNNs

Determining the batch size The upper bound can be found empirically. In general a hyperparameter search is needed, but a useful heuristic is: momentum = 1 . 0 − miniBatchSize windowSize learningRate = stepSize ∗ (1 . 0 − momentum ) ∗ miniBatchSize Gregory Diamos Persistent RNNs

Persistent RNN Details Gregory Diamos Persistent RNNs

RNN primer RNNs built on GEMM calls reload the weights (U) each timestep. However, the weights are constant, and this is wasteful. Gregory Diamos Persistent RNNs

Caching weights in registers GPU x 1 380 GB/s 5.5 MB 6.144 TFLOP/s 300 ns x 24 Core 128 GB/s 256 GFLOP/s 230 KB 30 ns x 128 Thread 16 GB/s 2 GFLOP/s 896 B 6 ns Off-chip memory is much slower and less efficient than registers. GPUs have more on-chip memory in registers than anywhere else. Cache RNN weights in registers and reuse them over timesteps. Gregory Diamos Persistent RNNs

Choosing the tile sizes 1152 1152 1152 2 SM 0 Warp 0 Thread 15 Thread 14 Thread 15 Thread 0 Thread 1 Thread 0 3 Warp 1 SM 1 6 ... ... 1152 48 ... ... Thread 16 Thread 17 Thread 31 Thread 16 Thread 30 Thread 31 Warp 7 SM 23 Recurrent Weight Matrix Block rows avoid additional inter-CTA synchronizations. Each SM loads the activations into shared memory. Threads are interleaved to avoid shared memory bank conflicts. Vector loads and broadcasts amplify shared memory bandwidth. Gregory Diamos Persistent RNNs

Global barriers on GPUs Persistent kernel launch Kernel launch barrier global divergent divergent barrier branch branch barrier barrier Grid of cooperative Grid of cooperative thread arrays thread arrays Cooperative Thread Array Cooperative Thread Array An inter-CTA barrier is implemented with a counting semaphore. Uses atomic, membar, and cache modified load/store operations. Completes in about 500ns on a TitanX GPU. Disclaimer: global barriers violate the CUDA 7.5 model. CUDA does not guarantee forward progress of multiple CTAs. Our system implements cooperative threading for correctness. Gregory Diamos Persistent RNNs

Software pipelining timestep 0 timestep 1 timestep n-1 mini-batch 0 load math reduce barrier load math reduce barrier load math reduce barrier mini-batch 1 load math reduce barrier load math reduce barrier load math reduce barrier ... mini-batch 2 load math reduce barrier load math reduce barrier load math reduce barrier mini-batch 3 load math reduce barrier load math reduce barrier load math reduce barrier i 0 i 1 i 2 i 3 i 4 i 5 i 6 i 7 i 4n-4 i 4n-3 i 4n-2 i 4n-1 i 4n i 4n+1 i 4n+2 Software pipelining is used to hide latency. Thread local math (430ns). Intra-SM reduction (320ns). Global loads (315ns). Global barrier (500ns). These are grouped into 4 pipeline stages, kept full with a minibatch of 4. Gregory Diamos Persistent RNNs

Strong Scaling Gregory Diamos Persistent RNNs

Scaling to 128 GPUs Scaling results for end-to-end model training. 8 GPUs per node, 7GB/s infiniband between nodes. The algorithmic mini-batch size is fixed at 512. Deep Speech Scaling With 1152 Unit Layers 300 PERSISTENT-RNN GEMM-RNN 250 PERFECT SCALING 200 TeraFLOP/s 150 100 50 0 0 20 40 60 80 100 120 140 GPU Count A smaller mini-batch per GPU enables the use of up to 128 GPUs. Gregory Diamos Persistent RNNs

Exploring deep residual RNNs Using a mini-batch per GPU of 4 provides a 16x reduction in memory. Models with more parameters can now fit into GPU memory. Deep Residual Network Error Rate Reduction With Depth 36 Deep Residual RNN 35 Word Error Rate (English) 34 33 32 31 30 29 28 27 0 10 20 30 40 50 60 70 80 90 Recurrent Layer Count Results suggest that residual skip connections networks apply to RNNs. Gregory Diamos Persistent RNNs

Pascal and future Future GPUs will enable bigger and faster RNN layers. bigger GPUs (more threads, more registers) low latency atomics between GPUs (NvLink) lower precision (fp16) Gregory Diamos Persistent RNNs

Conclusions So far, deep learning for speech recognition has scaled with compute. 100 GFLOP/s 6 TFLOP/s 800 TFLOP/s 100 PFLOP/s 1 laptop 1 GPU 128 GPUs 16K GPUs recognition accuracy human level deep learning state of the art many previous methods data and compute Persistent kernels provide a new tool for accelerating RNN training. Let’s continue building faster computers, software, and algorithms. What other hard AI problems will scale with deep learning and compute? Gregory Diamos Persistent RNNs

Questions Questions? Contact Me: Gregory Diamos - gregdiamos@baidu.com Baidu USA is hiring! http://usa.baidu.com/careers/ Gregory Diamos Persistent RNNs

Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos - PowerPoint PPT Presentation

Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos Baidu SVAIL April 7, 2016 Gregory Diamos Persistent RNNs SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Gregory Diamos Persistent

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Nave

Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed

Distributed Shared Persistent Memory (SoCC 17) Yizhou Shan, Yiying Zhang Persistent Memory

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Assessment of sinoatrial node function at patients with persistent and long-standing persistent

Recurrent Neural Networks (RNNs) for NLP MACHINE LEARNING MEETUP DR. ANA PELETEIRO RAMALLO

Hardware Support for ACID Transactions in Persistent Memory Arpit Joshi , Vijay Nagarajan, Marcelo

Distributed Optimization of CNNs and RNNs GTC 2015 William Chan williamchan.ca

Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari , Jishen Zhao

RNN Input Layer RNN Hidden Layer RNN h t-1 h t x t (Picture adapted from Andrej

Persistent Handles: approaches Ralph Bhme, Samba Team, SerNet 2018-06-08 Outline Persistent

Training RNNs with 16-bit Floa5ng Point Erich Elsen

arXiv:1303.5778v1 [cs.NE] 22 Mar 2013 Recurrent neural networks (RNNs) are a powerful model for

Persistent Bioperl Persistent Bioperl BOSC 2003 Hilmar Lapp Genomics Institute Of The Novartis

RNNs for Timeseries Analysis www.data4sci.com github.com/DataForScience/RNN Disclaimer The

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

Portable Passive Detection of Advanced Persistent Threats APT Catcher Author: Guido Kroon

RNNs for Timeseries Analysis www.bgoncalves.com github.com/bmtgoncalves/RNN Disclaimer The

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems Se Kwon Lee K. Hyun Lim 1 ,

Persistent Memory Use Cases in Modern Software Architectures Olasoji Denloye SW Engineer Intel

IPv6 Prefix Assignment for end-customers persistent vs non-persistent and what size to