ON TEGRA X1 ALAN WANG, NVIDIA Convolutional Neural Network - PowerPoint PPT Presentation

DIRECT CONVOLUTION FOR DEEP NEURAL NETWORK CLASSIFICATION ON TEGRA X1 ALAN WANG, NVIDIA

Convolutional Neural Network optimization target Result Convolutional Fully Connected Input layer layer

Convolutional Layer An example: A E Parameter Description Value in this case B C #input channels 4 F H,W Input feature map size 6x6 ∆𝑞 2 C G K #output channels 3 U,V Output stride 1,1 D ∆𝑞 1 R,S Filter size 3x3 1 1 0 4 Input feature maps, and 3 output feature maps, so 12 different filters 0 0 1 E = A convolve with filters[A][E] 0 1 0 2 1 4 5 8 5 + B convolve with filters[B][E] filter 6 3 4 6 5 9 + C convolve with filters[C][E] 5 3 5 2 5 6 + D convolve with filters[D][E] 8 1 7 3 2 4 Total input pixel = C * H * W = 4 * 15 * 15 7 1 2 4 4 3 Total coefficients = K * C * R * S = 3 * 4 * 3 * 3 3 9 9 8 3 2 Total math = K * C * H * W * R * S = 3 * 4 * 15 * 15 * 3 * 3 input

Analysis of Overfeat

#1 Analysis of Math/Memory ratio Total Math = K*C*P*Q*R*S Total Memory = C*H*W + K*C*R*S + K*P*Q Overfeat Layer 1 2 3 4 5 6 Input Channels 3 96 256 512 512 1024 Output Channels 96 256 512 512 1024 1024 Filter size 7x7 7x7 3x3 3x3 3x3 3x3 Padding size / / 1x1 1x1 1x1 1x1 input size 221x221 36x36 15x15 15x15 15x15 15x15 Math/Memory (FFMA/Byte) 31.8 173.8 48.5 50.6 52.1 53.3 Big space to explore on GPU to make it math throughput limited!

#2 Analysis of Layer configuration variety Overfeat ∆𝑞 2 Layer 1 2 3 4 5 6 Range Input Channels 3 96 256 512 512 1024 3 ~ 1024 Output Channels 96 256 512 512 1024 1024 96 ~1024 ∆𝑞 1 Filter size 7x7 7x7 3x3 3x3 3x3 3x3 3~7 Padding size / / 1x1 1x1 1x1 1x1 0~1 input size 221x221 36x36 15x15 15x15 15x15 15x15 221~15 The implementation should not rely on the assumption of a particular configuration

#3 Analysis of Input/Coefficient ratio Total inputs = C * H * W Total coefficients = K * C * R * S ∆𝑞 2 Overfeat Layer 1 2 3 4 5 6 ∆𝑞 1 Input Channels 3 96 256 512 512 1024 Output Channels 96 256 512 512 1024 1024 Filter size 7x7 7x7 3x3 3x3 3x3 3x3 Padding size / / 1x1 1x1 1x1 1x1 input size 221x221 36x36 15x15 15x15 15x15 15x15 Input/Coefficient 10.383 0.103 0.063 0.063 0.031 0.027 Coefficients dominate in most layers except the first few

The direct convolution prototype on Tegra X1

Workload Distribution Thread z Distribute the workload P by output space: Thread y Q Thread x K per CUDA block: T0 T1 T2 T3 Split each output 32 output channels channel to tiles

Data Reuse H W C R Coefficients S 32*C Shared memory Input Pixels 32 output channels

Data Reuse: Input pixels  Every CUDA block need to fetch the entire input pixel space.  Inter-block redundant load handled by cache  Inner-block redundant load handled by shared memory  Total Load ~ K/32 * C * H * W Block 0 Block 1 Block 2 Block n All input All input All input All input pixels pixels pixels pixels

Data Reuse: Coefficients  No data reuse between blocks  Inner-block reuse only occurs between the threads coming from the same output channel, handled by cache. T0 T1 coefficients T2 T3 Filters[Ki]{C}{R}{S} Output Channel ID: Ki

Input Pixels Layout  Use 3D texture to elegantly handle out-of-bound access in every input channel. out-of-bound access 2 1 4 5 8 5 inner-bound access 6 3 4 6 5 9 5 3 5 2 5 6 Return data modes when out-of-bound access 8 1 7 3 2 4 occurs: 7 1 2 4 4 3 • cudaAddressModeWrap • cudaAddressModeClamp 3 9 9 8 3 2 • cudaAddressModeMirror • cudaAddressModeBorder  Mapping to texture  tex3D<float>(textureObject,W,H,C)

Coefficients Layout  To keep the load request coalesced.  Use the order of C,R,S,K. Block 0 ∆𝑞 2 Thread n “load filters[][ci=1][ ri=0][si =1]” Thread 1 Thread 0 ∆𝑞 1 0~K-1 Ri=0,Si=0 Ri=0,Si=1 Ri=2,Si=2 Ci=0 Ci=1

Per-Thread pseudo code for ci Load input footprint to shared memory (pixBuffer); __syncthreads(); load 1 coefficient to cbuffer_front; for ri for si load 1 coefficient to cbuffer_back; for TILE_X for TILE_Y outputBuffer[] += pixBuffer[] * cbuffer_front; switch cbuffer_front <-> cbuffer_back; Write the outputBuffer to global memory;

Performance  Test layer: Overfeat 6 input channels output channels input size filter size padding stride 1024 1024 15x15 3x3 1x1 1x1  Algorithm configuration Tile Size Coefficients Layout Input Layout Output Layout 5x8 C,R,S,K 3D texture(W,H,C) K,P ,Q  Test platform : Tegra X1  Current performance: GFLOPs Utilization ~ 75%

Summary  Proto-type a direct convolution implementation to accelerate the convolutional layer of DNN classification  Analyze the optimization technique  Achieve high GFLOPs utilization on Tegra X1, currently 75%, continuing optimization

THANK YOU alanw@nvidia.com

ON TEGRA X1 ALAN WANG, NVIDIA Convolutional Neural Network - PowerPoint PPT Presentation

DIRECT CONVOLUTION FOR DEEP NEURAL NETWORK CLASSIFICATION ON TEGRA X1 ALAN WANG, NVIDIA Convolutional Neural Network optimization target Result Convolutional Fully Connected Input layer layer Convolutional Layer An example: A E

SIGGRAPH 2013 Shaping the Future of Visual Computing TEGRA: Attacking Mobile Entertainment with

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke

Porting Nouveau to Tegra K1 How NVIDIA became a Nouveau contributor Alexandre Courbot, NVIDIA

A High-Precision, Hybrid GPU, CPU and RAM Power Model for the Tegra K1 SoC Kristoffer Robin

GTC 2015 Session S5429 Creating Dense Mixed GPU and FPGA Systems With Tegra K1s Using OpenCL

Where Tegra meets Titan Prof Tom Drummond Computer vision is easy! But first a

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew

In Inte tegra grated ted Ref efin inery ery an and P d Pet etroc rochemical hemical

High Quality Real Time Image Processing Framework on Mobile Platforms using Tegra K1 Eyal Hirsch

REAL TIME DUAL CAMERA SPECTRAL IMAGING BASED ON NVIDIA TEGRA SOC TO ASSESS UAV MISSIONS Michele

Drone Net Using Tegra for Multi-Spectral Detection and Tracking in Shared Air Space

Mobile 3D Mapping With Tegra K1 Karol Majek karolmajek@gmail.com Institute of Mathematical

Tegra gra Go Goes s Ind ndustry: ustry: Emb mbedded edded Hyp ypersp erspectral ectral

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv , NVIDIA Brant Zhao, NVIDIA April 7

GRATE LIBERATING NVIDIA'S TEGRA GPU February 2013 - Erik "kusma" Faye-Lund

53023 - EGLSTREAMS: INTEROPERABILITY FOR CAMERA, CUDA AND OPENGL Debalina Bhattacharjee Sharan

Convolution Outline Linear Systems Properties Response of Linear System

Trends and advances in optimization: Industry applications with historical perspective Tams

http://ecademy.agnessco http://ecademy.agnessco http://ecademy.agnessco http://ecademy.agnessco

www.c3dc.fr Pier erre Al Alliez ez, Inria Sophia Antipolis Antho hony ny Pa Pamart rt,

Food Image Recognition Using Very Deep Convolutional Networks Hamid Hassannejad 2 nd

nowcasting Wai-kin WONG Xing Jian SHI, Dit Yan YEUNG, Wang-chun WOO WMO WWRP 4th International

methods and analysis : de-convolution and reservoir surveillance 26 th November 2013 Society of

Face recognition with Convolutional Neural Network Martin Vels Face recognition with CNN

ON TEGRA X1 ALAN WANG, NVIDIA Convolutional Neural Network - PowerPoint PPT Presentation

DIRECT CONVOLUTION FOR DEEP NEURAL NETWORK CLASSIFICATION ON TEGRA X1 ALAN WANG, NVIDIA Convolutional Neural Network optimization target Result Convolutional Fully Connected Input layer layer Convolutional Layer An example: A E

SIGGRAPH 2013 Shaping the Future of Visual Computing TEGRA: Attacking Mobile Entertainment with

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke

Porting Nouveau to Tegra K1 How NVIDIA became a Nouveau contributor Alexandre Courbot, NVIDIA

A High-Precision, Hybrid GPU, CPU and RAM Power Model for the Tegra K1 SoC Kristoffer Robin

GTC 2015 Session S5429 Creating Dense Mixed GPU and FPGA Systems With Tegra K1s Using OpenCL

Where Tegra meets Titan Prof Tom Drummond Computer vision is easy! But first a

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew

In Inte tegra grated ted Ref efin inery ery an and P d Pet etroc rochemical hemical

High Quality Real Time Image Processing Framework on Mobile Platforms using Tegra K1 Eyal Hirsch

REAL TIME DUAL CAMERA SPECTRAL IMAGING BASED ON NVIDIA TEGRA SOC TO ASSESS UAV MISSIONS Michele

Drone Net Using Tegra for Multi-Spectral Detection and Tracking in Shared Air Space

Mobile 3D Mapping With Tegra K1 Karol Majek karolmajek@gmail.com Institute of Mathematical

Tegra gra Go Goes s Ind ndustry: ustry: Emb mbedded edded Hyp ypersp erspectral ectral

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv , NVIDIA Brant Zhao, NVIDIA April 7

GRATE LIBERATING NVIDIA'S TEGRA GPU February 2013 - Erik &quot;kusma&quot; Faye-Lund

53023 - EGLSTREAMS: INTEROPERABILITY FOR CAMERA, CUDA AND OPENGL Debalina Bhattacharjee Sharan

Convolution Outline Linear Systems Properties Response of Linear System

Trends and advances in optimization: Industry applications with historical perspective Tams

http://ecademy.agnessco http://ecademy.agnessco http://ecademy.agnessco http://ecademy.agnessco

www.c3dc.fr Pier erre Al Alliez ez, Inria Sophia Antipolis Antho hony ny Pa Pamart rt,

Food Image Recognition Using Very Deep Convolutional Networks Hamid Hassannejad 2 nd

nowcasting Wai-kin WONG Xing Jian SHI, Dit Yan YEUNG, Wang-chun WOO WMO WWRP 4th International

methods and analysis : de-convolution and reservoir surveillance 26 th November 2013 Society of

Face recognition with Convolutional Neural Network Martin Vels Face recognition with CNN

GRATE LIBERATING NVIDIA'S TEGRA GPU February 2013 - Erik "kusma" Faye-Lund