Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh Pejman Lotfi-Kamran * Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology * The Institute for Research in Fundamental Sciences NGPU SM SM Neurally Accelerated GPU SM SM
Approximate computing Embracing imprecision Relax the abstraction of “ near perfect” accuracy in Data Processing Storage Communication Accept imprecision to improve performance energy dissipation resource utilization efficiency
Opportunity Many GPU applications are amenable to approximation Augmented Machine Reality SM SM Learning Computer Sensor Vision Processing SM SM Robotics Multimedia
Opportunity 100%# 90%# Approximable 80%# Non-Approximable 70%# Runtime 60%# 50%# 40%# 30%# 20%# 10%# 0%# binariza'on) blackscholes) convolu'on) inversek2j) jmeint) laplacian) meanfilter) newton9raph) sobel) sram) gmean) More than 55% of application runtime and energy is in neurally approximable regions
Neural Transformation for GPUs Neural Neural Network Network
Challenges core core core core core core core core core core core core core core core core Many core core core core core core core core Core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core
Challenges core core core core core core core core core core core core core core core core Many Simple core core core core core core core core Core In Order core core core core core core core core core core core core core core core core Data-level core core core core core core core core Parallelism core core core core core core core core core core core core core core core core
Challenges core core core core core core core core core core core core core core core core Many Simple core core core core core core core core Core In Order core core core core core core core core core core core core core core core core Data-level core core core core core core core core Parallelism core core core core core core core core core core core core core core core core
Challenges core core core core core core core core core core core core core core core core Many Simple core core core core core core core core Core In Order core core core core core core core core core core core core core core core core Data-level core core core core core core core core Parallelism core core core core core core core core core core core core core core core core Augmenting the CPU based neural processing units to each SIMD lane imposes 31.2% area overhead
dst_reg src_reg2 src_reg1 src_reg3 NGPU Neurally-Accelerated GPU Architecture Memory Interconnection Network Partition L2 Decode Fetch Cache Issue Operand Write Collection back Active Mask SIMD Lane I-$ Off-chip SIMT DRAM Stack LSU Pipeline D-$ Streaming Multiprocessor (SM)
Neuronal Network Operations ... ... y j = x j,i x j, 0 x j,n sigmoid ( w j, 0 w j,i w j,n w j, 0 × x j, 0 + w j, 0 . . . w j,i × x j,i + . . . w j,n × x j,n + ) y j 11
Input FIFO Output FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane
Output FIFO Input FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane NGPU reuses the existing ALU in each SIMD lane
Output FIFO Input FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane Weight FIFO is shared among all the SIMD lanes
Input FIFO Output FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane Overall NGPU has ≤1% area overhead
NGPU Execution Model ld.global %r0, [addr0]; in 0 (%r 0 ) in 1 (%r 1 ) ld.global %r1, [addr1]; w0 w1 w2 w3 send.n_data %r0; n0 n1 send.n_data %r1; w4 w5 n2 recv.n_data %r2; st.global [addr2], %r2; out 0 (%r 2 ) Neurally Accelerated Neural Network GPU Application
NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; …
NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … SIMD lanes are in normal mode and performs precise computation
NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … SIMD lanes enter neural mode
NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … SIMD starts the calculation of the neural network
NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … The neurally accelerated SIMD lanes autonomously calculate the neural outputs in lock-step
NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … The accelerated SIMD lanes autonomously calculate the neural outputs in lock-step
NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … The accelerated SIMD lanes autonomously calculate the neural outputs in lock-step
NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … The accelerated SIMD lanes autonomously calculate the neural outputs in lock-step
Recommend
More recommend