DRAM Access Reduction by Node Fusion with TVM Chia-Wei Chang, Jing-Jia Liou, Chih-Tsun Huang, Wei-Chung Hsu & Juin-Ming Lu National Tsing Hua University & Industrial Technology Research Institute Dec 5th, 2019 1
DRAM Access Consumes More Energy • Energy efficiency is the key to DNN computation • Hardware accelerators • DRAM consumes 50-100x more energy per byte than SRAM • Node fusion is used to save DRAM accesses DRAM SRAM Register Energy 250x 4x 1x 2
TVM only Fuses Elementwise OP BatchNorm Elementwise TopLevel Relu Conv TVMOP Elementwise OutElementwieFusable • Currently, TVM only supports fusion of elementwise OP into Conv • Each OP has an attribute to indicate whether to fuse • Generate TVMOP, which includes nodes to share data in SRAM 3
Our Node Fusion Merges Multiple Convs Fusion Fus Tensor data Te 1 st 2 nd 1 st 2 nd DNN DRAM DRAM DRAM DRAM DRAM Operator SRAM for ( n = 0 ; n < N ; n ++) for ( n = 0 ; n < N ; n ++) # 1st Conv for ( k = 0 ; k < C2 ; k ++) for ( k = 0 ; k < C1 ; k ++) for ( y = 0 ; y < H2 ; y ++) for ( y = 0 ; y < H1 ; y ++) for ( x = 0 ; x < W2 ; x ++) for ( x = 0 ; x < W1 ; x ++) # Internal SRAM buffer int sram [ C1 ][ R2 ][ S2 ] for ( c = 0 ; c < C0 ; c ++) for ( r = 0 ; r < R1 ; r ++) for ( c = 0 ; c < C1 ; c1 ++) for ( s = 0 ; s < S1 ; s ++) for ( r = 0 ; r < R2 ; r ++) O1 [ n ][ k ][ y ][ x ] += W1 [ k ][ c ][ r ][ s ] * I [ n ][ c ][ y + r ][ x + s ] for ( s = 0 ; s < S2 ; s ++) for ( c2 = 0 ; c2 < C0 ; c ++) for ( n = 0 ; n < N ; n ++) # 2nd Conv for ( r2 = 0 ; r2 < R1 ; r ++) for ( k = 0 ; k < C2 ; k ++) for ( s2 = 0 ; s2 < S1 ; s ++) for ( y = 0 ; y < H2 ; y ++) sram [ c ][ r ][ s ] += W1 [ c ][ c2 ][ r2 ][ s2 ] * I [ n ][ c2 ][ y + r + r2 ][ x + s + s2 ] for ( x = 0 ; x < W2 ; x ++) for ( c = 0 ; c < C1 ; c ++) for ( c = 0 ; c < C1 ; c ++) for ( r = 0 ; r < R2 ; r ++) for ( r = 0 ; r < R2 ; r ++) for ( s = 0 ; s < S2 ; s ++) for ( s = 0 ; s < S2 ; s ++) O2 [ n ][ k ][ y ][ x ] += W2 [ k ][ c ][ r ][ s ] * O1 [ n ][ c ][ y + r ][ x + s ] O [ n ][ k ][ y ][ x ] += W2 [ k ][ c ][ r ][ s ] * sram [ c ][ r ][ s ] 4
Experiment Settings: Hardware Controller • Eyeriss-like architecture ifmap • 256MB DRAM PE PE PE ... PE weights • 108KB SRAM ipsum Buffer PE PE PE ... PE • 12x14 PE ... opsum ... ... ... • Runs AlexNet PE PE PE ... PE • Due to hardware limitation, only Conv is DRAM evaluated 5
Experimental Results Energy (mJ) MCycle Energy-Delay (KCycle.J) 5 7 35 4.5 16% 6 30 23% 4 5 40% 3.5 25 3 4 20 2.5 3 15 2 1.5 10 2 1 5 1 0.5 0 0 0 Engergy*Cycle Energy Cycle w/o Fusion Fusion w/o Fusion Fusion w/o Fusion Fusion 6
Recommend
More recommend