BRAM DSP URAM Scaling the Cascades Interconnect-aware FPGA implementation of Machine Learning problems Anand Samajdar, Tushar Garg, Tushar Krishna, Nachiket Kapre nachiket@uwaterloo.ca
Claim • Hard FPGA interconnect (cascades) e ffi ciently supports nearest neighbour communication + reuse in ML workloads • Three kinds of UltraScale+ cascades [DSP , BRAM, URAM] • Combination of (1) pixel, (2) row, (3) map reuse • Deliverables: • 650 MHz full-chip operation • 7x better latency, 30% lower throughput than the formidable Xilinx SuperTile design for GoogLeNet v1 � 4
Landscape of FPGA+ML accelerators � 5
Communication Requirements of 3x3 Convolution Output Map J x x x Input Row k Input Row k+1 x x x + Output Row k Input Row k+2 x x x Weights Input Map I � 6
Communication Requirements of 3x3 Convolution 2 1 Output 3 Map J row streaming pixel streaming x x x Input Row k channel streaming Input Row k+1 x x x + Output Row k Input Row k+2 x x x Weights Input Map I � 7
Reuse Patterns Output Map J x x x Input Row k + Input Row k+1 x x x Output Row k Input Row k+2 x x x Input Map I Weight Row 0 Weight Row 1 Weight Row 2 Input Row k+1 Input Row k+2 Input Row k x+ x+ x+ x+ x+ x+ x+ x+ x+ P cascade for summation pixel streaming � 8
Reuse Patterns Output Map J x x x Input Row k + Input Row k+1 x x x Output Row k Input Row k+2 x x x Input Map I Weight Row 0 Weight Row 1 Weight Row 2 Input Row k+1 Input Row k+2 Input Row k B cascade for weight streaming A cascade for x+ x+ x+ x+ x+ x+ x+ x+ x+ pixel streaming P cascade for summation 1 pixel streaming � 9
Reuse Patterns Output Map J x x x Input Row k + Input Row k+1 x x x Output Row k Input Row k+2 x x x row streaming 2 Input Map I Weight Row 0 Weight Row 1 Weight Row 2 Exploit Input Row k+1 Input Row k+2 Input Row k Data Reuse B cascade for weight streaming A cascade for x+ x+ x+ x+ x+ x+ x+ x+ x+ pixel streaming P cascade for summation 1 pixel streaming � 10
Reuse Patterns Output Map J x x x Input Row k + Input Row k+1 x x x Output Row k Input Row k+2 x x x row streaming 2 Input Map I Weight Row 0 Weight Row 1 Weight Row 2 Exploit Input Row k+1 Input Row k+2 Input Row k Data Reuse Input Map I B cascade for weight streaming A cascade for x+ x+ x+ x+ x+ x+ x+ x+ x+ pixel streaming P cascade for summation 1 pixel streaming Output Map J � 11
Reuse Patterns Output Map J x x x Input Row k + Input Row k+1 x x x Output Row k Input Row k+2 x x x Input Map I Input Map I Weights 3x3 Convolution Tile Output Map J � 12
Reuse Patterns Weights Input Output 3x3 Convolution Tile Map I Map J Weights Input 3x3 Convolution Tile Map I+1 Weights Input 3x3 Convolution Tile Map I+.. � 13
Reuse Patterns Weights Input Output 3x3 Convolution Tile Map I Map J channel streaming Weights Input 3x3 Convolution Tile Map I+1 3 Weights Input 3x3 Convolution Tile Map I+.. � 14
Reuse Patterns Weights Input Output 3x3 Convolution Tile Map I Map J channel streaming Weights Input Map I+1 3 Weights Input Map I+.. � 15
Xilinx UltraScale+ FPGA Cascades • BRAM18 support A/B row streaming pixel streaming cascades 2x72b-wide links • DSP48 supports A, B, P channel streaming 2 1 cascades (systolic input and summation) BRAM 3 • URAM288 supports A, B DSP cascades URAM � 16
Outline • Understanding Cascades • Assembling the FPGA accelerator + FPGA Layout • MLPerf Evaluation • Conclusions + Discussion � 17
Promise of Cascades • Absorb data movement onto dedicated interconnect vs. General-purpose wiring • Higher clock frequency operation, layout-friendly architecture � 18
Our approach • Exploit cascades aggressively! • DSP48 • For 3x3 convolution, length-9 cascades • P cascade for summation (like INT8 paper) • A cascade for systolic retiming (like DSP48E2 user guide) • B cascades for weights (our contribution) 19
Our approach • Exploit cascades aggressively! • RAMB18E2 (our contribution) • For 3x3 convolution, only need 3 BRAM-long chains • A/B cascade for shift operation • Swap between A and B to keep one read port available. 20
Our approach • Exploit cascades aggressively! • URAM288 (our contribution) • Alternating A/B cascades of length-2 • Both data + addresses cascades • Shift operation tricky! • Due to 72b width, and resource ratios, idle cycles available for realizing shifts 21
Putting it together x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 22
Putting it together x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 23
Putting it together Weights (initial shift) x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 24
Putting it together Weights Pixel streaming (initial shift) x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 25
Putting it together Row i+2 Row i+1 Row i RAMB RAMB RAMB 18 (C) 18 (B) 18 (A) x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 26
Putting it together RAMB RAMB RAMB Row streaming 18 (C) 18 (B) 18 (A) x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 27
Putting it together to next URAM RAMB RAMB RAMB 18 (C) 18 (B) 18 (A) URAM 288 (Input) (Weights) x x x x x x x x x from previous URAM + + + + + + + + + RAMB 18 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 (Kern) 28
Putting it together to next URAM RAMB RAMB RAMB 18 (C) 18 (B) 18 (A) URAM 288 (Input) (Weights) x x x x x x x x x from previous URAM + + + + + + + + + RAMB 18 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 (Kern) Map streaming 29
Putting it together to next URAM RAMB RAMB RAMB 18 (C) 18 (B) 18 (A) URAM URAM 288 288 (Input) (Output) (Weights) + x x x x x x x x x from previous URAM + + + + + + + + + RAMB 18 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 (Kern) 30
Putting it together row streaming 2 to next URAM channel streaming RAMB RAMB RAMB 18 (C) 18 (B) 18 (A) URAM URAM 288 288 (Input) (Output) (Weights) + x x x x x x x x x from previous URAM 3 1 pixel streaming + + + + + + + + + RAMB 18 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 (Kern) 31
A 3x3 tile layout � 32
A 3x3 tile layout � 33
A 3x3 tile layout Places and routes at 1.2ns � 34
Tiling the design • VU37P device has specific resource mix • For each URAM, you get 4.2 BRAMs and 9.4 DSP48s • Repeating pattern must conform to this ratio • One tile: 2 URAMs, 8 BRAMs, 18 DSPs • Physical layout XDC constraints must account for irregular column arrangement of hard resources � 35
Tiling the design • VU37P device has specific resource mix • For each URAM, you get 4.2 BRAMs and 9.4 DSP48s • Repeating pattern must conform to this ratio • One tile: 2 URAMs, 8 BRAMs, 18 DSPs • Physical layout XDC constraints must ONE TILE account for irregular column arrangement of hard resources � 36
Matrix-Matrix Multiplication • Limited Reuse opportunities • Split large matrix across URAMs • Each URAM stores a set of complete rows —> allows result vector to be independently processed. • Partial vector results then circulated across the chip in a ring-like fashion —> using BRAM cascades • URAM cascades only used for loading matrix at start � 37
VU37P Layout � 38
VU37P FPGA Mapping CONVOLUTION MATRIX-MULTIPLY � 39
VU37P FPGA Mapping 80% 20% CONVOLUTION MATRIX-MULTIPLY � 40
VU37P FPGA Mapping 80% 20% � 41
Effect of using cascades • Registers in hard interconnect Cascade No Cascade save us fabric registers for other Route Count 75K pipelining needs 50K • Clock period marginally better 25K • Obvious reduction in interconnect 0 20 40 60 80 utilization Congestion (%) � 42
Effect of using cascades • Registers in hard interconnect Cascade No Cascade save us fabric registers for other Route Count 75K pipelining needs 50K • Clock period marginally better 25K • Obvious reduction in interconnect 0 20 40 60 80 utilization Congestion (%) � 43
Evaluation Methodology • We use the SCALE-Sim cycle- accurate simulator • https://github.com/ARM- software/SCALE-Sim • Map URAMs -> IFMAP/ OFMAP SRAMs • BRAMs and DSP cascades => systolic array links • VU37P can fit systolic array of size 960x9 (conv), 480x9 (mm) 44
Xilinx SuperTile • GoogLeNet v1 mapped to VU9P FPGA (Amazon F1) • 3046 images/s + 3.3ms latency • Scorching 720 MHz operation! • Mind-numbing 88% overall e ffi ciency http://isfpga.org/slides/Compute-E ffi cient_Neural-Network_Acceleration.pdf 45
Recommend
More recommend