nmax
play

NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural - PowerPoint PPT Presentation

NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural Inferencing No NDA Required Public Information 1 Leader in eFPGA TSMC IP Alliance Member eFPGA working Silicon TSMC 40/28/16/12 eFPGA in design for GF14 and TSMC 7/7+


  1. NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural Inferencing No NDA Required – Public Information 1

  2. Leader in eFPGA • TSMC IP Alliance Member • eFPGA working Silicon TSMC 40/28/16/12 • eFPGA in design for GF14 and TSMC 7/7+ • >>10 customer deals/design/fab/silicon: 16-180nm; more in negotiation • First 6 announced agreements No NDA Required – Public Information

  3. NMAX Value Proposition ▪ Inferencing: 8x8 or 16x8 natively; 16x16 at ½ throughput ▪ High MAC utilization: more throughput out of less silicon ▪ Low DRAM bandwidth: more throughput at less system cost and power ▪ Low latency: high performance, low latency at batch = 1 ▪ Modular: can respond quickly to customer needs ▪ Scalable doubling MACs means doubling throughput ▪ Flexible: run any NN or multiple NN ▪ Tensorflow/Caffe: easy to program No NDA Required – Public Information 3

  4. NMAX Inferencing Applications ▪ Edge • Automotive • Aerospace • Cameras • Cell Phones • Drones • PCs • Retail • Robotics • Speech • Surveillance ▪ Data Centers No NDA Required – Public Information 4

  5. Comparing Neural Engine Alternatives No NDA Required – Public Information 5

  6. Neural Inferencing Challenges & Terminology • An inferencing chip with 1000 MACs @ 1GHz has 1 TMACs/sec = 2 TOPS ▪ This is a peak number: no one uses all of the MACs • Challenge: move data at low power where needed to keep MACs utilized • Challenge: maintain high performance at batch=1 for lowest latency No NDA Required – Public Information 6

  7. How do you get the NN throughput you need? 1. Determine how many OPS/MACs are needed for each image ▪ YOLOv3 2MP = 400 Billion MACs/image = 800 Billion Operations/image 2. Determine how many images/second you need to process ▪ YOLOv3 2MP autonomous driving: 30 image/sec = 24 TOPS throughput 3. How many MACs you need is determined by this formula: Y TOPS Peak = X TOPS Throughput ➗ MAC utilization ▪ ▪ MAC utilization will vary based on NN, image size, batch size ▪ Batch=1 is what you need at the edge Number of MACs required = Y TOPS Peak ➗ Frequency of MAC Completion ▪ ▪ NOTE: no short cuts in the above for pruning model, Winograd, compression No NDA Required – Public Information 7

  8. MAC Utilization / MAC Efficiency • A MAC can only do a useful calculation if both the activation and the weight are available on the inputs; if not it stalls • MAC Utilization = (# of useful MAC calculations) ➗ (# of MACs Available) • Example: ▪ Nvidia Tesla T4 claims 3920 images/second on ResNet-50 @ Batch=28 ▪ Each ResNet-50 image takes 7 Billion Operations (3.5 Billion MACs) ▪ So T4 is doing 3920 * 7 Billion Ops = 27.44 Trillion Ops/sec = 27.44 TOPS ▪ T4 data sheet claims 130 TOPS (int8) ▪ So T4 MAC utilization, for Resnet-50 @ Batch=28, is 21% No NDA Required – Public Information 8

  9. 9 Microsoft BrainWave Slide from HotChips 2018: IDEAL Maximum Existing Solutions Hardware Allowed Latency Utilization Latency at 99th (%) Batch Size Batch Size Batching improves HW utilization but increases latency No NDA Required – Public Information

  10. ResNet-50 Int8 ▪ Image classification ▪ 224 x 224 pixels ▪ 50 stage neural network ▪ 22.7 Million weights ▪ 3.5 Billion MACs per image = 7 Billion Operations per image No NDA Required – Public Information 10

  11. ResNet-50 Images/Second vs Batch Size: NMAX utilization high at batch=1 16000 NMAX 12x12 Images 14000 Per 12000 Second 10000 Habana 8000 Goya NMAX 12x6 6000 4000 NMAX 6x6 2000 ? 0 Batch=1 Batch=5 Batch=10 Batch=28 EDGE Nvidia Tesla T4 Habana Goya NMAX 6x6 NMAX6x12 NMAX12x12 No NDA Required – Public Information 11

  12. Real Time Object Recognition: YOLOv3 No NDA Required – Public Information 12

  13. YOLOv3 Int8 ▪ Real time object recognition ▪ 2 Megapixel images ▪ >100 stage neural network ▪ 62 Million weights ▪ 400 Billion MACs per image = 800 Billion Operations per image No NDA Required – Public Information 13

  14. NMAX: YOLOv3, 2048x1024, Batch=1 using 2 x 4Gbit LPDDR4 DRAM 10x reduction in DRAM BW requirements vs competing solutions (<25 vs >300GB/s) NMAX array size 12x12 12x6 6x6 SRAM Size 64MB 64MB 32MB TOPS Peak 147 73 37 Throughput (@1GHz) 124 fps 72 fps 27 fps Latency 8 ms 14 ms 37 ms Avg. DRAM BW 12 GB/s 14 GB/s 10 GB/s Avg. SRAM BW 177 GB/s 103 GB/s 34 GB/s XFLX & ArrayLINX BW 18 TB/s 10 TB/s 4 TB/s MAC Efficiency 67% 78% 58% (max useable DRAM BW: 25 GB/s ) 98 TOPS 58 TOPS 22 TOPS T4-class Throughput Throughput Throughput performance No NDA Required – Public Information 14

  15. Why is NMAX the right solution for performance inferencing • The most efficient implementation of any neural network is a hardwired ASIC • But customers want reconfigurability • NMAX is the closest reconfigurable architecture to hardwired ASIC ▪ Each stage when configured executes just like an ASIC • NMAX can run any neural network running Tensorflow/Caffe No NDA Required – Public Information 15

  16. NMAX512 Tile Microarchitecture: 1 TOPS @ <2mm 2 in TSMC16FFC/12FFC NMAX512 Tile* ArrayLINX TM to DDR, PCIe & SoC L2 SRAM via Features RAMLINX TM connections adjacent tiles • 8 NMAX clusters achieves 50-90% MAC efficiency EFLX IO EFLX IO • Local eFPGA logic (EFLX) for: EFLX Logic EFLX Logic ▪ Control logic & management ▪ Reconfigurable data flow L1 SRAM L1 SRAM ▪ Additional signal processing (e.g. ReLU, Sigmoid, Tanh) NMAX Cluster NMAX Cluster • Local L1 SRAM for weights & activations NMAX Cluster NMAX Cluster XFLX Interconnect NMAX Cluster NMAX Cluster • L2 SRAM (via RAMLINX) NMAX Cluster NMAX Cluster • L3 storage through DDR/PCIe L1 SRAM L1 SRAM • High speed XFLX interconnects all blocks within the tile EFLX Logic EFLX Logic • High speed ArrayLINX connects to adjacent NMAX tiles EFLX IO EFLX IO to create larger NMAX arrays by abutment ArrayLINX TM to L2 SRAM via L2 SRAM via RAMLINX TM RAMLINX TM adjacent tiles No NDA Required – Public Information 16 *architectural diagram, not to scale

  17. Every Tile is Reconfigured (quickly & differently) every stage NMAX512 Tile* ArrayLINX TM to DDR, PCIe & SoC L2 SRAM via RAMLINX TM connections adjacent tiles EFLX IO EFLX IO EFLX Logic EFLX Logic This example does a matrix multiply of L1 SRAM L1 SRAM a 512 activation vector NMAX Cluster NMAX Cluster from the prior stage NMAX Cluster NMAX Cluster XFLX Interconnect times a weight matrix NMAX Cluster NMAX Cluster which is then activated NMAX Cluster NMAX Cluster L1 SRAM L1 SRAM to produce the ACTIVATION activation vector for EFLX Logic EFLX Logic the next stage EFLX IO EFLX IO ArrayLINX TM to L2 SRAM via L2 SRAM via Input Activation Output Activation RAMLINX TM RAMLINX TM adjacent tiles from L2 SRAM to L2 SRAM No NDA Required – Public Information 17 *architectural diagram, not to scale

  18. NMAX Clusters Systolically Multiply the Activation Vector by the Weights • Example of a 4 input vector multiplying by a 4x4 weight matrix Source: Hardware for Neural networks, page 466, https://page.mi.fu- berlin.de/rojas/neural/chapter/ K18.pdf No NDA Required – Public Information 18

  19. Modular NMAX arrays: easily scales from 1 to >100 TOPS DDR IF 2x2 NMAX512 Array* SoC / PCIe connection Features • NMAX tiles form arrays by abutment • ArrayLINX interconnect on all 4 sides of NMAX NMAX TILE TILE NMAX tile automatically connect to provide high bandwidth array-wide interconnect • Shared L2 SRAM: ▪ Local, high-capacity SRAMs placed in between L2 SRAM L2 SRAM NMAX tiles ▪ Holds weights for each layer, as well as activations from one layer to the next ▪ EFLX place-and-route algorithms minimizes interconnect distances between SRAM and NMAX NMAX NMAX TILE TILE L2 SRAM L2 SRAM No NDA Required – Public Information 19 *architectural diagram, not to scale

  20. NMAX is dataflow: NMAX Compiler maps from Caffe/TensorFlow • Mapped NN automatically “unrolls” onto the NMAX hardware • Control logic & data operators maps to EFLX reconfigurable logic i in w 00 i 0 n 0 w 00 i' 0 ld i 0 n 0 ld n° 0 w 11 i 1 n 1 NMAX w 11 i' 1 ld i 1 n 1 ld n° 1 TILE w 22 i 2 n 2 w 22 i' 2 ld i 2 n 2 ld n° 2 w 33 i' 3 ld i 3 n 3 ld n° 3 w 33 i 3 n 3 d out No NDA Required – Public Information 20

Recommend


More recommend