NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural - PowerPoint PPT Presentation

NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural Inferencing No NDA Required – Public Information 1

Leader in eFPGA • TSMC IP Alliance Member • eFPGA working Silicon TSMC 40/28/16/12 • eFPGA in design for GF14 and TSMC 7/7+ • >>10 customer deals/design/fab/silicon: 16-180nm; more in negotiation • First 6 announced agreements No NDA Required – Public Information

NMAX Value Proposition ▪ Inferencing: 8x8 or 16x8 natively; 16x16 at ½ throughput ▪ High MAC utilization: more throughput out of less silicon ▪ Low DRAM bandwidth: more throughput at less system cost and power ▪ Low latency: high performance, low latency at batch = 1 ▪ Modular: can respond quickly to customer needs ▪ Scalable doubling MACs means doubling throughput ▪ Flexible: run any NN or multiple NN ▪ Tensorflow/Caffe: easy to program No NDA Required – Public Information 3

NMAX Inferencing Applications ▪ Edge • Automotive • Aerospace • Cameras • Cell Phones • Drones • PCs • Retail • Robotics • Speech • Surveillance ▪ Data Centers No NDA Required – Public Information 4

Comparing Neural Engine Alternatives No NDA Required – Public Information 5

Neural Inferencing Challenges & Terminology • An inferencing chip with 1000 MACs @ 1GHz has 1 TMACs/sec = 2 TOPS ▪ This is a peak number: no one uses all of the MACs • Challenge: move data at low power where needed to keep MACs utilized • Challenge: maintain high performance at batch=1 for lowest latency No NDA Required – Public Information 6

How do you get the NN throughput you need? 1. Determine how many OPS/MACs are needed for each image ▪ YOLOv3 2MP = 400 Billion MACs/image = 800 Billion Operations/image 2. Determine how many images/second you need to process ▪ YOLOv3 2MP autonomous driving: 30 image/sec = 24 TOPS throughput 3. How many MACs you need is determined by this formula: Y TOPS Peak = X TOPS Throughput ➗ MAC utilization ▪ ▪ MAC utilization will vary based on NN, image size, batch size ▪ Batch=1 is what you need at the edge Number of MACs required = Y TOPS Peak ➗ Frequency of MAC Completion ▪ ▪ NOTE: no short cuts in the above for pruning model, Winograd, compression No NDA Required – Public Information 7

MAC Utilization / MAC Efficiency • A MAC can only do a useful calculation if both the activation and the weight are available on the inputs; if not it stalls • MAC Utilization = (# of useful MAC calculations) ➗ (# of MACs Available) • Example: ▪ Nvidia Tesla T4 claims 3920 images/second on ResNet-50 @ Batch=28 ▪ Each ResNet-50 image takes 7 Billion Operations (3.5 Billion MACs) ▪ So T4 is doing 3920 * 7 Billion Ops = 27.44 Trillion Ops/sec = 27.44 TOPS ▪ T4 data sheet claims 130 TOPS (int8) ▪ So T4 MAC utilization, for Resnet-50 @ Batch=28, is 21% No NDA Required – Public Information 8

9 Microsoft BrainWave Slide from HotChips 2018: IDEAL Maximum Existing Solutions Hardware Allowed Latency Utilization Latency at 99th (%) Batch Size Batch Size Batching improves HW utilization but increases latency No NDA Required – Public Information

ResNet-50 Int8 ▪ Image classification ▪ 224 x 224 pixels ▪ 50 stage neural network ▪ 22.7 Million weights ▪ 3.5 Billion MACs per image = 7 Billion Operations per image No NDA Required – Public Information 10

ResNet-50 Images/Second vs Batch Size: NMAX utilization high at batch=1 16000 NMAX 12x12 Images 14000 Per 12000 Second 10000 Habana 8000 Goya NMAX 12x6 6000 4000 NMAX 6x6 2000 ? 0 Batch=1 Batch=5 Batch=10 Batch=28 EDGE Nvidia Tesla T4 Habana Goya NMAX 6x6 NMAX6x12 NMAX12x12 No NDA Required – Public Information 11

Real Time Object Recognition: YOLOv3 No NDA Required – Public Information 12

YOLOv3 Int8 ▪ Real time object recognition ▪ 2 Megapixel images ▪ >100 stage neural network ▪ 62 Million weights ▪ 400 Billion MACs per image = 800 Billion Operations per image No NDA Required – Public Information 13

NMAX: YOLOv3, 2048x1024, Batch=1 using 2 x 4Gbit LPDDR4 DRAM 10x reduction in DRAM BW requirements vs competing solutions (<25 vs >300GB/s) NMAX array size 12x12 12x6 6x6 SRAM Size 64MB 64MB 32MB TOPS Peak 147 73 37 Throughput (@1GHz) 124 fps 72 fps 27 fps Latency 8 ms 14 ms 37 ms Avg. DRAM BW 12 GB/s 14 GB/s 10 GB/s Avg. SRAM BW 177 GB/s 103 GB/s 34 GB/s XFLX & ArrayLINX BW 18 TB/s 10 TB/s 4 TB/s MAC Efficiency 67% 78% 58% (max useable DRAM BW: 25 GB/s ) 98 TOPS 58 TOPS 22 TOPS T4-class Throughput Throughput Throughput performance No NDA Required – Public Information 14

Why is NMAX the right solution for performance inferencing • The most efficient implementation of any neural network is a hardwired ASIC • But customers want reconfigurability • NMAX is the closest reconfigurable architecture to hardwired ASIC ▪ Each stage when configured executes just like an ASIC • NMAX can run any neural network running Tensorflow/Caffe No NDA Required – Public Information 15

NMAX512 Tile Microarchitecture: 1 TOPS @ <2mm 2 in TSMC16FFC/12FFC NMAX512 Tile* ArrayLINX TM to DDR, PCIe & SoC L2 SRAM via Features RAMLINX TM connections adjacent tiles • 8 NMAX clusters achieves 50-90% MAC efficiency EFLX IO EFLX IO • Local eFPGA logic (EFLX) for: EFLX Logic EFLX Logic ▪ Control logic & management ▪ Reconfigurable data flow L1 SRAM L1 SRAM ▪ Additional signal processing (e.g. ReLU, Sigmoid, Tanh) NMAX Cluster NMAX Cluster • Local L1 SRAM for weights & activations NMAX Cluster NMAX Cluster XFLX Interconnect NMAX Cluster NMAX Cluster • L2 SRAM (via RAMLINX) NMAX Cluster NMAX Cluster • L3 storage through DDR/PCIe L1 SRAM L1 SRAM • High speed XFLX interconnects all blocks within the tile EFLX Logic EFLX Logic • High speed ArrayLINX connects to adjacent NMAX tiles EFLX IO EFLX IO to create larger NMAX arrays by abutment ArrayLINX TM to L2 SRAM via L2 SRAM via RAMLINX TM RAMLINX TM adjacent tiles No NDA Required – Public Information 16 *architectural diagram, not to scale

Every Tile is Reconfigured (quickly & differently) every stage NMAX512 Tile* ArrayLINX TM to DDR, PCIe & SoC L2 SRAM via RAMLINX TM connections adjacent tiles EFLX IO EFLX IO EFLX Logic EFLX Logic This example does a matrix multiply of L1 SRAM L1 SRAM a 512 activation vector NMAX Cluster NMAX Cluster from the prior stage NMAX Cluster NMAX Cluster XFLX Interconnect times a weight matrix NMAX Cluster NMAX Cluster which is then activated NMAX Cluster NMAX Cluster L1 SRAM L1 SRAM to produce the ACTIVATION activation vector for EFLX Logic EFLX Logic the next stage EFLX IO EFLX IO ArrayLINX TM to L2 SRAM via L2 SRAM via Input Activation Output Activation RAMLINX TM RAMLINX TM adjacent tiles from L2 SRAM to L2 SRAM No NDA Required – Public Information 17 *architectural diagram, not to scale

NMAX Clusters Systolically Multiply the Activation Vector by the Weights • Example of a 4 input vector multiplying by a 4x4 weight matrix Source: Hardware for Neural networks, page 466, https://page.mi.fu- berlin.de/rojas/neural/chapter/ K18.pdf No NDA Required – Public Information 18

Modular NMAX arrays: easily scales from 1 to >100 TOPS DDR IF 2x2 NMAX512 Array* SoC / PCIe connection Features • NMAX tiles form arrays by abutment • ArrayLINX interconnect on all 4 sides of NMAX NMAX TILE TILE NMAX tile automatically connect to provide high bandwidth array-wide interconnect • Shared L2 SRAM: ▪ Local, high-capacity SRAMs placed in between L2 SRAM L2 SRAM NMAX tiles ▪ Holds weights for each layer, as well as activations from one layer to the next ▪ EFLX place-and-route algorithms minimizes interconnect distances between SRAM and NMAX NMAX NMAX TILE TILE L2 SRAM L2 SRAM No NDA Required – Public Information 19 *architectural diagram, not to scale

NMAX is dataflow: NMAX Compiler maps from Caffe/TensorFlow • Mapped NN automatically “unrolls” onto the NMAX hardware • Control logic & data operators maps to EFLX reconfigurable logic i in w 00 i 0 n 0 w 00 i' 0 ld i 0 n 0 ld n° 0 w 11 i 1 n 1 NMAX w 11 i' 1 ld i 1 n 1 ld n° 1 TILE w 22 i 2 n 2 w 22 i' 2 ld i 2 n 2 ld n° 2 w 33 i' 3 ld i 3 n 3 ld n° 3 w 33 i 3 n 3 d out No NDA Required – Public Information 20

NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural - PowerPoint PPT Presentation

NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural Inferencing No NDA Required Public Information 1 Leader in eFPGA TSMC IP Alliance Member eFPGA working Silicon TSMC 40/28/16/12 eFPGA in design for GF14 and TSMC 7/7+

Optimality of Packing Shigetoshi Nakatake Univ. of Kitakyushu 1 Commemoration for Professor Y.

1 Slide 2 Some of the features that will be described in this presentation are indicative of

Johnson Street Bridge Replacement Project Quarterly Update Governance and Priorities Committee

Whipple Local Recurrence following XRT Rajesh Ramanathan, MD Surgical Oncology ISIGO October 10

Progress toward an Engineering Discipline of Software Mary Shaw Institute for Software Research

Standard Cell Design Advanced VLSI Design CMPE 641 Standard Cell Libraries Standard cell

Fermilab NORTH 0 20 20 40 1"=20'-0" 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

CMOS-oriented EDA Infrastructure Xiang Qiu, Malgorzata Marek-Sadowska University of California,

The SiLago Method: Next Generation VLSI Architectures and Design Automation Ahmed Hemani KTH

Outline MoEvaEon Site-specific cases: the 1987 SupersEEon Hills

Overview: TransOvation 2015 PRESENTED BY: TED ZOLI, HNTB What is TransOvation? Not a

Picard Groups of Stable Module Categories Richard Wong GROOT Summer Seminar 2020 Slides can be

District 1SR Bridge Program Training Manual 1SR Bridge Program Training Manual The purpose of

PORT BRUCE BRIDGE REPLACEMENT SCHEDULE B MUNICIPAL CLASS ENVIRONMENTAL ASSESSMENT PUBLIC

April 26, 2019 Performance-Based Geotechnical Seismic Design Steve Kramer Professor of Civil

A double acting hydraulic jack with a A double acting hydraulic jack with a load cell load cell

CS/EE 6710 Introduction to Layout Inverter Layout Example Layout Design Rules Composite Layout

Archive of SID

Composite Layout Drawing the mask layers that will be used by the fabrication folks to make

Q&A Please submit all questions concerning webinar content through the Q&A panel.

Dynamic Compaction Martin Larisch & Tim Pervan INSERT DATE HERE Ground Improvement - What is

twenty three concrete construction: Bright Football Complex www.tamu.edu foundation design

twenty seven concrete construction: Bright Football Complex www.tamu.edu foundation design

Then too. . . Benjamin Slade and Aniko Csirmaz Dept. of Linguistics University of Utah Meaning

NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural - PowerPoint PPT Presentation

NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural Inferencing No NDA Required Public Information 1 Leader in eFPGA TSMC IP Alliance Member eFPGA working Silicon TSMC 40/28/16/12 eFPGA in design for GF14 and TSMC 7/7+

Optimality of Packing Shigetoshi Nakatake Univ. of Kitakyushu 1 Commemoration for Professor Y.

1 Slide 2 Some of the features that will be described in this presentation are indicative of

Johnson Street Bridge Replacement Project Quarterly Update Governance and Priorities Committee

Whipple Local Recurrence following XRT Rajesh Ramanathan, MD Surgical Oncology ISIGO October 10

Progress toward an Engineering Discipline of Software Mary Shaw Institute for Software Research

Standard Cell Design Advanced VLSI Design CMPE 641 Standard Cell Libraries Standard cell

Fermilab NORTH 0 20 20 40 1&quot;=20'-0&quot; 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

CMOS-oriented EDA Infrastructure Xiang Qiu, Malgorzata Marek-Sadowska University of California,

The SiLago Method: Next Generation VLSI Architectures and Design Automation Ahmed Hemani KTH

Outline MoEvaEon Site-specific cases: the 1987 SupersEEon Hills

Overview: TransOvation 2015 PRESENTED BY: TED ZOLI, HNTB What is TransOvation? Not a

Picard Groups of Stable Module Categories Richard Wong GROOT Summer Seminar 2020 Slides can be

District 1SR Bridge Program Training Manual 1SR Bridge Program Training Manual The purpose of

PORT BRUCE BRIDGE REPLACEMENT SCHEDULE B MUNICIPAL CLASS ENVIRONMENTAL ASSESSMENT PUBLIC

April 26, 2019 Performance-Based Geotechnical Seismic Design Steve Kramer Professor of Civil

A double acting hydraulic jack with a A double acting hydraulic jack with a load cell load cell

CS/EE 6710 Introduction to Layout Inverter Layout Example Layout Design Rules Composite Layout

Archive of SID

Composite Layout Drawing the mask layers that will be used by the fabrication folks to make

Q&amp;A Please submit all questions concerning webinar content through the Q&amp;A panel.

Dynamic Compaction Martin Larisch &amp; Tim Pervan INSERT DATE HERE Ground Improvement - What is

twenty three concrete construction: Bright Football Complex www.tamu.edu foundation design

twenty seven concrete construction: Bright Football Complex www.tamu.edu foundation design

Then too. . . Benjamin Slade and Aniko Csirmaz Dept. of Linguistics University of Utah Meaning

Fermilab NORTH 0 20 20 40 1"=20'-0" 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

Q&A Please submit all questions concerning webinar content through the Q&A panel.

Dynamic Compaction Martin Larisch & Tim Pervan INSERT DATE HERE Ground Improvement - What is