InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data - PowerPoint PPT Presentation

InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data Processing Shengwen Liang 1,2 , Ying Wang 1,2 , Cheng Liu 1,2 Huawei Li 1,2,3 , Xiaowei Li 1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 University of Chinese Academy of Sciences 3 Peng Cheng Laboratory, Shenzhen, China 1/11

Outline – InS-DLA SATA/PCIe GPU/FPGA PCIe SSD CPU DRAM Long data High energy PCIe peer to peer movement path consumption Multiple memory hierarchies Energy SSD Deep Shorten data Controller Learning efficiency movement path Data analysis DRAM NAND FLASH InS-DLA 2/11

Opportunities & Challenges 10000 The internal bandwidth of SSD can be 16x higher than NAND FLASH Bandwidth SSD (8 channel) the external SSD bandwidth. (MB/s) SATA In-SSD data access is more efficient than accessing from SAS(2port) PCIe(x4) the external interface. Fully utilize the high internal bandwidth of SSD. 0 2006 2011 2016 Year Open-Channel SSD SSD Host System Host System Block Metadata Logical Address Write Buffering Block Metadata Wear-leveling Write Buffering Physical Address Wear-leveling Expose Physical address to the Error handling Error handling host to specify the data layout. Media Controller Media Controller Non-volatile media Non-volatile media 3/11

Opportunities & Challenges NAND FLASH DRAM • Coarse-grained data operation • Slower than DRAM 1. How to rearrange data layout in the NAND flash to balance the flash Bytes Page (4~16KB) bandwidth and DLA throughput? Error Correction Engine NAND FLASH PHY 2. How to configure Error Correction Engine to provide more hardware NAND resources for deep learning accelerator FLASH SSD PHY in the area and power limited SSD Error 75.3% Controller Correction controller chip? Engine Considerable hardware resource 4/11

Overview of InS-DLA ARM User Space D-ECC (SATA, PCIe) I/O interface PHY FMC Kernel Space Scheduling Error Handling Wear Garbage FLASH Bad Block Management … Leveling Collection … D-ECC Data Placement PHY FMC InS-DLA FLASH • Directly access data PE PE PE PE D-ECC PHY FMC from NAND FLASH PE PE PE PE … instead of DRAM. • Output stationary D-ECC PE PE PE PE PHY FMC dataflow introduced Weight In/Out in the Eyeriss. Weight path Buffer In/out path Buffer 5/11

Flash-aware data layout 1. How to rearrange data layout in the NAND flash to balance the flash bandwidth and DLA throughput? Physical Page Address command Channel LUN BLOCK PAGE SECTOR （ 0,1 ） PE PE （ 1,1 ） PE （ 2,1 ） Maximize data parallelism by using physical page address commands provided by the OpenChannel SSD. （ 0,0 ） PE PE （ 1,0 ） PE （ 2,0 ） Read Page Cache Flash Block Page Page Register Register Controller command Channel-1 Channel-2 Channel-0 Read page Page Cache Flash Block Page Page Register Controller Register Cache command F lash-aware Improve single channel throughput data layout 6/11

Fault-tolerant aware strategy 2. How to configure Error Correction Engine to provide more hardware resources for deep learning accelerator in the area and power limited SSD controller chip? MSB LSB 7 6 5 4 3 2 1 0 The fault-tolerant of deep learning, No accuracy loss when bit error only occurs at the lower position. Low bit buffer Low bit buffer 11100001 10100101 10100001 10101011 10001010 10101010 BCH BCH 10101111 00101101 (1101,1024,15) (1101,1024,15) 10101101 Flash D-ECC-Encoder D-ECC-Decoder Fault-tolerant aware strategy --- change the protection region 7/11

Experiment Setup Baselines CPU GPU FPGA NAND FLASH Intel Xeon E5-2630 CPU - - Flash Controller v4@2.20GHz Intel Xeon E5-2630 Zynq ZC706 CPU+FPGA-1 - v4@2.20GHz Intel Xeon E5-2630 CPU+FPGA-2 - Zynq ZC706 InS-DLA v4@2.20GHz Intel Xeon E5-2630 CPU+GPU NVIDIA GTX - PCIe v4@2.20GHz 1080 Ti OpenSSD Module LUT FF 1. Zynq FPGA Chip – InS-DLA and FMC+ECC 102010 76712 flash controller 1. Dual Cortex A9 -- Firmware FMC+D-ECC 47858(-53%) 66880(-12%) 2. 1GB DRAM NVMe Interface 8585 11456 3. 8-channels NAND flash 4. PCIe Gen 2 (maximum lane = 8) DLA 93232 21929 Total 149675 (68.46%) 100265 (22.93%) 8/11

Experiment Result CPU CGPU C-FPGA-1 C-FPGA-2 InS-DLA + ECC (sim) InS-DLA + ECC (FPGA) 100 Energy (J) 1 AlexNet Squeezenet ResNet-18 GoogleNet 3 GOPS/W 2 1 0 AlexNet Squeezenet ResNet-18 GoogleNet • InS-DLA on simulator and FPGA prototype outperforms four baselines in terms of energy efficiency. 9/11

Experiment Result General Read Read Cache General Read+ Read Cache+ Command Command Flash aware data Flash aware data layout layout Latency (cycle) 60330240 31809810 8541280 3976226 Improvement 1 47.27% 85.84% 93.41% InS-DLA + ECC (sim) InS-DLA + ECC (FPGA) InS-DLA + D-ECC (sim) InS-DLA + D-ECC (FPGA) 4 Performance 4 Energy (J) (GOPS/W 0 0 AN SQ RN GN AN SQ RN GN • The flash-aware data layout with read cache command improves throughput. • The D-ECC reduces the energy cost by 34% on the simulator and 30% on FPGA prototype compared to the ECC hardware. 10/11

More details are shown in the poster, waiting for you. Thank you for your attention InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data Processing Shengwen Liang 1,2 , Ying Wang 1,2 , Cheng Liu 1,2 Huawei Li 1,2,3 , Xiaowei Li 1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 University of Chinese Academy of Sciences 3 Peng Cheng Laboratory, Shenzhen, China 11/11

InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data - PowerPoint PPT Presentation

InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data Processing Shengwen Liang 1,2 , Ying Wang 1,2 , Cheng Liu 1,2 Huawei Li 1,2,3 , Xiaowei Li 1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese

C-Brain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive Data-level

N. Lane et al. l. DeepX: A Software Accelerator for Low Power Deep Learning In Inference on

In Integrating NVIDIA Deep Learning Accelerator (NVDLA) wit ith RIS ISC-V SoC on Fir ireSim

In Integrating NVIDIA IA Deep Learning Accelerator (NVDLA) with RIS ISC-V SoC on FireSim

Open Source Virtual Platforms for SW Prototyping on FPGA Mark Burton Enabling System Level Design

DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT Shashank Prasanna Deep Learning in Production -

Image Classification with DIGITS NVIDIA Deep Learning Institute 1 DEEP LEARNING INSTITUTE DLI

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

STRUCTURE INTO MACHINE LEARNING TRINITY OF AI ALGORITHMS COMPUTE DATA 2 DEEP LEARNING IS

Effectiveness of Deep Learning Vs. Machine Learning in a Health Care Use Case RxToDx A Data

Deep Learning With Differential Privacy Presenter: Xiaojun Xu Deep Learning Framework

Foundations of Artificial Intelligence 14. Deep Learning Learning from Raw Data Joschka

Edge-based Discovery of Training Data for Machine Learning CMU authors Deep Learning Recipe

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

STARTING A DEEP LEARNING PROJECT Bryan Catanzaro, 11 May 2017 Supervised learning (learning

DEEP LEARNING ON RF DATA Adam Thompson | Senior Solutions Architect March 29, 2018 Background

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Natalia Vassilieva, PhD

Distributed Synthetic Data Platform for Deep Learning Applications BITCOIN OR ETHER AMAZON DEEP

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

NAS NAS

Big Data, Deep Learning and Other Allegories: Scalability and

Reproducibility and Replicability in Deep Reinforcement Learning (and Other Deep Learning

Data Mining II Neural Networks and Deep Learning Heiko Paulheim Deep Learning A recent