CRESITT EVENT IA EMBARQUEE ET RECHERCHE AMONT CEA Presentation for CRESITT | October 17th, 2019 Sandrine Varenne, David Briand CEA LIST sandrine.varenne@cea.fr | 1
IA EMBARQUÉE ET RECHERCHE AMONT 1 LES TRAVAUX DU CEA DRT ( DIRECTION DE LA RECHERCHE TECHNOLOGIQUE ) EN INTELLIGENCE ARTIFICIELLE 2 A PERÇU GÉNÉRAL DE NOS ACTIVITÉS EN IA EMBARQUÉE 3 ZOOM SUR NOS OUTILS N2D2 ET NOS ACCÉLÉRATEURS HARDWARE (PNEURO, DNEURO…) 4 CONCLUSION | 2
CEA TECH & Artificial Intelligence Text & Audio Semantics Video Other signals.. Images DATA Architecture Data analytics NVM Algorithms Architecture Software Hardware IC Conception Algorithm Certification & know-how know-how Adequation software Communication verification Tools… 3D Integration Smart … … Systems CEA CONFIDENTIEL | 3
CEA TECH & Artificial Intelligence To address the embedded Challenges Days, weeks on multi-GPU server until correct accuracy (topology, training set, parameters…) Labeled Machine learning DNN databases algorithm model Nvidia DGX-1 (8 Tesla P100) training prediction “A car” New data DNN trained model prediction Low-latency inference (TPU, FPGA, GPU, PNeuro…) CEA CONFIDENTIEL | 4
KNOW-HOW OF CEA IN DEEP LEARNING & EMBEDDED AI EXPERIENCES Possible FRAMEWORKS Code Generation link with Modules for CPU, Manycore CPU, GPU, FPGA, OFF THE SHELF Optimized C Cuda HLS TensorRT Dedicated HW ELEMENTS CuDNN C++ OpenMP OpenCL HW LIBRAIRIES DNEURO HW IP SPIKING+ NVM PNEURO SPIKING CEA CONFIDENTIEL | 5
N2D2 An European Platform to address Embedded Systems’ Challenges N2D2 has been totally developed by CEA Database Handling and Data Preprocessing Help • Data conditioning • Semi automatic Data labelling Standalone Code generation for • COTS* Components (CPU, GPU, FPGA) • Specific Hardware Targets (ST, Kalray, Renesas …) • NN Hardware Accelerators based on CEA IP >> Well adapted for embedded AI Decision help for the implementation phase • Hardware Cost & Form Factor • Power Consumption • Latency Spike Coding * COTS : Commercial Off-The-Shelf Components CEA CONFIDENTIEL | 6
Context / Motivations • Deep Neural Networks (DNN) are very successful in the vast majority of classification/recognition benchmarks … on high-end multi-250W GPU clusters 85 Top-1 ImageNet accuracy (%) 80 • Embedding low-power DNN remains challenging: 75 • Must adapt and simplify DNN topologies 70 • Reduce layers complexity (number of operations) 65 • Reduce precision (8 bit integer or less) 60 • Today’s general purpose CoTS are inefficient for DNNs 55 • Number of cores too low 50 • Computing cores too complex (floating point computation) • 45 Low MAC/cycle efficiency 10 100 1000 10000 100000 • Insufficient memory Complexity (MMACs) Balancing speed/power and applicative performances is a major challenge Need for a framework to automate DNN shrinking exploration and evaluation, performances projection and porting on embedded platforms CONFIDENTIEL
Deep learning for embedded computing N2D2 : DNN design framework • Unified modeling and NN exploration tool • Custom applications building & optimization (CNN, Faster- RCNN…) O PTIMIZED E MBEDDED • Hardware mapping & benchmarking (CPUs, GPUs, FPGAs, ASIPs) C ODE G ENERATION • N2D2 is available at https://github.com/CEA-LIST/N2D2/ ACCELERATION H ARDWARE Embedded Programmable processor PNeuro ASIC neural Dee • Clustered 8-bit SIMD architecture computing • Designed for DNN processing chains and image processing • Published at DATE 2018 FPGA H ARDWARE A CCELERATION Dataflow FPGA IP DNeuro • Optimized RTL DNN layer kernels • Automatic RTL generation through N2D2 • Dataflow computation, designed to use the DSP available on FPGA CONFIDENTIEL
Deep learning for embedded computing N2D2 : DNN design framework • Unified modeling and NN exploration tool • Custom applications building & optimization (CNN, Faster- RCNN…) O PTIMIZED E MBEDDED • Hardware mapping & benchmarking (CPUs, GPUs, FPGAs, ASIPs) C ODE G ENERATION • N2D2 is available at https://github.com/CEA-LIST/N2D2/ Motivations • Deep Neural Networks (DNN) are today extremely successful in the vast Embedded majority of classification/recognition benchmarks… on high -end multi-250W neural Dee GPU clusters computing • Embedding low-power DNN remains challenging: • Must adapt and simplify DNN topologies • Reduce layers complexity (number of operations) • Reduce precision (8 bit integer) Balancing speed/power and applicative performances is a major challenge Need for a framework to automate DNN shrinking exploration and • evaluation, performances projection and porting on embedded platforms CONFIDENTIEL
N2D2: DNN Design Environment • A unique platform for the design and exploration of DNN applications SW DNN libraries CONSIDERED CRITERIA COTS • OpenCL, OpenMP, • Accuracy (approximate computing…) • Many-core CPUs • Memory need CuDNN, CUDA, (MPPA, P2012, ARM…) • Computational Complexity TensorRT • GPUs, FPGAs • PNeuro, ASMP Learning & HW ACCELERATORS HW DNN libraries Test PNeuro DNeuro, C/HLS databases Optimization Trained DNN Data Modeling Learning Test Code Generation Code Execution conditioning CONFIDENTIEL
N2D2: Data Augmentation, Conditioning and Analysis • N2D2 integrates data processing and analysis dataflow building • Genericity: process image and sound, 1D, 2D or 3D data • Associate a label for each data point, 1D or 2D labels • Support arbitrary label shapes (circular, rectangular, polygonal or pixel-wise defined) • Apply transformations to data, pixel-wise labels and geometrical labels • Basic operations: rescaling, flipping, normalization, affine, filtering, DFT… • Advanced operations: elastic distortion, random slices/labels extraction, morphological reconstructions… Test set Data channels Validation set Learn set Channel Channel Channel Extract Affine Extract STATS DATA- Affine Slice STATS Extract Affine DATA- Slice Rescale STATS Op=-STATS Data- Slice Rescale Op=-STATS BASE Extract DL Core / Rescale BASE Extract Op=-STATS .mean base .mean Extract Affine Spike coding .mean STATS Channel Affine STATS Channel Affine STATS Channel Op/=STATS Extract Op/=STATS Extract Op=/STATS .stdDev Extract .stdDev .stdDev (cumulative) mean min Nb. of data Annotation data max (geometric mean (cumulative) Nb. of data min and pixel-wise) Value Transformation module max Data analysis module Value CONFIDENTIEL
N2D2: Typical Outputs Layer-wise detailed memory Dataflow visualization Results visualization: and computing requirements - Pixel-wise segmentation - ROI bounding box extraction and classification N2D2 INI network description file ; Database Input=conv1 [database] Type=Conv Type=MNIST_IDX_Database KernelWidth=5 Validation=0.2 KernelHeight=5 NbChannels=12 ; Environment Stride=2 [env] ConfigSection=common.config SizeX=24 SizeY=24 ; Third layer (fully connected) [fc1] BatchSize=128 Input=conv2 [env.Transformation] Type=Fc Type=PadCropTransformation NbOutputs=100 Width=[env]SizeX ConfigSection=common.config Height=[env]SizeY ; Output layer (fully connected) [env.OnTheFlyTransformation] [fc2] Type=DistortionTransformation Input=fc1 ApplyTo=LearnOnly Type=Fc ElasticGaussianSize=21 NbOutputs=10 ElasticSigma=6.0 ConfigSection=common.config ElasticScaling=36.0 Scaling=10.0 ; Softmax layer [soft] Rotation=10.0 Input=fc2 ; First layer (convolutionnal) Type=Softmax [conv1] NbOutputs=10 Input=env WithLoss=1 Type=Conv ConfigSection=common.config KernelWidth=5 KernelHeight=5 ; Common solvers config [common.config] NbChannels=6 Stride=2 WeightsSolver.LearningRate=0.05 ConfigSection=common.config WeightsSolver.Decay=0.0005 Solvers.LearningRatePolicy=StepDecay Solvers.LearningRateStepSize= [sp] _EpochSize ; Second layer (convolutionnal) [conv2] Solvers.LearningRateDecay=0.993 Layer-wise weights and kernels Pixel-wise and object wise visualization, distribution and Layer-wise output visualization confusion matrix reporting data-range analysis and data-range analysis CONFIDENTIEL
N2D2: DNN Complexity Analysis High weights memory High in/out buffer memory High computation Absolute Relative metrics metrics CONFIDENTIEL
N2D2: Calibration for Integer Precision • Weights clamping and/or normalization • Layers output activation distribution quantization • Histogram analysis and optimal quantization threshold determination • Using Kullback – Leibler divergence Goal: automatic and guaranteed best result without retraining CONFIDENTIEL
N2D2: Hardware Exports GPU generic C++/OpenCL HLS FPGA (Intel) HLS FPGA (Xilinx) GPU (NVidia) C++/OpenCL C/HLS C++/CUDA/CuDNN/ TensorRT Dataflow DNeuro ( ) Support SSD and N2D2 TensorRT configurable Faster-RCNN RTL on Drive PX2 RTL library A unified tool for multiple MPPA ( ) hardware targets C++/OpenCL KaNN API DSP-like PNeuro ( ) programmable CPU x86 / ARM / DSP RTL/ASM SIMD processor C/OpenMP C++/OpenCL R-Car ( ) CNN-IP C API NeuroSpike ( ) RTL ASMP ( ) C/OpenMP/CVA8 Generic spike SystemC Generic / not optimized for a specific product CONFIDENTIEL
Recommend
More recommend