INSIDE NVIDIA'S AI INFRASTRUCTURE FOR SELF-DRIVING CARS (HINT: IT’S ALL ABOUT THE DATA) CLEMENT FARABET | | San Jose 2019 1
Self-driving cars requires tremendously large datasets for training and testing 2
NVIDIA DRIVE: SOFTWARE-DEFINED CAR Powerful and Efficient AI, CV, AR, HPC | Rich Software Development Platform Functional Safety | Open Platform | 370+ partners developing on DRIVE DRIVE IX Trunk Opening Eye Gaze Distracted Driver Drowsy Driver Cyclist Alert DRIVE AR Detect Track CG Lidar RADAR LIDAR Localization LIDAR Localization Path Perception DRIVE AV Surround Perception Lanes Signs Lights Egomotion Camera Localization Path Planning DRIVE AGX XAVIER DRIVE OS DRIVE AGX PEGASUS 3
BUILDING AI FOR SDC IS HARD Vehicles Pedestrians Bicycles Animals Hazards Every neural net in our DRIVE Software stack needs to Street Lamps Day Twilight Night Backlit handle 1000s of conditions and geolocations Clear Cloudy Rain Snow Fog NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
DATA AND INFERENCE TO GET THERE? WHAT TESTING SCALE ARE WE TALKING ABOUT? We’re on our way to 100s PB of real test data = millions of real miles + 1,000s DRIVE Constellation nodes for offline testing alone & billions of simulated miles 180PB — Target robustness per model (miles) 120PB 24h test –– Test dataset size required on 3,200 Nodes* (miles) 60PB — NVIDIA’s ongoing data collection (miles) 30PB 15PB 24h test on 1,600 Nodes* Real-time test runs in 24h * DRIVE PEGASUS Nodes on 400 Nodes*
SDC SCALE TODAY AT NVIDIA 12-camera+Radar+Lidar 4,000 GPUs in cluster 1,500 labelers RIG mounted on 30 cars = 500 PFLOPs 100 DRIVE 1PB+ raw data 20M objects labeled/mo Pegasus in cluster collected/month (Constellations) 1PB of in-rack object 15PB raw active 20 unique models cache per 72 GPUs, training+test dataset 50 labeling tasks 30PB provisioned
Creating the right datasets is the cornerstone of machine learning. 7
TRADITIONAL SW DEVELOPMENT Write initial code Source Code Modify, add, delete, Compiler improve code Logs, stdout, Executable profiler Run, debug
ML-BASED SW DEVELOPMENT Collect initial data Dataset Modify, add, delete, Machine Learning improve data Algorithms Inference results, Predictor confidence estimates, characterization, etc. Run, debug
TRADITIONAL ML-BASED SOFTWARE SOFTWARE Source code Data Compiler DL/ML Algos Executable Predictor
ML-BASED SW DEVELOPMENT Tremendous progress Collect initial over the data past 10y Dataset Modify, add, delete, Machine Learning improve data Algorithms Inference results, Predictor confidence estimates, characterization, etc. Run, debug
ML-BASED SW DEVELOPMENT Collect initial data Dataset Modify, add, delete, Machine Learning improve data Algorithms Inference results, Predictor confidence estimates, characterization, etc. Lagging Run, debug behind, innovation required
Active Learning is a powerful paradigm to iteratively develop datasets (== develop and debug traditional software) 13
ADD MORE RANDOM DATA... PLATEAU Object detection performance. mAP as as function of epochs, for base model (blue), random strategy (purple) and active strategy (orange).
ACTIVE LEARNING => GET OUT OF PLATEAU! Object detection performance. mAP as as function of epochs, for base model (blue), random strategy (purple) and active strategy (orange).
WHY? NOT ALL DATA CREATED EQUALLY vs Source Some Samples Are Much More Informative Than Others
1. How do we find the most informative unlabeled data to build the right datasets the fastest? 2. How do we build training datasets that are 1/1000 the size for the same result?
HOW ACTIVE LEARNING WORKS Model uncertainty Training models Collecting data 18
ACTIVE LEARNING NEEDS UNCERTAINTY Bayesian Deep Networks (BNN) Bayesian networks are the principled way to model uncertainty. However, they are computationally demanding: Training: Intractable without approximations. - Testing: distributions need ~100 forward passes (varying the model) - 19
OUR ACTIVE LEARNING APPROACH Our approximation to BNN We proposed an approximation to BNN to train a network using ensembles: Samples from the same distribution as the training set will have consensus - while other samples will not. We regularize the weights in the ensemble to approximate probability - distributions. 20 [ Chitta, Alvarez, Lesnikowski ], Deep Probabilistic Ensembles: Approximate …. (published at NeurIPS 2018 Workshop on Bayesian Deep Learning)
OUR ACTIVE LEARNING RESULTS Quantitative Results on CIFAR10 Competitive results using ~1/4 th of the training data 21 [ Chitta, Alvarez, Lesnikowski ], Deep Probabilistic Ensembles: Approximate …. (published at NeurIPS 2018 Workshop on Bayesian Deep Learning)
OUR ACTIVE LEARNING RESULTS Applied to more challenging problems like semantic segmentation 22 [ Chitta, Alvarez, Lesnikowski ], Deep Probabilistic Ensembles: Approximate …. (published at NeurIPS 2018 Workshop on Bayesian Deep Learning)
Getting active learning to scale to the SDC problem is a massive challenge! But it is also necessary: labeling cost, data collection and storage cost, training cost.
Project MagLev NVIDIA’s internal production grade ML infrastructure 24
MAGLEV Goal: enable the full iterative ML development cycle (e.g. active learning), at the scale of self-driving car data. PB-Scale Data Management PB-Scale PB-Scale AI AI-based Data Training Selection/Mining PB-Scale AI Testing + Debugging [Inference]
MAGLEV COMPONENTS UI/UX/CLI Dashboard for MagLev experience, visualizing results, spinning up notebooks, sharing pipelines, data exploration / browsing Datasets Workflows Experiments Apps “Storing, tracking “ API and infra to describe and “ Track and view all results from “ Python Building blocks to and versioning datasets” run workflows, manually or DL/ML experiments, from rapidly describe DL/ML apps, programmatically” models to metrics” access data, produce metrics” Results Saving Workflow Infra/Services Read/Stream/Write data Artifacts and volumes for DL/ML apps management Metrics Traceability Workflow Traceability Off-the-shelf models Data traceability Results Analysis ML Pipelines Generic vertical ML Data representation Hosted Notebooks Persistence / Resuming (AV/Medical/…) operators ML Data querying - Presto HyperOpt parameter Pruning, Exporting, Testing / Spark / Parquet tracking and sampling 26
WORKFLOWS IN MAGLEV Workflow = directed graph of jobs. Each job is described by inputs and outputs: datasets and models. Datasets and models 1 st -class citizens, tracked/versioned assets. Street Scene Street Scene Face detector Pedestrian Pedestrian Dataset with Pedestrian Dataset #34 Model #13 Model #1 Model #5 people #1 Pedestrian Model #1 Pedestrian Model #1 Model #1 Job #1 Job #4 Job #3 Job #2 Classify Dataset, Export to TRT Select best Job #2 Train pedestrian filter for images Job #2 for model, prune Train pedestrian Job #2 detector Train pedestrian that contain a face detector Jetson/Xavier and fine-tune Train pedestrian detector detector 4x 8-GPU 1x 8-GPU 1x 8-GPU 1x Xavier node node node node (hyper-opt)
WORKFLOWS IN MAGLEV Step 1: Define the workflow as a list of steps in a YAML file Step 2: Execute the workflow maglev run //dlav/common:workflow -- -f my.yaml -e saturnv -r <results dir>
EXAMPLE WORKFLOW: FIND BEST MODEL Improving DNNs through massively parallel experimentation Experiments are run is parallel as part of a predefined workflow Random hyper-parameters Parallel Experiments Example: Run the 50 jobs in parallel • Model 1 • Use 8 GPUs per job Model 2 Total time → 1 day • Optimal hyper-parameters Model 3 Model 4 Define Pick Best Evaluate Prune Re-train . Workflow Model . Model 50 New experiment set parameters
MAGLEV SERVICES AWS Runs on Kubernetes Hybrid deployment: 1/ service cluster on AWS 2/ compute cluster at NVIDIA (SaturnV) Multi-node training via MPI over k8s Dataset management, versioning Workflow engine, based on Argo Experiments management, versioning Leverages NVIDIA TensorRT for inference Leverages NVIDIA GPU Cloud Containers for Pre-built DL/ML containers SaturnV
MagLev + DRIVE Data Factory End to end infrastructure to support AI development for DRIVE
MAGLEV + DRIVE DATA FACTORY 1,500 Large Labelers AI Dev team ”Collect ⇨ Select ⇨ Label ⇨ Train ⇨ Test” as programmatic, reproducible workflows. Labeling ML/Metrics UI UI Enables end to end AI dev for SDC, with labeling in the loop! 20M 20 models objects 15PB Data Selected Metrics Trained Labeled Ingest actively Today labeled Lake Datasets & Logs Models Datasets developed 1PB per per month week AWS Data selection Testing Training Job #1 Job #1 Job #1 Run Multi-Step Data selection Testing Training Workflow Job #2 Job #1 Job #2 … … … (workflow = sequence of Data selection Testing Training map jobs) Job #N Job #N Job #N 4000 GPU Cluster (SaturnV)
Recommend
More recommend