Project MagLev: NVIDIAs production-grade AI Platform Divya Vavili, - PowerPoint PPT Presentation

Project MagLev: NVIDIA’s production-grade AI Platform Divya Vavili, Yehia Khoja - Mar 21 2019

● AI inside of NVIDIA ● Constraints and scale ● AI Platform needs Agenda ● Technical solutions ● Scenario walkthrough ● Maglev architecture evolution 2

AI inside of NVIDIA Deep Learning is fueling all areas of business Self-Driving Cars Robotics Healthcare AI Cities Retail AI for Public Good 3

Constraints and scale SDC Scale Today at NVIDIA 5

Constraints and scale What are our requirements? Safety Tons of data! Inference on edge Reproducibility 6

What testing scale are we talking about? DATA AND INFERENCE TO GET THERE? We’re on our way to 100s PB of real test data = millions of real miles + 1,000s DRIVE Constellation nodes for offline testing alone & billions of simulated miles 180PB — NVIDIA’s data collection (miles) 120PB –– Active testing to date (miles) 24h test — Target robustness (miles) on 3,200 Nodes* 60PB 30PB 15PB 24h test on 1,600 Nodes* Real-time test runs in 24h * DRIVE PEGASUS Nodes on 400 Nodes*

The need for an AI platform An end-to-end solution for industry-grade AI development Enable the development of AV Perception, fully tested across 1000s of conditions, and yielding failure rates < 1 in N miles, N large Scalable AI PB-Scale AI AI-based Data Training Testing Selection/Mining Traceability: Seamless PB-Scale Workflow model=>code+data Data Access Automation 8

1PB per month 9

The need for an AI platform Enabling automation of training, and testing workflows Data Data Data Model Model TransferPilot Indexing Selection Labeling Training Testing Data Factory Automated Workflows NVOrigin Training [On-demand transcoding] Dataset Store Testing Workflows Workflows Model Store Tested [Training & Testing [Nightly tests, [Data preproc, DNN Models [Trained Models] Datasets] re-simulation, etc.] training, pruning, export, LabelStore fine-tuning] [Labels, tags, etc.] Labeled Datasets Trained Models SaturnV Storage 10

So how did we solve for this? 11

Technical solution(s) Safety - Non-compromisable primary objective for the passengers Safety All other engineering requirements stem from this - Models tested on huge datasets to be confident Tons of data! - Faster iteration that aids in producing extremely good and well-tested models - Reproducibility/Traceability Inference on edge Reproducibility 12

Technical solution(s) Tons of data! Safety - Collecting enormous amounts of data under innumerable scenarios is key to building good AV models - Now that we data, what next? - How do engineers access this data? Tons of data! - How do you make sure that the data: - can be preprocessed for each team’s need? - is not corrupted by other members of the team or across teams? Inference on edge - Lifecycle management of data Reproducibility 13

Technical solution(s) Tons of data! Safety What is the solution? vdisk - Virtualized Immutable file-system Tons of data! Offers broad platform support - - Structured to support data deduplication - Inherently supports caching Inference on edge Provides kubernetes - integration making it cloud-native Reproducibility 14

Technical solution(s) Inference on edge Safety - AV model inference is limited in terms of hardware capabilities - So, finding a lighter model without losing performance is prudent Tons of data! and takes multiple and faster iterations Inference on edge Reproducibility 15

Technical solution(s) Reproducibility Safety Why? - Being able to run a 10 year old workflow and get the same results - Faster iteration of model development Tons of data! - Understand why a model behaved certain way Requires: Inference on edge - Proper version control of datasets, models and the experiments Reproducibility - … and traceability go hand in hand Reproducibility 16

MagLev Scenario walkthrough Predicting 12 month mortgage delinquency using Fannie Mae Single family home loan ● data Key points: Immutable dataset creation Specifying workflows and launching them End-to-traceability 17

MagLev Scenario walkthrough Creating an immutable dataset • >> maglev volumes create --name <my-volume> --path </some/local/directory/path> [--resume-version <version>] Creating volume: Volume(name = my-volume, version = 449c8efa-eaef-4d9b-81b9-3a59fe269e9b) Uploading '<local-file>'... … Successfully created new volume. Volume(name = my-volume, version = 449c8efa-eaef-4d9b-81b9-3a59fe269e9b) Creates a ISO image • ISO image only contains the metadata for the dataset while the actual dataset resides in • S3 18

MagLev Scenario walkthrough 19

MagLev Architecture Evolution Version 1 - Technical viability Compute and data on public cloud Mostly for technical evaluation - Costs skyrocketing - Poor performance - clash between functionality and efficiency - Early decisions Cloud native platform - Image source: shuttershock.com General purpose services/ETL pipelines hosted on public cloud allows us to elastically - scale based on requirements 28

MagLev Architecture Evolution Version 2 - Minimize costs Compute on internal data-center for GPU workloads Minimize costs - Take advantage of innovation on GPUs before it hits the market - Huge compute cluster that is always kept busy by the training/testing workflows - What needed to improve: Performance due to lack of data locality - 29

MagLev Architecture Evolution Version 3 - High performance Internal data center specialized for both compute and data performance High performance due to data locality - Better UX for data scientists - Programmatically create workflows - 30

MagLev Data Center Architecture 31

MagLev Service Architecture General service cluster on public - cloud Authentication - Volume management - Workflow traceability - Experiment/Model management - Compute cluster on internal NGC - cloud Both clusters are cloud-native built - on top of Kubernetes 32

Questions 33

Project MagLev: NVIDIAs production-grade AI Platform Divya Vavili, - PowerPoint PPT Presentation

Project MagLev: NVIDIAs production-grade AI Platform Divya Vavili, Yehia Khoja - Mar 21 2019 AI inside of NVIDIA Constraints and scale AI Platform needs Agenda Technical solutions Scenario walkthrough Maglev

Transrapid Maglev System Transrapid Maglev System Sandhouse Gang Northwestern University

Maglev Train Demonstration Miren Relucio, Lucas McPhee, Grant Wallace Overview Create a

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

Sequence at a Glance 8 TH GRADE 9 TH GRADE 10 TH GRADE 11 TH GRADE 12 TH GRADE SUGGESTED PROGRAM

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

NVIDIA INDEX IMPLEMENTING CLOUD SERVICES FOR MASSIVE DATA VISUALIZATION Marc Nienhaus (NVIDIA),

NVIDIA DESIGNWORKS Ankit Patel - ankitp@nvidia.com Prerna Dogra - pdogra@nvidia.com 1 Autonomous

NVIDIA VGPU LINUX KVM Neo Jia, Dec 19th 2019 AGENDA NVIDIA vGPU

GET TO KNOW THE NVIDIA GRID TM SDK Shounak Deshpande, NVIDIA Background NVIDIA GRID SDK AGENDA

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

Cutting Edge Tools and Techniques for Real-Time Rendering with NVIDIA GameWorks David Coombes,

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JULIEN

Business Propositi tion The Futu ture of Broadband Now WyTec Company History Industry

Results for the Six Months Ended September 30, 2014 November 2014 Building a better, brighter

AGL Regional Conference Thursday, January 26, 2017 ALLAN TANTILLO SENIOR DIRECTOR FOR NATIONAL

Projections of Alzheimers Dementia in Washington State David Mancuso, PhD Director, DSHS

Results of Light Rail Review City Council Presentation Tuesday, September 13, 2016 Lyndon S.

Transit Update Development Services Committee June 23, 2015 Building Markhams Future

Magnesium Discovery Magnesium British Columbia, Canada Discovery July 2009 Outlook 2 Disclaimer

AU GU ST 2017 The information contained herein is provided solely for the reader's general