Building On-prem GPU Training Infrastructure By Stephen Balaban - PowerPoint PPT Presentation

Building On-prem GPU Training Infrastructure By Stephen Balaban CEO, Lambda

Lambda Customers

About Me Started using CNNs for face recognition in 2012. ● First employee at Perceptio. We developed image ● recognition CNNs that ran locally on the iPhone. Acquired by Apple in 2015. Published in SPIE and NeurIPS. ●

Workshop Structure ● Audience survey ● Presentation w/ Q&A ● Q&A + Workshop

5 Stages of GPU Cloud Grief

It all starts with the Shock of an expensive AWS bill.

Stage 1 - Denial “This won’t happen again next month.”

Stage 2 - Anger “The bill doubled again!”

Stage 3 - Bargaining with your account manager.

Stage 4 - Depression “Spot instances and reserved instances aren’t enough, this is hopeless.”

Stage 5 - Acceptance “GPU cloud services are expensive. Managing hardware is scary.”

Hardware: A Quick Rundown 1. GPUs 2. CPUs 3. GPU-GPU Bandwidth & PCIe Topology

GPU Speed Comparisons Source: https://lambdalabs.com/blog/titan-rtx-tensorflow-benchmarks/

Performance / $ Source: https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/

What to look for 1. Number of PCIe lanes. (Affects total bandwidth.) 2. NUMA Node Topology. (Affects GPU peering.) Source: https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/

GPU Peering & PCIe Topology

PCIe Topology 16x 16x 16x 16x 16x 16x

Dual Root PCIe Topology CPU-CPU CPU CPU Interconnect PEX PEX PEX PEX 8748 8748 8748 8748 G G G G G G G G P P P P P P P P U U U U U U U U 4 5 6 7 0 1 2 3 Arrow is 16x PCIe Connection Source: Lambda

Single Root PCIe Topology CPU PEX 8796 PEX 8796 G G G G G G G G P P P P P P P P U U U U U U U U 0 1 2 3 4 5 6 7 Arrow is 16x PCIe Connection Source: Lambda

Cascaded PCIe Topology CPU PEX 8796 PEX 8796 G G G G G G G G P P P P P P P P U U U U U U U U 0 1 2 3 4 5 6 7 Arrow is 16x PCIe Connection Source: Lambda

NVLink System Topology CPU-CPU CPU CPU Interconnect PEX PEX PEX PEX 8748 8748 8748 8748 Open Circle is CPU-CPU Comm GPU 0 GPU 1 GPU 4 GPU 5 Green Double Arrow is NVLink GPU 2 GPU 3 GPU 6 GPU 7 Arrow is 16x PCIe Connection Source: Lambda

Real Life Examples

Source: ASUS

Single Root Complex vs Dual Root Complex Single Root Complex Dual Root Complex (4029GP-TRT2) (4028GR-TRT) Source: Supermicro

1080 Ti GPUDirect Peer-to-Peer Bandwidth Benchmark 16x 16x 16x 16x 16x 16x Source: Lambda

No Peering on the new 2080 Ti Topology used in this experiment. (For the 1080 Ti, no NVLink.) Source: Lambda

Lambda Stack = GPU-enabled Frameworks For Ubuntu 16.04 or 18.04. One command: LAMBDA_REPO=$(mktemp) && \ wget -O${LAMBDA_REPO} https://lambdalabs.com/static/misc/lambda-stack-repo.deb && \ sudo dpkg -i ${LAMBDA_REPO} && rm -f ${LAMBDA_REPO} && \ sudo apt-get update && sudo apt-get install -y lambda-stack-cuda Also comes as a Docker Container. Source: https://lambdalabs.com/lambda-stack-deep-learning-software

Cost Comparison: On-prem vs. Cloud p3dn.24xlarge Instance Lambda Hyperplane AWS $109,008 once $160,308/year with reserved pricing (Add $15,000 / year if you want to co-locate instead.)

Cost Comparison: On-prem vs. Cloud p3.16xlarge Instance Lambda Blade AWS $28,389 once $139,371/year with reserved pricing (Add $15,000 / year if you want to co-locate instead.)

Cost Comparison: On-prem vs. Cloud p3.8xlarge Instance Lambda Quad AWS $12,472 once $69,729/year with reserved pricing

Thank You! Tweet @LambdaAPI @stephenbalaban LAMBDALABS.COM/BLOG

Building On-prem GPU Training Infrastructure By Stephen Balaban - PowerPoint PPT Presentation

Building On-prem GPU Training Infrastructure By Stephen Balaban CEO, Lambda Lambda Customers About Me Started using CNNs for face recognition in 2012. First employee at Perceptio. We developed image recognition CNNs that ran

Cloud, On-prem, Both? Dropboxs approach to Infrastructure Akhil Gupta VP Engineering

Prese sentation n con onte test Prem: Senior Div-P-$4, B-$3.75, R-$3, W-$2.25 Prem: Junior

CAPACITY BUILDING OBJECTIVES FOR THIS TRAINING MANUAL CAPACITY BUILDING Capacity building is

Decision aid methodologies in transportation Lecture 1: Introduction Prem Kumar

Infrastructure for capacity building EU-AU Joint Session on Infrastructure for the Minerals

Decision aid methodologies in transportation Lecture 3: Crew Scheduling Prem Kumar

Decision aid methodologies in transportation Lecture 5: Revenue Management Prem Kumar

Decision aid methodologies in transportation Lecture 7: More Applications Prem Kumar

Decision aid methodologies in transportation Lecture 2: Aircraft Scheduling Prem Kumar

January 2018 Vision: Building a World Class Research Infrastructure Adequate, Sustainable

Prof. Prem Kumar Dr. Lesley Roberts Understand university pre-requisites to teaching and

Building Climate Resilience Across Infrastructure Ontario Managed Assets Infrastructure Ontario -

Decision aid methodologies in transportation Lecture 6: Miscellaneous Topics Prem Kumar

From Training to Education: Building Offensive Curriculum from Training Certifications * or

Infrastructure Ontario (IO) Crown corporation responsible for building, managing, financing,

Presented to: Presented by: World Bank Staff Gary Reid PREM Knowledge & Learning Week

Presented to: Presented by: World Bank Staff Gary Reid PREM Knowledge & Learning Week

Presented to: Presented by: World Bank Staff Ranjana Mukherjee PREM Knowledge & Learning

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod

An open access electronic PROM/PREM system Jonathan Field DC MSc FRCC(pain) FBCA Background

The Challenge Building Infrastructure for Marlborough STEMs School-Industry Partnerships HOW

FINFISHER: FinFly ISP 2.0 Infrastructure Product Training Table of content 2 1. Introduction

Building a National Clinical Research Infrastructure OneFlorida 3 rd Annual Stakeholder Meeting

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

Building On-prem GPU Training Infrastructure By Stephen Balaban - PowerPoint PPT Presentation

Building On-prem GPU Training Infrastructure By Stephen Balaban CEO, Lambda Lambda Customers About Me Started using CNNs for face recognition in 2012. First employee at Perceptio. We developed image recognition CNNs that ran

Cloud, On-prem, Both? Dropboxs approach to Infrastructure Akhil Gupta VP Engineering

Prese sentation n con onte test Prem: Senior Div-P-$4, B-$3.75, R-$3, W-$2.25 Prem: Junior

CAPACITY BUILDING OBJECTIVES FOR THIS TRAINING MANUAL CAPACITY BUILDING Capacity building is

Decision aid methodologies in transportation Lecture 1: Introduction Prem Kumar

Infrastructure for capacity building EU-AU Joint Session on Infrastructure for the Minerals

Decision aid methodologies in transportation Lecture 3: Crew Scheduling Prem Kumar

Decision aid methodologies in transportation Lecture 5: Revenue Management Prem Kumar

Decision aid methodologies in transportation Lecture 7: More Applications Prem Kumar

Decision aid methodologies in transportation Lecture 2: Aircraft Scheduling Prem Kumar

January 2018 Vision: Building a World Class Research Infrastructure Adequate, Sustainable

Prof. Prem Kumar Dr. Lesley Roberts Understand university pre-requisites to teaching and

Building Climate Resilience Across Infrastructure Ontario Managed Assets Infrastructure Ontario -

Decision aid methodologies in transportation Lecture 6: Miscellaneous Topics Prem Kumar

From Training to Education: Building Offensive Curriculum from Training Certifications * or

Infrastructure Ontario (IO) Crown corporation responsible for building, managing, financing,

Presented to: Presented by: World Bank Staff Gary Reid PREM Knowledge &amp; Learning Week

Presented to: Presented by: World Bank Staff Gary Reid PREM Knowledge &amp; Learning Week

Presented to: Presented by: World Bank Staff Ranjana Mukherjee PREM Knowledge &amp; Learning

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod

An open access electronic PROM/PREM system Jonathan Field DC MSc FRCC(pain) FBCA Background

The Challenge Building Infrastructure for Marlborough STEMs School-Industry Partnerships HOW

FINFISHER: FinFly ISP 2.0 Infrastructure Product Training Table of content 2 1. Introduction

Building a National Clinical Research Infrastructure OneFlorida 3 rd Annual Stakeholder Meeting

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

Presented to: Presented by: World Bank Staff Gary Reid PREM Knowledge & Learning Week

Presented to: Presented by: World Bank Staff Gary Reid PREM Knowledge & Learning Week

Presented to: Presented by: World Bank Staff Ranjana Mukherjee PREM Knowledge & Learning