Building On-prem GPU Training Infrastructure By Stephen Balaban CEO, Lambda
Lambda Customers
About Me Started using CNNs for face recognition in 2012. ● First employee at Perceptio. We developed image ● recognition CNNs that ran locally on the iPhone. Acquired by Apple in 2015. Published in SPIE and NeurIPS. ●
Workshop Structure ● Audience survey ● Presentation w/ Q&A ● Q&A + Workshop
5 Stages of GPU Cloud Grief
It all starts with the Shock of an expensive AWS bill.
Stage 1 - Denial “This won’t happen again next month.”
Stage 2 - Anger “The bill doubled again!”
Stage 3 - Bargaining with your account manager.
Stage 4 - Depression “Spot instances and reserved instances aren’t enough, this is hopeless.”
Stage 5 - Acceptance “GPU cloud services are expensive. Managing hardware is scary.”
Hardware: A Quick Rundown 1. GPUs 2. CPUs 3. GPU-GPU Bandwidth & PCIe Topology
GPUs
GPU Speed Comparisons Source: https://lambdalabs.com/blog/titan-rtx-tensorflow-benchmarks/
Performance / $ Source: https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/
CPUs
What to look for 1. Number of PCIe lanes. (Affects total bandwidth.) 2. NUMA Node Topology. (Affects GPU peering.) Source: https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/
GPU Peering & PCIe Topology
PCIe Topology 16x 16x 16x 16x 16x 16x
Dual Root PCIe Topology CPU-CPU CPU CPU Interconnect PEX PEX PEX PEX 8748 8748 8748 8748 G G G G G G G G P P P P P P P P U U U U U U U U 4 5 6 7 0 1 2 3 Arrow is 16x PCIe Connection Source: Lambda
Single Root PCIe Topology CPU PEX 8796 PEX 8796 G G G G G G G G P P P P P P P P U U U U U U U U 0 1 2 3 4 5 6 7 Arrow is 16x PCIe Connection Source: Lambda
Cascaded PCIe Topology CPU PEX 8796 PEX 8796 G G G G G G G G P P P P P P P P U U U U U U U U 0 1 2 3 4 5 6 7 Arrow is 16x PCIe Connection Source: Lambda
NVLink System Topology CPU-CPU CPU CPU Interconnect PEX PEX PEX PEX 8748 8748 8748 8748 Open Circle is CPU-CPU Comm GPU 0 GPU 1 GPU 4 GPU 5 Green Double Arrow is NVLink GPU 2 GPU 3 GPU 6 GPU 7 Arrow is 16x PCIe Connection Source: Lambda
Real Life Examples
Source: ASUS
Single Root Complex vs Dual Root Complex Single Root Complex Dual Root Complex (4029GP-TRT2) (4028GR-TRT) Source: Supermicro
1080 Ti GPUDirect Peer-to-Peer Bandwidth Benchmark 16x 16x 16x 16x 16x 16x Source: Lambda
No Peering on the new 2080 Ti Topology used in this experiment. (For the 1080 Ti, no NVLink.) Source: Lambda
Lambda Stack = GPU-enabled Frameworks For Ubuntu 16.04 or 18.04. One command: LAMBDA_REPO=$(mktemp) && \ wget -O${LAMBDA_REPO} https://lambdalabs.com/static/misc/lambda-stack-repo.deb && \ sudo dpkg -i ${LAMBDA_REPO} && rm -f ${LAMBDA_REPO} && \ sudo apt-get update && sudo apt-get install -y lambda-stack-cuda Also comes as a Docker Container. Source: https://lambdalabs.com/lambda-stack-deep-learning-software
Cost Comparison: On-prem vs. Cloud p3dn.24xlarge Instance Lambda Hyperplane AWS $109,008 once $160,308/year with reserved pricing (Add $15,000 / year if you want to co-locate instead.)
Cost Comparison: On-prem vs. Cloud p3.16xlarge Instance Lambda Blade AWS $28,389 once $139,371/year with reserved pricing (Add $15,000 / year if you want to co-locate instead.)
Cost Comparison: On-prem vs. Cloud p3.8xlarge Instance Lambda Quad AWS $12,472 once $69,729/year with reserved pricing
Thank You! Tweet @LambdaAPI @stephenbalaban LAMBDALABS.COM/BLOG
Recommend
More recommend