An In-depth Performance Characterization of CPU- and GPU-based DNN - PowerPoint PPT Presentation

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures Ammar Ahmad Awan, Hari Subramoni, and Dhableswar K. (DK) Panda Dept. of Computer Science and Engineering The Ohio State University Credits: pdfs.semanticscholar.org (Ammar Ahmad Awan)

CPU based Deep Learning is not as bad as you think! • Introduction – CPU-based Deep Learning – Deep Learning Frameworks • Research Challenges • Design Discussion • Performance Characterization • Conclusion 2

GPUs are great for Deep Learning • NVIDIA GPUs have been the main driving force for faster training of Deep Neural Networks (DNNs) • GPUs: A natural fit for DL due to the throughput-oriented nature • GPUs are also growing in the HPC arena! 3

But what about CPUs? • Intel CPUs are everywhere and many-core CPUs are emerging according to Top500.org • Host CPUs exist even on the GPU nodes – Many-core Xeon Phis are increasing • Xeon Phi 1 st generation: a many-core co-processor • Xeon Phi 2 nd generation (KNL): a self-hosted many- core processor! • Usually, we hear CPUs are 10x – 100x slower than GPUs? – But can we do better? 4

Deep Learning Frameworks – CPUs or GPUs? • There are several Deep Learning (DL) or DNN Training frameworks – Caffe, Cognitive Toolkit, TensorFlow, MXNet, and counting.... • Every (almost every) framework has been optimized for NVIDIA GPUs • But every framework is able to execute on a CPU as well – So why are we not using them? – Performance has been “terrible” and several studies have reported significant degradation when using CPUs • But there is hope :-) – Coupled with Intel Xeon Phi (Knights Landing or KNL) and MC-DRAM, the landscape for CPU-based DL looks promising.. 5

The DL Framework(s) in discussion: Caffe • Caffe is a popular and widely used framework • NVIDIA-Caffe and BVLC-Caffe (Official Caffe) are almost similar • Intel-Caffe is optimized for CPU-based Deep Learning • OSU-Caffe is a multi-node multi-GPU variant that we have worked on at OSU Multi-node Caffe Variant Multi-GPU Support Multi-node Support Communication BVLC-Caffe Yes No N/A NVIDIA-Caffe Yes No N/A Intel-Caffe N/A Yes Intel MLSL 2017.1.016 (with Intel MPI 2017) OSU-Caffe Yes Yes MVAPICH2-GDR 2.2 6

Agenda • Introduction • Research Challenges • Design Discussion • Performance Characterization • Conclusion 7

The Key Question! Can we provide a holistic yet comprehensive view of DNN training performance for a diverse set of hardware architectures including Intel Xeon Phi (KNL) processors and NVIDIA Pascal GPUs? 8

Agenda • Introduction • Research Challenges • Design Discussion – Caffe Architecture – Understanding the Impact of Execution Environments • Performance Characterization • Conclusion 9

Caffe Architecture packed_comm_buff Loop {} Bcast (GPU0) 1. Data Propagation Params Params Params Params GPU1 GPU3 GPU0 GPU2 L 1 L 1 L 1 L 1 L 2 L 2 L 2 L 2 F B F F F B B B 2. Forward .. .. .. .. BackwardPass L n L n L n L n packed_reduce_ packed_reduce_ packed_reduce_ packed_reduce_ buff buff buff buff Reduce (GPU 0) 3. Gradient Gradients Aggregation ApplyUpdates 10

Understanding the Impact of Execution Environments Performance is dependent on: DLApplications(Image R ecognition, S peech P rocessing, etc.) 1. Hardware Architectures – GPUs DLFrameworks(Caffe, T ensorFlow, etc.) – Multi-/Many-core CPUs Generic MKL Optimized cuDNN Optimized 2. Software Libraries ConvolutionLayer ConvolutionLayer ConvolutionLayer – cuDNN (for GPUs) – MKL-DNN/MKL 2017 (for CPUs) A TLAS OpenBLAS MKL 2017 cuDNN/ cuBLAS 3. Hardware/Software co-design Other BLASLibraries – Software libraries optimized for BLASLibraries one platform will not help the other! Other Processors Multi-/ Many-core Many-core GPU (Xeon, XeonP hi) (P ascal P100) Hardware 11

Agenda • Introduction • Research Challenges • Design Discussion • Performance Characterization – Single-node Performance – Multi-node Performance • Conclusion 12

Performance Characterization • Several GPU generations and CPU architectures • Single-node Results for AlexNet and ResNet-50 – Impact of MKL engine – Impact of MC-DRAM – Layer-wise breakdown – P100 vs. KNL • Multi-node results using Intel-Caffe and OSU-Caffe – Weak scaling – ResNet-50 and AlexNet 13

Performance Characterization: Various Architectures Processor Architecture (Description) No. of Cores No. of Sockets Name (Label) Haswell1 Intel Xeon CPU E5-2660 v3 @ 2.60 GHz 20 (2*10) 2 Haswell2 Intel Xeon CPU E5-2687 v3 @ 3.10 GHz 20 (2*10) 2 Broadwell Intel Xeon CPU E5-2680 v4 @ 2.40 GHz 28 (2*14) 2 KNL Intel Xeon Phi CPU 7250 @ 1.40 GHz 68 (1*68) 1 K40 NVIDIA Tesla K40 11.8GB @ 0.75 GHz 2880 CUDA Cores N/A K80 NVIDIA Tesla K80 11.8GB @ 0.82 GHz 2496 CUDA Cores N/A P100 NVIDIA Tesla P100-PCIE 1 6GB @ 1.33 GHz 3584 CUDA Cores N/A 14

Single-node: Impact of MKL engine in Intel-Caffe • The comparison of optimized 1800 MKL engine and the default 1600 1400 ) Caffe engine Training Time(ms 1200 1000 • MKL engine is up to 3X better 800 600 than default Caffe engine 400 200 0 • Biggest gains for Intel Xeon Phi (KNL) (many-core) architecture • Both Haswell and Broadwell CPUArchitectures architectures get significant speedups ( up to 1.5X ) 15

Single-node: Impact of Utilizing MCDRAM • “MCDRAM as Cache” and “MCDRAM-All” offer very similar performance • MCDRAM as Cache was chosen for all the subsequent results • On average, DDR-All is up to 1.5X slower than MCDRAM 16

Diving Deeper: Layer-wise Breakdown 800 500 450 700 400 600 350 Time (ms) Time (ms) 500 300 250 400 200 300 150 200 100 50 100 0 0 conv1 conv2 conv3 conv4 conv5 conv1 conv2 conv3 conv4 conv5 • The full landscape for AlexNet: Forward and Backward Pass • Faster Convolutions  Faster Training • Most performance gains are based on conv2 and conv3 for AlexNet 17

Diving Deeper: P100 vs. KNL (AlexNet) 200 • Fully connected layers are much slower on 180 KNL compared to P100 160 • conv1 and conv3 also contribute to 140 degradation on KNL 120 ) Time (ms 100 • conv2 is faster on KNL compared to P100 80 60 40 20 0 P100 KNL-Opt HardwareArchitecture conv1 conv2 conv3 conv4 conv5 fc6 fc7 18

Multi-node Results: ResNet-50 • All results are weak scaling 400 600 350 500 Training Time (seconds) 300 Images/second 400 250 300 200 150 200 • Images/second is a derived metric but 100 100 more meaningful for understanding 50 0 0 scalability 2 4 8 16 20 32 No. of Nodes Time (seconds) Images/ second ResNet-50 Intel-Caffe 25

Multi-node Results: AlexNet Comparison 90 3500 80 3000 Training Time (seconds) 70 econd 2500 60 Images Per S 2000 50 40 1500 30 1000 20 500 10 0 0 1 2 4 8 16 20 32 1 2 4 8 16 20 32 No. ofNodes No. ofNodes OS U-Caffe(GPU) Intel-Caffe (CPU) OS U-Caffe(GPU) Intel-Caffe (CPU) • OSU-Caffe vs. Intel-Caffe – Different frameworks so not directly comparable – A rough comparison can still help in understanding scalability trends – Design of framework can affect performance for distributed training • MPI (or the communication runtime) can cause a marked difference 20

Agenda • Introduction • Research Challenges • Design Comparisons • Performance Characterization • Conclusion 21

Conclusion • CPU is very comparable to GPU for DNN Training workloads if appropriate optimizations are exploited • GPUs are still faster than CPUs in general • KNL beats P100 for one case but P100 beats KNL for most cases • Evaluating the performance of a DL framework – The hardware architecture matters – But software stack has a higher and more significant impact than hardware – The full execution environment and communication runtime needs to be evaluated to ensure fairness in comparisons 22

Performance Characterization of DNN Trainingusing TensorFlow and PyTorch on Modern Clusters Arpan Jain, Ammar Ahmad Awan, Quentin Anthony, Hari Subramoni, and Dhableswar K. (DK) Panda Dept. of Computer Science and Engineering The Ohio State University Credits: http://nowlab.cse.ohio-state.edu/static/media/talks (Arpan Jain)

Agenda • Introduction • Background • Research Challenges • Characterization Strategy – Evaluation Platforms and SoftwareLibraries – Experimental Setup • Performance Evaluation • Conclusion 24

Deep Learning Frameworks • Easily implement and experiment with Deep Neural Networks – Several Deep Learning (DL) frameworks have emerged • Caffe, PyTorch, TensorFlow, MXNet, and counting.... – Focus on TensorFlow and PyTorch • Most frameworks - optimized for NVIDIAGPUs– – but CPU optimized implementations are also emerging as we saw in the previous paper 25

Deep Learning and TensorFlow • The most widely used framework open-sourced by Google • Replaced Google’s DistBelief framework • Runs on almost all execution platforms available (CPU, GPU, TPU, Mobile, etc.) • https://github.com/tensorflow/tensorflow 26

Agenda • Introduction • Background • Research Challenges • Characterization Strategy – Evaluation Platforms and SoftwareLibraries – Experimental Setup • Performance Evaluation • Conclusion 27

An In-depth Performance Characterization of CPU- and GPU-based DNN - PowerPoint PPT Presentation

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures Ammar Ahmad Awan, Hari Subramoni, and Dhableswar K. (DK) Panda Dept. of Computer Science and Engineering The Ohio State University Credits:

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

Scalable and Distributed DNN Training on Modern HPC Systems: Challenges and Solutions Keynote

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel DNN Training with Tensor Flow

Software Engineering Software Applications A.Y. 2020/2021 What is software engineering? What is

Principles of Software Construction: Objects, Design, and Concurrency Design for large-scale

The Software Life Cycle Elaboration Production Software Engineering Deployment Modelling

1 If it's not merely programming Life Cycle What is it? Software life cycle Life Cycle

DCC Lifecycle Curation Model 2.0 Sayeed Choudhury, Johns Hopkins University Carole Palmer &

So#ware Architecture Prof. Bertrand Meyer, Dr. Michela Pedroni ETH Zurich, FebruaryMay 2010

Sambuz

Useful Links

Newsletter

Mail Us

An In-depth Performance Characterization of CPU- and GPU-based DNN - PowerPoint PPT Presentation

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures Ammar Ahmad Awan, Hari Subramoni, and Dhableswar K. (DK) Panda Dept. of Computer Science and Engineering The Ohio State University Credits:

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

for each dst in my.out_edges if dst.depth &gt; my.depth+1 then dst.depth = my.depth+1

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

Scalable and Distributed DNN Training on Modern HPC Systems: Challenges and Solutions Keynote

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel DNN Training with Tensor Flow

Software Engineering Software Applications A.Y. 2020/2021 What is software engineering? What is

Principles of Software Construction: Objects, Design, and Concurrency Design for large-scale

The Software Life Cycle Elaboration Production Software Engineering Deployment Modelling

1 If it's not merely programming Life Cycle What is it? Software life cycle Life Cycle

DCC Lifecycle Curation Model 2.0 Sayeed Choudhury, Johns Hopkins University Carole Palmer &amp;

So#ware Architecture Prof. Bertrand Meyer, Dr. Michela Pedroni ETH Zurich, FebruaryMay 2010

Sambuz

Useful Links

Newsletter

Mail Us

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

DCC Lifecycle Curation Model 2.0 Sayeed Choudhury, Johns Hopkins University Carole Palmer &