Distributed Hierarchical GPU Parameter Server for Massive Scale Deep - PowerPoint PPT Presentation

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems Presenter: Weijie Zhao 1 1 Cognitive Computing Lab, Baidu Research Joint work with Deping Xie 2 , Ronglai Jia 2 , Yulei Qian 2 , Ruiquan Ding 3 , Mingming Sun 1 , Ping Li 1 2 Baidu Search Ads (Phoenix Nest), Baidu Inc. 3 Sys. & Basic Infra., Baidu Inc.

Sponsored Online Advertising Ads

Sponsored Online Advertising Query Neural Network Ad Click-through Rate CTR = #"#$%&'()*+,-). #/01*2..$+3. ×100% User portrait … … High-dimensional sparse vectors (10 11 dimensions)

CTR Prediction Models at Baidu 2009 and earlier: Single machine CTR models 2010: Distributed logistic regression (LR) and distributed parameter server 2013: Distributed deep neural networks (DNN) , extremely large models Since 2017: Single GPU AIBox, Multi-GPU Hierarchical Parameter Sever, Approx. near neighbor (ANN) search, Maximum inner product search (MIPS) ANN and MIPS have become increasingly important in the whole pipeline of CTR prediction, due to the popularity & maturity of embedding learning and improved ANN/MIPS techniques

A Visual Illustration of CTR Models Sparse input Sparse parameters Embedding (~10 TB) layer Dense Fully-connected parameters layers (< 1 GB) Output

MPI Cluster Solution Distributed Parameter Server Node 1 Memory Node 2 Memory Global parameters shards pull/push Local parameters Data shards Worker 1 Worker 2 Worker 3 Worker 4

Wait! Why do We Need Such a Massive Model?

Hashing For Reducing CTR Models One permutation + one sign random projection (work done in 2015) Image search ads is typically a small source of revenue 1. Hashing + DNN significantly improves over LR (logistic regression)! 2. A fine solution if the goal is to use single-machine to achieve good accuracy!

Hashing For Reducing CTR Models One permutation + one sign random projection (work done in 2015) Web search ads use more features and larger models 1. Even a 0.1% decrease in AUC would result in a noticeable decrease in revenue 2. Solution of using hashing + DNN + single machine is typically not acceptable

MPI Cluster Solution Distributed Parameter Server Node 1 Memory Node 2 Memory Global parameters shards pull/push Local parameters Data shards Worker 1 Worker 2 Worker 3 Worker 4 Hardware and maintenance cost 10-TB model parameters Hundreds of computing nodes Communication cost

But all the cool kids use GPUs! Let’s train the 10-TB Model with GPUs!

Sparse input Sparse parameters Embedding (~10 TB) layer Hold 10 TB parameters in GPU? Dense Fully-connected parameters layers (< 1 GB) Output

A few hundreds of non-zeros Sparse input Sparse parameters Embedding (~10 TB) layer Only a small subsets of Dense parameters in the Fully-connected parameters embedding layer are used layers (< 1 GB) and updated in a batch Output

A few hundreds of non-zeros Sparse input Sparse parameters Embedding (~10 TB) layer Only a small subsets of Dense parameters in the Fully-connected parameters embedding layer are used layers (< 1 GB) and updated in a batch The working parameters can be hold in GPU High Output Bandwidth Memory (HBM)

Solve the Machine Learning Problem in a System Way! HBM-PS GPU 1 GPU 2 Parameter shards pull/push Data shards Workers RDMA remote Inter-GPU synchronization communications GPU 3 GPU 4 Local pull/push & Data transfer Remote pull/push MEM-PS Memory Local parameters Data shards HDFS Batch load/dump SSD-PS SSD Materialized parameters

11, 87 87 mini-batch 1 mini-batch 3 98 4, 53 50, 56, 61 4, 61 mini-batch 2 mini-batch 4 61, 87 5, 56 working : 4, 5, 11, 50, 53, 56, 61, 87, 98 Node 1 : Node 2 : SSD: 1, 3, 5, …, 97, 99 2, 4, 6, …, 98, 100 pull local pull remote MEM-PS/SSD-PS MEM-PS MEM: 5, 11, 53, 61, 87 4, 50, 56, 98 partition parameters HBM: GPU 1 : 4, 5, 11, 50 GPU 2 : 53, 56, 61, 87, 98 pull remote HBM-PS pull local HBM-PS 11, 87 GPU 1 : 11 87, 98 98 Worker 1 forward/backward mini-batch 1 propagation

Experimental Evaluation • 4 GPU computing nodes • 8 cutting-edge 32 GB HBM GPUs • Server-grade CPUs with 48 cores (96 threads) • ∼ 1 TB of memory • ∼ 20 TB RAID-0 NVMe SSDs • 100 Gb RDMA network adaptor

Experimental Evaluation

Execution Time

Price-Performance Ratio • Hardware and maintenance cost: 1 GPU node ~ 10 CPU-only nodes • 4 GPU node vs. 75-150 CPU nodes

Scalability real ideal 9E+4 #Examples trained/sec 6E+4 3E+4 0E+0 1 2 3 4 #Nodes

Conclusions • We introduce the architecture of a distributed hierarchical GPU parameter server for massive deep learning ads systems. • We perform an extensive set of experiments on 5 CTR prediction models in real- world online sponsored advertising applications. • A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. • The cost of 4 GPU nodes is much less than the cost of maintaining an MPI cluster of 75-150 CPU nodes. • The price-performance ratio of this proposed system is 4.4-9.0X better than the previous MPI solution. • This system is being integrated with the PaddlePaddle deep learning platform (https://www. paddlepaddle.org.cn) to become PaddleBox .

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep - PowerPoint PPT Presentation

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems Presenter: Weijie Zhao 1 1 Cognitive Computing Lab, Baidu Research Joint work with Deping Xie 2 , Ronglai Jia 2 , Yulei Qian 2 , Ruiquan Ding 3 , Mingming

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Server Traffic Management Server Traffic Management Jeff Chase Duke University, Department of

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Parameter Hub A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural

Content Server Caching Network Client Web Server Browser Avoid Network Latency Avoid Queuing

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Hierarchical Pointer Analysis for Distributed Programs Distributed Programs Amir Kamil and

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

Generating Massive Amount of Generating Massive Amount of High- -Quality Random Numbers using

Approaching a Platform Migration Approaches to SAS migration and Platform LSF considerations for

Untangling Header Bidding Lore Some myths, some truths, and some hope Waqar Aqeel , Debopam

Lecture 10: THE AD-AS MODEL Reference: Chapter 8 LEARNING OBJECTIVES 1.What determines the

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Coordination-Free Computations Christopher Meiklejohn LASP DISTRIBUTED, EVENTUALLY CONSISTENT

TOOLS FOR BEYOND DIE CODESIGN AND INTEGRATION Honoring Prof. Yoji Kajitani

Advertising and Voter Data in Asymmetric Political Contests Priyanka Sharma Liad Wagman 31

Quar te r ly D&O Claims T r e nds: 2017 Wr ap-Up Ja nua ry 24 th , 2018, 11 AM E a ste

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep - PowerPoint PPT Presentation

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems Presenter: Weijie Zhao 1 1 Cognitive Computing Lab, Baidu Research Joint work with Deping Xie 2 , Ronglai Jia 2 , Yulei Qian 2 , Ruiquan Ding 3 , Mingming

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Server Traffic Management Server Traffic Management Jeff Chase Duke University, Department of

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Parameter Hub A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural

Content Server Caching Network Client Web Server Browser Avoid Network Latency Avoid Queuing

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Hierarchical Pointer Analysis for Distributed Programs Distributed Programs Amir Kamil and

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

Generating Massive Amount of Generating Massive Amount of High- -Quality Random Numbers using

Approaching a Platform Migration Approaches to SAS migration and Platform LSF considerations for

Untangling Header Bidding Lore Some myths, some truths, and some hope Waqar Aqeel , Debopam

Lecture 10: THE AD-AS MODEL Reference: Chapter 8 LEARNING OBJECTIVES 1.What determines the

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Coordination-Free Computations Christopher Meiklejohn LASP DISTRIBUTED, EVENTUALLY CONSISTENT

TOOLS FOR BEYOND DIE CODESIGN AND INTEGRATION Honoring Prof. Yoji Kajitani

Advertising and Voter Data in Asymmetric Political Contests Priyanka Sharma Liad Wagman 31

Quar te r ly D&amp;O Claims T r e nds: 2017 Wr ap-Up Ja nua ry 24 th , 2018, 11 AM E a ste

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Quar te r ly D&O Claims T r e nds: 2017 Wr ap-Up Ja nua ry 24 th , 2018, 11 AM E a ste