Parameter Server Marco Serafini COMPSCI 532 Lecture 19 Machine - PowerPoint PPT Presentation

Parameter Server Marco Serafini COMPSCI 532 Lecture 19

Machine Learning • Wide array of problems and algorithms • Classification • Given labeled data points, predict label of new data point • Regression • Learn a function from some (x, y) pairs • Clustering • Group data points into “similar” clusters • Segmentation • Partition image into meaningful segments • Outlier detection 2 2

More Dimensions • Supervision: • Supervised ML: labeled ground truth is available • Unsupervised ML: no ground truth • Training vs. Inference • Training: obtain model from training data • Inference: actually run the prediction • Today we focus on the training problem 3 3

Example: Ad Click Predictor • Ad prediction problem • A user is browsing the web • Choose ad that maximizes the likelihood of a click • Training data • Trillions of ad-click log entries • Trillions of features per ad and user • Important to reduce running time of training • Want to retrain frequently • Reduce energy and resource utilization costs 4 4

Abstracting ML Algorithms • Can we find commonalities among ML algorithms? • This would allow finding • Common abstractions • Systems solutions to efficiently implement these abstractions • Some common aspects • We have a prediction model A • A should optimize some complex objective function L • E.g.: Likelihood of correctly labeling a new ad as “click” or “no-click” • ML algorithm does this by iteratively refining A 5 5

High-Level View • Notation • D: data • A: model parameters • L: function to optimize (e.g., minimize loss) • Goal: Update A based on D to optimize L • Typical approach: iterative convergence 𝐵 " = 𝐺(𝐵 "&' , ∆ * (𝐵 "&' , 𝐸) merge updates to parameters iteration t compute updates that minimize L 6 6

How to Parallelize? • How to execute the algorithm over a set of workers? • Data-parallel approach • Partition data D • All workers share the model parameters A • Model-parallel approach • Partition model parameters A • All workers process the same data D 7 7

Data-Parallel Approach 𝐵 " = 𝐵 "&' + Σ /0' ∆(𝐵 "&' , 𝐸 / ) 1 • Process for each worker • Update parameters based on data • Push updates to parameter servers • Servers aggregate & apply updates • Pull parameters • Requirements • Updates associative and commutative! • Example: Stochastic Gradient Descent 10 10

Example • Each worker • Loads a partition of data • At every iteration, compute gradients • Server • Aggregate gradients • Update parameters 11 11

Parameter Server • Stores model parameters • Advantages • No need for message passing • Distributed shared memory abstraction • Very first implementation: key-value store • Improvements by the work we read • Server-side UDFs • Worker scheduling • Bandwidth optimizations 12 12

Architecture • Different namespaces • Single parameters as <key, value> pairs • Server-side linear algebra operations • Sum • Multiplication • 2-norm 13 13

Does This Scale? • We said that a model can have trillion parameters • Q: Does this scale? • A: Yes • Each data point (worker) only updates few parameters • Example: Sparse Logistic Regression 14 14

Optimizing communication • Machine learning is communication-heavy • Ranges • Workers do not update single keys • Instead they batch updates per range • Message compression • Worker-side caching of lists + send hash of lists • Don’t send zeroes • Snappy compression • Filtering: small updates are omitted (application-specific) 15 15

Tasks • Activated by RPC: push or pull operations • Executed asynchronously • Users can specify the dependency of tasks 16 16

Flexible Consistency • Typical semantics • Sequential • Eventual • Bounded delay 17 17

Dependencies • Vector clocks to express dependencies • Size: one entry per parameter per node is too large • Use instead one entry per range per node • Ranges are few and not split frequently 18 18

Consistent Hashing • Server manager maintains the ring • Other servers receive ranges 19 19

Replication • Synchronous replication • Master pushes aggregated updates • When all replicas receive update, ack • Replication after aggregation • Master waits until multiple updates are ready 20 20

Results: Sparse Logistic Regression • Convergence and CPU utilization 21 21

Effect of Network Compression 22 22

Effect of Asynchrony • Note: More asynchrony not always better 23 23

Model-Parallel Approach "&' 𝐵 "&' , 𝐸)} /0' 𝐵 " = 𝐵 "&' + 𝐷𝑝𝑜 ({∆ / (𝐵 "&' , 𝑇 / 1 ) • Process for each worker "&' to update (from scheduler) • Receive ids of parameters 𝑇 / • This is a partition of the entire space of parameters • Compute update on those parameters • Send updates to parameter server that • Concatenates updates (which are disjoint ) • Applies updates to parameters • Requirements • There should be no/weak correlation among parameters • Example: matrix factorization • Q: Advantage? 26 26

Model-Parallel Scheduler • Some systems (e.g. Petuum) support global scheduler • Scheduler runs application-specific logic • Two main goals • Partition parameters • Prioritized scheduling: give precedence to parameters that converge slower 27 27

Horovod • Use a ring topology among workers for aggregation • Linear instead of quadratic number of messages • Schedule non-overlapping updates Workers … … Workers Parameter servers 29 29

Scheduling Updates 30 30

Parameter Server Marco Serafini COMPSCI 532 Lecture 19 Machine - PowerPoint PPT Presentation

Parameter Server Marco Serafini COMPSCI 532 Lecture 19 Machine Learning Wide array of problems and algorithms Classification Given labeled data points, predict label of new data point Regression Learn a function from

CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Server Traffic Management Server Traffic Management Jeff Chase Duke University, Department of

Content Server Caching Network Client Web Server Browser Avoid Network Latency Avoid Queuing

Parameter Passing and Pointers Parameter passing and functions I: reference parameters

10/16/19 Parameters and Parameter Tuning Genetic Algorithms History Taxonomy

Server Upgrades 6/25/19 Agenda Existing Server Infrastructure Reasons for upgrading

1 Handling Return Traffic Handling Return Traffic URL Switching URL Switching Idea: switch

Proxy Server, Network Address Translator, Firewall 1 Proxy Server 2 1 Introduction What

Installing a Web Server 1. Install a sample web server, which supports Servlets/JSPs. A light

Installing a Web Server 1. Install a sample web server, which supports Servlets/JSPs. A light

Parameter Hub A Rack-Scale Parameter Server for Efficient Cloud-based Distributed Deep Neural

Binding: Connecting From Procedure to Client and Server Remote Procedure Server exports its

Subroutines and Parameter Passing ECE2893 Lecture 5 ECE2893 Subroutines and Parameter Passing

Real Time Market Real Time Market Parameter Settings: Parameter Settings: Analytic Results

Client-Side IPv6 Measurement Geoff Huston APNIC Labs How to measure millions of end devices for

Living in AD-times Using Open Standards with Microsoft ActiveDirectory John Paschoud LSE

Voice over the Internet (the basics) Outline Basics about voice encoding Packetization

Multimedia Communications Spring 2006-07 Advances in the Transport Layer (RTP) Shahab Baqai

Unintrusive Customization Techniques for Web Advertising Marc Langheinrich Atsuyoshi Nakamura

Firewall Configuration 1. As a first step, check to see if the Splunk universal forwarder is

Secure and Efficient Metering Moni Naor and Benny Pinkas Eurocrypt '98 Contents Motivation

Marcel de Vries CTO Xpirit @marcelv @marcelv Lets sell music online! @marcelv Assemble a team

Sambuz

Useful Links

Newsletter

Mail Us