Scaling with the Parameter Server Variations on a Theme Alexander - PowerPoint PPT Presentation

Scaling with the Parameter Server Variations on a Theme Alexander Smola Google Research & CMU alex.smola.org

Thanks Amr Nino Joey Ahmed Shervashidze Gonzalez Shravan Sergiy Markus Narayanamurthy Matyusevich Weimer Source: place source info here 2

Practical Distributed Inference • Multicore • asynchronous optimization with shared state • Multiple machines • exact synchronization (Yahoo LDA) • approximate synchronization • dual decomposition Source: place source info here 3

MITT’S Motivation Data & Systems 4

Commodity Hardware • High Performance Computing Very reliable, custom built, expensive • Consumer hardware Cheap, e ffj cient, easy to replicate, Not very reliable, deal with it! Source: place source info here 5

The Joys of Real Hardware http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/ en//people/je fg /stanford-295-talk.pdf 6 Slide courtesy of Je fg Dean

Scaling problems • Data (lower bounds) – >10 Billion documents (webpages, e-mails, advertisements, tweets) – >100 Million users on Google, Facebook, Twitter, Yahoo, Hotmail – >1 Million days of video on YouTube – >10 Billion images on Facebook • Processing capability for single machine 1TB/hour But we have much more data • Parameter space for models is big for a single machine (but not too much) Personalize content for many millions of users • Need to process data on many cores and many machines simultaneously Source: place source info here 7

Some Problems • Good old-fashioned supervised learning (classification, regression, tagging, entity extraction, ...) • Graph factorization (latent variable estimation, social recommendation, discovery) • Structure inference (clustering, topics, hierarchies, DAGs, whatever else your NP Bayes friends have) • Example use case - combine information from generic webpages, databases, human generated data, semistructured tables into knowledge about entities. Source: place source info here 8

Some Problems • Good old-fashioned supervised learning (classification, regression, tagging, entity extraction, ...) • Graph factorization (latent variable estimation, social recommendation, discovery) • Structure inference (clustering, topics, hierarchies, DAGs, whatever else your NP Bayes friends have) • Example use case - combine information from generic webpages, databases, human generated data, semistructured tables into knowledge about entities. How do we solve it at scale? Source: place source info here 8

Some Problems this talk • Good old-fashioned supervised learning (classification, regression, tagging, entity extraction, ...) • Graph factorization (latent variable estimation, social recommendation, discovery) • Structure inference (clustering, topics, hierarchies, DAGs, whatever else your NP Bayes friends have) • Example use case - combine information from generic webpages, databases, human generated data, semistructured tables into knowledge about entities. How do we solve it at scale? Source: place source info here 8

MITT’S Multicore parallelism 9

Multicore Parallelism • Many processor cores – Decompose into separate tasks x data loss – Good Java/C++ tool support source gradient • Shared memory – Exact estimates - requires locking of neighbors (see e.g. Graphlab) Good if problem can be decomposed cleanly (e.g. Gibbs sampling in large model) – Exact updates but delayed incorporation - requires locking of state Good if delayed update is of little consequence (e.g. Yahoo LDA, Yahoo online) – Hogwild updates - no locking whatsoever - requires atomic state Good if collision probability is low Source: place source info here 10

Stochastic Gradient Descent data x updater source data loss data x source gradient part n part n data parallel parameter parallel • Delayed updates (round robin for data parallelism, aggregation tree for parameter parallelism) � minimize f i ( w ) • Online template w i Input: scalar σ > 0 and delay τ for t = τ + 1 to T + τ do Obtain f t and incur loss f t ( w t ) 1 Compute g t := ⇥ f t ( w t ) and set η t = σ ( t − τ ) Update w t +1 = w t � η t g t − τ end for Source: place source info here 11

Guarantees • Worst case guarantee (Zinkevich, Langford, Smola, 2010) SGD with delay τ on τ processors is no worse than sequential SGD √ E [ f i ( w )] ≤ 4 RL τ T • Lower bound is tight Proof: send same instance τ times • Better bounds with iid data – Penalty is covariance in features – Vanishing penalty for smooth f(w) √  28 . 3 R 2 H + 2 3 RL + 4 � τ 2 + 8 3 R 2 H log T E [ R [ X ]] ≤ 3 RL T. • Works even (better) if we don’t lock between updates (Recht, Re, Wright, 2011) Hogwild Source: place source info here 12

Speedup on TREC 450 400 350 speedup in % 300 250 200 150 100 50 0 1 2 3 4 5 6 7 number of cores Source: place source info here 13

LDA Multicore Inference Intel Threading Building Blocks tokens sampler diagnostics sampler file count output to sampler & topics combiner updater file sampler optimization sampler topics joint state table •Decouple multithreaded sampling and updating (almost) avoids stalling for locks in the sampler •Joint state table –much less memory required –samplers synchronized (10 docs vs. millions delay) •Hyperparameter update via stochastic gradient descent •No need to keep documents in memory (streaming) Smola and Narayanamurthy, 2010 14

LDA Multicore Inference Intel Threading Building Blocks tokens sampler diagnostics sampler file count output to sampler & topics combiner updater file sampler optimization sampler topics joint state table • Sequential collapsed Gibbs sampler, separate state table Mallet (Mimno et al. 2008) - slow mixing, high memory load, many iterations • Sequential collapsed Gibbs sampler (parallel) Yahoo LDA (Smola and Narayanamurthy, 2010) - fast mixing, many iterations • Sequential stochastic gradient descent (variational, single logical thread) VW LDA (Ho fg man et al, 2011) - fast convergence, few iterations, dense • Sequential stochastic sampling gradient descent (only partly variational) Ho fg man, Mimno, Blei, 2012 - fast convergence, quite sparse, single logical thread Smola and Narayanamurthy, 2010 15

General strategy • Shared state space • Delayed updates from cores • Proof technique is usually to show that the problem hasn’t changed too much during the delay (in terms of interactions). • More work – Macready, Siapas and Kau fg man, 1995 Criticality and Parallelism in Combinatorial Optimization – Low, Gonzalez, Kyrola, Bickson, Guestrin and Hellerstein, 2010 Shotgun for l1 Source: place source info here 16

This was easy ... what if we need many machines? Source: place source info here 17

MITT’S Parameter Server 30,000 ft view 20

Why (not) MapReduce? • Map(key, value) process instances on a subset of the data / emit aggregate statistics • Reduce(key, value) aggregate for all the dataset - update parameters • This is a parameter exchange mechanism (simply repeat MapReduce) good if you can make your algorithm fit (e.g. distributed convex online solvers) • Can be slow to propagate updates between machines & slow convergence (e.g. a really bad idea in clustering - each machine proposes di fg erent clustering) Hadoop MapReduce loses the state between mapper iterations diagram from Ramakrishnan, Sakrejda, Canon, DoE 2011 21

General parallel algorithm template • Clients have local copy of parameters to be estimated client • P2P is infeasible since O(n 2 ) connections (see Asuncion et al. for amazing tour de force) • Synchronize* with parameter server – Reconciliation protocol average parameters, lock variables, turnstile counter – Synchronization schedule asynchronous, synchronous, episodic – Load distribution algorithm single server, uniform distribution, fault tolerance, recovery server Source: place source info here 22

General parallel algorithm template client syncs to client many masters master serves server complete graph is bad for network many clients use randomized messaging to fix it Source: place source info here 23

Desiderata • Variable and load distribution • Large number of objects (a priori unknown) • Large pool of machines (often faulty) • Assign objects to machines such that • Object goes to the same machine (if possible) • Machines can be added/fail dynamically • Consistent hashing (elements, sets, proportional) • Symmetric, dynamically scalable, fault tolerant • for large scale inferences • for real time data sketches Source: place source info here 24

Random Caching Trees • Cache / synchronize an object • Uneven load distribution • Must not generate hotspot • For given key, pick random order of machines • Map order onto tree / star via BFS ordering Karger et al. 1999, Ahmed et al. 2011 25

Random Caching Trees • Cache / synchronize an object • Uneven load distribution • Must not generate hotspot (Karger et al. 1999 - ‘Akamai’ paper) • For given key, pick random order of machines • Map order onto tree / star via BFS ordering Karger et al. 1999, Ahmed et al. 2011 26

Scaling with the Parameter Server Variations on a Theme Alexander - PowerPoint PPT Presentation

Scaling with the Parameter Server Variations on a Theme Alexander Smola Google Research & CMU alex.smola.org Thanks Amr Nino Joey Ahmed Shervashidze Gonzalez Shravan Sergiy Markus Narayanamurthy Matyusevich Weimer Source: place

CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for

Repeat Repeat runs/variations on a theme runs/variations on a theme Model

Variations on a Theme by Friedman Ali Enayat, G oteborgs Universitet September 5, 2013

Cthulus Clutches Lovecraftian Horror Theme Storyboard Implementation Theme Storyboard

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

BASICS Premium Theme Install, Set Up and Features Add Customizr Theme THE FOLLOWING RESOURCE

Theme Feature Menu What Is Theme? Universal Themes Finding the Theme Making a Judgment

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Variations on a Flat Theme Jean L EVINE CAS, Ecole des Mines de Paris Dedicated to Michel

Server Traffic Management Server Traffic Management Jeff Chase Duke University, Department of

Content Server Caching Network Client Web Server Browser Avoid Network Latency Avoid Queuing

Monthly & Quarterly Tariff Variations July 2016 to June 2019 Tariff Variations Tariff

Parameter Passing and Pointers Parameter passing and functions I: reference parameters

Dependency Parsing Lecture 2 Overview Nivre's Arc-Eager / Arc-Standard Algorithm

How is God Revealed? Scripture Nature Conscience Jesus Do we sometimes focus on

ACCT 420: Topic modeling and anomaly detection Session 9 Dr. Richard M. Crowley 1 Front matter

ACCT 420: Topic modeling and anomaly detection Session 8 Dr. Richard M. Crowley 1 Front matter

Scalable Machine Learning 10. Distributed Inference and Applications Alex Smola Yahoo! Research

Entropy-based artificial viscosity Parabolic regularization and related topics Jean-Luc Guermond

On oscillating systems of interacting Hawkes processes Susanne Ditlevsen Eva L ocherbach

G en o M3 Building middleware-independent robotic components A. Mallet, C. Pasteur, M. Herrb, S.

Scaling with the Parameter Server Variations on a Theme Alexander - PowerPoint PPT Presentation

Scaling with the Parameter Server Variations on a Theme Alexander Smola Google Research & CMU alex.smola.org Thanks Amr Nino Joey Ahmed Shervashidze Gonzalez Shravan Sergiy Markus Narayanamurthy Matyusevich Weimer Source: place

CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for

Repeat Repeat runs/variations on a theme runs/variations on a theme Model

Variations on a Theme by Friedman Ali Enayat, G oteborgs Universitet September 5, 2013

Cthulus Clutches Lovecraftian Horror Theme Storyboard Implementation Theme Storyboard

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

BASICS Premium Theme Install, Set Up and Features Add Customizr Theme THE FOLLOWING RESOURCE

Theme Feature Menu What Is Theme? Universal Themes Finding the Theme Making a Judgment

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

Variations on a Flat Theme Jean L EVINE CAS, Ecole des Mines de Paris Dedicated to Michel

Server Traffic Management Server Traffic Management Jeff Chase Duke University, Department of

Content Server Caching Network Client Web Server Browser Avoid Network Latency Avoid Queuing

Monthly &amp; Quarterly Tariff Variations July 2016 to June 2019 Tariff Variations Tariff

Parameter Passing and Pointers Parameter passing and functions I: reference parameters

Dependency Parsing Lecture 2 Overview Nivre's Arc-Eager / Arc-Standard Algorithm

How is God Revealed? Scripture Nature Conscience Jesus Do we sometimes focus on

ACCT 420: Topic modeling and anomaly detection Session 9 Dr. Richard M. Crowley 1 Front matter

ACCT 420: Topic modeling and anomaly detection Session 8 Dr. Richard M. Crowley 1 Front matter

Scalable Machine Learning 10. Distributed Inference and Applications Alex Smola Yahoo! Research

Entropy-based artificial viscosity Parabolic regularization and related topics Jean-Luc Guermond

On oscillating systems of interacting Hawkes processes Susanne Ditlevsen Eva L ocherbach

G en o M3 Building middleware-independent robotic components A. Mallet, C. Pasteur, M. Herrb, S.

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Monthly & Quarterly Tariff Variations July 2016 to June 2019 Tariff Variations Tariff