MALT: Distributed Data Parallelism for Existing ML Applications Hao - PowerPoint PPT Presentation

MALT: Distributed Data Parallelism for Existing ML Applications Hao Li*, Asim Kadav, Erik Kruus, Cristian Ungureanu * University of Maryland-College Park NEC Laboratories, Princeton

Data, data everywhere… User-generated Software Hardware content Transactions, Data facebook, twitter, Camera feeds, website visits, generated by reviews, emails Sensors other metadata Applications Ad-click/Fraud Sentiment analysis, Surveillance, (usually based prediction, Targeted Anomaly Recommendations advertising detection on ML) 2

Timely insights depend on updated models Training Data (such as image, label pairs) model parameters Surveillance/Safe Advertising (ad Knowledge banks labrador Driving prediction, ad-bidding) (automated answering) Test/Deploy model parameters • Usually trained in real time • Usually trained hourly • Usually trained daily • Expensive to train HD • Expensive to train • Expensive to train videos millions of requests large corpus 3

Model training challenges • Large amounts of data to train Explosion in types, speed and scale of data • Types : Image, time-series, structured, sparse • Speed : Sensor, Feeds, Financial • Scale : Amount of data generated growing exponentially • Public datasets: Processed splice genomic dataset is 250 GB • and data subsampling is unhelpful Private datasets: Google, Baidu perform learning over TBs of • data • Model sizes can be huge Models with billions of parameters do not fit in a single machine • E.g. : Image classification, Genome detection • Model accuracy generally improves by using larger models with more data 4

Properties of ML training workloads • Fine-grained and Incremental: Small, repeated updates to model vectors • • Asynchronous computation: E.g. Model-model communication, back-propagation • • Approximate output: Stochastic algorithms, exact/strong consistency maybe an overkill • • Need rich developer environment: • Require rich set of libraries, tools, graphing abilities 5

MALT: Machine Learning Toolset • Run existing ML software in data-parallel fashion • Efficient shared memory over RDMA writes to communicate model information Communication: Asynchronously push ( scatter ) model • information, gather locally arrived information Network graph: Specify which replicas to send updates • Representation: SPARSE / DENSE hints to store model vectors • • MALT integrates with existing C++ and Lua applications Demonstrate fault-tolerance and speedup with SVM, matrix • factorization and neural networks Re-uses existing developer tools • 6

Outline Introduction Background MALT Design Evaluation Conclusion 7

Distributed Machine Learning • ML algorithms learn incrementally from data • Start with an initial guess of model parameters • Compute gradient over a loss fn, and update the model • Data Parallelism: Train over large data • Data split over multiple machines • Model replicas train over different parts of data and communicate model information periodically • Model parallelism: Train over large models • Models split over multiple machines • A single training iteration spans multiple machines 8

Stochastic Gradient Descent (SGD) • SGD trains over one (or few) training example at a time Every data example processed is • an iteration Update to the model is gradient • Number of iterations to compute • gradient is batch size One pass over the entire data is • an epoch Acceptable performance over • test set after multiple epochs is convergence Can train wide range of ML methods : k-means, SVM, matrix factorization, neural-networks etc. 9

Data-parallel SGD: Mini-batching • Machines train in parallel over a batch and exchange model information Iterate over data examples faster (in parallel) • May need more passes over data than single SGD (poor convergence) • 10

Approaches to data-parallel SGD • Hadoop/Map-reduce: A variant of bulk-synchronous parallelism Synchronous averaging of model updates every epoch (during reduce) • map reduce Data model parameter 1 Data Data model parameter 2 merged model parameter 1)Infrequent communication produces low accuracy models Data 2)Synchronous training hurts performance due to stragglers model parameter 3 11

Parameter server • Central server to merge updates every few iterations Workers send updates asynchronously and receive whole models from the server • Central server merges incoming models and returns the latest model • Example: Distbelief (NIPS 2012), Parameter Server (OSDI 2014), Project Adam (OSDI 2014) • Data model parameter 1 Data model parameter 2 merged model parameter Data model parameter 3 workers server 12

Peer-to-peer approach (MALT) • Workers send updates to one another asynchronously Workers communicate every few iterations • No separate master/slave code to port applications) • No central server/manager: simpler fault tolerance • Data Data Data Data model parameter 1 model parameter 2 model parameter 3 model parameter 4 workers 13

Outline Introduction Background MALT Design Evaluation Conclusion 14

MALT framework Existing ML applications Existing ML applications Existing ML applications MALT VOL MALT VOL MALT VOL (Vector Object Library) (Vector Object Library) (Vector Object Library) MALT dStorm (distributed one-sided remote memory) infiniBand communication substrate (such as MPI, GASPI) Model replicas train in parallel. Use shared memory to communicate. Distributed file system to load datasets in parallel. 15

dStorm: Distributed one-sided remote memory • RDMA over infiniBand allows high-throughput/low latency networking RDMA over Converged Ethernet (RoCE) support for non-RDMA hardware • • Shared memory abstraction based over RDMA one-sided writes (no reads) S1.create(size, ¡ALL) S2.create(size, ¡ALL) S3.create(size, ¡ALL) Memory Memory Memory O1 O2 O3 O1 O2 O2 O3 O1 O2 O3 O3 Machine 1 Machine 2 Machine 3 Primary Receive Receive Similar to partitioned global address space languages - copy queue for O1 queue for O1 Local vs Global memory 16

scatter() propagates using one-sided RDMA • Updates propagate based on communication graph S3.scatter() S1.scatter() S2.scatter() Memory Memory Memory O1 O1 O1 O2 O3 O1 O2 O2 O2 O2 O1 O3 O3 O3 O3 O2 Machine 1 Machine 2 Machine 3 Remote CPU not involved: Writes over RDMA. Per-sender copies do not need to be immediately merged by receiver 17

gather() function merges locally • Takes a user-defined function (UDF) as input such as average S3.gather(AVG) S1.gather(AVG) S2.gather(AVG) Memory Memory Memory O1 O1 O1 O2 O3 O2 O3 O1 O1 O2 O2 O2 O3 O1 O1 O2 O3 O3 O3 O3 O2 Machine 1 Machine 2 Machine 3 Useful/General abstraction for data-parallel algos: Train and scatter() the model vector, gather() received updates 18

VOL: Vector Object Library • Expose vectors/tensors instead of memory objects Memory • Provide representation optimizations V1 • sparse/dense ¡ parameters store as arrays or key-value stores ¡ • Inherits scatter()/gather() calls from dStorm O1 • Can use vectors/tensors in existing vectors 19

Propagating updates to everyone Data model 1 Data Data model 6 model 2 Data Data model 5 model 3 Data model 4 20

2 ) communication rounds for N nodes O(N Data model 1 Data Data model 6 model 2 Data Data model 5 model 3 Data model 4 21

In-direct propagation of model updates Data model 1 Data Data model 2 model 6 Data Data model 5 model 3 Data Use a uniform random sequence to determine where to send updates to ensure all updates propagate uniformly. Each node sends to fewer than N nodes (such as logN) model 4 22

O(Nlog(N)) communication rounds for N nodes MALT proposes sending models to • Data fewer nodes (log N instead of N) Requires the node graph be • connected Data Data Use any uniform random sequence • Reduces processing/network times • Data Data Network communication time • reduces Data Time to update the model reduces • Iteration speed increases but may • need more epochs to converge Trade-off model information recency with Key Idea: Balance communication • savings in network and with computation update processing time Send to less/more than log(N) nodes • 23

Converting serial algorithms to parallel Serial ¡SGD Data-‑Parallel ¡SGD maltGradient g(sparse, ALL); Gradient g; Parameter w; Parameter w; for epoch = 1:maxEpochs do for epoch = 1:maxEpochs do for i = 1:N/ranks do for i = 1:N do g = cal_gradient(data[i]); g = cal_gradient(data[i]); g.scatter(ALL); w = w + g; g.gather(AVG); w = w + g; • scatter() performs one-sided RDMA writes to other machines. • “ALL” signifies communication with all other machines. • gather(AVG) applies average to the received gradients. • Optional barrier() ¡ makes the training synchronous. 24

MALT: Distributed Data Parallelism for Existing ML Applications Hao - PowerPoint PPT Presentation

MALT: Distributed Data Parallelism for Existing ML Applications Hao Li, Asim Kadav, Erik Kruus, Cristian Ungureanu University of Maryland-College Park NEC Laboratories, Princeton Data, data everywhere User-generated Software Hardware

Malt COA Bucket Analysis Approach Presenters Mike Heinrich Tyler Schoales Craft Malt Specialist

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Home Brew Con 2018 LOW-OXYGEN EN BREWING Preserves the fresh malt/grain flavor that exists in

Crop 2018 Viking Malt Countries All Viking Malt countries together with many other countries have

Caitlin Tegels Malt-O-Meal Improving Utility Efficiency Malt-O-Meal Caitlin Tegels Advisors:

Crop 2018 Viking Malt Countries All Viking Malt countries have been together with other countries

MALT : MALloc Tracker A memory profiling tool 3/02/2019 MALT, Sbastien Valat 1 Questions

Harper Avenue Focus Area Existing Conditions Existing Development Existing Development

Harper Avenue Focus Area Open House Existing Conditions Existing Development Existing

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Society of Barley Engineers 2oth Anniversary 20 th Anniversary Barley Champagne Malt: Belgian

MALT & NUMAPROF , Memory Profiling for HPC Applications SBASTIEN VALAT FOSDEM 2019

Session Objectives 1) Describe the purpose and key aspects of incorporating standards of

Nutritional Considerations: Nutrient Content & Variety Living (Well!) with Gastroparesis

Annual Business Meeting 2016-2017 Executive Summary 2017 Annual Conference Board Meeting,

Onelight.com Publishing c 2010 World Population 2 Onelight.com Publishing c 2010 3

Using Hardware Features for Increased Debugging Transparency Fengwei Zhang, Kevin Leach, Angelos

Ensemble Models for Dependency Parsing: Cheap and Good? Mihai Surdeanu and Christopher D. Manning

Transparent System Introspection in Support of Analyzing Stealthy Malware Kevin Leach PhD

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE L INEAR P ROGRAMMING brewers problem

Sambuz

Useful Links

Newsletter

Mail Us

MALT: Distributed Data Parallelism for Existing ML Applications Hao - PowerPoint PPT Presentation

MALT: Distributed Data Parallelism for Existing ML Applications Hao Li*, Asim Kadav, Erik Kruus, Cristian Ungureanu * University of Maryland-College Park NEC Laboratories, Princeton Data, data everywhere User-generated Software Hardware

Malt COA Bucket Analysis Approach Presenters Mike Heinrich Tyler Schoales Craft Malt Specialist

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Home Brew Con 2018 LOW-OXYGEN EN BREWING Preserves the fresh malt/grain flavor that exists in

Crop 2018 Viking Malt Countries All Viking Malt countries together with many other countries have

Caitlin Tegels Malt-O-Meal Improving Utility Efficiency Malt-O-Meal Caitlin Tegels Advisors:

Crop 2018 Viking Malt Countries All Viking Malt countries have been together with other countries

MALT : MALloc Tracker A memory profiling tool 3/02/2019 MALT, Sbastien Valat 1 Questions

Harper Avenue Focus Area Existing Conditions Existing Development Existing Development

Harper Avenue Focus Area Open House Existing Conditions Existing Development Existing

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Society of Barley Engineers 2oth Anniversary 20 th Anniversary Barley Champagne Malt: Belgian

MALT &amp; NUMAPROF , Memory Profiling for HPC Applications SBASTIEN VALAT FOSDEM 2019

Session Objectives 1) Describe the purpose and key aspects of incorporating standards of

Nutritional Considerations: Nutrient Content &amp; Variety Living (Well!) with Gastroparesis

Annual Business Meeting 2016-2017 Executive Summary 2017 Annual Conference Board Meeting,

Onelight.com Publishing c 2010 World Population 2 Onelight.com Publishing c 2010 3

Using Hardware Features for Increased Debugging Transparency Fengwei Zhang, Kevin Leach, Angelos

Ensemble Models for Dependency Parsing: Cheap and Good? Mihai Surdeanu and Christopher D. Manning

Transparent System Introspection in Support of Analyzing Stealthy Malware Kevin Leach PhD

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE L INEAR P ROGRAMMING brewers problem

Sambuz

Useful Links

Newsletter

Mail Us

MALT: Distributed Data Parallelism for Existing ML Applications Hao Li, Asim Kadav, Erik Kruus, Cristian Ungureanu University of Maryland-College Park NEC Laboratories, Princeton Data, data everywhere User-generated Software Hardware

MALT & NUMAPROF , Memory Profiling for HPC Applications SBASTIEN VALAT FOSDEM 2019

Nutritional Considerations: Nutrient Content & Variety Living (Well!) with Gastroparesis