Overviews and practical reports Justin Clarke, Cecilia Ferrando, William Rebelsky
Trends ● The growing amount of data and need for Machine Learning data processing challenges us to advance systems ● One trillion IoT devices expected by 2035 ● More need for supporting Machine Learning computation on edge devices
Specific challenges ● Who should design better systems for Machine Learning? ● What research is needed to address systems issues for Machine Learning? ● How do we improve AI support on edge devices?
Why are these issues important today?
Source: https://mkomo.com/cost-per-gigabyte-update
Source: Forbes.com
Specific problems being addressed in these works ● How do we develop better systems for ML? ● Who should do so? ● What needs to be done in order to improve the systems? ● What research directions should we be taking? ● How do we advance inference on edge devices?
Strategies and Principles of Distributed Machine Learning on Big Data Eric P. Xing *, Qirong Ho, Pengtao Xie, Dai Wei
Overview ● Rise of big Data -> more demands on Machine Learning ● More demands on Machine Learning -> larger clusters necessary ● Larger clusters necessary -> Someone needs to engineer the systems ● Should this fall to Machine Learning Researchers or Systems researchers? ● View: Big ML benefits from ML-rooted insights. Therefore ML researchers should do the system design.
Four Key Questions 1. How can an ML program be distributed over a cluster? 2. How can ML computation be bridged with inter-machine communication? 3. How can such communication be performed? 4. What should be communicated between machines?
Background on Machine Learning ● ML is becoming the primary mechanism for distilling raw-data into useable insights ● Most ML programs belong to one of a few families of well developed approaches ● Conventional ML R&D excels in model, algorithm, and theory development ● In general, ML is an optimization problem:
Distributed Machine Learning ● Given P systems, we would expect P-fold increase in performance ● However, current state of the art shows less than ½P-fold increase in performance ● Much of this is due to idealized assumptions in research: ○ Infinitely fast networks ○ All machines process at the same rate ○ No additional users/background tasks ● Two obvious research avenues to improve performance: ○ Improve convergence rate (number of iterations) ○ Improve throughput (per-iteration time)
Machine Learning vs Traditional Programs Machine Learning: Traditional: ● ● Error tolerance: Transaction-centric ○ ● ML programs are robust to minor errors in Only execute correctly if each step is intermediate steps atomically correct ● Dynamic structural dependencies: ○ Parameters depend not only on the data but on each other ● Non-uniform convergence: ○ Not all parameters converge in the same number of iterations
State of Current platforms ● Current platforms are general-purpose ● For each software: there is a tradeoff between: ○ Speed of Execution ○ Ease of programmability ○ Correctness of solution ● Current systems ensure that the outcome of each program is perfectly reproducible, which is not necessary in ML ● This paper: Usable software should instead offer two utilities: ○ A ready-to-run set of ML-workhorse implementations (eg MCMC) ○ ML distributed cluster OS that supports the above implementations
1. How can an ML program be distributed over a cluster? ● Big data can be parallelized using either model parallelism or data parallelism strategies ● ML requires a mix of both ● Improved efficiency: ○ Compute prioritization of parameters (e.g. give mroe resources to the parameters that need them) ○ Workload balancing using slow-worker agnosticism ○ Create a Structure Aware Parallelization (SAP) for scheduling, prioritization, and load balancing
2. How can ML computation be bridged with inter-machine communication?
Bridging Models: Current state ● Bulk Synchronous parallel bridging model (BSP): ○ Workers wait at the end of iteration until everyone is finished ○ Issue: Don’t get the p-fold speed up ■ Synchronization barrier suffers from stragglers ■ Synchronization barrier can take longer than the iteration ● Asynchronous execution: ○ Workers continue iterating and sending updates without waiting for others to finish ○ Issue: less progress per iteration ■ Information becomes stale ■ In the limit, errors can cause slow or incorrect convergence
Bridging Models: Solution ● Combine the two models and get the best of both worlds ● Stale Synchronous Parallel (SSP) bridging ○ Workers who get more than s iterations ahead of any other worker are stopped
3. How can such communication be performed? ● Continuous communication with rate limiters in the SSP implementation ● Wait-free Backpropagation ○ Take advantage of the idea that in fully connected layers the top layers account for 90% of the parameters, but only 10% of the backpropagation cost ● Update prioritization based on parameters that change the most (absolute or relative) ● Decentralized storage using Halton topology
4. What should be communicated between machines? ● Typical clusters can transmit at most a few gigabytes per second between two machines ○ Naive synchronization is not instantaneous ● Sufficient Factor Broadcasting: ○ Gradient computations can be decomposed to transmit S(K+D) instead of KD elements ● Convergence is still guaranteed (although it can take extra iterations)
Strategies and Principles of Distributed Machine Learning on Big Data PROS: 1. Enough relevant background information to explain the issues to people not already in the field 2. Strong justification for ML-researchers being involved in designing the systems 3. Separation of the issues into 4 major questions allows for more directed research moving forward CONS: 1. Section 4: Petuum a. Claims close to p-fold speedup but doesn’t show data b. Basic implementation that “might become the foundation of an ML distributed cluster operating system” 2. Inconsistent specificity: carefully state the ML models, general statements of the solution a. “Continuous communication can be achieved by a rate limiter in the SSP implementation” b. “SSP with properly selected staleness values “
A Berkeley View of Systems Challenges for AI Ion Stoica, Dawn Song, Raluca Ada Popa, David Patterson, Michael W. Mahoney, Randy Katz, Anthony D. Joseph, Michael Jordan, Joseph M. Hellerstein, Joseph Gonzalez, Ken Goldberg, Ali Ghodsi, David Culler, Pieter Abbeel
Trends: ● Mission critical AI ● Personalized AI ● AI across organizations ● AI demands outpacing Moore’s Law
Acting in Dynamic Environments ● Continual Learning ● Robust Decisions ● Explainable Decisions
Secure AI ● Secure enclaves ● Adversarial learning ● Shared learning on confidential data
AI-specific architectures ● Domain specific hardware ● composable AI systems ● Cloud-edge systems
Strengths and Weaknesses
[Cecilia]
Machine Learning at Facebook MACHINE LEARNING TASKS INFRASTRUCTURE DATACENTERS - Ranking posts - Content understanding - Object detection - Virtual reality - Speech recognition Inference on the - Translation EDGE edge to avoid latency
Challenges of edge inference OPTIMIZATION SOFTWARE HARDWARE LIMITATIONS LIMITATIONS “HOW DOES (diversity) (low performance) FACEBOOK RUN INFERENCE AT THE EDGE?” EDGE
Challenges of edge inference Mobile inference runs on old CPU cores CPU cores design year
Challenges of edge inference Mobile inference runs on old CPU cores There is no “standard” mobile SoC The most common SoC has only 4% of the market
Challenges of edge inference GPUs? Mobile inference runs on old CPU cores DSPs? There is no “standard” mobile SoC “Holistic” optimization?
Challenges of edge inference GPUs? Mobile inference runs on old CPU cores DSPs? Only 20% of the mobile SoCs have a GPU 3x more powerful than There is no “standard” mobile SoC “Holistic” optimization? CPUs! ( Apple devices stand out)
Challenges of edge inference GPUs? Mobile inference runs on old CPU cores DSPs? There is no “standard” mobile SoC “Holistic” optimization? Digital Signal Processors (co-processors) have little support for vector structures Programmability is an issue Only available on 5% of the SoCs
Challenges of edge inference GPUs? Mobile inference runs on old CPU cores DSPs? There is no “standard” mobile SoC “Holistic” optimization? Performance variability
Facebook mobile inference tools ● FBLearner workflow and optimization ● Caffe2 runs on mobile and is designed for broad support and CNN optimization ● New version of PyTorch designed to accelerate AI from research to production
Facebook mobile inference tools
Recommend
More recommend