overviews and practical reports
play

Overviews and practical reports Justin Clarke, Cecilia Ferrando, - PowerPoint PPT Presentation

Overviews and practical reports Justin Clarke, Cecilia Ferrando, William Rebelsky Trends The growing amount of data and need for Machine Learning data processing challenges us to advance systems One trillion IoT devices expected by


  1. Overviews and practical reports Justin Clarke, Cecilia Ferrando, William Rebelsky

  2. Trends ● The growing amount of data and need for Machine Learning data processing challenges us to advance systems ● One trillion IoT devices expected by 2035 ● More need for supporting Machine Learning computation on edge devices

  3. Specific challenges ● Who should design better systems for Machine Learning? ● What research is needed to address systems issues for Machine Learning? ● How do we improve AI support on edge devices?

  4. Why are these issues important today?

  5. Source: https://mkomo.com/cost-per-gigabyte-update

  6. Source: Forbes.com

  7. Specific problems being addressed in these works ● How do we develop better systems for ML? ● Who should do so? ● What needs to be done in order to improve the systems? ● What research directions should we be taking? ● How do we advance inference on edge devices?

  8. Strategies and Principles of Distributed Machine Learning on Big Data Eric P. Xing *, Qirong Ho, Pengtao Xie, Dai Wei

  9. Overview ● Rise of big Data -> more demands on Machine Learning ● More demands on Machine Learning -> larger clusters necessary ● Larger clusters necessary -> Someone needs to engineer the systems ● Should this fall to Machine Learning Researchers or Systems researchers? ● View: Big ML benefits from ML-rooted insights. Therefore ML researchers should do the system design.

  10. Four Key Questions 1. How can an ML program be distributed over a cluster? 2. How can ML computation be bridged with inter-machine communication? 3. How can such communication be performed? 4. What should be communicated between machines?

  11. Background on Machine Learning ● ML is becoming the primary mechanism for distilling raw-data into useable insights ● Most ML programs belong to one of a few families of well developed approaches ● Conventional ML R&D excels in model, algorithm, and theory development ● In general, ML is an optimization problem:

  12. Distributed Machine Learning ● Given P systems, we would expect P-fold increase in performance ● However, current state of the art shows less than ½P-fold increase in performance ● Much of this is due to idealized assumptions in research: ○ Infinitely fast networks ○ All machines process at the same rate ○ No additional users/background tasks ● Two obvious research avenues to improve performance: ○ Improve convergence rate (number of iterations) ○ Improve throughput (per-iteration time)

  13. Machine Learning vs Traditional Programs Machine Learning: Traditional: ● ● Error tolerance: Transaction-centric ○ ● ML programs are robust to minor errors in Only execute correctly if each step is intermediate steps atomically correct ● Dynamic structural dependencies: ○ Parameters depend not only on the data but on each other ● Non-uniform convergence: ○ Not all parameters converge in the same number of iterations

  14. State of Current platforms ● Current platforms are general-purpose ● For each software: there is a tradeoff between: ○ Speed of Execution ○ Ease of programmability ○ Correctness of solution ● Current systems ensure that the outcome of each program is perfectly reproducible, which is not necessary in ML ● This paper: Usable software should instead offer two utilities: ○ A ready-to-run set of ML-workhorse implementations (eg MCMC) ○ ML distributed cluster OS that supports the above implementations

  15. 1. How can an ML program be distributed over a cluster? ● Big data can be parallelized using either model parallelism or data parallelism strategies ● ML requires a mix of both ● Improved efficiency: ○ Compute prioritization of parameters (e.g. give mroe resources to the parameters that need them) ○ Workload balancing using slow-worker agnosticism ○ Create a Structure Aware Parallelization (SAP) for scheduling, prioritization, and load balancing

  16. 2. How can ML computation be bridged with inter-machine communication?

  17. Bridging Models: Current state ● Bulk Synchronous parallel bridging model (BSP): ○ Workers wait at the end of iteration until everyone is finished ○ Issue: Don’t get the p-fold speed up ■ Synchronization barrier suffers from stragglers ■ Synchronization barrier can take longer than the iteration ● Asynchronous execution: ○ Workers continue iterating and sending updates without waiting for others to finish ○ Issue: less progress per iteration ■ Information becomes stale ■ In the limit, errors can cause slow or incorrect convergence

  18. Bridging Models: Solution ● Combine the two models and get the best of both worlds ● Stale Synchronous Parallel (SSP) bridging ○ Workers who get more than s iterations ahead of any other worker are stopped

  19. 3. How can such communication be performed? ● Continuous communication with rate limiters in the SSP implementation ● Wait-free Backpropagation ○ Take advantage of the idea that in fully connected layers the top layers account for 90% of the parameters, but only 10% of the backpropagation cost ● Update prioritization based on parameters that change the most (absolute or relative) ● Decentralized storage using Halton topology

  20. 4. What should be communicated between machines? ● Typical clusters can transmit at most a few gigabytes per second between two machines ○ Naive synchronization is not instantaneous ● Sufficient Factor Broadcasting: ○ Gradient computations can be decomposed to transmit S(K+D) instead of KD elements ● Convergence is still guaranteed (although it can take extra iterations)

  21. Strategies and Principles of Distributed Machine Learning on Big Data PROS: 1. Enough relevant background information to explain the issues to people not already in the field 2. Strong justification for ML-researchers being involved in designing the systems 3. Separation of the issues into 4 major questions allows for more directed research moving forward CONS: 1. Section 4: Petuum a. Claims close to p-fold speedup but doesn’t show data b. Basic implementation that “might become the foundation of an ML distributed cluster operating system” 2. Inconsistent specificity: carefully state the ML models, general statements of the solution a. “Continuous communication can be achieved by a rate limiter in the SSP implementation” b. “SSP with properly selected staleness values “

  22. A Berkeley View of Systems Challenges for AI Ion Stoica, Dawn Song, Raluca Ada Popa, David Patterson, Michael W. Mahoney, Randy Katz, Anthony D. Joseph, Michael Jordan, Joseph M. Hellerstein, Joseph Gonzalez, Ken Goldberg, Ali Ghodsi, David Culler, Pieter Abbeel

  23. Trends: ● Mission critical AI ● Personalized AI ● AI across organizations ● AI demands outpacing Moore’s Law

  24. Acting in Dynamic Environments ● Continual Learning ● Robust Decisions ● Explainable Decisions

  25. Secure AI ● Secure enclaves ● Adversarial learning ● Shared learning on confidential data

  26. AI-specific architectures ● Domain specific hardware ● composable AI systems ● Cloud-edge systems

  27. Strengths and Weaknesses

  28. [Cecilia]

  29. Machine Learning at Facebook MACHINE LEARNING TASKS INFRASTRUCTURE DATACENTERS - Ranking posts - Content understanding - Object detection - Virtual reality - Speech recognition Inference on the - Translation EDGE edge to avoid latency

  30. Challenges of edge inference OPTIMIZATION SOFTWARE HARDWARE LIMITATIONS LIMITATIONS “HOW DOES (diversity) (low performance) FACEBOOK RUN INFERENCE AT THE EDGE?” EDGE

  31. Challenges of edge inference Mobile inference runs on old CPU cores CPU cores design year

  32. Challenges of edge inference Mobile inference runs on old CPU cores There is no “standard” mobile SoC The most common SoC has only 4% of the market

  33. Challenges of edge inference GPUs? Mobile inference runs on old CPU cores DSPs? There is no “standard” mobile SoC “Holistic” optimization?

  34. Challenges of edge inference GPUs? Mobile inference runs on old CPU cores DSPs? Only 20% of the mobile SoCs have a GPU 3x more powerful than There is no “standard” mobile SoC “Holistic” optimization? CPUs! ( Apple devices stand out)

  35. Challenges of edge inference GPUs? Mobile inference runs on old CPU cores DSPs? There is no “standard” mobile SoC “Holistic” optimization? Digital Signal Processors (co-processors) have little support for vector structures Programmability is an issue Only available on 5% of the SoCs

  36. Challenges of edge inference GPUs? Mobile inference runs on old CPU cores DSPs? There is no “standard” mobile SoC “Holistic” optimization? Performance variability

  37. Facebook mobile inference tools ● FBLearner workflow and optimization ● Caffe2 runs on mobile and is designed for broad support and CNN optimization ● New version of PyTorch designed to accelerate AI from research to production

  38. Facebook mobile inference tools

Recommend


More recommend