distributed machine learning with a serverless
play

Distributed Machine Learning with a Serverless Architecture Hao Wang - PowerPoint PPT Presentation

INFOCOM19, Paris, FR Distributed Machine Learning with a Serverless Architecture Hao Wang 1 , Di Niu 2 , Baochun Li 1 1 University of Toronto, 2 University of Alberta What is machine learning? Deep Learning Machine Learning Numerical


  1. INFOCOM’19, Paris, FR Distributed Machine Learning with a Serverless Architecture Hao Wang 1 , Di Niu 2 , Baochun Li 1 1 University of Toronto, 2 University of Alberta

  2. What is machine learning?

  3. Deep Learning

  4. Machine Learning Numerical optimization Gradients � 5

  5. ML Workflow Datasets Convergence Objective Objective … Loss rate Data Budget … 😈 Resource Model Model Training & Reservation Design Tuning Evaluation 😮 � 6

  6. Our Key Insights • Most current ML training jobs are data parallel • Model quality and resource investment have a nonlinear relation • ML training is inevitably a trial-and-error process � 7

  7. Distributed ML Infrastructure IaaS PaaS Pricing Per hour Per hour Maintanance By users By providers AWS EC2, Azure ML Studio, … Examples Google Cloud Google Cloud ML Compute … Engine … � 8

  8. Serverless? IaaS PaaS Serverless Pricing Per hour Per hour Per call Maintanance By users By providers By providers Examples AWS EC2 Azure ML Studio AWS Lambda � 9

  9. Serverless Computing? • Only input and output, no intermediate states Stateless function � 10

  10. Go Serverless? Con: Pro: 1. Execution model is too simple 1. Flexible concurrency 2. Runtime limitations (~15min) 2. Instant response 3. Communication overhead 3. Easy to deploy 4. Cheap? Runtime * MemSize � 11

  11. λ ML Training on Serverless? - MapReduce on Serverless Cloud (PyWren, [SoCC’17]) - Video processing on Serverless Cloud (Sprocket [SoCC’18])

  12. Stochastic Gradient Descent (SGD) Input Samples θ j = θ j + α ( y i − h θ ( x i )) x i j � 13

  13. Mini-batch SGD Input Samples i + b − 1 θ j = θ j + a ∑ ( y k − h θ ( x k )) x j k b k = i � 14

  14. Parameter Server • Model replicas on workers • Servers update parameters Li, Mu, et al. Scaling distributed machine learning with the parameter server. OSDI'14

  15. SGD on Lambda Input Samples i + b − 1 θ j = θ j + a ∑ ( y k − h θ ( x k )) x j k Function KV Storage b k = i i + b − 1 θ j = θ j + a ∑ ( y k − h θ ( x k )) x j k Function b k = i or … i + b − 1 θ j = θ j + a ∑ ( y k − h θ ( x k )) x j k Function b k = i � 16

  16. ML Training on Lambda Input Samples KV Storage Func. Func. … or Func. � 17

  17. Toy Example • Workload - A logistic regression model • AWS Lambda • EC2 c5.2xlarge - 20 functions - 8 CPUs, 16GB mem - 150 functions - Local storage - X functions (dynamic # of func.) - S3 storage

  18. Toy Example • Loss value v.s. training time • Loss value v.s. monetary cost 20 functions 8-core EC2 150 functions X functions 0.015 Loss value 0.010 0 100 200 300 0.005 0.010 0.015 0.020 0.025 0.030 Time (s) Cost ($)

  19. Toy Example Slowest, no cheap Fastest, expensive Fast, cheap • The first epoch: 120 functions • The last epoch: 10 functions X functions: • Intermediate epochs: 20 functions

  20. Challenges • Functions on Serverless - Limitation on performance and deployment • Dynamic Resource Provisioning - Speed v.s. cost (given a budget, how fast could be?)

  21. • Hybrid Synchronous Parallel (HSP) Siren • Experience-Driven Resource Scheduler

  22. Architecture Stateless code User- Functions Defined package Model API Libs Cloud Step 3 Step 1 function status resource scheme Local Client action states DRL Function Scheduler Manager Agent

  23. Enforce Parallelism on Siren

  24. Synchronous or Asynchronous Synchronous training Asynchronous training

  25. Hybrid Synchronous Parallel (HSP) Fetching input Computing Updating parameters f t +1 ,i … f t,i mini-batch … f t,i +1 epoch t epoch t+1 time

  26. Experience-Driven Scheduler

  27. Toy Example - Find the X Slowest, no cheap Fastest, expensive Fast, cheap

  28. Deep Reinforcement Learning Reward r t Policy Agent Environment π ( a t | s t − 1 , θ ) Features Action Stateless a t Functions Policy parameters θ State s t − 1

  29. State s t = ( t , ℓ t , P t , P F t , P C t , P U t , u t , w t , b t )

  30. <latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit> <latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit> <latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit> <latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit> Action n t , m t ∈ ℤ + a t = ( n t , m t ) n t × m t choices ~ 138,000 actions on AWS Approximating with Gaussian distribution Policy exp ( − ( a − μ ( s , θ ) 2 1 ) π ( a | s , θ ) = π ( a t | s t − 1 , θ ) 2 σ ( s , θ ) 2 σ ( s , θ ) 2 π

  31. Reward At each epoch , t r t = − β P t , t = 1,…, T − 1 regularizer At the final epoch , T reach the expected loss value, or use up all budget a constant as the final reward/penalty

  32. Training Maximize cumulative discounted reward: T ∑ max γ t r t , γ ∈ (0,1] t =1 discount factor Policy gradient: policy function ∇ θ 𝔽 π [ γ t r t ] = 𝔽 π [ ∇ θ ln π ( a | s , θ ) q π ( s , a )] T ∑ t =1 expected reward with and a s

  33. DRL Reward r t Policy Agent Environment π ( a t | s t − 1 , θ ) Features Action Stateless a t Functions Policy parameters θ State s t − 1

  34. Workflow Stateless code User- Step 2 Functions Defined package Model API Libs Cloud Step 3 Step 1 function status resource scheme Local Client action states DRL Function Scheduler Manager Agent Step 5 Step 4

  35. • Simulation: OpenAI Gym Evaluation • Testbed: AWS Lambda + AWS EC2 • 44.3% ⬇ on job completion time

  36. Simulation - overview • Workload : mini-batched SGD algorithms • Goal : DRL agent v.s. Grid search (# of functions)

  37. Simulation - grid search 2000 3000 4000 5000 GS GS-300 100 10.03% Siren GS-200 Total rewards GS-100 Time (s) GS-50 0 12.87% −100 36% 0 1000 2000 100 200 300 Number of functions Budget ($)

  38. Simulation

  39. Simulation - DRL training Siren-300 100 1000 Number of functions Siren-200 Total rewards Siren-100 0 500 −100 Siren-300 −200 0 0 100 200 300 0 100 200 300 Epoch of ML training Iteration

  40. Testbed • Siren on AWS Lambda v.s. MXNet on EC2 - m4.large: 2 vCPU, 8GB memory, $0.1/hr - m4.xlarge: 4 vCPU, 16GB memory, $0.2/hr - m4.2xlarge: 8 vCPU, 32GB memory, $0.4/hr • Workload - LeNet on MNIST - CNN on movie review - Linear Classification on click-through prediction dataset

  41. Testbed - Siren and EC2 on LeNet 250 m4.large m4.2xlarge m4.xlarge Siren 200 Time (s) 150 100 50 0.08 Cost($) 0.06 0.04 0.02 2 4 6 8 Number of EC2 instances Siren

  42. Testbed - DRL training Memory (MB) 800 100 600 Total rewards 400 200 # of functions 0 # of functions Memory 1000 −100 500 2 4 6 8 0 100 200 300 Training epoch of LeNet Iteration

  43. Testbed - time v.s. cost EC2 400 Siren 300 Time (s) 200 100 0 0.02 0.04 0.06 0.08 Cost ($)

  44. Testbed - given the same cost m4.2xlarge Siren 100 3000 2000 Time (s) 2000 50 1000 1000 0 0 0 LeNet CNN Linear Classfication

  45. Conclusion • Siren: Distributed Machine Learning with a Serverless Architecture - Hybrid Synchronous Parallel (HSP) - Experience-Driven Resource Scheduler • Evaluation - Simulation & Testbed - 44.3% ⬇ on job completion time

  46. Q&A Thank You

Recommend


More recommend