cirrus a serverless framework for end to end ml workflows
play

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao - PowerPoint PPT Presentation

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey Tumanov, Andrew Zhang, Randy Katz Machine Learning End-to-end ML workflows Modern end-to-end ML workflows are complex End-to-end ML


  1. Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey Tumanov, Andrew Zhang, Randy Katz

  2. Machine Learning

  3. End-to-end ML workflows ● Modern end-to-end ML workflows are complex

  4. End-to-end ML workflows ● Modern end-to-end ML workflows are complex ● ML workflows consist of 3 heterogeneous stages

  5. End-to-end ML workflows ● Modern end-to-end ML workflows are complex ● ML workflows consist of 3 heterogeneous stages Dataset Preprocessing

  6. End-to-end ML workflows ● Modern end-to-end ML workflows are complex ● ML workflows consist of 3 heterogeneous stages Dataset Model Training Preprocessing

  7. End-to-end ML workflows ● Modern end-to-end ML workflows are complex ● ML workflows consist of 3 heterogeneous stages Hyperparameter Dataset Model Training Tuning Preprocessing

  8. End-to-end ML workflows ● Modern end-to-end ML workflows are complex ● ML workflows consist of 3 heterogeneous stages Hyperparameter Dataset Model Training Tuning Preprocessing ML workflows are interactive and iterative

  9. Provisioning ML workflows Provisioning ML workflows is challenging Hard to accurately estimate resource demands of each stage Data scientists have limited systems expertise ● Complex infrastructure management detracts from ML work ● Resource waste due to overprovisioning of resources

  10. Serverless computing Code Output Input AWS S3

  11. Serverless computing Code Output Input

  12. Serverless computing benefits Tight provisioning of Simplifying infrastructure resources management Fine-grained billing Automatic resource Fine-grained configuration / provisioning resources / maintenance High elasticity

  13. Challenges of serverless Small local memory Low bandwidth and and storage no P2P communication Limited lambda package size Lack of fast Short-lived and shared storage unpredictable launch times

  14. Existing approaches Serverless Frameworks Machine Learning Frameworks Short-lived and unpredictable launch times

  15. Existing approaches Serverless Frameworks Machine Learning Frameworks PyWren Download dependencies Limit. Pkg from S3 size High-latency communication No fast through S3 storage Stragglers Unpred. launch Short-lived and unpredictable launch times

  16. Existing approaches Serverless Frameworks Machine Learning Frameworks PyWren Unable to launch runtimes Download dependencies Small in lambdas Limit. Pkg from S3 mem. size No ring/tree reduces High-latency communication No driver-to-worker comm. No P2P No fast through S3 comm. storage Precludes MPI Stragglers Unpred. Unpred. launch launch Short-lived and unpredictable launch times

  17. Cirrus: a framework for serverless end-to-end ML workflows

  18. Cirrus: design principles 1 ) Addressing serverless challenges Low memory Ultra-lightweight runtime + data prefetching Limited package size High-perf. data store No P2P communication (parameter-server and KV) No fast storage Robust handling of lambda Short lifetimes and termination unpredictable launch Limited pkg size

  19. Cirrus: design principles 2 ) Achieving benefits for end-to-end ML Tight provisioning of Per-stage fine-grained resources variable agile scalability Simplifying infrastructure High-level API supports management end-to-end ML Limited pkg size

  20. Cirrus architecture (client side) Dashboard Python API Client frontend Data scientist Preproc. Training Tuning Create/Stop Task Client backend Task Lambda Scheduler Manager Client side (stateful)

  21. Cirrus Dashboard

  22. Cirrus Dashboard

  23. Cirrus Dashboard

  24. Cirrus Dashboard

  25. Cirrus architecture (server side) Cirrus runtime Data Iterator API Minibatch Buffer Sparse LR Mat. Fact. LDA Data store client API put get put/get (gradient) (model) key Data store PS API Key-value API SGD Adagrad Models Key-values Momentum Server side (stateless)

  26. Cirrus evaluation 1. Cirrus provides benefits by specializing both for serverless and end-to-end ML 2. We show that Cirrus outperforms a state-of-the-art serverless system: PyWren

  27. Evaluation setup 1. Deployment: AWS Lambdas (3GB of mem.) 2. Benchmark: async. distributed SGD Sparse Logistic Regression task 3. Dataset: Criteo Dataset (a dataset of display ads) 4. PyWren: a. Baseline: iterative synchronous SGD training using AWS S3 to store gradients and model b. + 3 incremental optimizations 5. Cirrus: 2 modes (with/without prefetching)

  28. Cirrus outperforms vanilla serverless Synchronous SGD training suffers from stragglers Test Loss

  29. Cirrus outperforms vanilla serverless ● Multiple SGD iterations on each lambda invocation ● Asynchronous SGD Test Loss

  30. Cirrus outperforms vanilla serverless Sparse gradients and Test training data prefetching Loss

  31. Cirrus outperforms vanilla serverless Test +700x updates/sec Loss Replace AWS S3 with high-performance store (Redis)

  32. Cirrus outperforms vanilla serverless Test Loss Cirrus without training data 10x faster prefetching

  33. Cirrus outperforms vanilla serverless Test Loss 10x faster Cirrus with training data 10x prefetching faster

  34. Conclusion 1. End-to-end ML workflows: a. time-consuming infrastructure management b. resource overprovisioning 2. Cirrus -- serverless end-to-end ML framework: a. simplify deployment of ML workflows b. per-stage provisioning of resources 3. Cirrus outperforms existing serverless solutions by specializing for serverless and ML

  35. Thank you! github.com/ucbrise/cirrus @jccarreira

Recommend


More recommend