Cirrus: A Serverless Framework for End-to-end ML Workflows Joao - PowerPoint PPT Presentation

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey Tumanov, Andrew Zhang, Randy Katz

Machine Learning

End-to-end ML workflows ● Modern end-to-end ML workflows are complex

End-to-end ML workflows ● Modern end-to-end ML workflows are complex ● ML workflows consist of 3 heterogeneous stages

End-to-end ML workflows ● Modern end-to-end ML workflows are complex ● ML workflows consist of 3 heterogeneous stages Dataset Preprocessing

End-to-end ML workflows ● Modern end-to-end ML workflows are complex ● ML workflows consist of 3 heterogeneous stages Dataset Model Training Preprocessing

End-to-end ML workflows ● Modern end-to-end ML workflows are complex ● ML workflows consist of 3 heterogeneous stages Hyperparameter Dataset Model Training Tuning Preprocessing

End-to-end ML workflows ● Modern end-to-end ML workflows are complex ● ML workflows consist of 3 heterogeneous stages Hyperparameter Dataset Model Training Tuning Preprocessing ML workflows are interactive and iterative

Provisioning ML workflows Provisioning ML workflows is challenging Hard to accurately estimate resource demands of each stage Data scientists have limited systems expertise ● Complex infrastructure management detracts from ML work ● Resource waste due to overprovisioning of resources

Serverless computing Code Output Input AWS S3

Serverless computing Code Output Input

Serverless computing benefits Tight provisioning of Simplifying infrastructure resources management Fine-grained billing Automatic resource Fine-grained configuration / provisioning resources / maintenance High elasticity

Challenges of serverless Small local memory Low bandwidth and and storage no P2P communication Limited lambda package size Lack of fast Short-lived and shared storage unpredictable launch times

Existing approaches Serverless Frameworks Machine Learning Frameworks Short-lived and unpredictable launch times

Existing approaches Serverless Frameworks Machine Learning Frameworks PyWren Download dependencies Limit. Pkg from S3 size High-latency communication No fast through S3 storage Stragglers Unpred. launch Short-lived and unpredictable launch times

Existing approaches Serverless Frameworks Machine Learning Frameworks PyWren Unable to launch runtimes Download dependencies Small in lambdas Limit. Pkg from S3 mem. size No ring/tree reduces High-latency communication No driver-to-worker comm. No P2P No fast through S3 comm. storage Precludes MPI Stragglers Unpred. Unpred. launch launch Short-lived and unpredictable launch times

Cirrus: a framework for serverless end-to-end ML workflows

Cirrus: design principles 1 ） Addressing serverless challenges Low memory Ultra-lightweight runtime + data prefetching Limited package size High-perf. data store No P2P communication (parameter-server and KV) No fast storage Robust handling of lambda Short lifetimes and termination unpredictable launch Limited pkg size

Cirrus: design principles 2 ） Achieving benefits for end-to-end ML Tight provisioning of Per-stage fine-grained resources variable agile scalability Simplifying infrastructure High-level API supports management end-to-end ML Limited pkg size

Cirrus architecture (client side) Dashboard Python API Client frontend Data scientist Preproc. Training Tuning Create/Stop Task Client backend Task Lambda Scheduler Manager Client side (stateful)

Cirrus Dashboard

Cirrus architecture (server side) Cirrus runtime Data Iterator API Minibatch Buffer Sparse LR Mat. Fact. LDA Data store client API put get put/get (gradient) (model) key Data store PS API Key-value API SGD Adagrad Models Key-values Momentum Server side (stateless)

Cirrus evaluation 1. Cirrus provides benefits by specializing both for serverless and end-to-end ML 2. We show that Cirrus outperforms a state-of-the-art serverless system: PyWren

Evaluation setup 1. Deployment: AWS Lambdas (3GB of mem.) 2. Benchmark: async. distributed SGD Sparse Logistic Regression task 3. Dataset: Criteo Dataset (a dataset of display ads) 4. PyWren: a. Baseline: iterative synchronous SGD training using AWS S3 to store gradients and model b. + 3 incremental optimizations 5. Cirrus: 2 modes (with/without prefetching)

Cirrus outperforms vanilla serverless Synchronous SGD training suffers from stragglers Test Loss

Cirrus outperforms vanilla serverless ● Multiple SGD iterations on each lambda invocation ● Asynchronous SGD Test Loss

Cirrus outperforms vanilla serverless Sparse gradients and Test training data prefetching Loss

Cirrus outperforms vanilla serverless Test +700x updates/sec Loss Replace AWS S3 with high-performance store (Redis)

Cirrus outperforms vanilla serverless Test Loss Cirrus without training data 10x faster prefetching

Cirrus outperforms vanilla serverless Test Loss 10x faster Cirrus with training data 10x prefetching faster

Conclusion 1. End-to-end ML workflows: a. time-consuming infrastructure management b. resource overprovisioning 2. Cirrus -- serverless end-to-end ML framework: a. simplify deployment of ML workflows b. per-stage provisioning of resources 3. Cirrus outperforms existing serverless solutions by specializing for serverless and ML

Thank you! github.com/ucbrise/cirrus @jccarreira

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao - PowerPoint PPT Presentation

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey Tumanov, Andrew Zhang, Randy Katz Machine Learning End-to-end ML workflows Modern end-to-end ML workflows are complex End-to-end ML

Welcome from Cirrus Aircraft Brief History of Cirrus Cirrus founded in 1984 Began

Cirrus Aircraft Brief History of Cirrus Cirrus founded in 1984 Began development of the

High-Performance Computing at the University of Michigan: CIRRUS Flux Andrew Caird

Serverless On Your Own Terms Using Knative Context Serverless more than Function Serverless

Serverless Gardens IoT + Serverless johncmckim.me twitter.com/@johncmckim

How Serverless Changes the IT Department Paul Johnston Opinionated Serverless Person

Lunch and Learn John McKim @johncmckim Software Engineer A Cloud Guru Serverless Framework

Kotlin Serverless Framework Vladislav Tankov What is serverless? cloud-computing execution model

Cirrus Audio Solutions Cirrus Audio Solutions Home Audio Portable Audio Personal CD Player

Hui Su Hui Su Jonathan H. Jiang, Yu Gu (UCLA), J. David Neelin (UCLA), Joe W. Waters, Jonathan

Stateful Serverless Sean Walsh @SeanWalshEsq We predict that Serverless Computing will grow

Serverless Performance on a Budget Erwin van Eyk The central trade-off in serverless computing

Databases Gone Serverless Alkin Tezuysal (@ask_dba) Sr. Technical Manager, Percona Who am I?

Kotless Kotlin Serverless Framework Vladislav Tankov @vdtankov October 15, 2020 Introduction

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

Serverless Boom or Bust? An Analysis of Economic Incentives Xiayue Charles Lin, Joseph E.

Dataset Dashboard A SPARQL Endpoint Explorer Petr Kemen petr.kremen@fel.cvut.cz Motivation

Title here Section 20 Break 19 Learn More Information Agenda 01. Welcome Message 02. About

Chanel Jones Birth Doula Childbirth Educator About me: My Birth Experience Birthing Trauma

Poking Facebook: Characterization of OSN Applications Minas Gjoka, Michael Sirivianos, Athina

Because we care: Privacy Dashboard on Firefox OS Marta Piekarska (Technische Universitat

Rahul Kumar Software Developer @rahul_kumar_aws LinuxCon, CloudOpen, ContainerCon North America

Lockport City School District COVID-19 Communication/Notification Plan Michelle T. Bradley

EGI-InSPIRE Status Update on Operations Portal www.egi.eu www.egi.eu EGI-InSPIRE