Cirrus: A Serverless Framework for End-to-end ML Workflows Joao - - PowerPoint PPT Presentation

cirrus a serverless framework for end to end ml workflows
SMART_READER_LITE
LIVE PREVIEW

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao - - PowerPoint PPT Presentation

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey Tumanov, Andrew Zhang, Randy Katz Machine Learning End-to-end ML workflows Modern end-to-end ML workflows are complex End-to-end ML


slide-1
SLIDE 1

Cirrus: A Serverless Framework for End-to-end ML Workflows

Joao Carreira, Pedro Fonseca, Alexey Tumanov, Andrew Zhang, Randy Katz

slide-2
SLIDE 2

Machine Learning

slide-3
SLIDE 3

End-to-end ML workflows

  • Modern end-to-end ML workflows are complex
slide-4
SLIDE 4

End-to-end ML workflows

  • Modern end-to-end ML workflows are complex
  • ML workflows consist of 3 heterogeneous stages
slide-5
SLIDE 5

End-to-end ML workflows

  • Modern end-to-end ML workflows are complex
  • ML workflows consist of 3 heterogeneous stages

Dataset Preprocessing

slide-6
SLIDE 6

End-to-end ML workflows

  • Modern end-to-end ML workflows are complex
  • ML workflows consist of 3 heterogeneous stages

Dataset Preprocessing Model Training

slide-7
SLIDE 7

End-to-end ML workflows

  • Modern end-to-end ML workflows are complex
  • ML workflows consist of 3 heterogeneous stages

Dataset Preprocessing Model Training Hyperparameter Tuning

slide-8
SLIDE 8

End-to-end ML workflows

  • Modern end-to-end ML workflows are complex
  • ML workflows consist of 3 heterogeneous stages

ML workflows are interactive and iterative

Dataset Preprocessing Model Training Hyperparameter Tuning

slide-9
SLIDE 9

Provisioning ML workflows

Provisioning ML workflows is challenging

  • Complex infrastructure management detracts from ML work
  • Resource waste due to overprovisioning of resources

Hard to accurately estimate resource demands of each stage Data scientists have limited systems expertise

slide-10
SLIDE 10

Serverless computing

Output Input Code

AWS S3

slide-11
SLIDE 11

Serverless computing

Output Input Code

slide-12
SLIDE 12

Fine-grained resources Fine-grained billing High elasticity Automatic resource configuration / provisioning / maintenance

Serverless computing benefits

Tight provisioning of resources Simplifying infrastructure management

slide-13
SLIDE 13

Challenges of serverless

Small local memory and storage Short-lived and unpredictable launch times Low bandwidth and no P2P communication Lack of fast shared storage Limited lambda package size

slide-14
SLIDE 14

Existing approaches

Serverless Frameworks Machine Learning Frameworks

Short-lived and unpredictable launch times

slide-15
SLIDE 15

Existing approaches

Serverless Frameworks Machine Learning Frameworks

PyWren

Short-lived and unpredictable launch times

  • Limit. Pkg

size

Download dependencies from S3 High-latency communication through S3

No fast storage

Stragglers

Unpred. launch

slide-16
SLIDE 16

Existing approaches

Serverless Frameworks Machine Learning Frameworks

PyWren

Short-lived and unpredictable launch times

  • Limit. Pkg

size

Download dependencies from S3 High-latency communication through S3

No fast storage

Stragglers

Unpred. launch Small mem.

Unable to launch runtimes in lambdas No ring/tree reduces No driver-to-worker comm. Precludes MPI

Unpred. launch No P2P comm.

slide-17
SLIDE 17

Cirrus: a framework for serverless end-to-end ML workflows

slide-18
SLIDE 18

Robust handling of lambda termination Ultra-lightweight runtime + data prefetching

Limited pkg size

High-perf. data store (parameter-server and KV)

1)Addressing serverless challenges

No fast storage Low memory Limited package size No P2P communication Short lifetimes and unpredictable launch

Cirrus: design principles

slide-19
SLIDE 19

Per-stage fine-grained variable agile scalability

Cirrus: design principles

Limited pkg size

Tight provisioning of resources Simplifying infrastructure management

High-level API supports end-to-end ML

2)Achieving benefits for end-to-end ML

slide-20
SLIDE 20

Cirrus architecture (client side)

Dashboard Python API Client frontend

Preproc. Training Tuning Create/Stop Task

Client backend

Task Scheduler Lambda Manager

Client side (stateful) Data scientist

slide-21
SLIDE 21

Cirrus Dashboard

slide-22
SLIDE 22

Cirrus Dashboard

slide-23
SLIDE 23

Cirrus Dashboard

slide-24
SLIDE 24

Cirrus Dashboard

slide-25
SLIDE 25

Server side (stateless)

Cirrus runtime

Data Iterator API Minibatch Buffer

Sparse LR

  • Mat. Fact.

LDA

Data store client API put (gradient) get (model) Data store

PS API Key-value API Models Key-values SGD Adagrad Momentum

Cirrus architecture (server side)

put/get key

slide-26
SLIDE 26

Cirrus evaluation

1. Cirrus provides benefits by specializing both for serverless and end-to-end ML 2. We show that Cirrus outperforms a state-of-the-art serverless system: PyWren

slide-27
SLIDE 27

Evaluation setup

1. Deployment: AWS Lambdas (3GB of mem.) 2. Benchmark: async. distributed SGD Sparse Logistic Regression task 3. Dataset: Criteo Dataset (a dataset of display ads) 4. PyWren: a. Baseline: iterative synchronous SGD training using AWS S3 to store gradients and model b. + 3 incremental optimizations 5. Cirrus: 2 modes (with/without prefetching)

slide-28
SLIDE 28

Cirrus outperforms vanilla serverless

Synchronous SGD training suffers from stragglers

Test Loss

slide-29
SLIDE 29
  • Multiple SGD iterations on

each lambda invocation

  • Asynchronous SGD

Test Loss

Cirrus outperforms vanilla serverless

slide-30
SLIDE 30

Sparse gradients and training data prefetching

Test Loss

Cirrus outperforms vanilla serverless

slide-31
SLIDE 31

Replace AWS S3 with high-performance store (Redis)

Test Loss

+700x updates/sec

Cirrus outperforms vanilla serverless

slide-32
SLIDE 32

Cirrus without training data prefetching

Test Loss

10x faster

Cirrus outperforms vanilla serverless

slide-33
SLIDE 33

Cirrus with training data prefetching

Test Loss

10x faster 10x faster

Cirrus outperforms vanilla serverless

slide-34
SLIDE 34

Conclusion

1. End-to-end ML workflows: a. time-consuming infrastructure management b. resource overprovisioning 2. Cirrus -- serverless end-to-end ML framework: a. simplify deployment of ML workflows b. per-stage provisioning of resources 3. Cirrus outperforms existing serverless solutions by specializing for serverless and ML

slide-35
SLIDE 35

Thank you!

github.com/ucbrise/cirrus @jccarreira