Parity Models Erasure-Coded Resilience for Prediction Serving Systems Jack Kosaian Rashmi Vinayak Shivaram Venkataraman
Rashmi Vinayak Shivaram Venkataraman 2
Machine learning lifecycle Training Inference Deploy model in Get a model to reach target domain desired accuracy “Batch” jobs Online Hours to weeks Milliseconds 3
Machine learning inference queries predictions 0.8 0.05 0.15 cat dog bird 4
Prediction serving systems Inference in datacenter/cluster settings Open Source Cloud Services 5
Prediction serving system architectures queries predictions Frontend model instances 6
Machine learning inference question- answering translation ranking Must operate with low, predictable latency 7
Unavailability in serving systems • Slowdowns and failures (unavailability) - Resource contention - Hardware failures - Runtime slowdowns - ML-specific events • Result in inflated tail latency - Cause prediction serving systems to miss SLOs Must alleviate slowdowns and failures 8
Redundancy-based resilience • Proactive: send each query to 2+ servers • Reactive: wait for a timeout before duplicating query Reactive Recovery Delay (lower is better) Proactive Resource Overhead (lower is better) 9
Erasure codes: proactive, resource-efficient Relation to (n, k) notation k data units r “parity” units encoding n = k + r “parity” D 2 P = D 1 + D 2 D 1 D 1 P D 2 D 2 = P – D 1 any k out of (k+r) units original k data units decoding 11
Erasure codes: proactive, resource-efficient Storage Reactive Communication Recovery Delay Prediction Serving Systems erasure (lower is better) codes Proactive Resource Overhead (lower is better) 12
Coded-computation Our goal: Using erasure codes to reduce tail latency in prediction serving Goal: preserve results of computation over queries X 1 X 2 queries F F F models F(X 1 ) F(X 2 ) predictions 13
Coded-computation Our goal: Using erasure codes to reduce tail latency in prediction serving Encode queries X 1 X 2 encode “parity query” F F F F(X 1 ) F(X 2 ) 14
Coded-computation Our goal: Using erasure codes to reduce tail latency in prediction serving Decode results of inference over queries X 1 X 2 encode “parity query” F F F F(X 1 ) F(P) decode F(X 2 ) 15
Traditional coding vs. coded-computation Codes for storage Coded-computation D 1 D 2 X 1 X 2 encode encode F F F D 1 D 2 P F(X 1 ) F(P) decode decode F(X 2 ) D 2 Need to recover computation over inputs 16
Challenge: Non-linear computation Linear computation Non-linear computation Example: F(X) = 2X Example: F(X) = X 2 X 2 P = X 1 + X 2 X 2 P = X 1 + X 2 X 1 X 1 2X 2X X 2 X 2 X 2 2X F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = 2(X 1 + X 2 ) 2 – X 12 F(X 2 ) = 2(X 1 + X 2 ) – X 1 F(X 2 ) = X 22 + 2X 1 X 2 F(X 2 ) = 2X 2 Actual is X 22 17
Challenge: Non-linear computation Linear computation Non-linear computation Example: F(X) = 2X X 2 P = X 1 + X 2 X 2 P = X 1 + X 2 X 1 X 1 2X 2X 2X F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = 2(X 1 + X 2 ) – X 1 F(X 2 ) = ??? F(X 2 ) = 2X 2 18
Current approaches to coded-computation • Lots of great work on linear computations • Huang 1984, Lee 2015, Dutta 2016, Dutta 2017, Mallick 2018, more… • Recent work supports restricted nonlinear computations • Yu 2018 • At least 2x resource overhead Current approaches insufficient for neural networks in prediction serving systems 19
Our approach: Learning-based coded-computation Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259 Parity Models: Erasure-Coded Resilience for Prediction Serving Systems To appear in ACM SOSP 2019 https://jackkosaian.github.io 21
Learning an erasure code? Design encoder and decoder as neural networks X 1 X 2 encoder Accurate decoder Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259 22
Learning an erasure code? Design encoder and decoder as neural networks X 1 X 2 encoder Accurate Expensive Computationally encoder/decoder expensive decoder Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259 23
Learn computation over parities Use simple, fast encoders and decoders Learn computation over parities: “parity model” P = X 1 + X 2 X 2 X 1 Accurate parity model Efficient (F P ) encoder/decoder F(X 2 ) = F P (P) – F(X 1 ) Parity Models: Erasure-Coded Resilience for Prediction Serving Systems To appear in ACM SOSP 2019 https://jackkosaian.github.io 24
Designing parity models Parity model goal Transform parities such that decoder can reconstruct unavailable predictions P = X 1 + X 2 X 2 X 1 parity model (F P ) F(X 2 ) = F P (P) – F(X 1 ) 25
Designing parity models Parity model goal Transform parities such that decoder can reconstruct unavailable predictions P = X 1 + X 2 X 2 X 1 parity model (F P ) F(X 2 ) = F P (P) – F(X 1 ) Learn a parity model F P (P) = F(X 1 ) + F(X 2 ) 26
Designing parity models Parity model goal Transform parities such that decoder can reconstruct unavailable predictions P = X 1 + X 2 X 2 X 1 parity model (F P ) F(X 2 ) = F P (P) – F(X 1 ) F P (P) = F(X 1 ) + F(X 2 ) 27
Training a parity model 1. Sample k inputs and encode 2. Perform inference with parity model P = X 1 + X 2 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 1 compute loss Desired output: F(X 1 ) + F(X 2 ) 0.8 0.15 0.05 0.2 0.7 0.1 28
Training a parity model 1. Sample k inputs and encode 2. Perform inference with parity model P = X 1 + X 2 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 2 compute loss Desired output: F(X 1 ) + F(X 2 ) 0.15 0.8 0.05 0.3 0.5 0.2 29
Training a parity model 1. Sample k inputs and encode 2. Perform inference with parity model P = X 1 + X 2 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 3 compute loss Desired output: F(X 1 ) + F(X 2 ) 0.03 0.02 0.95 0.3 0.3 0.4 30
Training a parity model: higher parameter k 1. Sample inputs and encode 2. Perform inference with parity model P = X 1 + X 2 + X 3 + X 4 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 3 Desired output: F(X 1 ) + F(X 2 ) + F(X 3 ) + F(X 4 ) 31
Training a parity model: different encoder 1. Sample inputs and encode 2. Perform inference with parity model P = 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 3 32
Learning results in approximate reconstructions Appropriate for machine learning inference 1. Predictions resulting from inference are approximations 2. Inaccuracy only at play when predictions otherwise slow/failed 33
Implementing parity models in Clipper queries predictions Frontend Encoder Decoder parity model slow/failed 34
Design space in parity models framework Encoder/decoder • Many possibilities • Generic: addition/subtraction P = X 1 + X 2 X 2 X 1 • Can specialize to task parity model Parity model architecture (F P ) • Again, many possibilities • Same as original model ⇒ same latency as original F(X 2 ) = F P (P) – F(X 1 ) 35
Evaluation 1. How accurate are reconstructions using parity models? 2. How much can parity models help reduce tail latency? 36
Evaluation of Accuracy • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than replication 37
Evaluation of Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than replication 38
Evaluation of Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than replication 39
Evaluation of Overall Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than 6.1% replication 40
Evaluation of Overall Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than 0.6% 6.1% replication 41
Evaluation of Overall Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) expected operating regime • 2x less overhead than replication 42
Evaluation of Accuracy: Higher values of k Tradeoff between resource-overhead, resilience, and accuracy • Addition/subtraction code 43
Evaluation of Accuracy: Object-localization Ground Truth Available Parity Models 44
Evaluation of Accuracy: Task-specific encoder 22% accuracy improvement over addition/subtraction at k = 4 Input Images Parity Image 32 encode 32 32 32 45
Evaluation of Tail Latency Reduction: Setup • Implemented in Clipper prediction serving system • Evaluate with 18-36 nodes on AWS with varying: • Inference hardware (GPUs, CPUs) • Query arrival rates • Batch sizes • Levels of load imbalance • Amounts of redundancy • Baseline approaches • Baseline: approach with same number of resources as parity models 46
Recommend
More recommend