parity models erasure coded resilience for prediction
play

Parity Models Erasure-Coded Resilience for Prediction Serving - PowerPoint PPT Presentation

Parity Models Erasure-Coded Resilience for Prediction Serving Systems Jack Kosaian Rashmi Vinayak Shivaram Venkataraman Rashmi Vinayak Shivaram Venkataraman 2 Machine learning lifecycle Training Inference Deploy model in Get a model to


  1. Parity Models Erasure-Coded Resilience for Prediction Serving Systems Jack Kosaian Rashmi Vinayak Shivaram Venkataraman

  2. Rashmi Vinayak Shivaram Venkataraman 2

  3. Machine learning lifecycle Training Inference Deploy model in Get a model to reach target domain desired accuracy “Batch” jobs Online Hours to weeks Milliseconds 3

  4. Machine learning inference queries predictions 0.8 0.05 0.15 cat dog bird 4

  5. Prediction serving systems Inference in datacenter/cluster settings Open Source Cloud Services 5

  6. Prediction serving system architectures queries predictions Frontend model instances 6

  7. Machine learning inference question- answering translation ranking Must operate with low, predictable latency 7

  8. Unavailability in serving systems • Slowdowns and failures (unavailability) - Resource contention - Hardware failures - Runtime slowdowns - ML-specific events • Result in inflated tail latency - Cause prediction serving systems to miss SLOs Must alleviate slowdowns and failures 8

  9. Redundancy-based resilience • Proactive: send each query to 2+ servers • Reactive: wait for a timeout before duplicating query Reactive Recovery Delay (lower is better) Proactive Resource Overhead (lower is better) 9

  10. Erasure codes: proactive, resource-efficient Relation to (n, k) notation k data units r “parity” units encoding n = k + r “parity” D 2 P = D 1 + D 2 D 1 D 1 P D 2 D 2 = P – D 1 any k out of (k+r) units original k data units decoding 11

  11. Erasure codes: proactive, resource-efficient Storage Reactive Communication Recovery Delay Prediction Serving Systems erasure (lower is better) codes Proactive Resource Overhead (lower is better) 12

  12. Coded-computation Our goal: Using erasure codes to reduce tail latency in prediction serving Goal: preserve results of computation over queries X 1 X 2 queries F F F models F(X 1 ) F(X 2 ) predictions 13

  13. Coded-computation Our goal: Using erasure codes to reduce tail latency in prediction serving Encode queries X 1 X 2 encode “parity query” F F F F(X 1 ) F(X 2 ) 14

  14. Coded-computation Our goal: Using erasure codes to reduce tail latency in prediction serving Decode results of inference over queries X 1 X 2 encode “parity query” F F F F(X 1 ) F(P) decode F(X 2 ) 15

  15. Traditional coding vs. coded-computation Codes for storage Coded-computation D 1 D 2 X 1 X 2 encode encode F F F D 1 D 2 P F(X 1 ) F(P) decode decode F(X 2 ) D 2 Need to recover computation over inputs 16

  16. Challenge: Non-linear computation Linear computation Non-linear computation Example: F(X) = 2X Example: F(X) = X 2 X 2 P = X 1 + X 2 X 2 P = X 1 + X 2 X 1 X 1 2X 2X X 2 X 2 X 2 2X F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = 2(X 1 + X 2 ) 2 – X 12 F(X 2 ) = 2(X 1 + X 2 ) – X 1 F(X 2 ) = X 22 + 2X 1 X 2 F(X 2 ) = 2X 2 Actual is X 22 17

  17. Challenge: Non-linear computation Linear computation Non-linear computation Example: F(X) = 2X X 2 P = X 1 + X 2 X 2 P = X 1 + X 2 X 1 X 1 2X 2X 2X F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = 2(X 1 + X 2 ) – X 1 F(X 2 ) = ??? F(X 2 ) = 2X 2 18

  18. Current approaches to coded-computation • Lots of great work on linear computations • Huang 1984, Lee 2015, Dutta 2016, Dutta 2017, Mallick 2018, more… • Recent work supports restricted nonlinear computations • Yu 2018 • At least 2x resource overhead Current approaches insufficient for neural networks in prediction serving systems 19

  19. Our approach: Learning-based coded-computation Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259 Parity Models: Erasure-Coded Resilience for Prediction Serving Systems To appear in ACM SOSP 2019 https://jackkosaian.github.io 21

  20. Learning an erasure code? Design encoder and decoder as neural networks X 1 X 2 encoder Accurate decoder Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259 22

  21. Learning an erasure code? Design encoder and decoder as neural networks X 1 X 2 encoder Accurate Expensive Computationally encoder/decoder expensive decoder Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259 23

  22. Learn computation over parities Use simple, fast encoders and decoders Learn computation over parities: “parity model” P = X 1 + X 2 X 2 X 1 Accurate parity model Efficient (F P ) encoder/decoder F(X 2 ) = F P (P) – F(X 1 ) Parity Models: Erasure-Coded Resilience for Prediction Serving Systems To appear in ACM SOSP 2019 https://jackkosaian.github.io 24

  23. Designing parity models Parity model goal Transform parities such that decoder can reconstruct unavailable predictions P = X 1 + X 2 X 2 X 1 parity model (F P ) F(X 2 ) = F P (P) – F(X 1 ) 25

  24. Designing parity models Parity model goal Transform parities such that decoder can reconstruct unavailable predictions P = X 1 + X 2 X 2 X 1 parity model (F P ) F(X 2 ) = F P (P) – F(X 1 ) Learn a parity model F P (P) = F(X 1 ) + F(X 2 ) 26

  25. Designing parity models Parity model goal Transform parities such that decoder can reconstruct unavailable predictions P = X 1 + X 2 X 2 X 1 parity model (F P ) F(X 2 ) = F P (P) – F(X 1 ) F P (P) = F(X 1 ) + F(X 2 ) 27

  26. Training a parity model 1. Sample k inputs and encode 2. Perform inference with parity model P = X 1 + X 2 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 1 compute loss Desired output: F(X 1 ) + F(X 2 ) 0.8 0.15 0.05 0.2 0.7 0.1 28

  27. Training a parity model 1. Sample k inputs and encode 2. Perform inference with parity model P = X 1 + X 2 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 2 compute loss Desired output: F(X 1 ) + F(X 2 ) 0.15 0.8 0.05 0.3 0.5 0.2 29

  28. Training a parity model 1. Sample k inputs and encode 2. Perform inference with parity model P = X 1 + X 2 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 3 compute loss Desired output: F(X 1 ) + F(X 2 ) 0.03 0.02 0.95 0.3 0.3 0.4 30

  29. Training a parity model: higher parameter k 1. Sample inputs and encode 2. Perform inference with parity model P = X 1 + X 2 + X 3 + X 4 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 3 Desired output: F(X 1 ) + F(X 2 ) + F(X 3 ) + F(X 4 ) 31

  30. Training a parity model: different encoder 1. Sample inputs and encode 2. Perform inference with parity model P = 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 3 32

  31. Learning results in approximate reconstructions Appropriate for machine learning inference 1. Predictions resulting from inference are approximations 2. Inaccuracy only at play when predictions otherwise slow/failed 33

  32. Implementing parity models in Clipper queries predictions Frontend Encoder Decoder parity model slow/failed 34

  33. Design space in parity models framework Encoder/decoder • Many possibilities • Generic: addition/subtraction P = X 1 + X 2 X 2 X 1 • Can specialize to task parity model Parity model architecture (F P ) • Again, many possibilities • Same as original model ⇒ same latency as original F(X 2 ) = F P (P) – F(X 1 ) 35

  34. Evaluation 1. How accurate are reconstructions using parity models? 2. How much can parity models help reduce tail latency? 36

  35. Evaluation of Accuracy • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than replication 37

  36. Evaluation of Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than replication 38

  37. Evaluation of Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than replication 39

  38. Evaluation of Overall Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than 6.1% replication 40

  39. Evaluation of Overall Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than 0.6% 6.1% replication 41

  40. Evaluation of Overall Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) expected operating regime • 2x less overhead than replication 42

  41. Evaluation of Accuracy: Higher values of k Tradeoff between resource-overhead, resilience, and accuracy • Addition/subtraction code 43

  42. Evaluation of Accuracy: Object-localization Ground Truth Available Parity Models 44

  43. Evaluation of Accuracy: Task-specific encoder 22% accuracy improvement over addition/subtraction at k = 4 Input Images Parity Image 32 encode 32 32 32 45

  44. Evaluation of Tail Latency Reduction: Setup • Implemented in Clipper prediction serving system • Evaluate with 18-36 nodes on AWS with varying: • Inference hardware (GPUs, CPUs) • Query arrival rates • Batch sizes • Levels of load imbalance • Amounts of redundancy • Baseline approaches • Baseline: approach with same number of resources as parity models 46

Recommend


More recommend