Willump: A Statistically-Aware End-to-end Optimizer for ML Inference Peter Kraft , Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, Matei Zaharia 1
Problem: ML Inference ● Often performance-critical. ● Recent focus on tools for ML prediction serving. 2
A Common Bottleneck: Feature Computation Receive Raw ● Many applications bottlenecked by Data feature computation. ● Pipeline of transformations computes Compute numerical features from data for model. Features Predict With Model 3
A Common Bottleneck: Feature Computation ● Feature computation is bottleneck when models are inexpensive — boosted trees, not DNNs. ● Common on tabular/structured data! 4
A Common Bottleneck: Feature Computation Production Microsoft sentiment analysis pipeline Model run Feature computation takes > 99% of the time! time 5 Source: Pretzel (OSDI ‘18)
Current State-of-the-art ● Apply traditional serving optimizations, e.g. caching (Clipper), compiler optimizations (Pretzel). ● Neglect unique statistical properties of ML apps. 6
Statistical Properties of ML Amenability to approximation 7
Statistical Properties of ML Amenability to approximation Easy input: Hard input: Definitely not Maybe a a dog. dog? 8
Statistical Properties of ML Amenability to approximation Easy input: Hard input: Definitely not Maybe a a dog. dog? Existing Systems: Use Expensive Model for Both 9
Statistical Properties of ML Amenability to approximation Easy input: Hard input: Definitely not Maybe a a dog. dog? Statistically-Aware Systems: Use cheap model on bucket, expensive model on cat. 10
Statistical Properties of ML ● Model is often part of a bigger app (e.g. top-K query) 11
Statistical Properties of ML ● Model is often part of a bigger app (e.g. top-K query) Artist Score Rank Problem: Beatles 9.7 1 Return top Bruce Springsteen 9.5 2 10 artists. … … … Justin Bieber 5.6 999 Nickelback 4.1 1000 12
Statistical Properties of ML ● Model is often part of a bigger app (e.g. top-K query) Existing Systems Artist Score Rank Use Beatles 9.7 1 expensive Bruce Springsteen 9.5 2 model for … … … everything! Justin Bieber 5.6 999 Nickelback 4.1 1000 13
Statistical Properties of ML ● Model is often part of a bigger app (e.g. top-K query) Statistically-aware Systems Artist Score Rank High-value: Beatles 9.7 1 Rank precisely, Bruce Springsteen 9.5 2 return. … … … Low-value: Justin Bieber 5.6 999 Approximate, discard. Nickelback 4.1 1000 14
Prior Work: Statistically-Aware Optimizations ● Statistically-aware optimizations exist in literature. ● Always application-specific and custom-built. ● Never automatic! Source: Cheng et al. ( DLRS’ 16 ), Kang et al. 15 (VLDB ‘17)
ML Inference Dilemna ● ML inference systems: Easy to use. ○ Slow. ○ ● Statistically-aware systems: Fast ○ Require a lot of work to implement. ○ 16
Can an ML inference system be fast and easy to use? 17
Willump: Overview ● Statistically-aware optimizer for ML Inference. ● Targets feature computation! ● Automatic model-agnostic statistically-aware opts. ● 10x throughput+latency improvements. 18
Outline ● System Overview ● Optimization 1: End-to-end Cascades ● Optimization 2: Top-K Query Approximation ● Evaluation 19
Willump: Goals ● Automatically maximize performance of ML inference applications whose performance bottleneck is feature computation 20
System Overview Input Pipeline def pipeline(x1, x2): input = lib.transform(x1, x2) preds = model.predict(input) return preds
System Overview Willump Optimization Input Pipeline def pipeline(x1, x2): Infer Transformation input = lib.transform(x1, x2) Graph preds = model.predict(input) return preds
System Overview Willump Optimization Input Pipeline def pipeline(x1, x2): Infer Transformation input = lib.transform(x1, x2) Graph preds = model.predict(input) return preds Statistically-Aware Optimizations: 1. End-To-End Cascades 2. Top-K Query Approximation 23
System Overview Willump Optimization Input Pipeline def pipeline(x1, x2): Infer Transformation input = lib.transform(x1, x2) Graph preds = model.predict(input) return preds Statistically-Aware Optimizations: 1. End-To-End Cascades 2. Top-K Query Approximation Compiler Optimizations (Weld —Palkar et al. VLDB ‘18) 24
System Overview Willump Optimization Input Pipeline def pipeline(x1, x2): Infer Transformation input = lib.transform(x1, x2) Graph preds = model.predict(input) return preds Statistically-Aware Optimizations: 1. End-To-End Cascades 2. Top-K Query Approximation Optimized Pipeline def willump_pipeline(x1, x2): Compiler Optimizations preds = compiled_code(x1, x2) (Weld —Palkar et al. VLDB ‘18) return preds 25
Outline ● System Overview ● Optimization 1: End-to-end Cascades ● Optimization 2: Top-K Query Approximation ● Evaluation 26
Background: Model Cascades ● Classify “easy” inputs with cheap model. ● Cascade to expensive model for “hard” inputs. Easy input: Hard input: Definitely not Maybe a a dog. dog? 27
Background: Model Cascades ● Used for image classification, object detection. ● Existing systems application-specific and custom-built. Source: Viola-Jones ( CVPR’ 01 ), Kang et al. 28 (VLDB ‘17)
Our Optimization: End-to-end cascades ● Compute only some features for “easy” data inputs; cascade to computing all for “hard” inputs. ● Automatic and model-agnostic, unlike prior work. ○ Estimates for runtime performance & accuracy of a feature set ○ Efficient search process for tuning parameters 29
End-to-end Cascades: Original Model Compute All Features Model Prediction
End-to-end Cascades: Approximate Model Compute Compute Selected Features All Features Approximate Model Cascades Optimization Model Prediction Prediction
End-to-end Cascades: Confidence Compute Compute Selected Features All Features Approximate Model Cascades Optimization Confidence > Threshold Model Yes Prediction Prediction
End-to-end Cascades: Final Pipeline Compute Compute Selected Features All Features Approximate Model Cascades Optimization Confidence > Threshold Model Yes No Compute Remaining Features Original Model Prediction Prediction
End-to-end Cascades: Constructing Cascades ● Construct cascades during model training. ● Need model training set and an accuracy target. 34
End-to-end Cascades: Selecting Features Key question: Compute Selected Features Select which features? Approximate Model Confidence > Threshold Yes No Compute Remaining Features Original Model Prediction
End-to-end Cascades: Selecting Features ● Goal: Select features that minimize expected query time given accuracy target. 36
End-to-end Cascades: Selecting Features Two possibilities for a query: Can approximate or not. Compute Selected Features Approximate Model Confidence > Threshold No Yes Can approximate Can’t approximate Compute Remaining Features query. query. Original Model Prediction
End-to-end Cascades: Selecting Features 𝑄( approx ) cost (𝑇) + 𝑄(~ approx ) cost (𝐺) min 𝑇 Compute Selected Features (S) cost (𝑇) Approximate Model Confidence > Threshold P(Yes) = P(approx) Yes Prediction
End-to-end Cascades: Selecting Features 𝑄( approx ) cost (𝑇) + 𝑄(~ approx ) cost (𝐺) min 𝑇 Compute Selected Features (S) Approximate Model Confidence > Threshold P(No) = P(~approx) No Compute Remaining Features cost (𝐺) Original Model Prediction
End-to-end Cascades: Selecting Features 𝑄( approx ) cost (𝑇) + 𝑄(~ approx ) cost (𝐺) min 𝑇 Compute Selected Features (S) cost (𝑇) Approximate Model Confidence > Threshold P(Yes) = P(approx) P(No) = P(~approx) No Yes Compute Remaining Features cost (𝐺) Original Model Prediction
End-to-end Cascades: Selecting Features ● Goal: Select feature set S that minimizes query time: 𝑄( approx ) cost (𝑇) + 𝑄(~ approx ) cost (𝐺) min 𝑇 41
End-to-end Cascades: Selecting Features ● Goal: Select feature set S that minimizes query time: 𝑄( approx ) cost (𝑇) + 𝑄(~ approx ) cost (𝐺) min 𝑇 ● Approach: ○ Choose several potential values of cost (𝑻) . ○ Find best feature set with each cost(S). ○ Train model & find cascade threshold for each set. ○ Pick best overall. 42
End-to-end Cascades: Selecting Features ● Goal: Select feature set S that minimizes query time: 𝑄( approx ) cost (𝑇) + 𝑄(~ approx ) cost (𝐺) min 𝑇 ● Approach: ○ Choose several potential values of cost (𝑇) . ○ Find best feature set with each cost(S). ○ Train model & find cascade threshold for each set. ○ Pick best overall. 43
End-to-end Cascades: Selecting Features ● Goal: Select feature set S that minimizes query time: 𝑄( approx ) cost (𝑇) + 𝑄(~ approx ) cost (𝐺) min 𝑇 ● Approach: ○ Choose several potential values of cost (𝑇) . ○ Find best feature set with each cost(S). ○ Train model & find cascade threshold for each set. ○ Pick best overall. 44
Recommend
More recommend