Research at the intersection of AI + Systems Joseph E. Gonzalez Assistant Professor, UC Berkeley jegonzal@cs.berkeley.edu
Looking Back on AI Systems Going back to when I started graduate school …
Machine learning community has had an evolving focus on AI Systems Distributed Deep Learning Fast Algorithms Frameworks Algorithms 2006 2017 ML for Machine Learning Systems Frameworks Integration of Communities
Learning Training Big Data Big Model The focus of AI Systems research has been on model training.
Training Big Data Big Model Enabling Machine Learning and Systems Innovations Deep Learning Stochastic Distributed (CNN/RNN) Optimization Dataflow Systems Domain Specific Symbolic GPU / TPU Languages (TensorFlow) Methods Acceleration
Training Big Data Big Model Splash CoCoA rllab VW
Learning ? Big Training Data Big Model
Learning Drive Actions Big Training Data Big Model
Learning Prediction Query ? Big Training Data Decision Big Model Application
Prediction Learning Query Big Training Data Decision Big Model Application Goal: ~ 10 ms under heavy load Complicated by Deep Learning è New ML Algorithms and Systems
Support low-latency, high-throughput serving workloads Models getting more complex Ø 10s of GFLOPs [1] Ø Recurrent nets Using specialized Deployed on critical path hardware for Ø Maintain latency goals under heavy load predictions [1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.
Google Translate Serving “If each of the world’s Android phones used the new Google voice search for just three minutes a day , these engineers realized, the company would 140 billion words a day 1 need twice as many data centers. ” – Wired 82,000 GPUs Designed New Hardware! running 24/7 Tensor Processing Unit (TPU) [1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html
Prediction-Serving Challenges ??? VW Create Caffe 14 Large and growing Support low-latency, high- ecosystem of ML models throughput serving workloads and frameworks
Wide range of application and frameworks
Wide range of application and frameworks ??? VW Create Caffe 16
One-Off Systems for High-Value Tasks Problems: Expensive to build and maintain Ø Requires AI + Systems expertise Tightly-coupled model, framework, and application Ø Difficult to update models and add new frameworks
Prediction Serving is an Open Problem Ø Computationally challenging Ø Need low latency & high throughput Ø No standard technology or abstractions for serving models IDK Prediction Cascades Learning to make fast predictions Low Latency [Work in Progress] Prediction Serving System [NSDI’17]
Clipper Low Latency Prediction Serving System Xin Giulio Daniel Yika Corey Ion Alexey Wang Zhou Crankshaw Luo Zumar Stoica Tumanov
Wide range of application and frameworks ??? VW Create Caffe 20
Middle layer for prediction serving. Common System Abstraction Optimizations VW Create Caffe 21
Clipper Decouples Applications and Models Applications Predict RPC/REST Interface Observe Clipper RPC RPC RPC RPC Model Container (MC) MC MC MC Caffe
Clipper Architecture Applications Predict RPC/REST Interface Observe Clipper Model Selection Layer Combine predictions across frameworks Provide a common interface and Model Abstraction Layer system optimizations RPC RPC RPC RPC MC MC MC Model Container (MC) Caffe
Clipper Architecture Applications Predict RPC/REST Interface Observe Optimized Common Model Clipper Caching Batching API Isolation Model Selection Layer Combine predictions across frameworks Provide a common interface and Model Abstraction Layer system optimizations RPC RPC RPC RPC MC MC MC Model Container (MC) Caffe
Batching to Improve Throughput Ø Why batching helps: Ø Optimal batch depends on: Ø hardware configuration Ø model and framework A single Ø system load page load may generate Clipper Solution: many queries Adaptively tradeoff latency and throughput… Throughput-optimized frameworks Ø Inc. batch size until the latency objective is exceeded ( Additive Increase ) Throughput Ø If latency exceeds SLO cut batch size by a fraction ( Multiplicative Decrease ) Batch size
AdaStLve 1R BatFKLng 4 6 9 3 8 1 3 9 2 8 8 0 7 60000 4 4 9 5 4 Throughput 5 3 8 9 40000 2 2 3 6 2 6 0 1 0 9 2 2 (QPS) 2 7 7 3 20000 8 7 9 9 1 9 0 1 1 3 1 2 0 8 40 2 0 0 0 0 0 P99 Latency 2 2 2 2 2 20 5 5 (ms) 0 0 0 0 0 4 8 1 0 3 3 4 1 1 2 2 1 1000 2 3 6 0 9 Batch Size 4 3 0 t n 0 0 0 S V R 2 e 9 9 9 L r V - 6 6 6 R R V ) ) ) ) 1 e r r O N n n e a a r r P ) r r ) n g e a e a n n a r e n n R S e e r r e a 5 a d L L 6 L L L L K e e n K K y g L O a K 3 6 6 R K 5 6 ( ( ( L 6 ( (
Overhead of decoupled architecture Applications Applications RPC Interface Predict RPC/REST Interface Predict Feedback TensorFlow- Clipper Serving RPC RPC RPC RPC MC MC MC MC Caffe
Overhead of decoupled architecture Better P99 Latency (ms) Throughput (QPS) Better Decentralized system matches performance of centralized design.
Clipper Architecture Applications Predict RPC/REST Interface Observe Clipper Model Selection Layer Combine predictions across frameworks Provide a common interface and Model Abstraction Layer system optimizations RPC RPC RPC RPC MC MC MC Model Container (MC) Caffe
Clipper Architecture Applications Predict RPC/REST Interface Observe Clipper Model Selection Layer Combine predictions across frameworks Provide a common interface and Model Abstraction Layer system optimizations RPC RPC RPC RPC MC MC MC Model Container (MC) Caffe
Predict RPC/REST Interface Observe Clipper Model Selection Layer Combine predictions across frameworks Provide a common interface and Model Abstraction Layer Version 1 system optimizations Periodic retraining Version 2 Version 3 RPC RPC RPC RPC MC MC MC Model Container (MC) Caffe Experiment with new models and frameworks Caffe
Selection Policy can Calibrate Confidence Version 2 Version 3 “CAT” “CAT” Policy “DJ” “CAT” UNSURE “CAT” Caffe
Selection Policy: Estimate confidence Image1et 7Rp-5 ErrRr 5ate cRnIident cRnIident unsure unsure 0.4 0.3182 Better 0.1983 0.2 0.0586 0.0469 0.0327 0.0 5-agree ensemEle 4-agree
Selection Policy: Estimate confidence Image1et 7Rp-5 ErrRr 5ate cRnIident cRnIident unsure unsure 0.4 0.3182 0.1983 0.2 Better 0.0586 0.0469 0.0327 0.0 5-agree ensemEle 4-agree width is percentage of query workloads
Open Research Questions Ø Efficient execution of complex model compositions Ø Optimal batching to achieve end-to-end latency goals Ø Automatic model failure identification and correction Ø Use anomaly detection techniques to identify model failures Ø Prediction serving on the edge Ø Allowing models to span cloud and edge infrastructure http://clipper.ai
IDK Prediction Cascades Learning to make fast predictions. Low Latency [Work in Progress] Prediction Serving System [NSDI’17]
IDK Prediction Cascades Learning to make fast predictions. Low Latency [Work in Progress] Prediction Serving System [NSDI’17]
Accuracy Relative Cost 90 1.2 78.3 77.4 76.2 1 73.3 80 Small 69.8 1 70 but Order of magnitude 56.6 60 0.8 significant 0.67 gap 50 0.6 40 0.33 0.31 30 0.4 20 0.15 0.2 0.08 10 0 0 Complexity à Complexity à Model costs are increasing much faster than gains in accuracy .
IDK Prediction Cascades Simple models for simple tasks Xin Yika Daniel Alexey https://arxiv.org/abs/1706.00885 Wang Luo Crankshaw Tumanov I don’t Query Simple Model Accurate Model Know Slow Fast Prediction Prediction Combine fast (inaccurate) models with slow (accurate) models to maximize accuracy while reducing computational costs.
Query Simple Model ResNet152 Accuracy Relative Cost 100 1.2 1 78.3 78.3 78.3 78.3 78.3 0.89 1 80 0.8 0.76 0.8 0.63 60 0.6 40 0.4 20 0.2 0 0 37% reduction in runtime @ no loss in accuracy
I don’t Query Simple Model ResNet152 Know Slow Fast Prediction Prediction Ø Cascades within a Model Conv Conv Conv Conv Gate Conv Conv Gate Conv Query Prediction FC
I don’t Query Simple Model Accurate Model Know Slow Fast Prediction Prediction Ø Cascades within a Model Conv Conv Conv Conv Gate Conv Conv Gate Conv Query Prediction FC Skip Blocks
Cascading reduces computational cost 1R GDte CRnvGDte 511GDte 120 110.0 67.08 67.72 Similar gains on 100 larger models 40% AverDge Depth 80 74.0 54.16 54.69 28% 60 38.0 35.82 34.31 40 10% 20 0 5es1et110 5es1et74 5es1et38
Easy Images Difficult Images Skip More Skip More Skip Less Skip Less Number of Layers Skipped
Recommend
More recommend