research at the intersection of ai systems
play

Research at the intersection of AI + Systems Joseph E. Gonzalez - PowerPoint PPT Presentation

Research at the intersection of AI + Systems Joseph E. Gonzalez Assistant Professor, UC Berkeley jegonzal@cs.berkeley.edu Looking Back on AI Systems Going back to when I started graduate school Machine learning community has had an


  1. Research at the intersection of AI + Systems Joseph E. Gonzalez Assistant Professor, UC Berkeley jegonzal@cs.berkeley.edu

  2. Looking Back on AI Systems Going back to when I started graduate school …

  3. Machine learning community has had an evolving focus on AI Systems Distributed Deep Learning Fast Algorithms Frameworks Algorithms 2006 2017 ML for Machine Learning Systems Frameworks Integration of Communities

  4. Learning Training Big Data Big Model The focus of AI Systems research has been on model training.

  5. Training Big Data Big Model Enabling Machine Learning and Systems Innovations Deep Learning Stochastic Distributed (CNN/RNN) Optimization Dataflow Systems Domain Specific Symbolic GPU / TPU Languages (TensorFlow) Methods Acceleration

  6. Training Big Data Big Model Splash CoCoA rllab VW

  7. Learning ? Big Training Data Big Model

  8. Learning Drive Actions Big Training Data Big Model

  9. Learning Prediction Query ? Big Training Data Decision Big Model Application

  10. Prediction Learning Query Big Training Data Decision Big Model Application Goal: ~ 10 ms under heavy load Complicated by Deep Learning è New ML Algorithms and Systems

  11. Support low-latency, high-throughput serving workloads Models getting more complex Ø 10s of GFLOPs [1] Ø Recurrent nets Using specialized Deployed on critical path hardware for Ø Maintain latency goals under heavy load predictions [1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.

  12. Google Translate Serving “If each of the world’s Android phones used the new Google voice search for just three minutes a day , these engineers realized, the company would 140 billion words a day 1 need twice as many data centers. ” – Wired 82,000 GPUs Designed New Hardware! running 24/7 Tensor Processing Unit (TPU) [1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

  13. Prediction-Serving Challenges ??? VW Create Caffe 14 Large and growing Support low-latency, high- ecosystem of ML models throughput serving workloads and frameworks

  14. Wide range of application and frameworks

  15. Wide range of application and frameworks ??? VW Create Caffe 16

  16. One-Off Systems for High-Value Tasks Problems: Expensive to build and maintain Ø Requires AI + Systems expertise Tightly-coupled model, framework, and application Ø Difficult to update models and add new frameworks

  17. Prediction Serving is an Open Problem Ø Computationally challenging Ø Need low latency & high throughput Ø No standard technology or abstractions for serving models IDK Prediction Cascades Learning to make fast predictions Low Latency [Work in Progress] Prediction Serving System [NSDI’17]

  18. Clipper Low Latency Prediction Serving System Xin Giulio Daniel Yika Corey Ion Alexey Wang Zhou Crankshaw Luo Zumar Stoica Tumanov

  19. Wide range of application and frameworks ??? VW Create Caffe 20

  20. Middle layer for prediction serving. Common System Abstraction Optimizations VW Create Caffe 21

  21. Clipper Decouples Applications and Models Applications Predict RPC/REST Interface Observe Clipper RPC RPC RPC RPC Model Container (MC) MC MC MC Caffe

  22. Clipper Architecture Applications Predict RPC/REST Interface Observe Clipper Model Selection Layer Combine predictions across frameworks Provide a common interface and Model Abstraction Layer system optimizations RPC RPC RPC RPC MC MC MC Model Container (MC) Caffe

  23. Clipper Architecture Applications Predict RPC/REST Interface Observe Optimized Common Model Clipper Caching Batching API Isolation Model Selection Layer Combine predictions across frameworks Provide a common interface and Model Abstraction Layer system optimizations RPC RPC RPC RPC MC MC MC Model Container (MC) Caffe

  24. Batching to Improve Throughput Ø Why batching helps: Ø Optimal batch depends on: Ø hardware configuration Ø model and framework A single Ø system load page load may generate Clipper Solution: many queries Adaptively tradeoff latency and throughput… Throughput-optimized frameworks Ø Inc. batch size until the latency objective is exceeded ( Additive Increase ) Throughput Ø If latency exceeds SLO cut batch size by a fraction ( Multiplicative Decrease ) Batch size

  25. AdaStLve 1R BatFKLng 4 6 9 3 8 1 3 9 2 8 8 0 7 60000 4 4 9 5 4 Throughput 5 3 8 9 40000 2 2 3 6 2 6 0 1 0 9 2 2 (QPS) 2 7 7 3 20000 8 7 9 9 1 9 0 1 1 3 1 2 0 8 40 2 0 0 0 0 0 P99 Latency 2 2 2 2 2 20 5 5 (ms) 0 0 0 0 0 4 8 1 0 3 3 4 1 1 2 2 1 1000 2 3 6 0 9 Batch Size 4 3 0 t n 0 0 0 S V R 2 e 9 9 9 L r V - 6 6 6 R R V ) ) ) ) 1 e r r O N n n e a a r r P ) r r ) n g e a e a n n a r e n n R S e e r r e a 5 a d L L 6 L L L L K e e n K K y g L O a K 3 6 6 R K 5 6 ( ( ( L 6 ( (

  26. Overhead of decoupled architecture Applications Applications RPC Interface Predict RPC/REST Interface Predict Feedback TensorFlow- Clipper Serving RPC RPC RPC RPC MC MC MC MC Caffe

  27. Overhead of decoupled architecture Better P99 Latency (ms) Throughput (QPS) Better Decentralized system matches performance of centralized design.

  28. Clipper Architecture Applications Predict RPC/REST Interface Observe Clipper Model Selection Layer Combine predictions across frameworks Provide a common interface and Model Abstraction Layer system optimizations RPC RPC RPC RPC MC MC MC Model Container (MC) Caffe

  29. Clipper Architecture Applications Predict RPC/REST Interface Observe Clipper Model Selection Layer Combine predictions across frameworks Provide a common interface and Model Abstraction Layer system optimizations RPC RPC RPC RPC MC MC MC Model Container (MC) Caffe

  30. Predict RPC/REST Interface Observe Clipper Model Selection Layer Combine predictions across frameworks Provide a common interface and Model Abstraction Layer Version 1 system optimizations Periodic retraining Version 2 Version 3 RPC RPC RPC RPC MC MC MC Model Container (MC) Caffe Experiment with new models and frameworks Caffe

  31. Selection Policy can Calibrate Confidence Version 2 Version 3 “CAT” “CAT” Policy “DJ” “CAT” UNSURE “CAT” Caffe

  32. Selection Policy: Estimate confidence Image1et 7Rp-5 ErrRr 5ate cRnIident cRnIident unsure unsure 0.4 0.3182 Better 0.1983 0.2 0.0586 0.0469 0.0327 0.0 5-agree ensemEle 4-agree

  33. Selection Policy: Estimate confidence Image1et 7Rp-5 ErrRr 5ate cRnIident cRnIident unsure unsure 0.4 0.3182 0.1983 0.2 Better 0.0586 0.0469 0.0327 0.0 5-agree ensemEle 4-agree width is percentage of query workloads

  34. Open Research Questions Ø Efficient execution of complex model compositions Ø Optimal batching to achieve end-to-end latency goals Ø Automatic model failure identification and correction Ø Use anomaly detection techniques to identify model failures Ø Prediction serving on the edge Ø Allowing models to span cloud and edge infrastructure http://clipper.ai

  35. IDK Prediction Cascades Learning to make fast predictions. Low Latency [Work in Progress] Prediction Serving System [NSDI’17]

  36. IDK Prediction Cascades Learning to make fast predictions. Low Latency [Work in Progress] Prediction Serving System [NSDI’17]

  37. Accuracy Relative Cost 90 1.2 78.3 77.4 76.2 1 73.3 80 Small 69.8 1 70 but Order of magnitude 56.6 60 0.8 significant 0.67 gap 50 0.6 40 0.33 0.31 30 0.4 20 0.15 0.2 0.08 10 0 0 Complexity à Complexity à Model costs are increasing much faster than gains in accuracy .

  38. IDK Prediction Cascades Simple models for simple tasks Xin Yika Daniel Alexey https://arxiv.org/abs/1706.00885 Wang Luo Crankshaw Tumanov I don’t Query Simple Model Accurate Model Know Slow Fast Prediction Prediction Combine fast (inaccurate) models with slow (accurate) models to maximize accuracy while reducing computational costs.

  39. Query Simple Model ResNet152 Accuracy Relative Cost 100 1.2 1 78.3 78.3 78.3 78.3 78.3 0.89 1 80 0.8 0.76 0.8 0.63 60 0.6 40 0.4 20 0.2 0 0 37% reduction in runtime @ no loss in accuracy

  40. I don’t Query Simple Model ResNet152 Know Slow Fast Prediction Prediction Ø Cascades within a Model Conv Conv Conv Conv Gate Conv Conv Gate Conv Query Prediction FC

  41. I don’t Query Simple Model Accurate Model Know Slow Fast Prediction Prediction Ø Cascades within a Model Conv Conv Conv Conv Gate Conv Conv Gate Conv Query Prediction FC Skip Blocks

  42. Cascading reduces computational cost 1R GDte CRnvGDte 511GDte 120 110.0 67.08 67.72 Similar gains on 100 larger models 40% AverDge Depth 80 74.0 54.16 54.69 28% 60 38.0 35.82 34.31 40 10% 20 0 5es1et110 5es1et74 5es1et38

  43. Easy Images Difficult Images Skip More Skip More Skip Less Skip Less Number of Layers Skipped

Recommend


More recommend