prediction serving
play

Prediction Serving what happens after learning? Joseph E. Gonzalez - PowerPoint PPT Presentation

Prediction Serving what happens after learning? Joseph E. Gonzalez Asst. Professor, UC Berkeley jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com Outline VELOX Clipper Caffe VW Create Daniel Crankshaw, Xin Wang Michael


  1. Prediction Serving what happens after learning? Joseph E. Gonzalez Asst. Professor, UC Berkeley jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com

  2. Outline VELOX Clipper Caffe VW Create Daniel Crankshaw, Xin Wang Michael Franklin, & Ion Stoica

  3. Learning Training Big Data Big Model Timescale: minutes to days Systems: offline and batch optimized Heavily studied ... major focus of the AMPLab

  4. Learning Inference Query ? Big Training Data Decision Big Model Application

  5. Inference Learning Query Big Training Data Decision Big Model Application Timescale: ~ 10 milliseconds Systems: online and latency optimized Less studied …

  6. Learning Inference Query Big Training Data Decision Big Model Application Feedback

  7. Learning Inference Decision Training Big Data Timescale: hours to weeks Systems: combination of systems Application Less studied … Feedback

  8. Learning Inference Query Big Responsive Training Adaptive Data (~10ms) (~1 seconds) Decision Big Model Application Feedback

  9. VELOX Model Serving System [CIDR’15] Daniel Crankshaw, Peter Bailis, Haoyuan Li, Zhao Zhang, Joseph Gonzalez, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan Responsive Adaptive (~10ms) (~1 seconds) Key Insight: Decompose models into fast and slow changing components

  10. Learning Inference Query Big Training Data Decision Application Feedback

  11. Learning Inference Slow Changing Fast Changing Model Model Query Big Training Data Decision Application Feedback Slow

  12. Hybrid Offline + Online Learning Update feature functions offline using batch solvers • Leverage high-throughput systems (Tensor Flow) • Exploit slow change in population statistics f ( x ; θ ) T w u Update the user weights online: • Simple to train + more robust model • Address rapidly changing user statistics

  13. Common modeling structure f ( x ; θ ) T w u Matrix Deep Ensemble Factorization Learning Methods Items Users Input

  14. Learning Inference Slow Changing Fast Changing Model Model Query Big Training Data Decision Application Feedback Slow

  15. Fast Changing Learning Inference Model per user Slow Changing Model Query Big Training Data Decision Application Feedback Slow

  16. Velox Online Learning for Recommendations (20-News Groups) 0.6 Online Updates: 0.4 ms 0.5 Retraining: 7.1 seconds 0.4 >4 orders-of-magnitude Error 0.3 faster adaptation 0.2 given sufficient offline training data 0.1 0 0 10 20 30 Examples

  17. Velox Online Learning for Recommendations (20-News Groups) 0.6 Partial Updates: 0.4 ms 0.5 Retraining: 7.1 seconds 0.4 >4 orders-of-magnitude Error 0.3 faster adaptation 0.2 0.1 0 0 10 20 30 Examples

  18. Fast Changing Learning Inference Model per user Slow Changing Model Query Big Training Data Decision Application Feedback Slow

  19. Learning Inference Fast Changing Slow Changing Model per user Model Query Big Training Data Decision Velox Application Feedback Slow

  20. VELOX : the Missing Piece of BDAS Learning Graph Keystone BlinkDB Frames ML B erkeley Spark Streaming Spark GraphX MLLib D ata SQL Spark A nalytics Mesos S tack Tachyon HDFS, S3, …

  21. VELOX : the Missing Piece of BDAS Management Learning and Serving Graph Keystone BlinkDB Frames ML B erkeley Spark Velox Streaming Spark GraphX MLLib D ata SQL Spark A nalytics Mesos S tack Tachyon HDFS, S3, …

  22. VELOX : the Missing Piece of BDAS Management Learning and Serving Graph Keystone BlinkDB Frames ML B erkeley Spark Velox Streaming Spark GraphX MLLib D ata SQL Spark A nalytics Mesos S tack Tachyon HDFS, S3, …

  23. VELOX Architecture Fraud Content Detection Rec. Keystone ML Velox MLLib Spark Single JVM Instance

  24. VELOX Architecture Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Keystone ML Velox VW MLLib Spark Create Single JVM Instance Caffe

  25. VELOX as a Middle Layer Arch? Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Velox Generalize ? VW Create Caffe Spark Keystone ML MLLib

  26. Clipper A Low-Latency Online Prediction Serving System Daniel Crankshaw Xin Wang Michael Franklin Joseph E. Gonzalez Ion Stoica

  27. Clipper Generalizes Velox Across ML Frameworks Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Clipper VW Create Caffe

  28. Clipper Caffe VW Key Insight: Create The challenges of prediction serving can be addressed between end-user applications and machine learning frameworks As a result, Clipper is able to: Ø hide complexity by providing a common prediction interface Ø Ø bound latency and maximize throughput through approximate caching and adaptive batching Ø Ø enable robust online learning and personalization through generalized split-model correction policies Ø without modifying machine learning frameworks or end-user applications

  29. Clipper Design Goals Low and bounded latency predictions Ø interactive applications need reliable latency objectives Up-to-date and personalized predictions across models and frameworks Ø generalize the split model decomposition Optimize throughput for performance under heavy load Ø single query can trigger many predictions Simplify deployment Ø serve models using the original code and systems

  30. Clipper Architecture Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Clipper VW Create Caffe

  31. Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper VW Create Caffe

  32. Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper ust RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe

  33. Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper Improve accuracy through ensembles , Correction Layer online learning and personalization Provide a common interface to models Model Abstraction Layer while bounding latency and maximizing throughput . RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe

  34. Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper Correction Layer Correction Policy Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe

  35. Approximate Caching Approximate Caching Model Abstraction Layer Adaptive Batching Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) Model Wrapper (MW) MW MW MW MW MW MW Caffe Provides a unified generic prediction API across frameworks Ø Reduce Latency à Approximate Caching Ø Increase Throughput à Adaptive Batching Ø Simplify Deployment à RPC + Model Wrapper

  36. Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe common interface

  37. Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes Ø Resource isolation

  38. Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC RPC RPC MW MW MW Model Wrapper (MW) MW MW Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes Ø Resource isolation Ø Scale-out Problem: frameworks optimized for batch processing not latency

  39. Adaptive Batching to Improve Throughput Ø Why batching helps: Ø Optimal batch depends on: Ø hardware configuration Ø model and framework A single Ø system load page load may generate many queries Clipper Solution: be as slow as allowed … Hardware Acceleration Ø Inc. batch size until the latency objective is exceeded ( Additive Increase ) Ø If latency exceeds SLO cut batch size Helps amortize system overhead by a fraction ( Multiplicative Decrease )

  40. Tensor Flow Conv. Net (GPU) Optimal Batch Size (Queries Per Second) Latency (ms) Throughput Latency Deadline Batch Sizes (Queries)

  41. Comparison to TensorFlow Serving Takeaway : Clipper is able to match the average latency of TensorFlow Serving while reducing tail latency (2x) and improving throughput (2x)

  42. Approximate Caching to Reduce Latency Ø Opportunity for caching Clipper Solution: Approximate Caching apply locality sensitive hash functions Popular items may be evaluated Cache Hit frequently ? Ø Need for approximation Cache Miss ? Bag-of-Words Images Cache Hit Model ? Error High Dimensional and continuous valued queries have low cache hit rate.

  43. Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper Correction Layer Correction Policy Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe

  44. Clipper Correction Layer Correction Policy Goal: Maximize accuracy through ensembles , online learning , and personalization Generalize the split-model insight from Velox to achieve: Ø robust predictions by combining multiple models & frameworks Ø online learning and personalization by correcting and personalizing predictions in response to feedback

Recommend


More recommend