Prediction Serving what happens after learning? Joseph E. Gonzalez Asst. Professor, UC Berkeley jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com
Outline VELOX Clipper Caffe VW Create Daniel Crankshaw, Xin Wang Michael Franklin, & Ion Stoica
Learning Training Big Data Big Model Timescale: minutes to days Systems: offline and batch optimized Heavily studied ... major focus of the AMPLab
Learning Inference Query ? Big Training Data Decision Big Model Application
Inference Learning Query Big Training Data Decision Big Model Application Timescale: ~ 10 milliseconds Systems: online and latency optimized Less studied …
Learning Inference Query Big Training Data Decision Big Model Application Feedback
Learning Inference Decision Training Big Data Timescale: hours to weeks Systems: combination of systems Application Less studied … Feedback
Learning Inference Query Big Responsive Training Adaptive Data (~10ms) (~1 seconds) Decision Big Model Application Feedback
VELOX Model Serving System [CIDR’15] Daniel Crankshaw, Peter Bailis, Haoyuan Li, Zhao Zhang, Joseph Gonzalez, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan Responsive Adaptive (~10ms) (~1 seconds) Key Insight: Decompose models into fast and slow changing components
Learning Inference Query Big Training Data Decision Application Feedback
Learning Inference Slow Changing Fast Changing Model Model Query Big Training Data Decision Application Feedback Slow
Hybrid Offline + Online Learning Update feature functions offline using batch solvers • Leverage high-throughput systems (Tensor Flow) • Exploit slow change in population statistics f ( x ; θ ) T w u Update the user weights online: • Simple to train + more robust model • Address rapidly changing user statistics
Common modeling structure f ( x ; θ ) T w u Matrix Deep Ensemble Factorization Learning Methods Items Users Input
Learning Inference Slow Changing Fast Changing Model Model Query Big Training Data Decision Application Feedback Slow
Fast Changing Learning Inference Model per user Slow Changing Model Query Big Training Data Decision Application Feedback Slow
Velox Online Learning for Recommendations (20-News Groups) 0.6 Online Updates: 0.4 ms 0.5 Retraining: 7.1 seconds 0.4 >4 orders-of-magnitude Error 0.3 faster adaptation 0.2 given sufficient offline training data 0.1 0 0 10 20 30 Examples
Velox Online Learning for Recommendations (20-News Groups) 0.6 Partial Updates: 0.4 ms 0.5 Retraining: 7.1 seconds 0.4 >4 orders-of-magnitude Error 0.3 faster adaptation 0.2 0.1 0 0 10 20 30 Examples
Fast Changing Learning Inference Model per user Slow Changing Model Query Big Training Data Decision Application Feedback Slow
Learning Inference Fast Changing Slow Changing Model per user Model Query Big Training Data Decision Velox Application Feedback Slow
VELOX : the Missing Piece of BDAS Learning Graph Keystone BlinkDB Frames ML B erkeley Spark Streaming Spark GraphX MLLib D ata SQL Spark A nalytics Mesos S tack Tachyon HDFS, S3, …
VELOX : the Missing Piece of BDAS Management Learning and Serving Graph Keystone BlinkDB Frames ML B erkeley Spark Velox Streaming Spark GraphX MLLib D ata SQL Spark A nalytics Mesos S tack Tachyon HDFS, S3, …
VELOX : the Missing Piece of BDAS Management Learning and Serving Graph Keystone BlinkDB Frames ML B erkeley Spark Velox Streaming Spark GraphX MLLib D ata SQL Spark A nalytics Mesos S tack Tachyon HDFS, S3, …
VELOX Architecture Fraud Content Detection Rec. Keystone ML Velox MLLib Spark Single JVM Instance
VELOX Architecture Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Keystone ML Velox VW MLLib Spark Create Single JVM Instance Caffe
VELOX as a Middle Layer Arch? Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Velox Generalize ? VW Create Caffe Spark Keystone ML MLLib
Clipper A Low-Latency Online Prediction Serving System Daniel Crankshaw Xin Wang Michael Franklin Joseph E. Gonzalez Ion Stoica
Clipper Generalizes Velox Across ML Frameworks Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Clipper VW Create Caffe
Clipper Caffe VW Key Insight: Create The challenges of prediction serving can be addressed between end-user applications and machine learning frameworks As a result, Clipper is able to: Ø hide complexity by providing a common prediction interface Ø Ø bound latency and maximize throughput through approximate caching and adaptive batching Ø Ø enable robust online learning and personalization through generalized split-model correction policies Ø without modifying machine learning frameworks or end-user applications
Clipper Design Goals Low and bounded latency predictions Ø interactive applications need reliable latency objectives Up-to-date and personalized predictions across models and frameworks Ø generalize the split model decomposition Optimize throughput for performance under heavy load Ø single query can trigger many predictions Simplify deployment Ø serve models using the original code and systems
Clipper Architecture Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Clipper VW Create Caffe
Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper VW Create Caffe
Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper ust RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe
Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper Improve accuracy through ensembles , Correction Layer online learning and personalization Provide a common interface to models Model Abstraction Layer while bounding latency and maximizing throughput . RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe
Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper Correction Layer Correction Policy Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe
Approximate Caching Approximate Caching Model Abstraction Layer Adaptive Batching Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) Model Wrapper (MW) MW MW MW MW MW MW Caffe Provides a unified generic prediction API across frameworks Ø Reduce Latency à Approximate Caching Ø Increase Throughput à Adaptive Batching Ø Simplify Deployment à RPC + Model Wrapper
Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe common interface
Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes Ø Resource isolation
Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC RPC RPC MW MW MW Model Wrapper (MW) MW MW Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes Ø Resource isolation Ø Scale-out Problem: frameworks optimized for batch processing not latency
Adaptive Batching to Improve Throughput Ø Why batching helps: Ø Optimal batch depends on: Ø hardware configuration Ø model and framework A single Ø system load page load may generate many queries Clipper Solution: be as slow as allowed … Hardware Acceleration Ø Inc. batch size until the latency objective is exceeded ( Additive Increase ) Ø If latency exceeds SLO cut batch size Helps amortize system overhead by a fraction ( Multiplicative Decrease )
Tensor Flow Conv. Net (GPU) Optimal Batch Size (Queries Per Second) Latency (ms) Throughput Latency Deadline Batch Sizes (Queries)
Comparison to TensorFlow Serving Takeaway : Clipper is able to match the average latency of TensorFlow Serving while reducing tail latency (2x) and improving throughput (2x)
Approximate Caching to Reduce Latency Ø Opportunity for caching Clipper Solution: Approximate Caching apply locality sensitive hash functions Popular items may be evaluated Cache Hit frequently ? Ø Need for approximation Cache Miss ? Bag-of-Words Images Cache Hit Model ? Error High Dimensional and continuous valued queries have low cache hit rate.
Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper Correction Layer Correction Policy Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe
Clipper Correction Layer Correction Policy Goal: Maximize accuracy through ensembles , online learning , and personalization Generalize the split-model insight from Velox to achieve: Ø robust predictions by combining multiple models & frameworks Ø online learning and personalization by correcting and personalizing predictions in response to feedback
Recommend
More recommend