Prediction Serving what happens after learning? Joseph E. Gonzalez - PowerPoint PPT Presentation

Prediction Serving what happens after learning? Joseph E. Gonzalez Asst. Professor, UC Berkeley jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com

Outline VELOX Clipper Caffe VW Create Daniel Crankshaw, Xin Wang Michael Franklin, & Ion Stoica

Learning Training Big Data Big Model Timescale: minutes to days Systems: offline and batch optimized Heavily studied ... major focus of the AMPLab

Learning Inference Query ? Big Training Data Decision Big Model Application

Inference Learning Query Big Training Data Decision Big Model Application Timescale: ~ 10 milliseconds Systems: online and latency optimized Less studied …

Learning Inference Query Big Training Data Decision Big Model Application Feedback

Learning Inference Decision Training Big Data Timescale: hours to weeks Systems: combination of systems Application Less studied … Feedback

Learning Inference Query Big Responsive Training Adaptive Data (~10ms) (~1 seconds) Decision Big Model Application Feedback

VELOX Model Serving System [CIDR’15] Daniel Crankshaw, Peter Bailis, Haoyuan Li, Zhao Zhang, Joseph Gonzalez, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan Responsive Adaptive (~10ms) (~1 seconds) Key Insight: Decompose models into fast and slow changing components

Learning Inference Query Big Training Data Decision Application Feedback

Learning Inference Slow Changing Fast Changing Model Model Query Big Training Data Decision Application Feedback Slow

Hybrid Offline + Online Learning Update feature functions offline using batch solvers • Leverage high-throughput systems (Tensor Flow) • Exploit slow change in population statistics f ( x ; θ ) T w u Update the user weights online: • Simple to train + more robust model • Address rapidly changing user statistics

Common modeling structure f ( x ; θ ) T w u Matrix Deep Ensemble Factorization Learning Methods Items Users Input

Learning Inference Slow Changing Fast Changing Model Model Query Big Training Data Decision Application Feedback Slow

Fast Changing Learning Inference Model per user Slow Changing Model Query Big Training Data Decision Application Feedback Slow

Velox Online Learning for Recommendations (20-News Groups) 0.6 Online Updates: 0.4 ms 0.5 Retraining: 7.1 seconds 0.4 >4 orders-of-magnitude Error 0.3 faster adaptation 0.2 given sufficient offline training data 0.1 0 0 10 20 30 Examples

Velox Online Learning for Recommendations (20-News Groups) 0.6 Partial Updates: 0.4 ms 0.5 Retraining: 7.1 seconds 0.4 >4 orders-of-magnitude Error 0.3 faster adaptation 0.2 0.1 0 0 10 20 30 Examples

Fast Changing Learning Inference Model per user Slow Changing Model Query Big Training Data Decision Application Feedback Slow

Learning Inference Fast Changing Slow Changing Model per user Model Query Big Training Data Decision Velox Application Feedback Slow

VELOX : the Missing Piece of BDAS Learning Graph Keystone BlinkDB Frames ML B erkeley Spark Streaming Spark GraphX MLLib D ata SQL Spark A nalytics Mesos S tack Tachyon HDFS, S3, …

VELOX : the Missing Piece of BDAS Management Learning and Serving Graph Keystone BlinkDB Frames ML B erkeley Spark Velox Streaming Spark GraphX MLLib D ata SQL Spark A nalytics Mesos S tack Tachyon HDFS, S3, …

VELOX Architecture Fraud Content Detection Rec. Keystone ML Velox MLLib Spark Single JVM Instance

VELOX Architecture Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Keystone ML Velox VW MLLib Spark Create Single JVM Instance Caffe

VELOX as a Middle Layer Arch? Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Velox Generalize ? VW Create Caffe Spark Keystone ML MLLib

Clipper A Low-Latency Online Prediction Serving System Daniel Crankshaw Xin Wang Michael Franklin Joseph E. Gonzalez Ion Stoica

Clipper Generalizes Velox Across ML Frameworks Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Clipper VW Create Caffe

Clipper Caffe VW Key Insight: Create The challenges of prediction serving can be addressed between end-user applications and machine learning frameworks As a result, Clipper is able to: Ø hide complexity by providing a common prediction interface Ø Ø bound latency and maximize throughput through approximate caching and adaptive batching Ø Ø enable robust online learning and personalization through generalized split-model correction policies Ø without modifying machine learning frameworks or end-user applications

Clipper Design Goals Low and bounded latency predictions Ø interactive applications need reliable latency objectives Up-to-date and personalized predictions across models and frameworks Ø generalize the split model decomposition Optimize throughput for performance under heavy load Ø single query can trigger many predictions Simplify deployment Ø serve models using the original code and systems

Clipper Architecture Fraud Content Personal Robotic Machine Detection Rec. Asst. Control Translation Clipper VW Create Caffe

Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper VW Create Caffe

Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper ust RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe

Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper Improve accuracy through ensembles , Correction Layer online learning and personalization Provide a common interface to models Model Abstraction Layer while bounding latency and maximizing throughput . RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe

Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper Correction Layer Correction Policy Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe

Approximate Caching Approximate Caching Model Abstraction Layer Adaptive Batching Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) Model Wrapper (MW) MW MW MW MW MW MW Caffe Provides a unified generic prediction API across frameworks Ø Reduce Latency à Approximate Caching Ø Increase Throughput à Adaptive Batching Ø Simplify Deployment à RPC + Model Wrapper

Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe common interface

Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes Ø Resource isolation

Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC RPC RPC MW MW MW Model Wrapper (MW) MW MW Caffe Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes Ø Resource isolation Ø Scale-out Problem: frameworks optimized for batch processing not latency

Adaptive Batching to Improve Throughput Ø Why batching helps: Ø Optimal batch depends on: Ø hardware configuration Ø model and framework A single Ø system load page load may generate many queries Clipper Solution: be as slow as allowed … Hardware Acceleration Ø Inc. batch size until the latency objective is exceeded ( Additive Increase ) Ø If latency exceeds SLO cut batch size Helps amortize system overhead by a fraction ( Multiplicative Decrease )

Tensor Flow Conv. Net (GPU) Optimal Batch Size (Queries Per Second) Latency (ms) Throughput Latency Deadline Batch Sizes (Queries)

Comparison to TensorFlow Serving Takeaway : Clipper is able to match the average latency of TensorFlow Serving while reducing tail latency (2x) and improving throughput (2x)

Approximate Caching to Reduce Latency Ø Opportunity for caching Clipper Solution: Approximate Caching apply locality sensitive hash functions Popular items may be evaluated Cache Hit frequently ? Ø Need for approximation Cache Miss ? Bag-of-Words Images Cache Hit Model ? Error High Dimensional and continuous valued queries have low cache hit rate.

Clipper Architecture Applications RPC/REST Interface Predict Observe Clipper Correction Layer Correction Policy Approximate Caching Model Abstraction Layer Adaptive Batching RPC RPC RPC RPC Model Wrapper (MW) MW MW MW Caffe

Clipper Correction Layer Correction Policy Goal: Maximize accuracy through ensembles , online learning , and personalization Generalize the split-model insight from Velox to achieve: Ø robust predictions by combining multiple models & frameworks Ø online learning and personalization by correcting and personalizing predictions in response to feedback

Prediction Serving what happens after learning? Joseph E. Gonzalez - PowerPoint PPT Presentation

Prediction Serving what happens after learning? Joseph E. Gonzalez Asst. Professor, UC Berkeley jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com Outline VELOX Clipper Caffe VW Create Daniel Crankshaw, Xin Wang Michael

SERVING NEW YORK A WEALTH OF EXPERIENCE SERVING NEW YORK A WEALTH OF EXPERIENCE SERVING NEW

Serve Christ Colossians 3:18-4:1 Outline Conduct yourself as if you are serving Christ. The

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

PRETZEL:Opening the Black Box of Machine Learning Prediction Serving Systems Presented by

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

nutritional information www.dippinstix.com premium produce Serving Size: 1 package 2.75oz (78g)

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Clipper A Low-Latency Online Prediction Serving System Dan Crankshaw crankshaw@cs.berkeley.edu

DeepLoc Data set statistics & performance Protein prediction II Gregor Sturm, Johannes Rest,

(seasonal) prediction systems Arun Kumar Climate Prediction Center College Park, Maryland, USA

Summary of part I: prediction and RL Prediction is important for action selection The

Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21 Training vs. Inference

Faster Region-based Hotspot Detection Ran Chen 1 , Wei Zhong 2 , Haoyu Yang 1 , Hao Geng 1 , Xuan

51 live sites in 37 languages

Diodes Waveform shaping Circuits Lecture notes: page 2-20 to 2-31 Sedra & Smith (6 th Ed):

Clipper Breathing Life into Cultural Collections and Archives John Casey 1 , Trevor Collins 2

HOW CRYPTO FAILS IN PRACTICE CMSC 414 APR 3 2018 POOR PROGRAMING CryptoLint tool to perform

Key Management Choosing long, random keys doesnt do you any good if your clerk is selling

RAN School Council February 2020 Nock/Molin Change of Start Time New times: Arrival 7:45,

Prediction Serving what happens after learning? Joseph E. Gonzalez - PowerPoint PPT Presentation

Prediction Serving what happens after learning? Joseph E. Gonzalez Asst. Professor, UC Berkeley jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com Outline VELOX Clipper Caffe VW Create Daniel Crankshaw, Xin Wang Michael

SERVING NEW YORK A WEALTH OF EXPERIENCE SERVING NEW YORK A WEALTH OF EXPERIENCE SERVING NEW

Serve Christ Colossians 3:18-4:1 Outline Conduct yourself as if you are serving Christ. The

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

PRETZEL:Opening the Black Box of Machine Learning Prediction Serving Systems Presented by

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

nutritional information www.dippinstix.com premium produce Serving Size: 1 package 2.75oz (78g)

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Clipper A Low-Latency Online Prediction Serving System Dan Crankshaw crankshaw@cs.berkeley.edu

DeepLoc Data set statistics &amp; performance Protein prediction II Gregor Sturm, Johannes Rest,

(seasonal) prediction systems Arun Kumar Climate Prediction Center College Park, Maryland, USA

Summary of part I: prediction and RL Prediction is important for action selection The

Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21 Training vs. Inference

Faster Region-based Hotspot Detection Ran Chen 1 , Wei Zhong 2 , Haoyu Yang 1 , Hao Geng 1 , Xuan

51 live sites in 37 languages

Diodes Waveform shaping Circuits Lecture notes: page 2-20 to 2-31 Sedra &amp; Smith (6 th Ed):

Clipper Breathing Life into Cultural Collections and Archives John Casey 1 , Trevor Collins 2

HOW CRYPTO FAILS IN PRACTICE CMSC 414 APR 3 2018 POOR PROGRAMING CryptoLint tool to perform

Key Management Choosing long, random keys doesnt do you any good if your clerk is selling

RAN School Council February 2020 Nock/Molin Change of Start Time New times: Arrival 7:45,

DeepLoc Data set statistics & performance Protein prediction II Gregor Sturm, Johannes Rest,

Diodes Waveform shaping Circuits Lecture notes: page 2-20 to 2-31 Sedra & Smith (6 th Ed):