SafePredict: a meta-algorithm for machine learning to guarantee - PowerPoint PPT Presentation

Mustafa A. Kocak 1 , David Ramirez 1 , Elza Erkip 1 , and Dennis E. Shasha 2 SafePredict: a meta-algorithm for machine learning to guarantee correctness by refusing occasionally 1 NYU Tandon School of Engineering 2 NYU’s Courant Institute of Mathematical Sciences

underlying machine learning algorithms and decides whether to of automation and forecasting. pass them on to higher level applications. emitted by the meta-algorithm. 1 Introduction - Machine learning and prediction algorithms are the building blocks - Reliability is crucial in risk-critical applications. - Analytics, risk assessment, credit decisions. - Health care, medical diagnosis. - Judicial decision making. - Basic idea: Create a meta-algorithm that takes predictions from - Goal: Achieve robust correctness guarantees for the predictions

according to the application of interest. decision. 2 What Does It Mean to Refuse? - The implications of refusing to make a prediction may vary - do more tests / collect more data - request user feedback or ask for a human expert to make the - Want to refuse seldom while still achieving the error bound.

desired error bound without any assumption on the data or the base predictor. error rate of the base predictor to avoid refusing too much. 3 Novelty and Teaser - SafePredict achieves a - Tracks the changes in the

Literature Review

4 1 refusal relative to an error. P 1 P Batch Setup Data: Z i = ( X i , Y i ) ∈ X × Y ∼ i.i.d. D for all i = 1 , . . . , m + 1. Probability of Error ( P e ) Probability of Refusal ( P r ) ( ) ( ) � � Y m + 1 / ∈ { Y m + 1 , ∅ } | Z m Y m + 1 = ∅ | Z m Batch Setup Goal: Minimize P e + κ P r , where κ is the cost of a

5 otherwise is: learning and selective SVM. P r . Related Work (Batch Setup) - Chow, 1970: Assuming D is known, the optimal refusal mechanism { y ∗ if P ( Y = y ∗ | X = x ) ≥ 1 − κ � Y ( X ) = , ∅ where y ∗ = arg max y P ( Y = y ∗ | X ) is the MAP predictor. - For unknown D , instead minimize � P e + κ � - Wegkamp et al., 2006-2008: Rejection with hinge loss and lasso. - Wiener and El Yaniv, 2010-2012: Relationship with active - Cortes et al., 2016-2017: Kernel based methods and boosting.

In practice, a meta-algorithm approach is much more common. Base Predictor P is characterized by a scoring function S : otherwise 6 Refuse Option via Meta-Algorithms - S ( X , Y ) : How typical/probable/likely is ( X , Y ) ? - y ∗ = arg max y ∈Y S ( X , y ) { y ∗ if S ( X , y ∗ ) ≥ τ Meta-algorithm M characterized by τ : � Y ( X ) = . ∅

Conformal Prediction (Vovk et al., 2005): out-of-bag scores, other probability estimates. minimized if S is consistent. 7 Conformal Prediction - Conformity score, S ( x , y ) , measures how well ( x , y ) conforms with the training data. - e.g. distance to the decision boundary, - Strong guarantees in terms of coverage, i.e. P e ≤ ϵ + o ( 1 ) . - Probability of refusal is asymptotically

8 1 Typical methods: Isotonic and Platt’s regression. function F , to calibrate the scoring function S , i.e. 1. Conjugate Prediction : For any given scoring function S , - There are two main approaches to approach this problem: - A more practical quantity of interest, probability of error given not P e refused: Probability of error on non-refused data points ( ) � Y m + 1 ̸ = Y m + 1 | � r := P Y m + 1 ̸ = ∅ , Z m = . P e | ¯ 1 − P r calibrate the threshold τ , to guarantee P e | ¯ r ≤ ϵ . 2. Probability Calibration : Fix τ = 1 − ϵ , and learn a monotonic F ( S ( x , y )) ≃ P ( Y = y | X = x ) .

9 1 2 l Conjugate Prediction - Calibration Step 1. Split the training set as core training , Z n 1 , and calibration , Z n + l n + 1 , sets where n + l = m . 2. Train the base classifier P on the core training set. 3. Choose the smallest threshold τ ∗ that gives an empirical error rate less than ϵ on the calibration set , i.e.   ∑ m + l   τ ∗ = inf i = m + 1 1 � ∈{ Y i , ∅ } Y i /  τ : ≤ ϵ ∑ m + l  i = m + 1 1 � Y i ̸ = ∅ Theorem: At least with probability 1 − δ , we get √ log ( l /δ ) r ≤ ϵ + . P e | ¯ 1 − P r

Scoring Function (S(x,y)) : Fraction of trees that predicts the label of x as y . Baseline : Train a Random Forest over 75% of the data and test on remaining 25%. Core/Calibrate/Test Split : 50/25/25 10 Empirical Comparison Base Predictor (P) : Random Forest (100 trees).

excessive refusals. Probability calibration tends to be too conservative, thus leads to 11 Empirical Comparison

y t . i. Observe x t . y t . 12 Online/Adversarial Setup - Online: First observe x 1 , . . . , x t and y 1 , . . . , y t − 1 , then predict ˆ - For each t = 1 , . . . , T: ii. Predict ˆ iii. Observe y t and suffer l t ∈ [ 0 , 1 ] . - Adversarial: Assume nothing about the data. - Instead assume access to a set of predictors : P 1 , P 2 , . . . , P N .

i. Realizable Setup: Assume there exists a perfect predictor in the ensemble. refusals without allowing any errors. al, 2010): Allow up to k errors and minimize the refusals. mistakes. the refusals while keeping the number of errors below k . 13 Related Work (Online/Adversarial Setup) - “Knows What it Knows” (Li et al., 2008): Minimize the number of - “Trading off Mistakes and Don’t-Know Predictions,” (Sayedi et ii. l -bias Assumption: One of the predictors makes at most l - “Extended Littlestone Dimension” (Zhang et al., 2016): Minimize

SafePredict

13 SafePredict is a meta-algorithm for the online setup, which guarantee that the error rate on the non-refused predictions is bounded by a user-specified target rate. Our error guarantees do not depend on any assumption about the data or the base predictor, but are asymptotic in the number of non-refused predictions. The number of refusals depends on the quality of the base predictor and can be shown to be small if the base predictor has a low error rate.

14 Meta-Algorithms in Online Prediction Setup - Base-algorithm P makes prediction ˆ y P , t and suffer l P , t ∈ [ 0 , 1 ] . - Meta-algorithm M makes a (randomized) decision to refuse ( ∅ ) or predict ˆ y t , to guarantee a target error rate ϵ . - M predicts at time t with probability w P , t .

15 M is valid if lim sup randomization of M , i.e. efficient when P performs well. M is efficient if lim inf T Validity and Efficiency - We use the following ∗ notation to denote the averages over the T ∗ : Expected number of (non-refused) predictions, ∑ T t = 1 w P , t . T : Expected cumulative loss of M , ∑ T L ∗ t = 1 l P , t w P , t . Validity Efficiency L ∗ T ∗ T ∗ ≤ ϵ . T = 1. T ∗ →∞ T ∗ →∞ SafePredict Goal: M should be valid for any P and be

16 Intuition: Weight experts according to their past performances. well as the best expert? i (Littlestone et al., 1989) (Vovk, 1990) Background: Expert Advice and EWAF - How to combine expert opinions P 1 , . . . , P N to perform almost as Exponentially weighted average forecasting (EWAF) 0 . Initialize ( w P 1 , 1 , . . . , w P N , 1 ) and choose a learning rate η > 0. 1 . For each t = 1 , . . . , T 1 . 1 . Follow P i with probability w P i , t . 1 . 2 . Update the probability w P i , t + 1 ∝ w P i , t e − η l Pi , t . √ - Regret Bound : L T − min L P i , T ≤ T log ( N ) / 2 where L T and L P i , T are the cumulative losses of EWAF and P i .

17 . Dummy and SafePredict - We compare P with a dummy predictor ( D ) that refuses all the time. l D , t = ϵ , y D , t = ∅ . - SafePredict is simply running EWAF over the ensemble { D , P } . - EWAF regret bound implies ( √ T / T ∗ ) T / T ∗ − ϵ = O L ∗ Therefore, for validity, we need a better bound and a more careful choice of η .

18 , SafePredict is guaranteed to be valid for 1 T any P . Particularly, Theoretical Guarantees (Validity) Theorem (Validity) 1 Denoting the variance for the number of predictions with V ∗ and ( ) √ choosing η = Θ 1 / V ∗ ( √ ) ( ) L ∗ V ∗ √ T ∗ − ϵ = O = O , T ∗ T ∗ where V ∗ = ∑ T t = 1 w P , t w D , t . 1 In practice, V ∗ can be estimated via so called “doubling trick”.

If lim sup Furthermore, the number of refusals are finite almost surely. 19 Theoretical Guarantees (Efficiency) SafePredict is efficient as long as P has an error rate less than ϵ and η vanishes slower than 1 / T . Formally, Theorem (Efficiency) L P , t / t < ϵ and η T → ∞ , then SafePredict is efficient. t →∞

hard to recover from long sequences of mistakes. 20 Weight Shifting - Probability of making a prediction decreases exponentially fast if the base predictor has a higher error rate than ϵ . Therefore, it is - Probability of refusal only depends on the cumulative loss of P . - e.g. cold starts, concept changes. - Toy example:

21 towards P , i.e. Weight Shifting Weight-shifting : At each step, shift α portion of the D ’s weight w P , t ← w P , t + α w D , t = α + ( 1 − α ) w P , t . - Guarantees that w P , t is always greater than α . - Toy example:

decreases exponentially fast if P performs better 22 Weight Shifting - Preserves the validity guarantee for α = O ( 1 / T ) . - Probability of refusal than D after t 0 . ∗ (∑ t − 1 ) η τ = t 0 l P ,τ − ϵ ( t − t 0 ) ∗ w D , t ≤ e /α.

SafePredict: a meta-algorithm for machine learning to guarantee - PowerPoint PPT Presentation

Mustafa A. Kocak 1 , David Ramirez 1 , Elza Erkip 1 , and Dennis E. Shasha 2 SafePredict: a meta-algorithm for machine learning to guarantee correctness by refusing occasionally 1 NYU Tandon School of Engineering 2 NYUs Courant Institute of

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Intelligent Tutoring Systems: A Meta-Analysis Meta-Analysis Wenting Ma March, 2011

Company profile Capabilities Customers & References META-LRA Kft. 8400 Ajka,

Shape Constrained Nonparametric Baseline Estimators in the Cox Model Joint work with Rik Lopuha

To save and enhance lives October 5th, 2015 2 1

Machine learning and convex optimization with submodular functions Francis Bach Sierra

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

Lift-Based Bidding in Ad Selection Jian Xu*, Xuhui Shao, Jianjie Ma, Kuang-chih Lee, and Quan Lu

Data Mining II Regression Heiko Paulheim Regression Classification covered in Data

Multidimensional Scaling Applied Multivariate Statistics Spring 2012 Outline Fundamental

IsoGeneGUI: a graphical user interface for analyzing dose-response studies in microarray

SafePredict: a meta-algorithm for machine learning to guarantee - PowerPoint PPT Presentation

Mustafa A. Kocak 1 , David Ramirez 1 , Elza Erkip 1 , and Dennis E. Shasha 2 SafePredict: a meta-algorithm for machine learning to guarantee correctness by refusing occasionally 1 NYU Tandon School of Engineering 2 NYUs Courant Institute of

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

The Meta-Learning Problem &amp; Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Intelligent Tutoring Systems: A Meta-Analysis Meta-Analysis Wenting Ma March, 2011

Company profile Capabilities Customers &amp; References META-LRA Kft. 8400 Ajka,

Shape Constrained Nonparametric Baseline Estimators in the Cox Model Joint work with Rik Lopuha

To save and enhance lives October 5th, 2015 2 1

Machine learning and convex optimization with submodular functions Francis Bach Sierra

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

Lift-Based Bidding in Ad Selection Jian Xu*, Xuhui Shao, Jianjie Ma, Kuang-chih Lee, and Quan Lu

Data Mining II Regression Heiko Paulheim Regression Classification covered in Data

Multidimensional Scaling Applied Multivariate Statistics Spring 2012 Outline Fundamental

IsoGeneGUI: a graphical user interface for analyzing dose-response studies in microarray

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

Company profile Capabilities Customers & References META-LRA Kft. 8400 Ajka,