safepredict a meta algorithm for machine learning to
play

SafePredict: a meta-algorithm for machine learning to guarantee - PowerPoint PPT Presentation

Mustafa A. Kocak 1 , David Ramirez 1 , Elza Erkip 1 , and Dennis E. Shasha 2 SafePredict: a meta-algorithm for machine learning to guarantee correctness by refusing occasionally 1 NYU Tandon School of Engineering 2 NYUs Courant Institute of


  1. Mustafa A. Kocak 1 , David Ramirez 1 , Elza Erkip 1 , and Dennis E. Shasha 2 SafePredict: a meta-algorithm for machine learning to guarantee correctness by refusing occasionally 1 NYU Tandon School of Engineering 2 NYU’s Courant Institute of Mathematical Sciences

  2. underlying machine learning algorithms and decides whether to of automation and forecasting. pass them on to higher level applications. emitted by the meta-algorithm. 1 Introduction - Machine learning and prediction algorithms are the building blocks - Reliability is crucial in risk-critical applications. - Analytics, risk assessment, credit decisions. - Health care, medical diagnosis. - Judicial decision making. - Basic idea: Create a meta-algorithm that takes predictions from - Goal: Achieve robust correctness guarantees for the predictions

  3. according to the application of interest. decision. 2 What Does It Mean to Refuse? - The implications of refusing to make a prediction may vary - do more tests / collect more data - request user feedback or ask for a human expert to make the - Want to refuse seldom while still achieving the error bound.

  4. desired error bound without any assumption on the data or the base predictor. error rate of the base predictor to avoid refusing too much. 3 Novelty and Teaser - SafePredict achieves a - Tracks the changes in the

  5. Literature Review

  6. 4 1 refusal relative to an error. P 1 P Batch Setup Data: Z i = ( X i , Y i ) ∈ X × Y ∼ i.i.d. D for all i = 1 , . . . , m + 1. Probability of Error ( P e ) Probability of Refusal ( P r ) ( ) ( ) � � Y m + 1 / ∈ { Y m + 1 , ∅ } | Z m Y m + 1 = ∅ | Z m Batch Setup Goal: Minimize P e + κ P r , where κ is the cost of a

  7. 5 otherwise is: learning and selective SVM. P r . Related Work (Batch Setup) - Chow, 1970: Assuming D is known, the optimal refusal mechanism { y ∗ if P ( Y = y ∗ | X = x ) ≥ 1 − κ � Y ( X ) = , ∅ where y ∗ = arg max y P ( Y = y ∗ | X ) is the MAP predictor. - For unknown D , instead minimize � P e + κ � - Wegkamp et al., 2006-2008: Rejection with hinge loss and lasso. - Wiener and El Yaniv, 2010-2012: Relationship with active - Cortes et al., 2016-2017: Kernel based methods and boosting.

  8. In practice, a meta-algorithm approach is much more common. Base Predictor P is characterized by a scoring function S : otherwise 6 Refuse Option via Meta-Algorithms - S ( X , Y ) : How typical/probable/likely is ( X , Y ) ? - y ∗ = arg max y ∈Y S ( X , y ) { y ∗ if S ( X , y ∗ ) ≥ τ Meta-algorithm M characterized by τ : � Y ( X ) = . ∅

  9. Conformal Prediction (Vovk et al., 2005): out-of-bag scores, other probability estimates. minimized if S is consistent. 7 Conformal Prediction - Conformity score, S ( x , y ) , measures how well ( x , y ) conforms with the training data. - e.g. distance to the decision boundary, - Strong guarantees in terms of coverage, i.e. P e ≤ ϵ + o ( 1 ) . - Probability of refusal is asymptotically

  10. 8 1 Typical methods: Isotonic and Platt’s regression. function F , to calibrate the scoring function S , i.e. 1. Conjugate Prediction : For any given scoring function S , - There are two main approaches to approach this problem: - A more practical quantity of interest, probability of error given not P e refused: Probability of error on non-refused data points ( ) � Y m + 1 ̸ = Y m + 1 | � r := P Y m + 1 ̸ = ∅ , Z m = . P e | ¯ 1 − P r calibrate the threshold τ , to guarantee P e | ¯ r ≤ ϵ . 2. Probability Calibration : Fix τ = 1 − ϵ , and learn a monotonic F ( S ( x , y )) ≃ P ( Y = y | X = x ) .

  11. 9 1 2 l Conjugate Prediction - Calibration Step 1. Split the training set as core training , Z n 1 , and calibration , Z n + l n + 1 , sets where n + l = m . 2. Train the base classifier P on the core training set. 3. Choose the smallest threshold τ ∗ that gives an empirical error rate less than ϵ on the calibration set , i.e.   ∑ m + l   τ ∗ = inf i = m + 1 1 � ∈{ Y i , ∅ } Y i /  τ : ≤ ϵ ∑ m + l  i = m + 1 1 � Y i ̸ = ∅ Theorem: At least with probability 1 − δ , we get √ log ( l /δ ) r ≤ ϵ + . P e | ¯ 1 − P r

  12. Scoring Function (S(x,y)) : Fraction of trees that predicts the label of x as y . Baseline : Train a Random Forest over 75% of the data and test on remaining 25%. Core/Calibrate/Test Split : 50/25/25 10 Empirical Comparison Base Predictor (P) : Random Forest (100 trees).

  13. excessive refusals. Probability calibration tends to be too conservative, thus leads to 11 Empirical Comparison

  14. y t . i. Observe x t . y t . 12 Online/Adversarial Setup - Online: First observe x 1 , . . . , x t and y 1 , . . . , y t − 1 , then predict ˆ - For each t = 1 , . . . , T: ii. Predict ˆ iii. Observe y t and suffer l t ∈ [ 0 , 1 ] . - Adversarial: Assume nothing about the data. - Instead assume access to a set of predictors : P 1 , P 2 , . . . , P N .

  15. i. Realizable Setup: Assume there exists a perfect predictor in the ensemble. refusals without allowing any errors. al, 2010): Allow up to k errors and minimize the refusals. mistakes. the refusals while keeping the number of errors below k . 13 Related Work (Online/Adversarial Setup) - “Knows What it Knows” (Li et al., 2008): Minimize the number of - “Trading off Mistakes and Don’t-Know Predictions,” (Sayedi et ii. l -bias Assumption: One of the predictors makes at most l - “Extended Littlestone Dimension” (Zhang et al., 2016): Minimize

  16. SafePredict

  17. 13 SafePredict is a meta-algorithm for the online setup, which guarantee that the error rate on the non-refused predictions is bounded by a user-specified target rate. Our error guarantees do not depend on any assumption about the data or the base predictor, but are asymptotic in the number of non-refused predictions. The number of refusals depends on the quality of the base predictor and can be shown to be small if the base predictor has a low error rate.

  18. 14 Meta-Algorithms in Online Prediction Setup - Base-algorithm P makes prediction ˆ y P , t and suffer l P , t ∈ [ 0 , 1 ] . - Meta-algorithm M makes a (randomized) decision to refuse ( ∅ ) or predict ˆ y t , to guarantee a target error rate ϵ . - M predicts at time t with probability w P , t .

  19. 15 M is valid if lim sup randomization of M , i.e. efficient when P performs well. M is efficient if lim inf T Validity and Efficiency - We use the following ∗ notation to denote the averages over the T ∗ : Expected number of (non-refused) predictions, ∑ T t = 1 w P , t . T : Expected cumulative loss of M , ∑ T L ∗ t = 1 l P , t w P , t . Validity Efficiency L ∗ T ∗ T ∗ ≤ ϵ . T = 1. T ∗ →∞ T ∗ →∞ SafePredict Goal: M should be valid for any P and be

  20. 16 Intuition: Weight experts according to their past performances. well as the best expert? i (Littlestone et al., 1989) (Vovk, 1990) Background: Expert Advice and EWAF - How to combine expert opinions P 1 , . . . , P N to perform almost as Exponentially weighted average forecasting (EWAF) 0 . Initialize ( w P 1 , 1 , . . . , w P N , 1 ) and choose a learning rate η > 0. 1 . For each t = 1 , . . . , T 1 . 1 . Follow P i with probability w P i , t . 1 . 2 . Update the probability w P i , t + 1 ∝ w P i , t e − η l Pi , t . √ - Regret Bound : L T − min L P i , T ≤ T log ( N ) / 2 where L T and L P i , T are the cumulative losses of EWAF and P i .

  21. 17 . Dummy and SafePredict - We compare P with a dummy predictor ( D ) that refuses all the time. l D , t = ϵ , y D , t = ∅ . - SafePredict is simply running EWAF over the ensemble { D , P } . - EWAF regret bound implies ( √ T / T ∗ ) T / T ∗ − ϵ = O L ∗ Therefore, for validity, we need a better bound and a more careful choice of η .

  22. 18 , SafePredict is guaranteed to be valid for 1 T any P . Particularly, Theoretical Guarantees (Validity) Theorem (Validity) 1 Denoting the variance for the number of predictions with V ∗ and ( ) √ choosing η = Θ 1 / V ∗ ( √ ) ( ) L ∗ V ∗ √ T ∗ − ϵ = O = O , T ∗ T ∗ where V ∗ = ∑ T t = 1 w P , t w D , t . 1 In practice, V ∗ can be estimated via so called “doubling trick”.

  23. If lim sup Furthermore, the number of refusals are finite almost surely. 19 Theoretical Guarantees (Efficiency) SafePredict is efficient as long as P has an error rate less than ϵ and η vanishes slower than 1 / T . Formally, Theorem (Efficiency) L P , t / t < ϵ and η T → ∞ , then SafePredict is efficient. t →∞

  24. hard to recover from long sequences of mistakes. 20 Weight Shifting - Probability of making a prediction decreases exponentially fast if the base predictor has a higher error rate than ϵ . Therefore, it is - Probability of refusal only depends on the cumulative loss of P . - e.g. cold starts, concept changes. - Toy example:

  25. 21 towards P , i.e. Weight Shifting Weight-shifting : At each step, shift α portion of the D ’s weight w P , t ← w P , t + α w D , t = α + ( 1 − α ) w P , t . - Guarantees that w P , t is always greater than α . - Toy example:

  26. decreases exponentially fast if P performs better 22 Weight Shifting - Preserves the validity guarantee for α = O ( 1 / T ) . - Probability of refusal than D after t 0 . ∗ (∑ t − 1 ) η τ = t 0 l P ,τ − ϵ ( t − t 0 ) ∗ w D , t ≤ e /α.

Recommend


More recommend