Directly and Efficiently Optimizing Prediction Error and AUC of - PowerPoint PPT Presentation

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop on Optimization and its Applications Huatulco, Mexico January 2018 Hiva Ghanbari (Lehigh University) 1 / 34

Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 2 / 34

Supervised Learning Problem Given a finite sample data set S of n (input, label) pairs, e.g., S := { ( x i , y i ) : i = 1 , · · · , n } , where x i ∈ R d and y i ∈ { +1 , − 1 } . We are interested in Binary Classification Problem in supervised learning Binary Classification Problem ⇒ Discrete valued output +1 or − 1 We are interested in linear classifier (predictor) f ( x ; w ) = w T x so that f : X → Y , where X denote the space of input values and Y the space of output values. Hiva Ghanbari (Lehigh University) 4 / 34

Supervised Learning Problem How good is this classifier? Prediction Error Area Under ROC Curve (AUC) Given a finite sample data set S of n (input, label) pairs, e.g., S := { ( x i , y i ) : i = 1 , · · · , n } , where x i ∈ R d and y i ∈ { +1 , − 1 } . We are interested in Binary Classification Problem in supervised learning Binary Classification Problem ⇒ Discrete valued output +1 or − 1 We are interested in linear classifier (predictor) f ( x ; w ) = w T x so that f : X → Y , where X denote the space of input values and Y the space of output values. Hiva Ghanbari (Lehigh University) 4 / 34

Expected Risk (Prediction Error) In S , each ( x i , y i ) is an i.i.d. observation of the random variables ( X, Y ). ( X, Y ) has an unknown joint probability distribution P X,Y ( x, y ) over X and Y . The expected risk associated with a linear classifier f ( x ; w ) = w T x for zero-one loss function is defined as R 0 − 1 ( f ) = E X , Y [ ℓ 0 − 1 ( f ( X ; w ) , Y )] � � = P X,Y ( x, y ) ℓ 0 − 1 ( f ( x ; w ) , y ) dydx X Y where � +1 if y · f ( x ; w ) < 0 , ℓ 0 − 1 ( f ( x ; w ) , y ) = 0 if y · f ( x ; w ) ≥ 0 . Hiva Ghanbari (Lehigh University) 6 / 34

Empirical Risk Minimization The joint probability distribution P X,Y ( x, y ) is unknown The empirical risk of the linear classifier f ( x ; w ) for zero-one loss function over the finite training set S is of the interest, e.g., n R 0 − 1 ( f ; S ) = 1 � ℓ 0 − 1 ( f ( x i ; w ) , y i ) . n i =1 Hiva Ghanbari (Lehigh University) 7 / 34

Empirical Risk Minimization The joint probability distribution P X,Y ( x, y ) is unknown The empirical risk of the linear classifier f ( x ; w ) for zero-one loss function over the finite training set S is of the interest, e.g., n R 0 − 1 ( f ; S ) = 1 � ℓ 0 − 1 ( f ( x i ; w ) , y i ) . n i =1 Utilizing the logistic regression loss function instead of 0-1 loss function, results n R log ( f ; S ) = 1 � log (1 + exp( − y i · f ( x i ; w ))) , n i =1 Practically � � n F log ( w ) = 1 � log (1 + exp( − y i · f ( x i ; w ))) + λ � w � 2 min . n w ∈ R d i =1 Hiva Ghanbari (Lehigh University) 7 / 34

Alternative Interpretation of the Prediction Error We can interpret prediction error as a probability value: F error ( w ) = R 0 − 1 ( f ) = E X , Y [ ℓ 0 − 1 ( f ( X ; w ) , Y )] = P ( Y · w T X < 0) . Hiva Ghanbari (Lehigh University) 8 / 34

Alternative Interpretation of the Prediction Error We can interpret prediction error as a probability value: F error ( w ) = R 0 − 1 ( f ) = E X , Y [ ℓ 0 − 1 ( f ( X ; w ) , Y )] = P ( Y · w T X < 0) . If the true values of the prior probabilities P ( Y = +1) and P ( Y = − 1) are known or obtainable from a trivial calculation, then Lemma 1 Expected risk can be interpreted in terms of the probability value, so that F error ( w ) = P ( Y · w T X < 0) Z + ≤ 0 � Z − ≤ 0 �� = P � P ( Y = +1) + � 1 − P � P ( Y = − 1) , where Z + = w T X + , and Z − = w T X − , for X + and X − as random variables from positive and negative classes, respectively. Hiva Ghanbari (Lehigh University) 8 / 34

Data with Any Arbitrary Distribution Suppose ( X 1 , · · · , X n ) is a multivariate random variable. For a given mapping function g ( · ) we are interested in the c.d.f of Z = g ( X 1 , · · · , X n ) . If we define a region in space {X 1 × · · · × X n } such that g ( x 1 , · · · , x n ) ≤ z , then we have F Z ( z ) = P ( Z ≤ z ) = P ( g ( X ) ≤ z ) = P ( { x 1 ∈ X 1 , · · · , x n ∈ X n : g ( x 1 , · · · , x n ) ≤ z } ) � � · · · f X 1 , ··· ,X n ( x 1 , · · · , x n ) dx 1 · · · dx n . = { x 1 ∈X 1 , ··· ,x n ∈X n : g ( x 1 , ··· ,x n ) ≤ z } Hiva Ghanbari (Lehigh University) 9 / 34

Data with Normal Distribution Assume X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Hiva Ghanbari (Lehigh University) 10 / 34

Data with Normal Distribution Assume X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Why Normal? The family of multivariate Normal distributions is closed under linear transformations. Theorem 2 (Tong (1990)) If X ∼ N ( µ, Σ) and Z = CX + b , where C is any given m × n real matrix and b is any m × 1 real vector, then Z ∼ N � Cµ + b, C Σ C T � . Normal Distribution has a smooth c.d.f. Hiva Ghanbari (Lehigh University) 10 / 34

Prediction Error as a Smooth Function Theorem 3 Suppose that X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Then, F error ( w ) = P ( Y = +1) (1 − φ ( µ Z + /σ Z + )) + P ( Y = − 1) φ ( µ Z − /σ Z − ) , where � µ Z + = w T µ + , w T Σ + w, σ Z + = and � µ Z − = w T µ − , w T Σ − w, σ Z − = in which φ is the c.d.f of the standard normal distribution, e.g., � x 1 exp( − 1 2 t 2 ) dt, √ for ∀ x ∈ R . φ ( x ) = 2 π −∞ Hiva Ghanbari (Lehigh University) 11 / 34

Prediction Error as a Smooth Function Theorem 3 Suppose that X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Then, F error ( w ) = P ( Y = +1) (1 − φ ( µ Z + /σ Z + )) + P ( Y = − 1) φ ( µ Z − /σ Z − ) , where � µ Z + = w T µ + , w T Σ + w, σ Z + = and � µ Z − = w T µ − , w T Σ − w, σ Z − = in which φ is the c.d.f of the standard normal distribution, e.g., � x 1 exp( − 1 2 t 2 ) dt, √ for ∀ x ∈ R . φ ( x ) = 2 π −∞ Prediction error is a smooth function of w ⇒ we can compute the gradient and ... Hiva Ghanbari (Lehigh University) 11 / 34

Learning From Imbalanced Data Sets Many real-world machine learning problems are dealing with imbalanced learning data (a) Balanced data set (b) Imbalanced data set Hiva Ghanbari (Lehigh University) 13 / 34

Receiver Operating Characteristic (ROC) Curve Sorted outputs based on descending value of f ( x ; w ) = w T x − + + − + − + f ( x ; w ) Predicted Positive Predicted Negative Actual Positive True Positive (TP) False Negative (FN) Actual Negative False Positive (FP) True Negative (TN) T P Various thresholds result in different True Positive Rate = T P + F N and F P False Positive Rate = F P + T N . ROC curve presents the tradeoff between the TPR and the FPR, for all possible thresholds. Hiva Ghanbari (Lehigh University) 14 / 34

Area Under ROC Curve (AUC) How we can compare ROC curves? Hiva Ghanbari (Lehigh University) 15 / 34

Area Under ROC Curve (AUC) ⇒ Better classifier How we can compare ROC curves? Higher AUC = Hiva Ghanbari (Lehigh University) 15 / 34

An Unbiased Estimation of AUC Value An unbiased estimation of the AUC value of a linear classifier can be obtained via Wilcoxon-Mann-Whitney (WMW) statistic result (Mann and R.Whitney (1947)) , e.g., � n + � n − j =1 ✶ � f ( x + i ; w ) > f ( x − j ; w ) � i =1 AUC � f ; S + , S − � = . n + · n − where � if f ( x + i ; w ) > f ( x − +1 j ; w ) , f ( x + i ; w ) > f ( x − ✶ � j ; w ) � = 0 otherwise. in which S = S + ∪ S − . Hiva Ghanbari (Lehigh University) 16 / 34

AUC Approximation via Surrogate Losses The indicator function ✶ [ · ] can be approximate with: Sigmoid surrogate function, Yan et al. (2003) , Pairwise exponential loss or pairwise logistic loss, Rudin and Schapire (2009) , Pairwise hinge loss, Steck (2007) , � n + � n − j =1 max � 0 , 1 − � f ( x − j ; w ) − f ( x + i ; w ) �� i =1 F hinge ( w ) = . n + · n − Hiva Ghanbari (Lehigh University) 17 / 34

Directly and Efficiently Optimizing Prediction Error and AUC of - PowerPoint PPT Presentation

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop on Optimization and its

k -Step Ahead Prediction Error Model 1. k -Step Ahead Prediction Error Model 1. ARMAX model is

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

The Prediction Error Signal 1 Prediction Error Signal Behavior 2 LP Speech Analysis file:s5,

Inducing Efficiently Inducing Efficiently optimizi optimizing outpati ng outpatient i ent

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

Error Log Processing for Accurate Failure Prediction Felix Salfner Steffen Tschirpke ICSI

Section 6 : Cross Validation Yotam Shem-Tov Fall 2014 1/25 Yotam Shem-Tov STAT 239/ PS

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Optimizing First Line Treatment of Advanced Ovarian Cancer Ira R. Horowitz, MD, SM, FACOG, FACS

Systematic review searches performed 40 35 30 25 20 15 10 5 0 04-2013 07-2013 10-2013

This is not MEKAR Sophie J. Bakri, MD Financial disclosures: none relevant Patient History

Efficient and Equitable Climate Policy in a Dynamic World Lucas Bretschger ETH Zurich Public

Qualitative Learnings from BeneFITs First-Year Implementation Laura-Mae Baldwin Jen Schneider

Lessons learned implementing patient decision support in routine care Dominick L. Frosch, PhD

2003 Corporate Health Achievement Award Brian J. Linder, MD, MPH Matthew Hughes, MD, MPH

Driving Patient-Centered Innovations throughout the Ontario Cancer System Helen Angus Esther

Directly and Efficiently Optimizing Prediction Error and AUC of - PowerPoint PPT Presentation

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop on Optimization and its

k -Step Ahead Prediction Error Model 1. k -Step Ahead Prediction Error Model 1. ARMAX model is

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

The Prediction Error Signal 1 Prediction Error Signal Behavior 2 LP Speech Analysis file:s5,

Inducing Efficiently Inducing Efficiently optimizi optimizing outpati ng outpatient i ent

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

Error Log Processing for Accurate Failure Prediction Felix Salfner Steffen Tschirpke ICSI

Section 6 : Cross Validation Yotam Shem-Tov Fall 2014 1/25 Yotam Shem-Tov STAT 239/ PS

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Optimizing First Line Treatment of Advanced Ovarian Cancer Ira R. Horowitz, MD, SM, FACOG, FACS

Systematic review searches performed 40 35 30 25 20 15 10 5 0 04-2013 07-2013 10-2013

This is not MEKAR Sophie J. Bakri, MD Financial disclosures: none relevant Patient History

Efficient and Equitable Climate Policy in a Dynamic World Lucas Bretschger ETH Zurich Public

Qualitative Learnings from BeneFITs First-Year Implementation Laura-Mae Baldwin Jen Schneider

Lessons learned implementing patient decision support in routine care Dominick L. Frosch, PhD

2003 Corporate Health Achievement Award Brian J. Linder, MD, MPH Matthew Hughes, MD, MPH

Driving Patient-Centered Innovations throughout the Ontario Cancer System Helen Angus Esther

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits