directly and efficiently optimizing prediction error and
play

Directly and Efficiently Optimizing Prediction Error and AUC of - PowerPoint PPT Presentation

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop on Optimization and its


  1. Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop on Optimization and its Applications Huatulco, Mexico January 2018 Hiva Ghanbari (Lehigh University) 1 / 34

  2. Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 2 / 34

  3. Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 3 / 34

  4. Supervised Learning Problem Given a finite sample data set S of n (input, label) pairs, e.g., S := { ( x i , y i ) : i = 1 , · · · , n } , where x i ∈ R d and y i ∈ { +1 , − 1 } . We are interested in Binary Classification Problem in supervised learning Binary Classification Problem ⇒ Discrete valued output +1 or − 1 We are interested in linear classifier (predictor) f ( x ; w ) = w T x so that f : X → Y , where X denote the space of input values and Y the space of output values. Hiva Ghanbari (Lehigh University) 4 / 34

  5. Supervised Learning Problem How good is this classifier? Prediction Error Area Under ROC Curve (AUC) Given a finite sample data set S of n (input, label) pairs, e.g., S := { ( x i , y i ) : i = 1 , · · · , n } , where x i ∈ R d and y i ∈ { +1 , − 1 } . We are interested in Binary Classification Problem in supervised learning Binary Classification Problem ⇒ Discrete valued output +1 or − 1 We are interested in linear classifier (predictor) f ( x ; w ) = w T x so that f : X → Y , where X denote the space of input values and Y the space of output values. Hiva Ghanbari (Lehigh University) 4 / 34

  6. Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 5 / 34

  7. Expected Risk (Prediction Error) In S , each ( x i , y i ) is an i.i.d. observation of the random variables ( X, Y ). ( X, Y ) has an unknown joint probability distribution P X,Y ( x, y ) over X and Y . The expected risk associated with a linear classifier f ( x ; w ) = w T x for zero-one loss function is defined as R 0 − 1 ( f ) = E X , Y [ ℓ 0 − 1 ( f ( X ; w ) , Y )] � � = P X,Y ( x, y ) ℓ 0 − 1 ( f ( x ; w ) , y ) dydx X Y where � +1 if y · f ( x ; w ) < 0 , ℓ 0 − 1 ( f ( x ; w ) , y ) = 0 if y · f ( x ; w ) ≥ 0 . Hiva Ghanbari (Lehigh University) 6 / 34

  8. Empirical Risk Minimization The joint probability distribution P X,Y ( x, y ) is unknown The empirical risk of the linear classifier f ( x ; w ) for zero-one loss function over the finite training set S is of the interest, e.g., n R 0 − 1 ( f ; S ) = 1 � ℓ 0 − 1 ( f ( x i ; w ) , y i ) . n i =1 Hiva Ghanbari (Lehigh University) 7 / 34

  9. Empirical Risk Minimization The joint probability distribution P X,Y ( x, y ) is unknown The empirical risk of the linear classifier f ( x ; w ) for zero-one loss function over the finite training set S is of the interest, e.g., n R 0 − 1 ( f ; S ) = 1 � ℓ 0 − 1 ( f ( x i ; w ) , y i ) . n i =1 Utilizing the logistic regression loss function instead of 0-1 loss function, results n R log ( f ; S ) = 1 � log (1 + exp( − y i · f ( x i ; w ))) , n i =1 Practically � � n F log ( w ) = 1 � log (1 + exp( − y i · f ( x i ; w ))) + λ � w � 2 min . n w ∈ R d i =1 Hiva Ghanbari (Lehigh University) 7 / 34

  10. Alternative Interpretation of the Prediction Error We can interpret prediction error as a probability value: F error ( w ) = R 0 − 1 ( f ) = E X , Y [ ℓ 0 − 1 ( f ( X ; w ) , Y )] = P ( Y · w T X < 0) . Hiva Ghanbari (Lehigh University) 8 / 34

  11. Alternative Interpretation of the Prediction Error We can interpret prediction error as a probability value: F error ( w ) = R 0 − 1 ( f ) = E X , Y [ ℓ 0 − 1 ( f ( X ; w ) , Y )] = P ( Y · w T X < 0) . If the true values of the prior probabilities P ( Y = +1) and P ( Y = − 1) are known or obtainable from a trivial calculation, then Lemma 1 Expected risk can be interpreted in terms of the probability value, so that F error ( w ) = P ( Y · w T X < 0) Z + ≤ 0 � Z − ≤ 0 �� = P � P ( Y = +1) + � 1 − P � P ( Y = − 1) , where Z + = w T X + , and Z − = w T X − , for X + and X − as random variables from positive and negative classes, respectively. Hiva Ghanbari (Lehigh University) 8 / 34

  12. Data with Any Arbitrary Distribution Suppose ( X 1 , · · · , X n ) is a multivariate random variable. For a given mapping function g ( · ) we are interested in the c.d.f of Z = g ( X 1 , · · · , X n ) . If we define a region in space {X 1 × · · · × X n } such that g ( x 1 , · · · , x n ) ≤ z , then we have F Z ( z ) = P ( Z ≤ z ) = P ( g ( X ) ≤ z ) = P ( { x 1 ∈ X 1 , · · · , x n ∈ X n : g ( x 1 , · · · , x n ) ≤ z } ) � � · · · f X 1 , ··· ,X n ( x 1 , · · · , x n ) dx 1 · · · dx n . = { x 1 ∈X 1 , ··· ,x n ∈X n : g ( x 1 , ··· ,x n ) ≤ z } Hiva Ghanbari (Lehigh University) 9 / 34

  13. Data with Normal Distribution Assume X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Hiva Ghanbari (Lehigh University) 10 / 34

  14. Data with Normal Distribution Assume X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Why Normal? The family of multivariate Normal distributions is closed under linear transformations. Theorem 2 (Tong (1990)) If X ∼ N ( µ, Σ) and Z = CX + b , where C is any given m × n real matrix and b is any m × 1 real vector, then Z ∼ N � Cµ + b, C Σ C T � . Normal Distribution has a smooth c.d.f. Hiva Ghanbari (Lehigh University) 10 / 34

  15. Prediction Error as a Smooth Function Theorem 3 Suppose that X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Then, F error ( w ) = P ( Y = +1) (1 − φ ( µ Z + /σ Z + )) + P ( Y = − 1) φ ( µ Z − /σ Z − ) , where � µ Z + = w T µ + , w T Σ + w, σ Z + = and � µ Z − = w T µ − , w T Σ − w, σ Z − = in which φ is the c.d.f of the standard normal distribution, e.g., � x 1 exp( − 1 2 t 2 ) dt, √ for ∀ x ∈ R . φ ( x ) = 2 π −∞ Hiva Ghanbari (Lehigh University) 11 / 34

  16. Prediction Error as a Smooth Function Theorem 3 Suppose that X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Then, F error ( w ) = P ( Y = +1) (1 − φ ( µ Z + /σ Z + )) + P ( Y = − 1) φ ( µ Z − /σ Z − ) , where � µ Z + = w T µ + , w T Σ + w, σ Z + = and � µ Z − = w T µ − , w T Σ − w, σ Z − = in which φ is the c.d.f of the standard normal distribution, e.g., � x 1 exp( − 1 2 t 2 ) dt, √ for ∀ x ∈ R . φ ( x ) = 2 π −∞ Prediction error is a smooth function of w ⇒ we can compute the gradient and ... Hiva Ghanbari (Lehigh University) 11 / 34

  17. Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 12 / 34

  18. Learning From Imbalanced Data Sets Many real-world machine learning problems are dealing with imbalanced learning data (a) Balanced data set (b) Imbalanced data set Hiva Ghanbari (Lehigh University) 13 / 34

  19. Receiver Operating Characteristic (ROC) Curve Sorted outputs based on descending value of f ( x ; w ) = w T x − + + − + − + f ( x ; w ) Predicted Positive Predicted Negative Actual Positive True Positive (TP) False Negative (FN) Actual Negative False Positive (FP) True Negative (TN) T P Various thresholds result in different True Positive Rate = T P + F N and F P False Positive Rate = F P + T N . ROC curve presents the tradeoff between the TPR and the FPR, for all possible thresholds. Hiva Ghanbari (Lehigh University) 14 / 34

  20. Area Under ROC Curve (AUC) How we can compare ROC curves? Hiva Ghanbari (Lehigh University) 15 / 34

  21. Area Under ROC Curve (AUC) ⇒ Better classifier How we can compare ROC curves? Higher AUC = Hiva Ghanbari (Lehigh University) 15 / 34

  22. An Unbiased Estimation of AUC Value An unbiased estimation of the AUC value of a linear classifier can be obtained via Wilcoxon-Mann-Whitney (WMW) statistic result (Mann and R.Whitney (1947)) , e.g., � n + � n − j =1 ✶ � f ( x + i ; w ) > f ( x − j ; w ) � i =1 AUC � f ; S + , S − � = . n + · n − where � if f ( x + i ; w ) > f ( x − +1 j ; w ) , f ( x + i ; w ) > f ( x − ✶ � j ; w ) � = 0 otherwise. in which S = S + ∪ S − . Hiva Ghanbari (Lehigh University) 16 / 34

  23. AUC Approximation via Surrogate Losses The indicator function ✶ [ · ] can be approximate with: Sigmoid surrogate function, Yan et al. (2003) , Pairwise exponential loss or pairwise logistic loss, Rudin and Schapire (2009) , Pairwise hinge loss, Steck (2007) , � n + � n − j =1 max � 0 , 1 − � f ( x − j ; w ) − f ( x + i ; w ) �� i =1 F hinge ( w ) = . n + · n − Hiva Ghanbari (Lehigh University) 17 / 34

Recommend


More recommend