Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop on Optimization and its Applications Huatulco, Mexico January 2018 Hiva Ghanbari (Lehigh University) 1 / 34
Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 2 / 34
Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 3 / 34
Supervised Learning Problem Given a finite sample data set S of n (input, label) pairs, e.g., S := { ( x i , y i ) : i = 1 , · · · , n } , where x i ∈ R d and y i ∈ { +1 , − 1 } . We are interested in Binary Classification Problem in supervised learning Binary Classification Problem ⇒ Discrete valued output +1 or − 1 We are interested in linear classifier (predictor) f ( x ; w ) = w T x so that f : X → Y , where X denote the space of input values and Y the space of output values. Hiva Ghanbari (Lehigh University) 4 / 34
Supervised Learning Problem How good is this classifier? Prediction Error Area Under ROC Curve (AUC) Given a finite sample data set S of n (input, label) pairs, e.g., S := { ( x i , y i ) : i = 1 , · · · , n } , where x i ∈ R d and y i ∈ { +1 , − 1 } . We are interested in Binary Classification Problem in supervised learning Binary Classification Problem ⇒ Discrete valued output +1 or − 1 We are interested in linear classifier (predictor) f ( x ; w ) = w T x so that f : X → Y , where X denote the space of input values and Y the space of output values. Hiva Ghanbari (Lehigh University) 4 / 34
Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 5 / 34
Expected Risk (Prediction Error) In S , each ( x i , y i ) is an i.i.d. observation of the random variables ( X, Y ). ( X, Y ) has an unknown joint probability distribution P X,Y ( x, y ) over X and Y . The expected risk associated with a linear classifier f ( x ; w ) = w T x for zero-one loss function is defined as R 0 − 1 ( f ) = E X , Y [ ℓ 0 − 1 ( f ( X ; w ) , Y )] � � = P X,Y ( x, y ) ℓ 0 − 1 ( f ( x ; w ) , y ) dydx X Y where � +1 if y · f ( x ; w ) < 0 , ℓ 0 − 1 ( f ( x ; w ) , y ) = 0 if y · f ( x ; w ) ≥ 0 . Hiva Ghanbari (Lehigh University) 6 / 34
Empirical Risk Minimization The joint probability distribution P X,Y ( x, y ) is unknown The empirical risk of the linear classifier f ( x ; w ) for zero-one loss function over the finite training set S is of the interest, e.g., n R 0 − 1 ( f ; S ) = 1 � ℓ 0 − 1 ( f ( x i ; w ) , y i ) . n i =1 Hiva Ghanbari (Lehigh University) 7 / 34
Empirical Risk Minimization The joint probability distribution P X,Y ( x, y ) is unknown The empirical risk of the linear classifier f ( x ; w ) for zero-one loss function over the finite training set S is of the interest, e.g., n R 0 − 1 ( f ; S ) = 1 � ℓ 0 − 1 ( f ( x i ; w ) , y i ) . n i =1 Utilizing the logistic regression loss function instead of 0-1 loss function, results n R log ( f ; S ) = 1 � log (1 + exp( − y i · f ( x i ; w ))) , n i =1 Practically � � n F log ( w ) = 1 � log (1 + exp( − y i · f ( x i ; w ))) + λ � w � 2 min . n w ∈ R d i =1 Hiva Ghanbari (Lehigh University) 7 / 34
Alternative Interpretation of the Prediction Error We can interpret prediction error as a probability value: F error ( w ) = R 0 − 1 ( f ) = E X , Y [ ℓ 0 − 1 ( f ( X ; w ) , Y )] = P ( Y · w T X < 0) . Hiva Ghanbari (Lehigh University) 8 / 34
Alternative Interpretation of the Prediction Error We can interpret prediction error as a probability value: F error ( w ) = R 0 − 1 ( f ) = E X , Y [ ℓ 0 − 1 ( f ( X ; w ) , Y )] = P ( Y · w T X < 0) . If the true values of the prior probabilities P ( Y = +1) and P ( Y = − 1) are known or obtainable from a trivial calculation, then Lemma 1 Expected risk can be interpreted in terms of the probability value, so that F error ( w ) = P ( Y · w T X < 0) Z + ≤ 0 � Z − ≤ 0 �� = P � P ( Y = +1) + � 1 − P � P ( Y = − 1) , where Z + = w T X + , and Z − = w T X − , for X + and X − as random variables from positive and negative classes, respectively. Hiva Ghanbari (Lehigh University) 8 / 34
Data with Any Arbitrary Distribution Suppose ( X 1 , · · · , X n ) is a multivariate random variable. For a given mapping function g ( · ) we are interested in the c.d.f of Z = g ( X 1 , · · · , X n ) . If we define a region in space {X 1 × · · · × X n } such that g ( x 1 , · · · , x n ) ≤ z , then we have F Z ( z ) = P ( Z ≤ z ) = P ( g ( X ) ≤ z ) = P ( { x 1 ∈ X 1 , · · · , x n ∈ X n : g ( x 1 , · · · , x n ) ≤ z } ) � � · · · f X 1 , ··· ,X n ( x 1 , · · · , x n ) dx 1 · · · dx n . = { x 1 ∈X 1 , ··· ,x n ∈X n : g ( x 1 , ··· ,x n ) ≤ z } Hiva Ghanbari (Lehigh University) 9 / 34
Data with Normal Distribution Assume X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Hiva Ghanbari (Lehigh University) 10 / 34
Data with Normal Distribution Assume X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Why Normal? The family of multivariate Normal distributions is closed under linear transformations. Theorem 2 (Tong (1990)) If X ∼ N ( µ, Σ) and Z = CX + b , where C is any given m × n real matrix and b is any m × 1 real vector, then Z ∼ N � Cµ + b, C Σ C T � . Normal Distribution has a smooth c.d.f. Hiva Ghanbari (Lehigh University) 10 / 34
Prediction Error as a Smooth Function Theorem 3 Suppose that X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Then, F error ( w ) = P ( Y = +1) (1 − φ ( µ Z + /σ Z + )) + P ( Y = − 1) φ ( µ Z − /σ Z − ) , where � µ Z + = w T µ + , w T Σ + w, σ Z + = and � µ Z − = w T µ − , w T Σ − w, σ Z − = in which φ is the c.d.f of the standard normal distribution, e.g., � x 1 exp( − 1 2 t 2 ) dt, √ for ∀ x ∈ R . φ ( x ) = 2 π −∞ Hiva Ghanbari (Lehigh University) 11 / 34
Prediction Error as a Smooth Function Theorem 3 Suppose that X + ∼ N � X − ∼ N � µ + , Σ + � µ − , Σ − � and . Then, F error ( w ) = P ( Y = +1) (1 − φ ( µ Z + /σ Z + )) + P ( Y = − 1) φ ( µ Z − /σ Z − ) , where � µ Z + = w T µ + , w T Σ + w, σ Z + = and � µ Z − = w T µ − , w T Σ − w, σ Z − = in which φ is the c.d.f of the standard normal distribution, e.g., � x 1 exp( − 1 2 t 2 ) dt, √ for ∀ x ∈ R . φ ( x ) = 2 π −∞ Prediction error is a smooth function of w ⇒ we can compute the gradient and ... Hiva Ghanbari (Lehigh University) 11 / 34
Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 12 / 34
Learning From Imbalanced Data Sets Many real-world machine learning problems are dealing with imbalanced learning data (a) Balanced data set (b) Imbalanced data set Hiva Ghanbari (Lehigh University) 13 / 34
Receiver Operating Characteristic (ROC) Curve Sorted outputs based on descending value of f ( x ; w ) = w T x − + + − + − + f ( x ; w ) Predicted Positive Predicted Negative Actual Positive True Positive (TP) False Negative (FN) Actual Negative False Positive (FP) True Negative (TN) T P Various thresholds result in different True Positive Rate = T P + F N and F P False Positive Rate = F P + T N . ROC curve presents the tradeoff between the TPR and the FPR, for all possible thresholds. Hiva Ghanbari (Lehigh University) 14 / 34
Area Under ROC Curve (AUC) How we can compare ROC curves? Hiva Ghanbari (Lehigh University) 15 / 34
Area Under ROC Curve (AUC) ⇒ Better classifier How we can compare ROC curves? Higher AUC = Hiva Ghanbari (Lehigh University) 15 / 34
An Unbiased Estimation of AUC Value An unbiased estimation of the AUC value of a linear classifier can be obtained via Wilcoxon-Mann-Whitney (WMW) statistic result (Mann and R.Whitney (1947)) , e.g., � n + � n − j =1 ✶ � f ( x + i ; w ) > f ( x − j ; w ) � i =1 AUC � f ; S + , S − � = . n + · n − where � if f ( x + i ; w ) > f ( x − +1 j ; w ) , f ( x + i ; w ) > f ( x − ✶ � j ; w ) � = 0 otherwise. in which S = S + ∪ S − . Hiva Ghanbari (Lehigh University) 16 / 34
AUC Approximation via Surrogate Losses The indicator function ✶ [ · ] can be approximate with: Sigmoid surrogate function, Yan et al. (2003) , Pairwise exponential loss or pairwise logistic loss, Rudin and Schapire (2009) , Pairwise hinge loss, Steck (2007) , � n + � n − j =1 max � 0 , 1 − � f ( x − j ; w ) − f ( x + i ; w ) �� i =1 F hinge ( w ) = . n + · n − Hiva Ghanbari (Lehigh University) 17 / 34
Recommend
More recommend