Calibrated Surrogate Maximization of Linear-fractional Utility 07 th Feb. Han Bao (The University of Tokyo / RIKEN AIP)
accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 May cause severe issues! (e.g. in medical diagnosis) positive negative 2 โ Our focus: binary classification
accuracy: 0.8 positive F-measure Is accuracy appropriate? F-measure: 0.75 negative F-measure: 0 accuracy: 0.8 5 5 8 2 3 2 ๐ด๐ฐ ๐ฆ ๐ค = 2 ๐ด๐ฐ + ๐ฆ๐ฐ + ๐ฆ๐ฎ ๐ด๐ฐ = ๐ฝ X , Y =+1 [1 { f ( X )>0} ] ๐ด๐ฎ = ๐ฝ X , Y = โ 1 [1 { f ( X )<0} ] ๐ฆ๐ฐ = ๐ฝ X , Y = โ 1 [1 { f ( X )>0} ] ๐ฆ๐ฎ = ๐ฝ X , Y =+1 [1 { f ( X )<0} ]
Training and Evaluation minimizing 0/1-error compatible evaluation ๏ผ๏ผ๏ผ training incompatible evaluation training 4 compatible evaluation minimizing 0/1-error training โ Usual empirical risk minimization (ERM) ๐ก๐ฝ๐ฝ = ๐ด๐ฐ + ๐ด๐ฎ 1 = 1 โ (0/1-risk) 0 โ Training with accuracy but evaluate with F 1 2 ๐ด๐ฐ 1 ๐ฆ ๐ค = 2 ๐ด๐ฐ + ๐ฆ๐ฐ + ๐ฆ๐ฎ 0 โ Why not? Direct Optimization 2 ๐ด๐ฐ ๐ฆ ๐ค = 2 ๐ด๐ฐ + ๐ฆ๐ฐ + ๐ฆ๐ฎ
Balanced Error Rate Fowlkes-Mallows index Accuracy Weighted Accuracy Gower-Legendre index Jaccard index Matthews Correlation Coefficient F-measure w 1 ๐ด๐ฐ + w 2 ๐ด๐ฎ ๐ท๐ก๐ฝ๐ฝ = ๐ฆ๐ญ๐ฉ = ๐ด๐ฐ 1 w 1 ๐ด๐ฐ + w 2 ๐ด๐ฎ + w 3 ๐ฆ๐ฐ + w 4 ๐ฆ๐ฎ ฯ ๐ด๐ฐ + ๐ฆ๐ฐ 2 ๐ด๐ฐ ๐ฆ ๐ค = 2 ๐ด๐ฐ + ๐ฆ๐ฐ + ๐ฆ๐ฎ Wanna Unify!! ๐ก๐ฝ๐ฝ = ๐ด๐ฐ + ๐ด๐ฎ ๐ข๐ฅ๐ฒ = 1 1 ฯ ๐ฆ๐ฎ + 1 โ ฯ ๐ฆ๐ฐ ๐ด๐ฐ ๐ช๐ป๐ฝ = ๐ด๐ฐ + ๐ฆ๐ฐ + ๐ฆ๐ฎ ๐ด๐ฐ โ ๐ด๐ฎ โ ๐ฆ๐ฐ โ ๐ฆ๐ฎ ๐ญ๐ฃ๐ฃ = ฯ (1 โ ฯ )( ๐ด๐ฐ + ๐ฆ๐ฐ )( ๐ด๐ฎ + ๐ฆ๐ฎ ) ๐ด๐ฐ + ๐ด๐ฎ ๐ง๐ฌ๐ฉ = ๐ด๐ฐ + ฮฑ ( ๐ฆ๐ฐ + ๐ฆ๐ฎ ) + ๐ด๐ฎ
Actual Metrics linear-fraction Note: Unification of Metrics 6 ๐ด๐ฎ = โ ( Y = โ 1) โ ๐ฆ๐ฐ ๐ฆ๐ฎ = โ ( Y = + 1) โ ๐ด๐ฐ 2 ๐ด๐ฐ ๐ฆ ๐ค = 2 ๐ด๐ฐ + ๐ฆ๐ฐ + ๐ฆ๐ฎ U ( f ) = a 0 ๐ด๐ฐ + b 0 ๐ฆ๐ฐ + c 0 a 1 ๐ด๐ฐ + b 1 ๐ฆ๐ฐ + c 1 ๐ด๐ฐ ๐ช๐ป๐ฝ = ๐ด๐ฐ + ๐ฆ๐ฐ + ๐ฆ๐ฎ a k , b k , c k : constants
Unification of Metrics := 7 = linear-fraction . . . . . . . . . . . . . a 0 ๐ฝ P + b 0 ๐ฝ N . . . . . . . . . . + c 0 U ( f ) = a 0 ๐ด๐ฐ + b 0 ๐ฆ๐ฐ + c 0 1 . . . . . . . . . . . . . a 1 ๐ฝ P + b 1 ๐ฝ N . . . . . . . . . . + c 1 a 1 ๐ด๐ฐ + b 1 ๐ฆ๐ฐ + c 1 1 ๐ฝ X [ W 0 ( f ( X ))] ๐ฝ X [ W 1 ( f ( X ))] โ TP, FP = expectation of 0/1-loss ๐ด๐ฐ = โ ( Y = + 1, f ( X ) > 0) = ๐ฝ X , Y =+1 [1 { f ( X )>0} ] โถ e.g.
Goal of This Talk 8 Given a metric metric (utility) labeled sample classifier s.t. i.i.d. U ( f ) = a 0 ๐ด๐ฐ + b 0 ๐ฆ๐ฐ + c 0 a 1 ๐ด๐ฐ + b 1 ๐ฆ๐ฐ + c 1 Q. How to optimize U ( f ) directly? โถ without estimating class-posterior probability f : ๐ด โ โ {( x i , y i )} n โผ โ i =1 U ( f โฒ ๏ฟฝ ) U ( f ) = sup U f โฒ ๏ฟฝ
9 Outline โ Introduction โ Preliminary โถ Convex Risk Minimization โถ Plug-in Principle vs. Cost-sensitive Learning โ Key Idea โถ Quasi-concave Surrogate โ Calibration Analysis & Experiments
Formulation of Classification ฬ classified correctly classified incorrectly make 0/1 loss smoother (Empirical) Surrogate Risk Example of convex in ! 10 = minimize mis-classification rate โจ ฬ โ Goal of classification: maximize accuracy โจ n R ( f ) = 1 โ 1 [ y i โ sign( f ( x i ))] 0/1 ( ` ) n Logistic i =1 Hinge ๏ฟฝ ( m ) n = 1 โ โ ( y i f ( x i )) 1 n i =1 0 โ 1 0 1 m = y i f ( x i ) m n R ฯ ( f ) = 1 โ ฯ ( y i f ( x i )) ฯ n i =1 โถ logistic loss f โถ hinge loss โ SVM โถ exponential loss โ AdaBoost
3 Actors in Risk Minimization 0/1-loss what we actually minimize (empirical (surrogate) risk) classified correctly ฬ differentiable upper bound of 0/1-loss surrogate loss (surrogate risk) prediction margin classified โจ โจ โจ โจ 11 โจ โจ incorrectly โจ โ Minimize classification risk (= 1 - Accuracy) โจ R ( f ) = ๐ฝ [ โ ( Yf ( X ) ) ] 0/1 ( ` ) Logistic 0/1-loss represents if X is correctly Hinge ๏ฟฝ ( m ) classified by f 1 โ Surrogate loss makes tractable โจ 0 โ 1 0 1 m = y i f ( x i ) m R ฯ ( f ) = ๐ฝ [ ฯ ( Yf ( X ))] โ Sample approximation (M-estimation) โจ n R ฯ ( f ) = 1 โ ฯ ( y i f ( x i )) n i =1
Convexity & Statistical Property Then, Assume : convex. 12 = argmin ? Q. argmin tractable (convex) intractable generalize Theorem. iff . ฬ (informal) [Bartlett+ 2006] A. Yes, w/ calibrated surrogate n R ฯ ( f ) = 1 R ฯ R โ ฯ ( y i f ( x i )) n i =1 ฯ R ฯ ( f ) = ๐ฝ [ ฯ ( Yf ( X ))] argmin f R ฯ ( f ) = argmin f R ( f ) ฯ โฒ ๏ฟฝ (0) < 0 R ( f ) = ๐ฝ [ โ ( Yf ( X ))] P. L. Bartlett, M. I. Jordan, & J. D. McAuliffe. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473), 138-156.
Related Work: Plug-in Rule Y = +1 โ estimate P(Y=+1|x) and ฮด independently Y = -1 Y = +1 Y = -1 [Koyejo+ NIPS2014; Yan+ ICML2018] 13 โ Classifier based on class-posterior probability Bayes-optimal classifier (accuracy): โ ( Y = + 1 | x ) โ 1 2 โ ( Y = + 1 โฃ X ) 1 0 1 2 Bayes-optimal classifier (general case): โ ( Y = + 1 | x ) โ ฮด * โ ( Y = + 1 โฃ X ) 0 ฮด * 1 โ ( Y = + 1 | x ) ฮด * O. O. Koyejo, N. Natarajan, P. K. Ravikumar, & I. S. Dhillon. Consistent binary classification with generalized performance metrics. In NIPS , 2014. B. Yan, O. Koyejo, K. Zhong, & P. Ravikumar. Binary classification with Karmic, threshold-quasi-concave metrics. In ICML , 2018.
14 Outline โ Introduction โ Preliminary โถ Convex Risk Minimization โถ Plug-in Principle vs. Cost-sensitive Learning โ Key Idea โถ Quasi-concave Surrogate โ Calibration Analysis & Experiments
Convexity & Statistical Property calibration โ = argmin argmin objective? Q. tractable & calibrated calibration intractable tractable (convex) 15 intractable generalize ฬ โก n R ฯ ( f ) = 1 โ ฯ ( y i f ( x i )) n i =1 U ( f ) = ๐ฝ X [ W 0 ( f ( X ))] ๐ฝ X [ W 1 ( f ( X ))] R ฯ ( f ) = ๐ฝ [ ฯ ( Yf ( X ))] R ฯ R R ( f ) = ๐ฝ [ โ ( Yf ( X ))]
Non-concave, but quasi-concave (proof) Show โ efficiently optimized non-concave, but unimodal concave is convex for is convex NB: super-level set of concave func. 16 is convex. if : concave, : convex, for and Idea: concave / convex = quasi-concave is quasi-concave f ( x ) g ( x ) f g f ( x ) โฅ 0 g ( x ) > 0 โ x { x | f / g โฅ ฮฑ } f ( x ) g ( x ) โฅ ฮฑ โบ f ( x ) โ ฮฑ g ( x ) โฅ 0 โ โ quasi-concave concave โ super-levels are convex โด { x | f / g โฅ ฮฑ } โ ฮฑ โฅ 0
Surrogate Utility = non-negative sum of convex โ concave non-negative sum of concave denominator from above numerator from below โ convex 17 linear-fraction โ Idea: bound true utility from below . . . . . . . . . . . . . a 0 ๐ฝ P + b 0 ๐ฝ N . . . . . . . . . . + c 0 U ( f ) = a 0 ๐ด๐ฐ + b 0 ๐ฆ๐ฐ + c 0 1 . . . . . . . . . . . . . a 1 ๐ฝ P + b 1 ๐ฝ N . . . . . . . . . . + c 1 a 1 ๐ด๐ฐ + b 1 ๐ฆ๐ฐ + c 1 1 O . . . . . . . . . . . . . O a 0 ๐ฝ P + b 0 ๐ฝ N . . . . . . . . . . + c 0 โฅ 1 . . . . . . . . . . . . . a 1 ๐ฝ P + b 1 ๐ฝ N . . . . . . . . . . + c 1 1 O O
Surrogate Utility linear-fraction surrogate loss : Surrogate Utility = 18 โ Idea: bound true utility from below O . . . . . . . . . . . . . O a 0 ๐ฝ P + b 0 ๐ฝ N . . . . . . . . . . + c 0 โฅ U ( f ) = a 0 ๐ด๐ฐ + b 0 ๐ฆ๐ฐ + c 0 1 . . . . . . . . . . . . . a 1 ๐ด๐ฐ + b 1 ๐ฆ๐ฐ + c 1 a 1 ๐ฝ P + b 1 ๐ฝ N . . . . . . . . . . + c 1 1 O O U ฯ ( f ) = a 0 ๐ฝ P [1 โ ฯ ( f ( X ))] + b 0 ๐ฝ N [ โ ฯ ( โ f ( X ))] + c 0 ฯ ( m ) a 1 ๐ฝ P [1 + ฯ ( f ( X ))] + b 1 ๐ฝ N [ ฯ ( โ f ( X )) ] + c 1 ๐ฝ [ W 0, ฯ ] O := ๐ฝ [ W 1, ฯ ]
Hybrid Optimization Strategy โถ isnโt quasi-concave if numerator < 0 maximize fractional form (quasi-concave) 19 O . . . . . . . . . . . . . O a 0 ๐ฝ P + b 0 ๐ฝ N . . . . . . . . . . + c 0 1 U ฯ ( f ) = = . . . . . . . . . . . . . a 1 ๐ฝ P + b 1 ๐ฝ N . . . . . . . . . . + c 1 1 O O โ Note: numerator can be negative U ฯ โถ maximize numerator first (concave), then
Hybrid Optimization Strategy 20 maximize numerator maximize fraction normalized gradient for quasi-concave optimization [Hazan+ NeurIPS2015] Hazan, E., Levy, K., & Shalev-Shwartz, S. (2015). Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems (pp. 1594-1602).
21 Outline โ Introduction โ Preliminary โถ Convex Risk Minimization โถ Plug-in Principle vs. Cost-sensitive Learning โ Key Idea โถ Quasi-concave Surrogate โ Calibration Analysis & Experiments
Recommend
More recommend