calibrated surrogate maximization of linear fractional
play

Calibrated Surrogate Maximization of Linear-fractional Utility 07 th - PowerPoint PPT Presentation

Calibrated Surrogate Maximization of Linear-fractional Utility 07 th Feb. Han Bao (The University of Tokyo / RIKEN AIP) accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 May cause severe issues! (e.g. in medical diagnosis)


  1. Calibrated Surrogate Maximization of Linear-fractional Utility 07 th Feb. Han Bao (The University of Tokyo / RIKEN AIP)

  2. accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 May cause severe issues! (e.g. in medical diagnosis) positive negative 2 โ–  Our focus: binary classification

  3. accuracy: 0.8 positive F-measure Is accuracy appropriate? F-measure: 0.75 negative F-measure: 0 accuracy: 0.8 5 5 8 2 3 2 ๐–ด๐–ฐ ๐–ฆ ๐Ÿค = 2 ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ ๐–ด๐–ฐ = ๐”ฝ X , Y =+1 [1 { f ( X )>0} ] ๐–ด๐–ฎ = ๐”ฝ X , Y = โˆ’ 1 [1 { f ( X )<0} ] ๐–ฆ๐–ฐ = ๐”ฝ X , Y = โˆ’ 1 [1 { f ( X )>0} ] ๐–ฆ๐–ฎ = ๐”ฝ X , Y =+1 [1 { f ( X )<0} ]

  4. Training and Evaluation minimizing 0/1-error compatible evaluation ๏ผŸ๏ผŸ๏ผŸ training incompatible evaluation training 4 compatible evaluation minimizing 0/1-error training โ–  Usual empirical risk minimization (ERM) ๐–ก๐–ฝ๐–ฝ = ๐–ด๐–ฐ + ๐–ด๐–ฎ 1 = 1 โˆ’ (0/1-risk) 0 โ–  Training with accuracy but evaluate with F 1 2 ๐–ด๐–ฐ 1 ๐–ฆ ๐Ÿค = 2 ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ 0 โ–  Why not? Direct Optimization 2 ๐–ด๐–ฐ ๐–ฆ ๐Ÿค = 2 ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ

  5. Balanced Error Rate Fowlkes-Mallows index Accuracy Weighted Accuracy Gower-Legendre index Jaccard index Matthews Correlation Coefficient F-measure w 1 ๐–ด๐–ฐ + w 2 ๐–ด๐–ฎ ๐–ท๐–ก๐–ฝ๐–ฝ = ๐–ฆ๐–ญ๐–ฉ = ๐–ด๐–ฐ 1 w 1 ๐–ด๐–ฐ + w 2 ๐–ด๐–ฎ + w 3 ๐–ฆ๐–ฐ + w 4 ๐–ฆ๐–ฎ ฯ€ ๐–ด๐–ฐ + ๐–ฆ๐–ฐ 2 ๐–ด๐–ฐ ๐–ฆ ๐Ÿค = 2 ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ Wanna Unify!! ๐–ก๐–ฝ๐–ฝ = ๐–ด๐–ฐ + ๐–ด๐–ฎ ๐–ข๐–ฅ๐–ฒ = 1 1 ฯ€ ๐–ฆ๐–ฎ + 1 โˆ’ ฯ€ ๐–ฆ๐–ฐ ๐–ด๐–ฐ ๐–ช๐–ป๐–ฝ = ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ ๐–ด๐–ฐ โ‹… ๐–ด๐–ฎ โˆ’ ๐–ฆ๐–ฐ โ‹… ๐–ฆ๐–ฎ ๐–ญ๐–ฃ๐–ฃ = ฯ€ (1 โˆ’ ฯ€ )( ๐–ด๐–ฐ + ๐–ฆ๐–ฐ )( ๐–ด๐–ฎ + ๐–ฆ๐–ฎ ) ๐–ด๐–ฐ + ๐–ด๐–ฎ ๐–ง๐–ฌ๐–ฉ = ๐–ด๐–ฐ + ฮฑ ( ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ ) + ๐–ด๐–ฎ

  6. Actual Metrics linear-fraction Note: Unification of Metrics 6 ๐–ด๐–ฎ = โ„™ ( Y = โˆ’ 1) โˆ’ ๐–ฆ๐–ฐ ๐–ฆ๐–ฎ = โ„™ ( Y = + 1) โˆ’ ๐–ด๐–ฐ 2 ๐–ด๐–ฐ ๐–ฆ ๐Ÿค = 2 ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ U ( f ) = a 0 ๐–ด๐–ฐ + b 0 ๐–ฆ๐–ฐ + c 0 a 1 ๐–ด๐–ฐ + b 1 ๐–ฆ๐–ฐ + c 1 ๐–ด๐–ฐ ๐–ช๐–ป๐–ฝ = ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ a k , b k , c k : constants

  7. Unification of Metrics := 7 = linear-fraction . . . . . . . . . . . . . a 0 ๐”ฝ P + b 0 ๐”ฝ N . . . . . . . . . . + c 0 U ( f ) = a 0 ๐–ด๐–ฐ + b 0 ๐–ฆ๐–ฐ + c 0 1 . . . . . . . . . . . . . a 1 ๐”ฝ P + b 1 ๐”ฝ N . . . . . . . . . . + c 1 a 1 ๐–ด๐–ฐ + b 1 ๐–ฆ๐–ฐ + c 1 1 ๐”ฝ X [ W 0 ( f ( X ))] ๐”ฝ X [ W 1 ( f ( X ))] โ–  TP, FP = expectation of 0/1-loss ๐–ด๐–ฐ = โ„™ ( Y = + 1, f ( X ) > 0) = ๐”ฝ X , Y =+1 [1 { f ( X )>0} ] โ–ถ e.g.

  8. Goal of This Talk 8 Given a metric metric (utility) labeled sample classifier s.t. i.i.d. U ( f ) = a 0 ๐–ด๐–ฐ + b 0 ๐–ฆ๐–ฐ + c 0 a 1 ๐–ด๐–ฐ + b 1 ๐–ฆ๐–ฐ + c 1 Q. How to optimize U ( f ) directly? โ–ถ without estimating class-posterior probability f : ๐’ด โ†’ โ„ {( x i , y i )} n โˆผ โ„™ i =1 U ( f โ€ฒ ๏ฟฝ ) U ( f ) = sup U f โ€ฒ ๏ฟฝ

  9. 9 Outline โ–  Introduction โ–  Preliminary โ–ถ Convex Risk Minimization โ–ถ Plug-in Principle vs. Cost-sensitive Learning โ–  Key Idea โ–ถ Quasi-concave Surrogate โ–  Calibration Analysis & Experiments

  10. Formulation of Classification ฬ‚ classified correctly classified incorrectly make 0/1 loss smoother (Empirical) Surrogate Risk Example of convex in ! 10 = minimize mis-classification rate โ€จ ฬ‚ โ–  Goal of classification: maximize accuracy โ€จ n R ( f ) = 1 โˆ‘ 1 [ y i โ‰  sign( f ( x i ))] 0/1 ( ` ) n Logistic i =1 Hinge ๏ฟฝ ( m ) n = 1 โˆ‘ โ„“ ( y i f ( x i )) 1 n i =1 0 โˆ’ 1 0 1 m = y i f ( x i ) m n R ฯ• ( f ) = 1 โˆ‘ ฯ• ( y i f ( x i )) ฯ• n i =1 โ–ถ logistic loss f โ–ถ hinge loss โ‡’ SVM โ–ถ exponential loss โ‡’ AdaBoost

  11. 3 Actors in Risk Minimization 0/1-loss what we actually minimize (empirical (surrogate) risk) classified correctly ฬ‚ differentiable upper bound of 0/1-loss surrogate loss (surrogate risk) prediction margin classified โ€จ โ€จ โ€จ โ€จ 11 โ€จ โ€จ incorrectly โ€จ โ–  Minimize classification risk (= 1 - Accuracy) โ€จ R ( f ) = ๐”ฝ [ โ„“ ( Yf ( X ) ) ] 0/1 ( ` ) Logistic 0/1-loss represents if X is correctly Hinge ๏ฟฝ ( m ) classified by f 1 โ–  Surrogate loss makes tractable โ€จ 0 โˆ’ 1 0 1 m = y i f ( x i ) m R ฯ• ( f ) = ๐”ฝ [ ฯ• ( Yf ( X ))] โ–  Sample approximation (M-estimation) โ€จ n R ฯ• ( f ) = 1 โˆ‘ ฯ• ( y i f ( x i )) n i =1

  12. Convexity & Statistical Property Then, Assume : convex. 12 = argmin ? Q. argmin tractable (convex) intractable generalize Theorem. iff . ฬ‚ (informal) [Bartlett+ 2006] A. Yes, w/ calibrated surrogate n R ฯ• ( f ) = 1 R ฯ• R โˆ‘ ฯ• ( y i f ( x i )) n i =1 ฯ• R ฯ• ( f ) = ๐”ฝ [ ฯ• ( Yf ( X ))] argmin f R ฯ• ( f ) = argmin f R ( f ) ฯ• โ€ฒ ๏ฟฝ (0) < 0 R ( f ) = ๐”ฝ [ โ„“ ( Yf ( X ))] P. L. Bartlett, M. I. Jordan, & J. D. McAuliffe. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473), 138-156.

  13. Related Work: Plug-in Rule Y = +1 โ‡’ estimate P(Y=+1|x) and ฮด independently Y = -1 Y = +1 Y = -1 [Koyejo+ NIPS2014; Yan+ ICML2018] 13 โ–  Classifier based on class-posterior probability Bayes-optimal classifier (accuracy): โ„™ ( Y = + 1 | x ) โˆ’ 1 2 โ„™ ( Y = + 1 โˆฃ X ) 1 0 1 2 Bayes-optimal classifier (general case): โ„™ ( Y = + 1 | x ) โˆ’ ฮด * โ„™ ( Y = + 1 โˆฃ X ) 0 ฮด * 1 โ„™ ( Y = + 1 | x ) ฮด * O. O. Koyejo, N. Natarajan, P. K. Ravikumar, & I. S. Dhillon. Consistent binary classification with generalized performance metrics. In NIPS , 2014. B. Yan, O. Koyejo, K. Zhong, & P. Ravikumar. Binary classification with Karmic, threshold-quasi-concave metrics. In ICML , 2018.

  14. 14 Outline โ–  Introduction โ–  Preliminary โ–ถ Convex Risk Minimization โ–ถ Plug-in Principle vs. Cost-sensitive Learning โ–  Key Idea โ–ถ Quasi-concave Surrogate โ–  Calibration Analysis & Experiments

  15. Convexity & Statistical Property calibration โ‘  = argmin argmin objective? Q. tractable & calibrated calibration intractable tractable (convex) 15 intractable generalize ฬ‚ โ‘ก n R ฯ• ( f ) = 1 โˆ‘ ฯ• ( y i f ( x i )) n i =1 U ( f ) = ๐”ฝ X [ W 0 ( f ( X ))] ๐”ฝ X [ W 1 ( f ( X ))] R ฯ• ( f ) = ๐”ฝ [ ฯ• ( Yf ( X ))] R ฯ• R R ( f ) = ๐”ฝ [ โ„“ ( Yf ( X ))]

  16. Non-concave, but quasi-concave (proof) Show โ‡’ efficiently optimized non-concave, but unimodal concave is convex for is convex NB: super-level set of concave func. 16 is convex. if : concave, : convex, for and Idea: concave / convex = quasi-concave is quasi-concave f ( x ) g ( x ) f g f ( x ) โ‰ฅ 0 g ( x ) > 0 โˆ€ x { x | f / g โ‰ฅ ฮฑ } f ( x ) g ( x ) โ‰ฅ ฮฑ โŸบ f ( x ) โˆ’ ฮฑ g ( x ) โ‰ฅ 0 โŠ‡ โ–  quasi-concave concave โ–  super-levels are convex โˆด { x | f / g โ‰ฅ ฮฑ } โˆ€ ฮฑ โ‰ฅ 0

  17. Surrogate Utility = non-negative sum of convex โ‡’ concave non-negative sum of concave denominator from above numerator from below โ‡’ convex 17 linear-fraction โ–  Idea: bound true utility from below . . . . . . . . . . . . . a 0 ๐”ฝ P + b 0 ๐”ฝ N . . . . . . . . . . + c 0 U ( f ) = a 0 ๐–ด๐–ฐ + b 0 ๐–ฆ๐–ฐ + c 0 1 . . . . . . . . . . . . . a 1 ๐”ฝ P + b 1 ๐”ฝ N . . . . . . . . . . + c 1 a 1 ๐–ด๐–ฐ + b 1 ๐–ฆ๐–ฐ + c 1 1 O . . . . . . . . . . . . . O a 0 ๐”ฝ P + b 0 ๐”ฝ N . . . . . . . . . . + c 0 โ‰ฅ 1 . . . . . . . . . . . . . a 1 ๐”ฝ P + b 1 ๐”ฝ N . . . . . . . . . . + c 1 1 O O

  18. Surrogate Utility linear-fraction surrogate loss : Surrogate Utility = 18 โ–  Idea: bound true utility from below O . . . . . . . . . . . . . O a 0 ๐”ฝ P + b 0 ๐”ฝ N . . . . . . . . . . + c 0 โ‰ฅ U ( f ) = a 0 ๐–ด๐–ฐ + b 0 ๐–ฆ๐–ฐ + c 0 1 . . . . . . . . . . . . . a 1 ๐–ด๐–ฐ + b 1 ๐–ฆ๐–ฐ + c 1 a 1 ๐”ฝ P + b 1 ๐”ฝ N . . . . . . . . . . + c 1 1 O O U ฯ• ( f ) = a 0 ๐”ฝ P [1 โˆ’ ฯ• ( f ( X ))] + b 0 ๐”ฝ N [ โˆ’ ฯ• ( โˆ’ f ( X ))] + c 0 ฯ† ( m ) a 1 ๐”ฝ P [1 + ฯ• ( f ( X ))] + b 1 ๐”ฝ N [ ฯ• ( โˆ’ f ( X )) ] + c 1 ๐”ฝ [ W 0, ฯ• ] O := ๐”ฝ [ W 1, ฯ• ]

  19. Hybrid Optimization Strategy โ–ถ isnโ€™t quasi-concave if numerator < 0 maximize fractional form (quasi-concave) 19 O . . . . . . . . . . . . . O a 0 ๐”ฝ P + b 0 ๐”ฝ N . . . . . . . . . . + c 0 1 U ฯ• ( f ) = = . . . . . . . . . . . . . a 1 ๐”ฝ P + b 1 ๐”ฝ N . . . . . . . . . . + c 1 1 O O โ–  Note: numerator can be negative U ฯ• โ–ถ maximize numerator first (concave), then

  20. Hybrid Optimization Strategy 20 maximize numerator maximize fraction normalized gradient for quasi-concave optimization [Hazan+ NeurIPS2015] Hazan, E., Levy, K., & Shalev-Shwartz, S. (2015). Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems (pp. 1594-1602).

  21. 21 Outline โ–  Introduction โ–  Preliminary โ–ถ Convex Risk Minimization โ–ถ Plug-in Principle vs. Cost-sensitive Learning โ–  Key Idea โ–ถ Quasi-concave Surrogate โ–  Calibration Analysis & Experiments

Recommend


More recommend