Calibrated Surrogate Maximization of Linear-fractional Utility 07 th - PowerPoint PPT Presentation

Calibrated Surrogate Maximization of Linear-fractional Utility 07 th Feb. Han Bao (The University of Tokyo / RIKEN AIP)

accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 May cause severe issues! (e.g. in medical diagnosis) positive negative 2 ■ Our focus: binary classification

accuracy: 0.8 positive F-measure Is accuracy appropriate? F-measure: 0.75 negative F-measure: 0 accuracy: 0.8 5 5 8 2 3 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 𝖴𝖰 = 𝔽 X , Y =+1 [1 { f ( X )>0} ] 𝖴𝖮 = 𝔽 X , Y = − 1 [1 { f ( X )<0} ] 𝖦𝖰 = 𝔽 X , Y = − 1 [1 { f ( X )>0} ] 𝖦𝖮 = 𝔽 X , Y =+1 [1 { f ( X )<0} ]

Training and Evaluation minimizing 0/1-error compatible evaluation ？？？ training incompatible evaluation training 4 compatible evaluation minimizing 0/1-error training ■ Usual empirical risk minimization (ERM) 𝖡𝖽𝖽 = 𝖴𝖰 + 𝖴𝖮 1 = 1 − (0/1-risk) 0 ■ Training with accuracy but evaluate with F 1 2 𝖴𝖰 1 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 0 ■ Why not? Direct Optimization 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮

Balanced Error Rate Fowlkes-Mallows index Accuracy Weighted Accuracy Gower-Legendre index Jaccard index Matthews Correlation Coefficient F-measure w 1 𝖴𝖰 + w 2 𝖴𝖮 𝖷𝖡𝖽𝖽 = 𝖦𝖭𝖩 = 𝖴𝖰 1 w 1 𝖴𝖰 + w 2 𝖴𝖮 + w 3 𝖦𝖰 + w 4 𝖦𝖮 π 𝖴𝖰 + 𝖦𝖰 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 Wanna Unify!! 𝖡𝖽𝖽 = 𝖴𝖰 + 𝖴𝖮 𝖢𝖥𝖲 = 1 1 π 𝖦𝖮 + 1 − π 𝖦𝖰 𝖴𝖰 𝖪𝖻𝖽 = 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 𝖴𝖰 ⋅ 𝖴𝖮 − 𝖦𝖰 ⋅ 𝖦𝖮 𝖭𝖣𝖣 = π (1 − π )( 𝖴𝖰 + 𝖦𝖰 )( 𝖴𝖮 + 𝖦𝖮 ) 𝖴𝖰 + 𝖴𝖮 𝖧𝖬𝖩 = 𝖴𝖰 + α ( 𝖦𝖰 + 𝖦𝖮 ) + 𝖴𝖮

Actual Metrics linear-fraction Note: Unification of Metrics 6 𝖴𝖮 = ℙ ( Y = − 1) − 𝖦𝖰 𝖦𝖮 = ℙ ( Y = + 1) − 𝖴𝖰 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 𝖴𝖰 𝖪𝖻𝖽 = 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 a k , b k , c k : constants

Unification of Metrics := 7 = linear-fraction . . . . . . . . . . . . . a 0 𝔽 P + b 0 𝔽 N . . . . . . . . . . + c 0 U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 1 . . . . . . . . . . . . . a 1 𝔽 P + b 1 𝔽 N . . . . . . . . . . + c 1 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 1 𝔽 X [ W 0 ( f ( X ))] 𝔽 X [ W 1 ( f ( X ))] ■ TP, FP = expectation of 0/1-loss 𝖴𝖰 = ℙ ( Y = + 1, f ( X ) > 0) = 𝔽 X , Y =+1 [1 { f ( X )>0} ] ▶ e.g.

Goal of This Talk 8 Given a metric metric (utility) labeled sample classifier s.t. i.i.d. U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 Q. How to optimize U ( f ) directly? ▶ without estimating class-posterior probability f : 𝒴 → ℝ {( x i , y i )} n ∼ ℙ i =1 U ( f ′ � ) U ( f ) = sup U f ′ �

9 Outline ■ Introduction ■ Preliminary ▶ Convex Risk Minimization ▶ Plug-in Principle vs. Cost-sensitive Learning ■ Key Idea ▶ Quasi-concave Surrogate ■ Calibration Analysis & Experiments

Formulation of Classification ̂ classified correctly classified incorrectly make 0/1 loss smoother (Empirical) Surrogate Risk Example of convex in ! 10 = minimize mis-classification rate   ̂ ■ Goal of classification: maximize accuracy   n R ( f ) = 1 ∑ 1 [ y i ≠ sign( f ( x i ))] 0/1 ( ` ) n Logistic i =1 Hinge � ( m ) n = 1 ∑ ℓ ( y i f ( x i )) 1 n i =1 0 − 1 0 1 m = y i f ( x i ) m n R ϕ ( f ) = 1 ∑ ϕ ( y i f ( x i )) ϕ n i =1 ▶ logistic loss f ▶ hinge loss ⇒ SVM ▶ exponential loss ⇒ AdaBoost

3 Actors in Risk Minimization 0/1-loss what we actually minimize (empirical (surrogate) risk) classified correctly ̂ differentiable upper bound of 0/1-loss surrogate loss (surrogate risk) prediction margin classified         11     incorrectly   ■ Minimize classification risk (= 1 - Accuracy)   R ( f ) = 𝔽 [ ℓ ( Yf ( X ) ) ] 0/1 ( ` ) Logistic 0/1-loss represents if X is correctly Hinge � ( m ) classified by f 1 ■ Surrogate loss makes tractable   0 − 1 0 1 m = y i f ( x i ) m R ϕ ( f ) = 𝔽 [ ϕ ( Yf ( X ))] ■ Sample approximation (M-estimation)   n R ϕ ( f ) = 1 ∑ ϕ ( y i f ( x i )) n i =1

Convexity & Statistical Property Then, Assume : convex. 12 = argmin ? Q. argmin tractable (convex) intractable generalize Theorem. iff . ̂ (informal) [Bartlett+ 2006] A. Yes, w/ calibrated surrogate n R ϕ ( f ) = 1 R ϕ R ∑ ϕ ( y i f ( x i )) n i =1 ϕ R ϕ ( f ) = 𝔽 [ ϕ ( Yf ( X ))] argmin f R ϕ ( f ) = argmin f R ( f ) ϕ ′ � (0) < 0 R ( f ) = 𝔽 [ ℓ ( Yf ( X ))] P. L. Bartlett, M. I. Jordan, & J. D. McAuliffe. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473), 138-156.

Related Work: Plug-in Rule Y = +1 ⇒ estimate P(Y=+1|x) and δ independently Y = -1 Y = +1 Y = -1 [Koyejo+ NIPS2014; Yan+ ICML2018] 13 ■ Classifier based on class-posterior probability Bayes-optimal classifier (accuracy): ℙ ( Y = + 1 | x ) − 1 2 ℙ ( Y = + 1 ∣ X ) 1 0 1 2 Bayes-optimal classifier (general case): ℙ ( Y = + 1 | x ) − δ * ℙ ( Y = + 1 ∣ X ) 0 δ * 1 ℙ ( Y = + 1 | x ) δ * O. O. Koyejo, N. Natarajan, P. K. Ravikumar, & I. S. Dhillon. Consistent binary classification with generalized performance metrics. In NIPS , 2014. B. Yan, O. Koyejo, K. Zhong, & P. Ravikumar. Binary classification with Karmic, threshold-quasi-concave metrics. In ICML , 2018.

Convexity & Statistical Property calibration ① = argmin argmin objective? Q. tractable & calibrated calibration intractable tractable (convex) 15 intractable generalize ̂ ② n R ϕ ( f ) = 1 ∑ ϕ ( y i f ( x i )) n i =1 U ( f ) = 𝔽 X [ W 0 ( f ( X ))] 𝔽 X [ W 1 ( f ( X ))] R ϕ ( f ) = 𝔽 [ ϕ ( Yf ( X ))] R ϕ R R ( f ) = 𝔽 [ ℓ ( Yf ( X ))]

Non-concave, but quasi-concave (proof) Show ⇒ efficiently optimized non-concave, but unimodal concave is convex for is convex NB: super-level set of concave func. 16 is convex. if : concave, : convex, for and Idea: concave / convex = quasi-concave is quasi-concave f ( x ) g ( x ) f g f ( x ) ≥ 0 g ( x ) > 0 ∀ x { x | f / g ≥ α } f ( x ) g ( x ) ≥ α ⟺ f ( x ) − α g ( x ) ≥ 0 ⊇ ■ quasi-concave concave ■ super-levels are convex ∴ { x | f / g ≥ α } ∀ α ≥ 0

Surrogate Utility = non-negative sum of convex ⇒ concave non-negative sum of concave denominator from above numerator from below ⇒ convex 17 linear-fraction ■ Idea: bound true utility from below . . . . . . . . . . . . . a 0 𝔽 P + b 0 𝔽 N . . . . . . . . . . + c 0 U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 1 . . . . . . . . . . . . . a 1 𝔽 P + b 1 𝔽 N . . . . . . . . . . + c 1 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 1 O . . . . . . . . . . . . . O a 0 𝔽 P + b 0 𝔽 N . . . . . . . . . . + c 0 ≥ 1 . . . . . . . . . . . . . a 1 𝔽 P + b 1 𝔽 N . . . . . . . . . . + c 1 1 O O

Surrogate Utility linear-fraction surrogate loss : Surrogate Utility = 18 ■ Idea: bound true utility from below O . . . . . . . . . . . . . O a 0 𝔽 P + b 0 𝔽 N . . . . . . . . . . + c 0 ≥ U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 1 . . . . . . . . . . . . . a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 a 1 𝔽 P + b 1 𝔽 N . . . . . . . . . . + c 1 1 O O U ϕ ( f ) = a 0 𝔽 P [1 − ϕ ( f ( X ))] + b 0 𝔽 N [ − ϕ ( − f ( X ))] + c 0 φ ( m ) a 1 𝔽 P [1 + ϕ ( f ( X ))] + b 1 𝔽 N [ ϕ ( − f ( X )) ] + c 1 𝔽 [ W 0, ϕ ] O := 𝔽 [ W 1, ϕ ]

Hybrid Optimization Strategy ▶ isn’t quasi-concave if numerator < 0 maximize fractional form (quasi-concave) 19 O . . . . . . . . . . . . . O a 0 𝔽 P + b 0 𝔽 N . . . . . . . . . . + c 0 1 U ϕ ( f ) = = . . . . . . . . . . . . . a 1 𝔽 P + b 1 𝔽 N . . . . . . . . . . + c 1 1 O O ■ Note: numerator can be negative U ϕ ▶ maximize numerator first (concave), then

Hybrid Optimization Strategy 20 maximize numerator maximize fraction normalized gradient for quasi-concave optimization [Hazan+ NeurIPS2015] Hazan, E., Levy, K., & Shalev-Shwartz, S. (2015). Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems (pp. 1594-1602).

Calibrated Surrogate Maximization of Linear-fractional Utility 07 th - PowerPoint PPT Presentation

Calibrated Surrogate Maximization of Linear-fractional Utility 07 th Feb. Han Bao (The University of Tokyo / RIKEN AIP) accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 May cause severe issues! (e.g. in medical diagnosis)

Calibrated Surrogate Losses for Adversarially Robust Classification 1 The University of Tokyo

Fractional Linear What We Do Dependence under Interval Main Idea Main Idea (cont-d)

Beating Simplex for fractional packing and covering linear programs Christos Koufogiannakis and

Fictitious Play beats Simplex for fractional packing and covering Christos Koufogiannakis and

Efficient Numerical Methods for Fractional Laplacian and time fractional PDEs Jie Shen Purdue

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging Ping-Chun

Study of the surrogate-reac1on method via the simultaneous

Expectation-Maximization L eon Bottou NEC Labs America COS 424 3/9/2010 Agenda

A New Fractional Process: A Fractional Non-homogeneous Poisson Process Enrico Scalas University

FUNDAMENTAL SOLUTIONS FOR FRACTIONAL DIFFERENTIAL EQUATIONS INVOLVING FRACTIONAL POWERS OF

The Search for an Optimal Immunological Surrogate Endpoint in Randomized Vaccine Efficacy Trials

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Surrogate models for Single and Multi-Objective Stochastic Optimization: Integrating Support

Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking

PLURISUBHARMONICITY and PSEUDOCONVEXITY IN CALIBRATED (and other) GEOMETRIES with REESE HARVEY

Calibrated Bayes, and Inferential Paradigm for Of7icial Statistics in the Era of Big Data Rod

Content Comparison of two nearness-to- collision surrogate indicators at a Problem statement

Converting a biomarker to a surrogate What should it take? AstraZeneca Priority List- case

The Calibrated Bayes Factor for Model Comparison Steve MacEachern The Ohio State University

Analysis of the controllability of space-time fractional diffusion and super diffusion equations

Comparison of Ordinal and Metric Gaussian Process Regression as Surrogate Models for CMA

X TTC / PET / DRAC / CPI Some research applications have been done with the use the surrogate

Guided Evolutionary Strategies Augmenting random search with surrogate gradients Niru

USING COMMUNITY FESTIVALS TO STRENGHTHEN IMPLEMENTATION OF THE BAN ON SURROGATE ADVERTISING