Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao (The University of Tokyo / RIKEN AIP)
Han Bao (包 含) in Binary Classification. (AISTATS2020) knowledge learning transfer learning similarity robustness to Binary Classification. (preprint) Similarity-based Classification: Connecting Similarity Learning (MICCAI2020) Calibrated surrogate maximization of Dice. Calibrated Surrogate Maximization of Linear-fractional Utility Adversarially Robust Classification. (COLT2020) Calibrated Surrogate Losses for Source-guided Discrepancy. (AAAI2019) Unsupervised Domain Adaptation Based on (ICML2018) Classification from Pairwise Similarity and Unlabeled Data. https://hermite.jp/ つつみ ふくむ 2 robustness and knowledge transfer via loss function transfer ■ 2nd-year Ph.D. student @ Sugiyama-Honda-Yokoya Lab ■ Research Interests:
cross-entropy + softmax + https://devblogs.nvidia.com/mocha-jl-deep-learning-julia/image1/
Neural Network traffic feature ( ) Prediction softmax Network Neural label and prediction Distance of minimize light traffic label ( ) feature ( ) Training softmax cross-entropy light? x y x
Neural Network light traffic Evaluation softmax rate misclassification Network Neural light traffic minimize Training softmax cross-entropy ∑ y i log z i 1 [ y ≠ z ]
margin margin maximization hinge loss minimization misclassification rate max { 0, 1 − y i ( w ⊤ x i + b ) } w , b ∑ min i
Neural Network cross-entropy softmax Deep Learning classifier SVM Classifier hinge loss learning = minimize loss misclassification rate x x Does it work?
Background: Binary Classification if , if 8 and label : pair of feature ■ Input {( x i , y i )} n y i ∈ {±1} x i ∈ 𝒴 i =1 ▶ sample ■ Output 1 Y ≠ sign( f ( X )) f : 𝒴 → ℝ 0 Y = sign( f ( X )) ▶ classifier sign( f ( ⋅ )) ▶ predict class by R 01 ( f ) = 𝔽 [ 1 [ Y ≠ sign( f ( X ))] ] ▶ criteria: misclassification rate x 2 f x 1 f ( x ) 0
Loss function and Risk 0-1 risk wrong correct 1 0 9 discrete function no gradient for gradient descent minimization by ? 0-1 loss is NP-hard [Feldman+ 2012] ■ Goal of classification: minimize misclassification rate R 01 ( f ) = 𝔽 [ 1 [ Y ≠ sign( f ( X ))] ] ■ Misclassification rate = expectation of 0-1 loss 1 [ Y ≠ sign( f ( X ))] = ϕ 01 ( Yf ( X )) R 01 ■ Minimizing ϕ 01 Yf ( X ) Y ≠ sign( f ( X )) Y = sign( f ( X )) Feldman, V., Guruswami, V., Raghavendra, P., & Wu, Y. (2012). Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing , 41 (6), 1558-1590.
Target Loss vs. Surrogate Loss correct wrong Target loss (0-1 loss) correct wrong Surrogate loss 10 ϕ ϕ 01 ■ Final learning criterion ■ Different from target loss ■ Hard to optimize ■ Easily-optimizable criterion ▶ nonconvex, no gradient ▶ usually convex, smooth
Elements of Learning Theory (empirical) Calibration theory for Key ingredient: then converges (roughly speaking) If model is not too complicated, Generalization theory : 11 surrogate risk loss functions ̂ surrogate risk (population) target risk n R ϕ ( f ) = 1 ∑ ϕ ( y i f ( x i )) n i =1 R ϕ ( f ) = 𝔽 [ ϕ ( Yf ( X ))] R 01 ( f ) = 𝔽 [ ℓ 01 ( Yf ( X ))]
What surrogate is desirable? final learning criterion target risk Calibrated surrogate 12 surrogate risk … easily optimizable Target loss (0-1 loss) Surrogate loss ϕ R ϕ ( f ) R * ϕ R 01 ( f ) R * ϕ 01 01 f m f ∞ m →∞ m →∞ R ϕ ( f m ) → R * ϕ ⟹ R 01 ( f m ) → R * 01
How to check risk convergence? target (excess) risk , surrogate is calibrated! for all If Idea: write as function of (by using contraposition) [Steinwart 2007] surrogate (excess) risk target (excess) risk s.t. 13 surrogate (excess) risk Definition. (calibration function) is - calibrated for a target loss Definition. such that for all , . , there exists if for any ϕ ψ ψ ε > 0 δ > 0 f R ϕ ( f ) − R * ϕ < δ ⟹ R ψ ( f ) − R * ψ < ε δ ε δ ( ε ) = inf R ϕ ( f ) − R * R ψ ( f ) − R * ψ ≥ ε ϕ f δ ( ε ) > 0 ε > 0 Steinwart, I. (2007). How to compare different loss functions and their risks. Constructive Approximation , 26 (2), 225-287.
Main Tool: Calibration Function target excess risk target (excess) risk s.t. Definition. (calibration function) : biconjugate of increasing monotonically surrogate excess risk 14 -calibrated ▶ -calibrated ▶ for all surrogate (excess) risk δ ( ε ) = inf R ϕ ( f ) − R * R ψ ( f ) − R * ψ ≥ ε ϕ f ■ Provides iff condition ψ ⟺ δ ( ε ) > 0 ε > 0 ■ Provides excess risk bound ψ ≤ ( δ **) − 1 ( R ϕ ( f ) − R * ϕ ) ψ ⟹ R ψ ( f ) − R * δ ** δ
Example: Binary Classification ( ▶ squared loss ) hinge loss [Bartlett+ 2006] Theorem. If surrogate is convex, it is -calibrated iff 15 ϕ 01 ϕ ϕ 01 ▶ differentiable at 0 ϕ ′ (0) < 0 δ δ 1 1 ε ε 0 0 1 1 ϕ ( α ) = (1 − α ) 2 δ ( ε ) = ε 2 ϕ ( α ) = [1 − α ] + δ ( ε ) = ε P. L. Bartlett, M. I. Jordan, & J. D. McAuliffe. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473), 138-156.
Counterintuitive Result Crammer-Singer loss [Zhang 2004] (similar extension of logistic loss is calibrated) Crammer-Singer loss is not calibrated to 0-1 loss! [Crammer & Singer 2001] one of multi-class extensions of hinge loss is correct class prediction margin 16 ■ e.g. multi-class classification ⇒ maximize prediction margin f : 𝒴 → ℝ 3 feature x prediction score f ( x ) max{0,1 − pred. margin } Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research , 2 (Dec), 265-292 Zhang, T. (2004). Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research , 5 (Oct), 1225-1251.
Summary: Calibration Theory 17 (omitted) cross-entropy is calibrated not calibrated! CS-loss (MC-hinge loss) is Multi-class Classification Hinge, logistic is calibrated Binary Classification leading to minimization of target Stringent justification of surrogate loss! Calibrated Surrogate ⇒ replace with surrogate loss Target loss is often hard to optimize 1 0 Surrogate vs. Target loss ϕ Yf ( X ) R ϕ ( f ) Calibrated iff ϕ ′ (0) < 0 R ψ ( f )
When target is not 0-1 loss H. Bao and M. Sugiyama. Calibrated Surrogate Maximization of Linear-fractional Utility in Binary Classification. In AISTATS , 2020.
accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 positive negative 19 seemingly sensible classifier unreasonable classifier Accuracy can’t detect unreasonable classifiers under class imbalance ! ■ Our focus: binary classification
accuracy: 0.8 negative 20 F-measure Is accuracy appropriate? F-measure: 0.75 F-measure: 0 positive accuracy: 0.8 5 5 8 2 ■ F-measure is more appropriate under class imbalance 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 𝖴𝖰 = 𝔽 X , Y =+1 [1 { f ( X )>0} ] 𝖴𝖮 = 𝔽 X , Y = − 1 [1 { f ( X )<0} ] 𝖦𝖰 = 𝔽 X , Y = − 1 [1 { f ( X )>0} ] 𝖦𝖮 = 𝔽 X , Y =+1 [1 { f ( X )<0} ]
Training and Evaluation Accuracy compatible ??? calibrated Surrogate utility F値 Training F-measure Evaluation = 0-1 risk Evaluation calibrated 0-1 risk Surrogate risk Training 21 compatible ■ Usual training with accuracy ■ Training with accuracy but evaluating with F-measure
Not only F 1 , but many others Jaccard index Gower-Legendre index Balanced Error Rate Weighted Accuracy 22 Q. Can we handle in the same way? F-measure Accuracy w 1 𝖴𝖰 + w 2 𝖴𝖮 𝖷𝖡𝖽𝖽 = 𝖡𝖽𝖽 = 𝖴𝖰 + 𝖴𝖮 w 1 𝖴𝖰 + w 2 𝖴𝖮 + w 3 𝖦𝖰 + w 4 𝖦𝖮 2 𝖴𝖰 𝖢𝖥𝖲 = 1 1 𝖦 𝟤 = π 𝖦𝖮 + 1 − π 𝖦𝖰 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 𝖴𝖰 + 𝖴𝖮 𝖴𝖰 𝖧𝖬𝖩 = 𝖪𝖻𝖽 = 𝖴𝖰 + α ( 𝖦𝖰 + 𝖦𝖮 ) + 𝖴𝖮 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮
Actual Metrics linear-fraction Note: Unification of Metrics 23 𝖴𝖮 = ℙ ( Y = − 1) − 𝖦𝖰 𝖦𝖮 = ℙ ( Y = + 1) − 𝖴𝖰 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 𝖴𝖰 𝖪𝖻𝖽 = 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 a k , b k , c k : constants
Unification of Metrics := positive prediction negative data && positive prediction positive data = expectation divided by expecation linear-fraction 24 ▶ ▶ && U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 . . . . . . . . . . . . . 𝔽 X [ W 0 ( f ( X ))] a 0 𝔽 P + b 0 𝔽 N . . . . . . . . . . + c 0 1 . . . . . . . . . . . . . 𝔽 X [ W 1 ( f ( X ))] a 1 𝔽 P + b 1 𝔽 N . . . . . . . . . . + c 1 1 ■ TP, FP = expectation of 0/1-loss 𝖴𝖰 = 𝔽 X , Y =+1 [ 1 [ f ( X ) > 0] ] 𝖦𝖰 = 𝔽 X , Y = − 1 [ 1 [ f ( X ) > 0] ] f ( X )
Recommend
More recommend