Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao - PowerPoint PPT Presentation

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao (The University of Tokyo / RIKEN AIP)

Han Bao (包含) in Binary Classification. (AISTATS2020) knowledge learning transfer learning similarity robustness to Binary Classification. (preprint) Similarity-based Classification: Connecting Similarity Learning (MICCAI2020) Calibrated surrogate maximization of Dice. Calibrated Surrogate Maximization of Linear-fractional Utility Adversarially Robust Classification. (COLT2020) Calibrated Surrogate Losses for Source-guided Discrepancy. (AAAI2019) Unsupervised Domain Adaptation Based on (ICML2018) Classification from Pairwise Similarity and Unlabeled Data. https://hermite.jp/ つつみふくむ 2 robustness and knowledge transfer via loss function transfer ■ 2nd-year Ph.D. student @ Sugiyama-Honda-Yokoya Lab ■ Research Interests:

cross-entropy + softmax + https://devblogs.nvidia.com/mocha-jl-deep-learning-julia/image1/

Neural Network traffic feature ( ) Prediction softmax Network Neural label and prediction Distance of minimize light traffic label ( ) feature ( ) Training softmax cross-entropy light? x y x

Neural Network light traffic Evaluation softmax rate misclassification Network Neural light traffic minimize Training softmax cross-entropy ∑ y i log z i 1 [ y ≠ z ]

margin margin maximization hinge loss minimization misclassification rate max { 0, 1 − y i ( w ⊤ x i + b ) } w , b ∑ min i

Neural Network cross-entropy softmax Deep Learning classifier SVM Classifier hinge loss learning = minimize loss misclassification rate x x Does it work?

Background: Binary Classification if , if 8 and label : pair of feature ■ Input {( x i , y i )} n y i ∈ {±1} x i ∈ 𝒴 i =1 ▶ sample ■ Output 1 Y ≠ sign( f ( X )) f : 𝒴 → ℝ 0 Y = sign( f ( X )) ▶ classifier sign( f ( ⋅ )) ▶ predict class by R 01 ( f ) = 𝔽 [ 1 [ Y ≠ sign( f ( X ))] ] ▶ criteria: misclassification rate x 2 f x 1 f ( x ) 0

Loss function and Risk 0-1 risk wrong correct 1 0 9 discrete function no gradient for gradient descent minimization by ? 0-1 loss is NP-hard [Feldman+ 2012] ■ Goal of classification: minimize misclassification rate R 01 ( f ) = 𝔽 [ 1 [ Y ≠ sign( f ( X ))] ] ■ Misclassification rate = expectation of 0-1 loss 1 [ Y ≠ sign( f ( X ))] = ϕ 01 ( Yf ( X )) R 01 ■ Minimizing ϕ 01 Yf ( X ) Y ≠ sign( f ( X )) Y = sign( f ( X )) Feldman, V., Guruswami, V., Raghavendra, P., & Wu, Y. (2012). Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing , 41 (6), 1558-1590.

Target Loss vs. Surrogate Loss correct wrong Target loss (0-1 loss) correct wrong Surrogate loss 10 ϕ ϕ 01 ■ Final learning criterion ■ Different from target loss ■ Hard to optimize ■ Easily-optimizable criterion ▶ nonconvex, no gradient ▶ usually convex, smooth

Elements of Learning Theory (empirical) Calibration theory for Key ingredient: then converges (roughly speaking) If model is not too complicated, Generalization theory : 11 surrogate risk loss functions ̂ surrogate risk (population) target risk n R ϕ ( f ) = 1 ∑ ϕ ( y i f ( x i )) n i =1 R ϕ ( f ) = 𝔽 [ ϕ ( Yf ( X ))] R 01 ( f ) = 𝔽 [ ℓ 01 ( Yf ( X ))]

What surrogate is desirable? final learning criterion target risk Calibrated surrogate 12 surrogate risk … easily optimizable Target loss (0-1 loss) Surrogate loss ϕ R ϕ ( f ) R * ϕ R 01 ( f ) R * ϕ 01 01 f m f ∞ m →∞ m →∞ R ϕ ( f m ) → R * ϕ ⟹ R 01 ( f m ) → R * 01

How to check risk convergence? target (excess) risk , surrogate is calibrated! for all If Idea: write as function of (by using contraposition) [Steinwart 2007] surrogate (excess) risk target (excess) risk s.t. 13 surrogate (excess) risk Definition. (calibration function) is - calibrated for a target loss Definition. such that for all , . , there exists if for any ϕ ψ ψ ε > 0 δ > 0 f R ϕ ( f ) − R * ϕ < δ ⟹ R ψ ( f ) − R * ψ < ε δ ε δ ( ε ) = inf R ϕ ( f ) − R * R ψ ( f ) − R * ψ ≥ ε ϕ f δ ( ε ) > 0 ε > 0 Steinwart, I. (2007). How to compare different loss functions and their risks. Constructive Approximation , 26 (2), 225-287.

Main Tool: Calibration Function target excess risk target (excess) risk s.t. Definition. (calibration function) : biconjugate of increasing monotonically surrogate excess risk 14 -calibrated ▶ -calibrated ▶ for all surrogate (excess) risk δ ( ε ) = inf R ϕ ( f ) − R * R ψ ( f ) − R * ψ ≥ ε ϕ f ■ Provides iff condition ψ ⟺ δ ( ε ) > 0 ε > 0 ■ Provides excess risk bound ψ ≤ ( δ **) − 1 ( R ϕ ( f ) − R * ϕ ) ψ ⟹ R ψ ( f ) − R * δ ** δ

Example: Binary Classification ( ▶ squared loss ) hinge loss [Bartlett+ 2006] Theorem. If surrogate is convex, it is -calibrated iff 15 ϕ 01 ϕ ϕ 01 ▶ differentiable at 0 ϕ ′ (0) < 0 δ δ 1 1 ε ε 0 0 1 1 ϕ ( α ) = (1 − α ) 2 δ ( ε ) = ε 2 ϕ ( α ) = [1 − α ] + δ ( ε ) = ε P. L. Bartlett, M. I. Jordan, & J. D. McAuliffe. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473), 138-156.

Counterintuitive Result Crammer-Singer loss [Zhang 2004] (similar extension of logistic loss is calibrated) Crammer-Singer loss is not calibrated to 0-1 loss！ [Crammer & Singer 2001] one of multi-class extensions of hinge loss is correct class prediction margin 16 ■ e.g. multi-class classification ⇒ maximize prediction margin f : 𝒴 → ℝ 3 feature x prediction score f ( x ) max{0,1 − pred. margin } Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research , 2 (Dec), 265-292 Zhang, T. (2004). Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research , 5 (Oct), 1225-1251.

Summary: Calibration Theory 17 (omitted) cross-entropy is calibrated not calibrated! CS-loss (MC-hinge loss) is Multi-class Classification Hinge, logistic is calibrated Binary Classification leading to minimization of target Stringent justification of surrogate loss! Calibrated Surrogate ⇒ replace with surrogate loss Target loss is often hard to optimize 1 0 Surrogate vs. Target loss ϕ Yf ( X ) R ϕ ( f ) Calibrated iff ϕ ′ (0) < 0 R ψ ( f )

When target is not 0-1 loss H. Bao and M. Sugiyama. Calibrated Surrogate Maximization of Linear-fractional Utility in Binary Classification. In AISTATS , 2020.

accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 positive negative 19 seemingly sensible classifier unreasonable classifier Accuracy can’t detect unreasonable classifiers under class imbalance ! ■ Our focus: binary classification

accuracy: 0.8 negative 20 F-measure Is accuracy appropriate? F-measure: 0.75 F-measure: 0 positive accuracy: 0.8 5 5 8 2 ■ F-measure is more appropriate under class imbalance 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 𝖴𝖰 = 𝔽 X , Y =+1 [1 { f ( X )>0} ] 𝖴𝖮 = 𝔽 X , Y = − 1 [1 { f ( X )<0} ] 𝖦𝖰 = 𝔽 X , Y = − 1 [1 { f ( X )>0} ] 𝖦𝖮 = 𝔽 X , Y =+1 [1 { f ( X )<0} ]

Training and Evaluation Accuracy compatible ??? calibrated Surrogate utility F値 Training F-measure Evaluation = 0-1 risk Evaluation calibrated 0-1 risk Surrogate risk Training 21 compatible ■ Usual training with accuracy ■ Training with accuracy but evaluating with F-measure

Not only F 1 , but many others Jaccard index Gower-Legendre index Balanced Error Rate Weighted Accuracy 22 Q. Can we handle in the same way? F-measure Accuracy w 1 𝖴𝖰 + w 2 𝖴𝖮 𝖷𝖡𝖽𝖽 = 𝖡𝖽𝖽 = 𝖴𝖰 + 𝖴𝖮 w 1 𝖴𝖰 + w 2 𝖴𝖮 + w 3 𝖦𝖰 + w 4 𝖦𝖮 2 𝖴𝖰 𝖢𝖥𝖲 = 1 1 𝖦 𝟤 = π 𝖦𝖮 + 1 − π 𝖦𝖰 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 𝖴𝖰 + 𝖴𝖮 𝖴𝖰 𝖧𝖬𝖩 = 𝖪𝖻𝖽 = 𝖴𝖰 + α ( 𝖦𝖰 + 𝖦𝖮 ) + 𝖴𝖮 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮

Actual Metrics linear-fraction Note: Unification of Metrics 23 𝖴𝖮 = ℙ ( Y = − 1) − 𝖦𝖰 𝖦𝖮 = ℙ ( Y = + 1) − 𝖴𝖰 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 𝖴𝖰 𝖪𝖻𝖽 = 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 a k , b k , c k : constants

Unification of Metrics := positive prediction negative data && positive prediction positive data = expectation divided by expecation linear-fraction 24 ▶ ▶ && U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 . . . . . . . . . . . . . 𝔽 X [ W 0 ( f ( X ))] a 0 𝔽 P + b 0 𝔽 N . . . . . . . . . . + c 0 1 . . . . . . . . . . . . . 𝔽 X [ W 1 ( f ( X ))] a 1 𝔽 P + b 1 𝔽 N . . . . . . . . . . + c 1 1 ■ TP, FP = expectation of 0/1-loss 𝖴𝖰 = 𝔽 X , Y =+1 [ 1 [ f ( X ) > 0] ] 𝖦𝖰 = 𝔽 X , Y = − 1 [ 1 [ f ( X ) > 0] ] f ( X )

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao - PowerPoint PPT Presentation

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao (The University of Tokyo / RIKEN AIP) Han Bao ( ) in Binary Classification. (AISTATS2020) knowledge learning transfer learning similarity robustness to Binary

Learning Theory Bridges Loss Functions Sep 10 th , 2020 Han Bao (The University of Tokyo / RIKEN

Online Learning with Pairwise Loss Functions Online Learning with Pairwise Loss Functions MLSIG

3. Preference Learning Techniques a. Learning Utility Functions b. Learning Preference

On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash

Learning Architectures and Loss Functions in Continuous Space Fei Tian Machine Learning Group

A Unified View of Loss Functions in Supervised Learning Shuiwang Ji Department of Computer

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC Some slides adapted from Hamed

Prior and loss robustness for varoius loss functions Agnieszka Kami nska and Zdzis law

CS 4803 / 7643: Deep Learning Topics: Linear Classifiers Loss Functions Dhruv Batra

CS 4803 / 7643: Deep Learning Topics: Linear Classifiers Loss Functions Zsolt Kira

Learning as Loss Minimization Machine Learning 1 Learning as loss minimization The setup

Di ff erentially Private Empirical Risk Minimization with Non-convex Loss Functions Di Wang ,

ETHERNET (Functions, Standards, Hubs, Bridges, Switches, Segments & Frames) ECE 422

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Definability theory Once we know that there are functions on N which are not computable, the

ETHERNET (Functions, Standards, Hubs, Bridges, Switches, Segments & Frames) ECE 422 Data

Machine learning theory Convex learning problems Hamid Beigy Sharif university of technology

Learning Loss for Active Learning Rymarczyk D., Zieliski B., Tabor J., Sadowski M., Titov M.

3. Preference Learning Techniques 4. Complexity of Preference Learning 5. Conclusions 1 ECAI

Regularization with Lipschitz Loss Pierre Alquier Sequential, structured, and/or statistical

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and

Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical

7 Bridges in 74 Days (Or Less) (Or Less) NC 12 Detour Ocracoke NC 12 Bridges Ocracoke NC 12

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao - PowerPoint PPT Presentation

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao (The University of Tokyo / RIKEN AIP) Han Bao ( ) in Binary Classification. (AISTATS2020) knowledge learning transfer learning similarity robustness to Binary

Learning Theory Bridges Loss Functions Sep 10 th , 2020 Han Bao (The University of Tokyo / RIKEN

Online Learning with Pairwise Loss Functions Online Learning with Pairwise Loss Functions MLSIG

3. Preference Learning Techniques a. Learning Utility Functions b. Learning Preference

On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash

Learning Architectures and Loss Functions in Continuous Space Fei Tian Machine Learning Group

A Unified View of Loss Functions in Supervised Learning Shuiwang Ji Department of Computer

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC Some slides adapted from Hamed

Prior and loss robustness for varoius loss functions Agnieszka Kami nska and Zdzis law

CS 4803 / 7643: Deep Learning Topics: Linear Classifiers Loss Functions Dhruv Batra

CS 4803 / 7643: Deep Learning Topics: Linear Classifiers Loss Functions Zsolt Kira

Learning as Loss Minimization Machine Learning 1 Learning as loss minimization The setup

Di ff erentially Private Empirical Risk Minimization with Non-convex Loss Functions Di Wang ,

ETHERNET (Functions, Standards, Hubs, Bridges, Switches, Segments &amp; Frames) ECE 422

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Definability theory Once we know that there are functions on N which are not computable, the

ETHERNET (Functions, Standards, Hubs, Bridges, Switches, Segments &amp; Frames) ECE 422 Data

Machine learning theory Convex learning problems Hamid Beigy Sharif university of technology

Learning Loss for Active Learning Rymarczyk D., Zieliski B., Tabor J., Sadowski M., Titov M.

3. Preference Learning Techniques 4. Complexity of Preference Learning 5. Conclusions 1 ECAI

Regularization with Lipschitz Loss Pierre Alquier Sequential, structured, and/or statistical

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and

Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical

7 Bridges in 74 Days (Or Less) (Or Less) NC 12 Detour Ocracoke NC 12 Bridges Ocracoke NC 12

ETHERNET (Functions, Standards, Hubs, Bridges, Switches, Segments & Frames) ECE 422

ETHERNET (Functions, Standards, Hubs, Bridges, Switches, Segments & Frames) ECE 422 Data