mlss cc
play

MLSS .cc Machine Learning Summer School Thursday, 29 th January 2009 - PowerPoint PPT Presentation

Information, Divergence and Risk for Binary Classification Mark Reid* [mark.reid@anu.edu.au] Research School of Information Science and Engineering The Australian National University, Canberra, ACT, Australia MLSS .cc Machine Learning Summer


  1. Information, Divergence and Risk for Binary Classification Mark Reid* [mark.reid@anu.edu.au] Research School of Information Science and Engineering The Australian National University, Canberra, ACT, Australia MLSS .cc Machine Learning Summer School Thursday, 29 th January 2009 *Joint work with Robert Williamson

  2. Introduction

  3. The Blind Men & The Elephant F -D IVERGENCE S TATISTICAL B REGMAN AUC I NFORMATION D IVERGENCE C OST C URVES

  4. Overview Convex function representations Measures of Divergence • Integral (Taylor’s theorem) • Csiszár and Bregman divergences • Variational (LF Dual) • Loss, Risk and Regret Binary Experiments • Statistical Information • Distinguishing between two Representations probability distributions or classes • Loss and Divergence Classification Problems Bounds and Applications • Distinguishing between two • Reductions distributions, for each instance • Loss and Pinsker Bounds

  5. What’s in it for me? What to expect What not to expect • Lots of definitions • Algorithms • Various points of view on the • Models same concepts • Sample complexity analysis • Relationships between those ‣ Everything is idealised - i.e., concepts assuming complete data. • An emphasis on problems over • Technicalities techniques

  6. Part I : Convexity and Binary Experiments

  7. Overview Convex Functions Class Probability Estimation • Definitions & Properties • Generative/Discriminative Views • Fenchel & Csiszár Duals • Loss, Risk, Regret • Taylor Expansion • Savage’s Theorem • The Jensen Gap • Statistical Information • Bregman Information Binary Experiments and Divergence • Definitions & Examples • Statistics • Neyman-Pearson Lemma • Bregman & f-Divergence

  8. Convex Functions and their Representations

  9. Convex Sets Convex • Given points and weights x 1 , . . . , x n � n such that λ 1 , . . . , λ n ≥ 0 i =1 λ i = 1 their convex combination is x 2 n x 1 � λ i x i i =1 • We say is a convex set if it is S ⊆ R d closed under convex combination. That is, for any n, any x 1 , . . . , x n ⊂ S Not Convex and weights λ 1 , . . . , λ n ≥ 0 n � λ i x i ∈ S i =1 • Suffices to show for all and x 2 λ ∈ [0 , 1] that x 1 , x 2 ∈ S x 1 λ x 1 + (1 − λ ) x 2 ∈ S

  10. Convex Functions • The epigraph of a function is the set of points that lie above it: f ( x ) epi ( f ) := { ( x , y ) : x ∈ R d , y ≥ f ( x ) } • A function is convex if its epigraph is epi( f ) a convex set f ( x 2 ) ‣ Lines interpolating any two points on its graph lie above it f ( x 1 ) ‣ A convex function is necessarily continuous ‣ A point-wise sum of convex x 1 x 2 functions is convex

  11. The Legendre-Fenchel Transform f ( t ) • The LF Transform generalises the notion of a derivative to non- differentiable functions slope t * f ∗ ( t ∗ ) = sup t ∈ R d { � t, t ∗ � − f ( t ) } • When f is differentiable at t f ∗ ( t ∗ ) = t ∗ .t − f (( f ′ ) − 1 ( t ∗ )) t slope t f* ( t* ) • The double LF transform f ∗∗ ( t ) = sup t ∗ ∈ R d { � t ∗ , t � − f ∗ ( t ∗ ) } is involutive for convex f. That is, f ∗∗ ( t ) = f ( t ) t *

  12. Taylor’s Theorem Integral Form of Taylor Expansion • Let be an interval on which f is twice differentiable. Then [ t 0 , t ] � t f ( t ) = f ( t 0 ) + ( t − t 0 ) f ′ ( t 0 ) + ( t − s ) f ′′ ( s ) ds t 0 Corollary • Let f be twice differentiable on [ a , b ]. Then, for all t in [ a , b ], � b f ( t ) = f ( t 0 ) + ( t − t 0 ) f ′ ( t 0 ) + g ( t, s ) f ′′ ( s ) ds a � ( t − s ) + s ≥ t 0 where g ( t, s ) = ( s − t ) + s < t 0 • Differentiability can be removed if f ’ and f ’’ are interpreted distributionally

  13. Bregman Divergence • A Bregman divergence is a general f ( t ) = t log ( t ) class of “distance” measures defined using convex functions B f ( t, t 0 ) := f ( t ) − f ( t 0 ) − � t − t 0 , ∇ f ( t 0 ) � f ( t ) • In 1-d case, is the non-linear B f ( t, t 0 ) B f ( t, t 0 ) part of the Taylor expansion of f � t ( t − s ) f ′′ ( s ) ds B f ( t, t 0 ) := f ( t 0 ) t 0 t 0 t

  14. Jensen’s Inequality Jensen Gap Jensen’s Inequality • For convex and • The Jensen Gap is non-negative f : R → R distribution P define for all P if and only if f is convex J P [ f ( x )] := E P [ f ( x )] − f ( E P [ x ]) Affine Invariance • For all values a , b f ( x 4 ) J P [ f ( x ) + bx + a ] = J P [ f ( x )] f ( x 1 ) E P [ f ( x )] Taylor Expansion �� b � g x 0 ( x, s ) f ′′ ( s ) ds J P [ f ( x )] = J P J P [ f ( x )] f ( x 3 ) a � b f ( x 2 ) J P [ g x 0 ( x, s )] f ′′ ( s ) ds = f ( E P [ x ]) a x 1 x 2 E [ x ] x 3 x 4

  15. Representations of Convex Functions Integral Representation Variational Representation • Via Taylor’s Theorem • Via Fenchel Dual � b { t.t ∗ − f ∗ ( t ∗ ) } g ( t, s ) f ′′ ( s ) ds f ( t ) = sup f ( t ) = Λ f ( t ) + t ∗ ∈ R a where where f ∗ ( t ) = sup { t.t ∗ − f ( t ) } Λ f ( t ) = f ( t 0 ) + f ′ ( t 0 )( t − t 0 ) t ∈ R � ( t − s ) + s ≥ t 0 g ( t, s ) = ( s − t ) + s < t 0

  16. Binary Experiments and Measures of Divergence

  17. Binary Experiments Discrete Space • A binary experiment is a pair of 1.0 P distributions ( P , Q ) over the same Q 0.8 space X Probability 0.7 0.6 • We will think of P as the positive and 0.5 0.4 Q as the negative distribution 0.3 0.2 0.2 0.2 0.1 0 • Given samples from , how can we X a b c tell if they came from P or Q ? Continuous Space ‣ Hypothesis Testing Density dQ dP • The “further apart” P and Q are the easier this will be ‣ How do we define distance for distributions? 0 X

  18. Test Statistics X • We would like our distances to not be dependent on the topology of the underlying space + — • A test statistic maps each point in τ to a point on the real line X ‣ Usually a function of the distribution τ • A statistical test can be obtained by thresholding a test statistic r ( x ) = � τ ( x ) ≥ τ 0 � R τ 0 • Each threshold partitions space into positive and negative parts

  19. Statistical Power and Size Contingency Table Actual Class • True Positive Rate P ( τ ≥ τ 0 ) + – • False Positive Rate Q ( τ ≥ τ 0 ) True False • True Negative Rate + Q ( τ < τ 0 ) Predicted Class Positives Positives TP FP • False Negative Rate P ( τ < τ 0 ) Power False True – Negatives Negatives • = True Positive Rate = FN TN P ( τ ≥ τ 0 ) 1 − β Size • = False Positive Rate = Q ( τ ≥ τ 0 ) α

  20. The Neyman-Pearson Lemma Likelihood ratio τ ( x ) = dP dQ ( x ) 1 τ ∗ Neyman-Pearson Lemma (1933) True Positive Rate (TP) τ • The the likelihood ratio is the uniformly most powerful (UMP) statistical test ‣ Always has the largest TP Rate for any given FP rate 0 1 False Positive Rate (FP)

  21. Csiszár f-Divergence � � dP �� • f-divergence of P from Q is the I f ( P, Q ) = f E Q dQ Q-average of the likelihood ratio transformed by the function f � dP � � = f dQ dQ X ‣ f can be seen as a penalty for dP(x) ≠ dQ(x) � � dP �� • To be a divergence, we want I f ( P, Q ) = f E Q dQ ‣ ≥ 0 for all P , Q I f ( P, Q ) � � dP �� f E Q ≥ dQ ‣ = 0 for all Q I f ( Q, Q ) = f (1) • Jensen’s inequality requries � � dP �� ‣ f convex I f ( P, Q ) = f ≥ 0 J Q dQ ‣ f(1) = 0 “Jensen Gap”

  22. Properties and Examples Symmetry Examples 2.0 1.5 • Variational • I f ( P, Q ) = I f ⋄ ( Q, P ) 1.0 0.5 f ( t ) = | t − 1 | • 0.0 I f ( P, Q ) = I f ( Q, P ) ⇐ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ⇒ 3 f ( t ) = f ⋄ ( t ) + c ( t − 1) • KL-Divergence 2 1 f ( t ) = t ln t Closure 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 • Hellinger 1.0 0.8 • I af + bg = a I f + b I g √ 0.6 t − 1) 2 f ( t ) = ( 0.4 0.2 Affine Invariance 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 χ 2 • Pearson 4 3 • f ( t ) = ( t − 1) 2 I f = I g ⇐ ⇒ f ( t ) = g ( t ) + bt + a 2 1 0 • Triangular 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1.0 f ( t ) = ( t − 1) 2 0.8 0.6 t + 1 0.4 0.2 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

  23. Bregman Divergence (Generative) Bregman Divergences • Measures the average divergence between the densities of P and Q B f ( P, Q ) := E M [ B f ( dP, dQ )] E M [ f ( dP ) − f ( dQ ) − ( dP − dQ ) f ′ ( dQ )] = • “Additive” analogue of f-divergence

  24. Bregman and f-Divergences • What is the relationship between the classes of (generative) Bregman divergences and f-divergences? Bregman Csiszár ‣ One “additive”, one Divergences f-divergences “multiplicative” • They only have KL divergence in common [Csiszár, 1995] I f ( P, Q ) = B f ( P, Q ) E M [ I f ( p, q )] = E M [ B f ( p, q )] KL Divergence ⇐ ⇒ f ( t ) = t log ( t ) − t + 1

  25. Classification and Probability Estimation

Recommend


More recommend