ranking observations with latent information and binary
play

Ranking observations with latent information and binary feedback - PowerPoint PPT Presentation

Ranking observations with latent information and binary feedback Nicolas Vayatis Ecole Normale Sup erieure de Cachan Workshop on The Mathematics of Ranking at AIM - Palo Alto, August 2010 1 Statistical Issues in Machine Learning 2


  1. Ranking observations with latent information and binary feedback Nicolas Vayatis Ecole Normale Sup´ erieure de Cachan Workshop on The Mathematics of Ranking at AIM - Palo Alto, August 2010

  2. 1 Statistical Issues in Machine Learning 2 Prediction of Preferences 3 Other Criteria for Ranking Error

  3. Statistical issues in Machine Learning

  4. Generalization ability of decision rules Class G of candidate decision rules Risk functional L , the ”objective” criterion Past data D n with sample size n Method/Algorithm outputs an empirical estimate � g n ∈ G Main questions: ◮ Strong Bayes-risk consistency → L ∗ = inf a . s . L ( � g n ) − − g L ( g ) , n → ∞ ? ◮ Rate of this convergence?

  5. An example - Binary classification with i.i.d. data Data D n = { ( X i , Y i ) : i = 1 , . . . , n } i.i.d. copies of ( X , Y ) ∈ X × {− 1 , +1 } Empirical Risk Minimization principle n � L n ( g ) := 1 � I { g ( X i ) � = Y i } � g n = arg min n g ∈G i =1 First-order analysis: with probability at least 1 − δ � � � log(1 /δ ) | � g n ) − inf g ∈G L ( g ) ≤ 2 E L n ( g ) − L ( g ) | L ( � sup + c n g ∈G Tools: empirical processes techniques, concentration inequalities

  6. Complexity Control Vapnik-Chervonenkis inequality: � � � V | � sup L n ( g ) − L ( g ) | ≤ c E n g ∈G where V is the VC dimension of the class G . Rademacher average: � � � � � � n � R n ( G ) = 1 � � sup ǫ i I { Y i � = g ( X i ) } n E � � � � g ∈G i =1 where ǫ 1 , . . . , ǫ n i.i.d. sign variables

  7. Variance control Second-order analysis: Talagrand’s inequality � �� � � � P ( f ) − � P ( f ) − � sup P n ( f ) ≤ 2 E sup P n ( f ) + . . . f ∈F f ∈F � 2 (sup f ∈F Var ( f )) log(1 /δ ) + c log(1 /δ ) . . . + n n Variance control assumption Var ( f ) ≤ C ( L ( g ) − L ∗ ) α , ∀ g with α ∈ (0 , 1]. Fast rates of convergence: excess risk in n − 1 / (2 − α )

  8. Prediction of Preferences Joint work with St´ ephan Cl´ emen¸ con (Telecom ParisTech) G´ abor Lugosi (Pompeu Fabra)

  9. Setup ( X , Y ) random pair with unknown distribution P over X × R ( X , Y ), ( X ′ , Y ′ ) i.i.d., and Y , Y ′ may not be observed Preference label R = R ( Y , Y ′ ) ∈ R , with R ( Y , Y ′ ) = − R ( Y ′ , Y ) R > 0 means ” X is better than X ′ ” Decision rule: r : X × X → {− 1 , 0 , 1 } Prediction error = classification error with pairs of observations � � R · r ( X , X ′ ) < 0 L ( r ) = P Same like before?

  10. Empirical Ranking Risk Minimization Latent data D n = { ( X i , Y i ) : i = 1 , . . . , n } i.i.d. Observed data: { ( X i , X j , R i , j ) : i , j = 1 , . . . , n } , R i , j = R ( Y i , Y j ) Empirical criterion for ranking: � 1 L n ( r ) = I { R i , j · r ( X i , X j ) < 0 } n ( n − 1) i � = j General definition of a U-statistic (fixed f ): � 1 U n ( f ) = f ( Z i , Z j ) n ( n − 1) i � = j where Z 1 , ..., Z n i.i.d.

  11. Structure of U-Statistics - First representation Assume f symmetric. Average of ’sums-of-i.i.d.’ blocks: ⌊ n / 2 ⌋ � � � � U n ( f ) = 1 1 Z π ( i ) , Z π ( ⌊ n / 2 ⌋ + i ) f ⌊ n / 2 ⌋ n ! π i =1 where π represents permutations of { 1 , . . . , n } . Lemma Let ψ convex increasing and F a class of functions. Then:   � � ⌊ n / 2 ⌋ � � � 1  sup  ≤ E ψ E ψ sup U n ( f ) f Z π ( i ) , Z π ( ⌊ n / 2 ⌋ + i ) ⌊ n / 2 ⌋ f ∈F f ∈F i =1

  12. Consequences of the first representation Back to classification with ⌊ n / 2 ⌋ i.i.d. pairs Enough for first-order analysis (including ERM and CRM) Overestimates the variance Noise assumption too restrictive!! No fast rates in the general case!

  13. Structure of U-Statistics - Second representation Hoeffding’s decomposition U n ( f ) = E ( U n ( f )) + 2 T n ( f ) + W n ( f ) with n � ◮ T n ( f ) = 1 h ( Z i ) ( empirical average of i.i.d. ) n i =1 where h ( z ) = E f ( Z 1 , z ) − E ( U n ( f )) ◮ W n ( f ) = degenerate U-statistic (remainder term) Degenerate U-statistic W n with kernel ˜ h is such that: E (˜ h ( Z 1 , Z 2 ) | Z 1 ) = 0 a.s. Remark: Need here to observe individual labels Y , Y ′ !

  14. Insights for rates-of-convergence results Leading term T n is an empirical process ◮ handled by Talagrand’s concentration inequality ◮ involves ”standard” complexity measures: ⇒ Variance control involves the function h Exponential inequality for degenerate U -processes ◮ VC classes - exponential inequality by Arcones and Gin´ e (AoP1993) ◮ general case - a new moment inequality ⇒ additional complexity measures

  15. Fast Rates - Notations Kernel: q r (( x , y ) , ( x ′ , y ′ )) = I { ( y − y ′ ) · r ( x , x ′ ) < 0 } − I { ( y − y ′ ) · r ∗ ( x , x ′ ) < 0 } U -process indexed by ranking rule r ∈ R � 1 Λ n ( r ) = q r (( X i , Y i ) , ( X j , Y j )) , n ( n − 1) i � = j Excess risk: Λ( r ) = L ( r ) − L ∗ = E { q r (( X , Y ) , ( X ′ , Y ′ )) } Key quantity: h r ( x , y ) = E { q r (( x , y ) , ( X ′ , Y ′ )) } − Λ( r )

  16. Result on Fast Rates - VC Case Assume we have: the class R of ranking rules has finite VC dimension V for all r ∈ R , � L ( r ) − L ∗ � α Var( h r ( X , Y )) ≤ c (V) with some constants c > 0 and α ∈ [0 , 1]. Then, with probability larger than 1 − δ : � � � V log( n /δ ) � 1 / (2 − α ) L ( r n ) − L ∗ ≤ 2 r ∈R L ( r ) − L ∗ inf + C n

  17. Comments Question Sufficient condition for Assumption (V) : � L ( r ) − L ∗ � α ∀ r ∈ R , Var( h r ( X , Y )) ≤ c ? Goal Formulate noise assumptions on the regression function: E { Y | X = x }

  18. Example 1 - Bipartite Ranking Binary labels Y , Y ′ ∈ {− 1 , +1 } Posterior probability: η ( x ) = P { Y = +1 | X = x } Noise Assumption (NA) There exist constants c > 0 and α ∈ [0 , 1] such that : E ( | η ( x ) − η ( X ) | − α ) ≤ c . ∀ x ∈ X , Sufficient condition for (NA) with α < 1 η ( X ) absolutely continuous on [0 , 1] with bounded density

  19. Example 2 - Regression Data Y = m ( X ) + σ ( X ) · N , where N ∼ N (0 , 1), E ( N | X ) = 0 Key quantity: m ( X ) − m ( X ′ ) ∆( X , X ′ ) = � σ 2 ( X ) + σ 2 ( X ′ ) Noise Assumption (NA) There exist constants c > 0 and α ∈ [0 , 1] such that : E ( | ∆( x , X ) | − α ) ≤ c . ∀ x ∈ X , Sufficient condition for (NA) with α < 1 m ( X ) has a bounded density and σ ( X ) is bounded over X .

  20. Remainder Term Degenerate U -process Consider F a class of degenerate kernels, and � � � � � � � ˜ � � W n = sup f ( Z i , Z j ) � � � � f ∈F i , j

  21. Additional Complexity Measures ǫ 1 , . . . , ǫ n i.i.d. Rademacher random variables Complexity measures: � � � � � � � � � (1) Z ǫ = sup ǫ i ǫ j f ( Z i , Z j ) � � f ∈F � � i , j � (2) U ǫ = sup sup ǫ i α j f ( Z i , Z j ) f ∈F α : � α � 2 ≤ 1 i , j � � � � n � � � (3) = sup max ǫ i f ( Z i , Z k ) M ǫ � � � � k =1 ... n f ∈F i =1

  22. Moment Inequality Theorem If ˜ W n is a degenerate U -process, then there exists a universal constant C > 0 such that for all n and q ≥ 2 , � � 1 / q � E Z ǫ + q 1 / 2 E U ǫ + q ( E M ǫ + n ) + q 3 / 2 n 1 / 2 + q 2 � E ˜ W q ≤ C n Main tools: symmetrization, decoupling and concentration inequalities Related work: Adamczak (AoP, 2006), Arcones and Gin´ e (AoP, 1993), Gin´ e, Latala and Zinn (HDP II, 2000), Houdr´ e and Reynaud-Bouret (SIA, 2003), Major (PTRF, 2006)

  23. Control of the Degenerate Part Corollary With probability 1 − δ , � � � n 2 + E U ǫ log(1 /δ ) + E M ǫ log(1 /δ ) + log(1 /δ ) E Z ǫ ˜ W n ≤ C n 2 n 2 n Special case - F is a VC class √ √ E Z ǫ ≤ CnV , E U ǫ ≤ Cn E ǫ M ǫ ≤ C V , Vn Hence, with probability 1 − δ W n ≤ 1 ˜ n ( V + log(1 /δ ))

  24. Other Criteria for Ranking Error AUC and beyond - Focus on the top of the list Joint work with St´ ephan Cl´ emen¸ con (Telecom ParisTech)

  25. Global performance measures: ROC Curve For a given scoring rule: s : X → R Threshold t ∈ R True positive rate: β s ( t ) = P { s ( X ) ≥ t | Y = +1 } False positive rate: α s ( t ) = P { s ( X ) ≥ t | Y = − 1 } � � ROC : ( s , t ) �→ α s ( t ) , β s ( t ) + continuous extension

  26. Optimality, Metrics for ROC Curves By Neyman-Pearson’s lemma: optimal scoring rules are in S ∗ = { T ◦ η : T strictly increasing } Optimal ROC curve: α ∈ [0 , 1] �→ ROC ∗ ( α ) = β η ◦ α − 1 η ( α ) L 1 metric on ROC curves � 1 ( ROC ∗ ( α ) − ROC s ( α )) d α = AUC ( η ) − AUC ( s ) d 1 ( s , η ) = 0 What about stronger metrics ? ( ROC ∗ ( α ) − ROC s ( α )) d ∞ ( s , η ) = sup α ∈ [0 , 1]

  27. Connection to the AUC criterion Consider a real-valued scoring rule s : X → R Take: ( X , Y ), ( X ′ , Y ′ ) i.i.d. copies � 1 ROC s ( α ) d α = P { s ( X ) ≥ s ( X ′ ) | Y > Y ′ } AUC ( s ) = 0 Ranking rule: r ( X , X ′ ) = 2 I { s ( X ) > s ( X ′ ) } − 1 Ranking error and AUC: p = P { Y = +1 } 1 AUC ( s ) = 1 − 2 p (1 − p ) L ( r ) Maximization of AUC = Minimization of ranking error

Recommend


More recommend