ranking aggregation and you
play

Ranking, Aggregation, and You Lester Mackey Collaborators: John C. - PowerPoint PPT Presentation

Ranking, Aggregation, and You Lester Mackey Collaborators: John C. Duchi and Michael I. Jordan Stanford University UC Berkeley October 5, 2014 A simple question A simple question On a scale of 1 (very white) to 10 (very


  1. Tractable ranking First try: Empirical risk minimization ← Intractable! � n E n [ L ( f ( Q ) , Y )] = 1 R n ( f ) := ˆ ˆ min k =1 L ( f ( Q k ) , Y k ) n f Idea: Replace loss L ( α, Y ) with convex surrogate ϕ ( α, Y ) L ( α, Y ) = � i � = j Y ij 1 ( α i ≤ α j ) Hard

  2. Tractable ranking First try: Empirical risk minimization ← Intractable! � n E n [ L ( f ( Q ) , Y )] = 1 R n ( f ) := ˆ ˆ min k =1 L ( f ( Q k ) , Y k ) n f Idea: Replace loss L ( α, Y ) with convex surrogate ϕ ( α, Y ) L ( α, Y ) = � ϕ ( α, Y ) = � i � = j Y ij 1 ( α i ≤ α j ) i � = j Y ij φ ( α i − α j ) Hard Tractable

  3. Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f

  4. Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable

  5. Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable n →∞ ◮ argmin f ˆ R ϕ,n ( f ) → argmin f R ϕ ( f ) := E [ ϕ ( f ( Q ) , Y )]

  6. Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable n →∞ ◮ argmin f ˆ R ϕ,n ( f ) → argmin f R ϕ ( f ) := E [ ϕ ( f ( Q ) , Y )] Main Question: Are these tractable ranking procedures consistent ?

  7. Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable n →∞ ◮ argmin f ˆ R ϕ,n ( f ) → argmin f R ϕ ( f ) := E [ ϕ ( f ( Q ) , Y )] Main Question: Are these tractable ranking procedures consistent ? ⇐ ⇒ Does argmin f R ϕ ( f ) also minimize the true risk R ( f ) ?

  8. Classification consistency Consider the special case of classification

  9. Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1

  10. Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1 ◮ Pairwise loss: L ( α, Y ) = Y 01 1 ( α 0 ≤ α 1 ) + Y 10 1 ( α 1 ≤ α 0 )

  11. Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1 ◮ Pairwise loss: L ( α, Y ) = Y 01 1 ( α 0 ≤ α 1 ) + Y 10 1 ( α 1 ≤ α 0 ) ◮ Surrogate loss: ϕ ( α, Y ) = Y 01 φ ( α 0 − α 1 ) + Y 10 φ ( α 1 − α 0 )

  12. Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1 ◮ Pairwise loss: L ( α, Y ) = Y 01 1 ( α 0 ≤ α 1 ) + Y 10 1 ( α 1 ≤ α 0 ) ◮ Surrogate loss: ϕ ( α, Y ) = Y 01 φ ( α 0 − α 1 ) + Y 10 φ ( α 1 − α 0 ) Theorem: If φ is convex, procedure based on minimizing φ is consistent if and only if φ ′ (0) < 0 . [Bartlett, Jordan, and McAuliffe, 2006] ⇒ Tractable consistency for boosting, SVMs, logistic regression

  13. Ranking consistency? Good news: Can characterize surrogate ranking consistency 1 [Duchi, Mackey, and Jordan, 2013]

  14. Ranking consistency? Good news: Can characterize surrogate ranking consistency Theorem: 1 Procedure based on minimizing ϕ is consistent ⇐ ⇒ � � � � � E [ L ( α ′ , Y ) | q ] min E [ ϕ ( α, Y ) | q ] � α �∈ argmin α α ′ > min α E [ ϕ ( α, Y ) | q ] . ◮ Translation: ϕ is consistent if and only if minimizing conditional surrogate risk gives correct ranking for every query 1 [Duchi, Mackey, and Jordan, 2013]

  15. Ranking consistency? Bad news: The consequences are dire...

  16. Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4

  17. Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4 Task: Find argmin α E [ L ( α, Y ) | q ]

  18. Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4 Task: Find argmin α E [ L ( α, Y ) | q ] ◮ Classification (two node) case: Easy ◮ Choose α 0 > α 1 ⇐ ⇒ P [ Class 0 | q ] > P [ Class 1 | q ]

  19. Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4 Task: Find argmin α E [ L ( α, Y ) | q ] ◮ Classification (two node) case: Easy ◮ Choose α 0 > α 1 ⇐ ⇒ P [ Class 0 | q ] > P [ Class 1 | q ] ◮ General case: NP hard ◮ Unless P = NP , must restrict problem for tractable consistency

  20. Low noise distribution Define: Average preference for item i over item j : s ij = E [ Y ij | q ] ◮ We say i ≻ j on average if s ij > s ji

  21. Low noise distribution Define: Average preference for item i over item j : s ij = E [ Y ij | q ] ◮ We say i ≻ j on average if s ij > s ji Definition ( Low noise distribution ): If i ≻ j on average and j ≻ k on average, then i ≻ k on average. 1 s 1 3 ◮ No cyclic preferences on average s 12 s 3 1 s 23 2 3 Low noise ⇒ s 13 > s 31

  22. Low noise distribution Define: Average preference for item i over item j : s ij = E [ Y ij | q ] ◮ We say i ≻ j on average if s ij > s ji Definition ( Low noise distribution ): If i ≻ j on average and j ≻ k on average, then i ≻ k on average. 1 s 1 3 ◮ No cyclic preferences on average s 12 s 3 1 ◮ Find argmin α E [ L ( α, Y ) | q ] : Very easy ◮ Choose α i > α j ⇐ ⇒ s ij > s ji s 23 2 3 Low noise ⇒ s 13 > s 31

  23. Ranking consistency? Pairwise ranking surrogate: [Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004] � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij for φ convex with φ ′ (0) < 0 . Common in ranking literature.

  24. Ranking consistency? Pairwise ranking surrogate: [Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004] � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij for φ convex with φ ′ (0) < 0 . Common in ranking literature. Theorem: ϕ is not consistent, even in low noise settings. [Duchi, Mackey, and Jordan, 2013]

  25. Ranking consistency? Pairwise ranking surrogate: [Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004] � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij for φ convex with φ ′ (0) < 0 . Common in ranking literature. Theorem: ϕ is not consistent, even in low noise settings. [Duchi, Mackey, and Jordan, 2013] ⇒ Inconsistency for RankBoost, RankSVM, Logistic Ranking...

  26. Ranking with pairwise data is challenging

  27. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP )

  28. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions

  29. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij

  30. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij

  31. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij Question: Do tractable consistent losses exist for partial preference data?

  32. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij Question: Do tractable consistent losses exist for partial preference data? Yes!

  33. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij Question: Do tractable consistent losses exist for partial preference data? Yes , if we aggregate!

  34. Outline Supervised Ranking Formal definition Tractable surrogates Pairwise inconsistency Aggregation Restoring consistency Estimating complete preferences U-statistics Practical procedures Experimental results

  35. An observation Can rewrite risk of pairwise loss � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) i � = j where s ij = E [ Y ij | q ] .

  36. An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji

  37. An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji Consider the surrogate � ϕ ( α, s ) := max { s ij − s ji , 0 } φ ( α i − α j ) i � = j for φ non-increasing and convex, with φ ′ (0) < 0 .

  38. An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji Consider the surrogate � � ϕ ( α, s ) := max { s ij − s ji , 0 } φ ( α i − α j ) � = s ij φ ( α i − α j ) i � = j i � = j for φ non-increasing and convex, with φ ′ (0) < 0 . ◮ Either i → j penalized or j → i but not both

  39. An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji Consider the surrogate � � ϕ ( α, s ) := max { s ij − s ji , 0 } φ ( α i − α j ) � = s ij φ ( α i − α j ) i � = j i � = j for φ non-increasing and convex, with φ ′ (0) < 0 . ◮ Either i → j penalized or j → i but not both ◮ Consistent whenever average preferences are acyclic

  40. What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint

  41. What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints

  42. What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints New framework: Ranking with aggregate losses L ( α, s k ( Y 1 , . . . , Y k )) and ϕ ( α, s k ( Y 1 , . . . , Y k )) where s k is a structure function that aggregates first k datapoints

  43. What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints New framework: Ranking with aggregate losses L ( α, s k ( Y 1 , . . . , Y k )) and ϕ ( α, s k ( Y 1 , . . . , Y k )) where s k is a structure function that aggregates first k datapoints ◮ s k combines partial preferences into more complete estimates

  44. What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints New framework: Ranking with aggregate losses L ( α, s k ( Y 1 , . . . , Y k )) and ϕ ( α, s k ( Y 1 , . . . , Y k )) where s k is a structure function that aggregates first k datapoints ◮ s k combines partial preferences into more complete estimates ◮ Consistency characterization extends to this setting

  45. Aggregation via structure function 2 1 1 3 3 3 2 4 3 4 4 3 1 3 4 4 s k ( Y 1 , . . . , Y k ) Y 1 , Y 2 , . . . , Y k

  46. Aggregation via structure function 2 1 1 3 3 3 2 4 3 4 4 3 1 3 4 4 s k ( Y 1 , . . . , Y k ) Y 1 , Y 2 , . . . , Y k Question: When does aggregation help?

  47. Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily

  48. Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data

  49. Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data Idea: ◮ Use aggregation to estimate complete preferences from partial preferences

  50. Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data Idea: ◮ Use aggregation to estimate complete preferences from partial preferences ◮ Plug estimates into consistent surrogates

  51. Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data Idea: ◮ Use aggregation to estimate complete preferences from partial preferences ◮ Plug estimates into consistent surrogates ◮ Check that aggregation + surrogacy retains consistency

  52. Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 2 3 4 5

  53. Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k 2 3 4 5

  54. Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k ◮ Probability of a click on item k is 2 k − 1 � p k (1 − p j ) j =1 3 4 5

  55. Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k ◮ Probability of a click on item k is 2 k − 1 � p k (1 − p j ) j =1 3 ◮ ERR loss assumes p is known 4 5

  56. Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k ◮ Probability of a click on item k is 2 k − 1 � p k (1 − p j ) j =1 3 ◮ ERR loss assumes p is known Estimate p via maximum likelihood on n clicks: 4 k ( i ) � n � s = argmax log p k ( i ) + log(1 − p j ) . p ∈ [0 , 1] m i =1 j =1 5 ⇒ Consistent ERR minimization under our framework

  57. Benefits of aggregation ◮ Tractable consistency for partial preference losses argmin k →∞ E [ ϕ ( f ( Q ) , s k ( Y 1 , . . . , Y k ))] lim f ⇒ argmin k →∞ E [ L ( f ( Q ) , s k ( Y 1 , . . . , Y k ))] lim f ◮ Use complete data losses with realistic partial preference data ◮ Models process of generating relevance scores from clicks/comparisons

  58. What remains? Before aggregation, we had � n 1 argmin k =1 ϕ ( f ( Q k ) , Y k ) → argmin E [ ϕ ( f ( Q ) , Y )] � �� � n f � �� � f population empirical

  59. What remains? Before aggregation, we had � n 1 argmin k =1 ϕ ( f ( Q k ) , Y k ) → argmin E [ ϕ ( f ( Q ) , Y )] � �� � n f � �� � f population empirical What’s a suitable empirical analogue � R ϕ,n ( f ) with aggregation?

Recommend


More recommend