Tractable ranking First try: Empirical risk minimization ← Intractable! � n E n [ L ( f ( Q ) , Y )] = 1 R n ( f ) := ˆ ˆ min k =1 L ( f ( Q k ) , Y k ) n f Idea: Replace loss L ( α, Y ) with convex surrogate ϕ ( α, Y ) L ( α, Y ) = � i � = j Y ij 1 ( α i ≤ α j ) Hard
Tractable ranking First try: Empirical risk minimization ← Intractable! � n E n [ L ( f ( Q ) , Y )] = 1 R n ( f ) := ˆ ˆ min k =1 L ( f ( Q k ) , Y k ) n f Idea: Replace loss L ( α, Y ) with convex surrogate ϕ ( α, Y ) L ( α, Y ) = � ϕ ( α, Y ) = � i � = j Y ij 1 ( α i ≤ α j ) i � = j Y ij φ ( α i − α j ) Hard Tractable
Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f
Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable
Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable n →∞ ◮ argmin f ˆ R ϕ,n ( f ) → argmin f R ϕ ( f ) := E [ ϕ ( f ( Q ) , Y )]
Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable n →∞ ◮ argmin f ˆ R ϕ,n ( f ) → argmin f R ϕ ( f ) := E [ ϕ ( f ( Q ) , Y )] Main Question: Are these tractable ranking procedures consistent ?
Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable n →∞ ◮ argmin f ˆ R ϕ,n ( f ) → argmin f R ϕ ( f ) := E [ ϕ ( f ( Q ) , Y )] Main Question: Are these tractable ranking procedures consistent ? ⇐ ⇒ Does argmin f R ϕ ( f ) also minimize the true risk R ( f ) ?
Classification consistency Consider the special case of classification
Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1
Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1 ◮ Pairwise loss: L ( α, Y ) = Y 01 1 ( α 0 ≤ α 1 ) + Y 10 1 ( α 1 ≤ α 0 )
Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1 ◮ Pairwise loss: L ( α, Y ) = Y 01 1 ( α 0 ≤ α 1 ) + Y 10 1 ( α 1 ≤ α 0 ) ◮ Surrogate loss: ϕ ( α, Y ) = Y 01 φ ( α 0 − α 1 ) + Y 10 φ ( α 1 − α 0 )
Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1 ◮ Pairwise loss: L ( α, Y ) = Y 01 1 ( α 0 ≤ α 1 ) + Y 10 1 ( α 1 ≤ α 0 ) ◮ Surrogate loss: ϕ ( α, Y ) = Y 01 φ ( α 0 − α 1 ) + Y 10 φ ( α 1 − α 0 ) Theorem: If φ is convex, procedure based on minimizing φ is consistent if and only if φ ′ (0) < 0 . [Bartlett, Jordan, and McAuliffe, 2006] ⇒ Tractable consistency for boosting, SVMs, logistic regression
Ranking consistency? Good news: Can characterize surrogate ranking consistency 1 [Duchi, Mackey, and Jordan, 2013]
Ranking consistency? Good news: Can characterize surrogate ranking consistency Theorem: 1 Procedure based on minimizing ϕ is consistent ⇐ ⇒ � � � � � E [ L ( α ′ , Y ) | q ] min E [ ϕ ( α, Y ) | q ] � α �∈ argmin α α ′ > min α E [ ϕ ( α, Y ) | q ] . ◮ Translation: ϕ is consistent if and only if minimizing conditional surrogate risk gives correct ranking for every query 1 [Duchi, Mackey, and Jordan, 2013]
Ranking consistency? Bad news: The consequences are dire...
Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4
Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4 Task: Find argmin α E [ L ( α, Y ) | q ]
Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4 Task: Find argmin α E [ L ( α, Y ) | q ] ◮ Classification (two node) case: Easy ◮ Choose α 0 > α 1 ⇐ ⇒ P [ Class 0 | q ] > P [ Class 1 | q ]
Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4 Task: Find argmin α E [ L ( α, Y ) | q ] ◮ Classification (two node) case: Easy ◮ Choose α 0 > α 1 ⇐ ⇒ P [ Class 0 | q ] > P [ Class 1 | q ] ◮ General case: NP hard ◮ Unless P = NP , must restrict problem for tractable consistency
Low noise distribution Define: Average preference for item i over item j : s ij = E [ Y ij | q ] ◮ We say i ≻ j on average if s ij > s ji
Low noise distribution Define: Average preference for item i over item j : s ij = E [ Y ij | q ] ◮ We say i ≻ j on average if s ij > s ji Definition ( Low noise distribution ): If i ≻ j on average and j ≻ k on average, then i ≻ k on average. 1 s 1 3 ◮ No cyclic preferences on average s 12 s 3 1 s 23 2 3 Low noise ⇒ s 13 > s 31
Low noise distribution Define: Average preference for item i over item j : s ij = E [ Y ij | q ] ◮ We say i ≻ j on average if s ij > s ji Definition ( Low noise distribution ): If i ≻ j on average and j ≻ k on average, then i ≻ k on average. 1 s 1 3 ◮ No cyclic preferences on average s 12 s 3 1 ◮ Find argmin α E [ L ( α, Y ) | q ] : Very easy ◮ Choose α i > α j ⇐ ⇒ s ij > s ji s 23 2 3 Low noise ⇒ s 13 > s 31
Ranking consistency? Pairwise ranking surrogate: [Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004] � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij for φ convex with φ ′ (0) < 0 . Common in ranking literature.
Ranking consistency? Pairwise ranking surrogate: [Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004] � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij for φ convex with φ ′ (0) < 0 . Common in ranking literature. Theorem: ϕ is not consistent, even in low noise settings. [Duchi, Mackey, and Jordan, 2013]
Ranking consistency? Pairwise ranking surrogate: [Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004] � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij for φ convex with φ ′ (0) < 0 . Common in ranking literature. Theorem: ϕ is not consistent, even in low noise settings. [Duchi, Mackey, and Jordan, 2013] ⇒ Inconsistency for RankBoost, RankSVM, Logistic Ranking...
Ranking with pairwise data is challenging
Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP )
Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions
Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij
Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij
Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij Question: Do tractable consistent losses exist for partial preference data?
Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij Question: Do tractable consistent losses exist for partial preference data? Yes!
Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij Question: Do tractable consistent losses exist for partial preference data? Yes , if we aggregate!
Outline Supervised Ranking Formal definition Tractable surrogates Pairwise inconsistency Aggregation Restoring consistency Estimating complete preferences U-statistics Practical procedures Experimental results
An observation Can rewrite risk of pairwise loss � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) i � = j where s ij = E [ Y ij | q ] .
An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji
An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji Consider the surrogate � ϕ ( α, s ) := max { s ij − s ji , 0 } φ ( α i − α j ) i � = j for φ non-increasing and convex, with φ ′ (0) < 0 .
An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji Consider the surrogate � � ϕ ( α, s ) := max { s ij − s ji , 0 } φ ( α i − α j ) � = s ij φ ( α i − α j ) i � = j i � = j for φ non-increasing and convex, with φ ′ (0) < 0 . ◮ Either i → j penalized or j → i but not both
An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji Consider the surrogate � � ϕ ( α, s ) := max { s ij − s ji , 0 } φ ( α i − α j ) � = s ij φ ( α i − α j ) i � = j i � = j for φ non-increasing and convex, with φ ′ (0) < 0 . ◮ Either i → j penalized or j → i but not both ◮ Consistent whenever average preferences are acyclic
What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint
What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints
What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints New framework: Ranking with aggregate losses L ( α, s k ( Y 1 , . . . , Y k )) and ϕ ( α, s k ( Y 1 , . . . , Y k )) where s k is a structure function that aggregates first k datapoints
What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints New framework: Ranking with aggregate losses L ( α, s k ( Y 1 , . . . , Y k )) and ϕ ( α, s k ( Y 1 , . . . , Y k )) where s k is a structure function that aggregates first k datapoints ◮ s k combines partial preferences into more complete estimates
What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints New framework: Ranking with aggregate losses L ( α, s k ( Y 1 , . . . , Y k )) and ϕ ( α, s k ( Y 1 , . . . , Y k )) where s k is a structure function that aggregates first k datapoints ◮ s k combines partial preferences into more complete estimates ◮ Consistency characterization extends to this setting
Aggregation via structure function 2 1 1 3 3 3 2 4 3 4 4 3 1 3 4 4 s k ( Y 1 , . . . , Y k ) Y 1 , Y 2 , . . . , Y k
Aggregation via structure function 2 1 1 3 3 3 2 4 3 4 4 3 1 3 4 4 s k ( Y 1 , . . . , Y k ) Y 1 , Y 2 , . . . , Y k Question: When does aggregation help?
Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily
Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data
Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data Idea: ◮ Use aggregation to estimate complete preferences from partial preferences
Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data Idea: ◮ Use aggregation to estimate complete preferences from partial preferences ◮ Plug estimates into consistent surrogates
Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data Idea: ◮ Use aggregation to estimate complete preferences from partial preferences ◮ Plug estimates into consistent surrogates ◮ Check that aggregation + surrogacy retains consistency
Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 2 3 4 5
Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k 2 3 4 5
Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k ◮ Probability of a click on item k is 2 k − 1 � p k (1 − p j ) j =1 3 4 5
Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k ◮ Probability of a click on item k is 2 k − 1 � p k (1 − p j ) j =1 3 ◮ ERR loss assumes p is known 4 5
Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k ◮ Probability of a click on item k is 2 k − 1 � p k (1 − p j ) j =1 3 ◮ ERR loss assumes p is known Estimate p via maximum likelihood on n clicks: 4 k ( i ) � n � s = argmax log p k ( i ) + log(1 − p j ) . p ∈ [0 , 1] m i =1 j =1 5 ⇒ Consistent ERR minimization under our framework
Benefits of aggregation ◮ Tractable consistency for partial preference losses argmin k →∞ E [ ϕ ( f ( Q ) , s k ( Y 1 , . . . , Y k ))] lim f ⇒ argmin k →∞ E [ L ( f ( Q ) , s k ( Y 1 , . . . , Y k ))] lim f ◮ Use complete data losses with realistic partial preference data ◮ Models process of generating relevance scores from clicks/comparisons
What remains? Before aggregation, we had � n 1 argmin k =1 ϕ ( f ( Q k ) , Y k ) → argmin E [ ϕ ( f ( Q ) , Y )] � �� � n f � �� � f population empirical
What remains? Before aggregation, we had � n 1 argmin k =1 ϕ ( f ( Q k ) , Y k ) → argmin E [ ϕ ( f ( Q ) , Y )] � �� � n f � �� � f population empirical What’s a suitable empirical analogue � R ϕ,n ( f ) with aggregation?
Recommend
More recommend