statistical ranking problem
play

Statistical Ranking Problem Tong Zhang Statistics Department, - PowerPoint PPT Presentation

Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing with large search space.


  1. Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University

  2. Ranking Problems • Rank a set of items and display to users in corresponding order. • Two issues: performance on top and dealing with large search space. – web-page ranking ∗ rank pages for a query ∗ theoretical analysis with error criterion focusing on top – machine translation ∗ rank possible (English) translations for a given input (Chinese) sentence ∗ algorithm handling large search space 1

  3. Web-Search Problem • User types a query, search engine returns a result page: – selects from billions of pages. – assign a score for each page, and return pages ranked by the scores. • Quality of search engine: – relevance (whether returned pages are on topic and authoritative) – other issues ∗ presentation (diversity, perceived relevance, etc) ∗ personalization (predict user specific intention) ∗ coverage (size and quality of index). ∗ freshness (whether contents are timely). ∗ responsiveness (how quickly search engine responds to the query). 2

  4. Relevance Ranking: Statistical Learning Formulation • Training: – randomly select queries q , and web-pages p for each query. – use editorial judgment to assign relevance grade y ( p, q ) . – construct a feature x ( p, q ) for each query/page pair. – learn scoring function ˆ f ( x ( p, q )) to preserve the order of y ( p, q ) for each q . • Deployment: – query q comes in. – return pages p 1 , . . . , p m in descending order of ˆ f ( x ( p, q )) . 3

  5. Measuring Ranking Quality • Given scoring function ˆ f , return ordered page-list p 1 , . . . , p m for a query q . – only the order information is important. – should focus on the relevance of returned pages near the top. • DCG (discounted cumulative gain) with decreasing weight c i m DCG ( ˆ � f, q ) = c i r ( p i , q ) . j =1 • c i : reflects effort (or likelihood) of user clicking on the i -th position. 4

  6. Subset Ranking Model • x ∈ X : feature ( x ( p, q ) ∈ X ) • S ∈ S : subset of X ( { x 1 , . . . , x m } = { x ( p, q ) : p } ∈ S ) – each subset corresponds to a fixed query q . – assume each subset of size m for convenience: m is large. • y : quality grade of each x ∈ X ( y ( p, q ) ). • scoring function f : X × S → R . – ranking function r f ( S ) = { j i } : ordering of S ∈ S based on scoring function f . • quality: DCG ( f, S ) = � m i =1 c i E y ji | ( x ji ,S ) y j i . 5

  7. Some Theoretical Questions • Learnability: – subset size m is huge: do we need many samples (rows) to learn. – focusing quality on top. • Learning method: – regression. – pair-wise learning? other methods? • Limited goal to address here: – can we learn ranking by using regression when m is large. ∗ massive data size (more than 20 billion) ∗ want to derive: error bounds independent of m . – what are some feasible algorithms and statistical implications. 6

  8. Bayes Optimal Scoring • Given a set S ∈ S , for each x j ∈ S , we define the Bayes-scoring function as f B ( x j , S ) = E y j | ( x j ,S ) y j • The optimal Bayes ranking function r f B that maximizes DCG – induced by f B – returns a rank list J = [ j 1 , . . . , j m ] in descending order of f B ( x j i , S ) . – not necessarily unique (depending on c j ) • The function is subset dependent: require appropriate result set features. 7

  9. Simple Regression • Given subsets S i = { x i, 1 , . . . , x i,m } and corresponding relevance score { y i, 1 , . . . , y i,m } . • We can estimate f B ( x j , S ) using regression in a family F : n m ˆ � � ( f ( x i,j , S i ) − y i,j ) 2 f = arg min f ∈F i =1 j =1 • Problem: m is massive ( > 20 billion) – computationally inefficient – statistically slow convergence ∗ ranking error bounded by O ( √ m ) × root-mean-squared-error. • Solution: should emphasize estimation quality on top. 8

  10. Importance Weighted Regression • Some samples are more important than other samples (focus on top). • A revised formulation: ˆ � n f = arg min f ∈F 1 i =1 L ( f, S i , { y i,j } j ) , with n m w ( x j , S )( f ( x j , S ) − y j ) 2 + u sup w ′ ( x j , S )( f ( x j , S ) − δ ( x j , S )) 2 � L ( f, S, { y j } j ) = + j j =1 • weight w : importance weighting focusing regression error on top – zero for irrelevant pages • weight w ′ : large for irrelevant pages – for which f ( x j , S ) should be less than threshold δ . • importance weighting can be implemented through importance sampling. 9

  11. Relationship of Regression and Ranking Let Q ( f ) = E S L ( f, S ) , where L ( f, S ) = E { y j } j | S L ( f, S, { y j } j ) m w ( x j , S ) E y j | ( x j ,S ) ( f ( x j , S ) − y j ) 2 + u sup � w ′ ( x j , S )( f ( x j , S ) − δ ( x j , S )) 2 = + . j j =1 Assume that c i = 0 for all i > k . Under appropriate parameter Theorem 1. choices with some constants u and γ , for all f : f ′ Q ( f ′ )) 1 / 2 . DCG ( r B ) − DCG ( r f ) ≤ C ( γ, u )( Q ( f ) − inf Key point: focus on relevant documents on top; � j w ( x j , S ) is much smaller than m . 10

  12. Generalization Performance with Square Regularization β ( x, S ) = ˆ β T ψ ( x, S ) , with feature vector ψ ( x, S ) : Consider scoring f ˆ n � � 1 ˆ L ( β, S i , { y i,j } j ) + λβ T β � β = arg min , (1) n β ∈H i =1 m w ( x j , S )( f β ( x j , S ) − y j ) 2 + u sup w ′ ( x j , S )( f β ( x j , S ) − δ ( x j , S )) 2 � L ( β, S, { y j } j ) = + . j j =1 Let M = sup x,S � φ ( x, S ) � 2 and W = sup S [ � Theorem 2. x j ∈ S w ( x j , S ) + u sup x j ∈ S w ′ ( x j , S )] . Let f ˆ β be the estimator defined in (1). Then we have DCG ( r B ) − E { Si, { yi,j } j } n i =1 DCG ( r f ˆ β ) � 1 / 2 �� � 2 1 + W M β ∈H ( Q ( f β ) + λβ T β ) − inf ≤ C ( γ, u ) inf f Q ( f ) . √ 2 λn 11

  13. Interpretation of Results • Result does not depend on m , but the much smaller quantity quantity W = x j ∈ S w ( x j , S ) + u sup x j ∈ S w ′ ( x j , S )] sup S [ � – emphasize relevant samples on top: w is small for irrelevant documents. – a refined analysis can replace sup over S by some p -norm over S . • Can control generalization for the top portion of the rank-list even with large m . – learning complexity does not depend on the majority of items near the bottom of the rank-list. – the bottom items are usually easy to estimate. 12

  14. Key Points • Ranking quality near the top is most important – statistical analysis to deal with the scenario • Regression based algorithm to handle large search space – importance weighting of regression terms – error bounds independent of the massive web-size. 13

  15. Statistical Translation and Algorithm Challenge • Problem: – conditioned on a source sentence in one language. – generate a target sentence in another language. • General approach: – scoring function: measuring the quality of translated sentence based on the source sentence (similar to web-search) – search strategy: effectively generate target sentence candidates. ∗ search for the optimal score. ∗ structure used in the scoring function (through block model). • Main challenge: exponential growth of search space 14

  16. Graphical illustration b4 airspace Lebanese b3 violate b2 warplanes b1 Israeli A A A t A A A l l l n l l l T H A t m j l A r s h j w b } b r k A y n y r A l A A P } n t y y l y P 15

  17. Block Sequence Decoder • Database: a set of possible translation blocks – e.g. block “a b” translates into potential block “z x y”. • Scoring function: – candidate translation: a block sequence ( b 1 , . . . , b n ) – map each sequence of blocks into a non-negative score 1 ) = � n s w ( b n i =1 w T F ( b i − 1 , b i ; o ) . • Input: source sentence. • Translation: – generate block sequence consistent with the source sentence. – find the sequence with largest score. 16

  18. Decoder Training • Given source/target sentence pairs { ( S i , T i ) } i =1 ,...,n . • Given a decoding scheme implementing: ˆ z ( w, S ) = arg max z ∈ V ( S ) s w ( z ) . • Find parameter w such that on the training data: – ˆ z ( w, S i ) achieves high BLEU score on average. • Key difficulty: large search space (similar issues in MLR) • Traditional tuning. • Our methods: – local training. – global training. 17

  19. Traditional scoring function • A bunch of local statistics gathered from the training data, and beyond. – different language models of the target P ( T i | T i − 1 , T i − 2 ... ) . ∗ can depend on statistics beyond training data. ∗ Google benefited significantly from huge language models. – orientation (swap) frequency. – block frequency. – block quality score: ∗ e.g. normalized in-block unigram translation probability: ( S, T ) = ( { s 1 , . . . , s p } , { t 1 , . . . , t q } ) ; P ( S | T ) = � � i p ( s j | t i ) . j n j • Linear combination of log-frequencies (five or six features): – s w ( { b n 1 } ) = � � j w j log f j ( b i ; b i − 1 , · · · ) i • Tuning: hand-tuning; gradient descent adjustment to optimize BLEU score 18

Recommend


More recommend