machine learning for ranking
play

Machine Learning for Ranking CE-324: Modern Information Retrieval - PowerPoint PPT Presentation

Machine Learning for Ranking CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Machine learning for IR


  1. Machine Learning for Ranking CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

  2. Machine learning for IR ranking? } We’ve looked at methods for ranking docs in IR } Cosine similarity, idf, proximity, pivoted doc length normalization, doc autorithy, … } We’ve looked at methods for classifying docs using supervised machine learning classifiers } Naïve Bayes, Rocchio, kNN } Surely we can also use machine learning to rank the docs displayed in search results? } Sounds like a good idea } A.k.a.“machine-learned relevance” or “learning to rank” 2

  3. Machine learning for IR ranking } Actively researched – and actively deployed by major web search engines } In the last decade } Why didn’t it happen earlier? } Modern supervised ML has been around for about 20 years… } Naïve Bayes has been around for about 50 years… 3

  4. Machine learning for IR ranking } The IR community wasn’t very connected to the ML community } But there were a whole bunch of precursors: } Wong, S.K. et al. 1988. Linear structure in information retrieval. SIGIR 1988. } Fuhr, N. 1992. Probabilistic methods in information retrieval. Computer Journal. } Gey, F. C. 1994. Inferring probability of relevance using the method of logistic regression. SIGIR 1994. } Herbrich, R. et al. 2000. Large Margin Rank Boundaries for Ordinal Regression.Advances in Large Margin Classifiers. 4

  5. Why weren’t early attempts very successful/influential? } Sometimes an idea just takes time to be appreciated… } Limited training data } Especially for real world use (as opposed to writing academic papers), it was very hard to gather test collection queries and relevance judgments } This has changed, both in academia and industry } Poor machine learning techniques } Insufficient customization to IR problem } Not enough features for ML to show value 5

  6. Why wasn’t ML much needed? } Traditional ranking functions in IR used a very small number of features, e.g., } Term frequency } Inverse document frequency } Doc length } It was easy to tune weighting coefficients by hand } And people did 6

  7. Why is ML needed now? } Modern systems – especially on the Web – use a great number of features: } Arbitrary useful features – not a single unified model } Log frequency of query word in anchor text? } Query word in color on page? } # of images on page? } # of (out) links on page? } PageRank of page? } URL length? } URL contains “~”? } Page edit recency? } Page length? } The New York Times (2008-06-03) quoted Amit Singhal as saying Google was using over 200 such features. 7

  8. Weighted zone scoring } Simple example: } Only two zone: } title, body 𝑡𝑑𝑝𝑠𝑓 𝑒, 𝑟 = 𝛽𝑇 , 𝑒, 𝑟 + 1 − 𝛽 𝑇 0 𝑒, 𝑟 0 ≤ 𝛽 ≤ 1 } We intend to find an optimal value for 𝛽 8

  9. Weighted zone scoring } Training examples : each example is a tuple consisting of a query 𝑟 and a doc 𝑒 , together with a relevance judgment 𝑠 for 𝑒 on 𝑟 . } In the simplest form, each relevance judgments is either Relevant or Non-relevant . } 𝛽 is “learned” from examples, in order that the learned scores approximate the relevance judgments in the training examples. 9

  10. Weighted zone scoring 𝑒 4 , 𝑟 4 , 𝑠 (4) , … , 𝑒 8 , 𝑟 8 , 𝑠 (8) } 𝐸 = } Find minimum of the following cost function: 8 > 𝐾 : (𝛽) = < 𝑠 = − 𝛽𝑇 , 𝑒 = , 𝑟 = − 1 − 𝛽 𝑇 0 𝑒 = , 𝑟 = =?4 𝛽 ∗ = argmin 𝐾 : (𝛽) GHIH4 10

  11. Weighted zone scoring: special case } Boolean match in each zone: 𝑇 , 𝑒, 𝑟 , 𝑇 0 𝑒, 𝑟 ∈ {0,1} } Boolean relevance judgment: 𝑠 ∈ {0,1} 𝑜 4GN + 𝑜 G4O 𝛽 ∗ = 𝑜 4GN + 𝑜 4GO + 𝑜 G4O + 𝑜 G4N 𝑜 4GN : number of training data with 𝑇 , = 1 , 𝑇 0 = 0 and 𝑠 = 1 𝑜 G4O : number of training data with 𝑇 , = 0 , 𝑇 0 = 1 and 𝑠 = 0 11

  12. Simple example: Using classification for ad-hoc IR } Collect a training corpus of (q, d, r) triples } Relevance r is here binary (but may be multiclass 3–7) } Doc is represented by a feature vector } 𝒚 = (𝑦 4 , 𝑦 > ) } 𝑦 4 : cosine similarity } 𝑦 > : minimum query window size (shortest text span including all query words) called 𝜕 } Query term proximity is a very important new weighting factor } Train a model to predict the class 𝑠 of a doc-query pair 12

  13. Simple example: Using classification for ad hoc IR } Training data 13

  14. Simple example: Using classification for ad hoc IR } A linear score function is then 𝑇𝑑𝑝𝑠𝑓 𝑒, 𝑟 = 𝑥 4 ×CosineScore + 𝑥 > ×𝜕 + 𝑥 G } And the linear classifier is Decide relevant if 𝑇𝑑𝑝𝑠𝑓(𝑒, 𝑟) > 0 } just like when we were doing text classification 14

  15. Simple example: Using classification for ad hoc IR 0.05 cosine score Decision R surface R N R R R R R N N R R R N R N N N N N N 0 2 3 4 5 Term proximity 15

  16. More complex example of using classification for search ranking [Nallapati 2004] } We can generalize this to classifier functions over more features } We can use methods we have seen previously for learning the linear classifier weights 16

  17. Classifier for information retrieval } 𝑕 𝑒, 𝑟 = 𝑥 4 𝑔 4 𝑒, 𝑟 + ⋯ + 𝑥 _ 𝑔 _ (𝑒, 𝑟) + 𝑐 } Training: 𝑕(𝑒, 𝑟) must be negative for nonrelevant docs and positive for relevant docs } Testing: decide relevant iff 𝑕(𝑒, 𝑟) ≥ 0 } Features are not word presence features } To deal with query words not in your training data } but scores like the summed (log) tf of all query terms } Unbalanced data } Problem: It can result in trivial always-say-nonrelevant classifiers } A solution: undersampling nonrelevant docs during training (just take some at random) 17

  18. An SVM classifier for ranking [Nallapati 2004] } Experiments: } 4 TREC data sets } Comparisons with Lemur, a state-of-the-art open source IR engine (Language Model (LM)-based – see IIR ch. 12) } 6 features, all variants of tf, idf, and tf.idf scores 18

  19. An SVM classifier for ranking [Nallapati 2004] Train \Test Disk 3 Disk 4-5 WT10G (web) Disk 3 LM 0.1785 0.2503 0.2666 SVM 0.1728 0.2432 0.2750 Disk 4-5 LM 0.1773 0.2516 0.2656 SVM 0.1646 0.2355 0.2675 } At best the results are about equal to LM } Actually a little bit below } Paper’s advertisement: Easy to add more features } This is illustrated on a homepage finding task on WT10G: } Baseline LM 52% p@10, baseline SVM 58% } SVM with URL-depth, and in-link features: 78% p@10 19

  20. Learning to rank } Classification probably isn’t the right way . } Classification : Map to a unordered set of classes } Regression : Map to a real value } Ordinal regression : Map to an ordered set of classes } A fairly obscure sub-branch of statistics, but what we want here } Ordinal regression formulation gives extra power: } Relations between relevance levels are modeled } A number of categories of relevance: } These are totally ordered: 𝑑 1 < 𝑑 2 < ⋯ < 𝑑 d } Training data: each doc-query pair represented as a feature vector 𝜒 = and relevance ranking 𝑑 = as labels 20

  21. Sec. 15.4.2 The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] } Ranking Model: f ( d ) f ( d )

  22. Sec. 15.4.2 Pairwise learning: The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] } Aim is to classify instance pairs as correctly ranked or incorrectly ranked } This turns an ordinal regression problem back into a binary classification problem in an expanded space } We want a ranking function f such that c i > c k iff f (ψ i ) > f (ψ k ) } … or at least one that tries to do this with minimal error } Suppose that f is a linear function f (ψ i ) = w Ÿ ψ i

  23. Sec. 15.4.2 The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] } Then (combining c i > c k iff f (ψ i ) > f (ψ k ) and f (ψ i ) = w Ÿ ψ i ): c i > c k iff w Ÿ (ψ i − ψ k ) > 0 } Let us then create a new instance space from such pairs: Φ u = Φ( d i , d k , q ) = ψ i − ψ k z u = +1, 0, −1 as c i >,=,< c k } We can build model over just cases for which z u = −1 } From training data S = {Φ u }, we train an SVM

  24. Two queries in the pairwise space

  25. The Ranking by a classifier } Assume that the ranked list of docs for a set of sample queries is available as training data } Or even a set of training data in the form of ( 𝑒, 𝑒 f , 𝑟, 𝑨) is available } This form of training data can also be derived from the available ranked list of docs for samples queries if exist } Again we can use SVM classifier for ranking 25

  26. The Ranking by a classifier } Φ 𝑒, 𝑟 = 𝑡 4 𝑒, 𝑟 , … , 𝑡 _ (𝑒, 𝑟) } Φ 𝑒 f , 𝑟 = 𝑡 4 𝑒′, 𝑟 , … , 𝑡 _ (𝑒′, 𝑟) } It seeks a vector 𝑥 in the space of scores (constructed as above) such that: 𝑥 , Φ 𝑒, 𝑟 ≥ 𝑥 , Φ 𝑒′, 𝑟 For each 𝑒 that precedes 𝑒 f in the ranked list of docs for 𝑟 available in training data } A linear classifier like SVM can be used where its training data are constructed as the following pairs of input and output: ( 𝑒,𝑒 f ,𝑟,𝑨) (Φ 𝑒, 𝑟 − Φ 𝑒 f , 𝑟 , 𝑨) ⇒ 𝑨 = +1 if 𝑒 must precede 𝑒’ Feature label 𝑨 = −1 if 𝑒′ must precede 𝑒 vector 26

Recommend


More recommend