Machine Learning for Ranking
CE-324: Modern Information Retrieval
Sharif University of Technology
- M. Soleymani
Fall 2018
Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Machine Learning for Ranking CE-324: Modern Information Retrieval - - PowerPoint PPT Presentation
Machine Learning for Ranking CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Machine learning for IR
Sharif University of Technology
Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
} Cosine similarity, idf, proximity, pivoted doc length normalization, doc
} Naïve Bayes, Rocchio, kNN
} Sounds like a good idea } A.k.a.“machine-learned relevance” or “learning to rank”
2
} In the last decade
} Modern supervised ML has been around for about 20
} Naïve Bayes has been around for about 50 years…
3
} Wong, S.K. et al. 1988. Linear structure in information retrieval. SIGIR
} Fuhr, N. 1992. Probabilistic
} Gey, F. C. 1994. Inferring probability of relevance using the method of
} Herbrich, R. et al. 2000. Large Margin Rank Boundaries for Ordinal
4
} Especially for real world use (as opposed to writing academic papers), it
} This has changed, both in academia and industry
5
} Term frequency } Inverse document frequency } Doc length
} And people did
6
} Arbitrary useful features – not a single unified model
} Log frequency of query word in anchor text? } Query word in color on page? } # of images on page? } # of (out) links on page? } PageRank of page? } URL length? } URL contains “~”? } Page edit recency? } Page length?
7
8
} Only two zone:
} title, body
} We intend to find an optimal value for 𝛽
9
} In the simplest form, each relevance judgments is either
10
11
} Relevance r is here binary (but may be multiclass 3–7)
} 𝒚 = (𝑦4, 𝑦>)
} 𝑦4: cosine similarity } 𝑦>: minimum query window size (shortest text span including all query
} Query term proximity is a very important new weighting
12
13
14
R R R R R R R R R R R N N N N N N N N N N
15
16
4 𝑒, 𝑟 + ⋯ + 𝑥_𝑔 _(𝑒, 𝑟) + 𝑐
} Training: (𝑒, 𝑟) must be negative for nonrelevant docs and positive for
} Testing: decide relevant iff (𝑒, 𝑟) ≥ 0
} To deal with query words not in your training data } but scores like the summed (log) tf of all query terms
} Problem: It can result in trivial always-say-nonrelevant classifiers } A solution: undersampling nonrelevant docs during training (just take
17
} 4 TREC data sets } Comparisons with Lemur, a state-of-the-art open source IR
} 6 features, all variants of tf, idf, and tf.idf scores
18
} Actually a little bit below
} This is illustrated on a homepage finding task on WT10G:
} Baseline LM 52% p@10, baseline SVM 58% } SVM with URL-depth, and in-link features: 78% p@10
19
} Classification: Map to a unordered set of classes } Regression: Map to a real value } Ordinal regression: Map to an ordered set of classes
} A fairly obscure sub-branch of statistics, but what we want here
} Relations between relevance levels are modeled
} A number of categories of relevance:
} These are totally ordered: 𝑑1 < 𝑑2 < ⋯ < 𝑑d
} Training data: each doc-query pair represented as a feature vector 𝜒=
20
} This turns an ordinal regression problem back into a binary
25
} This form of training data can also be derived from the
26
} ½wTw + C Σ ξu is minimized, and } for all Φu such that zu < 0, wΦu ≥ 1 − ξu
} Using things like vector space model scores as features } As we shall see, it outperforms standard IR in evaluations
§ The ranking SVM considers all ordering violations as the same
§
} Predicting class label or relevance score
} Predicting relative order is closer to the nature of ranking than
} Input is a pair of results for a query, and the class is the relevance
} Learns a ranking function } Models the ranking problem in a straightforward fashion.
} can overcome the drawback of the above approaches by tackling the ranking problem
directly
30
} When number of associated docs varies largely for different queries,
} The position of docs in the ranked list is unimportant.
} The pointwise loss function may unconsciously emphasize too much those
31
} This turns a ranking problem back into a binary classification
} Predicting relative order is closer to the nature of ranking than
} The distribution of doc pair number is more skewed than the
32
} log term frequency, idf, pivoted length normalization
33
34
35