Table of Content • Weighted scoring for ranking • Learning to rank: A simple example • Learning to ranking as classification Ranking and Learning 290N UCSB, Tao Yang, 2013 Partially based on Manning, Raghavan, and Schütze‘s text book. Scoring Simple Model of Ranking with Similarity • Similarity-based approach Similarity of query features with document features • Weighted approach: Scoring with weighted features return in order the documents most likely to be useful to the searcher Consider each document has subscores in each feature or in each subarea. 1
Similarity ranking: example Weighted scoring with linear combination • A simple weighted scoring method: use a linear combination of subscores: E.g., Score = 0.6*< Title score> + 0.3*<Abstract score> + 0.1*<Body score> The overall score is in [0,1]. Example with binary subscores Query term appears in title and body only Document score: (0.6 ・ 1) + (0.1 ・ 1) = 0.7. Example How to determine weights automatically: Motivation • Modern systems – especially on the Web – use a • On the query “ bill rights” suppose that we retrieve great number of features: the following docs from the various zone indexes: – Arbitrary useful features – not a single unified model Log frequency of query word in anchor text? Query word highlighted on page? Span of query words on page Abstract 1 2 bill # of (out) links on page? Compute the rights PageRank of page? score URL length? Title for each doc 3 5 8 bill URL contains “~”? based on the rights 3 5 9 Page edit recency? weightings Page length? 0.6,0.3,0.1 Body 1 2 5 9 • Major web search engines use “hundreds” of bill such features – and they keep changing rights 3 5 8 9 2
Sec. 15.4 Machine learning for computing weights Learning weights: Methodology Given a set of training examples , • How do we combine these signals into a good each contains (query q , document d , relevance ranker? score r(d,q)). “machine - learned relevance” or “learning to rank” r(d,q) is relevance judgment for d on q • Learning from examples Simplest scheme These examples are called training data relevant (1) or nonrelevant (0) More sophisticated: graded relevance judgments Training Ranking 1 (bad), 2 (Fair), 3 (Good), 4 (Excellent), 5 (Perfect) examples formula Learn weights from these examples, so that the learned User query and scores approximate the relevance judgments in the training Ranked matched results 10 examples results 10 Simple example Learning w from training examples • Each doc has two zones, Title and Body • For a chosen w [0,1], score for doc d on query q where: s T ( d , q ) {0,1} is a Boolean denoting whether q matches the Title and s B ( d , q ) {0,1} is a Boolean denoting whether q matches the Body 3
How? Optimizing w • For each example t we can compute the score • There are 4 kinds of training examples based on • Thus only four possible values for score • We quantify Relevant as 1 and Non-relevant as 0 And only 8 possible values for error • Would like the choice of w to be such that the • Let n 01r be the number of training examples for computed scores are as close to these 1/0 which s T ( d , q )=0, s B ( d , q )=1, judgment = Relevant . judgments as possible • Similarly define n 00r , n 10r , n 11r , n 00i , n 01i , n 10i , n 11i Denote by r(d t ,q t ) the judgment for t • Then minimize total squared error Error: 2 2 1 ( 1 ) n 0 ( 1 ) n 01 r 01 i Total error – then calculus Generalizing this simple example • Add up contributions from various cases to get • More (than 2) features total error • Non-Boolean features What if the title contains some but not all query terms … • Now differentiate with respect to w to get Categorical features (query terms occur in plain, optimal value of w as: boldface, italics, etc) • Scores are nonlinear combinations of features • Multilevel relevance judgments (Perfect, Good, Fair, Bad, etc) • Complex error functions • Not always a unique, easily computable setting of score parameters 4
Learning-based Web Search Framework of Learning to Rank • Given a set of features e 1 ,e 2 ,…,e N , learn a ranking function f ( e 1 ,e 2 ,…,e N ) that minimizes the loss function L . * f min L f e e ( , ,..., e ), GroundTruth 1 2 N f F • Some related issues The functional space F – linear/non-linear? continuous? Derivative? The search strategy The loss function Sec. 15.4.1 Sec. 15.4.1 A richer example Using classification for deciding relevance • Collect a training corpus of ( q, d, r ) triples • A linear score function is Relevance r is still binary for now Score(d, q) = Score(α, ω) = aα + bω + c Document is represented by a feature vector • And the linear classifier is – x = (α, ω) α is cosine similarity, ω is minimum query Decide relevant if Score(d, q) > θ window size ω is the shortest text span that includes all query words (Query term proximity in the document) • … just like when we were doing text classification • Train a machine learning model to predict the class r of a document-query pair 5
Sec. 15.4.1 Using classification for deciding More complex example of using relevance classification for search ranking [Nallapati SIGIR 2004] cosine score 0.05 • We can generalize this to classifier functions over Decision R more features R surface N R R R • We can use methods we have seen previously for R R learning the linear classifier weights N N R 0.025 R R N R N N N N N N 0 2 3 4 5 Term proximity An SVM classifier for relevance Ranking vs. Classification [Nallapati SIGIR 2004] • Classification • Let g ( r | d,q ) = w f ( d , q ) + b Well studied over 30 years • Derive weights from the training Bayesian, Neural network, Decision tree, SVM, Boosting, … examples: Training data: points – Pos: x1, x2, x3, Neg: x4, x5 want g ( r | d,q ) ≤ −1 for nonrelevant documents x 5 x 4 x 3 x 2 x 1 0 g ( r | d,q ) ≥ 1 for relevant documents • Ranking • Testing: Less studied: only a few works published in recent years Training data: pairs (partial order) decide relevant iff g ( r | d,q ) ≥ 0 – (x1, x2), (x1, x3), (x1, x4), (x1, x5) • Use SVM classifier – (x2, x3), (x2, x4) … – … 6
Sec. 15.4.2 “Learning to rank” Learning to rank: Classification vs. regression • Assume a number of categories C of • Classification probably isn’t the right way to think about score learning: relevance exist Classification problems: Map to an unordered set of These are totally ordered: c 1 < c 2 < … < c J classes This is the ordinal regression setup Regression problems: Map to a real value • Assume training data is available Ordinal regression problems: Map to an ordered set consisting of document-query pairs of classes represented as feature vectors ψ i and • This formulation gives extra power: relevance ranking c i Relations between relevance levels are modeled Documents are good versus other documents for query given collection; not an absolute scale of goodness Sec. 15.4.1 “Learning to rank” Modified example • Point-wise learning • Collect a training corpus of ( q, d, r ) triples Given a query-document pair, predict a Relevance r is here 4 values score (e.g. relevancy score) Perfect, Relevant, Weak, Nonrelevant • Pair-wise learning • Train a machine learning model to predict the class r of a document-query pair the input is a pair of results for a query, and the class is the relevance ordering relationship between them • List-wise learning Perfect Nonrelevant Directly optimize the ranking metric for Relevant Weak each query Relevant Perfect Nonrelevant 7
Sec. 15.4.2 Point-wise learning: Example The Ranking SVM : Pairwise Learning [Herbrich et al. 1999, 2000; Joachims et al. KDD 2002] • Goal is to learn a threshold to separate each rank • Aim is to classify instance pairs as correctly ranked or incorrectly ranked • This turns an ordinal regression problem back into a binary classification problem • We want a ranking function f such that c i is ranked before c k : c i < c k iff f (ψ i ) > f (ψ k ) • Suppose that f is a linear function f (ψ i ) = w ψ i • Thus c i < c k iff w( ψ i - ψ k )>0 Ranking SVM Ranking SVM • Training Set for each query q , we have a ranked list of documents totally ordered by a person for relevance to the query. • Features vector of features for each document/query pair feature differences for two documents d i and d j • optimization problem is equivalent to that of a classification SVM on pairwise difference vectors • Classification Φ (q k , d i ) - Φ (q k , d j ) if d i is judged more relevant than d j , denoted d i ≺ d j then assign the vector Φ ( d i , d j , q ) the class y ijq =+1; otherwise −1. 8
Recommend
More recommend