1
play

1 Similarity ranking: example Weighted scoring with linear - PDF document

Table of Content Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification Ranking and Learning 290N UCSB, Tao Yang, 2013 Partially based on Manning, Raghavan, and Schtzes text book.


  1. Table of Content • Weighted scoring for ranking • Learning to rank: A simple example • Learning to ranking as classification Ranking and Learning 290N UCSB, Tao Yang, 2013 Partially based on Manning, Raghavan, and Schütze‘s text book. Scoring Simple Model of Ranking with Similarity • Similarity-based approach  Similarity of query features with document features • Weighted approach: Scoring with weighted features  return in order the documents most likely to be useful to the searcher  Consider each document has subscores in each feature or in each subarea. 1

  2. Similarity ranking: example Weighted scoring with linear combination • A simple weighted scoring method: use a linear combination of subscores:  E.g., Score = 0.6*< Title score> + 0.3*<Abstract score> + 0.1*<Body score>  The overall score is in [0,1]. Example with binary subscores Query term appears in title and body only Document score: (0.6 ・ 1) + (0.1 ・ 1) = 0.7. Example How to determine weights automatically: Motivation • Modern systems – especially on the Web – use a • On the query “ bill rights” suppose that we retrieve great number of features: the following docs from the various zone indexes: – Arbitrary useful features – not a single unified model  Log frequency of query word in anchor text?  Query word highlighted on page?  Span of query words on page Abstract 1 2 bill  # of (out) links on page? Compute the rights  PageRank of page? score  URL length? Title for each doc 3 5 8 bill  URL contains “~”? based on the rights 3 5 9  Page edit recency? weightings  Page length? 0.6,0.3,0.1 Body 1 2 5 9 • Major web search engines use “hundreds” of bill such features – and they keep changing rights 3 5 8 9 2

  3. Sec. 15.4 Machine learning for computing weights Learning weights: Methodology  Given a set of training examples , • How do we combine these signals into a good  each contains (query q , document d , relevance ranker? score r(d,q)).  “machine - learned relevance” or “learning to rank”  r(d,q) is relevance judgment for d on q • Learning from examples  Simplest scheme  These examples are called training data  relevant (1) or nonrelevant (0)  More sophisticated: graded relevance judgments Training Ranking  1 (bad), 2 (Fair), 3 (Good), 4 (Excellent), 5 (Perfect) examples formula  Learn weights from these examples, so that the learned User query and scores approximate the relevance judgments in the training Ranked matched results 10 examples results 10 Simple example Learning w from training examples • Each doc has two zones, Title and Body • For a chosen w  [0,1], score for doc d on query q where: s T ( d , q )  {0,1} is a Boolean denoting whether q matches the Title and s B ( d , q )  {0,1} is a Boolean denoting whether q matches the Body 3

  4. How? Optimizing w • For each example  t we can compute the score • There are 4 kinds of training examples based on • Thus only four possible values for score • We quantify Relevant as 1 and Non-relevant as 0  And only 8 possible values for error • Would like the choice of w to be such that the • Let n 01r be the number of training examples for computed scores are as close to these 1/0 which s T ( d , q )=0, s B ( d , q )=1, judgment = Relevant . judgments as possible • Similarly define n 00r , n 10r , n 11r , n 00i , n 01i , n 10i , n 11i  Denote by r(d t ,q t ) the judgment for  t  • Then minimize total squared error      Error:        2 2 1 ( 1 ) n 0 ( 1 ) n 01 r 01 i Total error – then calculus Generalizing this simple example • Add up contributions from various cases to get • More (than 2) features total error • Non-Boolean features  What if the title contains some but not all query terms … • Now differentiate with respect to w to get  Categorical features (query terms occur in plain, optimal value of w as: boldface, italics, etc) • Scores are nonlinear combinations of features • Multilevel relevance judgments (Perfect, Good, Fair, Bad, etc) • Complex error functions • Not always a unique, easily computable setting of score parameters 4

  5. Learning-based Web Search Framework of Learning to Rank • Given a set of features e 1 ,e 2 ,…,e N , learn a ranking function f ( e 1 ,e 2 ,…,e N ) that minimizes the loss function L .    * f min L f e e ( , ,..., e ), GroundTruth 1 2 N  f F • Some related issues  The functional space F – linear/non-linear? continuous? Derivative?  The search strategy  The loss function Sec. 15.4.1 Sec. 15.4.1 A richer example Using classification for deciding relevance • Collect a training corpus of ( q, d, r ) triples • A linear score function is  Relevance r is still binary for now Score(d, q) = Score(α, ω) = aα + bω + c  Document is represented by a feature vector • And the linear classifier is – x = (α, ω) α is cosine similarity, ω is minimum query Decide relevant if Score(d, q) > θ window size  ω is the shortest text span that includes all query words (Query term proximity in the document) • … just like when we were doing text classification • Train a machine learning model to predict the class r of a document-query pair 5

  6. Sec. 15.4.1 Using classification for deciding More complex example of using relevance classification for search ranking [Nallapati SIGIR 2004] cosine score  0.05 • We can generalize this to classifier functions over Decision R more features R surface N R R R • We can use methods we have seen previously for R R learning the linear classifier weights N N R 0.025 R R N R N N N N N N 0 2 3 4 5 Term proximity  An SVM classifier for relevance Ranking vs. Classification [Nallapati SIGIR 2004] • Classification • Let g ( r | d,q ) = w  f ( d , q ) + b  Well studied over 30 years • Derive weights from the training  Bayesian, Neural network, Decision tree, SVM, Boosting, … examples:  Training data: points – Pos: x1, x2, x3, Neg: x4, x5  want g ( r | d,q ) ≤ −1 for nonrelevant documents x 5 x 4 x 3 x 2 x 1 0  g ( r | d,q ) ≥ 1 for relevant documents • Ranking • Testing:  Less studied: only a few works published in recent years  Training data: pairs (partial order)  decide relevant iff g ( r | d,q ) ≥ 0 – (x1, x2), (x1, x3), (x1, x4), (x1, x5) • Use SVM classifier – (x2, x3), (x2, x4) … – … 6

  7. Sec. 15.4.2 “Learning to rank” Learning to rank: Classification vs. regression • Assume a number of categories C of • Classification probably isn’t the right way to think about score learning: relevance exist  Classification problems: Map to an unordered set of  These are totally ordered: c 1 < c 2 < … < c J classes  This is the ordinal regression setup  Regression problems: Map to a real value • Assume training data is available  Ordinal regression problems: Map to an ordered set consisting of document-query pairs of classes represented as feature vectors ψ i and • This formulation gives extra power: relevance ranking c i  Relations between relevance levels are modeled  Documents are good versus other documents for query given collection; not an absolute scale of goodness Sec. 15.4.1 “Learning to rank” Modified example • Point-wise learning • Collect a training corpus of ( q, d, r ) triples  Given a query-document pair, predict a  Relevance r is here 4 values score (e.g. relevancy score)  Perfect, Relevant, Weak, Nonrelevant • Pair-wise learning • Train a machine learning model to predict the class r of a document-query pair  the input is a pair of results for a query, and the class is the relevance ordering relationship between them • List-wise learning Perfect Nonrelevant  Directly optimize the ranking metric for Relevant Weak each query Relevant Perfect Nonrelevant 7

  8. Sec. 15.4.2 Point-wise learning: Example The Ranking SVM : Pairwise Learning [Herbrich et al. 1999, 2000; Joachims et al. KDD 2002] • Goal is to learn a threshold to separate each rank • Aim is to classify instance pairs as  correctly ranked  or incorrectly ranked • This turns an ordinal regression problem back into a binary classification problem • We want a ranking function f such that c i is ranked before c k : c i < c k iff f (ψ i ) > f (ψ k ) • Suppose that f is a linear function f (ψ i ) = w  ψ i • Thus c i < c k iff w( ψ i - ψ k )>0 Ranking SVM Ranking SVM • Training Set  for each query q , we have a ranked list of documents totally ordered by a person for relevance to the query. • Features  vector of features for each document/query pair  feature differences for two documents d i and d j • optimization problem is equivalent to that of a classification SVM on pairwise difference vectors • Classification Φ (q k , d i ) - Φ (q k , d j )  if d i is judged more relevant than d j , denoted d i ≺ d j  then assign the vector Φ ( d i , d j , q ) the class y ijq =+1; otherwise −1. 8

Recommend


More recommend