Machine Learning for Ranking CE-324: Modern Information Retrieval - PowerPoint PPT Presentation

Machine Learning for Ranking CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Machine learning for IR ranking? } We’ve looked at methods for ranking docs in IR } Cosine similarity, idf, proximity, pivoted doc length normalization, doc autorithy, … } We’ve looked at methods for classifying docs using supervised machine learning classifiers } Naïve Bayes, Rocchio, kNN } Surely we can also use machine learning to rank the docs displayed in search results? } Sounds like a good idea } A.k.a.“machine-learned relevance” or “learning to rank” 2

Machine learning for IR ranking } Actively researched – and actively deployed by major web search engines } In the last decade } Why didn’t it happen earlier? } Modern supervised ML has been around for about 20 years… } Naïve Bayes has been around for about 50 years… 3

Machine learning for IR ranking } The IR community wasn’t very connected to the ML community } But there were a whole bunch of precursors: } Wong, S.K. et al. 1988. Linear structure in information retrieval. SIGIR 1988. } Fuhr, N. 1992. Probabilistic methods in information retrieval. Computer Journal. } Gey, F. C. 1994. Inferring probability of relevance using the method of logistic regression. SIGIR 1994. } Herbrich, R. et al. 2000. Large Margin Rank Boundaries for Ordinal Regression.Advances in Large Margin Classifiers. 4

Why weren’t early attempts very successful/influential? } Sometimes an idea just takes time to be appreciated… } Limited training data } Especially for real world use (as opposed to writing academic papers), it was very hard to gather test collection queries and relevance judgments } This has changed, both in academia and industry } Poor machine learning techniques } Insufficient customization to IR problem } Not enough features for ML to show value 5

Why wasn’t ML much needed? } Traditional ranking functions in IR used a very small number of features, e.g., } Term frequency } Inverse document frequency } Doc length } It was easy to tune weighting coefficients by hand } And people did 6

Why is ML needed now? } Modern systems – especially on the Web – use a great number of features: } Arbitrary useful features – not a single unified model } Log frequency of query word in anchor text? } Query word in color on page? } # of images on page? } # of (out) links on page? } PageRank of page? } URL length? } URL contains “~”? } Page edit recency? } Page length? } The New York Times (2008-06-03) quoted Amit Singhal as saying Google was using over 200 such features. 7

Weighted zone scoring } Simple example: } Only two zone: } title, body 𝑡𝑑𝑝𝑠𝑓 𝑒, 𝑟 = 𝛽𝑇 , 𝑒, 𝑟 + 1 − 𝛽 𝑇 0 𝑒, 𝑟 0 ≤ 𝛽 ≤ 1 } We intend to find an optimal value for 𝛽 8

Weighted zone scoring } Training examples : each example is a tuple consisting of a query 𝑟 and a doc 𝑒 , together with a relevance judgment 𝑠 for 𝑒 on 𝑟 . } In the simplest form, each relevance judgments is either Relevant or Non-relevant . } 𝛽 is “learned” from examples, in order that the learned scores approximate the relevance judgments in the training examples. 9

Weighted zone scoring 𝑒 4 , 𝑟 4 , 𝑠 (4) , … , 𝑒 8 , 𝑟 8 , 𝑠 (8) } 𝐸 = } Find minimum of the following cost function: 8 > 𝐾 : (𝛽) = < 𝑠 = − 𝛽𝑇 , 𝑒 = , 𝑟 = − 1 − 𝛽 𝑇 0 𝑒 = , 𝑟 = =?4 𝛽 ∗ = argmin 𝐾 : (𝛽) GHIH4 10

Weighted zone scoring: special case } Boolean match in each zone: 𝑇 , 𝑒, 𝑟 , 𝑇 0 𝑒, 𝑟 ∈ {0,1} } Boolean relevance judgment: 𝑠 ∈ {0,1} 𝑜 4GN + 𝑜 G4O 𝛽 ∗ = 𝑜 4GN + 𝑜 4GO + 𝑜 G4O + 𝑜 G4N 𝑜 4GN : number of training data with 𝑇 , = 1 , 𝑇 0 = 0 and 𝑠 = 1 𝑜 G4O : number of training data with 𝑇 , = 0 , 𝑇 0 = 1 and 𝑠 = 0 11

Simple example: Using classification for ad-hoc IR } Collect a training corpus of (q, d, r) triples } Relevance r is here binary (but may be multiclass 3–7) } Doc is represented by a feature vector } 𝒚 = (𝑦 4 , 𝑦 > ) } 𝑦 4 : cosine similarity } 𝑦 > : minimum query window size (shortest text span including all query words) called 𝜕 } Query term proximity is a very important new weighting factor } Train a model to predict the class 𝑠 of a doc-query pair 12

Simple example: Using classification for ad hoc IR } Training data 13

Simple example: Using classification for ad hoc IR } A linear score function is then 𝑇𝑑𝑝𝑠𝑓 𝑒, 𝑟 = 𝑥 4 ×CosineScore + 𝑥 > ×𝜕 + 𝑥 G } And the linear classifier is Decide relevant if 𝑇𝑑𝑝𝑠𝑓(𝑒, 𝑟) > 0 } just like when we were doing text classification 14

Simple example: Using classification for ad hoc IR 0.05 cosine score Decision R surface R N R R R R R N N R R R N R N N N N N N 0 2 3 4 5 Term proximity 15

More complex example of using classification for search ranking [Nallapati 2004] } We can generalize this to classifier functions over more features } We can use methods we have seen previously for learning the linear classifier weights 16

Classifier for information retrieval } 𝑕 𝑒, 𝑟 = 𝑥 4 𝑔 4 𝑒, 𝑟 + ⋯ + 𝑥 _ 𝑔 _ (𝑒, 𝑟) + 𝑐 } Training: 𝑕(𝑒, 𝑟) must be negative for nonrelevant docs and positive for relevant docs } Testing: decide relevant iff 𝑕(𝑒, 𝑟) ≥ 0 } Features are not word presence features } To deal with query words not in your training data } but scores like the summed (log) tf of all query terms } Unbalanced data } Problem: It can result in trivial always-say-nonrelevant classifiers } A solution: undersampling nonrelevant docs during training (just take some at random) 17

An SVM classifier for ranking [Nallapati 2004] } Experiments: } 4 TREC data sets } Comparisons with Lemur, a state-of-the-art open source IR engine (Language Model (LM)-based – see IIR ch. 12) } 6 features, all variants of tf, idf, and tf.idf scores 18

An SVM classifier for ranking [Nallapati 2004] Train \Test Disk 3 Disk 4-5 WT10G (web) Disk 3 LM 0.1785 0.2503 0.2666 SVM 0.1728 0.2432 0.2750 Disk 4-5 LM 0.1773 0.2516 0.2656 SVM 0.1646 0.2355 0.2675 } At best the results are about equal to LM } Actually a little bit below } Paper’s advertisement: Easy to add more features } This is illustrated on a homepage finding task on WT10G: } Baseline LM 52% p@10, baseline SVM 58% } SVM with URL-depth, and in-link features: 78% p@10 19

Learning to rank } Classification probably isn’t the right way . } Classification : Map to a unordered set of classes } Regression : Map to a real value } Ordinal regression : Map to an ordered set of classes } A fairly obscure sub-branch of statistics, but what we want here } Ordinal regression formulation gives extra power: } Relations between relevance levels are modeled } A number of categories of relevance: } These are totally ordered: 𝑑 1 < 𝑑 2 < ⋯ < 𝑑 d } Training data: each doc-query pair represented as a feature vector 𝜒 = and relevance ranking 𝑑 = as labels 20

Sec. 15.4.2 The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] } Ranking Model: f ( d ) f ( d )

Sec. 15.4.2 Pairwise learning: The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] } Aim is to classify instance pairs as correctly ranked or incorrectly ranked } This turns an ordinal regression problem back into a binary classification problem in an expanded space } We want a ranking function f such that c i > c k iff f (ψ i ) > f (ψ k ) } … or at least one that tries to do this with minimal error } Suppose that f is a linear function f (ψ i ) = w ψ i

Sec. 15.4.2 The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] } Then (combining c i > c k iff f (ψ i ) > f (ψ k ) and f (ψ i ) = w ψ i ): c i > c k iff w (ψ i − ψ k ) > 0 } Let us then create a new instance space from such pairs: Φ u = Φ( d i , d k , q ) = ψ i − ψ k z u = +1, 0, −1 as c i >,=,< c k } We can build model over just cases for which z u = −1 } From training data S = {Φ u }, we train an SVM

Two queries in the pairwise space

The Ranking by a classifier } Assume that the ranked list of docs for a set of sample queries is available as training data } Or even a set of training data in the form of ( 𝑒, 𝑒 f , 𝑟, 𝑨) is available } This form of training data can also be derived from the available ranked list of docs for samples queries if exist } Again we can use SVM classifier for ranking 25

The Ranking by a classifier } Φ 𝑒, 𝑟 = 𝑡 4 𝑒, 𝑟 , … , 𝑡 _ (𝑒, 𝑟) } Φ 𝑒 f , 𝑟 = 𝑡 4 𝑒′, 𝑟 , … , 𝑡 _ (𝑒′, 𝑟) } It seeks a vector 𝑥 in the space of scores (constructed as above) such that: 𝑥 , Φ 𝑒, 𝑟 ≥ 𝑥 , Φ 𝑒′, 𝑟 For each 𝑒 that precedes 𝑒 f in the ranked list of docs for 𝑟 available in training data } A linear classifier like SVM can be used where its training data are constructed as the following pairs of input and output: ( 𝑒,𝑒 f ,𝑟,𝑨) (Φ 𝑒, 𝑟 − Φ 𝑒 f , 𝑟 , 𝑨) ⇒ 𝑨 = +1 if 𝑒 must precede 𝑒’ Feature label 𝑨 = −1 if 𝑒′ must precede 𝑒 vector 26

Machine Learning for Ranking CE-324: Modern Information Retrieval - PowerPoint PPT Presentation

Machine Learning for Ranking CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Machine learning for IR

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Online Submodular Set Cover, Ranking, and Repeated Active Learning Online Ranking: At each round,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch Ads Ranking at

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

1 Similarity ranking: example Weighted scoring with linear combination A simple weighted

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

PRanking with Ranking Koby Crammer Technion Israel Institute of Technology Based on joint

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

The Geometry of Relevant Implication Alasdair Urquhart University of Toronto October 2016

Physics 2D Lecture Slides Jan 6 Vivek Sharma UCSD Physics Modern Physics (PHYS 2D)

version control Hands-on Unix System Administration DeCal - Fall 2012 2012-10-29 (off-topic)

Topological arguments and Kolmogorov complexity Andrei Romashenko (joint work with Alexander

Signal Processing Algorithm Description & Evaluation with Simulation Brooke Russell Yale

COV OVID-19 19 Vacci cine S Safety Grace M. Lee, MD MPH Associate CMO, Stanford Childrens

Measurement Tools: RDM Common network abstraction Source: http://commons.wikimedia.org/wiki/

Insight and scoping Physical activity levels in older adults (55+) Background and introduction

Sambuz

Useful Links

Newsletter

Mail Us

Machine Learning for Ranking CE-324: Modern Information Retrieval - PowerPoint PPT Presentation

Machine Learning for Ranking CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Machine learning for IR

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Online Submodular Set Cover, Ranking, and Repeated Active Learning Online Ranking: At each round,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch Ads Ranking at

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

1 Similarity ranking: example Weighted scoring with linear combination A simple weighted

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

PRanking with Ranking Koby Crammer Technion Israel Institute of Technology Based on joint

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

The Geometry of Relevant Implication Alasdair Urquhart University of Toronto October 2016

Physics 2D Lecture Slides Jan 6 Vivek Sharma UCSD Physics Modern Physics (PHYS 2D)

version control Hands-on Unix System Administration DeCal - Fall 2012 2012-10-29 (off-topic)

Topological arguments and Kolmogorov complexity Andrei Romashenko (joint work with Alexander

Signal Processing Algorithm Description &amp; Evaluation with Simulation Brooke Russell Yale

COV OVID-19 19 Vacci cine S Safety Grace M. Lee, MD MPH Associate CMO, Stanford Childrens

Measurement Tools: RDM Common network abstraction Source: http://commons.wikimedia.org/wiki/

Insight and scoping Physical activity levels in older adults (55+) Background and introduction

Sambuz

Useful Links

Newsletter

Mail Us

Signal Processing Algorithm Description & Evaluation with Simulation Brooke Russell Yale