Machine Learning for Ranking CE-324: Modern Information Retrieval - - PowerPoint PPT Presentation

machine learning for ranking
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Ranking CE-324: Modern Information Retrieval - - PowerPoint PPT Presentation

Machine Learning for Ranking CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Machine learning for IR


slide-1
SLIDE 1

Machine Learning for Ranking

CE-324: Modern Information Retrieval

Sharif University of Technology

  • M. Soleymani

Fall 2018

Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

slide-2
SLIDE 2

Machine learning for IR ranking?

} We’ve looked at methods for ranking docs in IR

} Cosine similarity, idf, proximity, pivoted doc length normalization, doc

autorithy, …

} We’ve looked at methods for classifying docs using supervised

machine learning classifiers

} Naïve Bayes, Rocchio, kNN

} Surely we can also use machine learning to rank the docs

displayed in search results?

} Sounds like a good idea } A.k.a.“machine-learned relevance” or “learning to rank”

2

slide-3
SLIDE 3

Machine learning for IR ranking

} Actively researched – and actively deployed by major

web search engines

} In the last decade

} Why didn’t it happen earlier?

} Modern supervised ML has been around for about 20

years…

} Naïve Bayes has been around for about 50 years…

3

slide-4
SLIDE 4

Machine learning for IR ranking

} The

IR community wasn’t very connected to the ML community

} But there were a whole bunch of precursors:

} Wong, S.K. et al. 1988. Linear structure in information retrieval. SIGIR

1988.

} Fuhr, N. 1992. Probabilistic

methods in information retrieval. Computer Journal.

} Gey, F. C. 1994. Inferring probability of relevance using the method of

logistic regression. SIGIR 1994.

} Herbrich, R. et al. 2000. Large Margin Rank Boundaries for Ordinal

Regression.Advances in Large Margin Classifiers.

4

slide-5
SLIDE 5

Why weren’t early attempts very successful/influential?

} Sometimes an idea just takes time to be appreciated… } Limited training data

} Especially for real world use (as opposed to writing academic papers), it

was very hard to gather test collection queries and relevance judgments

} This has changed, both in academia and industry

} Poor machine learning techniques } Insufficient customization to IR problem } Not enough features for ML to show value

5

slide-6
SLIDE 6

Why wasn’t ML much needed?

} Traditional ranking functions in IR used a very small

number of features, e.g.,

} Term frequency } Inverse document frequency } Doc length

} It was easy to tune weighting coefficients by hand

} And people did

6

slide-7
SLIDE 7

Why is ML needed now?

} Modern systems – especially on the Web – use a great

number of features:

} Arbitrary useful features – not a single unified model

} Log frequency of query word in anchor text? } Query word in color on page? } # of images on page? } # of (out) links on page? } PageRank of page? } URL length? } URL contains “~”? } Page edit recency? } Page length?

} The New York Times (2008-06-03) quoted Amit Singhal as

saying Google was using over 200 such features.

7

slide-8
SLIDE 8

Weighted zone scoring

8

} Simple example:

} Only two zone:

} title, body

𝑡𝑑𝑝𝑠𝑓 𝑒, 𝑟 = 𝛽𝑇, 𝑒, 𝑟 + 1 − 𝛽 𝑇0 𝑒, 𝑟 0 ≤ 𝛽 ≤ 1

} We intend to find an optimal value for 𝛽

slide-9
SLIDE 9

Weighted zone scoring

9

} Training examples: each example is a tuple consisting

  • f a query 𝑟 and a doc 𝑒, together with a relevance

judgment 𝑠 for 𝑒 on 𝑟.

} In the simplest form, each relevance judgments is either

Relevant or Non-relevant.

} 𝛽 is “learned” from examples, in order that the learned

scores approximate the relevance judgments in the training examples.

slide-10
SLIDE 10

Weighted zone scoring

10

} 𝐸 =

𝑒 4 , 𝑟 4 , 𝑠(4) , … , 𝑒 8 , 𝑟 8 , 𝑠(8)

} Find minimum of the following cost function:

𝐾: (𝛽) = < 𝑠 = − 𝛽𝑇, 𝑒 = , 𝑟 = − 1 − 𝛽 𝑇0 𝑒 = , 𝑟 =

> 8 =?4

𝛽∗ = argmin

GHIH4

𝐾:(𝛽)

slide-11
SLIDE 11

Weighted zone scoring: special case

11

} Boolean match in each zone: 𝑇, 𝑒, 𝑟 , 𝑇0 𝑒, 𝑟 ∈ {0,1} } Boolean relevance judgment: 𝑠 ∈ {0,1}

𝛽∗ = 𝑜4GN + 𝑜G4O 𝑜4GN + 𝑜4GO + 𝑜G4O + 𝑜G4N

𝑜4GN: number of training data with 𝑇, = 1, 𝑇0 = 0 and 𝑠 = 1 𝑜G4O: number of training data with 𝑇, = 0, 𝑇0 = 1 and 𝑠 = 0

slide-12
SLIDE 12

Simple example: Using classification for ad-hoc IR

} Collect a training corpus of (q, d, r) triples

} Relevance r is here binary (but may be multiclass 3–7)

} Doc is represented by a feature vector

} 𝒚 = (𝑦4, 𝑦>)

} 𝑦4: cosine similarity } 𝑦>: minimum query window size (shortest text span including all query

words) called 𝜕

} Query term proximity is a very important new weighting

factor

} Train a model to predict the class 𝑠 of a doc-query pair

12

slide-13
SLIDE 13

Simple example: Using classification for ad hoc IR

} Training data

13

slide-14
SLIDE 14

Simple example: Using classification for ad hoc IR

} A linear score function is then

𝑇𝑑𝑝𝑠𝑓 𝑒, 𝑟 = 𝑥4×CosineScore + 𝑥>×𝜕 + 𝑥G

} And the linear classifier is

Decide relevant if 𝑇𝑑𝑝𝑠𝑓(𝑒, 𝑟) > 0

} just like when we were doing text classification

14

slide-15
SLIDE 15

Simple example: Using classification for ad hoc IR

2 3 4 5 0.05 cosine score Term proximity

R R R R R R R R R R R N N N N N N N N N N

Decision surface

15

slide-16
SLIDE 16

More complex example of using classification for search ranking [Nallapati 2004]

} We can generalize this to classifier functions over more

features

} We can use methods we have seen previously for

learning the linear classifier weights

16

slide-17
SLIDE 17

Classifier for information retrieval

} 𝑕 𝑒, 𝑟 = 𝑥4𝑔

4 𝑒, 𝑟 + ⋯ + 𝑥_𝑔 _(𝑒, 𝑟) + 𝑐

} Training: 𝑕(𝑒, 𝑟) must be negative for nonrelevant docs and positive for

relevant docs

} Testing: decide relevant iff 𝑕(𝑒, 𝑟) ≥ 0

} Features are not word presence features

} To deal with query words not in your training data } but scores like the summed (log) tf of all query terms

} Unbalanced data

} Problem: It can result in trivial always-say-nonrelevant classifiers } A solution: undersampling nonrelevant docs during training (just take

some at random)

17

slide-18
SLIDE 18

An SVM classifier for ranking [Nallapati 2004]

} Experiments:

} 4 TREC data sets } Comparisons with Lemur, a state-of-the-art open source IR

engine (Language Model (LM)-based – see IIR ch. 12)

} 6 features, all variants of tf, idf, and tf.idf scores

18

slide-19
SLIDE 19

An SVM classifier for ranking [Nallapati 2004]

Train \Test Disk 3 Disk 4-5 WT10G (web) Disk 3 LM 0.1785 0.2503 0.2666 SVM 0.1728 0.2432 0.2750 Disk 4-5 LM 0.1773 0.2516 0.2656 SVM 0.1646 0.2355 0.2675 } At best the results are about equal to LM

} Actually a little bit below

} Paper’s advertisement: Easy to add more features

} This is illustrated on a homepage finding task on WT10G:

} Baseline LM 52% p@10, baseline SVM 58% } SVM with URL-depth, and in-link features: 78% p@10

19

slide-20
SLIDE 20

Learning to rank

} Classification probably isn’t the right way.

} Classification: Map to a unordered set of classes } Regression: Map to a real value } Ordinal regression: Map to an ordered set of classes

} A fairly obscure sub-branch of statistics, but what we want here

} Ordinal regression formulation gives extra power:

} Relations between relevance levels are modeled

} A number of categories of relevance:

} These are totally ordered: 𝑑1 < 𝑑2 < ⋯ < 𝑑d

} Training data: each doc-query pair represented as a feature vector 𝜒=

and relevance ranking 𝑑= as labels

20

slide-21
SLIDE 21

The Ranking SVM

[Herbrich et al. 1999, 2000; Joachims et al. 2002]

} Ranking Model: f(d)

) (d f

  • Sec. 15.4.2
slide-22
SLIDE 22

Pairwise learning: The Ranking SVM

[Herbrich et al. 1999, 2000; Joachims et al. 2002]

} Aim is to classify instance pairs as correctly ranked or

incorrectly ranked

} This turns an ordinal regression problem back into a binary

classification problem in an expanded space

} We want a ranking function f such that

ci > ck iff f(ψi) > f(ψk)

} … or at least one that tries to do this with minimal error } Suppose that f is a linear function

f(ψi) = wŸψi

  • Sec. 15.4.2
slide-23
SLIDE 23

The Ranking SVM

[Herbrich et al. 1999, 2000; Joachims et al. 2002]

} Then (combining ci > ck iff f(ψi) > f(ψk) and f(ψi) = wŸψi):

ci > ck iff wŸ(ψi − ψk) > 0

} Let us then create a new instance space from such pairs:

Φu = Φ(di, dk, q) = ψi − ψk zu = +1, 0, −1 as ci >,=,< ck

} We can build model over just cases for which zu = −1 } From training data S = {Φu}, we train an SVM

  • Sec. 15.4.2
slide-24
SLIDE 24

Two queries in the pairwise space

slide-25
SLIDE 25

The Ranking by a classifier

25

} Assume that the ranked list of docs for a set of sample

queries is available as training data

} Or even a set of training data in the form of (𝑒, 𝑒f, 𝑟, 𝑨) is

available

} This form of training data can also be derived from the

available ranked list of docs for samples queries if exist

} Again we can use SVM classifier for ranking

slide-26
SLIDE 26

The Ranking by a classifier

26

} Φ 𝑒, 𝑟 = 𝑡4 𝑒, 𝑟 , … , 𝑡_(𝑒, 𝑟) } Φ 𝑒f, 𝑟 = 𝑡4 𝑒′, 𝑟 , … , 𝑡_(𝑒′, 𝑟) } It seeks a vector 𝑥 in the space of scores (constructed as

above) such that:

𝑥,Φ 𝑒, 𝑟 ≥ 𝑥,Φ 𝑒′, 𝑟

} A linear classifier like SVM can be used where its training data

are constructed as the following pairs of input and output:

For each 𝑒 that precedes 𝑒f in the ranked list of docs for 𝑟 available in training data

(Φ 𝑒, 𝑟 − Φ 𝑒f, 𝑟 , 𝑨) Feature vector label

(𝑒,𝑒f,𝑟,𝑨)

𝑨 = +1 if 𝑒 must precede 𝑒’ 𝑨 = −1 if 𝑒′ must precede 𝑒

slide-27
SLIDE 27

The Ranking SVM

[Herbrich et al. 1999, 2000; Joachims et al. 2002]

} The SVM learning task is then like other examples that we

saw before

} Find w and ξu ≥ 0 such that

} ½wTw + C Σ ξu is minimized, and } for all Φu such that zu < 0, wŸΦu ≥ 1 − ξu

} We

can just do the negative zu, as

  • rdering

is antisymmetric

} You can again use libSVM or SVMlight (or other SVM

libraries) to train your model (SVMrank specialization)

  • Sec. 15.4.2
slide-28
SLIDE 28

Adapting the Ranking SVM for (successful) Information Retrieval

[Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, Hsiao- Wuen Hon SIGIR 2006]

} A Ranking SVM model already works well

} Using things like vector space model scores as features } As we shall see, it outperforms standard IR in evaluations

} But it does not model important aspects of practical IR

well

} This paper addresses two customizations of the Ranking

SVM to fit an IR utility model

slide-29
SLIDE 29

The ranking SVM fails to model the IR problem well …

1.

Correctly ordering the most relevant docs is crucial to the success of an IR system, while misordering less relevant results matters little

§ The ranking SVM considers all ordering violations as the same

2.

Some queries have many (somewhat) relevant docs, and

  • ther queries few.

If we treat all pairs of results for queries equally, queries with many results will dominate the learning

§

But actually queries with few relevant results are at least as important to do well on

slide-30
SLIDE 30

Learning to rank: Approaches

} Point-wise

} Predicting class label or relevance score

} Pair-wise

} Predicting relative order is closer to the nature of ranking than

predicting class label or relevance score.

} Input is a pair of results for a query, and the class is the relevance

  • rdering relationship between them

} List-wise

} Learns a ranking function } Models the ranking problem in a straightforward fashion.

} can overcome the drawback of the above approaches by tackling the ranking problem

directly

30

slide-31
SLIDE 31

Problem with Pointwise Approach

} Properties of IR evaluation measures have not been well

considered.

} When number of associated docs varies largely for different queries,

domination by those queries with a large number of docs.

} The position of docs in the ranked list is unimportant.

} The pointwise loss function may unconsciously emphasize too much those

unimportant docs (which are ranked low in the final results).

31

slide-32
SLIDE 32

Pairwise ranking

} Goal: classify instance pairs as correctly ranked or

incorrectly ranked

} This turns a ranking problem back into a binary classification

problem

} Advantage

} Predicting relative order is closer to the nature of ranking than

predicting class label or relevance score.

} Problem

} The distribution of doc pair number is more skewed than the

distribution of doc number, with respect to different queries.

32

slide-33
SLIDE 33

The Limitation of Machine Learning

} Most work produces linear models of features by weighting

different base features

} This contrasts with most of the clever ideas of traditional IR,

which are nonlinear scalings and combinations of basic measurements

} log term frequency, idf, pivoted length normalization

33

slide-34
SLIDE 34

Summary

} The idea of learning ranking functions has been around for

about 20 years

} But only recently have ML knowledge, availability of training

data, a rich space of features come together to make this a hot research area

} It’s too early to give a definitive statement on what methods

are best in this area … it’s still advancing rapidly

} But machine learned ranking over many features now easily

beats traditional hand-designed ranking functions

} And there is every reason to think that the importance of

machine learning in IR will increase in the future.

34

slide-35
SLIDE 35

Resource

35

} IIR, 6.1 } IIR, 15.4