Why recall matters Stephen Robertson Microsoft Research Cambridge
Traditional ideas Assume binary relevance Assume (unranked, exact match) set retrieval and also ● Note: although I will refer to metrics such as NDCG which can deal with graded relevance, I will not discuss the issue further in the present talk.
Traditional ideas Devices: things you might do to improve results Recall device: something to increase/improve Recall (that is, increase the size of the retrieved set by allowing the query to match more items) Precision device: similarly, something to improve precision (that is, reduce the size of the retrieved set by making the query more specific)
The inverse relationship Recall devices reduce precision; precision devices reduce recall Hence recall and precision are in some sense in opposition (the recall/precision curve) user should choose his/her emphasis
User orientation High-recall user: I want high recall; I’m not so interested in precision High-precision user: I want high precision; I’m not so interested in recall
Scoring and ranking Replace set retrieval with a scoring function measuring how well each document matches the query … and rank the results in descending score order Now think of stepping down the ranked list as a recall device … and stopping early as a precision device … leading to the usual recall-precision curve As a simplification, think of ranking the entire corpus Note that there are many other devices which interact with the scoring in complex ways
Recall–Precision curve
Single topic curves 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Scoring and ranking Note an asymmetry: as you step down, recall must increase (or at least not decrease) … but precision may go either way its tendency to decrease is not a logical property, but a statistical one
High-precision search Assume that the user really only wants to see a small number of (highly) relevant items extreme case: just one would suffice metrics commonly used: Precision at 5, Mean reciprocal rank, NDCG@1, … common view: recall is of no consequence what the eye does not see … web search generally thought of this way
Recall-oriented search Some search tasks/environments are seen to be recall-oriented: – E-discovery: documents required for disclosure in a legal case – Prior-art patent search: looking for existing patents which might invalidate a new patent application – Evidence-based medicine: assessing all the evidence on alternative approaches But these are often thought of as exceptions, strange special cases
It’s the web that’s strange Peculiarities of the (English) web: size – variety of material – variety of authorship – lack of standardisation (of anything) – linguistic variety – variety of anchor text – variety of quality – variety of level – scale of search activity – monetisation of search engines –
Example: enterprise search Enterprise search environment much more limited – much less variety – various kinds of standardisation – few searches – Even for high-precision search, need to think about recall issues e.g. using the right terminology –
Recall So for all non-web search… some attention to recall is necessary recall devices may be useful But even in web search query modifications and suggestions are often recall devices e.g. spelling correction, singular/plural, related queries… only of value if they increase recall (that is, lead you to relevant items you would not otherwise have found) some kinds of queries/web environments need particular attention to recall e.g. minority languages
The recall-fallout curve (Another way of thinking about the performance curve) Signal detection systems System is trying to tell us which items are relevant Relevant item = signal, non-relevant item = noise Recall is true positive rate, fallout is false positive rate Operating characteristic (OC, ROC) curve: recall against fallout
The recall-fallout curve
The recall-fallout curve As in IR, distinguish between rank order and decision point although the distinction is stronger in most IR – contexts IR: system only ranks, user makes on-the-fly decision – about stopping point Signal detection: system normally has to have a set – threshold for acceptance/rejection Nevertheless, consider harder stopping points in IR e.g. for filtering, legal discovery –
The recall-fallout curve A single measure: area under ROC curve = probability of pairwise success – choose a random pair of signal–noise instances (one of each), measure the probability that the signal instance is ranked before the noise instance Contrast Average Precision = area under recall-precision curve can also be interpreted as a top-weighted probability of pairwise success Recall/fallout graphs have little practical use in IR but the idea does provide useful insights –
Single topic curves 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
A simplified view Assume: ● (for the purpose of this discussion) All queries look the same – Entire collection is scored/ranked – Evaluation metrics are continuous functions – of the score Signal-to-noise ratio ( generality ) is – reasonable Normally get a smooth concave curve
The recall-fallout curve
The recall-fallout curve
A touch of realism
Now where is precision? Precision combines information from both noise and signal and thus may be seen as an overall measure – but only for a single point on the graph – If we fix other things e.g. the generality of the query = ratio of – signal to noise … then we can see a form of precision on the graph:
The recall-fallout curve
The recall-fallout curve
Devices revisited Recall device: be more inclusive intention: increase recall necessarily increases both recall and fallout probably reduces precision because of the curvature Precision Fallout device: be more selective intention: reduce fallout necessarily reduces both should increase precision because of the curvature
User orientation High-recall user: I want high recall; I’m less worried by high fallout High-precision Low-fallout user: I want low fallout; I’m less interested in high recall
The challenge of recall (Some) recall is necessary, even for precision! Recall challenges: measuring/estimating recall – discovering recall failures – improving recall! – providing the user with evidence about recall – providing the user with guidance on how far to – go (optimising stopping point) predicting recall-at-a-stopping-point –
Recommend
More recommend