why recall matters
play

Why recall matters Stephen Robertson Microsoft Research Cambridge - PowerPoint PPT Presentation

Why recall matters Stephen Robertson Microsoft Research Cambridge Traditional ideas Assume binary relevance Assume (unranked, exact match) set retrieval and also Note: although I will refer to metrics such as NDCG which can deal with


  1. Why recall matters Stephen Robertson Microsoft Research Cambridge

  2. Traditional ideas Assume binary relevance Assume (unranked, exact match) set retrieval and also ● Note: although I will refer to metrics such as NDCG which can deal with graded relevance, I will not discuss the issue further in the present talk.

  3. Traditional ideas Devices: things you might do to improve results Recall device: something to increase/improve Recall (that is, increase the size of the retrieved set by allowing the query to match more items) Precision device: similarly, something to improve precision (that is, reduce the size of the retrieved set by making the query more specific)

  4. The inverse relationship Recall devices reduce precision; precision devices reduce recall Hence recall and precision are in some sense in opposition (the recall/precision curve) user should choose his/her emphasis

  5. User orientation High-recall user: I want high recall; I’m not so interested in precision High-precision user: I want high precision; I’m not so interested in recall

  6. Scoring and ranking Replace set retrieval with a scoring function measuring how well each document matches the query … and rank the results in descending score order Now think of stepping down the ranked list as a recall device … and stopping early as a precision device … leading to the usual recall-precision curve As a simplification, think of ranking the entire corpus Note that there are many other devices which interact with the scoring in complex ways

  7. Recall–Precision curve

  8. Single topic curves 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  9. Scoring and ranking Note an asymmetry: as you step down, recall must increase (or at least not decrease) … but precision may go either way its tendency to decrease is not a logical property, but a statistical one

  10. High-precision search Assume that the user really only wants to see a small number of (highly) relevant items extreme case: just one would suffice metrics commonly used: Precision at 5, Mean reciprocal rank, NDCG@1, … common view: recall is of no consequence what the eye does not see … web search generally thought of this way

  11. Recall-oriented search Some search tasks/environments are seen to be recall-oriented: – E-discovery: documents required for disclosure in a legal case – Prior-art patent search: looking for existing patents which might invalidate a new patent application – Evidence-based medicine: assessing all the evidence on alternative approaches But these are often thought of as exceptions, strange special cases

  12. It’s the web that’s strange Peculiarities of the (English) web: size – variety of material – variety of authorship – lack of standardisation (of anything) – linguistic variety – variety of anchor text – variety of quality – variety of level – scale of search activity – monetisation of search engines –

  13. Example: enterprise search Enterprise search environment much more limited – much less variety – various kinds of standardisation – few searches – Even for high-precision search, need to think about recall issues e.g. using the right terminology –

  14. Recall So for all non-web search… some attention to recall is necessary recall devices may be useful But even in web search query modifications and suggestions are often recall devices e.g. spelling correction, singular/plural, related queries… only of value if they increase recall (that is, lead you to relevant items you would not otherwise have found) some kinds of queries/web environments need particular attention to recall e.g. minority languages

  15. The recall-fallout curve (Another way of thinking about the performance curve) Signal detection systems System is trying to tell us which items are relevant Relevant item = signal, non-relevant item = noise Recall is true positive rate, fallout is false positive rate Operating characteristic (OC, ROC) curve: recall against fallout

  16. The recall-fallout curve

  17. The recall-fallout curve As in IR, distinguish between rank order and decision point although the distinction is stronger in most IR – contexts IR: system only ranks, user makes on-the-fly decision – about stopping point Signal detection: system normally has to have a set – threshold for acceptance/rejection Nevertheless, consider harder stopping points in IR e.g. for filtering, legal discovery –

  18. The recall-fallout curve A single measure: area under ROC curve = probability of pairwise success – choose a random pair of signal–noise instances (one of each), measure the probability that the signal instance is ranked before the noise instance Contrast Average Precision = area under recall-precision curve can also be interpreted as a top-weighted probability of pairwise success Recall/fallout graphs have little practical use in IR but the idea does provide useful insights –

  19. Single topic curves 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  20. A simplified view Assume: ● (for the purpose of this discussion) All queries look the same – Entire collection is scored/ranked – Evaluation metrics are continuous functions – of the score Signal-to-noise ratio ( generality ) is – reasonable Normally get a smooth concave curve

  21. The recall-fallout curve

  22. The recall-fallout curve

  23. A touch of realism

  24. Now where is precision? Precision combines information from both noise and signal and thus may be seen as an overall measure – but only for a single point on the graph – If we fix other things e.g. the generality of the query = ratio of – signal to noise … then we can see a form of precision on the graph:

  25. The recall-fallout curve

  26. The recall-fallout curve

  27. Devices revisited Recall device: be more inclusive intention: increase recall necessarily increases both recall and fallout probably reduces precision because of the curvature Precision Fallout device: be more selective intention: reduce fallout necessarily reduces both should increase precision because of the curvature

  28. User orientation High-recall user: I want high recall; I’m less worried by high fallout High-precision Low-fallout user: I want low fallout; I’m less interested in high recall

  29. The challenge of recall (Some) recall is necessary, even for precision! Recall challenges: measuring/estimating recall – discovering recall failures – improving recall! – providing the user with evidence about recall – providing the user with guidance on how far to – go (optimising stopping point) predicting recall-at-a-stopping-point –

Recommend


More recommend