Chapter 12: Query Processing Computers are useless, they can only give you answers. -- Pablo Picasso You have to think anyway, so why not think big? -- Donald Trump There are lies, damn lies, and workload assumptions. -- anonymous 12-1 IRDM WS 2015
Outline 12.1 Query Processing Algorithms 12.2 Fast Top-k Search 12.3 Phrase and Proximity Queries 12.4 Query Result Diversification loosely following Büttcher/Clarke/Cormack Chapters 5 and 8.6 plus Manning/Raghavan/Schütze Chapters 7 and 9 plus specific literature 12-2 IRDM WS 2015
Query Types • Conjunctive (i.e., all query terms are required) • Disjunctive (i.e., subset of query terms sufficient) • Phrase or proximity (i.e., query terms must occur in right order or close enough) • Mixed-mode with negation (e.g., “ harry potter” review + movie - book ) • Combined with ranking of result documents according to with score ( t , d ) depending on retrieval model (e.g. tf*idf) 12-3 IRDM WS 2015
Indexing with Document-Ordered Lists Data items: d 1 , …, d n d 1 s(t 1 ,d 1 ) = 0.7 … s(t m ,d 1 ) = 0.2 Index lists d1 d10 d23 d78 d88 index-list entries stored t 1 … 0.7 0.8 0.8 0.9 0.2 in ascending order of d1 d10 d23 d64 d78 t 2 … 0.2 0.6 0.6 0.8 0.1 document identifiers d10 d34 d64 d78 d99 t 3 … ( document-ordered lists ) 0.7 0.1 0.4 0.5 0.2 process all queries (conjunctive/disjunctive/mixed) by sequential scan and merge of posting lists 12-4 IRDM WS 2015
Document-at-a-Time Query Processing Document-at-a-Time (D AA T) query processing – assumes document-ordered posting lists – scans posting lists for query terms t 1 , …, t | q | concurrently – maintains an accumulator for each candidate result doc: – 𝑏𝑑𝑑 𝑒 = 𝑗: 𝑒 𝑡𝑓𝑓𝑜 𝑗𝑜 𝑀(𝑢𝑗) 𝑡𝑑𝑝𝑠𝑓(𝑢𝑗, 𝑒) Accumulators : 1.0 d 1 a d 1 , 1.0 d 4 , 2.0 d 7 , 0.2 d 8 , 0.1 : 6.0 d 4 b d 4 , 1.0 d 7 , 2.0 d 8 , 0.2 d 9 , 0.1 d 7 : 3.2 : 0.3 d 8 c d 4 , 3.0 d 7 , 1.0 : 0.1 d 9 – always advances posting list with lowest current doc id – exploit skip pointers when applicable – required memory depends on # results to be returned – top-k results in priority queue 12-5 IRDM WS 2015
D AA T with Weak And: WAND Method [Broder et al. 2003] Disjunctive (Weak And) query processing – assumes document-ordered posting lists with known maxscore(i) values for each t i : max d (score (d,t i )) – While scanning posting lists keep track of • min-k: the lowest total score in current top-k results • ordered term list : terms sorted by docId at current scan pos • pivot term: smallest j such that min-k 𝑗≤𝑘 𝑛𝑏𝑦𝑡𝑑𝑝𝑠𝑓(𝑗) • pivot doc: doc id at current scan pos in posting list Lj Eliminate docs that cannot become top-k results ( maxscore pruning ): – if pivot term does not exist (min-k > 𝑗 𝑛𝑏𝑦𝑡𝑑𝑝𝑠𝑓(𝑗)) – then stop – else advance scan positions to pos id of pivot doc ( “ big skip “ ) 12-6 IRDM WS 2015
Example: D AA T with WAND Method [Broder et al. 2003] Key invariant : For terms i=1..|q| and current scan positions cur i assume that cur 1 = min {curi | i=1..|q|} Then for each posting list i there is no docid between cur 1 and cur i maxscore i term i cur i … 5 1 101 … … 4 2 250 … … … 2 3 300 … … … … … 3 4 600 cannot contain any docid [102,599] Suppose that min-k = 12 then the pivot term is 4 ( i=1.3 maxscore i > min-k, i=1.4 maxscore i min-k) and the pivot docid is 600 can advance all scan positions cur i to 600 12-7 IRDM WS 2015
Term-at-a-Time Query Processing Term-at-a-Time (T AA T) query processing – assumes document-ordered posting lists – scans posting lists for query terms t 1 , …, t | q | one at a time, (possibly in decreasing order of idf values) – maintains an accumulator for each candidate result doc – after processing L(tj): 𝑏𝑑𝑑 𝑒 = 𝑗≤𝑘 𝑡𝑑𝑝𝑠𝑓(𝑢𝑗, 𝑒) Accumulators d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 d 1 : : : : : : : : : : : 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a d 1 , 1.0 d 4 , 2.0 d 7 , 0.2 d 8 , 0.1 : : : : : : : : : : : 6.0 2.0 0.0 2.0 3.0 2.0 3.0 0.0 3.0 3.0 6.0 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 d 4 b : : : : : : : : : : : 0.2 2.2 2.2 2.2 0.0 2.2 3.2 0.0 0.2 0.2 0.0 d 4 , 1.0 d 7 , 2.0 d 8 , 0.2 d 9 , 0.1 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 d 7 : : : : : : : : : : : 0.1 0.3 0.3 0.1 0.3 0.0 0.0 0.3 0.0 0.0 0.1 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 d 8 c d 4 , 3.0 d 7 , 1.0 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 d 9 : : : : : : : : : : : 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.1 0.0 – memory depends on the number of accumulators maintained – T AA T is attractive when scanning many short lists 12-8 IRDM WS 2015
Indexing with Impact-Ordered Lists Data items: d 1 , …, d n d 1 s(t 1 ,d 1 ) = 0.7 … s(t m ,d 1 ) = 0.2 Index lists d78 d23 d10 d1 d88 index-list entries stored t 1 … 0.9 0.8 0.8 0.7 0.2 in descending order of d64 d23 d10 d1 d78 t 2 … 0.8 0.6 0.6 0.2 0.1 per-term score impact d10 d78 d64 d99 d34 t 3 … ( impact-ordered lists ) 0.7 0.5 0.4 0.2 0.1 aims to avoid having to read entire lists rather scan only (short) prefixes of lists 12-9 IRDM WS 2015
Greedy Query Processing Framework Assume index lists are sorted by tf(t i ,d j ) or tf(t i ,d j )*idl(d j ) values idf values are stored separately Open scan cursors on all m index lists L(i) Repeat Find pos(g) among current cursor positions pos(i) (i=1..m) with the largest value of idf(t i )*tf(t i ,d j ) (or idf(t i )*tf(t i ,d j )*idl(d j )); Update the accumulator of the corresponding doc; Increment pos(g); Until stopping condition 12-10 IRDM WS 2015
Stopping Criterion: Quit & Continue Heuristics [Zobel/Moffat 1996] m For scoring of the form score ( q , d ) s ( t , d ) j i i j i 1 s ( t , d ) ~ tf ( t , d ) idf ( t ) idl ( d ) with i i j i j i j Assume hash array of accumulators for summing up score mass of candidate results quit heuristics (with docId-ordered or tf-ordered or tf*idl-ordered index lists): • ignore index list L(i) if idf(t i ) is below tunable threshold or • stop scanning L(i) if idf(t i )*tf(t i ,d j )*idl(d j ) drops below threshold or • stop scanning L(i) when the number of accumulators is too high continue heuristics: upon reaching threshold, continue scanning index lists, but do not add any new documents to the accumulator array 12-11 IRDM WS 2015
12.2 Fast Top-k Search Top-k aggregation query over relation R (Item, A1, ..., Am) : Select Item, s(R1.A1, ..., Rm.Am) As Aggr From Outer Join R1, …, Rm Order By Aggr Limit k with monotone s : ( i: x i x i ‘ ) s(x 1 … x m ) s(x 1 ‘ … x m ‘) (example: item is doc, attributes are terms, attr values are scores) • Precompute per-attr (index) lists sorted in desc attr-value order (score-ordered, impact-ordered) • Scan lists by sorted access (SA) in round-robin manner • Perform random accesses (RA) by Item when convenient • Compute aggregation s incrementally in accumulators • Stop when threshold test guarantees correct top-k (or when heuristics indicate „ good enough “ approximation) simple & elegant, adaptable & extensible to distributed system following R. Fagin: Optimal aggregation algorithms for middleware, JCSS. 66(4), 2003 12-12 IRDM WS 2015
Threshold Algorithm (TA) [Fagin 01,Güntzer 00, Nepal 99, Buckley 85] simple & DB-style; Threshold algorithm (TA ): needs only O(k) memory scan index lists; consider d at pos i in L i ; high i := s(t i ,d); if d top-k then { Data items: d 1 , …, d n look up s (d) in all lists L with i; score(d) := aggr {s (d) | =1..m}; d 1 if score(d) > min-k then add d to top-k and remove min- score d’; s(t 1 ,d 1 ) = 0.7 … min-k := min{score(d’) | d’ top-k}; s(t m ,d 1 ) = 0.2 threshold := aggr {high | =1..m}; if threshold min-k then exit; Query: q = (t 1 , t 2 , t 3 ) Index lists k = 2 Rank Doc Score Rank Doc Score d78 d23 d10 d1 d88 t 1 … Rank Doc Score 0.9 0.8 0.8 0.7 0.2 Scan 1 d10 2.1 1 d78 1.5 1 d78 1.5 1 d78 0.9 Rank Doc Score Scan d64 d23 d10 d12 d78 Scan t 2 1 d10 2.1 Scan … depth 1 Rank Doc Score 0.9 0.6 0.6 0.2 0.1 depth 2 2 d64 1.2 2 d64 0.9 2 d78 1.5 2 d64 0.9 1 d10 2.1 depth 3 depth 4 2 d78 1.5 d10 d78 d64 d99 d34 1 d10 2.1 t 3 … 2 d78 1.5 0.7 0.5 0.3 0.2 0.1 2 d78 1.5 STOP! 12-13 IRDM WS 2015
Recommend
More recommend