Chapter V: Indexing & Searching Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2011/12
Chapter V: Indexing & Searching* V.1 Indexing & Query processing Inverted indexes, B + -trees, merging vs. hashing, Map-Reduce & distribution, index caching V.2 Compression Dictionary-based vs. variable-length encoding, Gamma encoding, S16, P-for-Delta V.3 Top-k Query Processing Heuristic top-k approaches, Fagin’s family of threshold -algorithms, IO-Top-k, Top-k with incremental merging, and others V.4 Efficient Similarity Search High-dimensional similarity search, SpotSigs algorithm, Min-Hashing & Locality Sensitive Hashing (LSH) *mostly following Chapters 4 & 5 from Manning/Raghavan/Schütze and Chapter 9 from Baeza-Yates/Ribeiro-Neto with additions from recent research papers IR&DM, WS'11/12 November 29, 2011 V.2
V.1 Indexing - Web, intranet, digital libraries, desktop search - Unstructured/semistructured data ...... ..... ...... ..... extract index search rank present crawl & clean handle fast top-k queries, GUI, user guidance, dynamic pages, query logging, personalization detect duplicates, auto-completion detect spam scoring function strategies for build and analyze over many data crawl schedule and Web graph, and context criteria priority queue for index all tokens crawl frontier or word stems Server farms with 10 000‘s (2002) – 100,000’s (2010) computers, distributed/replicated data in high-performance file system ( GFS , HDFS ,…), massive parallelism for query processing ( MapReduce , Hadoop ,…) IR&DM, WS'11/12 November 29, 2011 V.3
Content Gathering and Indexing Bag-of-Words representations ...... ..... ...... ..... Surfing Surf Surf Crawling Internet Internet Wave Cafes Cafe Internet Web Surfing: ... ... WWW In Internet eService cafes with or Cafe Extraction Linguistic Statistically without Bistro of relevant methods: weighted ... Web Suit ... stemming , words features Indexing (terms) lemmas Documents Thesaurus Index (Ontology) (B + -tree) Synonyms, ... Bistro Cafe Sub-/Super- Concepts URLs IR&DM, WS'11/12 November 29, 2011 V.4
Vector Space Model for Relevance Ranking Ranking by Similarity metric: descending (e.g., Cosine measure) relevance | | F Search engine d q ij j 1 j ( , ) : sim d q i | F | | F | 2 2 d q | | ij j F [ 0 , 1 ] Query q 1 1 j j (set of weighted | | F [ 0 , 1 ] with d features) i Documents are feature vectors (bags of words) Using, e.g., using: 2 : / d w w e.g., ij ij ik k ( , ) tf*idf as freq f d # docs j i : log 1 log w ij weights max ( , ) # freq f d docs with f k k i i IR&DM, WS'11/12 November 29, 2011 V.5
Combined Ranking with Content & Links Structure Ranking by descending relevance & authority Search engine | | F [ 0 , 1 ] Query q (set of weighted features) Ranking functions: • Low-dimensional queries (ad-hoc ranking, Web search): BM25(F), authority scores, recency, document structure, etc. • High-dimensional queries (similarity search): Cosine, Jaccard, Hamming on bitwise signatures, etc. + Dozens of more features employed by various search engines IR&DM, WS'11/12 November 29, 2011 V.6
Digression: Basic Hardware Considerations 16 GB/s Typical (64bit@2GHz) CPU Bus system (32 – 256 bits Computer @200 – 800 MHz) ... ... 300 MB/s M C (SATA-300) 3,200 MB/s (DDR-SDRAM Secondary Storage @200MHz) HD Tertiary Storage 6,400 MB/s – 12,800 MB/s (DDR2, dual channel, 800MHz) HD TransferRate = width (number of bits) x clock rate x data per clock / 8 (bytes/sec) typically 1 IR&DM, WS'11/12 November 29, 2011 V.7
Moore’s Law Gordon Moore (Intel) anno 1965: “The density of integrated circuits (transistors) will double every 18 months!” → Has often been generalized to clock rates of CPUs, disk & memory sizes, etc. → Still holds today for integrated circuits! Source: http://en.wikipedia.org/wiki/Moore%27s_law IR&DM, WS'11/12 November 29, 2011 V.8
More Modern View on Hardware Multi-core- CPU CPU CPU CPU CPU CPU CPU CPU ... multi-CPU Computer L1/L2 L1/L2 ... ... C M CPU-to-L1-Cache: 3-5 cycles initial latency, Secondary Storage then “burst” mode • CPU-cache HD becomes primary CPU-to-L2-Cache: storage! 15-20 cycles latency • Main-memory CPU-to-Main-Memory: becomes secondary HD ~200 cycles latency storage! IR&DM, WS'11/12 November 29, 2011 V.9
Data Centers Google Data Center anno 2004 Source: J. Dean: WSDM 2009 Keynote IR&DM, WS'11/12 November 29, 2011 V.10
Different Query Types Find relevant docs Conjunctive queries: by list processing all words in q = q 1 … q k required on inverted indexes Disjunctive (“ andish ”) queries: Including variant: subset of q words qualifies, • scan & merge more of q yields higher score only subset of q i lists • lookup long Mixed-mode queries and negations : or negated q i lists q = q 1 q 2 q 3 +q 4 +q 5 – q 6 only for best result Phrase queries and proximity queries: candidates q = “q 1 q 2 q 3 ” q 4 q 5 … Vague-match (approximate) queries see Chapter III.5 with tolerance to spelling variants Structured queries and XML-IR //article[about(.//title , “ Harry Potter ”)] //sec IR&DM, WS'11/12 November 29, 2011 V.11
Indexing with Inverted Lists Vector space model suggests term-document matrix , but data is sparse and queries are even very sparse. Better use inverted index lists with terms as keys for B+ tree. q: {professor B+ tree on terms research ... ... xml} professor research xml 17: 0.3 17: 0.3 12: 0.5 11: 0.6 index lists 44: 0.4 44: 0.4 14: 0.4 17: 0.1 17: 0.1 Google: with postings 52: 0.1 28: 0.1 28: 0.7 > 10 Mio. terms ... (docId, score) 53: 0.8 44: 0.2 44: 0.2 > 20 Bio. docs 55: 0.6 51: 0.6 sorted by docId ... 52: 0.3 > 10 TB index ... terms can be full words, word stems, word pairs, substrings, N-grams, etc. (whatever “dictionary terms” we prefer for the application) • Index-list entries in docId order for fast Boolean operations • Many techniques for excellent compression of index lists • Additional position index needed for phrases, proximity, etc. (or other pre-computed data structures) IR&DM, WS'11/12 November 29, 2011 V.12
B+-Tree Index for Term Dictionary Keywords [A-Z] m = 3 [A-I] [J-Z] [A-D] [E-F] [L-Q] [R-Z] [G-I] [J-K] [A-B] [C] [D] [G] [H] [E] [F] [I] … … … • B-tree: balanced tree with internal nodes of ≤m fan -out • B + -tree: leaf nodes additionally linked via pointers for efficient range scans • For term dictionary: Leaf entries point to inverted list entries on local disk and/or node in compute cluster IR&DM, WS'11/12 November 29, 2011 V.13
Inverted Index for Posting Lists Documents: d 1 , …, d n Index-list entries usually stored in ascending order of docId d 10 (for efficient merge joins ) s(t 1 ,d 1 ) = 0.9 … or s(t m ,d 1 ) = 0.2 sort in descending order of per-term score Index lists ( impact-ordered lists d10 d23 d54 d67 d88 t 1 for top-k style pruning). … 0.9 0.8 0.8 0.7 0.2 d10 d12 d17 d23 d78 t 2 … 0.8 0.6 0.6 0.2 0.1 Usually compressed and divided d10 d12 d23 d88 d99 t 3 … 0.7 0.5 0.4 0.2 0.1 into block sizes which are convenient for disk operations. IR&DM, WS'11/12 November 29, 2011 V.14
Query Processing on Inverted Lists q: {professor B+ tree on terms research ... ... xml} professor research xml 17: 0.3 17: 0.3 12: 0.5 11: 0.6 index lists 44: 0.4 44: 0.4 14: 0.4 17: 0.1 17: 0.1 with postings 52: 0.1 28: 0.1 28: 0.7 ... (docId, score) 53: 0.8 44: 0.2 44: 0.2 55: 0.6 51: 0.6 sorted by docId ... 52: 0.3 ... Given: query q = t 1 t 2 ... t z with z (conjunctive) keywords similarity scoring function score(q,d) for docs d D, e.g.: q d with precomputed scores (index weights) s i (d) for which q i ≠0 Find: top-k results for score(q,d) =aggr{s i (d)} (e.g.: i q s i (d) ) Join-then-sort algorithm: top-k ( [term=t 1 ] (index) DocId [term=t 2 ] (index) DocId ... DocId [term=t z ] (index) order by s desc) IR&DM, WS'11/12 November 29, 2011 V.15
Index List Processing by Merge Join Keep L(i) in ascending order of doc ids . Delta encoding: compress L i by actually storing the gaps between successive doc ids (or using some more sophisticated prefix-free code). QP may start with those L i lists that are short and have high idf . → Candidates need to be looked up in other lists L j . To avoid having to uncompress the entire list L j , L j is encoded into groups (i.e., blocks) of compressed entries with a skip pointer at the start of each block sqrt(n) evenly spaced skip pointers for list of length n. L i … 2 4 9 16 59 66 128 135 291 311 315 591 672 899 skip! L j … 1 2 3 5 8 17 21 35 39 46 52 66 75 88 IR&DM, WS'11/12 November 29, 2011 V.16
Recommend
More recommend