Chapter 11: Text Indexing and Matching There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information -- Eric Schmidt is now created every 2 days. There is nothing that cannot be found through -- Eric Schmidt some search engine. The best place to hide a dead body is -- anonymous page 2 of Google search results. An engineer is someone who can do for a dime -- anonymous what any fool can do for a dollar. 11-1 IRDM WS 2015
Outline 11.1 Search Engine Architecture 11.2 Dictionary and Inverted Lists 11.3 Index Compression 11.4 Similarity Search mostly following Büttcher/Clarke/Cormack Chapters 2,3, 4 , 6 (alternatively: Manning/Raghavan/Schütze Chapters 3 , 4 , 5 ,6) 11.2 mostly BCC Ch.4, 11.3 mostly BCC Ch.6, 11.4 mostly MRS Ch.3 11-2 IRDM WS 2015
11.1 Search Engine Architecture ...... ..... ...... ..... extract index search rank present crawl & clean handle fast top-k queries, GUI, user guidance, dynamic pages, query logging, personalization detect duplicates, auto-completion detect spam scoring function strategies for build and analyze over many data crawl schedule and Web graph, and context criteria priority queue for index all tokens crawl frontier or word stems server farm with 100 000‘s of computers, distributed/replicated data in high-performance file system, massive parallelism for query processing 11-3 IRDM WS 2015
Content Gathering and Indexing Bag-of-Words representations ...... Internet ..... ...... ..... Web Internet Internet crisis Crawling crisis crisis user users user love ... ... Internet crisis: search engine users still love trust search engines Extraction Linguistic Statistically faith and have trust of relevant methods: weighted ... in the Internet words stemming features Indexing (terms) Documents Thesaurus Index (Ontology) (B + -tree) Synonyms, ... Sub-/Super- crisis love Concepts URLs 11-4 IRDM WS 2015
Crawling • Traverse Web: fetch page by http, parse retrieved html content for href links • Crawl frontier: maintain priority queue • Crawl strategy: breadth-first for broad coverage, depth-first for site capturing, clever prioritization • Link extraction: handle dynamic pages (Javascript …) Deep Web Crawling: generate form-filling queries Focused Crawling: interleave with classifier 11-5 IRDM WS 2015
Deep Web Crawling Deep Web (aka. Hidden Web): DB/CMS content items without URLs generate (valid) values for query form fields in order to bring items to surface Source: http://deepwebtechblog.com/wringing-science-from-google/ 11-6 IRDM WS 2015
Focused Crawling WWW automatially populate ...... ad-hoc topic directory ..... ...... ..... seeds Crawler training Classifier Link Analysis Root critical issues: Database Semistrutured • classifier accuracy Technology Data • feature selection Web Data • quality of training data XML Retrieval Mining 11-7 IRDM WS 2015
Focused Crawling WWW interleave crawler ...... and classifier ..... ...... ..... with periodic re-training seeds Crawler training Classifier Link Analysis high high confidence authority re-training Root Database Semistrutured Technology Data Web Data Social topic-specific Retrieval Mining Graphs archetypes 11-8 IRDM WS 2015
Vector Space Model for Content Relevance Ranking Ranking by descending Similarity metric: relevance | | F Search engine d q ij j 1 j ( , ) : sim d q i | | | | F F 2 2 d q ij j | | q F Query [ 0 , 1 ] 1 1 j j (set of weighted features) | | d F [ 0 , 1 ] Documents are feature vectors i (bags of words) e.g. weights by tf*idf model Features are terms (words and other tokens) or term-zone pairs (term in title/heading/caption /…) can be stemmed/lemmatized (e.g. to unify singular and plural) can also be multi-word phrases (e.g. bigrams) 11-9 IRDM WS 2015
Vector Space Model: tf*idf Scores tf (d i , t j ) = term frequency of term t j in doc d i df (t j ) = document frequency of t j = #docs with t j idf (t j ) = N / df(t j ) with corpus size (total #docs) N dl (d i ) = doc length of d i (avgdl: avg. doc length over all N docs) tf*idf score for single-term query ( index weight ): 1 N d ( 1 ln( 1 ln( tf ( d , t )))) ln for tf(d i ,t j )>0, 0 else ij i j df ( t ) j dampening & plus optional length normalization normalization cosine similarity for ranking (cosine of angle between q and d vectors when vectors are L2-normalized): sim ( q , d ) q d q d where j q d i if q j 0 d ij 0 i j ij j ij j j q d i sparse scalar product 11-10 IRDM WS 2015
(Many) tf*idf Variants: Pivoted tf*idf Scores tf (d i , t j ) = term frequency of term t j in doc d i df (t j ) = document frequency of t j = #docs with t j idf (t j ) = N / df(t j ) with corpus size (total #docs) N dl (d i ) = doc length of d i (avgdl: avg. doc length over all N docs) tf*idf score for single-term query ( index weight ): 1 N d ( 1 ln( 1 ln( tf ( d , t )))) ln for tf(d i ,t j )>0, 0 else ij i j df ( t ) j pivoted tf*idf score: 1 ln( 1 ln( tf ( d , t ))) 1 N i j d ln avoids undue favoring ij dl ( d ) df ( t ) i ( 1 s ) s j of long docs avgdl tf*idf scoring often works very well, also uses scalar product but it has many ad-hoc tuning issues Chapter 13: for score aggregation more principled ranking models 11-11 IRDM WS 2015
11.2 Indexing with Inverted Lists Vector space model suggests term-document matrix , but data is sparse and queries are even very sparse use inverted index lists with terms as keys for B+ tree or hashmap q: Internet B+ tree or hashmap crisis ... ... trust crisis Internet trust 17: 0.3 17: 0.3 12: 0.5 11: 0.6 index lists Google etc.: 44: 0.4 44: 0.4 14: 0.4 17: 0.1 17: 0.1 with postings > 10 Mio. terms 52: 0.1 28: 0.1 28: 0.7 (DocId, score) > 100 Bio. docs 53: 0.8 44: 0.2 44: 0.2 ... sorted by DocId > 50 TB index 55: 0.6 51: 0.6 52: 0.3 ... ... terms can be full words, word stems, word pairs, substrings, N-grams, etc. (whatever „dictionary terms“ we prefer for the application) • index-list entries in DocId order for fast Boolean operations • many techniques for excellent compression of index lists • additional position index needed for phrases, proximity, etc. (or other precomputed data structures) 11-12 IRDM WS 2015
Dictionary • Dictionary maintains information about terms: – mapping terms to unique term identifiers (e.g. crisis → 3141359) – location of corresponding posting list on disk or in memory – statistics such as document frequency and collection frequency • Operations supported by the dictionary: – Lookups by term – range searches for prefix and suffix queries (e.g. net*, *net ) – substring matching for wildcard queries (e.g. cris*s ) – Lookups by term identifier • Typical implementations: – B+ trees, hash tables, tries (digital trees), suffix arrays 11-13 IRDM WS 2015
B + Tree • Paginated hollow multiway search tree with high fanout ( low depth) • Node contents: (child pointer, key) pairs as routers in inner nodes key with id list or record data in leaf nodes • Perfectly balanced: all leaves have identical distance to root • Search and update efficiency: O(log k n/C) page accesses (disk I/Os) with n keys, page storage capacity C, and fanout k Jena Bonn Essen Merzig B + -Tree Frank- Jena furt Saar- Erfurt Essen Köln Mainz Merzig Paris brücken Trier Ulm Aachen Berlin Bonn 11-14 IRDM WS 2015
Prefix B + Tree for Keys of Type String Keys in inner nodes are mere Routers for search space partitioning. Rather than x i = max{s: s is a key in subtree t i } a shorter router y i with s i y i < x i+1 for all s i in t i and all s i+1 in t i+1 is sufficient, for example, yi = shortest string with the above property. even higher fanout, possibly lower depth of the tree K C N Et Prefix- Frank- Jena B + -tree furt Saar- Aachen Berlin Bonn Erfurt Essen Köln Mainz Merzig Paris brücken Trier Ulm 11-15 IRDM WS 2015
Posting Lists and Payload • Inverted index keeps a posting list for each term with the following payload for each posting: – document identifier (e.g. d 123 , d 234 , …) – term frequency (e.g. tf ( crisis , d 123 ) = 2, tf ( crisis , d 234 ) = 4) – score impact (e.g. tf(crisis, d 123 ) * idf(crisis) = 3.75 ) – offsets : positions at which the term occurs in document • Posting lists can be sorted by doc id or sorted by score impact • Posting lists are compressed for space and time efficiency posting list for d 123 , 2, [4, 14] d 234 , 4, [47] d 266 , 3, [1, 9, 20] crisis payload: tf, offsets posting 11-16 IRDM WS 2015
Recommend
More recommend