Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003
Searching Text • grep utility on Unix - specify a regular expression - search all specified files Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Searching Text • grep utility on Unix - specify a regular expression - search all specified files • what happens if - the files are very big, and - many repeated searches need to be carried out Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Searching Text • grep utility on Unix - specify a regular expression - search all specified files • what happens if - the files are very big, and - many repeated searches need to be carried out • can we do better? Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Indexing • split the search process - create an index of frequently used terms (also called a concordance ) - handle the search as a query to lookup the index amortize indexing time over a large number of queries Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Full-text Retrieval • full-text retrieval ≡ searching large text databases using automatically constructed concordances Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Full-text Retrieval • full-text retrieval ≡ searching large text databases using automatically constructed concordances • Questions - How is this different from a library catalog? - Can we rely on high-speed modern processors to do exhaustive searches? - What kind of indexing would be required for full-text retrieval? Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Indexing: A General Technique • no large database can be searched without indexes • there may be primary and secondary indexes • elaborate data structures to hold the index to support rapid queries - e.g., B+ trees • other issues - separate structures for separate indexes? - rapid reindexing for addition, deletion, update - size of the index Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Applications • databases - every database has elaborate index generation schemes • web search - search engines, e.g., google, yahoo!, lycos - also the issue of ranking and displaying the results • disk search - Apple’s Sherlock creates index files for filesystem search Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Inverted File Index • term ≡ keywords of interest • lexicon ≡ list of all terms occurring in the text index[term] = document1, document2, . . . Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Inverted File Index • term ≡ keywords of interest • lexicon ≡ list of all terms occurring in the text index[term] = document1, document2, . . . How do you index non-text data (e.g., PDF files, images)? Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
An Example Document Text 1 Pease porridge hot, pease porridge cold 2 Pease porridge in the pot 3 Nine days old 4 Some like it hot, some like it cold 5 Sole like it in the pot 6 Nine days old find the lexicon and build the inverted index Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Example (contd.) Number Term Documents 1 cold 2; 1, 4 2 days 2; 3, 6 3 hot 2; 1, 4 4 in 2; 2, 5 5 it 2; 4, 5 6 like 2; 4, 5 7 nine 2; 3, 6 9 old 2; 3, 6 10 pease 2; 1, 2 11 porridge 2; 1, 2 12 pot 2; 2, 5 13 some 3; 4, 5 14 the 2; 2, 5 Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Using the Inverted Index • simple lookup is trivial - for large documents, may have to maintain the location within each document Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Using the Inverted Index • simple lookup is trivial - for large documents, may have to maintain the location within each document • compound queries? - conjunctive and disjunctive queries, e.g., “term1 AND term2”, “term1 OR term2” - complement, “NOT term1” Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Using the Inverted Index • simple lookup is trivial - for large documents, may have to maintain the location within each document • compound queries? - conjunctive and disjunctive queries, e.g., “term1 AND term2”, “term1 OR term2” - complement, “NOT term1” • near queries? Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Using the Inverted Index • simple lookup is trivial - for large documents, may have to maintain the location within each document • compound queries? - conjunctive and disjunctive queries, e.g., “term1 AND term2”, “term1 OR term2” - complement, “NOT term1” • near queries? • potentially huge index files - should we worry about the index size? Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Trimming the Index • case folding - mostly, case is immaterial • stemming - are “search”, “searching”, “searches” different? - strategy: maintain only the neutral form of the term • eliminate stop words - frequently occurring terms ≡ stop list - e.g., “a”, “the”, “in”, “to”, etc. Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Effectiveness: Precision Precision = r t r: number of relevant documents retrieved t: total number of documents retrieved if 50 documents are retrieved, 35 are relevant, then the precision is 70% Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Effectiveness: Recall Recall = r n r: number of relevant documents retrieved n: total number of relevant documents in the collection if 50 documents are retrieved, 35 are relevant, then the precision is 70% if there are 140 relevant documents then the recall is 25% Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Search Engines Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Indexing the Web • more than 2 billion documents on the web • google claims to index 1.5 billion documents • two indexing approaches - search engines (e.g., google) - hierarchical directories (e.g., Yahoo!) Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Web Search Characteristics • bulk • rapidly changing content - about one-third changes every year • heterogeneous content • duplication, as much as 30% • high linkage • wide variety of users • varying user behavior - 85% only look at the first screen - 78% never modify their first query Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Query Characteristics 0 term in query 21% 1 term in query 26% 2 terms in query 26% 3 terms in query 15% > 3 terms in query 12% Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Goals of a Search Engine • speed • recall • precision • precision in the top result page Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Search Engine Architecture • crawler - collects pages from the Web • indexer - indexes the collected pages • query server - accepts and processes queries and returns the results Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
The Crawler base ← set of known working hyperlinks queue ← base while (! queue.empty()) { p = first element of queue process p for each page, q, referenced from p add q to queue; } Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Indexing • inverted index - most common, used by google - superimposed coding is another technique • term extraction - title or the whole document - document analysis to identify keywords Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Query Processing • keyword vs concept-based searching - concept-based searching uses “clustering” - Excite used concept-based searching • searching “similar” results • ranking the hits Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Rankings • google’s page-popularity based rankings - combined with proximity of search keywords to those in the document let page P be pointed to by pages T 1 , T 2 , T 3 , etc. let L ( x ) be the number of links going out of page x let R ( x ) be the page rank of page x R ( P ) = (1 − d ) + d × ( R ( T 1 ) L ( T 1 ) + R ( T 2 ) L ( T 2 ) + . . . + R ( T k ) L ( T k )) where, d is a damping factor Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Solving the Rankings 1 1 1 1 ) R ( T p 1 2 ) R ( T p 1 k 1 ) R ( T p 1 R ( P 1 ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k 1 )) L ( T p 1 L ( T p 1 L ( T p 1 1 1 1 1 ) R ( T p 2 2 ) R ( T p 2 k 2 ) R ( T p 2 R ( P 2 ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k 2 )) L ( T p 2 L ( T p 2 L ( T p 2 . . . 1 1 1 1 ) R ( T p n 2 ) R ( T p n k n ) R ( T p n R ( P n ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k n )) L ( T p n L ( T p n L ( T p n Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Solving the Rankings 1 1 1 1 ) R ( T p 1 2 ) R ( T p 1 k 1 ) R ( T p 1 R ( P 1 ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k 1 )) L ( T p 1 L ( T p 1 L ( T p 1 1 1 1 1 ) R ( T p 2 2 ) R ( T p 2 k 2 ) R ( T p 2 R ( P 2 ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k 2 )) L ( T p 2 L ( T p 2 L ( T p 2 . . . 1 1 1 1 ) R ( T p n 2 ) R ( T p n k n ) R ( T p n R ( P n ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k n )) L ( T p n L ( T p n L ( T p n L × R = C Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Recommend
More recommend