Database Management Course Content Systems • Introduction • Database Design Theory • Query Processing and Optimisation Winter 2003 • Concurrency Control CMPUT 391: Information Retrieval and the Web • Data Base Recovery and Security • Object-Oriented Databases • Inverted Index for IR Dr. Osmar R. Zaïane • XML • Data Warehousing • Data Mining • Parallel and Distributed Databases University of Alberta Chapter 27 of • Other Advanced Database Topics Textbook Dr. Osmar R. Zaïane, 2001-2003 1 Dr. Osmar R. Zaïane, 2001-2003 2 2 Database Management Systems University of Alberta Database Management Systems University of Alberta Inverted Indexes and IR Objectives of Lecture 7 Inverted Indexes and Information Retrieval • Inverted Indexes and Information Retrieval • Get a general idea about the technologies • Signature Files behind search engines • Anatomy of a Search Engine • Get acquainted with inverted indexes • Discuss ranking issues • Web Crawler • Ranking Results • Authorities, Hubs and PageRank Dr. Osmar R. Zaïane, 2001-2003 Dr. Osmar R. Zaïane, 2001-2003 3 4 Database Management Systems University of Alberta Database Management Systems University of Alberta
Everyday Activity Information Retrieval • We use search engines whenever we look • Find resources (documents) that contain a for resources on the Internet certain list of keywords • How do these search engines work? Find the pages where the phrase “alpha • How come they give different results beta” occurs. while the results come from the same Searching sequentially is too expensive. Web? • The results are often very disappointing. You would need an index to directly find the pages. Why aren’t we satisfied? Dr. Osmar R. Zaïane, 2001-2003 5 Dr. Osmar R. Zaïane, 2001-2003 6 Database Management Systems University of Alberta Database Management Systems University of Alberta Querying Creating an Index Inverted Index For each document w a : D 1 , D 2 , D 3 … w b : D 1 , D 3 … Which document D 1 , D 2 , D 3 … D i : w a , w b , w c … w c : D 1 , … contains W a and W b ? ∩ ∩ ∩ ∩ w d : D 2 , D 3 , … D 1 , D 3 … … index documents Document D i Inverted Index w a : D 1 , D 2 , D 3 … w a : D 1 , D 2 , D 3 … D 1 : w a , w b , w c … Which document w b : D 1 , D 3 … w b : D 1 , D 3 … D 1 , D 2 , D 3 … D 2 : w a , w d , w e … contains W a or W b ? w c : D 1 , … ∪ ∪ ∪ ∪ w c : D 1 , … w d : D 2 , D 3 , … D 3 : w a , w b , w d … D 1 , D 3 … w d : D 2 , D 3 , … … … … documents D n : w x , w y , w z … Inverted Index Dr. Osmar R. Zaïane, 2001-2003 Dr. Osmar R. Zaïane, 2001-2003 7 8 Database Management Systems University of Alberta Database Management Systems University of Alberta
Indexing for Text Search Inverted Indexes and IR • Text database: Collection of text documents • Important class of queries: Keyword searches • Inverted Indexes and Information Retrieval – Boolean queries: Query terms connected with AND, OR and NOT. Result is list of documents that satisfy • Signature Files the boolean expression. • Anatomy of a Search Engine – Ranked queries: Result is list of documents ranked by their “relevance”. • Web Crawler – IR: Precision (percentage of retrieved documents that are relevant) and recall (percentage of relevant • Ranking Results objects that are retrieved) • Authorities, Hubs and PageRank • Inverted indexes is not the only approach in IR. Signature files are also used for document retrieval. Dr. Osmar R. Zaïane, 2001-2003 9 Dr. Osmar R. Zaïane, 2001-2003 10 Database Management Systems University of Alberta Database Management Systems University of Alberta Signature Files: Query Evaluation Signature Files • Boolean query consisting of conjunction of words: – Generate query signature Sq • Index structure (the signature file) with one – Scan signatures of all documents. data entry for each document – If signature S matches Sq, then retrieve document and check for false positives. • Hash function hashes words to bit-vector. • Boolean query consisting of disjunction of k • Data entry for a document (the signature of words: the document) is the OR of all hashed – Generate k query signatures S1, …, Sk words. – Scan signature file to find documents whose signature • Signature S1 matches signature S2 if matches any of S1, …, Sk – Check for false positives S2&S1=S2 Dr. Osmar R. Zaïane, 2001-2003 Dr. Osmar R. Zaïane, 2001-2003 11 12 Database Management Systems University of Alberta Database Management Systems University of Alberta
Signature Files: Example Inverted Indexes and IR Word Hash • Inverted Indexes and Information Retrieval Agent 010 • Signature Files James 100 • Anatomy of a Search Engine Mobile 001 • Web Crawler • Ranking Results RID Document Signature 1 Agent James 110 • Authorities, Hubs and PageRank 2 Mobile agent 011 Dr. Osmar R. Zaïane, 2001-2003 13 Dr. Osmar R. Zaïane, 2001-2003 14 Database Management Systems University of Alberta Database Management Systems University of Alberta A Search Engine Blocs Search Engine Components • A Search Engine has an interface to enter queries Interface Inverted Index • A search engine has access to an inverted Query/Results User index already built • A search engine ranks the results found in Built off-line the index Ranking Dr. Osmar R. Zaïane, 2001-2003 Dr. Osmar R. Zaïane, 2001-2003 15 16 Database Management Systems University of Alberta Database Management Systems University of Alberta
Search Engine General Inverted Indexes and IR Architecture Page • Inverted Indexes and Information Retrieval 2 • Signature Files Page Parser and Crawler indexer 3 • Anatomy of a Search Engine 5 1 • Web Crawler 4 Index LTV • Ranking Results 3 6 • Authorities, Hubs and PageRank LV Search 4 Engine LNV Dr. Osmar R. Zaïane, 2001-2003 17 Dr. Osmar R. Zaïane, 2001-2003 18 Database Management Systems University of Alberta Database Management Systems University of Alberta Inverted Indexes and IR Search Engines are not Enough • Inverted Indexes and Information Retrieval • Most of the knowledge in the World-Wide • Signature Files Web is buried inside documents. • Search engines (and crawlers) barely • Anatomy of a Search Engine scratch the surface of this knowledge by • Web Crawler extracting keywords from web pages. • Ranking Results • There is text mining, text summarization, • Authorities, Hubs and PageRank natural language statistical analysis, etc., but not the scope of this course. Dr. Osmar R. Zaïane, 2001-2003 Dr. Osmar R. Zaïane, 2001-2003 19 20 Database Management Systems University of Alberta Database Management Systems University of Alberta
Relevancy Ranking How do we Rank ? • Some search engine claim to have indexed • Each Search Engine uses a different ranking about one billion documents function. Usually these ranking functions are not disclosed. (similarity measure) • Each search can yield a very large list of “supposedly relevant” documents • Parameters used in ranking: • Sifting through thousands of results is - Frequency of words - Existence in directory tedious and not necessary - Location of words - Inward and outward Links - Metadata - Entirety of query • It is extremely important to rank the results - Domain - Size of document since most users will look mainly at the 10 - And $$$$ - Age of document to 20 first documents. Dr. Osmar R. Zaïane, 2001-2003 21 Dr. Osmar R. Zaïane, 2001-2003 22 Database Management Systems University of Alberta Database Management Systems University of Alberta Ontology for Search Results • There are still too many results in typical search engine responses. • Reorganize results using a semantic hierarchy (Zaïane et al. 2001). WordNet Semantic Search network result Dr. Osmar R. Zaïane, 2001-2003 Dr. Osmar R. Zaïane, 2001-2003 23 24 Database Management Systems University of Alberta Database Management Systems University of Alberta
Recommend
More recommend