Unstructured Data Management Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University of Pittsburgh Unstructured Data Typically refers to free text I Allows I G Keyword queries including operators G More sophisticated “ concept ” queries e.g., 4 find all web pages dealing with drug abuse Classic model for searching text documents I 2 1
Unstructured Data and Query Example I Antony and Cleopatra, Act III, Scene ii I Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, I When Antony found Julius Caesar dead, He cried almost to roaring; and he wept I I When at Philippi he found Brutus slain. I Hamlet, Act III, Scene ii I Lord Polonius: I did enact Julius Caesar I was killed i' the I Capitol; Brutus killed me. Query: Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia ? Database Management vs Information Retrieval I Data: DB: Set of Tables with well defined schema IR: Set of (text) documents I Goal: DB: Find an accurate response to a user query IR: Retrieve documents with information that is relevant to user ’ s information need 4 2
Querying Unstructured Data Which plays of Shakespeare contain the words Brutus AND Caesar I but NOT Calpurnia ? G One could grep all of Shakespeare ’ s plays for Brutus and Caesar, then strip out lines containing Calpurnia ? 4 Slow (for large corpora) 4 NOT Calpurnia is non-trivial 4 Other operations (e.g., find the word Romans near countrymen ) not feasible 4 Ranked retrieval (best documents to return) 5 Term-document incidence matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if play contains Brutus AND Caesar but NOT word, 0 otherwise Calpurnia 3
Query evaluation and optimization I 0/1 vector for each term. I To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) è bitwise AND . I 110100 AND 110111 AND 101111 = 100100. •Consider: 1M documents, each with about 1K terms. •6GB of data in the documents (avg 6 bytes/term incl spaces/punctuation) •Assume 500K distinct terms. 500K x 1M matrix has half-a-trillion 0 ’ s and 1 ’ s. 7 Inverted index For each term (token) T , we must store a list of all documents that contain T . Posting 2 4 8 16 32 64 128 Brutus 1 2 3 5 8 13 21 34 Calpurnia 13 16 Caesar Postings lists Dictionary Sorted by docID 8 8 4
Indexer step 1: Sequence of (Term, DocumentID) pairs Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 Doc 1 Doc 2 me 1 so 2 let 2 it 2 I did enact Julius be 2 So let it be with with 2 Caesar I was killed caesar 2 Caesar. The noble the 2 i' the Capitol; noble 2 Brutus hath told you brutus 2 Brutus killed me. hath 2 Caesar was ambitious told 2 you 2 caesar 2 was 2 ambitious 2 Indexer step 2: Sorting by terms Term Doc # Term Doc # ambitious 2 I 1 did 1 be 2 enact 1 brutus 1 julius 1 brutus 2 capitol 1 caesar 1 I 1 caesar 1 was 1 caesar 2 killed 1 caesar 2 i' 1 did 1 the 1 enact 1 capitol 1 hath 1 brutus 1 I 1 killed 1 I 1 me 1 i' 1 so 2 it 2 let 2 julius 1 it 2 killed 1 be 2 killed 1 with 2 let 2 caesar 2 me 1 the 2 noble 2 noble 2 so 2 brutus 2 the 1 hath 2 the 2 told 2 told 2 you 2 you 2 caesar 2 was 1 was 2 was 2 ambitious 2 with 2 5
Indexer step 3: Merging terms and adding frequencies Term Doc # Term freq Term Doc # ambitious 2 1 ambitious 2 be 2 1 be 2 Multiple term entries in a single I brutus 1 1 brutus 1 brutus 2 1 document are merged. brutus 2 capitol 1 1 capitol 1 caesar 1 1 Frequency information is added. I caesar 1 caesar 2 2 caesar 2 did 1 1 caesar 2 enact 1 1 did 1 hath 2 1 enact 1 I 1 2 hath 1 i' 1 1 I 1 it 2 1 I 1 julius 1 1 i' 1 killed 1 2 it 2 let 2 1 julius 1 me 1 1 killed 1 noble 2 1 killed 1 so 2 1 let 2 the 1 1 me 1 the 2 1 noble 2 told 2 1 so 2 the 1 you 2 1 the 2 was 1 1 told 2 was 2 1 you 2 with 2 1 was 1 was 2 with 2 Indexer step 4: Splitting into dictionary and posting files Term Doc # Freq Doc # Freq ambitious 2 1 Term N docs Tot freq 2 1 be 2 1 ambitious 1 1 2 1 brutus 1 1 be 1 1 1 1 brutus 2 1 brutus 2 2 2 1 capitol 1 1 capitol 1 1 1 1 caesar 1 1 caesar 2 3 1 1 caesar 2 2 did 1 1 2 2 did 1 1 enact 1 1 1 1 enact 1 1 hath 1 1 1 1 hath 2 1 I 1 2 2 1 I 1 2 i' 1 1 1 2 i' 1 1 it 1 1 1 1 it 2 1 julius 1 1 2 1 julius 1 1 killed 1 2 1 1 killed 1 2 let 1 1 1 2 let 2 1 me 1 1 2 1 me 1 1 noble 1 1 1 1 noble 2 1 so 1 1 2 1 so 2 1 the 2 2 2 1 the 1 1 told 1 1 1 1 the 2 1 you 1 1 2 1 told 2 1 was 2 2 2 1 you 2 1 with 1 1 2 1 was 1 1 1 1 was 2 1 2 1 with 2 1 2 1 Pointers 6
Query processing: AND Consider processing the query: I Brutus AND Caesar G Locate Brutus in the Dictionary; 4 Retrieve its postings. G Locate Caesar in the Dictionary; 4 Retrieve its postings. G “ Merge ” the two postings: 2 4 8 16 32 64 128 Br Brutus Caesa sar 1 2 3 5 8 13 21 34 13 The merge Walk through the two postings simultaneously, in time linear in the I total number of postings entries 2 2 4 4 8 8 16 16 32 32 64 64 128 128 Brutus Br 2 8 Caesa sar 1 1 2 2 3 5 5 8 8 13 13 21 21 34 34 3 If the list lengths are x and y , the merge takes O( x+y ) operations. Crucial: postings sorted by docID. 14 7
Boolean queries: Exact match The Boolean Retrieval model is being able to ask a query that is a Boolean I expression: G Boolean Queries are queries using AND, OR and NOT to join query terms 4 Views each document as a set of words 4 Is precise: document matches condition or not. 15 Boolean queries: More general merges Adapt the merge for the queries: Brutus AND NOT Caesar Brutus OR NOT Caesar Can we still run through the merge in time O( x+y )? 16 8
Query optimization What is the best order for query processing? I Consider a query that is an AND of t terms. I For each of the t terms, get its postings, then AND them together. I Brutus 2 4 8 16 32 64128 Calpurnia 1 2 3 5 8 16 21 34 Caesar 13 16 Query: Brutus AND Calpurnia AND Caesar 17 Query optimization example I Process in order of increasing freq: G start with smallest set, then keep cutting further . Brutus 2 4 8 16 32 64128 Calpurnia 1 2 3 5 8 13 21 34 Caesar 13 16 Execute the query as ( Caesa sar AND Brutus) s) AND Ca ia . Calp lpurnia 18 9
More general optimization I e.g., ( madding OR crowd ) AND ( ignoble OR strife ) I Get freq ’ s for all terms. I Estimate the size of each OR by the sum of its freq ’ s (conservative). I Process in increasing order of OR sizes. 19 Exercise Recommend a query processing I order for Term Freq eyes 213312 (tangerine OR trees) AND kaleidoscope 87009 (marmalade OR skies) AND marmalade 107913 (kaleidoscope OR eyes) skies 271658 tangerine 46653 trees 316812 20 10
More Advanced IR I What about phrases? G Stanford University I Proximity: Find Gates NEAR Microsoft . G Need index to capture position information in docs. I Zones in documents: Find documents with ( author = Ullman ) AND (text contains automata ). 21 Ranking search results Boolean queries give inclusion or exclusion of docs. I Often we want to rank/group results I G Need to measure proximity from query to each doc. G Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query. 22 11
Clustering and classification Given a set of docs, group them into clusters based on their contents. I Given a set of topics, plus a new doc D , decide which topic(s) D I belongs to. 23 The web and its challenges I Unusual and diverse documents I Unusual and diverse users, queries, information needs I Beyond terms, exploit ideas from social networks G link analysis, clickstreams ... I How do search engines work? And how can we make them better? 24 12
Recommend
More recommend