Web Data Management Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University of Pittsburgh The Web document collection No design/co-ordination I Unstructured (text, html, …), semi-structured (XML, I annotated photos), structured (Databases)… Distributed content creation, linking, democratization of I publishing Content includes truth, lies, obsolete information, I contradictions … Scale much larger than previous text collections I Growth – slowed down from initial “ volume doubling I every few months ” but still expanding Content can be dynamically generated I The Web 1
Web search basics Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 User Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele -Free Air shipping! All models. Helpful advice. www.best-vacuum.com Web Results 1 - 10 of about 7,310,000 for miele . ( 0.12 seconds) Miele , Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele . ... USA. to miele .com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www. miele .com/ - 20k - Cached - Similar pages Web spider Miele Welcome to Miele , the home of the very best appliances and kitchens in the world. www. miele .co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www. miele .de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www. miele .at/ - 3k - Cached - Similar pages Search Indexer The Web Indexes Ad indexes How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) 2
How good are the retrieved docs? Precision : Fraction of retrieved docs that are relevant to user ’ s I information need Recall : Fraction of relevant docs in collection that are retrieved I G On the web, recall seldom matters What matters I G Precision at 1? Precision above the fold? G Recall matters when the number of matches is very small G Quality of pages varies widely, relevance is not enough 4 Content: Trustworthy, diverse, non-duplicated, well maintained 4 Web readability: display correctly & fast 4 No annoyances: pop-ups, etc 5 Distributed indexing For web-scale indexing must use a distributed computing cluster I Individual machines are fault-prone I G Can unpredictably slow down or fail How do we exploit such a pool of machines? I Example: Google data centers I G Mainly contain commodity machines. G Data centers are distributed around the world. G Estimate: a total of 1 million servers, 3 million processors/cores (2007) G Estimate: Google installs 100,000 servers each quarter. 4 Based on expenditures of 200–250 million dollars per year G If in a non-fault-tolerant system with 1000 nodes, each node has 99.9% uptime, what is the uptime of the system? Answer: 63% G What about number of servers failing per minute for an installation of 1 million servers? 3
Implementation of Distributed indexing Maintain a master machine directing the indexing job – considered I “ safe ” . Break up indexing into sets of (parallel) tasks. I Master machine assigns each task to an idle machine from a pool. I Term-partitioned vs Document-partitioned Index Index construction was just one phase. I Another phase: transforming a term-partitioned index into document- I partitioned index. G Term-partitioned: one machine handles a subrange of terms G Document-partitioned: one machine handles a subrange of documents Most search engines use a document-partitioned index. I 4
Ranked retrieval Thus far, our queries have all been Boolean. I G Documents either match or don ’ t. Good for expert users with precise understanding of their needs and I the collection. Also good for applications: Applications can easily consume 1000s of I results. G Not good for the majority of users. G Most users incapable of writing Boolean queries (or they are, but they think it ’ s too much work). Most users don ’ t want to wade through 1000s of results. I G This is particularly true of web search. Problem with Boolean search Boolean queries often result in either too few (=0) or too many (1000s) I results. Query 1: “ standard user dlink 650 ” → 200,000 hits I Query 2: “ standard user dlink 650 no card found ” : 0 hits I It takes skill to come up with a query that produces a manageable I number of hits. With a ranked list of documents it does not matter how large the I retrieved set is. 5
Query-document matching scores We need a way of assigning a score to a query/document pair I Let ’ s start with a one-term query I If the query term does not occur in the document: score should be 0 I The more frequent the query term in the document, the higher the I score (should be) We will look at a number of alternatives for this. I Recall: Binary term-document incidence matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth 1 1 0 0 0 1 Antony 1 1 0 1 0 0 Brutus Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 1 0 1 1 1 0 worser Each document is represented by a binary vector ∈ {0,1} |V| 6
Term-document count matrices I Consider the number of occurrences of a term in a document: G Each document is a count vector in ℕ v : a column below Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 157 73 0 0 0 0 Brutus 4 157 0 1 0 0 232 227 0 2 1 1 Caesar 0 10 0 0 0 0 Calpurnia Cleopatra 57 0 0 0 0 0 mercy 2 0 3 5 5 1 worser 2 0 1 1 1 0 Term frequency tf The term frequency tf t,d of term t in document d is defined as the I number of times that t occurs in d . We want to use tf when computing query-document match scores. But I how? Raw term frequency is not what we want: I G A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term. G But not 10 times more relevant. Relevance does not increase proportionally with term frequency. I 7
Log-frequency weighting The log frequency weight of term t in d is I 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. I Score for a document-query pair: sum over terms t in both q and d : I The score is 0 if none of the query terms is present in the document. I Document frequency Rare terms are more informative than frequent terms I Consider a term in the query that is rare in the collection I (e.g., arachnocentric ) A document containing this term is very likely to be relevant to the query I arachnocentric We want a high weight for rare terms like arachnocentric . I We will use document frequency (df) to capture this in the score. I df ( £ N ) is the number of documents that contain the term I 8
idf weight df t is the document frequency of t : the number of documents that I contain t G df is a measure of the informativeness of t We define the idf (inverse document frequency) of t by I = idf log N /df t 10 t G We use log N /df t instead of N /df t to “ dampen ” the effect of idf. idf example, suppose N = 1 million term df t idf t calpurnia 1 6 animal 100 4 sunday 1,000 3 fly 10,000 2 under 100,000 1 the 1,000,000 0 There is one idf value for each term t in a collection. 9
Collection vs. Document frequency The collection frequency of t is the number of occurrences of t in the I collection, counting multiple occurrences. Example: I Word Collection frequency Document frequency insurance 10440 3997 try 10422 8760 Which word is a better search term (and should get a higher weight)? I tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. I = + ´ w ( 1 log tf ) log N / df t , d 10 t t , d Best known weighting scheme in information retrieval I Alternative names: tf.idf, tf x idf I Increases with the number of occurrences within a document I Increases with the rarity of the term in the collection I 10
Binary → count → weight matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 5.25 3.18 0 0 0 0.35 Brutus 1.21 6.1 0 1 0 0 Caesar 8.59 2.54 0 1.51 0.25 0 Calpurnia 0 1.54 0 0 0 0 2.85 0 0 0 0 0 Cleopatra 1.51 0 1.9 0.12 5.25 0.88 mercy worser 1.37 0 0.11 4.15 0.25 1.95 Each document is now represented by a real-valued vector of tf-idf weights ∈ R |V| Documents as vectors So we have a |V|-dimensional vector space I Terms are axes of the space I Documents are points or vectors in this space I Very high-dimensional: hundreds of millions of dimensions when you I apply this to a web search engine This is a very sparse vector - most entries are zero. I 11
Recommend
More recommend