Outline WebTables & Octopus Michael J. Cafarella � WebTables University of Washington � Octopus CSE454 April 30, 2009 2 3 WebTables � WebTables system automatically extracts dbs from This page contains 16 distinct HTML tables, web crawl but only one relational database [WebDB08, “Uncovering…”, Cafarella et al] [VLDB08, “WebTables: Exploring…”, Cafarella et al] Each relational database has its own schema, Schema Statistics usually with labeled columns. Applications Raw crawled pages Raw HTML Tables Recovered Relations � An extracted relation is one table plus labeled columns � Estimate that our crawl of 14.1B raw HTML tables contains ~ 154M good relational dbs 6 1
The Table Corpus The Table Corpus Table type % total count Small tables 88.06 12.34B HTML forms 1.34 187.37M Calendars 0.04 5.50M Obvious non-rel 89.44 12.53B Other non-rel (est.) 9.46 1.33B Rel (est.) 1.10 154.15M 7 8 Relation Recovery Schema Statistics Recovered Relations make model year Schema Freq Toyo ta Camry 1984 {make , mode l , yea r } 1 2 make model year {make , mode l , yea r , co lo r } 1 { } { } Mazda Pro tégé 2003 { name , add r , c i t y , s t a te , z i p } 1 Chev ro le t Impa la 1979 { name , s i ze , l as t -mod i f i ed } 1 make model year color Step 1. Relational Filtering Step 2. Metadata Detection Chrys l e r Vo la re 1974 ye l l ow Recall 81% , Precision 41% Recall 85% , Precision 89% Nissan Sen t ra 1994 r ed � Output name addr city state zip � Schema stats useful for computing attribute � 271M databases, about 125M are good Dan S 16 Pa rk Sea t t l e W A 98195 probabilities A lon H 129 E lm Be lmon t CA 94011 � Five orders of magnitude larger than previous � p(“make”), p(“model”), p(“zipcode”) largest corpus [WWW02, “A Machine Learning…”, Wang & Hu] � p(“make” | “model”), p(“make” | “zipcode”) name size last-modified � 2.6M unique relational schemas Readme. t x t 182 Apr 26 , 2005 cac . xm l 813 Ju l 23 , 2008 � What can we do with those schemas? [VLDB08, “WebTables: Exploring…”, Cafarella et al] 9 10 App # 1: Relation Search App # 1: Relation Search � Problem: keyword search for high- quality extracted databases � Output depends on both quality of extracted tables and the ranking function 11 12 2
App # 1: Relation Search App # 1: Experiments Metadata detection, when adding schema stats scoring Schema statistics can help improve both: � � Precision 0.79 ⇒ 0.89 � Relation Recovery (Metadata detection) � Recall 0.84 ⇒ 0.85 � Ranking � Ranking: compared 4 rankers on test set � By computing a schema coherency score S(R) for relation R, � Naïve : Top-10 pages from google.com � and adding it to feature vector Filter : Top-10 good tables from google.com � Measures how well a schema “hangs together” � Rank : Trained ranker � High: {make , mode l } � Rank-Stats : Trained ranker with coherency score � Low: {make , z i pcode } � What fraction of top-k are relevant? � Average pairwise Pointwise Mutual Information score for all � attributes in schema k Naïve Filter Rank Rank-Stats 10 0.26 0.35 0.43 0.47 (80%) (34% ) (65% ) 20 0.33 0.47 0.56 0.59 (79%) (42% ) (70% ) 30 0.34 0.59 0.66 0.68 (100%) (74% ) (94% ) 13 14 App # 2: Schema Autocomplete App # 2: Schema Autocomplete � Input: topic attribute ( e.g. , make ) name name, s i ze , l as t -mod i f i ed, t ype � Output: relevant schema i n s t ruc to r i n s t ruc to r , t ime , t i t l e , days , r oom, cou rse {make , mode l , yea r , p r i ce } e lec ted e lec ted , pa r t y , d i s t r i c t , i n cumben t , s t a tus , … � “tab-complete” for your database ab ab , h , r , bb , so , r b i , avg , l ob , h r , pos , ba t t e r s sq f t sq f t , p r i ce , ba ths , beds , yea r , t ype , l o t - sq f t , … � For input set I , output S , threshold t � while p( S - I | I ) > t � newAttr = max p( newAttr , S - I | I ) � S = S ∪ newAttr � emit newAttr 15 16 App # 2: Experiments App # 3: Synonym Discovery Input: topic attribute ( e.g. , addr ess ) � Asked experts for schemas in 10 areas � Output: relevant synonym pairs � � What was autocompleter’s recall? ( te lephone = tel - # ) Used for schema matching � [VLDB01, “Generic Schema Matching…”, Madhavan et al] Linguistic thesauri are incomplete; hand-made � thesauri are burdensome For attributes a , b and input domain C , � when p(a,b)= 0 17 18 3
App # 3: Synonym Discovery App # 3: Experiments � For each input attr, repeatedly emit best synonym pair (until min threshold reached) name e-ma i l | ema i l , phone | te l ephone , e-ma i l _add ress |ema i l _add r ess , da te | l as t_mod i f i ed i ns t ruc to r cou rse - t i t l e | t i t l e , day |days , c ourse | cou rse -# , cou rse -name |cou rse - t i t l e e lec ted cand ida te |name, p res id ing - o f f i ce r | speake r ab k | so , h |h i t s , avg |ba , name |p l aye r sq f t ba th |ba ths , l i s t | l i s t - p r i ce , bed |beds , p r i ce | ren t 19 20 WebTables Contributions Outline � Largest collection of databases and schemas, by far � Large-scale extracted schema data for � WebTables first time; enables novel applications � Octopus 21 22 Multiple Tables Integration Challenge � Can we combine tables to create new � Try to create a database of all “VLDB p rog ram com mi t t ee membe rs ” data sources? � Data integration for the Structured Web � Many existing “mashup” tools, which ignore realities of Web data � A lot of useful data is not in XML � User cannot know all sources in advance � Transient integrations 23 24 4
Octopus Walkthrough - Operator # 1 � SEARCH (“ VLDB p rog ram commi t tee � Provides “workbench” of data integration operators to build target database me mbers ”) � Most operators are not correct/incorrect, but se rge ab i t ebou l i n r i a high/low quality (like search) michae l ad iba … grenob le � Also, prosaic traditional operators an ton io a l bano … pisa … … se rge ab i t ebou l i n r i a anas tass ia a i l… ca rneg ie… gus tavo a lonso e tz zu r i ch … … 25 26 Walkthrough - Operator # 2 Walkthrough - Operator # 2 � Recover relevant data � Recover relevant data se rge ab i t ebou l i n r i a se rge ab i t ebou l i n r i a 1996 CONTEXT() CONTEXT() michae l ad iba … grenob le michae l ad iba … grenob le 1996 an ton io a l bano … pisa an ton io a l bano … pisa 1996 … … … … … se rge ab i t ebou l i n r i a 2005 se rge ab i t ebou l i n r i a anas tass ia a i l… ca rneg ie… 2005 anas tass ia a i l… ca rneg ie… CONTEXT() CONTEXT() gus tavo a lonso e tz zu r i ch 2005 gus tavo a lonso e tz zu r i ch … … … … … 27 28 Walkthrough - Union Walkthrough - Operator # 3 � Combine datasets � Add column to data � Similar to “join” but join target is a topic EXTEND ( “publications”, col= 0) “publications” se rge ab i t ebou l i n r i a 1996 michae l ad iba … grenob le 1996 se rge ab i t ebou l i n r i a 1996 se rge ab i t ebou l i n r i a se 1996 rge ab i t “ ebou La rge l Sca le i P2P n r i a D is t…” 1996 an ton io a l bano … pisa 1996 michae l ad iba … grenob le 1996 michae l ad iba … grenob le michae 1996 l ad “Exp iba lo i t i ng b i … grenob t empora l…” le 1996 … … … an ton io a l bano … pisa 1996 an ton io a l bano … pisa an 1996 ton io a l “Ano bano the r Examp … pisa le o f a…” 1996 Union() se rge ab i t ebou l i n r i a 2005 se rge ab i t ebou l i n r i a se rge 2005 ab i t “ ebou La rge l Sca le i P2P n r i a D is t…” 2005 anas tass ia a i l… ca rneg ie… 2005 anas tass ia a i l… ca rneg ie… anas 2005 tass ia “E a f i f l… i c i en t Use ca o rneg f t he…” ie… 2005 se rge ab i t ebou l i n r i a 2005 gus tavo a lonso e tz zu r i ch 2005 gus tavo a lonso e tz zu r i ch gus 2005 tavo a lonso “A Dynamic and e tz F zu lex r i ch ib l e …” 2005 anas tass ia a i l… ca rneg ie… 2005 … … … … … … … … … gus tavo a lonso e tz zu r i ch 2005 … … … • User has integrated data sources with little effort • No wrappers; data was never intended for reuse 29 30 5
Recommend
More recommend