W E L C O M E 1 domain-centric intelligent automated DIADEM data extraction methodology Web data as you want it
T E A M 2
I N T R O D U C T I O N 3 Cheng Wang ¡ Tim Furche ¡ Poster now Facebook Session II, № 57 Today at 17:15-19:00 Demo paper WaDaR Today @ 10:30-12:00 Giorgio Orsi ¡ Stefano Ortona
H O W: T E C H N O L O G Y & T E A M 4 What? Data Extraction ref-code postcode bedrooms bathrooms available price 33453 OX2 6AR 3 2 15/10/2013 £1280 pcm 33433 OX4 7DG 2 1 18/04/2013 £995 pcm
H O W: T E C H N O L O G Y & T E A M 5 What? Data Extraction >10000 ref-code postcode bedrooms bathrooms available price 33453 OX2 6AR 3 2 15/10/2013 £1280 pcm 33433 OX4 7DG 2 1 18/04/2013 £995 pcm
6 “For many kinds of information one has to extract from thousands of sites in order to build a comprehensive database” – N I L E S H D A LV I e t a l . VLDB 2012
H O W: T E C H N O L O G Y & T E A M 7 Result Summary 6 500-5000 Sites for each domain Domains (real estate, used cars, locations, electronics, …) > 96% 85-95% Perfect recall wrappers Precision of extracted (consistently in all domains) primary attributes
D I A D E M 8 DIADEM: Many Domains ○ Domains considered from 2014–2015 ◗ Real estate UK & US ◗ Used cars UK & US ◗ Products: • consumer electronics (Singapore, Malaysia) • fashion (UK) ◗ Locations: • restaurant (chains & open web, US) • hotels (US)
D I A D E M 9 DIADEM: Process Form understanding & filling Record & attribute identification Ontology Site URL Exploration Extraction Induction
D I A D E M E X A M P L E 10 1 2
D I A D E M E X A M P L E 11 3 4 1 2
D I A D E M E X A M P L E 12 Contact Form 1 1 2 2 Up to £250,000 3 4 5 iFrame with results <250k 3
H O W: T E C H N O L O G Y & T E A M 13 Strong Principles ROSeAnn (VLDB’14) 1 Entity extraction from text and structure OPAL (WWW’12, VLDBJ’13) 2 Form understanding & filling AMBER (under submission) 3 Record identification for listing pages OXPath (VLDB’11, VLDBJ’13) 4 Extraction language WaDaR (demo @ VLDB’15) 5 Joint wrapper and relation repair DIADEM (VLDB’14) 6 World-first accurate, automatic full-site extraction system
D I A D E M 14 Control Flow: guarded FST Decision: Which action to take? failure filling Stage 1: Init Page 5 Browser crawler 4 Interaction success 3 2 1 7 back next 6 iFrame link G u a rd e d F S Ts : e x p o n e n t i a l l y m o re s u c c i n c t t h a n p l a i n F S Ts . Stage 5: Finalize
D I A D E M 15 Control Flow: guarded FST Stage 3: Crawling field set behavior value field browser selection selection selection iteration interaction 1 2 3 4 Stage 1: Page Init 3 modification 1 classifier 4 2 G u a rd e d F S Ts a s re l a t i o n a l t r a n s d u c e r s : s c a l e t o h u n d re d s o f s t a t e s a n d m i l l i o n s o f f a c t s
R E S U LT PA G E P H E N O M E N O L O G Y 16 GRID Layout GRID Layout LIST Layout LIST Layout 1: Single-level GRID 1: Frequent description attribute 1: Interspersed ad 2 : O p t i o n a l b a t h r o o m 2: Multi-node location 1: Outlier record 2 : M u l t i - a t t r i b u t e t i t l e 3: Location in title and separate girardlettings.co.uk innesmackay.com remax.co.uk adzuna.co.uk LIST Layout LIST Layout GRID Layout 1: Multiple prices GRID Layout 1: Many attributes 1 : M u l t i p l e p r i c e s 2: Structured location ) ? ( d r o c e r g n i s s M i 1 : 3: Unit of measure 2: Multi-attribute title 2 : R e c o r d w i t h o u t p r i c e a n d m a k e auto100.co.uk finders.co.uk motorclick.co.uk perrys.co.uk
H O W: T E C H N O L O G Y & T E A M 17 http://diadem.cs.ox.ac.uk/demo
D I A D E M A N A LY S I S 18 Full-site extraction wrapper e fg ective wrong or no data missing data UK real estate 91% 7% 2% Oxford real estate 90% 6% 4% ViNTs 10 4% 5% 91% UK used cars 93% 4% 3% 5 US real estate 90% 5% 5%
D I A D E M A N A LY S I S 19 Competition: Segmentation R e c o r ds R E − RND U C − RND p r e c i s i on r e c a ll M DR 38% 48% 56% 72% D E P T A 77% 53% 84% 58% s 88% 78% T i N 95% 81% V D I A D E M 98% 97% 99% 99% 0 % 25 % 50 % 75 % 100 % 0 % 25 % 50 % 75 % 100 % C O N C L U S I O N : Do only a part of the job, and poorly
D I A D E M A N A LY S I S 20 Competition: Attributes A tt r i bu t es R E − RND U C − RND p r e c i s i on r e c a ll r R unne 42% 65% R oad 48% 60% D E P T A 83% 74% 84% 58% D E M 97% 96% D I A 95% 95% 0 % 25 % 50 % 75 % 100 % 0 % 25 % 50 % 75 % 100 % C O N C L U S I O N : Do only a part of the job, and poorly
D I A D E M A N A LY S I S 21 Competition: Forms ICQ dataset HA [14] ExQ [41] StatParser [36] DIADEM [17] F 1 for labeling 92% 96% 96% 98% o n l y l a b e l l i n g n o c l a s s i f i c a t i o n o r f i l l i n g
D I A D E M A N A LY S I S 22 Performance: Analysis Phase 20 15 t i m e ( m i nu t es ) 10 5 0 0 10 20 30 40 R E − F U LL v i s i t ed pages
D I A D E M A N A LY S I S 23 Performance: Extraction Phase 2000 1500 t i m e ( se c onds ) 1000 500 0 0 250 500 750 1000 nu m be r o f r e c o r ds
DIADEM extracts from the web as it is “It's a cruel and random world, but the chaos is all so beautiful.” – H I R O M U A R A K AWA
Segmentation Form filling Alignment Crawling DIADEM extracts full sites automatically Pagination Object extraction Wrapper induction
DIADEM extracts full domains + no per-site supervision at all
B O D Y L E V E L O N E 27
H O W: T E C H N O L O G Y & T E A M 28 Chain locations technology evaluation by a US tech company ○ Following a presentation of DIADEM ◗ they didn’t believe that this works ○ We need locations of restaurant chains allover ○ Challenge: what can you do in 2-3 weeks? ◗ from a given list of some 300 chains
H O W: T E C H N O L O G Y & T E A M 29 Chain locations technology evaluation by large US tech company 160,000 Restaurant chain locations , from over 295 chains including all major chains 95% 85% Precision of extracted Effective wrappers , all location information automatically maintained 30 days from start to finish 3 person team
H O W: T E C H N O L O G Y & T E A M 30 PER ATTRIBUTE ACCURACY Wrong & Good Scrape, No Correct Wrong Accuracy Precision empty Bad Data Benchmark 826 0 0 0 9 100.00% 100.00% category 829 6 0 0 0 99.28% 99.28% city 11 3 3 0 821 78.57% 100.00% closed 382 446 446 5 2 46.14% 100.00% hours 745 89 88 0 1 89.33% 99.87% latlong 17 59 59 1 758 22.37% 100.00% located_in 831 4 0 0 0 99.52% 99.52% name 709 126 117 0 0 84.91% 98.75% phone 803 9 8 0 23 98.89% 99.88% postal_code 803 16 2 16 0 98.05% 98.29% street_address 818 6 0 9 2 99.27% 99.27% website_0 83.30% 99.53% This evaluation is done by independent, external evaluators on a sample of more than 830 locations.
D I A D E M 31 More http://diadem.cs.ox.ac.uk/vldb15/demo.mp4 Demo: http://diadem.cs.ox.ac.uk/vldb15/slides.pdf Slides: http://diadem.cs.ox.ac.uk/evaluation/14/02/ Evaluation: Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Selected Schallhart, Cheng Wang: DIADEM: Thousands of Websites to a Single Database . papers: PVLDB 7(14): 1845-1856 (2014) Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Jon Sellers: OXPath: A language for scalable data extraction, automation, and crawling on the deep web . VLDB J. 22(1): 47-72 (2013) Luying Chen, Stefano Ortona, Giorgio Orsi, Michael Benedikt: ROSeAnn: Reconciling Opinions of Semantic Annotators . PVLDB 6(12): 1238-1241 (2013) Jens Lehmann, Tim Furche, Giovanni Grasso, Axel-Cyrille Ngonga Ngomo, Christian Schallhart, Andrew Jon Sellers, Christina Unger, Lorenz Bühmann, Daniel Gerber, Konrad Höffner, David Liu, Sören Auer: DEQA: Deep Web Extraction for Question Answering . International Semantic Web Conference (2) 2012: 131-147 Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Schallhart, Cheng Wang: Little Knowledge Rules the Web: Domain-Centric Result Page Extraction . RR 2011: 61-76
D I A D E M 32 Summary You want the location of all the restaurants in the US ? DIADEM in less then 30 words amenities hotels ▪ automated data extraction e ff ectively opening times hairdressers UK covering entire verticals (100k+ sources) Brasil o ff ered services rock concerts … … ▪ unrivalled performance in extracting Germany terms rental cars Indonesia entities, including places, people, products features headphones World ▪ highly disruptive technology with value availability mortgage loans for even established players … or the price of all the houses in the UK ? independently verified from yielding 100,000s 1,000,000s >95% >75-95% at restaurant, real estate products, businesses, places, precision sources with 100% recall used car, … websites and other entities Delivered at little human e ff ort with just 2-3 weeks 3 engineers with automatic maintenance for any vertical once
Recommend
More recommend