Learning Rules to Pre-process Web Data for Automatic Integration Kai Simon, Thomas Hornung, Georg Lausen Workshop "Model Checking and Semantic Web Rules Languages“ 28th October 2006 D ata b ases and I nformation S ystems Research Group Computer Science Department Albert-Ludwigs-University Freiburg
DBIS Research Group Computer Science Department Motivation Albert-Ludwigs-University Freiburg Most Information available on the Web is only human - accessible through presentation-oriented HTML pages. We still lack techniques which enable machines - [agents] - to extract and - understand presentation-oriented HTML pages to act on behalf of humans
DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Introduction Introduction System Overview Extraction Extraction and Alignment and Alignment Table Mining Table Mining
DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB
DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB extract & align records
DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB extract & align records - Column Splitting Table Mining - Label Assignment - Arithmetic Dependencies
DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB Table Mining Rules extract & align records Export result of Table Mining Heuristics
DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts Access of Materialized View a known source DB - Column Splitting - Label Assignment - Arithmetic Dependencies extract & align records Apply Table Mining Rules
DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Introduction Introduction ViPER [CIKM'05] Extraction Extraction • Automatic Data Extraction and Alignment and Alignment • Tabular Alignment Table Mining Table Mining
DBIS Research Group Computer Science Department Automatic Data Extraction Albert-Ludwigs-University Freiburg Scan the Web page for similar data records
DBIS Research Group Computer Science Department Automatic Data Extraction Albert-Ludwigs-University Freiburg Scan the Web page for similar data records use visual information to • segment the data records • compute the relevance according to the location inside the Web page.
DBIS Research Group Computer Science Department Automatic Data Extraction Albert-Ludwigs-University Freiburg Scan the Web page for similar data records use visual information to • segment the data records • compute the relevance according to the location inside the Web page. extract similar data records with the highest relevance.
DBIS Research Group Computer Science Department Tabular Alignment Albert-Ludwigs-University Freiburg Data record alignment
DBIS Research Group Computer Science Department Data Representation Albert-Ludwigs-University Freiburg F-Logic Facts
DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Data driven / statistical methods Introduction Introduction Column Splitting Extraction Extraction Label Assignment and Alignment and Alignment Arithmetic Dependencies Table Mining Table Mining
DBIS Research Group Computer Science Department Column Splitting Albert-Ludwigs-University Freiburg Save: £3.00 (13%) 13%) Save: £3.00 Save: £6.00 (21%) Save: £6.00 21%) Save: £3.00 (9%) Save: £3.00 9%) data item 1 data item 1 data item 2 data item 2 data item 3 data item 3 Save: £3.00 (13%) Save: £3.00 (13%) Save: £6.00 (21%) Save: £6.00 (21%) Save: £3.00 (9%) Save: £3.00 (9%) Save: Save: £3.00 £3.00 £6.00 £6.00 (13%) (13%) (21%) (21%) (9%) (9%) £ £ 3.00 3.00 6.00 6.00 ( ( 13 13 21 21 9 9 %) %) punc- punc- cur- cur- text text float float float float int int int int int int text text tuation tuation rency rency subset 1 subset 1 subset 2 subset 2
DBIS Research Group Computer Science Department Splitting Rules Albert-Ludwigs-University Freiburg … … … …
DBIS Research Group Computer Science Department Splitting Rules Albert-Ludwigs-University Freiburg … …
DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Data driven / statistical methods Introduction Introduction Column Splitting Extraction Extraction Label Assignment and Alignment and Alignment Arithmetic Dependencies Table Mining Table Mining
DBIS Research Group Computer Science Department Label Assignment Albert-Ludwigs-University Freiburg Visual HTML source code Assignment strategy representation Col i Col i+1 Col i+2 b span br List Price: br $499.99 Our Price: $299.95 List Price: $499.99 Our $299.95 Price: Col i Col i+1 Col i+2 $499.99 span b br List Price $299.95 Our Price $499.99 List Price $299.95 Our Price Col i Col i+1 Col i+2 Col i+3 tr tr Our Price List Price $299.95 $499.99 td td td td b List $299.95 $499.99 Price Our Price
DBIS Research Group Computer Science Department Inter Label Assignment Albert-Ludwigs-University Freiburg Inter label assignment
DBIS Research Group Computer Science Department Inner Label Assignment Albert-Ludwigs-University Freiburg Inner label assignment
DBIS Research Group Computer Science Department Column Label Assignment Rules Albert-Ludwigs-University Freiburg Inter label Inner label assignment assignment
DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg
DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg
DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg
DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg Solution
DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg
DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg
DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg
DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg Rules can be: - modified - removed - added
DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Data driven / statistical methods Introduction Introduction Column Splitting Extraction Extraction Label Assignment and Alignment and Alignment Arithmetic Dependencies Table Mining Table Mining
DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg Find arithmetic dependencies between numeric columns by checking the homogeneous systems of linear equations: or for non trivial solutions.
DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg Find arithmetic dependencies between numeric columns by checking the homogeneous systems of linear equations: or for non trivial solutions.
DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg The system has discovered the arithmetic dependency: newPrice = oldPrice - discount
DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg The system has discovered the arithmetic dependency: newPrice = oldPrice - discount oldPrice - newPrice - discount < threshold
DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg The system has discovered the arithmetic dependency: newPrice = oldPrice - discount oldPrice - newPrice - discount < threshold
DBIS Research Group Computer Science Department Conclusion Albert-Ludwigs-University Freiburg oldPrice description newPrice brand discount Advantages - Table Mining Heuristics are only applied once for each resource - Manual post-processing of heuristics - Qualitative information integration based on identified constraints - Annotating HTML streams on-the-fly (OntoGather [PPSWR ‘06])
DBIS Research Group Computer Science Department Outlook Albert-Ludwigs-University Freiburg So far Conversion of structured HTML pages to F-Logic facts - What’s next Use Text Mining techniques to push the limit of - structured content (e.g. rental listings)
DBIS Research Group Computer Science Department ??? Questions ??? Albert-Ludwigs-University Freiburg Thank you for your attention!
Recommend
More recommend