learning rules to pre process web data for automatic
play

Learning Rules to Pre-process Web Data for Automatic Integration - PowerPoint PPT Presentation

Learning Rules to Pre-process Web Data for Automatic Integration Kai Simon, Thomas Hornung, Georg Lausen Workshop "Model Checking and Semantic Web Rules Languages 28th October 2006 D ata b ases and I nformation S ystems Research Group


  1. Learning Rules to Pre-process Web Data for Automatic Integration Kai Simon, Thomas Hornung, Georg Lausen Workshop "Model Checking and Semantic Web Rules Languages“ 28th October 2006 D ata b ases and I nformation S ystems Research Group Computer Science Department Albert-Ludwigs-University Freiburg

  2. DBIS Research Group Computer Science Department Motivation Albert-Ludwigs-University Freiburg Most Information available on the Web is only human - accessible through presentation-oriented HTML pages. We still lack techniques which enable machines - [agents] - to extract and - understand presentation-oriented HTML pages to act on behalf of humans

  3. DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Introduction Introduction System Overview Extraction Extraction and Alignment and Alignment Table Mining Table Mining

  4. DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB

  5. DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB extract & align records

  6. DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB extract & align records - Column Splitting Table Mining - Label Assignment - Arithmetic Dependencies

  7. DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB Table Mining Rules extract & align records Export result of Table Mining Heuristics

  8. DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts Access of Materialized View a known source DB - Column Splitting - Label Assignment - Arithmetic Dependencies extract & align records Apply Table Mining Rules

  9. DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Introduction Introduction ViPER [CIKM'05] Extraction Extraction • Automatic Data Extraction and Alignment and Alignment • Tabular Alignment Table Mining Table Mining

  10. DBIS Research Group Computer Science Department Automatic Data Extraction Albert-Ludwigs-University Freiburg Scan the Web page for similar data records

  11. DBIS Research Group Computer Science Department Automatic Data Extraction Albert-Ludwigs-University Freiburg Scan the Web page for similar data records use visual information to • segment the data records • compute the relevance according to the location inside the Web page.

  12. DBIS Research Group Computer Science Department Automatic Data Extraction Albert-Ludwigs-University Freiburg Scan the Web page for similar data records use visual information to • segment the data records • compute the relevance according to the location inside the Web page. extract similar data records with the highest relevance.

  13. DBIS Research Group Computer Science Department Tabular Alignment Albert-Ludwigs-University Freiburg Data record alignment

  14. DBIS Research Group Computer Science Department Data Representation Albert-Ludwigs-University Freiburg F-Logic Facts

  15. DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Data driven / statistical methods Introduction Introduction Column Splitting Extraction Extraction Label Assignment and Alignment and Alignment Arithmetic Dependencies Table Mining Table Mining

  16. DBIS Research Group Computer Science Department Column Splitting Albert-Ludwigs-University Freiburg Save: £3.00 (13%) 13%) Save: £3.00 Save: £6.00 (21%) Save: £6.00 21%) Save: £3.00 (9%) Save: £3.00 9%) data item 1 data item 1 data item 2 data item 2 data item 3 data item 3 Save: £3.00 (13%) Save: £3.00 (13%) Save: £6.00 (21%) Save: £6.00 (21%) Save: £3.00 (9%) Save: £3.00 (9%) Save: Save: £3.00 £3.00 £6.00 £6.00 (13%) (13%) (21%) (21%) (9%) (9%) £ £ 3.00 3.00 6.00 6.00 ( ( 13 13 21 21 9 9 %) %) punc- punc- cur- cur- text text float float float float int int int int int int text text tuation tuation rency rency subset 1 subset 1 subset 2 subset 2

  17. DBIS Research Group Computer Science Department Splitting Rules Albert-Ludwigs-University Freiburg … … … …

  18. DBIS Research Group Computer Science Department Splitting Rules Albert-Ludwigs-University Freiburg … …

  19. DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Data driven / statistical methods Introduction Introduction Column Splitting Extraction Extraction Label Assignment and Alignment and Alignment Arithmetic Dependencies Table Mining Table Mining

  20. DBIS Research Group Computer Science Department Label Assignment Albert-Ludwigs-University Freiburg Visual HTML source code Assignment strategy representation Col i Col i+1 Col i+2 b span br List Price: br $499.99 Our Price: $299.95 List Price: $499.99 Our $299.95 Price: Col i Col i+1 Col i+2 $499.99 span b br List Price $299.95 Our Price $499.99 List Price $299.95 Our Price Col i Col i+1 Col i+2 Col i+3 tr tr Our Price List Price $299.95 $499.99 td td td td b List $299.95 $499.99 Price Our Price

  21. DBIS Research Group Computer Science Department Inter Label Assignment Albert-Ludwigs-University Freiburg Inter label assignment

  22. DBIS Research Group Computer Science Department Inner Label Assignment Albert-Ludwigs-University Freiburg Inner label assignment

  23. DBIS Research Group Computer Science Department Column Label Assignment Rules Albert-Ludwigs-University Freiburg Inter label Inner label assignment assignment

  24. DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg

  25. DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg

  26. DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg

  27. DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg Solution

  28. DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg

  29. DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg

  30. DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg

  31. DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg Rules can be: - modified - removed - added

  32. DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Data driven / statistical methods Introduction Introduction Column Splitting Extraction Extraction Label Assignment and Alignment and Alignment Arithmetic Dependencies Table Mining Table Mining

  33. DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg Find arithmetic dependencies between numeric columns by checking the homogeneous systems of linear equations: or for non trivial solutions.

  34. DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg Find arithmetic dependencies between numeric columns by checking the homogeneous systems of linear equations: or for non trivial solutions.

  35. DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg The system has discovered the arithmetic dependency: newPrice = oldPrice - discount

  36. DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg The system has discovered the arithmetic dependency: newPrice = oldPrice - discount oldPrice - newPrice - discount < threshold

  37. DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg The system has discovered the arithmetic dependency: newPrice = oldPrice - discount oldPrice - newPrice - discount < threshold

  38. DBIS Research Group Computer Science Department Conclusion Albert-Ludwigs-University Freiburg oldPrice description newPrice brand discount Advantages - Table Mining Heuristics are only applied once for each resource - Manual post-processing of heuristics - Qualitative information integration based on identified constraints - Annotating HTML streams on-the-fly (OntoGather [PPSWR ‘06])

  39. DBIS Research Group Computer Science Department Outlook Albert-Ludwigs-University Freiburg So far Conversion of structured HTML pages to F-Logic facts - What’s next Use Text Mining techniques to push the limit of - structured content (e.g. rental listings)

  40. DBIS Research Group Computer Science Department ??? Questions ??? Albert-Ludwigs-University Freiburg Thank you for your attention!

Recommend


More recommend