extraction and integration of web data by end users
play

Extraction and Integration of Web Data by End-Users Sudhir Agarwal - PowerPoint PPT Presentation

Extraction and Integration of Web Data by End-Users Sudhir Agarwal and Michael Genesereth Stanford Logic Group, Stanford Computer Science Department, Stanford University Introduction End users often need to quickly find and analyze information


  1. Extraction and Integration of Web Data by End-Users Sudhir Agarwal and Michael Genesereth Stanford Logic Group, Stanford Computer Science Department, Stanford University

  2. Introduction End users often need to quickly find and analyze information from • many web sites Search engines: • – (+) good at finding individual documents – (-) not good at fulfilling a complex information requirement End users have new ideas as soon as some information is easily • available  Web aggregators only shift but don’t solve the problem We propose an approach that empowers end-users to • – easily extract information from web pages, – clean, integrate and search extracted information by writing Datalog rules and queries

  3. Extraction of Data from the Web • While browsing normally, end-users select the information they want to remember • Our extraction algorithm automatically creates a table from user’s selection – Input: HTML DOM tree D (representing the selection) – D’  compress D by replacing parents of lone children by their resp. children – D’’  remove all nodes of D’ that have no text content – T  D’’ create a table from D’’ by interpret ing nodes in level 1 as rows … k T and that in level 2 as column values D’’1 D’’k – Output T |D’’1| |D’’k| max( |D’’i| ) 1<=i<=k

  4. Extraction from HTML tables illustrated with the DBLP page of Jeffrey Ullman

  5. Extraction from HTML tables illustrated with the DBLP page of Jeffrey Ullman

  6. Extraction from HTML Tables that have DIV elements for layout (illustrated with the Stanford CS faculty web page)

  7. Extraction from HTML Tables that have DIV elements for layout (illustrated with the Stanford CS faculty web page)

  8. Extraction from arbitrary HTML elements (illustrated with the MIT CS faculty page)

  9. Extraction from arbitrary HTML elements (illustrated with the MIT CS faculty page)

  10. Extraction of text paragraphs (illustrated with the New York Times front page)

  11. Extraction of text paragraphs (illustrated with the New York Times front page)

  12. Extraction of non-adjacent elements that may even be on different web pages with the help of a clipboard (illustrated by the amazon web page)

  13. Extraction of non-adjacent elements that may even be on different web pages with the help of a clipboard (illustrated by the amazon web page)

  14. Rule based Cleaning t3 ct3(A,C,E,F):-t3(A,B) & distinct(B,"") & matches(B,"[^:]+","0,1",C) & matches(B,"[^:]+","1",D) & matches(D,"[^.]+","0,1",E) & matches(D,"[^.]+","1",F) ct3

  15. Rule-based Integration • End users write simple Datalog rules to integrate (clean) tables in GAV fashion pubOf("Jeffrey Ullman",A,B,C,D):-ct3(A,B,C,D) • Multiple rules with the same head define a view as a union csPubs(A,B,C,D):-ct3(A,B,C,D) csPubs(A,B,C,D):-ct4(A,B,C,D) • Body can contain multiple tables and views Assume table ct2 contains names of faculty members in column A faculty(A):-ct2(A,B,C,D,E,F) facPubs(B,C,D,E):-pubOf(A,B,C,D,E)&faculty(A)

  16. Conclusion • We presented an end-user driven web data extraction, integration and search approach • The approach is implemented as a browser plugin (pls. refer to http://seamail.ksri.kit.edu/swb/ for details) • Our approach can suggest cleaning and integration rules that could be reused for tables of same arity and origin (not part of this presentation, pls. refer to paper) • As a next step we plan to derive reusable extraction scripts from users’ browsing logs and extraction actions Thank you !

  17. Cleaning Extracted Data • Extracted data often need to be cleaned before it can be integrated with other data • Simplest way of allowing end users to freely edit the extracted tables can quickly become very tedious if similar steps need to be performed repetitively for multiple tables • We propose rule based cleaning since cleaning rules are reusable and thus save time

  18. … k T D’’1 D’’k |D’’1| |D’’k| max( |D’’i| ) 1<=i<=k

Recommend


More recommend