csci 548 information integration on the web
play

CSCI-548: Information Integration on the Web Craig Knoblock - PowerPoint PPT Presentation

CSCI-548: Information Integration on the Web Craig Knoblock University of Southern California January 05 University of Southern California 1 2 University of Southern California January 05 3 University of Southern California January 05


  1. CSCI-548: Information Integration on the Web Craig Knoblock University of Southern California January 05 University of Southern California 1

  2. 2 University of Southern California January 05

  3. 3 University of Southern California January 05

  4. Example Applications January 05 University of Southern California 4

  5. Integrating Country Information World Governments Agent NATO Members CIA World Factbook 1995 1996 1997 January 05 University of Southern California 5

  6. Predicting Flight Delays Yahoo Weather Prediction Historical Flight Data Agent Learned Flight Historical Weather Delay Predictor Data January 05 University of Southern California 6

  7. Real Estate Notifications New Listing: Send Email 3br 2bath Notification 200K January 05 University of Southern California 7

  8. TheaterLoc Entertainment Agent Tiger Map Hollywood.com Server Trailers Etak Geocoder Agent Zagat CuisineNet Yahoo Movies January 05 University of Southern California 8

  9. Travel Planning Assistant January 05 University of Southern California 9

  10. Geospatial Data Integration January 05 University of Southern California 10

  11. WorldInfo Assistant January 05 University of Southern California 11

  12. Course Overview January 05 University of Southern California 12

  13. XML � XML widely used as an internet data interchange language � Xquery – language for manipulating XML documents � In this class I will cover the Xquery language January 05 University of Southern California 13

  14. Wrappers Casablanca Restaurant NAME STREET 220 Lincoln Boulevard Venice CITY (310) 392-5751 PHONE January 05 University of Southern California 14

  15. Wrappers � Turning online sources into structured information � Research Topics � Wrapper Learning � Automatic Wrapper Generation � Wrapper Maintenance � Tools � AgentBuilder � AgentRunner January 05 University of Southern California 15

  16. Plan Execution Boxer Anthrax investigation continues… Barbara Boxer Boxer Bay area politicans meet… Dianne Feinstein Feinstein Bay area politicans meet… Jane Harman 4676 Admiralty Way Marina del Rey CA Harman Life in LA is just too sunny… address senators & house reps combined results recent news Join Wrapper name Yahoo News Select Wrapper graph URL senators, Vote-Smart house reps Wrapper Wrapper Wrapper OpenSecrets OpenSecrets OpenSecrets (funding page) (member page) (names page) all officials member URL funding URL George Bush Dick Cheney Barbara Boxer Dianne Feinstein Jane Harman James Hahn January 05 University of Southern California 16

  17. Plan Execution � Research Topics � Streaming dataflow execution systems � Optimizing execution systems ⌧ Adaptive execution strategies ⌧ Speculative Execution � Tools � Theseus agent execution system January 05 University of Southern California 17

  18. Data Integration Mediator Outlook Server Timeline Mediator Server Yahoo CDW Laptops Local sources & services Remote sources & services January 05 University of Southern California 18

  19. Data Integration Systems � Information mediators � Used to automatically select and compose information across sources � Research Topics ⌧ Global-as-view vs. Local-as-view integration ⌧ Optimizing query plans � Tools � Prometheus information mediator January 05 University of Southern California 19

  20. Record Linkage Zagat’s Restaurant Department of Health Guide Source Restaurant Source Art’s Deli Art’s Delicatessen California Pizza Kitchen Ca’ Brea Campanile CPK Citrus The Grill Grill, The Patina Philippe The Original Philippe’s The Original Spago The Tillerman How can the same objects be identified when they are stored in inconsistent text formats? January 05 University of Southern California 20

  21. Record Linkage � Align information across sources � Research Topics: � Matching individual attributes � Matching entire records � Tools � Apollo Record Linkage System January 05 University of Southern California 21

  22. Aligning Schemas and Ontologies Mediated schema price agent-name agent-phone office-phone description If “office” occurs in name => office-phone listed-price contact-name contact-phone office comments Schema of realestate.com realestate.com listed-price contact-name contact-phone office comments $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location If “fantastic” & “great” homes.com occur frequently in sold-at contact-agent extra-info data instances => description $350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle January 05 University of Southern California 22

  23. Aligning Schemas and Ontologies � Given two different sources with different schemas, how do we automatically align the information � Research Topics � Automatic schema alignment based on structure and naming � Automatic alignment based on the source contents January 05 University of Southern California 23

  24. Constraint Integration January 05 University of Southern California 24

  25. Constraint Integration Frameworks � Approach to tightly integrating closely related sources � Research: � Constraint propagation and constraint satisfaction techniques � Tools � Heracles constraint integration system January 05 University of Southern California 25

  26. Geospatial Data Integration Street Vector Data Corrected Tiger Line Files Constraint Satisfaction Satellite Image 604 or 604 or 610, Palm or 604 645, Sierra 610 642 610 645,Sierra Terraserver Address Latitude Longitude 642, Penn or 642,644,646 642 Penn St 33.923413 -118.409809 645, Sierra or 636,Penn Penn 639, Sierra Street Address City, State Zipcode 639,Sierra 640 Penn St 33.923412 -118.409809 636,Penn or 636,638,640 642 Penn St El Segundo, CA 90245 636 Penn St 33.923412 -118.409809 630,Penn Penn 639, Sierra or 633, Sierra 640 Penn St El Segundo, CA 90245 604 Palm Ave 33.923414 -118.409809 630,Penn or 633,Sierra 630,632,634 636 Penn St El Segundo, CA 90245 628,Penn 610 Palm Ave 33.923414 -118.409810 Penn 633, Sierra or 604 Palm Ave El Segundo, CA 90245 629, Sierra 645 Sierra St 33.923413 -118.409810 628,Penn or 629,Sierra 628, Penn 610 Palm Ave El Segundo, CA 90245 624,Penn 639 Sierra St 33.923412 -118.409810 629, Sierra or 624,Penn or 624, Penn 645 Sierra St El Segundo, CA 90245 623,Sierra 623, Sierra Geocoded Houses 618,Penn 639 Sierra St El Segundo, CA 90245 Initial Hypothesis Result After Constraint Satisfaction Census Master Address File Address # units Area(sq ft) Lot size 642 Penn St 3 1793 135.72 * 53.33 604 Palm Ave 1 884 69 * 42 610 Palm Ave 1 756 66 * 42 645 Sierra St 1 1337 120 * 62 639 Sierra St 1 1408 121*53.5 January 05 University of Southern California 26 Los Angeles County Assessor’s Site Data Extracted from On-line Site Property Tax Records

  27. Application Areas � Geospatial data integration � Includes satellite imagery, maps, vector data and many related online sources � Biological data integration � Huge number of sources on gene-related information � Many sources available as web services � In this course we will focus on the first application area January 05 University of Southern California 27

  28. And other topics � Semantic Web � Data mining from the Web � Information extraction January 05 University of Southern California 28

  29. Course Details January 05 University of Southern California 29

  30. Where to find me… � Research Associate Professor Computer Science Department PHE 416 (Only for office hour after class) � Senior Project Leader Information Sciences Institute Marina del Rey ISI 922 (Office the rest of the time) January 05 University of Southern California 30

  31. TA, Grader & Office Hours � Professor: Craig Knoblock (Knoblock@isi.edu) � Office Hours: ⌧ Tuesday 5-6pm (PHE 416) ⌧ Thursday 3–4pm (ISI 922 or 310-448-8786) � TA: Martin Michalowski (martinm@isi.edu) � Office Hours: Monday 1-2:30pm (SAL 200c) � TA: Anshuman Chakravartty (achakrav@usc.edu) � Office Hours: (all in SAL 200c) ⌧ Tue: 11-12:30pm, Wed: 1-2:30pm, Th: 10-11:30pm, Fri: 2-3:30pm � Grader: Junaid Chaudhry (chaudhry@isi.edu) January 05 University of Southern California 31

  32. Course Web Pages � Blackboard – totale.usc.edu � Your USC login works on this account � If you are registered for 548, you will have access � All readings, slides, homeworks, etc will be posted on the site page � Please check for announcements and read the discussion board on a regular basis � All questions should be posted (not emailed!) � If you know the answer to a posted question, please answer it! � But please don’t post answers to homeworks! January 05 University of Southern California 32

  33. Prerequisites & Recommendations � Prerequisites � CS561 or CS573 -- Introduction to AI � CS585 – Database Systems � Recommended Courses � CS571 – Issues of Programming Language Design � CS573 – Advanced AI January 05 University of Southern California 33

Recommend


More recommend