1table
play

1Table A System for Managing Structured Web Data Yang Zhang with: - PowerPoint PPT Presentation

1Table A System for Managing Structured Web Data Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe Structured Web Data No tables Web is more than just text Other tables, tags, lists, etc tables Data


  1. 1Table A System for Managing Structured Web Data Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe

  2. Structured Web Data No tables • Web is more than just text Other – tables, tags, lists, etc tables Data – 50% pages have tables tables – 25% tables appear to be useful data tables (relational, entity, sets, etc.) • No existing tools to effectively query this data – RDBMSs don’t scale, process noisy data poorly – Search engines are structure ‐ blind • 1Table fills the gap!

  3. Table Search Data Synthetic Table The 1Table Visualization Generation Project Reference Schema Reconciliation Reconciliation

  4. Table Search Data Synthetic Table The 1Table Visualization Generation Project Reference Schema Reconciliation Reconciliation

  5. 1Table Project HOBO: TABLE SEARCH

  6. The Quest for Infrastructure • _: limited indexing options, inefficient structure • _: lots of hoops, un ‐ structured • _: little bang for the buck, slow setup, inefficient structure • Wanted control over query model, ranking Hobo: “poor man’s text search”

  7. Challenges • Millions of tables (~100M in Core) • Noisy: many are not data tables (layout) • Query by: attributes? values? similar examples? • No structured metadata Hobo • Similar to traditional inverted index search • Schema ‐ agnostic structured query model

  8. Hobo Query Processor Slave 0 TID TID Shard Slaves 00000 Table Index Shard Slaves 00000 Shard Slaves 00000 00000 Slave 1 TID Master TID Shard Slaves 00000 Table Index Shard Slaves 00000 Shard Slaves 00216 00216 GFS Slave 499

  9. Processing Pipeline extraction filtering docjoins raw tables good tables annotation servers labeling, annotation, munging Daffie querying indexing query processor Hobo inverted analyzed/cleaned tables index

  10. Recipe: Hobo Query Model • Start with Google.com-style conjunction of disjunctions • Add structural primitives: terms have attributes • Introduce binding of variables to terms • Impose binary relational constraints (½ cup) • Mix bindings and constraints in arbitrary boolean expressions • Serve and enjoy

  11. Query Model and x y “united states” where x .offset + 1 = y .offset

  12. Query Model and x z y “france” “paris” “germany” where x .row = y .row and x .col = z .col

  13. Query Model • What attributes are currently available? – Physical: offset, col, row – Logical: source (header/body/context) – For ranking: size, pageRank, isDataTable, hasHeaders, … – Easy to add more! • Fast (poly ‐ time) constraint verifier

  14. Query Languages High ‐ level template ‐ based query Low ‐ level constraint ‐ based query language example: language: and { a = and { “united states” us a = term { united } b = term { states } china | prc cn where a.pos + 1 = b.pos * to } b = or { term { china } term { prc } parser, ((("united states") (us)) } rewriter c = us ((china | prc) (cn)) d = cn ((_) (to))) e = to where a.col == b.col c.col == d.col c.col == e.col a.row == c.row b.row == d.row }

  15. Demo!

  16. Areas for Future Work • Low ‐ hanging performance fruits – O(n) constraint verification by ordering/hashing – Smarter concurrent iteration over inverted index – Query rewriting – More resources • Soft constraints: not required, but use for ranking • Frontend: richer data visualization • Ranking of results • Easy integration into Dataspaces

  17. 1Table Project TABLE SUGGEST

  18. Synthetic Table Generation What country corresponds to code “tr”? united states us united states us china cn china cn tr turkey tr japan jp ... …

  19. Challenges • Inconsistent/inaccurate information • Resolving data from multiple sources • Ad ‐ hoc semantics • Data with nested (sub ‐ cell) structure – .us (united states) – united states/us

  20. TableSuggest Features • Spreadsheet that suggests values to fill in • Can draw data from _ and Google Sets, but primarily 1Table (Hobo) • Hodgpodge of techniques (thrown in ad ‐ hoc manner from inspecting results) – Type enumeration (_, Hobo) – Set expansion (Sets, Hobo) – Attribute resolution (Hobo) – Column clustering (1Table) – …

  21. Demo!

  22. Areas for Future Work • More principled evaluation • Implementation infelicities • Support for numeric queries using two ‐ tier indexing structure with “range buckets” • Richer sub ‐ structure extraction (lists) • Incremental indexing with live data feeds/sources • Tailoring to specific domains • Entity tables • Aggregating values in denormalized tables

Recommend


More recommend