1Table A System for Managing Structured Web Data Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe
Structured Web Data No tables • Web is more than just text Other – tables, tags, lists, etc tables Data – 50% pages have tables tables – 25% tables appear to be useful data tables (relational, entity, sets, etc.) • No existing tools to effectively query this data – RDBMSs don’t scale, process noisy data poorly – Search engines are structure ‐ blind • 1Table fills the gap!
Table Search Data Synthetic Table The 1Table Visualization Generation Project Reference Schema Reconciliation Reconciliation
Table Search Data Synthetic Table The 1Table Visualization Generation Project Reference Schema Reconciliation Reconciliation
1Table Project HOBO: TABLE SEARCH
The Quest for Infrastructure • _: limited indexing options, inefficient structure • _: lots of hoops, un ‐ structured • _: little bang for the buck, slow setup, inefficient structure • Wanted control over query model, ranking Hobo: “poor man’s text search”
Challenges • Millions of tables (~100M in Core) • Noisy: many are not data tables (layout) • Query by: attributes? values? similar examples? • No structured metadata Hobo • Similar to traditional inverted index search • Schema ‐ agnostic structured query model
Hobo Query Processor Slave 0 TID TID Shard Slaves 00000 Table Index Shard Slaves 00000 Shard Slaves 00000 00000 Slave 1 TID Master TID Shard Slaves 00000 Table Index Shard Slaves 00000 Shard Slaves 00216 00216 GFS Slave 499
Processing Pipeline extraction filtering docjoins raw tables good tables annotation servers labeling, annotation, munging Daffie querying indexing query processor Hobo inverted analyzed/cleaned tables index
Recipe: Hobo Query Model • Start with Google.com-style conjunction of disjunctions • Add structural primitives: terms have attributes • Introduce binding of variables to terms • Impose binary relational constraints (½ cup) • Mix bindings and constraints in arbitrary boolean expressions • Serve and enjoy
Query Model and x y “united states” where x .offset + 1 = y .offset
Query Model and x z y “france” “paris” “germany” where x .row = y .row and x .col = z .col
Query Model • What attributes are currently available? – Physical: offset, col, row – Logical: source (header/body/context) – For ranking: size, pageRank, isDataTable, hasHeaders, … – Easy to add more! • Fast (poly ‐ time) constraint verifier
Query Languages High ‐ level template ‐ based query Low ‐ level constraint ‐ based query language example: language: and { a = and { “united states” us a = term { united } b = term { states } china | prc cn where a.pos + 1 = b.pos * to } b = or { term { china } term { prc } parser, ((("united states") (us)) } rewriter c = us ((china | prc) (cn)) d = cn ((_) (to))) e = to where a.col == b.col c.col == d.col c.col == e.col a.row == c.row b.row == d.row }
Demo!
Areas for Future Work • Low ‐ hanging performance fruits – O(n) constraint verification by ordering/hashing – Smarter concurrent iteration over inverted index – Query rewriting – More resources • Soft constraints: not required, but use for ranking • Frontend: richer data visualization • Ranking of results • Easy integration into Dataspaces
1Table Project TABLE SUGGEST
Synthetic Table Generation What country corresponds to code “tr”? united states us united states us china cn china cn tr turkey tr japan jp ... …
Challenges • Inconsistent/inaccurate information • Resolving data from multiple sources • Ad ‐ hoc semantics • Data with nested (sub ‐ cell) structure – .us (united states) – united states/us
TableSuggest Features • Spreadsheet that suggests values to fill in • Can draw data from _ and Google Sets, but primarily 1Table (Hobo) • Hodgpodge of techniques (thrown in ad ‐ hoc manner from inspecting results) – Type enumeration (_, Hobo) – Set expansion (Sets, Hobo) – Attribute resolution (Hobo) – Column clustering (1Table) – …
Demo!
Areas for Future Work • More principled evaluation • Implementation infelicities • Support for numeric queries using two ‐ tier indexing structure with “range buckets” • Richer sub ‐ structure extraction (lists) • Incremental indexing with live data feeds/sources • Tailoring to specific domains • Entity tables • Aggregating values in denormalized tables
Recommend
More recommend