DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation System Yoan Chabot Thomas Labbé Jixiong Liu Raphaël Troncy Orange Labs Orange Labs Orange Labs EURECOM @yoan_chabot @tau_labbe @rtroncy Orange restricted
Context & Goals Design a semantic engine able to query (semi-)structured data I want to have precise and relevant answers to my queries expressed in natural language, without having to know the target database(s) model(s) We focus on tabular data: annotate the content and structure of tabular data for searching and recommending datasets
Tabular Data to Knowledge Graph Matching Goals CTA CEA CPA Column-Type Annotation Cell-Entity Annotation Columns-Property Annotation 1 st step: preprocessing to identify tables characteristics (orientation, key- column…) 2 nd step: annotations workflows Method 1: Baseline lookups Method 2: Embedding approach We focus on the CTA and CEA tasks CPA processing: list of properties associated to entities pairs, plus majority voting
Preprocessing (new homogeneity factor) Datatable Table in WTC format corpus Converter Header detection Table orientation (CSV, TSV, HTML, …) 2 2 Key column 𝑚𝑓𝑜(𝑦) (1 − 1 − 2 ∗ 𝑑𝑝𝑣𝑜𝑢 𝑢 𝑗 1 Content-based algorithm 𝐼𝑝𝑛 𝑦 = [ )] DWTC algorithm [1] 𝑚𝑓𝑜 𝑦 detection (homogeneity factor) 𝑢 𝑗 ∈ 𝑦 Lake Area Depth Country Hom. RH Windermere String_number String_number String unknown 0.89 • Object • Unit Kielder Reservoir String_number String_number String unknown 0.89 Primitive typing • Number Ullswater String_number String_number String unknown 0.89 • Date • Unknown Bassenthwaite Lake String_number String_number String unknown 0.89 Derwent Water String_number String_number String unknown 0.89 Hom. CH 0 0 0 Pre-processed tables 𝑁𝑓𝑏𝑜 𝐷𝐼 < 𝑁𝑓𝑏𝑜(𝑆𝐼) → 𝑰𝒑𝒔𝒋𝒜𝒑𝒐𝒖𝒃𝒎 ∃ 𝑑𝑝𝑚 𝑥ℎ𝑓𝑠𝑓 𝐼𝑝𝑛 𝑑𝑝𝑚 0: 3 ≠ 0 → 𝑰𝒇𝒃𝒆𝒇𝒔 = 𝒖𝒔𝒗𝒇 [1] https://subversion.assembla.com/svn/commondata/WDCFramework/tags/1/0/3/
Baseline lookups Pre-processed tables Entities Lookups 1 ‒ Lookups from all tables cells 1 Lake Area (4 external sources + 1 internal API Windermere 14,73 km² API API Ingestion Wikidata ES) CirrusSearch Kielder Reservoir 10,86 km² Server ‒ Wikidata as pivot metadata 3 3 ‒ DBpedia translation (uri & 2 {title: "Q119936", types) 4 label: "Windermere"}, {"mainType": "populated place", {title: "Q390370", 2 SPARQL 4 "types": "settlement" label: "Windermere"} ‒ TF-IDF-like types scoring 6 "subTypes": ""} … 5 ‒ Entities disambiguation with 7 DBpedia entity target type(s) uri & types 6 7 Types scoring Entities 7 Type(s) selection Disambiguation CTA output CEA output
Embedding approach Id: ["Q223687"], label:["Wes Anderson"], Embedding Q223687 aliases:["Wesley Wales Anderson"], 1 Enrichment types:["Q5","dbPedia.Person"], EMBEDDING subTypes:["dbPedia.Director","Q2526255"," Q36180"] OpenKE [2] ‒ Embedding enrichment through 1 Wikidata ES server Title Director Lookup Entities Rushmore Anderson 2 candidates Lookup ‒ Regex + Levenshtein lookup 2 Fight Club Fincher ‒ K-means clustering over 3 Lookup + Table based candidates space hyperparameters ‒ Scoring algorithm to extract 4 Candidates 3 5 best cluster and deduce target clustering type 4 ‒ Candidates disambiguation 6 Clusters scoring from clusters, types and entities scores Candidates’ types 6 5 scoring Candidates’ entities scoring [2] http://139.129.163.161/index/toolkits# pretrained-wikidata CTA output CEA output
Embedding approach example Candidates scoring (CTA) 𝑇 𝑑 𝑅941209 Clusters scoring 𝑇 𝑙 𝑑𝑚𝑣𝑡𝑢𝑓𝑠#2 Entities scoring (CEA): 𝑇 𝑓 𝑗 = 0.25 ∗ 𝑇 𝑙 𝑜 + 0,5 ∗ 𝑆 𝑈 + 0.2 ∗ 𝑇 𝑑 (𝑗) Entities disambiguation: 𝑇 𝑓 𝑄𝑏𝑣𝑚 𝑈ℎ𝑝𝑛𝑏𝑡 𝐵𝑜𝑒𝑓𝑠𝑡𝑝𝑜 , 𝑻 𝒇 𝑿𝒇𝒕 𝑩𝒐𝒆𝒇𝒔𝒕𝒑𝒐 > 𝑇 𝑓 𝑄𝑏𝑣𝑚 𝑋. 𝑇. 𝐵𝑜𝑒𝑓𝑠𝑡𝑝𝑜
Results Table 2: Round 1 results (own evaluator < AI crowd evaluator) Task CTA CEA Table1: Preprocessing results Criteria F1 Precision AH AP F1 Precision Task/Tool DWTC DAGOBAH Baseline 0.517 0.482 NA NA 0.784 0.814 Orientation Detection 0.9 0.957 Baseline++ 0.641 0.641 1.108 0.246 0.881 0.890 Header Extraction Not evaluated 1.0 Key Column Detection 0.857 0.986 Embedding 0.683 0.683 1.483 0.258 0.840 0.852 Approach Pros Cons High coverage (multiple sources) Lookup-services dependency (reliability) Baseline Computational efficiency Blackbox (indexing, scoring…) Queries volume Lookup strategy independence Computational performances Embedding Relevant clustering even with few data K optimization Generalization (no tailored cleaning + less Embedding dependency heuristics in lookups and scoring)
Discussion & Future Work Performance bottlenecks (due to the challenge context): Light Data cleaning … on purpose Basic lookup strategies … on purpose (e.g. no use of dictionary) Missing Wikidata – DBpedia type mappings Subset embedding (restricted to baseline candidates) Future work: Test other Wikidata embeddings methods (on the whole space) Compute joint embeddings with Wikipedia/DBpedia to enhance coverage Experiment more clustering algorithms and parameters on different datasets Learn data table embedding and find vectorial transformation(s) with KG embedding space …
DAGOBAH Datatable-powered Accurate-knowledge Graph for Outstanding and Beautiful Answers to Humans Thanks! Orange restricted
Recommend
More recommend