dagobah
play

DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation - PowerPoint PPT Presentation

DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation System Yoan Chabot Thomas Labb Jixiong Liu Raphal Troncy Orange Labs Orange Labs Orange Labs EURECOM @yoan_chabot @tau_labbe @rtroncy Orange restricted Context


  1. DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation System Yoan Chabot Thomas Labbé Jixiong Liu Raphaël Troncy Orange Labs Orange Labs Orange Labs EURECOM @yoan_chabot @tau_labbe @rtroncy Orange restricted

  2. Context & Goals  Design a semantic engine able to query (semi-)structured data I want to have precise and relevant answers to my queries expressed in natural language, without having to know the target database(s) model(s)  We focus on tabular data: annotate the content and structure of tabular data for searching and recommending datasets

  3. Tabular Data to Knowledge Graph Matching  Goals CTA CEA CPA Column-Type Annotation Cell-Entity Annotation Columns-Property Annotation  1 st step: preprocessing to identify tables characteristics (orientation, key- column…)  2 nd step: annotations workflows  Method 1: Baseline lookups  Method 2: Embedding approach  We focus on the CTA and CEA tasks CPA processing: list of properties associated to entities pairs, plus majority voting

  4. Preprocessing (new homogeneity factor) Datatable Table in WTC format corpus Converter Header detection Table orientation (CSV, TSV, HTML, …) 2 2 Key column 𝑚𝑓𝑜(𝑦) (1 − 1 − 2 ∗ 𝑑𝑝𝑣𝑜𝑢 𝑢 𝑗 1 Content-based algorithm 𝐼𝑝𝑛 𝑦 = [ )] DWTC algorithm [1] 𝑚𝑓𝑜 𝑦 detection (homogeneity factor) 𝑢 𝑗 ∈ 𝑦 Lake Area Depth Country Hom. RH Windermere String_number String_number String unknown 0.89 • Object • Unit Kielder Reservoir String_number String_number String unknown 0.89 Primitive typing • Number Ullswater String_number String_number String unknown 0.89 • Date • Unknown Bassenthwaite Lake String_number String_number String unknown 0.89 Derwent Water String_number String_number String unknown 0.89 Hom. CH 0 0 0 Pre-processed tables 𝑁𝑓𝑏𝑜 𝐷𝐼 < 𝑁𝑓𝑏𝑜(𝑆𝐼) → 𝑰𝒑𝒔𝒋𝒜𝒑𝒐𝒖𝒃𝒎 ∃ 𝑑𝑝𝑚 𝑥ℎ𝑓𝑠𝑓 𝐼𝑝𝑛 𝑑𝑝𝑚 0: 3 ≠ 0 → 𝑰𝒇𝒃𝒆𝒇𝒔 = 𝒖𝒔𝒗𝒇 [1] https://subversion.assembla.com/svn/commondata/WDCFramework/tags/1/0/3/

  5. Baseline lookups Pre-processed tables Entities Lookups 1 ‒ Lookups from all tables cells 1 Lake Area (4 external sources + 1 internal API Windermere 14,73 km² API API Ingestion Wikidata ES) CirrusSearch Kielder Reservoir 10,86 km² Server ‒ Wikidata as pivot metadata 3 3 ‒ DBpedia translation (uri & 2 {title: "Q119936", types) 4 label: "Windermere"}, {"mainType": "populated place", {title: "Q390370", 2 SPARQL 4 "types": "settlement" label: "Windermere"} ‒ TF-IDF-like types scoring 6 "subTypes": ""} … 5 ‒ Entities disambiguation with 7 DBpedia entity target type(s) uri & types 6 7 Types scoring Entities 7 Type(s) selection Disambiguation CTA output CEA output

  6. Embedding approach Id: ["Q223687"], label:["Wes Anderson"], Embedding Q223687 aliases:["Wesley Wales Anderson"], 1 Enrichment types:["Q5","dbPedia.Person"], EMBEDDING subTypes:["dbPedia.Director","Q2526255"," Q36180"] OpenKE [2] ‒ Embedding enrichment through 1 Wikidata ES server Title Director Lookup Entities Rushmore Anderson 2 candidates Lookup ‒ Regex + Levenshtein lookup 2 Fight Club Fincher ‒ K-means clustering over 3 Lookup + Table based candidates space hyperparameters ‒ Scoring algorithm to extract 4 Candidates 3 5 best cluster and deduce target clustering type 4 ‒ Candidates disambiguation 6 Clusters scoring from clusters, types and entities scores Candidates’ types 6 5 scoring Candidates’ entities scoring [2] http://139.129.163.161/index/toolkits# pretrained-wikidata CTA output CEA output

  7. Embedding approach example Candidates scoring (CTA) 𝑇 𝑑 𝑅941209 Clusters scoring 𝑇 𝑙 𝑑𝑚𝑣𝑡𝑢𝑓𝑠#2 Entities scoring (CEA): 𝑇 𝑓 𝑗 = 0.25 ∗ 𝑇 𝑙 𝑜 + 0,5 ∗ 𝑆 𝑈 + 0.2 ∗ 𝑇 𝑑 (𝑗) Entities disambiguation: 𝑇 𝑓 𝑄𝑏𝑣𝑚 𝑈ℎ𝑝𝑛𝑏𝑡 𝐵𝑜𝑒𝑓𝑠𝑡𝑝𝑜 , 𝑻 𝒇 𝑿𝒇𝒕 𝑩𝒐𝒆𝒇𝒔𝒕𝒑𝒐 > 𝑇 𝑓 𝑄𝑏𝑣𝑚 𝑋. 𝑇. 𝐵𝑜𝑒𝑓𝑠𝑡𝑝𝑜

  8. Results Table 2: Round 1 results (own evaluator < AI crowd evaluator) Task CTA CEA Table1: Preprocessing results Criteria F1 Precision AH AP F1 Precision Task/Tool DWTC DAGOBAH Baseline 0.517 0.482 NA NA 0.784 0.814 Orientation Detection 0.9 0.957 Baseline++ 0.641 0.641 1.108 0.246 0.881 0.890 Header Extraction Not evaluated 1.0 Key Column Detection 0.857 0.986 Embedding 0.683 0.683 1.483 0.258 0.840 0.852 Approach Pros Cons  High coverage (multiple sources)  Lookup-services dependency (reliability) Baseline  Computational efficiency  Blackbox (indexing, scoring…)  Queries volume  Lookup strategy independence  Computational performances Embedding  Relevant clustering even with few data  K optimization  Generalization (no tailored cleaning + less  Embedding dependency heuristics in lookups and scoring)

  9. Discussion & Future Work  Performance bottlenecks (due to the challenge context):  Light Data cleaning … on purpose  Basic lookup strategies … on purpose (e.g. no use of dictionary)  Missing Wikidata – DBpedia type mappings  Subset embedding (restricted to baseline candidates)  Future work:  Test other Wikidata embeddings methods (on the whole space)  Compute joint embeddings with Wikipedia/DBpedia to enhance coverage  Experiment more clustering algorithms and parameters on different datasets  Learn data table embedding and find vectorial transformation(s) with KG embedding space  …

  10. DAGOBAH Datatable-powered Accurate-knowledge Graph for Outstanding and Beautiful Answers to Humans Thanks! Orange restricted

Recommend


More recommend