DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation - PowerPoint PPT Presentation

DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation System Yoan Chabot Thomas Labbé Jixiong Liu Raphaël Troncy Orange Labs Orange Labs Orange Labs EURECOM @yoan_chabot @tau_labbe @rtroncy Orange restricted

Context & Goals  Design a semantic engine able to query (semi-)structured data I want to have precise and relevant answers to my queries expressed in natural language, without having to know the target database(s) model(s)  We focus on tabular data: annotate the content and structure of tabular data for searching and recommending datasets

Tabular Data to Knowledge Graph Matching  Goals CTA CEA CPA Column-Type Annotation Cell-Entity Annotation Columns-Property Annotation  1 st step: preprocessing to identify tables characteristics (orientation, key- column…)  2 nd step: annotations workflows  Method 1: Baseline lookups  Method 2: Embedding approach  We focus on the CTA and CEA tasks CPA processing: list of properties associated to entities pairs, plus majority voting

Preprocessing (new homogeneity factor) Datatable Table in WTC format corpus Converter Header detection Table orientation (CSV, TSV, HTML, …) 2 2 Key column 𝑚𝑓𝑜(𝑦) (1 − 1 − 2 ∗ 𝑑𝑝𝑣𝑜𝑢 𝑢 𝑗 1 Content-based algorithm 𝐼𝑝𝑛 𝑦 = [ )] DWTC algorithm [1] 𝑚𝑓𝑜 𝑦 detection (homogeneity factor) 𝑢 𝑗 ∈ 𝑦 Lake Area Depth Country Hom. RH Windermere String_number String_number String unknown 0.89 • Object • Unit Kielder Reservoir String_number String_number String unknown 0.89 Primitive typing • Number Ullswater String_number String_number String unknown 0.89 • Date • Unknown Bassenthwaite Lake String_number String_number String unknown 0.89 Derwent Water String_number String_number String unknown 0.89 Hom. CH 0 0 0 Pre-processed tables 𝑁𝑓𝑏𝑜 𝐷𝐼 < 𝑁𝑓𝑏𝑜(𝑆𝐼) → 𝑰𝒑𝒔𝒋𝒜𝒑𝒐𝒖𝒃𝒎 ∃ 𝑑𝑝𝑚 𝑥ℎ𝑓𝑠𝑓 𝐼𝑝𝑛 𝑑𝑝𝑚 0: 3 ≠ 0 → 𝑰𝒇𝒃𝒆𝒇𝒔 = 𝒖𝒔𝒗𝒇 [1] https://subversion.assembla.com/svn/commondata/WDCFramework/tags/1/0/3/

Baseline lookups Pre-processed tables Entities Lookups 1 ‒ Lookups from all tables cells 1 Lake Area (4 external sources + 1 internal API Windermere 14,73 km² API API Ingestion Wikidata ES) CirrusSearch Kielder Reservoir 10,86 km² Server ‒ Wikidata as pivot metadata 3 3 ‒ DBpedia translation (uri & 2 {title: "Q119936", types) 4 label: "Windermere"}, {"mainType": "populated place", {title: "Q390370", 2 SPARQL 4 "types": "settlement" label: "Windermere"} ‒ TF-IDF-like types scoring 6 "subTypes": ""} … 5 ‒ Entities disambiguation with 7 DBpedia entity target type(s) uri & types 6 7 Types scoring Entities 7 Type(s) selection Disambiguation CTA output CEA output

Embedding approach Id: ["Q223687"], label:["Wes Anderson"], Embedding Q223687 aliases:["Wesley Wales Anderson"], 1 Enrichment types:["Q5","dbPedia.Person"], EMBEDDING subTypes:["dbPedia.Director","Q2526255"," Q36180"] OpenKE [2] ‒ Embedding enrichment through 1 Wikidata ES server Title Director Lookup Entities Rushmore Anderson 2 candidates Lookup ‒ Regex + Levenshtein lookup 2 Fight Club Fincher ‒ K-means clustering over 3 Lookup + Table based candidates space hyperparameters ‒ Scoring algorithm to extract 4 Candidates 3 5 best cluster and deduce target clustering type 4 ‒ Candidates disambiguation 6 Clusters scoring from clusters, types and entities scores Candidates’ types 6 5 scoring Candidates’ entities scoring [2] http://139.129.163.161/index/toolkits# pretrained-wikidata CTA output CEA output

Embedding approach example Candidates scoring (CTA) 𝑇 𝑑 𝑅941209 Clusters scoring 𝑇 𝑙 𝑑𝑚𝑣𝑡𝑢𝑓𝑠#2 Entities scoring (CEA): 𝑇 𝑓 𝑗 = 0.25 ∗ 𝑇 𝑙 𝑜 + 0,5 ∗ 𝑆 𝑈 + 0.2 ∗ 𝑇 𝑑 (𝑗) Entities disambiguation: 𝑇 𝑓 𝑄𝑏𝑣𝑚 𝑈ℎ𝑝𝑛𝑏𝑡 𝐵𝑜𝑒𝑓𝑠𝑡𝑝𝑜 , 𝑻 𝒇 𝑿𝒇𝒕 𝑩𝒐𝒆𝒇𝒔𝒕𝒑𝒐 > 𝑇 𝑓 𝑄𝑏𝑣𝑚 𝑋. 𝑇. 𝐵𝑜𝑒𝑓𝑠𝑡𝑝𝑜

Results Table 2: Round 1 results (own evaluator < AI crowd evaluator) Task CTA CEA Table1: Preprocessing results Criteria F1 Precision AH AP F1 Precision Task/Tool DWTC DAGOBAH Baseline 0.517 0.482 NA NA 0.784 0.814 Orientation Detection 0.9 0.957 Baseline++ 0.641 0.641 1.108 0.246 0.881 0.890 Header Extraction Not evaluated 1.0 Key Column Detection 0.857 0.986 Embedding 0.683 0.683 1.483 0.258 0.840 0.852 Approach Pros Cons  High coverage (multiple sources)  Lookup-services dependency (reliability) Baseline  Computational efficiency  Blackbox (indexing, scoring…)  Queries volume  Lookup strategy independence  Computational performances Embedding  Relevant clustering even with few data  K optimization  Generalization (no tailored cleaning + less  Embedding dependency heuristics in lookups and scoring)

Discussion & Future Work  Performance bottlenecks (due to the challenge context):  Light Data cleaning … on purpose  Basic lookup strategies … on purpose (e.g. no use of dictionary)  Missing Wikidata – DBpedia type mappings  Subset embedding (restricted to baseline candidates)  Future work:  Test other Wikidata embeddings methods (on the whole space)  Compute joint embeddings with Wikipedia/DBpedia to enhance coverage  Experiment more clustering algorithms and parameters on different datasets  Learn data table embedding and find vectorial transformation(s) with KG embedding space  …

DAGOBAH Datatable-powered Accurate-knowledge Graph for Outstanding and Beautiful Answers to Humans Thanks! Orange restricted

DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation - PowerPoint PPT Presentation

DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation System Yoan Chabot Thomas Labb Jixiong Liu Raphal Troncy Orange Labs Orange Labs Orange Labs EURECOM @yoan_chabot @tau_labbe @rtroncy Orange restricted Context

hashes Hashes in lisp are basically a lookup table of key-value pairs can create/destroy

Self-applicable probabilistic inference without interpretive overhead Oleg Kiselyov Chung-chieh

UMBC A B M A L T F O U M B C I M Y O R T 1 (December 11, 2000 3:44 pm) I E S

EE 457 Unit 2b Fast Adders Carry-Lookahead Adders (Carry-Lookahead Adder) FAST ADDERS 2b.3

CSCE 496/896 Lecture 7: Learning Stephen Scott Reinforcement Learning Introduction MDPs Q

CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing

CONSTANT INSECURITY: THINGS YOU DIDN'T KNOW ABOUT (PECOFF) PORTABLE EXECUTABLE FILE FORMAT

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

Loops, trees and operators Yves Le Jan Universit Paris-Sud May 2010 ENERGY and GREEN FUNCTION

Ch.2: Loops and lists Joakim Sundnes 1 , 2 Hans Petter Langtangen 1 , 2 Simula Research Laboratory

Lighting/Shading III Week 7, Fri Feb 29 http://www.ugrad.cs.ubc.ca/~cs314/Vjan2008 News

I total = k s I ambient + I i ( I specular = k s I light ( v r ) n shiny i = 1 Lighting II

Data Visualization Brait ispuu Types of Visualization Mathematical Visualization y =

ChaNGa CHArm N-body GrAvity Laxmikant Kale Filippo Gioachin Pritish Jetley Thomas Quinn Celso

Collabora've*monitoring** ofAirTrafficandRF*spectrum:**

U.S. Mobile Payments Mark MacCarthy November 2009 U.S. Market: Contactless Visa PayWave

Exercise 3: Geometry FLUKA Advanced Course Exercise 3 - Layout Exercise 3a Goal Build the

Important From Last Time A system is safety critical when its failure may result in injuries

SCS CS 139 39 Ap Applied plied Ph Physics ysics II Dr. Prapun Suksompong

s rsrt Case study: payment card security

Object-Oriented Design No SVN checkout today Software development methods Object-oriented

Introduction to Skyrmions ---Discussion for Spintronics Lecture From Qing Yan Dec 22 nd ,2017

Magnetic Vortices, Vortex Lattices and Automorphic Functions I.M.Sigal based on the joint work

Magnetic Scattering Diana Lucia Quintero Castro Department of Mathematics and Natural S ciences

DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation - PowerPoint PPT Presentation

DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation System Yoan Chabot Thomas Labb Jixiong Liu Raphal Troncy Orange Labs Orange Labs Orange Labs EURECOM @yoan_chabot @tau_labbe @rtroncy Orange restricted Context

hashes Hashes in lisp are basically a lookup table of key-value pairs can create/destroy

Self-applicable probabilistic inference without interpretive overhead Oleg Kiselyov Chung-chieh

UMBC A B M A L T F O U M B C I M Y O R T 1 (December 11, 2000 3:44 pm) I E S

EE 457 Unit 2b Fast Adders Carry-Lookahead Adders (Carry-Lookahead Adder) FAST ADDERS 2b.3

CSCE 496/896 Lecture 7: Learning Stephen Scott Reinforcement Learning Introduction MDPs Q

CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing

CONSTANT INSECURITY: THINGS YOU DIDN'T KNOW ABOUT (PECOFF) PORTABLE EXECUTABLE FILE FORMAT

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

Loops, trees and operators Yves Le Jan Universit Paris-Sud May 2010 ENERGY and GREEN FUNCTION

Ch.2: Loops and lists Joakim Sundnes 1 , 2 Hans Petter Langtangen 1 , 2 Simula Research Laboratory

Lighting/Shading III Week 7, Fri Feb 29 http://www.ugrad.cs.ubc.ca/~cs314/Vjan2008 News

I total = k s I ambient + I i ( I specular = k s I light ( v r ) n shiny i = 1 Lighting II

Data Visualization Brait ispuu Types of Visualization Mathematical Visualization y =

ChaNGa CHArm N-body GrAvity Laxmikant Kale Filippo Gioachin Pritish Jetley Thomas Quinn Celso

Collabora've*monitoring** of*Air*Traffic*and*RF*spectrum:**

U.S. Mobile Payments Mark MacCarthy November 2009 U.S. Market: Contactless Visa PayWave

Exercise 3: Geometry FLUKA Advanced Course Exercise 3 - Layout Exercise 3a Goal Build the

Important From Last Time A system is safety critical when its failure may result in injuries

SCS CS 139 39 Ap Applied plied Ph Physics ysics II Dr. Prapun Suksompong

s rsrt Case study: payment card security

Object-Oriented Design No SVN checkout today Software development methods Object-oriented

Introduction to Skyrmions ---Discussion for Spintronics Lecture From Qing Yan Dec 22 nd ,2017

Magnetic Vortices, Vortex Lattices and Automorphic Functions I.M.Sigal based on the joint work

Magnetic Scattering Diana Lucia Quintero Castro Department of Mathematics and Natural S ciences

Collabora've*monitoring** ofAirTrafficandRF*spectrum:**