FacetE: Exploiting Web Tables for Domain-Specific Word Embedding - PowerPoint PPT Presentation

FacetE: Exploiting Web Tables for Domain-Specific Word Embedding Evaluation Michael Günther , Paul Sikorski, Maik Thiele, and Wolfgang Lehner DBTest ‘20 Workshop at SIGMOD 2020 19.06.2020

NLP Systems Workflow Data Storage with textual data W W OR 5.02, Ds 43.07, W ….. W Extracted text data Numerical Language Model Representation (Vectors) 2

NLP Systems Workflow State-of-the-art Language Models: Word Embeddings Training on Dummy Task Extract Weights as Pre-Trained Language Model Deep Neuronal Network Data Storage with Extracted Relational 5.02, textual data Information Relational database with 43.07, ….. text data Large Text corpora in natural language W W OR 5.02, Ds 43.07, W ….. W Extracted text data Numerical Language Model Representation (Vectors) 3

NLP Systems Workflow Data Storage with Similarity Search Classification and textual data Tasks Regression Tasks W W OR 5.02, Ds 43.07, W ….. W Extracted text data Numerical Language Model Representation (Vectors) 4

Word Embedding for Systems ML Systems Database Systems Information Retrieval Systems ▪ Utilize implicitly encoded ▪ Semantic text similarity ▪ Semantic search knowledge from large queries ▪ Query Expansion text corpora ▪ Data exploration ▪ Multi-lingual search ▪ Capture sematic ▪ Data integration similarities of text values Choice of the word embedding model is crucial for the performance! 5

Evaluation of Word Embedding Models Word Similarity ▪ Similar Words by cosine similarity of word vectors 𝒚 ∙ 𝒛 woman 𝑡𝑗𝑛 𝑑𝑝𝑡 (𝒚, 𝒛) = 𝒚 ∙ | 𝒛 | man queen ▪ Example: most similar to “king”? drier king → prince, man, and queen driest prince Analogy Queries dry wetter ▪ Retrieve Similar Relations London wettest 𝑏 − 𝑐 ≈ 𝑑 − ? Berlin 3CosAdd: arg max wet 𝑡𝑗𝑛 𝑑𝑝𝑡 𝒆, 𝒅 − 𝒃 + 𝒄 𝑒 𝜗𝑊 𝒃,𝒄,𝒅 England ▪ Example: man – woman ≈ king - ? → queen Germany Schematic Representation of Word Vectors 6

Evaluation of Word Embedding Models Common Similarity Datasets … Similarity Eval * Embedding Model WS353 RW ▪ WS-353 353 word pairs of general … domain knowledge quantifying semantic CBOW 57.2 32.5 relatedness … SkipGram 62.8 37.2 ▪ SimLex-999 999 word pairs of general domain knowledge quantifying semantic … … … … similarity Embedding Model Semantic Syntactic Total Analogy Eval * Depend on human notion of similarity → Require human labeling effort CBOW 57.3 68.9 63.7 Common Analogy Query Datasets SkipGram 66.1 65.1 65.6 ▪ Google Analogy 550 semantic and … … … … syntactic relations, mostly city-country * Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global relations Vectors for Word Representation. ▪ MSR 8,000 analogies of 800 syntactic Limitations: relations Only small Return a single Only general Facts of general domain knowledge datasets value only domain → Automatic extraction possible 7

Evaluation of Word Embedding Models Common Similarity Datasets Limitations: ▪ WS-353 353 word pairs of general Only small Only general domain knowledge quantifying semantic Return a single relatedness datasets value only domain ▪ SimLex-999 999 word pairs of general domain knowledge quantifying semantic similarity Design Goals: Depend on human notion of similarity Large number Flexible Multiple → Require human labeling effort of relations structure categories Common Analogy Query Datasets ▪ Google Analogy 550 semantic and syntactic relations, mostly city-country Design Strategies: relations ▪ MSR 8,000 analogies of 800 syntactic Extraction Definition of Organization relations from millions in facets categories of web tables Facts of general domain knowledge → Automatic extraction possible 8

Dataset Design Data Source: Web Tables Airport IATA Airport Country Airp Location Rank Country Airport Area ort Airp #Passeng City IATA ▪ Large amount of knowledge ort ers Team Event Rank Year … ▪ General enough to be expected in Team Country Team Country Team Rank Event Rank pre-trained word embedding models ▪ Redundancy allows to exclude Web Tables Corpus Header Pairs and Values temporary facts (e.g. time dependent facts like home Soccer Player soccer team to visiting team) Soccer Player AC Milan England Keeper Sports Target Design: Facets Soccer Team Soccer Country Arsenal Brazil Forward Player Player … … … … ▪ Each Facet 𝐺: 𝑃 → 𝑊 assigns objects (e.g. Team Country Position Soccer Player) to values (e.g. Teams) … … ▪ Allows flexible construction of Airport Airport Economy application specific evaluation datasets England London LGW Airport City Airport IATA ▪ More flexible then hierarchical New York Brazil LHR … categorization … … … IATA City Country FacetE Storage Format Collection of Facets 9

Extraction Pipeline 125M Web Tables Pre Filtering: Frequency and Regex Filter, 1 Facet Creation Soft Functional Dependencies: Check 2 contradiction of most frequent relation 3 Post Filtering: Filter by Pooling, Blacklist , … Categorization: Assign facets to 8 broader 4 categories 250 Facets / 600K Values Analogy Word Embeddings Evaluation 10

Extraction Pipeline 1) Pre-Filtering 125M Web Tables ▪ Filters infrequent and non-textual data of English tables Pre Filtering: Frequency and Regex Filter, 1 Country Date Team Facet Creation Soft Functional Dependencies: Check Remove 2 Country Team Nick- contradiction of most frequent relation infrequent name columns 3 Remove Post Filtering: Filter by Pooling, Blacklist , … non-textual data Categorization: Assign facets to 8 broader 4 categories Country Team 250 Facets / 600K Values Team Country Column-Tuples Analogy Word Embeddings → Basis for Facets Evaluation 11

Extraction Pipeline 2) Soft-Functional Dependencies Most frequent 125M Web Tables for “Arsenal” ▪ Determine static Team Country facts Pre Filtering: Frequency and Regex Filter, 1 Arsenal England Facet Creation 1) Determine AC Milan Italy most frequent Soft Functional Dependencies: Check 2 relation pairs Juventus Italy contradiction of most frequent relation Team Country 3 2) Check on Post Filtering: Filter by Pooling, Blacklist , … Arsenal United contradictions Kingdom Categorization: Assign facets to 8 broader 4 AC Milan Italy 𝑇𝐺𝐸 𝑝, 𝑤 categories 𝑑𝑝𝑣𝑜𝑢(𝑝, 𝑤) One = Team Country σ 𝑤 ′ :(𝑝,𝑤 ′ ) 𝑑𝑝𝑣𝑜𝑢(𝑝, 𝑤 ′ ) 250 Facets / 600K Values Contradiction AC Milan Italy 𝑇𝐺𝐸 𝐵𝑠𝑡𝑓𝑜𝑏𝑚, 𝐹𝑜𝑕𝑚𝑏𝑜𝑒 Juventus Italy Analogy = 2 Word Embeddings Evaluation Arsenal England 3 12

Extraction Pipeline 3) Post-Filtering 125M Web Tables ▪ Blacklists Remove too generic facets Pre Filtering: Frequency and Regex Filter, 1 Facet Creation Name Description Soft Functional Dependencies: Check 2 contradiction of most frequent relation ▪ Word Embedding Pooling 3 Post Filtering: Filter by Pooling, Blacklist , … Retain only facets modeled by at least one word embedding model Categorization: Assign facets to 8 broader 4 ? categories 250 Facets / 600K Values ? Analogy ? Word Embeddings Evaluation 13

Extraction Pipeline 4) Categorization 125M Web Tables ▪ Assign each of the 250 facets on of 8 broader categories Pre Filtering: Frequency and Regex Filter, 1 (e.g. geographic, music, sports, …) Facet Creation Team Country Soft Functional Dependencies: Check 2 contradiction of most frequent relation AC Milan Italy Juvertus Italy 3 Post Filtering: Filter by Pooling, Blacklist , … Arsenal England Categorization: Assign facets to 8 broader Word Embedding 4 Model categories Cat. Sim 250 Facets / 600K Values Similarity to Music 0.15 Keywords Sports 0.53 Analogy Word Embeddings Keywords for Evaluation …. …. categories 14

Evaluation Evaluation of Categories Setup ▪ 4 Pre-trained word embedding models: GloVe, Word2Vec-SkipGram, fastText, SentenceBert ▪ Selection of 4 FacetE categories Calculation ▪ Select facets 𝐺: 𝑃 → 𝑊 from the categories ▪ Determine the value 𝑊 for each object 𝑃 with 3CosAdd analogy method ▪ Calculate amount of correctly assigned values ▪ Calculate average in each category Coverage: For some text values word embedding models can not determine a vector Evaluation of 4 Categories 15

Evaluation Evaluation of Categories Setup Observation ▪ 4 Pre-trained word embedding models: No single best model GloVe, Word2Vec-SkipGram, fastText,  SentenceBert High Coverage  ▪ Selection of 4 FacetE categories Calculation ▪ Select facets 𝐺: 𝑃 → 𝑊 from the categories ▪ Determine the value 𝑊 for each object 𝑃 with 3CosAdd analogy method ▪ Calculate amount of correctly assigned values ▪ Calculate average in each category Coverage: For some text values word embedding models can not determine a vector Evaluation of 4 Categories 16

FacetE: Exploiting Web Tables for Domain-Specific Word Embedding - PowerPoint PPT Presentation

FacetE: Exploiting Web Tables for Domain-Specific Word Embedding Evaluation Michael Gnther , Paul Sikorski, Maik Thiele, and Wolfgang Lehner DBTest 20 Workshop at SIGMOD 2020 19.06.2020 NLP Systems Workflow Data Storage with textual data

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

TIMES TABLES HOW WE TEACH TIMES TABLES AND HOW YOU CAN HELP WHY ARE TIMES TABLES IMPORTANT?

NZ Data Tables Data tables sit alongside the Active NZ main report The data tables provide

Symbol tables COMP 520 Fall 2013 Symbol tables (2) Symbol tables are used to describe and analyse

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

X1D: Create Pivot Tables using Excel 2013 3/07/2018 V1N Create Pivot Tables using Excel 2013 1

Create Pivot Tables using Excel 2008/2013 1/26/2016 V1H Create Pivot Tables using Excel 2008 1

INF5110 Compiler Construction Symbol tables Spring 2016 1 / 43 Outline 1. Symbol tables

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Exploiting Private Local Exploiting Private Local Memories to Reduce the Memories to Reduce the

Exploiting carbon and nitrogen Exploiting carbon and nitrogen compounds for enhanced energy

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

Hacking Browser's DOM Exploiting Ajax and RIA Exploiting Ajax and RIA Shreeraj Shah

CS344: Introduction to Artificial CS344: Introduction to Artificial Intelligence Vishal Vachhani

Event Model for Auto Video Search TRECVID 2005 Search by NUS PRIS Tat-Seng Chua, Shi-Yong Neo,

COMMUNITY FORUMS: PROVIDENCE PUBLIC SCHOOLS REVIEW Commissioner Infante-Green OBJECTIVES

Cross-lingual NLP Sara Stymne Uppsala University Department of Linguistics and Philology

Reactive Programming Models for IoT Todd L. Montgomery @toddlmontgomery Psst! Already Here! Not

External and Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation

Korean 9/20/2010 Speakers spoken in North and South Korean, each with various dialects and a

Mixture of Training Data Xinyu Wang, Yong Jiang, Kewei Tu School of Information Science and