FacetE: Exploiting Web Tables for Domain-Specific Word Embedding Evaluation Michael Günther , Paul Sikorski, Maik Thiele, and Wolfgang Lehner DBTest ‘20 Workshop at SIGMOD 2020 19.06.2020
NLP Systems Workflow Data Storage with textual data W W OR 5.02, Ds 43.07, W ….. W Extracted text data Numerical Language Model Representation (Vectors) 2
NLP Systems Workflow State-of-the-art Language Models: Word Embeddings Training on Dummy Task Extract Weights as Pre-Trained Language Model Deep Neuronal Network Data Storage with Extracted Relational 5.02, textual data Information Relational database with 43.07, ….. text data Large Text corpora in natural language W W OR 5.02, Ds 43.07, W ….. W Extracted text data Numerical Language Model Representation (Vectors) 3
NLP Systems Workflow Data Storage with Similarity Search Classification and textual data Tasks Regression Tasks W W OR 5.02, Ds 43.07, W ….. W Extracted text data Numerical Language Model Representation (Vectors) 4
Word Embedding for Systems ML Systems Database Systems Information Retrieval Systems ▪ Utilize implicitly encoded ▪ Semantic text similarity ▪ Semantic search knowledge from large queries ▪ Query Expansion text corpora ▪ Data exploration ▪ Multi-lingual search ▪ Capture sematic ▪ Data integration similarities of text values Choice of the word embedding model is crucial for the performance! 5
Evaluation of Word Embedding Models Word Similarity ▪ Similar Words by cosine similarity of word vectors 𝒚 ∙ 𝒛 woman 𝑡𝑗𝑛 𝑑𝑝𝑡 (𝒚, 𝒛) = 𝒚 ∙ | 𝒛 | man queen ▪ Example: most similar to “king”? drier king → prince, man, and queen driest prince Analogy Queries dry wetter ▪ Retrieve Similar Relations London wettest 𝑏 − 𝑐 ≈ 𝑑 − ? Berlin 3CosAdd: arg max wet 𝑡𝑗𝑛 𝑑𝑝𝑡 𝒆, 𝒅 − 𝒃 + 𝒄 𝑒 𝜗𝑊 𝒃,𝒄,𝒅 England ▪ Example: man – woman ≈ king - ? → queen Germany Schematic Representation of Word Vectors 6
Evaluation of Word Embedding Models Common Similarity Datasets … Similarity Eval * Embedding Model WS353 RW ▪ WS-353 353 word pairs of general … domain knowledge quantifying semantic CBOW 57.2 32.5 relatedness … SkipGram 62.8 37.2 ▪ SimLex-999 999 word pairs of general domain knowledge quantifying semantic … … … … similarity Embedding Model Semantic Syntactic Total Analogy Eval * Depend on human notion of similarity → Require human labeling effort CBOW 57.3 68.9 63.7 Common Analogy Query Datasets SkipGram 66.1 65.1 65.6 ▪ Google Analogy 550 semantic and … … … … syntactic relations, mostly city-country * Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global relations Vectors for Word Representation. ▪ MSR 8,000 analogies of 800 syntactic Limitations: relations Only small Return a single Only general Facts of general domain knowledge datasets value only domain → Automatic extraction possible 7
Evaluation of Word Embedding Models Common Similarity Datasets Limitations: ▪ WS-353 353 word pairs of general Only small Only general domain knowledge quantifying semantic Return a single relatedness datasets value only domain ▪ SimLex-999 999 word pairs of general domain knowledge quantifying semantic similarity Design Goals: Depend on human notion of similarity Large number Flexible Multiple → Require human labeling effort of relations structure categories Common Analogy Query Datasets ▪ Google Analogy 550 semantic and syntactic relations, mostly city-country Design Strategies: relations ▪ MSR 8,000 analogies of 800 syntactic Extraction Definition of Organization relations from millions in facets categories of web tables Facts of general domain knowledge → Automatic extraction possible 8
Dataset Design Data Source: Web Tables Airport IATA Airport Country Airp Location Rank Country Airport Area ort Airp #Passeng City IATA ▪ Large amount of knowledge ort ers Team Event Rank Year … ▪ General enough to be expected in Team Country Team Country Team Rank Event Rank pre-trained word embedding models ▪ Redundancy allows to exclude Web Tables Corpus Header Pairs and Values temporary facts (e.g. time dependent facts like home Soccer Player soccer team to visiting team) Soccer Player AC Milan England Keeper Sports Target Design: Facets Soccer Team Soccer Country Arsenal Brazil Forward Player Player … … … … ▪ Each Facet 𝐺: 𝑃 → 𝑊 assigns objects (e.g. Team Country Position Soccer Player) to values (e.g. Teams) … … ▪ Allows flexible construction of Airport Airport Economy application specific evaluation datasets England London LGW Airport City Airport IATA ▪ More flexible then hierarchical New York Brazil LHR … categorization … … … IATA City Country FacetE Storage Format Collection of Facets 9
Extraction Pipeline 125M Web Tables Pre Filtering: Frequency and Regex Filter, 1 Facet Creation Soft Functional Dependencies: Check 2 contradiction of most frequent relation 3 Post Filtering: Filter by Pooling, Blacklist , … Categorization: Assign facets to 8 broader 4 categories 250 Facets / 600K Values Analogy Word Embeddings Evaluation 10
Extraction Pipeline 1) Pre-Filtering 125M Web Tables ▪ Filters infrequent and non-textual data of English tables Pre Filtering: Frequency and Regex Filter, 1 Country Date Team Facet Creation Soft Functional Dependencies: Check Remove 2 Country Team Nick- contradiction of most frequent relation infrequent name columns 3 Remove Post Filtering: Filter by Pooling, Blacklist , … non-textual data Categorization: Assign facets to 8 broader 4 categories Country Team 250 Facets / 600K Values Team Country Column-Tuples Analogy Word Embeddings → Basis for Facets Evaluation 11
Extraction Pipeline 2) Soft-Functional Dependencies Most frequent 125M Web Tables for “Arsenal” ▪ Determine static Team Country facts Pre Filtering: Frequency and Regex Filter, 1 Arsenal England Facet Creation 1) Determine AC Milan Italy most frequent Soft Functional Dependencies: Check 2 relation pairs Juventus Italy contradiction of most frequent relation Team Country 3 2) Check on Post Filtering: Filter by Pooling, Blacklist , … Arsenal United contradictions Kingdom Categorization: Assign facets to 8 broader 4 AC Milan Italy 𝑇𝐺𝐸 𝑝, 𝑤 categories 𝑑𝑝𝑣𝑜𝑢(𝑝, 𝑤) One = Team Country σ 𝑤 ′ :(𝑝,𝑤 ′ ) 𝑑𝑝𝑣𝑜𝑢(𝑝, 𝑤 ′ ) 250 Facets / 600K Values Contradiction AC Milan Italy 𝑇𝐺𝐸 𝐵𝑠𝑡𝑓𝑜𝑏𝑚, 𝐹𝑜𝑚𝑏𝑜𝑒 Juventus Italy Analogy = 2 Word Embeddings Evaluation Arsenal England 3 12
Extraction Pipeline 3) Post-Filtering 125M Web Tables ▪ Blacklists Remove too generic facets Pre Filtering: Frequency and Regex Filter, 1 Facet Creation Name Description Soft Functional Dependencies: Check 2 contradiction of most frequent relation ▪ Word Embedding Pooling 3 Post Filtering: Filter by Pooling, Blacklist , … Retain only facets modeled by at least one word embedding model Categorization: Assign facets to 8 broader 4 ? categories 250 Facets / 600K Values ? Analogy ? Word Embeddings Evaluation 13
Extraction Pipeline 4) Categorization 125M Web Tables ▪ Assign each of the 250 facets on of 8 broader categories Pre Filtering: Frequency and Regex Filter, 1 (e.g. geographic, music, sports, …) Facet Creation Team Country Soft Functional Dependencies: Check 2 contradiction of most frequent relation AC Milan Italy Juvertus Italy 3 Post Filtering: Filter by Pooling, Blacklist , … Arsenal England Categorization: Assign facets to 8 broader Word Embedding 4 Model categories Cat. Sim 250 Facets / 600K Values Similarity to Music 0.15 Keywords Sports 0.53 Analogy Word Embeddings Keywords for Evaluation …. …. categories 14
Evaluation Evaluation of Categories Setup ▪ 4 Pre-trained word embedding models: GloVe, Word2Vec-SkipGram, fastText, SentenceBert ▪ Selection of 4 FacetE categories Calculation ▪ Select facets 𝐺: 𝑃 → 𝑊 from the categories ▪ Determine the value 𝑊 for each object 𝑃 with 3CosAdd analogy method ▪ Calculate amount of correctly assigned values ▪ Calculate average in each category Coverage: For some text values word embedding models can not determine a vector Evaluation of 4 Categories 15
Evaluation Evaluation of Categories Setup Observation ▪ 4 Pre-trained word embedding models: No single best model GloVe, Word2Vec-SkipGram, fastText, SentenceBert High Coverage ▪ Selection of 4 FacetE categories Calculation ▪ Select facets 𝐺: 𝑃 → 𝑊 from the categories ▪ Determine the value 𝑊 for each object 𝑃 with 3CosAdd analogy method ▪ Calculate amount of correctly assigned values ▪ Calculate average in each category Coverage: For some text values word embedding models can not determine a vector Evaluation of 4 Categories 16
Recommend
More recommend