Improving the Compositionality of Word Embeddings M ASTER T HESIS Supervisors: Author: dr. Evangelos K ANOULAS Thijs S CHEEPERS dr. Efstratios G AVVES
Truely understanding A far out goal for Artificial Intelligence
What is your name? Such a simple question from Her by Spike Jonze (2013)
„What is your name?‰ 01010111 01101000 01100001 01110100 00100000 01101001 01110011 00100000 01111001 01101111 01110101 01110010 00100000 01101110 01100001 01101101 01100101 00111111 Transforming to Binary
„What is your name?‰ ⁄ 01010111 01101000 01100001 01110100 00100000 01101001 01110011 00100000 01111001 01101111 01110101 01110010 00100000 01101110 01100001 01101101 01100101 00111111 ASCII
„What is your name?‰ What is your name 1 0 0 0 0 1 … 0 … … 1 … 100,000 0 0 … 1 0 0 0 0 Bag-of-words
Improving the Compositionality of Word Embeddings T ITLE OF THE M ASTER T HESIS
„What is your name?‰ What is your name 0.23 1.62 -1.60 0.87 1.56 -0.25 0.82 1.32 … … … … 300 -0.78 -0.53 0.91 -1.41 Word 0.93 1.72 -1.39 -0.91 Embeddings
Word Embeddings encode Lexical Semantics, i.e. word meaning What is your name 0.23 1.62 -1.60 0.87 1.56 -0.25 0.82 1.32 … … … … 300 -0.78 -0.53 0.91 -1.41 0.93 1.72 -1.39 -0.91
capable can right law trade 20 act has work have 's was action is move are be western process force power make become making eastern manner way leaves southeastern been fruit south southern end government time being off asia state out branch language disease side point unit states up africa part game group form french formed 10 woman member parts english position body set world british america line structure american region place area head united ground city system cause people spoken device caused person computer surface made war someone given military you skin light something one material substance air your quality sound consisting back property their an containing tropical a its his characteristic metal food edible any found characterized all 0 marked the each resembling it acid together animal sea common relating two animals money water related liquid that or river red such and color which north same northern black yellow similar so who fish equal blood living etc but white like whose perennial as type order genus green central only than particular family especially other another having if use where plant plants used new no herbs not when hard without with name 10 lacking shrubs flowers after ' certain shrub number various to of wood some several for trees born europe tree evergreen cultivated european many native from by first into before between more in long through most under roman at during very greek against old over ancient great on small large short near high around widely about 20 usually often sometimes 20 10 0 10 20
Word Embedding space ⅕ · (‘Berlin’ – ‘Germany’) + (‘Stockholm’ – ‘Sweden’) + (‘Washington DC’ – ‘United States’) + (‘Beijing’ – ‘China’) + (‘London’ – ‘United Kingdom’ ) ≈ {capital} ‘Netherlands’ + {capital} = ‘Amsterdam’
2 China Beijing 1.5 Russia Japan Moscow 1 Tokyo Ankara Turkey 0.5 Poland Germany 0 France Warsaw Berlin Italy Paris -0.5 Athens Greece Rome Spain -1 Madrid Portugal -1.5 Lisbon -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 from Mikolov et al. (2013)
Improving the Compositionality of Word Embeddings T ITLE OF THE M ASTER T HESIS
Word Embedding Composition Combine encodings of word meanings in such a way that a good encoding of their joint meaning is created
„What is your name?‰ What is your name 0.23 1.62 -1.60 0.87 -0.13 1.56 -0.25 0.82 1.32 1.65 f ( ) = … … … … … 300 -0.78 -0.53 0.91 -1.41 1.63 0.93 1.72 -1.39 -0.91 0.99 Word Embedding Composition
Overview 1. Evaluating compositionality 2. Tuning word embeddings for better algebraic composition 3. Neural methods for composing word embeddings
1. Evaluating compositionality Introducing CompVecEval a method to evaluate word embeddings on their compositionality
Dictionaries A pragmatic solution for word meaning
cat /kat/ A small domesticated carnivorous mammal with soft fur, a short snout, and retractable claws. It is widely kept as a pet or for catching mice, and many breeds have been developed.
cat /kat/ A method of examining body organs by scanning them with X-rays and using a computer to construct a series of cross-sectional scans along a single axis.
person c f c x [0…2] a human being
Dictionary 1. WordNet (Miller and Fellbaum 1998) 2. We use 4,119 datapoints for our evaluation method, and 72,322 datapoints for tuning
Popular pretrained Word Embeddings 1. Word2Vec (Mikolov et al. 2013) 2. GloVe (Pennington et al. 2014) 3. fastText (Bojanowski et al. 2016) 4. Paragram (Wieting et al. 2015)
the cat ate the mouse Word2Vec w t-2 w t-1 w t w t+1 w t+2 Skip-gram w t ate
Recommend
More recommend