learning word embeddings for low resource languages by pu
play

Learning Word Embeddings for Low-resource Languages by PU Learning - PowerPoint PPT Presentation

Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang 1 Word Embeddings are useful Many successful stories Named entity recognition Document ranking


  1. Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang 1

  2. Word Embeddings are useful • Many successful stories • Named entity recognition • Document ranking • Sentiment analysis • Question answering • Image captioning • Pre-trained word vectors have been widely used • GloVe [Pennington+14]: 3900+ citations • Word2Vec [Mikolov+13] : 7600+ citations 2

  3. Existing English Embeddings are trained on a large collection of text • Word2Vec is trained on • GloVe is trained on a the Google News dataset. crawled corpus. 100 billion 840 billion tokens tokens 3

  4. How about other language? 4

  5. How about other language? • # Wikipedia articles in different languages • English: ~ 2.5 M High-resource languages: • German: ~ 800 K 23 languages have more than 100K articles • French: ~ 700 K • … low-resource languages: • Czech: ~100 K 60 languages have • Danish: ~95K 10K ~ 100K articles • … very low-resource languages: • Chichewa: 58 183 languages have less than 10K articles 5

  6. Sparsity of the co-occurrence matrix • Word Embeddings are trained based on co-occurrence statistics • When training corpus is small • Many word pairs are unobserved • Co-occurrence matrix is very sparse • Example: The text8 data • 17,000,000 tokens and 71,000 distinct words • Co-occurrence matrix has more than 5,000,000,000 entries, > 99% are zeros . 6

  7. Zeros in the co-occurrence matrix • True zeros • Word pairs which are unlikely to co-occur • Missing entries • Word pairs can co-occur • Unobserved in the training data Center word space alien table cake … alien 0.8 0.1 - 0 0 table context word 0 0 - 0 0 … - - - - - True 0 cake 0 0.2 - 0 0 Missing space 0 0 - 0 0.7 7 7

  8. Motivation • 8

  9. Our contributions 1. Propose a PU-Learning framework for training word embedding 2. Design an efficient learning algorithm to deal with all negative pairs 3. Demonstrate that unobserved word pairs provide valuable information 9

  10. PU-Learning for Training Word Embedding 10

  11. PU Learning Framework 1. Pre-processing: Building co-occurrence matrix 2. Matrix factorization by PU-Learning 3. Post-processing 11

  12. Step 1 – Building co-occurrence matrix • Count words co-occurrence statistics • We follow [Levy+15] to scale the co-occurrence counts by PPMI metric Center word happy cat dog table … the 0.8 0.1 - 0 0 black context word 0 0 - 0 0 … the black cat likes milk … … - - - - - likes 0 0 - 0 0 context window milk (cat, the) 0.2 0 - 0 0.2 (cat, black) (cat, likes) Scaled by PPMI 12 (cat, milk) ...

  13. Center word happy cat dog table … black 0.8 0.1 - 0 0 zeros blue context word 0.5 0 - 0 0 … - - - - - yellow frequent 0 0 - 0 0 milk 0.2 0 - 0 0.2 13

  14. Step 2 - PU-Learning for matrix factorization 0.1 0.1 - 0.8 0.1 0.2 - 0.3 0.1 … 0.2 0.3 - - - 0.1 0.1 … 0.1 0.1 0.1 0.1 - … … … … … 0.2 0.2 - 14

  15. Step 2 - PU-Learning for matrix factorization 0.1 0.1 - 0.8 0.1 0.2 - 0.3 0.1 … 0.2 0.3 - - - 0.1 0.1 … 0.1 0.1 0.1 0.1 - … … … … … 0.2 0.2 - r o n r r o e i t n c Regularization o n i u t c f u g r n t s i t n h o g c i e e W R 15

  16. Step 2 – Weighting function frequent 0.1 0.1 - 0.1 0.2 - 0.3 0.1 … 0.2 0.3 zeros - - - 0.1 0.1 … 0.1 0.1 0.1 0.1 - … … … … … 0.2 0.2 - 16

  17. Step 2 - PU-Learning for matrix factorization r o n r r o e i t n c Regularization o n i u t c f u g r n t i s t n h o g c i e e W R 17 • We consider all entries • Both positive and zero entries 17

  18. Step 2 - PU-Learning for matrix factorization 0.1 0.1 - 0.1 0.2 - 0.3 0.1 … 0.2 0.3 - - - 0.1 0.1 … 0.1 0.1 0.1 0.1 - … … … … … 0.2 0.2 - • We design efficient coordinate descent algorithm (see paper for details) 18

  19. Step 2 - PU-Learning for matrix factorization 0.1 0.1 - 0.1 0.2 - 0.3 0.1 … 0.2 0.3 - - - 0.1 0.1 … 0.1 0.1 0.1 0.1 - … … … … … 0.2 0.2 - • We design efficient coordinate descent algorithm (see paper for details) 19

  20. Step 2 - PU-Learning for matrix factorization 0.1 0.1 - 0.1 0.2 - 0.3 0.1 … 0.2 0.3 - - - 0.1 0.1 … 0.1 0.1 0.1 0.1 - … … … … … 0.2 0.2 - • We design efficient coordinate descent algorithm (see paper for details) 20

  21. Step 2 - PU-Learning for matrix factorization 0.1 0.1 - 0.1 0.2 - 0.3 0.1 … 0.2 0.3 - - - 0.1 0.1 … 0.1 0.1 0.1 0.1 - … … … … … 0.2 0.2 - • We design efficient coordinate descent algorithm (see paper for details) 21

  22. Step 2 - PU-Learning for matrix factorization 0.1 0.1 - 0.1 0.2 - 0.3 0.1 … 0.2 0.3 - - - 0.1 0.1 … 0.1 0.1 0.1 0.1 - … … … … … 0.2 0.2 - • We design efficient coordinate descent algorithm (see paper for details) 22

  23. Step 3 -- Post-processing • 23

  24. Experiments 24

  25. Results on English Simulate the low-resource setting: Embedding is trained on a subset of Wikipedia with 32M tokens Word Similarity Task on WS353 Analogy Task on Google Dataset 25

  26. Results on Danish (more results in paper) Danish Wikipedia with 64M tokens Test set are translated by Google translation (w/ 90% accuracy verified by native speakers) Word Similarity Task on WS353 Analogy Task on Google Dataset 26

  27. • Weight for zero entries in co-occurrence matrix • Zero entries can be true 0 or missing • 𝜍 reflects how confident that the zero entries are true zero 27

  28. Take home messages • A PU-Learning framework for learning word embedding in the low resource setting • Unobserved word pairs provide valuable information Thanks! 28

Recommend


More recommend