mid term follow up
play

MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of - PowerPoint PPT Presentation

PhD Mid-term Follow-up 16/10/2018 MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of multilingual semantic classes from texts Presenter: Zheng ZHANG Supervisors: Pierre ZWEIGENBAUM & Yue MA


  1. PhD Mid-term Follow-up 16/10/2018 MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of multilingual semantic classes from texts Presenter: Zheng ZHANG 
 Supervisors: Pierre ZWEIGENBAUM & Yue MA 
 Evaluation committee: Vincent CLAVEAU & Alexandre ALLAUZEN Date 1

  2. OUTLINE • Introduction of Thesis Topic • Results Achieved GNEG: Graph-Based Negative Sampling for word2vec • corpus2graph: Efficient Generation and Processing of • Word Co-occurrence Networks Using corpus2graph • Work in Progress • Future Work 2

  3. Introduction of Thesis Topic 3

  4. Introduction of Thesis Topic Multilingual semantic classes pomme Semantic class: a group of words • clustered by using distributional pêche similarity measures fruits prunelle poire couleurs 4

  5. Introduction of Thesis Topic Multilingual semantic classes FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire couleurs 5

  6. Introduction of Thesis Topic Multilingual semantic classes FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire couleurs 5

  7. Introduction of Thesis Topic Applications: Unknown words “translation” FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire roux couleurs 6

  8. Introduction of Thesis Topic Applications: Unknown words “translation” FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire roux couleurs 6

  9. Introduction of Thesis Topic Applications: Universal classes extraction FR EN peach pear pomme fruits pêche apple fruits prunelle poire Universal class of “fruits” 7

  10. Introduction of Thesis Topic Cross-lingual Word Embeddings Learning alignment data level document vulic2015bilingual sentence gouws2015 levy2017strong bilbowa Luong2015 bilingual gouws2015simple word mikolov2013exploiting artetxe2017learning training stage “count -based ” “neura l ” pre-preprocessing training post-embedding Multilingual word embeddings learning can be seen as an extension of (monolingual) word embeddings learning. 8

  11. Results Achieved 9

  12. Results Achieved: GNEG* Skip-gram negative sampling Why? • Softmax calculation is too expensive → Replace • every term in the Skip-gram objective. What? • Distinguish the target word from draws from the • noise distribution using logistic regression , where there are k negative examples for each data sample. Advantages: • Cheap to calculate. • All valid words could be selected as negative • examples. *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 10

  13. Results Achieved: GNEG* Drawbacks of skip-gram negative sampling Negative sampling is not targeted for training words. It • is only based on the word count. word_count word_id lg( 𝑄 𝑜 ( 𝑥 )) word_id word_id Heat map of the negative word count examples distribution Same ! *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 11

  14. Results Achieved: GNEG* Graph-based word2vec training Word co-occurrences networks (matrices) • Definition: A graph whose vertices represent • unique terms of the document and whose edges represent co-occurrences between the terms within a fixed-size sliding window. Networks and matrices are interchangeable. • Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction. https://safetyapp.shinyapps.io/GoWvis/ A new context → negative examples • word2vec already implicitly uses the statistics of • word co-occurrences for the context word selection, but not for the negative examples selection. *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 12

  15. Results Achieved: GNEG* Graph-based negative sampling • Based on the word co-occurrence network (matrix) word_id lg( 𝑥𝑝𝑠𝑒 𝑑𝑝 − 𝑝𝑑𝑑𝑣𝑠𝑠𝑓𝑜𝑑𝑓 ) word_id Heat map of the word co-occurrence distribution Three methods to generate noise distribution • • Training-word context distribution Difference between the unigram distribution and • the training words contexts distribution Random walks on the word co-occurrence network • *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 13

  16. Results Achieved: GNEG* Graph-based negative sampling Evaluation results • Total time • Entire English Wikipedia corpus ( tokens) trained • on the server prevert (50 threads used): 2.5h + 8h Word co-occurrence word2vec training network (matrix) generation 14 *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia

  17. Results Achieved: corpus2graph* How to generate a large word co-occurrence network within 3 hours ? “Tech Specs” • Working with other graph libraries friendly • ( “Don’t reinvent the wheel.” ) NLP applications oriented (built-in tokenizer, • stemmer, sentence analyzer…) Handle large corpus (e.g. Entire English Wikipedia • corpus, tokens; by using multiprocessing) Grid search friendly (different window size, • vocabulary size, sentence analyzer…) Fast ! • *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, 15 In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

  18. Results Achieved: corpus2graph* corpus2graph corpus2graph generation corpus2graph igraph processing *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, 16 In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

  19. Results Achieved: corpus2graph* Word Co-occurrence Network Generation 
 NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor The history of natural language processing generally started in the 1950s. The histori of natur languag process gener start in the 0000s h n l p g s 0 17 *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

  20. Results Achieved: corpus2graph* Word Co-occurrence Network Generation 
 NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence d max =distance=2 distance=1 h n l p g s 0 18 *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

  21. Results Achieved: corpus2graph* Word Co-occurrence Network Generation 
 NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence • Word pair analyzer • Word pair weight w.r.t. the maximum distance Word • Directed & undirected • User-customized word pair analyzer pair 1 1 n h 1 l 1 1 0 p 1 1 1 *Efficient Generation and Processing of Word Co-occurrence Networks Using 1 corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL s g 1 19 1 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

  22. Results Achieved: corpus2graph* Word Co-occurrence Network Generation 
 Multi-processing 3 multi-processing steps • Word processing • Sentence analyzing • Word pair merging • MapReduce like • *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US 20

  23. Work in Progress 21

  24. Work in Progress Use of pre-computed graph information extracted from word co-occurrence networks Graph-based negative sampling for fastText • Word co-occurrence based matrix factorization for • word embeddings learning 22

  25. Work in Progress Matrix Factorization • [Levy and Goldberg, 2014] shows that skip-gram with negative sampling is implicitly factorizing a word-context matrix. max(PMI( )-log k, 0) SVD word co-occurrence matrix “enhanced” matrix 23

Recommend


More recommend