PhD Mid-term Follow-up 16/10/2018 MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of multilingual semantic classes from texts Presenter: Zheng ZHANG Supervisors: Pierre ZWEIGENBAUM & Yue MA Evaluation committee: Vincent CLAVEAU & Alexandre ALLAUZEN Date 1
OUTLINE • Introduction of Thesis Topic • Results Achieved GNEG: Graph-Based Negative Sampling for word2vec • corpus2graph: Efficient Generation and Processing of • Word Co-occurrence Networks Using corpus2graph • Work in Progress • Future Work 2
Introduction of Thesis Topic 3
Introduction of Thesis Topic Multilingual semantic classes pomme Semantic class: a group of words • clustered by using distributional pêche similarity measures fruits prunelle poire couleurs 4
Introduction of Thesis Topic Multilingual semantic classes FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire couleurs 5
Introduction of Thesis Topic Multilingual semantic classes FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire couleurs 5
Introduction of Thesis Topic Applications: Unknown words “translation” FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire roux couleurs 6
Introduction of Thesis Topic Applications: Unknown words “translation” FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire roux couleurs 6
Introduction of Thesis Topic Applications: Universal classes extraction FR EN peach pear pomme fruits pêche apple fruits prunelle poire Universal class of “fruits” 7
Introduction of Thesis Topic Cross-lingual Word Embeddings Learning alignment data level document vulic2015bilingual sentence gouws2015 levy2017strong bilbowa Luong2015 bilingual gouws2015simple word mikolov2013exploiting artetxe2017learning training stage “count -based ” “neura l ” pre-preprocessing training post-embedding Multilingual word embeddings learning can be seen as an extension of (monolingual) word embeddings learning. 8
Results Achieved 9
Results Achieved: GNEG* Skip-gram negative sampling Why? • Softmax calculation is too expensive → Replace • every term in the Skip-gram objective. What? • Distinguish the target word from draws from the • noise distribution using logistic regression , where there are k negative examples for each data sample. Advantages: • Cheap to calculate. • All valid words could be selected as negative • examples. *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 10
Results Achieved: GNEG* Drawbacks of skip-gram negative sampling Negative sampling is not targeted for training words. It • is only based on the word count. word_count word_id lg( 𝑄 𝑜 ( 𝑥 )) word_id word_id Heat map of the negative word count examples distribution Same ! *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 11
Results Achieved: GNEG* Graph-based word2vec training Word co-occurrences networks (matrices) • Definition: A graph whose vertices represent • unique terms of the document and whose edges represent co-occurrences between the terms within a fixed-size sliding window. Networks and matrices are interchangeable. • Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction. https://safetyapp.shinyapps.io/GoWvis/ A new context → negative examples • word2vec already implicitly uses the statistics of • word co-occurrences for the context word selection, but not for the negative examples selection. *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 12
Results Achieved: GNEG* Graph-based negative sampling • Based on the word co-occurrence network (matrix) word_id lg( 𝑥𝑝𝑠𝑒 𝑑𝑝 − 𝑝𝑑𝑑𝑣𝑠𝑠𝑓𝑜𝑑𝑓 ) word_id Heat map of the word co-occurrence distribution Three methods to generate noise distribution • • Training-word context distribution Difference between the unigram distribution and • the training words contexts distribution Random walks on the word co-occurrence network • *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 13
Results Achieved: GNEG* Graph-based negative sampling Evaluation results • Total time • Entire English Wikipedia corpus ( tokens) trained • on the server prevert (50 threads used): 2.5h + 8h Word co-occurrence word2vec training network (matrix) generation 14 *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia
Results Achieved: corpus2graph* How to generate a large word co-occurrence network within 3 hours ? “Tech Specs” • Working with other graph libraries friendly • ( “Don’t reinvent the wheel.” ) NLP applications oriented (built-in tokenizer, • stemmer, sentence analyzer…) Handle large corpus (e.g. Entire English Wikipedia • corpus, tokens; by using multiprocessing) Grid search friendly (different window size, • vocabulary size, sentence analyzer…) Fast ! • *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, 15 In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US
Results Achieved: corpus2graph* corpus2graph corpus2graph generation corpus2graph igraph processing *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, 16 In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US
Results Achieved: corpus2graph* Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor The history of natural language processing generally started in the 1950s. The histori of natur languag process gener start in the 0000s h n l p g s 0 17 *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US
Results Achieved: corpus2graph* Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence d max =distance=2 distance=1 h n l p g s 0 18 *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US
Results Achieved: corpus2graph* Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence • Word pair analyzer • Word pair weight w.r.t. the maximum distance Word • Directed & undirected • User-customized word pair analyzer pair 1 1 n h 1 l 1 1 0 p 1 1 1 *Efficient Generation and Processing of Word Co-occurrence Networks Using 1 corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL s g 1 19 1 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US
Results Achieved: corpus2graph* Word Co-occurrence Network Generation Multi-processing 3 multi-processing steps • Word processing • Sentence analyzing • Word pair merging • MapReduce like • *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US 20
Work in Progress 21
Work in Progress Use of pre-computed graph information extracted from word co-occurrence networks Graph-based negative sampling for fastText • Word co-occurrence based matrix factorization for • word embeddings learning 22
Work in Progress Matrix Factorization • [Levy and Goldberg, 2014] shows that skip-gram with negative sampling is implicitly factorizing a word-context matrix. max(PMI( )-log k, 0) SVD word co-occurrence matrix “enhanced” matrix 23
Recommend
More recommend