MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of - PowerPoint PPT Presentation

PhD Mid-term Follow-up 16/10/2018 MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of multilingual semantic classes from texts Presenter: Zheng ZHANG   Supervisors: Pierre ZWEIGENBAUM & Yue MA   Evaluation committee: Vincent CLAVEAU & Alexandre ALLAUZEN Date 1

OUTLINE • Introduction of Thesis Topic • Results Achieved GNEG: Graph-Based Negative Sampling for word2vec • corpus2graph: Efficient Generation and Processing of • Word Co-occurrence Networks Using corpus2graph • Work in Progress • Future Work 2

Introduction of Thesis Topic 3

Introduction of Thesis Topic Multilingual semantic classes pomme Semantic class: a group of words • clustered by using distributional pêche similarity measures fruits prunelle poire couleurs 4

Introduction of Thesis Topic Multilingual semantic classes FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire couleurs 5

Introduction of Thesis Topic Applications: Unknown words “translation” FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire roux couleurs 6

Introduction of Thesis Topic Applications: Universal classes extraction FR EN peach pear pomme fruits pêche apple fruits prunelle poire Universal class of “fruits” 7

Introduction of Thesis Topic Cross-lingual Word Embeddings Learning alignment data level document vulic2015bilingual sentence gouws2015 levy2017strong bilbowa Luong2015 bilingual gouws2015simple word mikolov2013exploiting artetxe2017learning training stage “count -based ” “neura l ” pre-preprocessing training post-embedding Multilingual word embeddings learning can be seen as an extension of (monolingual) word embeddings learning. 8

Results Achieved 9

Results Achieved: GNEG* Skip-gram negative sampling Why? • Softmax calculation is too expensive → Replace • every term in the Skip-gram objective. What? • Distinguish the target word from draws from the • noise distribution using logistic regression , where there are k negative examples for each data sample. Advantages: • Cheap to calculate. • All valid words could be selected as negative • examples. *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 10

Results Achieved: GNEG* Drawbacks of skip-gram negative sampling Negative sampling is not targeted for training words. It • is only based on the word count. word_count word_id lg( 𝑄 𝑜 ( 𝑥 )) word_id word_id Heat map of the negative word count examples distribution Same ! *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 11

Results Achieved: GNEG* Graph-based word2vec training Word co-occurrences networks (matrices) • Definition: A graph whose vertices represent • unique terms of the document and whose edges represent co-occurrences between the terms within a fixed-size sliding window. Networks and matrices are interchangeable. • Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction. https://safetyapp.shinyapps.io/GoWvis/ A new context → negative examples • word2vec already implicitly uses the statistics of • word co-occurrences for the context word selection, but not for the negative examples selection. *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 12

Results Achieved: GNEG* Graph-based negative sampling • Based on the word co-occurrence network (matrix) word_id lg( 𝑥𝑝𝑠𝑒 𝑑𝑝 − 𝑝𝑑𝑑𝑣𝑠𝑠𝑓𝑜𝑑𝑓 ) word_id Heat map of the word co-occurrence distribution Three methods to generate noise distribution • • Training-word context distribution Difference between the unigram distribution and • the training words contexts distribution Random walks on the word co-occurrence network • *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 13

Results Achieved: GNEG* Graph-based negative sampling Evaluation results • Total time • Entire English Wikipedia corpus ( tokens) trained • on the server prevert (50 threads used): 2.5h + 8h Word co-occurrence word2vec training network (matrix) generation 14 *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia

Results Achieved: corpus2graph* How to generate a large word co-occurrence network within 3 hours ? “Tech Specs” • Working with other graph libraries friendly • ( “Don’t reinvent the wheel.” ) NLP applications oriented (built-in tokenizer, • stemmer, sentence analyzer…) Handle large corpus (e.g. Entire English Wikipedia • corpus, tokens; by using multiprocessing) Grid search friendly (different window size, • vocabulary size, sentence analyzer…) Fast ! • *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, 15 In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

Results Achieved: corpus2graph* corpus2graph corpus2graph generation corpus2graph igraph processing *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, 16 In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

Results Achieved: corpus2graph* Word Co-occurrence Network Generation   NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor The history of natural language processing generally started in the 1950s. The histori of natur languag process gener start in the 0000s h n l p g s 0 17 *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

Results Achieved: corpus2graph* Word Co-occurrence Network Generation   NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence d max =distance=2 distance=1 h n l p g s 0 18 *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

Results Achieved: corpus2graph* Word Co-occurrence Network Generation   NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence • Word pair analyzer • Word pair weight w.r.t. the maximum distance Word • Directed & undirected • User-customized word pair analyzer pair 1 1 n h 1 l 1 1 0 p 1 1 1 *Efficient Generation and Processing of Word Co-occurrence Networks Using 1 corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL s g 1 19 1 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

Results Achieved: corpus2graph* Word Co-occurrence Network Generation   Multi-processing 3 multi-processing steps • Word processing • Sentence analyzing • Word pair merging • MapReduce like • *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US 20

Work in Progress 21

Work in Progress Use of pre-computed graph information extracted from word co-occurrence networks Graph-based negative sampling for fastText • Word co-occurrence based matrix factorization for • word embeddings learning 22

Work in Progress Matrix Factorization • [Levy and Goldberg, 2014] shows that skip-gram with negative sampling is implicitly factorizing a word-context matrix. max(PMI( )-log k, 0) SVD word co-occurrence matrix “enhanced” matrix 23

MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of - PowerPoint PPT Presentation

PhD Mid-term Follow-up 16/10/2018 MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of multilingual semantic classes from texts Presenter: Zheng ZHANG Supervisors: Pierre ZWEIGENBAUM & Yue MA

Objectives Follow Sets Explain the purpose of the follow set. Dr. Mattox Beckman Be able

Mid-Region Council of Governments Mid-Region Metropolitan Planning Organization Mid-Region

Mid- -Term Performance Term Performance Mid Announcement for the Year Ending Announcement for

Group Mid-Term Management Plan 2020 From Group Mid-Term Management Plan 2018 to 2020 Estimates

Mid-Snake TMDL By Cassie Sundquist and Chris Jeszke Mid Snake TMDL EPA approved the Mid Snake

Mid-Year Budget Report February 17, 2015 Purpose of Mid-Year The Annual Mid-Year Budget

Mid Y Mid Yea ear r Review view Mid Year Review Committee of the Whole - August 23, 2016

Mid Shannon Wilderness Park The potential future of the Longford bogs Mid Shannon Potential 22

Cafeteria Mid-Year Election Changes Laurie Brophy Del Horton Making Mid-Year Election Changes

Quantum Error Correction Shyam Sundhar R Department of EE Mid Term Presentation, CS 682 Mid

1900 CRYSTAL DRIVE SPRC Meeting #3 DECEMBER 19, 2019 SPRC #2 Follow-up SPRC #2 Follow-up:

Objectives FOLLOW Sets Dr. Mattox Beckman Compute the FOLLOW sets for the nonterminal symbols

Straw Proposal Business Programs Follow-up Business Program Follow-up Topic Follow-up

Return To Office Strategy Short-Term Strategy Mid-Term Strategy - Remote Work Long-Term

BRAC-133 October 20, 2010 Transportation and Environmental Services BRAC-133 Conceptual Design

The short- -term and long term and long- -term term The short stratospheric and tropospheric

Simulation-Based Admissible Dominance Pruning Alvaro Torralba, J org Hoffmann HSDIP

PRUNING NESTED-DFS FOR PARAMETRIC TIMED AUTOMATA LAURE PETRUCCI & JACO VAN DE POL CNRS/LIPN,

Example: Age, Income and Owning a flat 250 Training set

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by

High-Performance Hardware for Machine Learning U.C. Berkeley October 19, 2016 William Dally

Random Sampling Revisited: Lattice Enumeration with Discrete Pruning Yoshinori Aono

Efficient Search-Space Pruning for Integrated Fusion and Tiling Transformations Xiaoyang Gao,

Introduction to Machine Learning CART: Stopping Criteria & Pruning

MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of - PowerPoint PPT Presentation

PhD Mid-term Follow-up 16/10/2018 MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of multilingual semantic classes from texts Presenter: Zheng ZHANG Supervisors: Pierre ZWEIGENBAUM & Yue MA

Objectives Follow Sets Explain the purpose of the follow set. Dr. Mattox Beckman Be able

Mid-Region Council of Governments Mid-Region Metropolitan Planning Organization Mid-Region

Mid- -Term Performance Term Performance Mid Announcement for the Year Ending Announcement for

Group Mid-Term Management Plan 2020 From Group Mid-Term Management Plan 2018 to 2020 Estimates

Mid-Snake TMDL By Cassie Sundquist and Chris Jeszke Mid Snake TMDL EPA approved the Mid Snake

Mid-Year Budget Report February 17, 2015 Purpose of Mid-Year The Annual Mid-Year Budget

Mid Y Mid Yea ear r Review view Mid Year Review Committee of the Whole - August 23, 2016

Mid Shannon Wilderness Park The potential future of the Longford bogs Mid Shannon Potential 22

Cafeteria Mid-Year Election Changes Laurie Brophy Del Horton Making Mid-Year Election Changes

Quantum Error Correction Shyam Sundhar R Department of EE Mid Term Presentation, CS 682 Mid

1900 CRYSTAL DRIVE SPRC Meeting #3 DECEMBER 19, 2019 SPRC #2 Follow-up SPRC #2 Follow-up:

Objectives FOLLOW Sets Dr. Mattox Beckman Compute the FOLLOW sets for the nonterminal symbols

Straw Proposal Business Programs Follow-up Business Program Follow-up Topic Follow-up

Return To Office Strategy Short-Term Strategy Mid-Term Strategy - Remote Work Long-Term

BRAC-133 October 20, 2010 Transportation and Environmental Services BRAC-133 Conceptual Design

The short- -term and long term and long- -term term The short stratospheric and tropospheric

Simulation-Based Admissible Dominance Pruning Alvaro Torralba, J org Hoffmann HSDIP

PRUNING NESTED-DFS FOR PARAMETRIC TIMED AUTOMATA LAURE PETRUCCI &amp; JACO VAN DE POL CNRS/LIPN,

Example: Age, Income and Owning a flat 250 Training set

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by

High-Performance Hardware for Machine Learning U.C. Berkeley October 19, 2016 William Dally

Random Sampling Revisited: Lattice Enumeration with Discrete Pruning Yoshinori Aono

Efficient Search-Space Pruning for Integrated Fusion and Tiling Transformations Xiaoyang Gao,

Introduction to Machine Learning CART: Stopping Criteria &amp; Pruning

PRUNING NESTED-DFS FOR PARAMETRIC TIMED AUTOMATA LAURE PETRUCCI & JACO VAN DE POL CNRS/LIPN,

Introduction to Machine Learning CART: Stopping Criteria & Pruning