Efficient Generation and Processing of Word Co-occurrence Networks - PowerPoint PPT Presentation

TextGraphs-2018 June 6, 2018 Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum 1 Date

Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks. Otherwise, the present tool looks too specific because people are too focused on word/sense/sentence/other embeddings today.” 2

Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks . Otherwise, the present tool looks too specific because people are too focused on word/sense/sentence/other embeddings today.” NAACL 2018 Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang Filling Missing Paths: Modeling Co-occurrences of Word Pairs and Dependency Paths for Recognizing Lexical Semantic Relations Koki Washio and Tsuneaki Kato 3

Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks . Otherwise, the present tool looks too specific because people are too focused on word /sense/sentence/other embeddings today.” Example of injecting word co-occurrence networks into word embeddings learning. GNEG: Graph-Based Negative Sampling for word2vec Zheng Zhang, Pierre Zweigenbaum, In Proceedings of ACL 2018, Melbourne, Australia 4

Why do we need this tool? Negative sampling Graph-based negative sampling word_id word_id lg(/ 0 ($)) lg ($%&' (%%(()&rence) word_id word_id Heat map of the negative Heat map of the word co- examples distribution / 0 ($) occurrence distribution Negative examples distribution is based on the word co-occurrence 5 network (matrix) 1($) : word count

Why do we need this tool? Negative sampling Graph-based negative sampling word_id word_id lg(/ 0 ($)) lg ($%&' (%%(()&rence) word_id word_id Heat map of the negative Heat map of the word co- examples distribution / 0 ($) occurrence distribution Negative examples distribution is based on the word co-occurrence 5 corpus2graph network (matrix) 1($) : word count

What kind of tool? “Tech Specs” • Working with other graph libraries friendly ( “Don’t • reinvent the wheel.” ) NLP applications oriented (built-in tokenizer, stemmer, • sentence analyzer…) • Handle large corpus (e.g. Entire English Wikipedia corpus, ~2.19 &'((')* tokens; by using multiprocessing) Grid search friendly (different window size, vocabulary • size, sentence analyzer…) Fast ! • 6

corpus2graph corpus2graph generation corpus2graph igraph processing 7

Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor The history of natural language processing generally started in the 1950s. The histori of natur languag process gener start in the 0000s h n l p g s 0 8

Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence d max =distance=2 distance=1 h n l p g s 0 9

Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence • Word pair analyzer • Word pair weight w.r.t. the maximum distance • Directed & undirected Word pair • User-customized word pair analyzer 1 1 n h l 1 1 1 !"#$ℎ& = 1×*+,-". /0 + 1×*+,-". /0 • 0 p 1 1 • undirected 1 10 1 s g 1 1

Word Co-occurrence Network Generation Multi-processing 3 multi-processing steps • Word processing • Sentence analyzing • Word pair merging • MapReduce like • 11

Word Co-occurrence Network Generation Grid-search Word pair weight of different maximum distances ( ! "#$ ) • Reuse of the intermediate data • 1 st step: numerical id encoded text file after word processing • 2 nd step: separate word pair files of different distances for • each text file 2 nd step: distinct word count • 12

Word Co-occurrence Network Generation Speed Word co-occurrence network generation speed (seconds) The baseline: • Processing the corpus sentence by sentence, extracting word pairs • and adding them to the graph as edges through graph libraries. Single-core • Why corpus2graph is slower than baseline when using NetworkX? • Small corpus, one core : corpus2graph is slower than baseline • when using NetworkX. Large corpus, multiple cores : corpus2graph is much faster than • baseline when using NetworkX. Example: entire English Wikipedia dump from April 2017, ~2.19 billion tokens); 50 logical cores on a server with 4 Intel Xeon E5- 4620 processors): ~2.5 hours 13

Word Co-occurrence Network Processing Networks and matrices are interchangeable • Graph loading & transition matrix calculation speed (seconds) • 14

Open source https://github.com/zzcoolj/corpus2graph graph_from_corpus all [--max_window_size=<max_window_size> --process_num=<process_num> --min_count=<min_count> --max_vocab_size=<max_vocab_size> --safe_files_number_per_processor=<safe_files_number_per_processor>] <data_dir> <output_dir> 15

Future work • Word co-occurrence network generation “Desktop mode”: less memory consumption, less cores, • but also less feasibility for grid search. Word co-occurrence network processing • Support more graph processing methods • GPU mode • 16

Thanks for your attention!

Efficient Generation and Processing of Word Co-occurrence Networks - PowerPoint PPT Presentation

TextGraphs-2018 June 6, 2018 Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum 1 Date Why do we need this tool? However, the landscape of natural language

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

Word Unit A What is a word processing program? What does it do? o T ype text, but thats not

Networks based on words Bowen Dai WANs definition Word-adjacency networks belong to the

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Efficient induction of probabilistic word classes with LDA Grzegorz Chrupa la Saarland

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

+ + What is a word cloud? Word Clouds Source:

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

Natural Language Processing with Deep Learning Word Embeddings Navid Rekab-Saz

Word processing Lecture 9 COMPSCI111/111G Todays lecture u Storing information using ASCII

Scrivener Revolutionising the writing process CANDICE KELLY PhD candidate Word processing

Personalized Mathematical Word Problem Generation Oleksandr Polozov * Eleanor ORourke * Adam M.

Automatic Generation of Efficient Dynamic Binary Translators Fr ed eric P etrot, Luc

Good Formatting! Word Processing Lesson 1 Mrs. McGuire What is Formatting? the look and feel of

Natural Language Processing 1 Lecture 6: Distributional semantics: generalisation and word

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Elmer Alternative Pre-processing tools ElmerTeam CSC IT Center for Science Mesh generation

Efficient Estimation of Word Representation in Vector Space Topics Language Models in NLP o

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 2:

EELE 4310: Digital Signal Processing (DSP) Chapter # 8 : Efficient Computation of the DFT: Fast

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 2:

Deep Learning for Natural Language Processing Perspectives on word embeddings Richard Johansson