TextGraphs-2018 June 6, 2018 Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum 1 Date
Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks. Otherwise, the present tool looks too specific because people are too focused on word/sense/sentence/other embeddings today.” 2
Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks . Otherwise, the present tool looks too specific because people are too focused on word/sense/sentence/other embeddings today.” NAACL 2018 Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang Filling Missing Paths: Modeling Co-occurrences of Word Pairs and Dependency Paths for Recognizing Lexical Semantic Relations Koki Washio and Tsuneaki Kato 3
Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks . Otherwise, the present tool looks too specific because people are too focused on word /sense/sentence/other embeddings today.” Example of injecting word co-occurrence networks into word embeddings learning. GNEG: Graph-Based Negative Sampling for word2vec Zheng Zhang, Pierre Zweigenbaum, In Proceedings of ACL 2018, Melbourne, Australia 4
Why do we need this tool? Negative sampling Graph-based negative sampling word_id word_id lg(/ 0 ($)) lg ($%&' (%%(()&rence) word_id word_id Heat map of the negative Heat map of the word co- examples distribution / 0 ($) occurrence distribution Negative examples distribution is based on the word co-occurrence 5 network (matrix) 1($) : word count
Why do we need this tool? Negative sampling Graph-based negative sampling word_id word_id lg(/ 0 ($)) lg ($%&' (%%(()&rence) word_id word_id Heat map of the negative Heat map of the word co- examples distribution / 0 ($) occurrence distribution Negative examples distribution is based on the word co-occurrence 5 corpus2graph network (matrix) 1($) : word count
What kind of tool? “Tech Specs” • Working with other graph libraries friendly ( “Don’t • reinvent the wheel.” ) NLP applications oriented (built-in tokenizer, stemmer, • sentence analyzer…) • Handle large corpus (e.g. Entire English Wikipedia corpus, ~2.19 &'((')* tokens; by using multiprocessing) Grid search friendly (different window size, vocabulary • size, sentence analyzer…) Fast ! • 6
corpus2graph corpus2graph generation corpus2graph igraph processing 7
Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor The history of natural language processing generally started in the 1950s. The histori of natur languag process gener start in the 0000s h n l p g s 0 8
Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence d max =distance=2 distance=1 h n l p g s 0 9
Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence • Word pair analyzer • Word pair weight w.r.t. the maximum distance • Directed & undirected Word pair • User-customized word pair analyzer 1 1 n h l 1 1 1 !"#$ℎ& = 1×*+,-". /0 + 1×*+,-". /0 • 0 p 1 1 • undirected 1 10 1 s g 1 1
Word Co-occurrence Network Generation Multi-processing 3 multi-processing steps • Word processing • Sentence analyzing • Word pair merging • MapReduce like • 11
Word Co-occurrence Network Generation Grid-search Word pair weight of different maximum distances ( ! "#$ ) • Reuse of the intermediate data • 1 st step: numerical id encoded text file after word processing • 2 nd step: separate word pair files of different distances for • each text file 2 nd step: distinct word count • 12
Word Co-occurrence Network Generation Speed Word co-occurrence network generation speed (seconds) The baseline: • Processing the corpus sentence by sentence, extracting word pairs • and adding them to the graph as edges through graph libraries. Single-core • Why corpus2graph is slower than baseline when using NetworkX? • Small corpus, one core : corpus2graph is slower than baseline • when using NetworkX. Large corpus, multiple cores : corpus2graph is much faster than • baseline when using NetworkX. Example: entire English Wikipedia dump from April 2017, ~2.19 billion tokens); 50 logical cores on a server with 4 Intel Xeon E5- 4620 processors): ~2.5 hours 13
Word Co-occurrence Network Processing Networks and matrices are interchangeable • Graph loading & transition matrix calculation speed (seconds) • 14
Open source https://github.com/zzcoolj/corpus2graph graph_from_corpus all [--max_window_size=<max_window_size> --process_num=<process_num> --min_count=<min_count> --max_vocab_size=<max_vocab_size> --safe_files_number_per_processor=<safe_files_number_per_processor>] <data_dir> <output_dir> 15
Future work • Word co-occurrence network generation “Desktop mode”: less memory consumption, less cores, • but also less feasibility for grid search. Word co-occurrence network processing • Support more graph processing methods • GPU mode • 16
Thanks for your attention!
Recommend
More recommend