Automatic construction of distributional thesaurus (for multiple languages) Zheng ZHANG 1 st year PhD student ILES, LIMSI 28/03/2017 1
What is a distributional thesaurus? • For a given input, a distributional thesaurus identifies semantically similar words based on the assumption that they share a similar distribution . • Distributional assumption: In practice, two words are considered similar if their occurrences share similar contexts . Ref. Vincent Claveau, Ewa Kijak. Distributional Thesauri for Information Retrieval and vice versa. 28/03/2017 2
Why do we need it? • It is useful for alleviating data sparseness in many NLP applications. • It is useful for completing lexical resources. Ref. Enrique Henestroza Anguiano, Pascal Denis. FreDist: Automatic construction of distributional thesauri for French. 28/03/2017 3
Contexts • These contexts are typically co-occurring words in a limited window around the considered words, or syntactically linked words. Ref. http://nlp.stanford.edu:8080/corenlp/process 28/03/2017 4
Contexts • These contexts are typically co-occurring words in a limited window around the considered words, or syntactically linked words. Ref. http://nlp.stanford.edu:8080/corenlp/process 28/03/2017 4
A new context: Graph-of-words • A graph whose vertices represent unique terms of the document and whose edges represent co-occurrences between the terms within a fixed-size sliding window. • “This is an example about how to generate a graph. ” (window size=4) Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction. https://safetyapp.shinyapps.io/GoWvis/ 28/03/2017 5
Graph attributes: K-core • A subgraph H k = (Vʹ,Eʹ), induced by the subset of vertices Vʹ ⊆ V (and a fortiori by the subset of edges Eʹ ⊆ E), is called a k-core or a core of order k iff ∀ v ∈ Vʹ, degH k (v) ≥ k and H k is the maximal subgraph with this property, i.e. it cannot be augmented without losing this property. Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction. Text Mining – an introduction, Michalis Vazirgiannis, 2017 Data Science Winter School, Beijing, China 28/03/2017 6
Graph attributes: K-core • In other words, the k-core of a graph corresponds to the maximal • A subgraph H k = (Vʹ,Eʹ), induced by the subset of vertices Vʹ ⊆ V (and connected subgraph whose vertices are at least of degree k within the a fortiori by the subset of edges Eʹ ⊆ E), is called a k-core or a core of order k iff ∀ v ∈ Vʹ, degH k (v) ≥ k and H k is the maximal subgraph subgraph. with this property, i.e. it cannot be augmented without losing this property. Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction. Text Mining – an introduction, Michalis Vazirgiannis, 2017 Data Science Winter School, Beijing, China 28/03/2017 6
Why graph-of-words may be a good choice? • Graph-of-words: • Taking into account word co-occurrence and word order (optional) . (compared with bag-of-words) • K-core: • In one core, all neighborhoods contribute equally to the subgraph. (compared with centrality which is used in PageRank & HITS) • K-cores are adaptive. • It has been proved that main core has a good performance in information retrieval. Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction. 28/03/2017 7
Difficulty: optimization for Big data • Texts: Multiprocessing • Encoding text by local ids • Merge local id-word dictionaries to get an universal id-word dictionary • Transfer local encoded text • “MapReduce like” Multiprocessing to prepare edges files • “This is an example about how to generate a graph. ” (window size=2) • Edges of window size n = edges of distance 2 + … + edges of distance n 28/03/2017 8
Difficulty: optimization for Big data • Texts: Multiprocessing • Encoding text by local ids • Merge local id-word dictionaries to get an universal id-word dictionary • Transfer local encoded text • “MapReduce like” Multiprocessing to prepare edges files • “This is an example about how to generate a graph. ” (window size=2) 3 • Edges of window size n = edges of distance 2 + … + edges of distance n 28/03/2017 8
Difficulty: optimization for Big data • Texts: Multiprocessing • Encoding text by local ids • Merge local id-word dictionaries to get an universal id-word dictionary • Transfer local encoded text • “MapReduce like” Multiprocessing to prepare edges files • “This is an example about how to generate a graph. ” (window size=2) 3 4 • Edges of window size n = edges of distance 2 + … + edges of distance n 28/03/2017 8
Multiple languages (ideas) • Using a small dictionary to generate a mixed text • Find common graph patterns for multiple languages Ref. Stephan Gouws, Anders Søgaard, Simple task-specific bilingual word embeddings 28/03/2017 9
Future work • word2vec: GoW model architecture • Using graph-of-words for other task. (e.g. identifying parallel sentences in comparable corpora, BUCC2017 shared task) • From distributional thesaurus to semantic classes 28/03/2017 10
Me Merci
Recommend
More recommend