KONVENS 2016, The 13-th Conference on Natural Language Processing 21 September, 2016, Bochum, Germany Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics Alexander Panchenko , Johannes Simon, Martin Riedl and Chris Biemann Technische Universität Darmstadt, LT Group, Computer Science Department, Germany September 19, 2016 | 1
Summary ▶ An approach to word sense induction and disambiguation . ▶ The method is unsupervised and knowledge-free . ▶ Sense induction by clustering of word similarity networks ▶ Feature aggregation w.r.t. the induced inventory. ▶ Comparable to the state-of-the-art unsupervised WSD (SemEval’13 participants and various sense embeddings). ▶ Open source implementation: github.com/tudarmstadt-lt/JoSimText September 19, 2016 | 2
Motivation for Unsupervised Knowledge-Free Word Sense Disambiguation ▶ A word sense disambiguation (WSD) system: ▶ Input : word and its context. ▶ Output : a sense of this word. ▶ Surveys : Agirre and Edmonds (2007) and Navigli (2009). ▶ Knowledge-based approaches that rely on hand-crafted resources, such as WordNet. ▶ Supervised approaches learn from hand-labeled training data, such as SemCor. ▶ Problem 1: hand-crafted lexical resources and training data are expensive to create, often inconsistent and domain-dependent. ▶ Problem 2: These methods assume a fixed sense inventory: ▶ senses emerge and disappear over time. ▶ different applications require different granularities the sense inventory. ▶ An alternative route is the unsupervised knowledge-free approach . ▶ learn an interpretable sense inventory ▶ learn a disambiguation model September 19, 2016 | 3
Contribution ▶ The contribution is a framework that relies on induced inventories as a pivot for learning contextual feature representations and disambiguation. ▶ We rely on the JoBimText framework and distributional semantics (Biemann and Riedl, 2013) adding a word sense disambiguation functionality on top of it. ▶ The advantage of our method, compared to prior art, is that it can integrate several types of context features in an unsupervised way . ▶ The method achieves state-of-the-art results in unsupervised WSD. September 19, 2016 | 4
Method: Data-Driven Noun Sense Modelling 1. Computation of a distributional thesaurus ▶ using distributional semantics 2. Word sense induction ▶ using ego-network clustering of related words 3. Building a disambiguation model of the induced senses ▶ by feature aggregation w.r.t. the induced sense inventory September 19, 2016 | 5
Method: Distributional Thesaurus of Nouns us- ing the JoBimText framework ▶ A distributional thesaurus (DT) is a graph of word similarities, such as “(Python, Java, 0.781)”. ▶ We used the JoBimText framework (Biemann and Riedl, 2013): ▶ efficient computation of nearest neighbours for all words ▶ providing state-of-the-art performance (Riedl, 2016) ▶ For each noun in the corpus get 200 most similar nouns September 19, 2016 | 6
Method: Distributional Thesaurus of Nouns us- ing the JoBimText framework (cont.) ▶ For each noun in the corpus get l = 200 most similar nouns: 1. Extract word , feature and word-feature frequencies. ▶ Using dependency-based features , such as amod( • , grilled) or prep_for( • , dinner) using the Malt parser (Nivre et al., 2007) ▶ Collapsing of dependencies in the same way as the Stanford dependencies. 2. Discard rare words , features and word-features ( t < 3). 3. Normalize word-feature scores using the Local Mutual Information (LMI): ∑ f ij i , j f ij LMI ( i , j ) = f ij · PMI ( i , j ) = f ij · log f i ∗ · f ∗ j 4. Ranking word features by LMI. 5. Prune all, but p = 1000 most significant features per word. 6. Word similarities are computed as a number of common features for two words: sim ( t i , t j ) = | k : f ik > 0 ∧ f jk > 0 | 7. Return l = 200 most related words per word. September 19, 2016 | 7
Method: Noun Sense Induction via Ego-Network Clustering ▶ The "furniture" and the "data" sense clusters of the word "table". ▶ Graph clustering using the Chinese Whispers algorithm (Biemann, 2006). September 19, 2016 | 8
Method: Noun Sense Induction via Ego-Network Clustering (cont.) ▶ Process one word per iteration ▶ Construct an ego-network of the word: ▶ use dependency-based distributional word similarities ▶ the ego-network size ( N ): the number of related words ▶ the ego-network connectivity ( n ): how strongly the neighbours are related; this parameter controls granularity of sense inventory. ▶ Graph clustering using the Chinese Whispers algorithm . September 19, 2016 | 9
Method: Disambiguation of Induced Noun Senses ▶ Learning a disambiguation model P ( s i | C ) for each of the induced senses s i ∈ S of the target word w in context C = { c 1 , ..., c m } . ▶ We use the Naïve Bayes model : P ( s i ) ∏ | C | j =1 P ( c j | s i ) P ( s i | C ) = , P ( c 1 , ..., c m ) ▶ The best sense given the context C : | C | ∏ s ∗ P ( c j | s i ). i = arg max P ( s i ) s i ∈ S j =1 September 19, 2016 | 10
Method: Disambiguation of Induced Noun Senses (cont.) ▶ The prior probability of each sense is computed based on the largest cluster heuristic: | s i | P ( s i ) = s i ∈ S | s i | . ∑ ▶ Extract sense representations by aggregation of features from all words of the cluster s i . ▶ Probability of the feature c j given the sense s i : | s i | P ( c j | s i ) = 1 − α f ( w k , c j ) ∑ λ k + α , Λ i f ( w k ) k ▶ To normalize the score we divide it by the sum of all the weights Λ i = ∑ | s i | λ k : k ▶ α is a small number, e.g. 10 − 5 , added for smoothing. September 19, 2016 | 11
Method: Disambiguation of Induced Noun Senses (cont.) ▶ To calculate a WSD model we need to extract from corpus: 1. the distributional thesaurus; 2. sense clusters; 3. word-feature frequencies. ▶ Sense representations are obtained by “averaging” of feature representations of words in the sense clusters. September 19, 2016 | 12
Feature Extraction: Single Models ▶ The method requires sparse word-feature counts f ( w k , c j ). ▶ We demonstrate the approach on the four following types of features: 1. Features based on sense clusters : Cluster ▶ Features : words from the induced sense clusters; ▶ Weights : similarity scores. 2. Dependency features : Deptarget, Depall ▶ Features : syntactic dependencies attached to the word, e.g. “subj( • ,type)” or “amod(digital, • )” ▶ Weights : LMI scores of the scores. 3. Dependency word features : Depword ▶ Features : words extracted from all syntactic dependencies attached to a target word. For instance, the feature “subj( • ,write)” would result in the feature “write”. ▶ Weights : LMI scores. 4. Trigram features : Trigramtarget, Trigramall ▶ Features : pairs of left and right words around the target word, e.g. “typing_ • _or” and “digital_ • _.”. ▶ Weights : LMI scores. September 19, 2016 | 13
Feature Combination: Combined Models ▶ Feature-level Combination of Features ▶ Union context features of different types, such as dependencies and trigrams. ▶ “Stack” feature spaces. ▶ Meta-level Combination of Features 1. Independent sense classifications by single models 2. Aggregation of predictions with: ▶ Majority selects the sense s i selected by the largest number of single models. ▶ Ranks . First, results of single model classification are ranked by their confidence ˆ P ( s i | C ): the most suitable sense to the context obtains rank one and so on. Finally, we assign the sense with the least sum of ranks. ▶ Sum . This strategy assigns the sense with the largest sum of classification i ˆ P ( s i | C i confidences i.e., ∑ k ), where i is the number of the single model. September 19, 2016 | 14
Corpora used for experiments # Tokens Size Text Type 1.863 · 10 9 Wikipedia 11.79 Gb encyclopaedic 1.980 · 10 9 ukWaC 12.05 Gb Web pages Table: Corpora used for training our models. September 19, 2016 | 15
Results: Evaluation on the “Python-Ruby- Jaguar” (PRJ) dataset: 3 words, 60 contexts, 2 senses per word ▶ A simple dataset: 60 contexts, 2 homonyms per word. ▶ The models based on the meta-combinations are not shown for brevity as they did not improve performance of the presented models in terms of F-score. September 19, 2016 | 16
Results: Evaluation on the TWSI dataset: 1012 nouns, 145140 contexts, 2.33 senses per word September 19, 2016 | 17
Results: the TWSI dataset: effect of the corpus choice on the WSD performance ▶ 10 best models according to the F-score on the TWSI dataset ▶ Trained on Wikipedia and ukWaC corpora September 19, 2016 | 18
Results: Evaluation on the SemEval 2013 Task 13 dataset: 20 nouns, 1848 contexts September 19, 2016 | 19
Conclusion ▶ An approach to word sense induction and disambiguation . ▶ The method is unsupervised and knowledge-free . ▶ Sense induction by clustering of word similarity networks ▶ Feature aggregation w.r.t. the induced inventory. ▶ Comparable to the state-of-the-art unsupervised WSD (SemEval’13 participants and various sense embeddings). ▶ Open source implementation: github.com/tudarmstadt-lt/JoSimText September 19, 2016 | 20
Thank you! September 19, 2016 | 21
Recommend
More recommend