Introduction Methodology Evaluation Results Conclusion A Study of Hybrid Similarity Measures for Semantic Relation Extraction Alexander Panchenko and Olga Morozova , Center for Natural Language Processing (CENTAL), Universit´ e catholique de Louvain, Belgium alexander.panchenko@student.uclouvain.be olga.morozova@uclouvain.be April 23, 2012 1 / 41
Introduction Methodology Evaluation Results Conclusion Plan Introduction Methodology Evaluation Results Conclusion 2 / 41
Introduction Methodology Evaluation Results Conclusion Semantic Relations • Semantic relations are useful for NLP and IR applications: • Query expansion (Hsu et al., 2006) • QA systems (Sun et al., 2005) • Text categorization (Tikk et al, 2003) • Word Sense Disambiguation (Patwardhan et al., 2003) 3 / 41
Introduction Methodology Evaluation Results Conclusion Semantic Relations • Semantic relations are useful for NLP and IR applications: • Query expansion (Hsu et al., 2006) • QA systems (Sun et al., 2005) • Text categorization (Tikk et al, 2003) • Word Sense Disambiguation (Patwardhan et al., 2003) • Semantic resources: thesauri, ontologies, synonymy dictionaries, WordNets, . . . • In the context of this work we consider following relation types: • synonyms : � car , SYN , vehicle � , � animal , SYN , beast � • hypernyms : � car , HYPER , Jeep Cherokee � , � animal , HYPER , crocodile � • co-hyponyms (have a common hypernym): � Toyota Land Cruiser , COHYPER , Jeep Cherokee � 4 / 41
Introduction Methodology Evaluation Results Conclusion Motivation Figure: Semantic relations in the EuroVoc thesaurus. 5 / 41
Introduction Methodology Evaluation Results Conclusion Motivation Figure: Semantic relations in the EuroVoc thesaurus. • Manual construction of relations : • (+) Precision • (–) Very expensive and time-consuming • Existing relation extraction methods : • (+) No or very little manual labor • (–) Not as precise as manual construction 6 / 41
Introduction Methodology Evaluation Results Conclusion Motivation Figure: Semantic relations in the EuroVoc thesaurus. • Manual construction of relations : • (+) Precision • (–) Very expensive and time-consuming • Existing relation extraction methods : • (+) No or very little manual labor • (–) Not as precise as manual construction • = ⇒ development of new relation extraction methods : • Input: terms C • Output: semantic relations ˆ R ⊂ C × C 7 / 41
Introduction Methodology Evaluation Results Conclusion The State of Art • A multitude of complimentary measures were proposed to extract synonyms, hypernyms, and co-hyponyms • Most of them are based on one of the 5 key approaches : 1. distributional analysis (Lin, 1998b) 2. Web as a corpus (Cilibrasi and Vitanyi, 2007) 3. lexico-syntactic patterns (Bollegala et al., 2007) 4. semantic networks (Resnik, 1995) 5. definitions of dictionaries or encyclopedias (Zesch et al., 2008a) 8 / 41
Introduction Methodology Evaluation Results Conclusion The State of Art • A multitude of complimentary measures were proposed to extract synonyms, hypernyms, and co-hyponyms • Most of them are based on one of the 5 key approaches : 1. distributional analysis (Lin, 1998b) 2. Web as a corpus (Cilibrasi and Vitanyi, 2007) 3. lexico-syntactic patterns (Bollegala et al., 2007) 4. semantic networks (Resnik, 1995) 5. definitions of dictionaries or encyclopedias (Zesch et al., 2008a) • Some attempts were made to combine measures (Curran, 2002; Cederberg and Widdows, 2003; Mihalcea et al., 2006; Agirre et al., 2009; Yang and Callan, 2009) • However, most studies are still not taking into account all 5 existing extraction approaches. 9 / 41
Introduction Methodology Evaluation Results Conclusion Contributions • A systematic analysis of • 16 baseline similarity measures of 5 key extraction principles • their combinations with 8 fusion methods and 3 techniques for the combination set selection • We are first to propose hybrid similarity measures based on all the 5 key extraction approaches: 1. distributional analysis 2. Web as a corpus 3. lexico-syntactic patterns 4. semantic networks 5. definitions of dictionaries or encyclopedias • The best found hybrid measure combines 15 baseline measures with the Logistic Regression 10 / 41
Introduction Methodology Evaluation Results Conclusion Similarity-based Relation Extraction Plan Introduction Methodology Similarity-based Relation Extraction Single Similarity Measures Hybrid Similarity Measures Evaluation Results Conclusion 11 / 41
Introduction Methodology Evaluation Results Conclusion Similarity-based Relation Extraction Similarity-based Relation Extraction (a) Terms, C (b) Terms, C Hybrid Similarity Measure Single Similarity Measure sim i ... sim N sim 1 S 1 S N norm norm ... S 1 S N combination method S i S cmb norm norm S i S cmb knn knn Relations, Relations, R R Figure: Single (a) and (b) hybrid similarity-based relation extractors. • sim k – a similarity measure sim k ( c i , c j ) ∈ [ 0 ; 1 ] , c i , c j ∈ C • S i – term-term similarity matrix ( C × C ) • knn – k -NN thresholding: R = � | C | ˆ i = 1 {� c i , c j � : ( c j ∈ top k % of c i ) ∧ ( s ij > 0 ) } . • S cmb – combined similarity matrix obtained with combination _ method ( S 1 , . . . , S N ) 12 / 41
Introduction Methodology Evaluation Results Conclusion Similarity-based Relation Extraction Single and Hybrid Similarity Measures • 16 single measures • 5 measures based on a semantic network • 3 web-based measures • 5 corpus-based measures • 2 distributional • 1 lexico-syntactic patterns • 2 other co-occurence based • 3 definition-based measures • 64 hybrid measures • 8 combination methods • 8 measure sets obtained with 3 measure selection techniques 13 / 41
Introduction Methodology Evaluation Results Conclusion Single Similarity Measures Plan Introduction Methodology Similarity-based Relation Extraction Single Similarity Measures Hybrid Similarity Measures Evaluation Results Conclusion 14 / 41
Introduction Methodology Evaluation Results Conclusion Single Similarity Measures Measures Based on a Semantic Network 1. Wu and Palmer (1994) 2. Leacock and Chodorow (1998) 3. Resnik (1995) 4. Jiang and Conrath (1997) 5. Lin (1998) Data: • WordNet 3.0 • SemCor corpus Variables: • Lengths of the shortest paths between terms in the network • Probability of terms derived from a corpus Coverage: 155.287 English terms encoded in WordNet 3.0. Complexity: calculation of a shortest path(s) between the nodes. 15 / 41
Introduction Methodology Evaluation Results Conclusion Single Similarity Measures Web-based Measures Normalized Google Distance (NGD) (Cilibrasi and Vitanyi, 2007) 6. NGD-Yahoo! 7. NGD-Bing 8. NGD-Google over wikipedia.org domain Data: number of times the terms co-occur in the documents as indexed by an IR system. Variables: • number of hits returned by query ” c i ” • number of hits returned by query ” c i AND c ′′ j Coverage: huge vocabulary in dozens of languages. Complexity: constraints of a search engine API. 16 / 41
Introduction Methodology Evaluation Results Conclusion Single Similarity Measures Corpus-based Measures 9. Bag-of-word Distributional Analysis (BDA) (Sahlgren, 2006) 10. Syntactic Distributional Analysis (SDA) (Curran, 2003) Data: WaCkypedia (800M tokens) and PukWaC (2000M tokens) corpora (Baroni et al., 2009) Variables: • feature vector based on the context window • feature vector based on the syntactic context Coverage: word should occur in the corpora. Complexity: O(BDA) « O(SDA) because of dependency parsing 17 / 41
Introduction Methodology Evaluation Results Conclusion Single Similarity Measures Corpus-based Measures 11. A measure based on lexico-syntactic patterns (PatternWiki) Data : WaCkypedia corpus (800M tokens) Method: • 10 patterns for hypernymy extraction: 6 Hearst (1992) patterns + 4 other patterns • such diverse {[occupations]} as {[doctors]}, {[engineers]} and {[scientists]}[PATTERN=1] • Semantic similarity s ij between terms c i , c j ∈ C is a function of the number of term co-occurences in the same concordance n ij : n ij sim ( c i , c j ) = s ij = max ij ( n ij ) . 18 / 41
Introduction Methodology Evaluation Results Conclusion Single Similarity Measures Corpus-based Measures Figure: A UNITEX graph implementing the first extraction pattern. Coverage: Target terms c i , c j should co-occur in a sentence. Complexity: Application of a cascade of FST to a text. 19 / 41
Introduction Methodology Evaluation Results Conclusion Single Similarity Measures Corpus-based Measures 12. Latent Semantic Analysis (LSA) on TASA corpus (Landauer and Dumais, 1997) 13. NGD on Factiva corpus (Veksler et al., 2008) 20 / 41
Introduction Methodology Evaluation Results Conclusion Single Similarity Measures Definition-based Measures 14. Extended Lesk (Banerjee and Pedersen, 2003) 15. GlossVectors (Patwardhan and Pedersen, 2006) Data: WordNet glosses. Variables: • bag-of-words vector of a term c i derived from the glosses • relation between words ( c i , c j ) in the network Coverage: 117.659 glosses encoded in WordNet 3.0 Complexity: Calculation of a similarity in a vector space. 21 / 41
Recommend
More recommend