the impact of distributional metrics in the quality of
play

The Impact of Distributional Metrics in the Quality of Relational - PowerPoint PPT Presentation

The Impact of Distributional Metrics in the Quality of Relational Triples calo Oliveira 1 , Paulo Gomes Hernani Costa, Hugo Gon hpcosta@student.dei.uc.pt, { hroliv,pgomes } @dei.uc.pt Cognitive & Media Systems Group CISUC, University of


  1. The Impact of Distributional Metrics in the Quality of Relational Triples calo Oliveira 1 , Paulo Gomes Hernani Costa, Hugo Gon¸ hpcosta@student.dei.uc.pt, { hroliv,pgomes } @dei.uc.pt Cognitive & Media Systems Group CISUC, University of Coimbra Lisbon, August 16, 2010 1supported by FCT scholarship grant SFRH/BD/44955/2008 Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 1 / 21

  2. Outline Introduction 1 Information Extraction Information Retrieval Research Goals Approach 2 Experimentation 3 Set-up Metrics adaptation Results Additional experimentation Concluding remarks 4 Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 2 / 21

  3. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  4. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Their creation and maintenance involves intensive human effort Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  5. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Their creation and maintenance involves intensive human effort Automatic creation/enrichment from textual resources is an alternative Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  6. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Their creation and maintenance involves intensive human effort Automatic creation/enrichment from textual resources is an alternative ▶ Higher coverage, easier update, but... Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  7. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Their creation and maintenance involves intensive human effort Automatic creation/enrichment from textual resources is an alternative ▶ Higher coverage, easier update, but... ▶ Precision is lower Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  8. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Their creation and maintenance involves intensive human effort Automatic creation/enrichment from textual resources is an alternative ▶ Higher coverage, easier update, but... ▶ Precision is lower ▶ Evaluation requires once again intensive human labour! Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  9. Introduction Information Extraction Information extraction (IE) Automatic extraction of structured information from natural language. Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 4 / 21

  10. Introduction Information Extraction Information extraction (IE) Automatic extraction of structured information from natural language. “Car is a vehicle with 4 wheels and an engine, used for carrying a small number of passengers.” Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 4 / 21

  11. Introduction Information Extraction Information extraction (IE) Automatic extraction of structured information from natural language. “Car is a vehicle with 4 wheels and an engine, used for carrying a small number of passengers.” ▶ vehicle HYPERNYM OF car Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 4 / 21

  12. Introduction Information Extraction Information extraction (IE) Automatic extraction of structured information from natural language. “Car is a vehicle with 4 wheels and an engine, used for carrying a small number of passengers.” ▶ vehicle HYPERNYM OF car ▶ wheel PART OF car ▶ engine PART OF car Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 4 / 21

  13. Introduction Information Extraction Information extraction (IE) Automatic extraction of structured information from natural language. “Car is a vehicle with 4 wheels and an engine, used for carrying a small number of passengers.” ▶ vehicle HYPERNYM OF car ▶ wheel PART OF car ▶ engine PART OF car ▶ carrying people PURPOSE OF car Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 4 / 21

  14. Introduction Information Retrieval Information retrieval (IR) Locating specific information in natural language resouces. Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 5 / 21

  15. Introduction Information Retrieval Information retrieval (IR) Locating specific information in natural language resouces. Approaches based on the occurrence of words in documents. Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 5 / 21

  16. Introduction Information Retrieval Information retrieval (IR) Locating specific information in natural language resouces. Approaches based on the occurrence of words in documents. Distributional similarity metrics ▶ Cocitation (Small (1973)) ▶ LSA (Deerwester et al. (1990)) ▶ Lin’s (Lin (1998)) ▶ PMI-IR (Turney (2001)) ▶ 휎 (Kozima and Furugori (1993)) ▶ ... Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 5 / 21

  17. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  18. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity ▶ Wandmacher et al. (2007) and Cederberg and Widdows (2003) used LSA to weight hypernymy triples Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  19. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity ▶ Wandmacher et al. (2007) and Cederberg and Widdows (2003) used LSA to weight hypernymy triples ▶ What about other semantic relations? Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  20. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity ▶ Wandmacher et al. (2007) and Cederberg and Widdows (2003) used LSA to weight hypernymy triples ▶ What about other semantic relations? ▶ What metrics should be used? Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  21. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity ▶ Wandmacher et al. (2007) and Cederberg and Widdows (2003) used LSA to weight hypernymy triples ▶ What about other semantic relations? ▶ What metrics should be used? ▶ New combined metrics? Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  22. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity ▶ Wandmacher et al. (2007) and Cederberg and Widdows (2003) used LSA to weight hypernymy triples ▶ What about other semantic relations? ▶ What metrics should be used? ▶ New combined metrics? 2 Help manual evaluation Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  23. Approach IE system Grammars Extraction of Corpus relational triples Removal of triples with stopwords Lemmatisation Additional Metrics extraction of triples application Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 7 / 21

  24. Experimentation Set-up Experimentation set-up ublico 2 corpus (annotated version) CETEMP´ ▶ 28,000 documents ▶ 30,100 unique context words (nouns, verbs and adjectives) ▶ term-document matrix 2 http://www.linguateca.pt/cetempublico/ Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 8 / 21

  25. Experimentation Set-up Experimentation set-up ublico 2 corpus (annotated version) CETEMP´ ▶ 28,000 documents ▶ 30,100 unique context words (nouns, verbs and adjectives) ▶ term-document matrix Triples obtained ▶ Extracted: 20,308 ▶ Discarded: 5,844 ▶ Inferred: 2,492 ▶ Final triple set: 16,956 2 http://www.linguateca.pt/cetempublico/ Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 8 / 21

  26. Experimentation Metrics adaptation Similarity between two documents For instance, Cocitation: First presented as a similarity metric between scientific papers (Small (1973)) Cocitation ( d i , d j ) = P ( d i ∩ d j ) (1) P ( d i ∪ d j ) Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 9 / 21

  27. Experimentation Metrics adaptation Similarity between two documents For instance, Cocitation: First presented as a similarity metric between scientific papers (Small (1973)) Cocitation ( d i , d j ) = P ( d i ∩ d j ) (1) P ( d i ∪ d j ) ▶ d i , d j represent two documents Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 9 / 21

Recommend


More recommend