cross language high similarity search using a conceptual
play

Cross-language High Similarity Search using a Conceptual Thesaurus - PowerPoint PPT Presentation

Introduction Conceptual Thesaurus Method Results Analysis References Cross-language High Similarity Search using a Conceptual Thesaurus no 2 and Paolo Rosso 1 Parth Gupta 1 , Alberto Barr on-Cede 1 Universitat Polit` ecnica de Val`


  1. Introduction Conceptual Thesaurus Method Results Analysis References Cross-language High Similarity Search using a Conceptual Thesaurus no 2 and Paolo Rosso 1 Parth Gupta 1 , Alberto Barr´ on-Cede˜ 1 Universitat Polit` ecnica de Val` encia, Spain 2 Universitat Polit´ ecnica de Catalunya, Spain September 19, 2012 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  2. Introduction Conceptual Thesaurus Method Results Analysis References Outline Introduction Conceptual Thesaurus Method Results Analysis References 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  3. Introduction Conceptual Thesaurus Method Results Analysis References Introduction I The task of cross-language high similarity search refers to the identification of documents that are duplicates or share very similar information in two di ff erent languages. I Some examples I Wikipedia articles in multiple languages I news stories in di ff erent languages covering the same event I cross-language cases of plagiarism I translated documents etc. I In the literature, also referred as I Cross-language pairwise similarity search I Cross-language mate retrieval I Cross-language near duplicate search 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  4. Introduction Conceptual Thesaurus Method Results Analysis References Conceptual Thesaurus (Domain specific) I Has often a multi-word structure I Tries to exhaustively cover omnipresent concepts of the domain I Eurovoc 1 I Emerged from European Parliamentary proceedings I Contains 6,797 multilingual concepts in 22 languages I Span across 21 domains of European Parliament activities 1 http://eurovoc.europa.eu/ 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  5. Introduction Conceptual Thesaurus Method Results Analysis References Eurovoc English Spanish German action for failure recurso por in- Klage wegen to fulfil an obli- cumplimiento Vertragsverlet- gation zung extra- intercambio ex- außergemeinschaf- community tracomunitario tlicher Handel trade sexual harass- acoso sexual sexuelle ment Bel¨ astigung 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  6. Introduction Conceptual Thesaurus Method Results Analysis References Eurovoc Assigning these concepts to Wikipedia documents or I Domain of concepts Shakespeare stories? I Politics I Intenational relations I European community I Law I Economics I So on.. 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  7. Introduction Conceptual Thesaurus Method Results Analysis References Method - Cross-language Conceptual Thesaurus based Similarity (CL-CTS) I Represent documents as a vector of concepts I Concept assignment is the least trivial part I Challenge: Exploit a domain specific CT for all the corpora I Assignment of concepts according to their verbatim occurrence in the document gives very bad results [Pouliquen et al.2006] I Assign a concept to a document if it “triggers the concept” 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  8. Introduction Conceptual Thesaurus Method Results Analysis References Method contd. I Heuristic: The terms together are highly domain dependent but alone are domain independent. I For example, “community” and “trade” compared to “community trade” Concept Assignment I Sum of the term frequencies (TF) of the terms in the concept in the Doc I Stopword removal + stemming I Filter the terms based on the discriminative power in the corpora 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  9. Introduction Conceptual Thesaurus Method Results Analysis References Method contd. I All the concepts do not help in similarity estimation - Hence Reduced Concepts (RC) I Reduces the comparison vocabulary drastically I Domain independent threshold 0 < d f ( t ) < � I Automatic domain adaptation (Football in “Sports” and “Society and Culture”) 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  10. Introduction Conceptual Thesaurus Method Results Analysis References Method contd. I Concern - The concepts are limited and are common across even slightly relevant documents I To overcome the limitation of conceptual similarity estimation, we use Named Entities in similarity too I n-gram similarity of NEs - simplest method I NEs act as discriminative features - e.g. Wikipedia page of Rome vs. Madrid 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  11. Introduction Conceptual Thesaurus Method Results Analysis References Method contd. I Sometimes high similar documents are parallel and the task is to find the parallel document for the given document I A pattern in length is noticed for parallel documents across languages [Pouliquen et al.2006] I we use the same “length panelty” len(parallel( d q )) = f ( µ, � , len( d q )) 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  12. Introduction Conceptual Thesaurus Method Results Analysis References Method contd. I The similarity function Conceptual Component NE Component ~ c q · ~ ! ( q, d ) = ↵ c d ! | q || d | + ` ( q, d ) + (1 − ↵ ) ∗ ⇣ ( q, d ) 2 ∗ Conceptual Similarity Length Penalty 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  13. Introduction Conceptual Thesaurus Method Results Analysis References Compared with 1. Cross-language Alignment based Similarity Analysis (CL-ASA) [Barr´ on-Cede˜ no et al.2008, Pinto et al.2009] 2. Cross-language Character n-grams (CL-CNG) [Mcnamee and Mayfield2004] 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  14. Introduction Conceptual Thesaurus Method Results Analysis References Datasets I JRC-Acquis (JRC) I Nature: related to European Commission activities I Size: 10,000 in each language I Type: Parallel I PAN-PC-2011 (PAN) I Nature: Project Gutenberg (artificially created cross-language plagiarism cases) I Size: 2920 (en-es) and 2222 (en-de) I Type: Noisy parallel I Wikipedia (Wiki) I Nature: General Wikipedia pages I Size: 10000 in each language I Type: Comparable 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  15. Introduction Conceptual Thesaurus Method Results Analysis References Datasets contd.. I Vocabulary shared by Eurovoc and JRC is higher than that of Eurovoc and PAN or Wiki. 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  16. Introduction Conceptual Thesaurus Method Results Analysis References Results : JRC en-es en-de 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  17. Introduction Conceptual Thesaurus Method Results Analysis References Results : PAN en-es en-de 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  18. Introduction Conceptual Thesaurus Method Results Analysis References Results : Wiki en-es en-de 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  19. Introduction Conceptual Thesaurus Method Results Analysis References Analysis I Performance of CL-CTS with reduced concepts is much higher compared to inclusion of all concepts I R@1 0.02 → 0.58 (JRC en-es) I Inclusion of NE component usually improves the performace except JRC - Interesting! I CL-ASA and CL-CNG exhibit very corpus dependent performace. I German stays more di ffi cult compared to Spanish (compounding of the words needs better care) 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

Recommend


More recommend