deep investigation of cross language plagiarism detection
play

Deep Investigation of Cross-Language Plagiarism Detection Methods - PowerPoint PPT Presentation

Deep Investigation of Cross-Language Plagiarism Detection Methods Authors Jrmy Ferrero Laurent Besacier Didier Schwab Frdric Agns Jrmy Ferrero, Laurent Besacier, Didier Schwab and Frdric Agns BUCC - August 2017 Deep


  1. Deep Investigation of Cross-Language Plagiarism Detection Methods Authors Jérémy Ferrero Laurent Besacier Didier Schwab Frédéric Agnès Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 1

  2. What is Cross-Language Plagiarism Detection? Cross-Language Plagiarism is a plagiarism by translation, i.e. a text has been plagiarized while being translated (manually or automatically). From a text in a language L, we must find similar passage(s) in other text(s) from a set of candidate texts in language L’ (cross-language textual similarity). Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 2

  3. Why is it so important? Sources: - McCabe, D. (2010). Students’ cheating takes a high-tech turn. In Rutgers Business School. - Josephson Institute. (2011). What would honest Abe Lincoln say? Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 3

  4. Research Questions to the characteristics of the compared texts? compared texts? And if so, which characteristics? Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 4 • How do the state-of-the-art methods behave according • Are the methods depend on the characteristics of the • Are the state-of-the-art methods complementary?

  5. State-of-the-Art Methods MT-Based Models Translation + Monolingual Analysis [Muhr et al., 2010] Comparable Corpora-Based Models CL-KGA, CL-ESA [Potthast et al., 2008] Parallel Corpora-Based Models Dictionary-Based Models CL-VSM, CL-CTS [Pataki, 2012] Syntax-Based Models Length Model, CL-C n G [Potthast et al., 2011], Cognateness Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 5 CL-ASA [Pinto et al., 2009], CL-LSI, CL-KCCA

  6. BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès Deep Investigation of Cross-Language Plagiarism Detection Methods 6 CL-C 3 G [Potthast et al., 2011]

  7. CL-CTS [Pataki, 2012] We use DBNary [Sérasset, 2015] as linked lexical resource. Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 7

  8. CL-ASA [Pinto et al., 2009] Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 8

  9. CL-ESA [Potthast et al., 2008] Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 9

  10. T+MA [Muhr et al., 2010] Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 10

  11. Evaluation Dataset [Ferrero et al., 2016] 1 1 A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès https://github.com/FerreroJeremy/Cross-Language-Dataset Detection. In Proceedings of LREC 2016. 11 Europarl and JRC); added noise ; • French , English and Spanish ; • Parallel and comparable (mix of Wikipedia, conference papers, product reviews, • Different granularities: document level, sentence level and chunk level; • Human and machine translated texts; • Obfuscated (to make the similarity detection more complicated) and without • Written and translated by multiple types of authors ; • Cover various fields .

  12. Fist experiment: Evaluation Protocol another language and to 999 other units randomly selected; validation; Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 12 • We compared each textual unit to its corresponding unit in • We threshold the obtained distance matrix to find the threshold giving the best F 1 score; • We repeat these two steps 10 times, leading to a 10 folds • The final value are the average of the 10 F 1 score.

  13. Results: Across Language Pairs 0.4633 0.2694 0.3523 0.3576 CL-ASA 0.4575 0.4645 0.3204 0.3171 0.4734 0.3098 CL-CTS 0.4577 0.4577 0.3819 0.3819 0.4931 0.4931 CL-C3G 0.2531 0.2843 Methods 0.3505 Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès language pair (EN: English; FR: French; ES: Spanish). Table 1: 0.3525 0.3673 0.3526 0.3692 CL-ESA 0.3760 T+MA 0.1383 0.1383 0.1337 0.1337 0.1430 0.1430 Chunk level 13 Sentence level 0.4250 0.4252 0.3140 CL-ASA 0.4169 0.4203 0.3881 0.3780 04116 CL-CTS 0.3941 0.4795 0.4795 0.4375 0.4375 0.5071 0.5071 CL-C3G Methods 0.4083 0.4738 0.3736 T+MA 0.3158 0.3540 0.3279 0.3177 0.3730 0.3634 0.1520 0.1520 0.1476 0.1476 0.1499 0.1499 CL-ESA EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES Overall F 1 score over all sub-corpora of the state-of-the-art methods for each

  14. Results: Across Language Pairs CL-C3G Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès Top 3 methods by source and target language. Table 2: (b) Sentence granularity T+MA CL-CTS T+MA CL-C3G T+MA CL-CTS CL-CTS CL-C3G CL-ASA (a) Chunk granularity CL-CTS CL-CTS CL-ASA CL-C3G CL-C3G 14 EN ↔ FR ES ↔ FR EN ↔ FR EN ↔ ES ES → FR EN ↔ ES FR → ES

  15. Results: Across Language Pairs 0.991 0.981 0.989 0.924 0.931 1.000 0.971 0.982 0.922 1.000 0.929 1.000 1.000 Lang. Pair Overall Strong correlation between languages! Sentence level 0.971 0.997 1.000 0.971 Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès between the different language pairs (EN: English; FR: French; ES: Spanish). Table 3: 0.966 1.000 0.997 0.925 1.000 0.949 0.922 0.928 1.000 0.949 0.913 0.970 15 0.994 0.990 1.000 0.987 0.971 0.980 0.980 Overall 1.000 0.967 0.980 0.940 Lang. Pair 0.957 0.995 0.998 1.000 0.996 0.949 0.988 0.998 Chunk level 1.000 0.983 0.991 0.965 0.978 1.000 EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES Pearson correlations of the overall F 1 score over all sub-corpora of all methods

  16. Results: Across Language Pairs Strong correlation between granularities! Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès (calculated from Table 1). chunk and the sentence granularity, by language pair (EN: English; FR: French; ES: Spanish) Pearson correlations of the results of all methods on all sub-corpora, between the Table 4: 0.939 0.932 16 0.838 0.833 0.946 0.907 Correlation Lang. Pair EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES

  17. Results: Across Language Pairs 0.515 Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès the chunk and the sentence granularity, by methods (calculated from Table 1). Pearson correlations of the results on all sub-corpora on all language pairs, between Table 5: 0.780 T+MA CL-ESA Strong correlation between granularities! 0.649 CL-ASA 0.970 CL-CTS 0.996 CL-C3G Correlation Methods 17

Recommend


More recommend