approaches for source retrieval and text alignment of
play

Approaches for Source Retrieval and Text Alignment of Plagiarism - PowerPoint PPT Presentation

Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan www.hljit.edu.cn PAN@CLEF2013 1 Who are we? 2 Who are we? 2 Who are we? 2 Who are we?


  1. Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan www.hljit.edu.cn PAN@CLEF2013 1

  2. Who are we? 2

  3. Who are we? 2

  4. Who are we? 2

  5. Who are we? Heilongjiang Institute of Technology Harbin, Heilongjiang Province, China 2

  6. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  7. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  8. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  9. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  10. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  11. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  12. Index  Approaches for Source Retrieval  Approaches for Text Alignment  Further works PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 4

  13. Source Retrieval Suspicious Query document keywords Source Text Candidate Retrieval Alignment Documents Suspicious plagiarism Internet text Document Resource Set PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 13

  14. Source Retrieval Suspicious Query document keywords Source Text Candidate Retrieval Alignment Documents Suspicious plagiarism Internet text Document Resource Set PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 14

  15. 2 problmes of source retrieval  Tow core problem of source retrieval  Retrieval source is millions of documents from the Internet  This work was done by PAN  The query keywords of suspicious document which would be used for retrieval are not specified  How to extract query keyword is one of important issues of our work 6

  16. Query Keywords Extraction  Query Keywords Extraction Based on TF-IDF  Query Keywords Extraction Based on Weighted TF-IDF  Adjacent Query Keywords Extraction by PatTree  Combination of Queries and Execution of Retrieval PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 16

  17. Keywords Based on TF-IDF  TF - term frequency, denotes the frequency of term i in document j  IDF - inverse document frequency IDF =log 2 (N/ df j )  TF-IDF of term i is:  Tips: we found that the top 10 terms with the highest TF-IDF can obtain a good results PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 17

  18. Keywords Based on Weighted TF-IDF  Weighted TF-IDF  Where weight is a weighted parameter, and we calculate the weight of term i according to its location  Tips: the keywords extraction based on the weighted TF-IDF sometimes is useful, sometimes useless. PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 18

  19. Adjacent Query Keywords Extraction by PatTree  The adjacent string with high frequency is more important than a single word  We use PatTree - an efficient data structure – to get the adjacent strings and their frequency example PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 19

  20. Combination of Queries Table 1: Query Combination and Group Query Query Keywords 1 Top 1 to 5 query keywords based on TF-IDF 2 Top 2 to 10 query keywords based on TF-IDF 3 2-Gram query keywords based on PatTree 4 3-Gram query keywords based on PatTree 5 4-Gram query keywords based on PatTree 6 4-Gram query keywords based on PatTree 7 Top 1 to 5 query keywords based on weighted TF-IDF 8 Top 6 to 10 query keywords based on weighted TF-IDF 9 5-Gram query keywords based on PatTree PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 20

  21. Results Source Retrieval subtask Table2 : Results Source of PAN@CLEF2013 Retrieval subtask Queries 48.50 Workload Downloads 5691.47 Queries 2.46 Time to 1st Detection Downloads 285.66 Precision 0.01 Retrieved Performance Recall 0.65 3 No Detection PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 21

  22. Results Source Retrieval subtask Table2 : Results Source of PAN@CLEF2013 Retrieval subtask Queries 48.50 Workload Downloads 5691.47 Queries 2.46 Time to 1st Detection Downloads 285.66 Precision 0.01 Retrieved Performance Recall 0.65 3 No Detection PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 22

  23. Results Source Retrieval subtask Table2 : Results Source of PAN@CLEF2013 Retrieval subtask Queries 48.50 Workload Downloads 5691.47 Queries 2.46 Time to 1st Detection Downloads 285.66 Precision 0.01 Retrieved Performance Recall 0.65 3 No Detection PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 23

  24. Text Alignment Suspicious Query document keywords Source Text Candidate Retrieval Alignment Documents Suspicious plagiarism Internet text Document Resource Set PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 24

  25. Text Alignment Suspicious Query document keywords Text Source Candidate Alignment Retrieval Documents Suspicious plagiarism Internet text Document Resource Set PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 25

  26. Text Alignment PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 26

  27. Text Alignment PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 27

  28. Text Alignment Seeding PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 28

  29. Text Alignment Seeding PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 29

  30. Text Alignment Seeding Match Merging PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 30

  31. Text Alignment Seeding Match Merging PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 31

  32. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 32

  33. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 33

  34. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 34

  35. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 35

  36. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 36

  37. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 37

  38. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 38

  39. Text Alignment Seeding Match Merging Extraction Filtering • Bilateral Alternating Merging Algorithm PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 39

  40. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 40

  41. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 41

  42. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 42

  43. Performance on the PAN2012 test corpus Table 3: Overall evaluation results for the final test corpus PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 43

  44. Performance on the PAN2012 test corpus Table 4: Results for the 02-no-obfuscation sub-corpus PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 44

  45. Performance on the PAN2012 test corpus Table 5: Results for the 03-random-obfuscation PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 45

  46. Performance on the PAN2012 test corpus Table 6: Results for the 04-translation-obfuscation PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 46

  47. Performance on the PAN2012 test corpus Table 7: Evaluation results for the 05-summary-obfuscation PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 47

  48. Further work  Use different methods to deal with different plagiarism problems to obtain a better performance  Query keywords extraction and ranking 48

  49. Thank you for your attention!

Recommend


More recommend