plagiarism detection system
play

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 - PowerPoint PPT Presentation

Improving performance of a plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem Text documents unstructured form Finding a potential source documents based on the suspected document


  1. Improving performance of a plagiarism detection system Andrzej Sobecki, Marcin Kępa IKC 2017

  2. Plagiarism detection problem • Text documents – unstructured form • Finding a potential source documents based on the suspected document • Searching in many repositories • Process must be short and accuracy

  3. How we do that The important stage for accuracy Calculating Parsing Hashing Filtering similarities The crucial stage for performance

  4. Filtering – actual solution Hash function h(x) Doc Doc profile One hash – One sentence Repository Doc Doc profile Suspected Suspected Doc profile doc doc profile profile Count Available documents Repositories identical profiles hash values

  5. Filtering – possible solutions • Algorithms dedicated for digital libraries, • Available search engines e.g., the elastic search, • Components of the hadoop ecosystem, • What is an effect of precision and recall values for performance and accuracy of the plagiarism detection process?

  6. Class of problem detecting similarities • Unstructured text documents, • Is required to analyzing most of the documents available in the repositories, • New documents are continuously add to the repositories, • Effective filtering with high values of recall and precision, • Finding similar sentences are more important than keywords.

  7. Models described in the article • KASKADA HashMap, • HDFS, • HDFS HashMap, • Hbase,

  8. Results – documents with fixed length

  9. Results – documents with different lengths

  10. Results – parallel tasks

  11. Results - scalability

  12. Results — cost of preparing structures

  13. Summary • Have you any questions?

Recommend


More recommend