Improving performance of a plagiarism detection system Andrzej Sobecki, Marcin Kępa IKC 2017
Plagiarism detection problem • Text documents – unstructured form • Finding a potential source documents based on the suspected document • Searching in many repositories • Process must be short and accuracy
How we do that The important stage for accuracy Calculating Parsing Hashing Filtering similarities The crucial stage for performance
Filtering – actual solution Hash function h(x) Doc Doc profile One hash – One sentence Repository Doc Doc profile Suspected Suspected Doc profile doc doc profile profile Count Available documents Repositories identical profiles hash values
Filtering – possible solutions • Algorithms dedicated for digital libraries, • Available search engines e.g., the elastic search, • Components of the hadoop ecosystem, • What is an effect of precision and recall values for performance and accuracy of the plagiarism detection process?
Class of problem detecting similarities • Unstructured text documents, • Is required to analyzing most of the documents available in the repositories, • New documents are continuously add to the repositories, • Effective filtering with high values of recall and precision, • Finding similar sentences are more important than keywords.
Models described in the article • KASKADA HashMap, • HDFS, • HDFS HashMap, • Hbase,
Results – documents with fixed length
Results – documents with different lengths
Results – parallel tasks
Results - scalability
Results — cost of preparing structures
Summary • Have you any questions?
Recommend
More recommend