INSTITUTO POLITÉCNICO NACIONAL Centro de Investigación en Computación Miguel A. Sanchez- Miguel A. Sanchez-Pere Perez, Grig z, Grigori ori Sidorov, Alexander Gelbukh idorov, Alexander Gelbukh Tuesday, 16 September 2014 1
Task 1. Methodology 2. Adaptative behavior 3. Results 4. Conclusions 5. Future Work 6.
Text Alignment: Given a pair of pair of documen documents, the task is to identify all contiguous maximal-length ntiguous maximal-length passages of reus reused text ed text between them. Suspicious spicious document document Source Source Source Source Text Align xt Alignmen ent Retrieval trieval documents documents Suspicious spicious Collecti Coll ection of on of passag passages documents documents
Preprocessing Seeding Extension Filtering
Sentence splitting (Kiss pretrained punkt model) Tokenizing (Treebank word tokenizer) Keeping tokens starting with a letter or digit Reducing to lowercase Stemming (Porter algorithm) Joining small sentences (1-2 words) with the next one
PAN 2014 training corpus Sentences length histogram (words)
Vector representation of sentences: TF-IDF TF-IDF, where sentences sentences are “documents,” thus called TF-ISF: inverse sentence sentence freq. “Documents”: union of sentences of both docs Vector similarity: Cosine similarity � threshold th1 AND Dice similarity � threshold th2
Seeds Seeds: pairs of similar similar sentences
Groupin Grouping Group left left
Groupin Grouping Group left left
Groupin Grouping Group right right
Groupin Grouping Group right right
Groupin Grouping Group left left
Example: Groupin Grouping maxGap maxGap = 1 1 Group left left
Example: Groupin Grouping maxGap maxGap = 1 1 Group left left
Example: Groupin Grouping maxGap maxGap = 1 1 Group right right
Example: Groupin Grouping maxGap maxGap = 1 1 Group right right
Example: Groupin Grouping maxGap maxGap = 1 1 Group left left
Example: Groupin Grouping maxGap maxGap = 1 1 Group left left
Grouping Groupin No Iteration None Random Translation Summary plagiarism 1 674 6803 6436 7637 3074 2 3 278 180 246 294 3 0 7 7 3 3 4 0 1 0 0 0
Example: Validati Validation on maxGap maxGap = 2 2
Example: Validati Validation on maxGap maxGap = 2 2 Cosine similarity If cosine similarity < th3 th3 Regroup with maxGap - maxGap - 1
Validati Validation on
1. Resolving overlapping A B ����� � � � 1 � � � �, 2. Removing small cases If n° n° charact characters rs in left side OR OR rigth side < minPlagLen minPlagLength gth then the case is removed
Cumulative histogram of plagiarism cases passages Source documents Suspicious documents
Text alignment task: best result of all 11 participating systems, thanks to: TF-ISF (inverse sentence frequency) measure for 1. “soft” removal of stopwords. Recursive extension algorithm: dynamic 2. adjustment of tolerance to gaps Algorithm for resolution of overlapping cases by 3. comparison of competing cases Dynamic adjustment of parameters by type of 4. obfuscation (summary vs. other types)
Text reuse focused on paraphrase Soft cosine to measure similarity between features New strategy to resolve overlapping
Thanks! http://www.gelbukh.com/plag http://www.gelbukh.com/plagiarism-detection/PAN-2014 iarism-detection/PAN-2014
Recommend
More recommend