INSTITUTO POLITCNICO NACIONAL Centro de Investigacin en Computacin - - PowerPoint PPT Presentation

▶

Aug 02, 2023 238 likes •571 views

INSTITUTO POLITCNICO NACIONAL Centro de Investigacin en Computacin Miguel A. Sanchez- Miguel A. Sanchez-Pere Perez, Grig z, Grigori ori Sidorov, Alexander Gelbukh idorov, Alexander Gelbukh Tuesday, 16 September 2014 1 Task 1.

SLIDE 1

INSTITUTO POLITÉCNICO NACIONAL

Centro de Investigación en Computación

Tuesday, 16 September 2014

Miguel A. Sanchez- Miguel A. Sanchez-Pere Perez, Grig z, Grigori

ri Sidorov, Alexander Gelbukh

idorov, Alexander Gelbukh

SLIDE 2

Task

Methodology

Adaptative behavior

Results

Conclusions

Future Work

SLIDE 3

Source Source Retrieval trieval Coll Collecti ection of

n of

documents documents Source Source documents documents Text Align xt Alignmen ent Suspicious spicious document document Suspicious spicious passag passages

Text Alignment: Given a pair of pair of documen documents, the task is to identify all contiguous maximal-length ntiguous maximal-length passages of reus reused text ed text between them.

SLIDE 4

 Preprocessing  Seeding  Extension  Filtering

SLIDE 5

 Sentence splitting (Kiss pretrained punkt model)  Tokenizing (Treebank word tokenizer)  Keeping tokens starting with a letter or digit  Reducing to lowercase  Stemming (Porter algorithm)  Joining small sentences (1-2 words) with the next one

SLIDE 6

PAN 2014 training corpus Sentences length histogram (words)

SLIDE 7

Vector representation of sentences: TF-IDF TF-IDF, where sentences sentences are “documents,” thus called TF-ISF: inverse sentence sentence freq. “Documents”: union of sentences of both docs Vector similarity: Cosine similarity threshold th1 AND Dice similarity threshold th2

SLIDE 8

Seeds Seeds: pairs of similar similar sentences

SLIDE 9

Group left left Groupin Grouping

SLIDE 10

Group left left Groupin Grouping

SLIDE 11

Group right right Groupin Grouping

SLIDE 12

Group right right Groupin Grouping

SLIDE 13

Group left left Groupin Grouping

SLIDE 14

Example: maxGap maxGap = 1 1 Group left left Groupin Grouping

SLIDE 15

Example: maxGap maxGap = 1 1 Group left left Groupin Grouping

SLIDE 16

Example: maxGap maxGap = 1 1 Group right right Groupin Grouping

SLIDE 17

Example: maxGap maxGap = 1 1 Group right right Groupin Grouping

SLIDE 18

Example: maxGap maxGap = 1 1 Group left left Groupin Grouping

SLIDE 19

Example: maxGap maxGap = 1 1 Group left left Groupin Grouping

SLIDE 20

Iteration No plagiarism None Random Translation Summary 1 674 6803 6436 7637 3074 2 3 278 180 246 294 3 7 7 3 3 4 1

Groupin Grouping

SLIDE 21

Example: maxGap maxGap = 2 2 Validati Validation

SLIDE 22

Example: maxGap maxGap = 2 2

Cosine similarity If cosine similarity < th3 th3 Regroup with maxGap - maxGap - 1

Validati Validation

SLIDE 23

Validati Validation

SLIDE 24

1 ,

A B

If n° n° charact characters rs in left side OR OR rigth side < minPlagLen minPlagLength gth then the case is removed

1. Resolving overlapping
2. Removing small cases

SLIDE 25

Source documents Suspicious documents Cumulative histogram of plagiarism cases passages

SLIDE 26

SLIDE 27

SLIDE 28

SLIDE 29

Text alignment task: best result of all 11 participating systems, thanks to:

TF-ISF (inverse sentence frequency) measure for “soft” removal of stopwords.

Recursive extension algorithm: dynamic adjustment of tolerance to gaps

Algorithm for resolution of overlapping cases by comparison of competing cases

Dynamic adjustment of parameters by type of

bfuscation (summary vs. other types)

SLIDE 30

 Text reuse focused on paraphrase  Soft cosine to measure similarity between features  New strategy to resolve overlapping

SLIDE 31

INSTITUTO POLITCNICO NACIONAL Centro de Investigacin en Computacin - - PowerPoint PPT Presentation

INSTITUTO POLITÉCNICO NACIONAL

Task

Methodology

Adaptative behavior

Results

Conclusions

Future Work

Vector representation of sentences: TF-IDF TF-IDF, where sentences sentences are “documents,” thus called TF-ISF: inverse sentence sentence freq. “Documents”: union of sentences of both docs Vector similarity: Cosine similarity threshold th1 AND Dice similarity threshold th2

A B

Text alignment task: best result of all 11 participating systems, thanks to:

TF-ISF (inverse sentence frequency) measure for “soft” removal of stopwords.

Recursive extension algorithm: dynamic adjustment of tolerance to gaps

Algorithm for resolution of overlapping cases by comparison of competing cases

Dynamic adjustment of parameters by type of

Thanks!

http://www.gelbukh.com/plag http://www.gelbukh.com/plagiarism-detection/PAN-2014 iarism-detection/PAN-2014