Chair of Network Architectures and Services Department of Informatics Technical University of Munich Automated Detection of Plagiarism based on Whitespace and History Markus Ongyerth December 4, 2017 Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Contents The Idea Implementation Evaluation Further Work M. Ongyerth – gitplag 2
Chair of Network Architectures and Services Department of Informatics Technical University of Munich What we want to find struct icmp6_neighbor_solicit ␣ { struct ether_header ehdr; struct ipv6_hdr iphdr; struct neighbor_solicit_payload pay; } __attribute__ (( packed )); if(buffer[offset + 6] != (byte) 0b00111010 ){ return false; } if(buffer[offset + 7] != (byte) 0b11111111 ){ return false; } M. Ongyerth – gitplag 3
Chair of Network Architectures and Services Department of Informatics Technical University of Munich What we want to find struct icmp6_neighbor_solicit ␣ { struct ether_header ehdr; struct ipv6_hdr iphdr; struct neighbor_solicit_payload pay; } __attribute__ (( packed )); if(buffer[offset + 6] != (byte) 0b00111010 ){ return false; } if(buffer[offset + 7] != (byte) 0b11111111 ){ return false; } M. Ongyerth – gitplag 3
Chair of Network Architectures and Services Department of Informatics Technical University of Munich The GRNVS dataset 2016 2017 Assignment 3 4 2 3 Submissions 236 199 355 223 Avg Commits 29 15 27 42 Cases of plagiarism 8 4 4 1 Automatic tests triggered over git M. Ongyerth – gitplag 4
Chair of Network Architectures and Services Department of Informatics Technical University of Munich The GRNVS dataset 2016 2017 Assignment 3 4 2 3 Submissions 236 199 355 223 Avg Commits 29 15 27 42 Cases of plagiarism 8 4 4 1 Automatic tests triggered over git M. Ongyerth – gitplag 4
Chair of Network Architectures and Services Department of Informatics Technical University of Munich In the past • Checking for plagiarism with MOSS • Hand check the results • Search for “strong” evidence by hand M. Ongyerth – gitplag 5
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Our two approaches Whitespace errors Identifier • Weird/broken indention • Unintuitive names • Multiple • Copies of typos • Trailing whitespace • ^ → ␣ → • 0b1000001 • → struct␣␣struct; • numericToTextFormat • → struct␣struct;␣$ • java.sql.Time M. Ongyerth – gitplag 6
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Version control history • Perpetrator try to hide • They “destroy” evidence • The ( Git -) history preserves evidence M. Ongyerth – gitplag 7
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Implementation 1. Read and tokenize submissions 2. Filter to viable tokens 3. Compare submission pairwise 4. Generate report / provide interactive interface M. Ongyerth – gitplag 8
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Differences to other systems • Whitespace ⇐ usually ignored • Identifiers • History M. Ongyerth – gitplag 9
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Differences to other systems • Whitespace • Identifiers ⇐ usually ignored • History M. Ongyerth – gitplag 9
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Differences to other systems • Whitespace • Identifiers • History ⇐ usually not available M. Ongyerth – gitplag 9
Chair of Network Architectures and Services Department of Informatics Technical University of Munich ROC graphs 1 0.8 Better than guessing Sensitivity 0.6 0.4 Worse than guessing 0.2 0 0 0.2 0.4 0.6 0.8 1 FPF M. Ongyerth – gitplag 10
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Detection rate (2016 whitespace with git) 1 1 0.8 0.8 30,2 Sensitivity Sensitivity 0.6 0.6 20,2 15,2 0.4 0.4 5,2 All All 10,2 Viability=5 Viability=5 0.2 0.2 Viability=15 Viability=15 Viability=30 Viability=30 5,2 0 0 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 FPF FPF · 10 − 2 · 10 − 2 (a) Assignment 2 (b) Assignment 3 M. Ongyerth – gitplag 11
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Detection rate (2017 identifier) 1 1 15 5 30 15 5 Identifier Identifier With Git With Git 0.8 0.8 20 Sensitivity Sensitivity 0.6 0.6 10 5 30 0.4 0.4 30 0.2 0.2 30 5 0 0 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 FPF FPF · 10 − 3 · 10 − 3 (c) Assignment 2 (d) Assignment 3 M. Ongyerth – gitplag 12
Chair of Network Architectures and Services Department of Informatics Technical University of Munich It’s not perfect • Shared external file • Students worked together • Incomplete file filter M. Ongyerth – gitplag 13
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Time requirements Assignment Git Whitespace Identifier No 8 s 10 s 3 Yes 18 s 24 s No 3 s 4 s 4 Yes 6 s 9 s M. Ongyerth – gitplag 14
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Further work • Improve usable file detection • Create and evaluate other tokenizing mechanisms • Some implementation details M. Ongyerth – gitplag 15
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Related work • Moss • Gitplag • (Docoloc) • Measuring Whitespace Pattern Sequences as an Indication of Pla- giarism (Baer et. Al) M. Ongyerth – gitplag 16
Recommend
More recommend