vandalism detection in wikidata
play

Vandalism Detection in Wikidata Stefan Heindorf 1 , Martin Potthast 2 - PowerPoint PPT Presentation

Vandalism Detection in Wikidata Stefan Heindorf 1 , Martin Potthast 2 , Benno Stein 2 , Gregor Engels 1 CIKM 2016 October 25, 2016 1 2 Motivation Vandalism Detection in Wikidata Stefan Heindorf 2 Motivation Vandalism Detection in Wikidata


  1. Vandalism Detection in Wikidata Stefan Heindorf 1 , Martin Potthast 2 , Benno Stein 2 , Gregor Engels 1 CIKM 2016 October 25, 2016 1 2

  2. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  3. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  4. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  5. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  6. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  7. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  8. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  9. Motivation Vandalism Detection in Wikidata Stefan Heindorf 2

  10. Vandalism Detection in Wikidata Stefan Heindorf 3

  11. Item head Vandalism Detection in Wikidata Stefan Heindorf 3

  12. Item head Item body Vandalism Detection in Wikidata Stefan Heindorf 3

  13. Revisions Item head Item body Vandalism Detection in Wikidata Stefan Heindorf 3

  14. Revisions Item head (Feb 22, 2013) (May 13, 2013) Item body (May 30, 2013) Vandalism Detection in Wikidata Stefan Heindorf 3

  15. Revisions Item head (Feb 22, 2013) (May 13, 2013) Item body (May 30, 2013) Vandalism Detection in Wikidata Stefan Heindorf 3

  16. Revisions Item head (Feb 22, 2013) (May 13, 2013) Item body (May 30, 2013) Vandalism Detection in Wikidata Stefan Heindorf 3

  17. Revisions Item head Item body Vandalism Detection in Wikidata Stefan Heindorf 3

  18. Why is it a problem? Patrolling Reverting Warning Protecting Blocking • Over 2 Mio manual edits per month • A lot of tedious work • Vandalism is not detected in time Vandalism Detection in Wikidata Stefan Heindorf 4

  19. Research Question How to detect damaging changes to crowdsourced knowledge bases? Vandalism Detection in Wikidata Stefan Heindorf 5

  20. Our Approach Vandalism Detection in Wikidata Stefan Heindorf 6

  21. Our Approach  Vandalism Corpus [SIGIR’15] 1. Label Dataset Vandalism Detection in Wikidata Stefan Heindorf 6

  22. Our Approach  Vandalism Corpus [SIGIR’15] 1. Label Dataset 2. Study Vandalism Characteristics  47 Features Vandalism Detection in Wikidata Stefan Heindorf 6

  23. Our Approach  Vandalism Corpus [SIGIR’15] 1. Label Dataset 2. Study Vandalism Characteristics  47 Features  Multiple-Instance Learning 3. Experiment with ML Vandalism Detection in Wikidata Stefan Heindorf 6

  24. Our Approach  Vandalism Corpus [SIGIR’15] 1. Label Dataset 2. Study Vandalism Characteristics  47 Features  Multiple-Instance Learning 3. Experiment with ML  2 Baselines 4. Compare with state of the art Vandalism Detection in Wikidata Stefan Heindorf 6

  25. Corpus [SIGIR ’15] Revisions over time 7

  26. Corpus [SIGIR ’15] Revisions over time Month 7

  27. Corpus [SIGIR ’15] Revisions over time Month 7

  28. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions Month 7

  29. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions 24 million manual revisions Month 7

  30. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Month 7

  31. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Item head (1.3% vandalism) Month 7

  32. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Item body (0.2% vandalism) Item head (1.3% vandalism) Month 7

  33. Corpus [SIGIR ’15] Revisions over time 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Item body Training (0.2% vandalism) Item head (1.3% vandalism) Month 7

  34. Corpus [SIGIR ’15] Revisions over time Validation 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Item body Training (0.2% vandalism) Item head (1.3% vandalism) Month 7

  35. Corpus [SIGIR ’15] Revisions over time Validation Test 103,000 vandalism revisions 24 million manual revisions  0.4% vandalism Item body Training (0.2% vandalism) Item head (1.3% vandalism) Month 7

  36. Features (47 in total) Content Features 11 Character features (e.g., lowerCaseRatio, digitRatio ) 9 Word features (e.g., badWordRatio ) 4 Sentence features (e.g., commentSitelinkSimilarity ) 3 Statement features (e.g., propertyFrequency ) Context Features 10 User features (e.g., userCountry ) 2 Item features (e.g., logItemFrequency ) 8 Revision features (e.g., revisionTag , revisionLanguage ) Vandalism Detection in Wikidata Stefan Heindorf 8

  37. Features (47 in total) revisionTag Vandalism Detection in Wikidata Stefan Heindorf 8

  38. Features (47 in total) revisionTag Vand. Total Prob. Rev. with tags 52 T 8,619 T 0.60% By abuse filter 49 T 122 T 39.90% By editing tools 3 T 8,496 T 0.03% Rev. w/o tags 52 T 15,386 T 0.34% revisionTag Vandalism Detection in Wikidata Stefan Heindorf 8

  39. Features (47 in total) revisionTag Vand. Total Prob. Rev. with tags 52 T 8,619 T 0.60% By abuse filter 49 T 122 T 39.90% By editing tools 3 T 8,496 T 0.03% Rev. w/o tags 52 T 15,386 T 0.34% revisionTag Vandalism Detection in Wikidata Stefan Heindorf 8

  40. Features (47 in total) revisionTag Vand. Total Prob. Rev. with tags 52 T 8,619 T 0.60% By abuse filter 49 T 122 T 39.90% By editing tools 3 T 8,496 T 0.03% Rev. w/o tags 52 T 15,386 T 0.34% revisionTag Vandalism Detection in Wikidata Stefan Heindorf 8

  41. Features (47 in total) revisionTag Vand. Total Prob. Rev. with tags 52 T 8,619 T 0.60% By abuse filter 49 T 122 T 39.90% By editing tools 3 T 8,496 T 0.03% Rev. w/o tags 52 T 15,386 T 0.34% revisionTag Vandalism Detection in Wikidata Stefan Heindorf 8

  42. Multiple-Instance Learning Vandalism Detection in Wikidata Stefan Heindorf 9

  43. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation Vandalism Detection in Wikidata Stefan Heindorf 9

  44. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported Vandalism Detection in Wikidata Stefan Heindorf 9

  45. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported Vandalism Detection in Wikidata Stefan Heindorf 9

  46. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) Session 1 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) Session 2 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported Vandalism Detection in Wikidata Stefan Heindorf 9

  47. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) Session 1 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) Session 2 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported • Idea: Apply Multiple-Instance Learning Vandalism Detection in Wikidata Stefan Heindorf 9

  48. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) Session 1 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) Session 2 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported • Idea: Apply Multiple-Instance Learning Vandalism Detection in Wikidata Stefan Heindorf 9

  49. Multiple-Instance Learning • Observation: Vandalism seldom occurs in isolation 22:35, 11 September 2013 184.19.64.111 (talk) . . (Changed English label: Barack Obama Aloha) Session 1 22:35, 11 September 2013 184.19.64.111 (talk) . . (Added English alias: Lulu:):):):):):):)) Session 2 12:05, 11 September 2013 MatmaBot (talk | contribs) . . (Changed Polish description: imported • Idea: Apply Multiple-Instance Learning Vandalism Detection in Wikidata Stefan Heindorf 9

  50. WDVD vs. Baselines • WDVD (our approach) W iki d ata V andalism D etector Vandalism Detection in Wikidata 10 Stefan Heindorf

  51. WDVD vs. Baselines • WDVD (our approach) W iki d ata V andalism D etector • FILTER (baseline) Wikidata Abuse Filter Vandalism Detection in Wikidata 10 Stefan Heindorf

  52. WDVD vs. Baselines • WDVD (our approach) W iki d ata V andalism D etector • FILTER (baseline) Wikidata Abuse Filter • ORES (baseline) O bjective R evision E valuation S ervice Vandalism Detection in Wikidata 10 Stefan Heindorf

Recommend


More recommend