flawfinder
play

FlawFinder A Modular System for Predicting Quality Flaws in - PowerPoint PPT Presentation

FlawFinder A Modular System for Predicting Quality Flaws in Wikipedia Oliver Ferschke, Iryna Gurevych and Marc Rittberger CLEF 2012 Labs and Workshop, Notebook Papers, September 2012. Rome, Italy., September 17 20, 2012 1 Introduction


  1. FlawFinder A Modular System for Predicting Quality Flaws in Wikipedia Oliver Ferschke, Iryna Gurevych and Marc Rittberger CLEF 2012 Labs and Workshop, Notebook Papers, September 2012. Rome, Italy., September 17 – 20, 2012 1

  2. Introduction Oliver Ferschke Iryna Gurevych Marc Rittberger 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 2

  3. FlawFinder Page IDs JWPL a a a a a b b b b b c c c c c Reader Linguistic Preprocessing Feature Extraction Training / Classification Writer Task-based system with Datastore / Results Multiple processing pipelines. 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 3

  4. Data Import  Document retrieval via Java Wikipedia Library and Wikipedia Revision Toolkit  article text  revision history  revision meta data (authors, edit comment, timestamps)  links (in/out, internal/external)  JWPL database based on Wikipedia data dump from January 4th, 2012. http://jwpl.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 4

  5. Preprocessing  UIMA-based NLP components for preprocessing from the Darmstadt Knowledge Processing Repository Linguistic Preprocessing Named Sentence Stopword Wikitext Tokenizer Entity Splitter Filter Parser Recognizer v http://dkpro-core.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 5

  6. Features  NGram features • 32 feature types in 7 categories  Structural features • ClearTK framework  Reference features • „ plug and play “ feature extractors • independent from utilized ML toolkit  Network features •  Named entity features Information Gain approach for  Revision-based features feature selection  Other features • Unsupervised discretization of numeric features v http://cleartk.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 6

  7. Classification Approach  Binary classification  Naive Bayes  AdaBoost with depth-limited C4.5 decision trees as weak classifiers  Negative instances  Random sample of untagged articles  Evaluation  10-fold cross validation on 1000 documents  Stable sampling of negative instances in one evaluation run v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 7

  8. Parameter Optimization • The overall system is a „ pipeline of pipelines “. DKPro • Individual pipelines can be parameterized Lab Parameter optimization: • Find best parameter setting across all pipelines • Report on performance for pipeline configurations DKPro Lab: • Task based processing • Parameter injection • Global configuration • Report probes gather statistics for global report Reports http://dkpro-lab.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 8

  9. Error Analysis and Evaluation Common error sources • Outdated labels (positive instances) • Missing labels (negative instances) • Unclear label definitions  esp. reference flaws are often confused • Section-scope and article-scope flaws mixed 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 9

  10. Conclusions & Outlook  Use article revision in which tag was first inserted  Solves outdated label problem  Use revision history for identifying negative instances  Solves missing label problem  Separate treatment of section- and article-scope templates  Real world application: multi-flaw classification  problems with overlaps in flaw definitions 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 10

  11. Thank you for your attention! Ubiquitous Knowledge Processing Lab http://www.ukp.tu-darmstadt.de 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 11

  12. 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 12

  13. Features  NGram features • Token-unigrams, bigrams, trigrams  Structural features • Extracted from article  Reference features text w/o markup  Network features • Min. frequency (5)  Named entity features  Revision-based features • Stopword filtered  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 13

  14. Features  NGram features • Empty sections  Structural features • Number of sections  Reference features • Mean section length  Network features • Markup to text ratio  Named entity features  Revision-based features  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 14

  15. Features  NGram features • Number of references  Structural features • Reference lists  Reference features • Reference to text ratio  Network features • References per  Named entity features sentence  Revision-based features  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 15

  16. Features  NGram features • External links  Structural features • Inlinks  Reference features • Outlinks  Network features  Named entity features  Revision-based features  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 16

  17. Features  NGram features • NER types • Organization  Structural features • Person • Location  Reference features  Network features • Absolute numbers and NER to text ratio  Named entity features  Revision-based features  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 17

  18. Features  NGram features • Number of revisions  Structural features • Number of unique contributors  Reference features  Network features • Number of registered contributors  Named entity features  Revision-based features • Article age  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 18

  19. Features  NGram features • Number of discussions on Talk page  Structural features • Number of sentences,  Reference features tokens and characters  Network features  Named entity features  Revision-based features  Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 19

Recommend


More recommend