FlawFinder A Modular System for Predicting Quality Flaws in Wikipedia Oliver Ferschke, Iryna Gurevych and Marc Rittberger CLEF 2012 Labs and Workshop, Notebook Papers, September 2012. Rome, Italy., September 17 – 20, 2012 1
Introduction Oliver Ferschke Iryna Gurevych Marc Rittberger 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 2
FlawFinder Page IDs JWPL a a a a a b b b b b c c c c c Reader Linguistic Preprocessing Feature Extraction Training / Classification Writer Task-based system with Datastore / Results Multiple processing pipelines. 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 3
Data Import Document retrieval via Java Wikipedia Library and Wikipedia Revision Toolkit article text revision history revision meta data (authors, edit comment, timestamps) links (in/out, internal/external) JWPL database based on Wikipedia data dump from January 4th, 2012. http://jwpl.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 4
Preprocessing UIMA-based NLP components for preprocessing from the Darmstadt Knowledge Processing Repository Linguistic Preprocessing Named Sentence Stopword Wikitext Tokenizer Entity Splitter Filter Parser Recognizer v http://dkpro-core.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 5
Features NGram features • 32 feature types in 7 categories Structural features • ClearTK framework Reference features • „ plug and play “ feature extractors • independent from utilized ML toolkit Network features • Named entity features Information Gain approach for Revision-based features feature selection Other features • Unsupervised discretization of numeric features v http://cleartk.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 6
Classification Approach Binary classification Naive Bayes AdaBoost with depth-limited C4.5 decision trees as weak classifiers Negative instances Random sample of untagged articles Evaluation 10-fold cross validation on 1000 documents Stable sampling of negative instances in one evaluation run v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 7
Parameter Optimization • The overall system is a „ pipeline of pipelines “. DKPro • Individual pipelines can be parameterized Lab Parameter optimization: • Find best parameter setting across all pipelines • Report on performance for pipeline configurations DKPro Lab: • Task based processing • Parameter injection • Global configuration • Report probes gather statistics for global report Reports http://dkpro-lab.googlecode.com 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 8
Error Analysis and Evaluation Common error sources • Outdated labels (positive instances) • Missing labels (negative instances) • Unclear label definitions esp. reference flaws are often confused • Section-scope and article-scope flaws mixed 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 9
Conclusions & Outlook Use article revision in which tag was first inserted Solves outdated label problem Use revision history for identifying negative instances Solves missing label problem Separate treatment of section- and article-scope templates Real world application: multi-flaw classification problems with overlaps in flaw definitions 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 10
Thank you for your attention! Ubiquitous Knowledge Processing Lab http://www.ukp.tu-darmstadt.de 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 11
19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 12
Features NGram features • Token-unigrams, bigrams, trigrams Structural features • Extracted from article Reference features text w/o markup Network features • Min. frequency (5) Named entity features Revision-based features • Stopword filtered Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 13
Features NGram features • Empty sections Structural features • Number of sections Reference features • Mean section length Network features • Markup to text ratio Named entity features Revision-based features Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 14
Features NGram features • Number of references Structural features • Reference lists Reference features • Reference to text ratio Network features • References per Named entity features sentence Revision-based features Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 15
Features NGram features • External links Structural features • Inlinks Reference features • Outlinks Network features Named entity features Revision-based features Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 16
Features NGram features • NER types • Organization Structural features • Person • Location Reference features Network features • Absolute numbers and NER to text ratio Named entity features Revision-based features Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 17
Features NGram features • Number of revisions Structural features • Number of unique contributors Reference features Network features • Number of registered contributors Named entity features Revision-based features • Article age Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 18
Features NGram features • Number of discussions on Talk page Structural features • Number of sentences, Reference features tokens and characters Network features Named entity features Revision-based features Other features v 19.09.2012 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Oliver Ferschke | 19
Recommend
More recommend