featured article identification in wikipedia
play

Featured Article Identification in Wikipedia - Thesis Defense - - PowerPoint PPT Presentation

Featured Article Identification in Wikipedia - Thesis Defense - Christian Fricke christian.fricke@uni-weimar.de Faculty of Media / Media Systems Bauhaus-Universitt Weimar, Germany October 18, 2012 Why is Wikipedia relevant? Millions of


  1. Featured Article Identification in Wikipedia - Thesis Defense - Christian Fricke christian.fricke@uni-weimar.de Faculty of Media / Media Systems Bauhaus-Universität Weimar, Germany October 18, 2012

  2. Why is Wikipedia relevant? Millions of people use Wikipedia, including authors, readers, researchers, and data analysts. 180 160 140 120 Number of Papers 100 Journals Conferences 80 60 40 20 0 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year Source: http://en.wikipedia.org/w/index.php?title=File:Growth_of_Academic_Interest_in_Wikipedia.svg Featured Article Identification in Wikipedia Motivation 2 / 21

  3. Wikipedia Statistics The quality assessment of articles is manually unmanageable for the ever-growing encyclopedia. � FA-Class: 4 213 (0.01%) � A-Class: 991 (0.03%) � GA-Class: 16 508 (0.43%) � B-Class: 82 787 (2.15%) � C-Class: 129 483 (3.36%) English Wikipedia � Start-Class: 881 813 (22.9%) � Stub-Class: 2 169 051 (56.4%) � FL-Class: 1 781 (0.01%) � List-Class: 100 812 (2.62%) � Unassessed: 461 818 (12.0%) � Total: 3 849 257 (100%) Featured Article Identification in Wikipedia Motivation 3 / 21

  4. Automated Solution ◮ Quality judgement of articles as indicator for improvement ◮ Most common method: binary classification of featured and non-featured articles represented as vectors of feature values Featured: FA-Class all other articles Non-featured: Featured Article Identification in Wikipedia Motivation 4 / 21

  5. Outline 1. Motivation 2. Quality Assessment Models 3. Feature Implementation 4. Article Classification 5. Conclusion Featured Article Identification in Wikipedia Quality Assessment Models 5 / 21

  6. Binary Classification Approaches (1) Blumenstock [WWW 2008] (2) Dalip et al. [JDIQ 2011] (3) Lipka and Stein [WWW 2010] (4) Stvilia et al. [IQ 2005] Problem: Extenuation of results through customized data sets Featured Article Identification in Wikipedia Quality Assessment Models 6 / 21

  7. (1) Blumenstock [WWW 2008] Features A single metric, the length (word count) of an article as its sole representation Dataset Unbalanced, random Featured : 1 554 Non-featured : 9 513 Classifier Multi-Layer Perceptron Featured Article Identification in Wikipedia Quality Assessment Models 7 / 21

  8. (2) Dalip et al. [JDIQ 2011] Features 54 features ranging from simple counts to complex graph-based metrics Dataset Unbalanced, random Featured : 549 Non-featured : 2 745 Classifier Support Vector Machine Featured Article Identification in Wikipedia Quality Assessment Models 8 / 21

  9. (3) Lipka and Stein [WWW 2010] Features Character trigram vector—mapping from substrings of three tokens to their respective frequencies Dataset Balanced, domain-specific Featured : 380 Non-featured : 380 Classifier Support Vector Machine Featured Article Identification in Wikipedia Quality Assessment Models 9 / 21

  10. (4) Stvilia et al. [IQ 2005] Features Seven distinct metrics based on variable groupings that contain 19 features Dataset Unbalanced, random Featured : 236 Non-featured : 834 Classifier C4.5 Decision Tree Featured Article Identification in Wikipedia Quality Assessment Models 10 / 21

  11. Outline 1. Motivation 2. Quality Assessment Models 3. Feature Implementation 4. Article Classification 5. Conclusion Featured Article Identification in Wikipedia Feature Implementation 11 / 21

  12. Data Preparation The January 2012 snapshot of the English Wikipedia constitutes 8TB of text data and is processed in less than two hours using the optimized Webis Hadoop cluster. Import SQL tables Extract metadata Database dump wikitext Parse XML files Extract wikitext Extract plaintext Dump Preprocessing Featured Article Identification in Wikipedia Feature Implementation 12 / 21

  13. Feature Categories Features are organized in four categories: Content Length and part of speech rates, readability indices, trigrams . . . Structure Lead rate, section distribution, counts for categories, files, images, lists, tables, and templates . . . Network Link counts and PageRank . . . History Age, currency, counts for edits, editors, and reverts . . . Featured Article Identification in Wikipedia Feature Implementation 13 / 21

  14. Feature Computation The runtime for the computation of each feature for all articles depends on its source and complexity. Category Features Runtime Source Content 35 < 1h plaintext Structure 23 < 1h wikitext Network 8 < 12h metadata History 9 < 12h all Total: 75 ∼ 1d Featured Article Identification in Wikipedia Feature Implementation 14 / 21

  15. Experiment Reconstruction ◮ Implemented most features to accurately replicate results in an easy to use framework incorporating data extraction, feature computation, dataset construction, and model definitions ◮ Employed WEKA to train and evaluate the classifiers ◮ Biased dataset selections made exact reproduction difficult Featured Article Identification in Wikipedia Feature Implementation 15 / 21

  16. Outline 1. Motivation 2. Quality Assessment Models 3. Feature Implementation 4. Article Classification 5. Conclusion Featured Article Identification in Wikipedia Article Classification 16 / 21

  17. Evaluation Measures P recision (Proportion of correctly identified negatives) R ecall (Proportion of correctly identified positives) F -Measure (Harmonic mean of P recision and R ecall ) Featured Article Identification in Wikipedia Article Classification 17 / 21

  18. Reconstruction Results Model Featured Non-featured Average P recision / R ecall / F -Measure P recision / R ecall / F -Measure F -Measure 0.871 / 0.936 / 0.902 0.989 / 0.977 / 0.983 0.970 (1) 0.781 / 0.877 / 0.826 0.980 / 0.960 / 0.970 0.949 ⊥ ⊥ ⊥ (2) 0.903 / 0.900 / 0.901 0.980 / 0.981 / 0.980 0.967 0.966 / 0.961 / 0.964 ⊥ ⊥ (3) 0.949 / 0.939 / 0.944 0.940 / 0.950 / 0.945 0.944 0.900 / 0.920 / 0.910 0.980 / 0.970 / 0.975 0.957 (4) 0.859 / 0.907 / 0.882 0.973 / 0.958 / 0.965 0.947 (1) Blumenstock (2) Dalip et al. (3) Lipka and Stein (4) Stvilia et al. Featured Article Identification in Wikipedia Article Classification 18 / 21

  19. Reconstruction Results Model Featured Non-featured Average P recision / R ecall / F -Measure P recision / R ecall / F -Measure F -Measure 0.871 / 0.936 / 0.902 0.989 / 0.977 / 0.983 0.970 (1) 0.781 / 0.877 / 0.826 0.980 / 0.960 / 0.970 0.949 ⊥ ⊥ ⊥ (2) 0.903 / 0.900 / 0.901 0.980 / 0.981 / 0.980 0.967 0.966 / 0.961 / 0.964 ⊥ ⊥ (3) 0.949 / 0.939 / 0.944 0.940 / 0.950 / 0.945 0.944 0.900 / 0.920 / 0.910 0.980 / 0.970 / 0.975 0.957 (4) 0.859 / 0.907 / 0.882 0.973 / 0.958 / 0.965 0.947 (1) Blumenstock (2) Dalip et al. (3) Lipka and Stein (4) Stvilia et al. Featured Article Identification in Wikipedia Article Classification 18 / 21

  20. Reconstruction Results Model Featured Non-featured Average P recision / R ecall / F -Measure P recision / R ecall / F -Measure F -Measure 0.871 / 0.936 / 0.902 0.989 / 0.977 / 0.983 0.970 (1) 0.781 / 0.877 / 0.826 0.980 / 0.960 / 0.970 0.949 ⊥ ⊥ ⊥ (2) 0.903 / 0.900 / 0.901 0.980 / 0.981 / 0.980 0.967 0.966 / 0.961 / 0.964 ⊥ ⊥ (3) 0.949 / 0.939 / 0.944 0.940 / 0.950 / 0.945 0.944 0.900 / 0.920 / 0.910 0.980 / 0.970 / 0.975 0.957 (4) 0.859 / 0.907 / 0.882 0.973 / 0.958 / 0.965 0.947 (1) Blumenstock (2) Dalip et al. (3) Lipka and Stein (4) Stvilia et al. Featured Article Identification in Wikipedia Article Classification 18 / 21

  21. Reconstruction Results Model Featured Non-featured Average P recision / R ecall / F -Measure P recision / R ecall / F -Measure F -Measure 0.871 / 0.936 / 0.902 0.989 / 0.977 / 0.983 0.970 (1) 0.781 / 0.877 / 0.826 0.980 / 0.960 / 0.970 0.949 ⊥ ⊥ ⊥ (2) 0.903 / 0.900 / 0.901 0.980 / 0.981 / 0.980 0.967 0.966 / 0.961 / 0.964 ⊥ ⊥ (3) 0.949 / 0.939 / 0.944 0.940 / 0.950 / 0.945 0.944 0.900 / 0.920 / 0.910 0.980 / 0.970 / 0.975 0.957 (4) 0.859 / 0.907 / 0.882 0.973 / 0.958 / 0.965 0.947 (1) Blumenstock (2) Dalip et al. (3) Lipka and Stein (4) Stvilia et al. Featured Article Identification in Wikipedia Article Classification 18 / 21

  22. Uniform Dataset We define four datasets to fairly compare the performance of each proposed model and propose an additional model that combines every implemented feature. Dataset Balanced, random, corresponding to minimum word counts of 0, 800, 1600, and 2400 Featured : 3 000 Non-featured : 3 000 (5) Fricke and Anderka: Features All 75 features from every category Classifier Support Vector Machine Featured Article Identification in Wikipedia Article Classification 19 / 21

  23. Uniform Evaluation 1.00 Average F -Measure � (1) Blumenstock � (2) Dalip et al. 0.95 � (3) Lipka and Stein � (4) Stvilia et al. � (5) Fricke and Anderka 0.90 0 800 1600 2400 minimum word count Featured Article Identification in Wikipedia Article Classification 20 / 21

  24. Conclusion and Outlook ◮ A framework for convenient and consistent evaluation ◮ A new model utilizing every implemented quality indicator ◮ The most comprehensive collection of article features to date Featured Article Identification in Wikipedia Conclusion 21 / 21

Recommend


More recommend