On the Use of PU Learning for Quality Flaw Prediction in Wikipedia - PowerPoint PPT Presentation

On the Use of PU Learning for Quality Flaw Prediction in Wikipedia Edgardo Ferretti, Donato Hernández, Rafael Guzmán, Manuel Montes, Marcelo Errecalde & Paolo Rosso September 19th, PAN@CLEF'12, Rome

Who are we? Methodological Design PU Learning Research questions Conclusions Who are we? Paolo Rosso Edgardo Ferretti Manuel Montes Donato Hernández Marcelo Errecalde Donato Hernández Rafael Guzmán

Who are we? Methodological Design PU Learning Research questions Conclusions Methodological Design  Using a state-of-the-art document model  Finding a good algorithm for classification tasks  Exploiting the characteristics of this algorithm

Who are we? Methodological Design PU Learning Research questions Conclusions Methodological Design  Using a state-of-the-art document model  73 features from the document model used in [1]. They were selected following the guidelines in [2]. Network Features Text Features LENGTH: character / sentence / word count, etc. In-link count, STRUCTURE: mandatory sections count, tables count, etc. Internal link count, STYLE: prepositions / stop words / questions rate, etc. Inter-language link READABILITY: Gunning-Fog / Kincaid indexes, etc, count [1] Anderka, M., Stein, B., Lipka, N.: Predicting Quality Flaws in User-generated Content: The Case of Wikipedia. In: 35rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM (2012) [2] Dalip, D., Goncalves, M., Cristo, M., Calado, P.: Automatic quality assessment of content created collaboratively by Web communities: a case study of Wikipedia. In: 9th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM (2009).

Who are we? Methodological Design PU Learning Research questions Conclusions PU Learning  This method uses as input a small labelled set of the positive class to be predicted and a large unlabelled set to help learning. [3] U Classifier 1 Training P 1 st stage [3] Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003.

Who are we? Methodological Design PU Learning Research questions Conclusions PU Learning  This method uses as input a small labelled set of the positive class to be predicted and a large unlabelled set to help learning. [3] Test U Classifier 1 no Training RNs P 1 st stage [3] Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003.

Who are we? Methodological Design PU Learning Research questions Conclusions PU Learning  This method uses as input a small labelled set of the positive class to be predicted and a large unlabelled set to help learning. [3] Test U Classifier 1 no Training Classifier 2 RNs P 1 st stage 2 nd stage [3] Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003.

Who are we? Methodological Design PU Learning Research questions Conclusions PU Learning  This method uses as input a small labelled set of the positive class to be predicted and a large unlabelled set to help learning. [3] Test U Classifier 1 no Training Training Classifier 2 RNs P 1 st stage 2 nd stage [3] Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003.

Who are we? Methodological Design PU Learning Research questions Conclusions #1 What classifier in each stage? Spy, 1-DNF, Rocchio, NB, KNN EM, ? SVM, SVM-I, Test SVM-IS U Classifier 1 ? no Training Training Classifier 2 RNs P 1 st stage 2 nd stage

Research questions Who are we? Methodological Design PU Learning Conclusions #1 What classifier in each stage? Spy, 1-DNF, Rocchio, NB, KNN EM, ? SVM, SVM-I, Test SVM-IS U Classifier 1 ? no Training Training Classifier 2 RNs P 1 st stage 2 nd stage Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003.

Research questions Who are we? Methodological Design PU Learning Conclusions #1 What classifier in each stage? Spy, 1-DNF, Rocchio, NB, KNN EM, ? SVM, SVM-I, Test SVM-IS U Classifier 1 ? no Training Training Classifier 2 RNs P 1 st stage 2 nd stage B. Zhang and W. Zuo. Reliable Negative Extracting Based on kNN for Learning from Positive and Unlabeled Examples . Journal of Computers, 4(1):94–101, 2009.

Research questions Who are we? Methodological Design PU Learning Conclusions #1 What classifier in each stage? NB, KNN, SVM ? NB, KNN, SVM Test U Classifier 1 ? no Training Training Classifier 2 RNs P 1 st stage 2 nd stage

Research questions Who are we? Methodological Design PU Learning Conclusions #1 What classifier in each stage? Our choice: NB + SVM NB, KNN, SVM ? NB, KNN, SVM Test U Classifier 1 ? no Training Training Classifier 2 RNs P 1 st stage 2 nd stage

Research questions Who are we? Methodological Design PU Learning Conclusions #1 #2 Untagged sampling strategy 50000 untagged documents Test U NB Classifier no Training Training RNs SVM Classifier P 1 st stage 2 nd stage

Research questions Who are we? Methodological Design PU Learning Conclusions #1 #2 Untagged sampling strategy 50000 untagged documents 10-fold cross-validation U P Training Test PU Learning

Research questions Who are we? Methodological Design PU Learning Conclusions #1 #2 Untagged sampling strategy U 5.3 U 3.3 U 7.3 U 1.3 U 9.3 U 7.2 U 3.2 U 1.2 U 5.2 U 9.2 U 1.1 U 3.1 U 5.1 U 7.1 U 9.1 U 10 U 1 U 2 U 3 U 4 U 5 U 6 U 7 U 8 U 9 U 10 U 1 U 2 U 10.1 U 2.1 U 4.1 U 6.1 U 8.1 U 10.2 U 4.2 U 8.2 U 2.2 U 6.2 U 10.3 U 4.3 U 8.3 U 2.3 U 6.3 |U i | = 5000, for i=1..10

Research questions Who are we? Methodological Design PU Learning Conclusions #1 #2 Untagged sampling strategy 1-sample 2-sample 10-sample U 1.0 =U 1 U 2.0 =U 2 U 10.0 =U 10 U 1.1 =U 1 +U 2 U 2.1 =U 2 +U 3 U 10.1 =U 10 +U 1 U 1.2 =U 1.1 +U 3 U 2.2 =U 2.1 +U 4 U 10.2 =U 10.1 +U 2 U 1.3 =U 1.2 +U 4 U 2.3 =U 2.2 +U 5 U 10.3 =U 10.2 +U 3 (P + U i.j ), i=1..10, j=0..3 ⇒ 40 different training sets Training Test P size Proportions P size 1000 1:5,1:10, 1:15, 1:20 110

Research questions Who are we? Methodological Design PU Learning Conclusions #1 #2 Untagged sampling strategy 1-sample 2-sample 10-sample U 1.0 =U 1 U 2.0 =U 2 U 10.0 =U 10 U 1.1 =U 1 +U 2 U 2.1 =U 2 +U 3 U 10.1 =U 10 +U 1 U 1.2 =U 1.1 +U 3 U 2.2 =U 2.1 +U 4 U 10.2 =U 10.1 +U 2 U 1.3 =U 1.2 +U 4 U 2.3 =U 2.2 +U 5 U 10.3 =U 10.2 +U 3 (P + U i.j ), i=1..10, j=0..3 ⇒ 40 different training sets Training Test P size Proportions P size 1000 1:5,1:10, 1:15, 1:20 110 Advert Empty No-foot Notab OR Orphan PS Ref Unref Wiki Recall 0.58 0.98 0.57 0.99 0.3 1 0.74 0.61 0.99 0.97

Research questions Who are we? Methodological Design PU Learning Conclusions #1 #2 #3 Strategies to select negative set from RNs ? Test U Classifier 1 no Training Training Classifier 2 RNs N P 1 st stage 2 nd stage

Research questions Who are we? Methodological Design PU Learning Conclusions #1 #2 #3 Strategies to select negative set from RNs 1.Selecting all RNs as negative set. [3] 2.Selecting |P| documents by random from RNs set. 3.Selecting the |P| best RNs (those assigned the highest confidence prediction values by classifier 1). 4.Selecting the |P| worst RNs (those assigned the lowest confidence prediction values by classifier 1). [3] Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003.

Research questions Who are we? Methodological Design PU Learning Conclusions #1 #2 #3 Strategies to select negative set from RNs Table 2. Recall and fn values for RNs selection strategies

Research questions Who are we? Methodological Design PU Learning Conclusions #1 #2 #3 Strategies to select negative set from RNs Table 2. Recall and fn values for RNs selection strategies Table 3. Average recall values per flaw

Research questions Who are we? Methodological Design PU Learning Conclusions #1 #2 #3 #4 SVM: Which kernel?  Linear SVM (WEKA's default parameters)  RBF SVM  γ ∈ {2 -15 , 2 -13 , 2 -11 ,…, 2 1 , 2 3 }  C ∈ {2 -5 , 2 -3 , 2 -1 ,…, 2 13 , 2 15 }

On the Use of PU Learning for Quality Flaw Prediction in Wikipedia - PowerPoint PPT Presentation

On the Use of PU Learning for Quality Flaw Prediction in Wikipedia Edgardo Ferretti, Donato Hernndez, Rafael Guzmn, Manuel Montes, Marcelo Errecalde & Paolo Rosso September 19th, PAN@CLEF'12, Rome Who are we? Methodological Design PU

On- -line flaw growth line flaw growth On monitoring in high monitoring in high temperature

due to a design flaw on many pontoon boats Explain the design flaw Describe the accident

Overview of the 1st International Competition on Quality Flaw Prediction in Wikipedia Maik

QSA Global range of flaw detectors When personnel safety and control reliability come first

Problems Flaw Hypothesis Methodology depends on caliber of testers to hypothesize and

What is a Security Flaw? Colin Percival cperciva@freebsd.org Colin Percival May 13, 2006

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Summary of part I: prediction and RL Prediction is important for action selection The

RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW

#5 Feed/Food The Fatal Flaw: #5 FEED/FOOD Intensive sea cage fish farmings

Lecture 11.2 MPI EN 600.320/420 Instructor: Randal Burns 6 March 2018 Department of Computer

Memory Management in C Memory Management in C Personal Software Engineering Personal Software

Open Access Inf rast ruct ures f or European Research Natalia Manola University of Athens

Analysing Temporally Annotated Corpora with CAVaT Temporal Annotation What to annotate?

"Voting Software" xkcd.com/2030/ Cryptocurrencies & Security on the Blockchain

Silberschatz and Galvin Chapter 4 Processes CPSC 410--Richard Furuta 01/19/99 1 Chapter

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of

Drupal 8 Migration Strategy October 2018 Lynne Walsh Senior Technical Content Strategist,

On the Use of PU Learning for Quality Flaw Prediction in Wikipedia - PowerPoint PPT Presentation

On the Use of PU Learning for Quality Flaw Prediction in Wikipedia Edgardo Ferretti, Donato Hernndez, Rafael Guzmn, Manuel Montes, Marcelo Errecalde & Paolo Rosso September 19th, PAN@CLEF'12, Rome Who are we? Methodological Design PU

On- -line flaw growth line flaw growth On monitoring in high monitoring in high temperature

due to a design flaw on many pontoon boats Explain the design flaw Describe the accident

Overview of the 1st International Competition on Quality Flaw Prediction in Wikipedia Maik

QSA Global range of flaw detectors When personnel safety and control reliability come first

Problems Flaw Hypothesis Methodology depends on caliber of testers to hypothesize and

What is a Security Flaw? Colin Percival cperciva@freebsd.org Colin Percival May 13, 2006

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Summary of part I: prediction and RL Prediction is important for action selection The

RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW

#5 Feed/Food The Fatal Flaw: #5 FEED/FOOD Intensive sea cage fish farmings

Lecture 11.2 MPI EN 600.320/420 Instructor: Randal Burns 6 March 2018 Department of Computer

Memory Management in C Memory Management in C Personal Software Engineering Personal Software

Open Access Inf rast ruct ures f or European Research Natalia Manola University of Athens

Analysing Temporally Annotated Corpora with CAVaT Temporal Annotation What to annotate?

&quot;Voting Software&quot; xkcd.com/2030/ Cryptocurrencies &amp; Security on the Blockchain

Silberschatz and Galvin Chapter 4 Processes CPSC 410--Richard Furuta 01/19/99 1 Chapter

EI 338: Computer Systems Engineering (Operating Systems &amp; Computer Architecture) Dept. of

Drupal 8 Migration Strategy October 2018 Lynne Walsh Senior Technical Content Strategist,

"Voting Software" xkcd.com/2030/ Cryptocurrencies & Security on the Blockchain

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of