Fake News Spreader Identification in Twitter using Ensemble Modeling - PowerPoint PPT Presentation

Fake News Spreader Identification in Twitter using Ensemble Modeling 8 th Author Profiling Task PAN Workshop – CLEF 2020 Ahmad Hashemi Mohammad Reza Zarei Mohammad Reza Moosavi Mohammad Taheri Department of Computer Science and Engineering, Shiraz University, Shiraz, Iran

Introduction Why Study Fake News? 2/16  Negative consequences of fake news propagation  Political Aspects  Economic Aspects  Health Related Aspects

Introduction 3/16 Profiling Fake News Spreaders  Hypothesis: Users who do not spread fake news have a set of different characteristics compared to users who tend to share fake news.  Identifying fake news spreaders as a first step towards fake news detection

Dataset The PAN-AP-20 Provided Corpus 4/16  Number of authors in the competition dataset: Language Training Test Total English 300 200 500 Spanish 300 200 500  For each author, their last 100 tweets have been retrieved

Methodology Overview of The Proposed Model 5/16

Methodology Statistical features 6/16  Fraction of retweets (tweets starting with "RT")  Average number of mentions per tweet  Average number of URLs per tweet  Average number of hashtags per tweet  Average tweet length

Methodology Implicit Features 7/16  Age (English dataset)  Gender (English dataset)  Emotional Signals  English dataset: anger, anticipation, disgust, fear, joy, sadness, surprise, trust  Spanish dataset: joy, anger, fear, repulsion, surprise, sadness  Personality (English dataset)  Agreeableness, conscientiousness, extraversion, neuroticism, openness

Methodology Word Embeddings 8/16  Preproccessing  Omitting retweet tags, hashtags, URLs and user tag  TweetTokenizer module from the NLTK package  English dataset: pretrained on blogs, news and comments  Spanish dataset: pretrained on news and media contents

Methodology Term Frequency – Inverse Document 9/16 Frequency (TF-IDF)  Preproccessing  Eliminating punctuations, numbers and stop words  Stemming  Omit-ting retweet tags, hashtags, URLs and user tag  TweetTokenizer module from the NLTK package

Methodology 10/16 Ensembling the Models Use soft classifiers to obtain the confidence of each model c i ( u ) c out ( u ) = αc 1 ( u )+ βc 2 ( u )+ γc 3 ( u ) α + β + γ = 1 c 1 ( u ): confidence of the classifier for TFIDF features c 2 (u): confidence of the classifier for Word Embeddings features c 3 (u): confidence of the classifier for implicit+statistical features The label of the user u is determined as:

Experimental Result 11/16 Model Selection  Accuracy scores of 10-fold cross-validation Feature groups Dataset SVM Random Logistic Forest Regression Statistical + Implicit English 57.6 69 49.6 TF-IDF English 68.3 70.3 68.3 Embedding English 67.6 71.3 67.6 Statistical + Implicit Spanish 72.6 73 56 TF-IDF Spanish 82 80 81.6 Embedding Spanish 74 76.3 76

Experimental Result 12/16 Ensembling the Models  Determined weight parameters for merging the individual classifiers Language TF-IDF ( α ) Embeddings( β ) Statistical+Implicit ( γ ) English 0.15 0.45 0.4 Spanish 0.65 0.1 0.25

Experimental Result Local Evaluation 13/16  10 fold cross validation scores obtained on different components Features Accuracy (en) Accuracy (es) TF-IDF 70.3 82 Embedding 71.3 76.3 Statistical + Explicit 69 73 Ensembled model 74.6 82.9 (final model)

Experimental Result Final Results 14/16  Accuracy scores obtained on the local evaluation and the official test set Language Cross-validation Official test set English 74.6 69.5 Spanish 82.9 78.5 Average 78.75 74.0

Future Work 15/16  Extracting more Implicit features and analyzing their discrimination  Proposing a learning scheme for the ensemble unit  Using the fake news spreader identification results for fake news detection

Thank You For Your Attention! s.ahmad.hmi@gmail.com mr.zarei@cse.shirazu.ac.ir

Fake News Spreader Identification in Twitter using Ensemble Modeling - PowerPoint PPT Presentation

Fake News Spreader Identification in Twitter using Ensemble Modeling 8 th Author Profiling Task PAN Workshop CLEF 2020 Ahmad Hashemi Mohammad Reza Zarei Mohammad Reza Moosavi Mohammad Taheri Department of Computer Science and Engineering,

Fake News Do Now: How would you define fake news? Why is fake news so common?

So what is Fake News Fake news is a type of hoax or deliberate spread of misinformation: News

Multilingual detection of Fake News Spreaders via Sparse Matrix Factorization Boshko Koloski

11-830 Computational Ethics for NLP Lecture 13: Fake News and Influencing Elections Fake News

Believability of News Understanding users perceptions of fake news and fact checking badges

Deep learning for speech synthesis The good news, the bad news, and the fake news Scott

Fake News Online Types of Fake News Clickbait Unreliable sources Misleading

UniNE at PAN-CLEF 2020: Profiling Fake News Spreaders on Twitter Catherine Ikae, Jacques Savoy

Who can you believe? How to avoid being deceived And why by fake news it matters A

Optimum product quality thanks to correct spreader roll parameter settings In Paper Industry

Dissecting Fake News Media Literacy in the Post-Truth Era W: thecinematheque.ca/education E:

Quantum Cryptography 1. Fake Quantum Theory. 2. Simple Quantum Protocols. 3. More Fake Quantum

GOVERNING IN A 'FAKE NEWS WORLD' Dr Matt Killingsworth University of Tasmania

Satire vs Fake News: You Can Tell by the Way They Say It Dipto Das and Anthony J Clark Computer

A Review of Fact-Checking, Fake News Detection and Argumentation Tariq Alhindi March 02, 2020

Fake News and FactChecking Workshop Peter Gallert Goethe Institut Windhoek 24 January 2020

Matching Scores TVM, Session 4 CS6200: Information Retrieval Slides by: Jesse Anderton Finding

Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one)

Lecturer, Computational Science and Engineering, Georgia Tech Text is everywhere We use

GpKex : Genetically Programmed Keyphrase Extraction from Croatian Texts Marko Bekavac and Jan

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

an optimized data exchange policy Hisham Mohamed and Stphane Marchand-Maillet Viper group, CVML

Integrating Structured Data and Text A Tagged Document < DOC > <

Feature selection LING 572 Advanced Statistical Methods for NLP January 21, 2020 1

Fake News Spreader Identification in Twitter using Ensemble Modeling - PowerPoint PPT Presentation

Fake News Spreader Identification in Twitter using Ensemble Modeling 8 th Author Profiling Task PAN Workshop CLEF 2020 Ahmad Hashemi Mohammad Reza Zarei Mohammad Reza Moosavi Mohammad Taheri Department of Computer Science and Engineering,

Fake News Do Now: How would you define fake news? Why is fake news so common?

So what is Fake News Fake news is a type of hoax or deliberate spread of misinformation: News

Multilingual detection of Fake News Spreaders via Sparse Matrix Factorization Boshko Koloski

11-830 Computational Ethics for NLP Lecture 13: Fake News and Influencing Elections Fake News

Believability of News Understanding users perceptions of fake news and fact checking badges

Deep learning for speech synthesis The good news, the bad news, and the fake news Scott

Fake News Online Types of Fake News Clickbait Unreliable sources Misleading

UniNE at PAN-CLEF 2020: Profiling Fake News Spreaders on Twitter Catherine Ikae, Jacques Savoy

Who can you believe? How to avoid being deceived And why by fake news it matters A

Optimum product quality thanks to correct spreader roll parameter settings In Paper Industry

Dissecting Fake News Media Literacy in the Post-Truth Era W: thecinematheque.ca/education E:

Quantum Cryptography 1. Fake Quantum Theory. 2. Simple Quantum Protocols. 3. More Fake Quantum

GOVERNING IN A 'FAKE NEWS WORLD' Dr Matt Killingsworth University of Tasmania

Satire vs Fake News: You Can Tell by the Way They Say It Dipto Das and Anthony J Clark Computer

A Review of Fact-Checking, Fake News Detection and Argumentation Tariq Alhindi March 02, 2020

Fake News and FactChecking Workshop Peter Gallert Goethe Institut Windhoek 24 January 2020

Matching Scores TVM, Session 4 CS6200: Information Retrieval Slides by: Jesse Anderton Finding

Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one)

Lecturer, Computational Science and Engineering, Georgia Tech Text is everywhere We use

GpKex : Genetically Programmed Keyphrase Extraction from Croatian Texts Marko Bekavac and Jan

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

an optimized data exchange policy Hisham Mohamed and Stphane Marchand-Maillet Viper group, CVML

Integrating Structured Data and Text A Tagged Document &lt; DOC &gt; &lt;

Feature selection LING 572 Advanced Statistical Methods for NLP January 21, 2020 1

Integrating Structured Data and Text A Tagged Document < DOC > <