Using N-grams to detect Bots on Twitter Juan Pizarro Universitat - PowerPoint PPT Presentation

Using N-grams to detect Bots on Twitter Juan Pizarro Universitat Politècnica de València jpizarrom@gmail.com Bots and Gender Profiling, PAN at CLEF 2019 Lugano, Switzerland, September 10, 2019

Outline ● Task ● Dataset ● Methods ○ Preprocessing ○ Feature Extraction ○ Models ○ Parameter Optimization ● Results ● Other Methods ● Conclusions and Future Work

Bots and Gender Profiling ● Predict ○ Author: Bot or Human ○ Gender: male or Female ● Lang: ○ English ○ Spanish ● 100 tweets per author ● Evaluation ○ Accuracy average ● TIRA platform

Dataset Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., M¨uller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)

Preprocessing ● Concat tweets by author ● Replace with single token ○ urls ○ user mentions ○ hashtags ● NLTK [1] TweetTokenizer Based on [2] [1] Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc. [2] Daneshvar, S., Inkpen, D.: Gender Identification in Twitter using N-grams and LSA: Notebook for PAN at CLEF 2018. In: CEUR Workshop Proceedings. vol. 2125 (2018),

Preprocessing

Feature Extraction Models ● Char N-grams (1, 6) ● SVM LinearSVC ● Word N-grams (1, 3) ● MultinomialNB ● Tf-idf ● LogisticRegression Using [1] [1] Pedregosa et al.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, JMLR 12, 2825–2830 (2011)

Parameter Optimization ● Hand-tuning ● Grid Search ● Random Search [1] [1] James Bergstra, Yoshua Bengio; Random Search for Hyper-Parameter Optimization.13(Feb):281−305, 2012.

Parameter Optimization ● Sequential model-based optimization (SMBO, also known as Bayesian optimization) with hyperopt [1,2] ○ Domain or Search Space ○ Objective Function ○ Optimization Algorithm [1] Bergstra, J. Hyperopt: Distributed asynchronous hyperparameter optimization in Python. http://jaberg.github.com/hyperopt, 2013. [2] Bergstra, J., Yamins, D., Cox, D. D. (2013) Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. To appear in Proc. of the 30th International Conference on Machine Learning (ICML 2013).

Parameter Optimization

Parameter Optimization ● Precision tp/(tp+fp) ● Recall tp/(tp+fn) ● F-beta score

Results on Dev

Results on Test Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., M¨uller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)

Other Methods: NN Preprocessing ● Concat tweets by author ● Replace ○ urls ○ user mentions ○ hashtags ○ number ○ demojify (demojize [1]) ● NLTK TweetTokenizer [1] https://github.com/carpedm20/emoji/

Other Methods: NN Model

Other Methods: Conv+Embedding

Other Methods: Conv+Pretrained Embedding

Other Methods: Conv+Embedding ● vocab_size=max_features+1 ● embedding_dim=50 ● maxlen=maxlen, ● embedding_matrix_weights=None ● trainable=False ● dropout1_rate=0.6 ● conv1_filters=128 ● conv1_kernel_size=7 ● dropout2_rate=0. ● dense1_units=32 ● dropout3_rate=0.

Conclusions ● SVM classifier with n-grams and TF-IDF features obtained good results ● Hyperparameter tuning is fundamental

Future Work ● why ● emoji ● lexicon ● word embeddings ● NN

Environment Setup ● NLTK [1] ● scikit-learn [2] ● hyperopt [3,4] ● Google Colaborator [5] ● Keras [6] [1] Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc. [2] Pedregosa et al.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, JMLR 12, 2825–2830 (2011) [3] Bergstra, J. Hyperopt: Distributed asynchronous hyperparameter optimization in Python. http://jaberg.github.com/hyperopt, 2013. [4] Bergstra, J., Yamins, D., Cox, D. D. (2013) Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. To appear in Proc. of the 30th International Conference on Machine Learning (ICML 2013). [5] https://colab.research.google.com [6] Chollet, F., et al.: Keras. https://keras.io (2015)

Other Methods ● build_model_emb_culstm_dense ● build_model_emb_lstm_dense ● build_model_emb_conv_maxpool_lstm_dense ● build_model_emb_conv_globmaxpool_dense_dense ● build_model_emb_sdrop_conv_maxpool_conv_maxpool_conv_maxpool_fln_ dense_dense ● build_model_emb_globmaxpool_dense_dense ● build_model_emb_sdrop_fln_dense_dense ● build_model_emb_sdrop_biculstm_fln_sdrop_globmaxpool_dense ● build_model_emb_fln_dense_dense

Bayesian Optimization https://towardsdatascience.com/an-introductory-example-of-bayesian-optimization-in-python-with-hyperopt-aae40fff4ff0

en-human? { #-0.9459677419354838 en human 'classifier': {'name': 'LinearSVC', 'params': {'C': 5153.874075307478, 'class_weight': 'balanced', 'dual': False, 'fit_intercept': True, 'intercept_scaling': 3.5918302677809204, 'loss': 'squared_hinge', 'max_iter': 1000, 'multi_class': 'ovr', 'penalty': 'l2', 'random_state': 2, 'tol': 0.0009950531254749422, 'verbose': False}}, 'feats': {'name': 'word_char', 'params': {'char': {'max_df': 0.7, 'min_df': 0.02, 'ngram_range': (1, 3)}, 'word': {'max_df': 0.6, 'min_df': 0.1, 'ngram_range': (2, 3)}}} }

en-gender { #-0.8 en gender 'classifier': {'name': 'LinearSVC', 'params': {'C': 14.332165053225301, 'class_weight': None, 'intercept_scaling': 0.215574951334565, 'loss': 'squared_hinge', 'max_iter': 2000, 'random_state': 42, 'tol': 3.798724613314342e-05}}, 'feats': {'name': 'word_char', 'params': {'char': {'max_df': 0.7, 'min_df': 0.02, 'ngram_range': (1, 3)}, 'word': {'max_df': 0.7, 'min_df': 0.04, 'ngram_range': (1, 3)}}} }

es-human? { # -0.9228260869565217 es human 'classifier': {'name': 'LinearSVC', 'params': {'C': 5153.874075307478, 'class_weight': 'balanced', 'dual': False, 'fit_intercept': True, 'intercept_scaling': 3.5918302677809204, 'loss': 'squared_hinge', 'max_iter': 1000, 'multi_class': 'ovr', 'penalty': 'l2', 'random_state': 2, 'tol': 0.0009950531254749422, 'verbose': False}}, 'feats': {'name': 'word_char', 'params': {'char': {'max_df': 0.8, 'min_df': 5, 'ngram_range': (3, 5)}, 'word': {'max_df': 0.7, 'min_df': 0.04, 'ngram_range': (1, 3)}}} }

es-genger { # -0.691304347826087 es gender 'classifier': {'name': 'LinearSVC', 'params': {'C': 83.52500216960948, 'class_weight': 'balanced', 'intercept_scaling': 0.40890443833718515, 'loss': 'hinge', 'max_iter': 2000, 'random_state': 42, 'tol': 0.0053996507748986814}}, 'feats': {'name': 'word_char', 'params': {'char': {'max_df': 0.7, 'min_df': 5, 'ngram_range': (3, 5)}, 'word': {'max_df': 0.6, 'min_df': 0.04, 'ngram_range': (1, 3)}}}}

Feature Extraction ● Char N-grams (1, 6) ● Word N-grams (1, 3) ● Tf-idf Using [1] [1] Pedregosa et al.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, JMLR 12, 2825–2830 (2011)

Models ● SVM LinearSVC ● MultinomialNB ● LogisticRegression Using [1] [1] Pedregosa et al.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, JMLR 12, 2825–2830 (2011)

Parameter Optimization ● Hand-tuning ● Grid Search ● Random Search [1] ● Sequential model-based optimization (SMBO, also known as Bayesian optimization) with hyperopt [2,3] ○ Domain or Search Space ○ Objective Function ○ Optimization Algorithm [1] James Bergstra, Yoshua Bengio; Random Search for Hyper-Parameter Optimization.13(Feb):281−305, 2012. [2] Bergstra, J. Hyperopt: Distributed asynchronous hyperparameter optimization in Python. http://jaberg.github.com/hyperopt, 2013. [3] Bergstra, J., Yamins, D., Cox, D. D. (2013) Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. To appear in Proc. of the 30th International Conference on Machine Learning (ICML 2013).

Results

Using N-grams to detect Bots on Twitter Juan Pizarro Universitat - PowerPoint PPT Presentation

Using N-grams to detect Bots on Twitter Juan Pizarro Universitat Politcnica de Valncia jpizarrom@gmail.com Bots and Gender Profiling, PAN at CLEF 2019 Lugano, Switzerland, September 10, 2019 Outline Task Dataset Methods

N-grams L445 / L545 Dept. of Linguistics, Indiana University Spring 2017 1 / 22 N-grams

What a are Twitter r bots, Twitter admits 8.5% of active and w what do they do? users, or

Building Twitter Bots in Node Philip James @phildini Whos this guy? #nodetweets @phildini

Discord Bot CHRIS L Discord What is it? Why does it need bots? Existing bots Why

Perspectives on CSCW 2017 Courtney Williams Opening Keynote Conversational Intelligence: Bots

ALTERNATIVE PROTEIN PRESENTATION NFS 200 BY BENJAMIN KRAEMER RECOMMENDATIONS OF RED MEAT

Identifying MMORPG Bots: Identifying MMORPG Bots: A Traffic Analysis Approach A Traffic Analysis

author profiling shared task on: Bots and gender profiling Francisco Rangel & Paolo Rosso

Statistical Language Modeling with N-grams in Python By Olha Diakonova What are n-grams

The Power of Bots: Understanding Bots in OSS Projects Mairieli Wessel Bruno Mendes Igor

A Qualitative Comparison of Facebook and Twitter Bots Introduction The increasing level of

Can We Detect Crisp Sets Based Only on How to Detect 1- . . . the Subsethood Ordering of Fuzzy

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

Peer-to-Peer bots Investigation using Distributed Honeypots Karun Dambiec u4462988 Australian

Comparing the Incomparable? Rethinking n-grams for free word order languages Lucie Luke ov

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Lying about Lying on Social Media: A Case Study of the 2019 Canadian Elections Catherine King,

He Heur uristic c Sea earc rch: : Be Best stFS FS an and d A * Com omputer Science c

Scaling Saved Searches Serving real time push-notifications for millions saved searches Who are

Tables, Priority Queues, Heaps Table ADT A table in generic terms has M columns and N rows

Chris Jones Bot Nick Paras Kapil Garg Helen Foster Motivation Chicago Tribune Theater Critic

Customer Care Automation bridges this gap Customer Care Automation involves more!

From robot swarms to ethical robots: the challenges of verification and validation - part 1

File Systems Profs. Bracy and Van Renesse based on slides by Prof. Sirer Storing Information

Using N-grams to detect Bots on Twitter Juan Pizarro Universitat - PowerPoint PPT Presentation

Using N-grams to detect Bots on Twitter Juan Pizarro Universitat Politcnica de Valncia jpizarrom@gmail.com Bots and Gender Profiling, PAN at CLEF 2019 Lugano, Switzerland, September 10, 2019 Outline Task Dataset Methods

N-grams L445 / L545 Dept. of Linguistics, Indiana University Spring 2017 1 / 22 N-grams

What a are Twitter r bots, Twitter admits 8.5% of active and w what do they do? users, or

Building Twitter Bots in Node Philip James @phildini Whos this guy? #nodetweets @phildini

Discord Bot CHRIS L Discord What is it? Why does it need bots? Existing bots Why

Perspectives on CSCW 2017 Courtney Williams Opening Keynote Conversational Intelligence: Bots

ALTERNATIVE PROTEIN PRESENTATION NFS 200 BY BENJAMIN KRAEMER RECOMMENDATIONS OF RED MEAT

Identifying MMORPG Bots: Identifying MMORPG Bots: A Traffic Analysis Approach A Traffic Analysis

author profiling shared task on: Bots and gender profiling Francisco Rangel &amp; Paolo Rosso

Statistical Language Modeling with N-grams in Python By Olha Diakonova What are n-grams

The Power of Bots: Understanding Bots in OSS Projects Mairieli Wessel Bruno Mendes Igor

A Qualitative Comparison of Facebook and Twitter Bots Introduction The increasing level of

Can We Detect Crisp Sets Based Only on How to Detect 1- . . . the Subsethood Ordering of Fuzzy

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

Peer-to-Peer bots Investigation using Distributed Honeypots Karun Dambiec u4462988 Australian

Comparing the Incomparable? Rethinking n-grams for free word order languages Lucie Luke ov

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Lying about Lying on Social Media: A Case Study of the 2019 Canadian Elections Catherine King,

He Heur uristic c Sea earc rch: : Be Best stFS FS an and d A * Com omputer Science c

Scaling Saved Searches Serving real time push-notifications for millions saved searches Who are

Tables, Priority Queues, Heaps Table ADT A table in generic terms has M columns and N rows

Chris Jones Bot Nick Paras Kapil Garg Helen Foster Motivation Chicago Tribune Theater Critic

Customer Care Automation bridges this gap Customer Care Automation involves more!

From robot swarms to ethical robots: the challenges of verification and validation - part 1

File Systems Profs. Bracy and Van Renesse based on slides by Prof. Sirer Storing Information

author profiling shared task on: Bots and gender profiling Francisco Rangel & Paolo Rosso