recsm summer school social media and big data research
play

RECSM Summer School: Social Media and Big Data Research Pablo - PowerPoint PPT Presentation

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of Economics www.pablobarbera.com Course website: pablobarbera.com/social-media-upf Supervised Machine Learning Applied to Social Media Text Supervised


  1. RECSM Summer School: Social Media and Big Data Research Pablo Barber´ a London School of Economics www.pablobarbera.com Course website: pablobarbera.com/social-media-upf

  2. Supervised Machine Learning Applied to Social Media Text

  3. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews...

  4. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into:

  5. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier

  6. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier

  7. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier):

  8. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier): ◮ Naive Bayes, regularized regression, SVM, K-nearest neighbors, BART, ensemble methods...

  9. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier): ◮ Naive Bayes, regularized regression, SVM, K-nearest neighbors, BART, ensemble methods... ◮ Approach to validate classifier: cross-validation

  10. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier): ◮ Naive Bayes, regularized regression, SVM, K-nearest neighbors, BART, ensemble methods... ◮ Approach to validate classifier: cross-validation ◮ Performance metric to choose best classifier and avoid overfitting: confusion matrix, accuracy, precision, recall...

  11. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words”

  12. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches:

  13. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher

  14. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher ◮ Unsupervised methods scale documents based on patterns of similarity from the term-document matrix, without requiring a training step

  15. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher ◮ Unsupervised methods scale documents based on patterns of similarity from the term-document matrix, without requiring a training step ◮ Relative advantage of supervised methods: You already know the dimension being scaled, because you set it in the training stage

  16. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher ◮ Unsupervised methods scale documents based on patterns of similarity from the term-document matrix, without requiring a training step ◮ Relative advantage of supervised methods: You already know the dimension being scaled, because you set it in the training stage ◮ Relative disadvantage of supervised methods: You must already know the dimension being scaled, because you have to feed it good sample documents in the training stage

  17. Supervised learning v. dictionary methods ◮ Dictionary methods:

  18. Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial

  19. Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial ◮ Disadvantage: not corpus-specific, so performance on a new corpus is unknown (domain shift)

  20. Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial ◮ Disadvantage: not corpus-specific, so performance on a new corpus is unknown (domain shift) ◮ Supervised learning can be conceptualized as a generalization of dictionary methods, where features associated with each categories (and their relative weight) are learned from the data

  21. Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial ◮ Disadvantage: not corpus-specific, so performance on a new corpus is unknown (domain shift) ◮ Supervised learning can be conceptualized as a generalization of dictionary methods, where features associated with each categories (and their relative weight) are learned from the data ◮ By construction, they will outperform dictionary methods in classification tasks, as long as training sample is large enough

  22. Dictionaries vs supervised learning Source : Gonz´ alez-Bail´ on and Paltoglou (2015)

  23. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation

  24. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles

  25. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records

  26. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation

  27. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation ◮ “Canonical” dataset: Comparative Manifesto Project

  28. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation ◮ “Canonical” dataset: Comparative Manifesto Project ◮ In most projects, undergraduate students (expertise comes from training)

  29. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation ◮ “Canonical” dataset: Comparative Manifesto Project ◮ In most projects, undergraduate students (expertise comes from training) ◮ Crowd-sourced coding

Recommend


More recommend