introduction to text mining
play

Introduction to Text Mining Alliance Summer School 2019 Elliott Ash - PowerPoint PPT Presentation

Introduction to Text Mining Alliance Summer School 2019 Elliott Ash 1/85 Social Science meets Data Science We are seeing a revolution in social science : new datasets : administrative data, digitization of text archives, social media


  1. Introduction to Text Mining Alliance Summer School 2019 Elliott Ash 1/85

  2. Social Science meets Data Science ◮ We are seeing a revolution in social science : ◮ new datasets : administrative data, digitization of text archives, social media ◮ new methods : natural language processing, machine learning ◮ In particular: ◮ many important human behaviors consist of text – millions and millions of lines of it. ◮ we cannot read these texts – somehow we must teach machines to read them for us. 2/85

  3. Readings ◮ Google Developers Guide to Text Classification: ◮ https://developers.google.com/machine-learning/ guides/text-classification/ ◮ “Analyzing polarization in social media: Method and application to tweets on 21 mass shootings” (2019). ◮ Demszky, Garg, Voigt, Zou, Gentzkow, Shapiro, and Jurafsky ◮ Natural Language Processing in Python ◮ Hands-on Machine Learning with Scikit-learn & TensorFlow 2.0 4/85

  4. Programming ◮ Python is ideal for text data and machine learning. ◮ I recommend Anaconda 3.6: continuum.io/downloads ◮ For relatively small corpora, R is also fine: ◮ see the quanteda package. 5/85

  5. Text as Data ◮ Text data is a sequence of characters called documents . ◮ The set of documents is the corpus . ◮ Text data is unstructured : ◮ the information we want is mixed together with (lots of) information we don’t. ◮ How to separate the two? 6/85

  6. Dictionary Methods ◮ Dictionary methods use a pre-selected list of words or phrases to analyze a corpus. ◮ Corpus-specific ◮ count words related to your analysis ◮ General ◮ e.g. LIWC ( liwc.wpengine.com ) has lists of words across categories. ◮ Sentiment Analysis: count sets of positive and negative words (doesn’t work very well) 8/85

  7. Measuring uncertainty in macroeconomy Baker, Bloom, and Davis ◮ Baker, Bloom, and Davis measure economic policy uncertainty using Boolean search of newspaper articles. (See http://www.policyuncertainty.com/ ). ◮ For each paper on each day since 1985, submit the following query: ◮ 1. Article contains “uncertain” OR “uncertainty”, AND ◮ 2. Article contains “economic” OR “economy”, AND ◮ 3. Article contains “congress” OR “deficit” OR “federal reserve” OR “legislation” OR “regulation” OR “white house” ◮ Normalize resulting article counts by total newspaper articles that month. 9/85

  8. Measuring uncertainty in macroeconomy Baker, Bloom, and Davis 10/85

  9. Goals of Featurization ◮ The goal: produce features that are ◮ predictive in the learning task ◮ interpretable by human investigators ◮ tractable enough to be easy to work with 12/85

  10. Pre-processing ◮ Standard pre-processing steps: ◮ drop capitalization, punctuation, numbers, stopwords (e.g. “the”, “such”) ◮ remove word stems (e.g., “taxes” and “taxed” become “tax”) 13/85

  11. Parts of speech ◮ Parts of speech (POS) tags provide useful word categories corresponding to their functions in sentences: ◮ Content : noun (NN), verb (VB), adjective (JJ), adverb (RB) ◮ Function : determinant (DT), preposition (IN), conjunction (CC), pronoun (PR). ◮ Parts of speech vary in their informativeness for various functions: ◮ For categorizing topics , nouns are usually most important ◮ For sentiment , adjectives are usually most important. 14/85

  12. N-grams ◮ N-grams are phrases, sequences of words up to length N . ◮ bigrams, trigrams, quadgrams, etc. ◮ capture information and familiarity from local word order. ◮ e.g. “estate tax” vs “death tax” 15/85

  13. Filtering the Vocabulary ◮ N-grams will blow up your feature space: filtering out uninformative n-grams is necessary. ◮ Google Developers recommend vocab size = m =20,000; I have gotten good performance from m =2,000. 1. Drop phrases that appear in few documents, or in almost all documents, using tf-idf weights: tf-idf( w ) = (1+log( c w )) × log( N ) d w ◮ c w = count of phrase w in corpus, N = number of documents, d w = number of documents where w appears. 2. filter on parts of speech (keep nouns, adjectives, and verbs). 3. filter on pointwise mutual information to get collocations (Ash JITE 2017, pg. 2) 4. supervised feature selection: select phrases that are predictive of outcome. 16/85

  14. A decent baseline for featurization ◮ Tag parts of speech: keep nouns, verbs, and adjectives. ◮ Drop stopwords, capitalization, punctuation. ◮ Run snowball stemmer to drop word endings. ◮ Make bigrams from the tokens. ◮ Take top 10,000 bigrams based on tf-idf weight. ◮ Represent documents as tf-idf frequencies over these bigrams. 17/85

  15. Cosine Similarity v 1 · v 2 cos_sim( v 1 , v 2 ) = || v 1 |||| v 2 || where v 1 and v 2 are vectors, rep- resenting documents (e.g., IDF- weighted frequencies). ◮ each document is a non-negative vector in an m -space ( m = size of dictionary): ◮ closer vectors form smaller angles: cos(0) = +1 means identical documents. ◮ furthest vectors are orthogonal: cos( π/ 2) = 0 means no words in common. ◮ For n documents, this gives n × ( n − 1) similarities. 19/85

  16. Text analysis of patent innovation Kelly, Papanikolau, Seru, and Taddy (2018) “Measuring technological innovation over the very long run” ◮ Data: ◮ 9 million patents since 1840, from U.S. Patent Office and Google Scholar Patents. ◮ date, inventor, backward citations ◮ text (abstract, claims, and description) ◮ Text pre-processing: ◮ drop HTML markup, punctuation, numbers, capitalization, and stopwords. ◮ remove terms that appear in less than 20 patents. ◮ 1.6 million words in vocabulary. 20/85

  17. Measuring Innovation Kelly, Papanikolau, Seru, and Taddy (2018) ◮ Backward IDF weighting of word w in patent i : # of patents prior to i BIDF( w , i ) = log (1 + # patents prior to i that include w ) ◮ down-weights words that appeared frequently before a patent. ◮ For each patent i : ◮ compute cosine similarity ρ ij to all future patents j , using BIDF of i . ◮ 9m × 9m similarity matrix = 30TB of data. ◮ enforce sparsity by setting similarity < .05 to zero (93.4% of pairs). 21/85

  18. Novelty, Impact, and Quality Kelly, Papanikolau, Seru, and Taddy (2018) ◮ “Novelty” is defined by dissimilarity (negative similarity) to previous patents: � Novelty j = − ρ ij i ∈ B ( j ) where B ( j ) is the set of previous patents (in, e.g., last 20 years). ◮ “Impact” is defined as similarity to subsequent patents: � Impact i = ρ ij j ∈ F ( i ) where F ( i ) is the set of future patents (in, e.g., next 100 years). ◮ A patent has high quality if it is novel and impactful : logQuality k = logImpact k +logNovelty k 22/85

  19. Validation Kelly, Papanikolau, Seru, and Taddy (2018) ◮ For pairs with higher ρ ij , patent j more likely to cite patent i . ◮ Within technology class (assigned by patent office), similarity is higher than across class. ◮ Higher quality patents get more cites: 23/85

  20. Most Innovative Firms Kelly, Papanikolau, Seru, and Taddy (2018) 24/85

  21. Breakthrough patents: citations vs quality Kelly, Papanikolau, Seru, and Taddy (2018) 25/85

  22. Breakthrough patents and firm profits Kelly, Papanikolau, Seru, and Taddy (2018) 26/85

  23. Topic Models in Social Science ◮ Topic models developed in computer science and statistics: ◮ summarize unstructured text using words within document ◮ useful for dimension reduction ◮ Social scientists use topics as a form of measurement ◮ how observed covariates drive trends in language ◮ tell a story not just about what, but how and why ◮ topic models are more interpretable than other methods, e.g. principal components analysis. 28/85

  24. Latent Dirichlet Allocation (LDA) ◮ Idea: documents exhibit each topic in some proportion. ◮ Each document is a distribution over topics . ◮ Each topic is a distribution over words . ◮ Latent Dirichlet Allocation (e.g. Blei 2012) is the most poular topic model in this vein because it is easy to use and (usually) provides great results. ◮ Maintained assumptions: Bag of words/phrases, fix number of topics ex ante. 29/85

  25. A statistical highlighter 30/85

  26. Topic modeling Federal Reserve Bank transcripts Hansen, McMahon, and Prat (QJE 2017) ◮ Use LDA to analyze speech at the FOMC (Federal Open Market Committee). ◮ private discussions among committee members at Federal Reserve (U.S. Central Bank) ◮ transcripts: 150 meetings, 20 years, 26,000 speeches, 24,000 unique words. ◮ Pre-processing: ◮ drop stopwords, stems, etc. ◮ Drop words with low TF-IDF weight 31/85

  27. LDA Training Hansen, McMahon, and Prat (QJE 2017) ◮ K = 40 topics selected for interpretability / topic coherence. ◮ the “statistically optimal” K = 70, but these were less interpretable. ◮ hyperparemeters α = 50 / K and η = . 025 to promote sparse word distributions (and more interpretable topics). 32/85

  28. 33/85

  29. Pro-Cyclical Topics Hansen, McMahon, and Prat (QJE 2017) 34/85

  30. Counter-Cyclical Topics Hansen, McMahon, and Prat (QJE 2017) 35/85

  31. FOMC Topics and Policy Uncertainty Hansen, McMahon, and Prat (QJE 2017) 36/85

Recommend


More recommend