recsm summer school facebook topic models
play

RECSM Summer School: Facebook + Topic Models Pablo Barber a - PowerPoint PPT Presentation

RECSM Summer School: Facebook + Topic Models Pablo Barber a School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf


  1. RECSM Summer School: Facebook + Topic Models Pablo Barber´ a School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf

  2. Collecting Facebook data Facebook only allows access to public pages’ data through the Graph API: 1. Posts on public pages

  3. Collecting Facebook data Facebook only allows access to public pages’ data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies...

  4. Collecting Facebook data Facebook only allows access to public pages’ data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore)

  5. Collecting Facebook data Facebook only allows access to public pages’ data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore) Access to other (anonymized) data used in published studies requires permission from Facebook

  6. Collecting Facebook data Facebook only allows access to public pages’ data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore) Access to other (anonymized) data used in published studies requires permission from Facebook R library: Rfacebook

  7. Overview of text as data methods Fig. 1 in Grimmer and Stewart (2013)

  8. Overview of text as data methods Entity Recognition Fig. 1 in Grimmer and Stewart (2013)

  9. Overview of text as data methods Entity Recognition Events Quotes Locations Names . . . Fig. 1 in Grimmer and Stewart (2013)

  10. Overview of text as data methods Entity Recognition Events Cosine similarity Quotes Naive Bayes Locations Names . . . Fig. 1 in Grimmer and Stewart (2013)

  11. Overview of text as data methods Entity Recognition Events Cosine mixture similarity Quotes model? Naive Bayes Locations Names . . . Fig. 1 in Grimmer and Stewart (2013)

  12. Overview of text as data methods Entity Recognition Events Cosine mixture similarity Quotes model? Naive Bayes Locations Names . . . (ML methods) Fig. 1 in Grimmer and Stewart (2013)

  13. Overview of text as data methods Entity Recognition Events Cosine mixture similarity Quotes model? Naive Bayes Locations Names . . . Models with covariates (sLDA, STM) (ML methods) Fig. 1 in Grimmer and Stewart (2013)

  14. Latent Dirichlet allocation (LDA) ◮ Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) )+".()1 +"/,9#)1 .&/,0,"'1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 :5)2,'0("'1 &(2,#)1 65)7&(65//1 .&/,0,"'1 6$332,)%1 8""(65//1 - - -

  15. Latent Dirichlet allocation (LDA) ◮ Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) )+".()1 +"/,9#)1 .&/,0,"'1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 :5)2,'0("'1 &(2,#)1 65)7&(65//1 .&/,0,"'1 6$332,)%1 8""(65//1 - - - ◮ Many applications in information retrieval, document summarization, and classification New+document+ What+is+this+document+about?+ weather+ .50+ finance+ .49+ sports+ .01+ Words+w 1 ,+…,+w N+ Distribu6on+of+topics + θ

  16. Latent Dirichlet allocation (LDA) ◮ Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) )+".()1 +"/,9#)1 .&/,0,"'1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 :5)2,'0("'1 &(2,#)1 65)7&(65//1 .&/,0,"'1 6$332,)%1 8""(65//1 - - - ◮ Many applications in information retrieval, document summarization, and classification New+document+ What+is+this+document+about?+ weather+ .50+ finance+ .49+ sports+ .01+ Words+w 1 ,+…,+w N+ Distribu6on+of+topics + θ ◮ LDA is one of the simplest and most widely used topic models

  17. Latent Dirichlet Allocation

  18. Latent Dirichlet Allocation ◮ Document = random mixture over latent topics ◮ Topic = distribution over n-grams Probabilistic model with 3 steps: 1. Choose θ i ∼ Dirichlet ( α ) 2. Choose β k ∼ Dirichlet ( δ ) 3. For each word in document i : ◮ Choose a topic z m ∼ Multinomial ( θ i ) ◮ Choose a word w im ∼ Multinomial ( β i , k = z m ) where: α =parameter of Dirichlet prior on distribution of topics over docs. θ i =topic distribution for document i δ =parameter of Dirichlet prior on distribution of words over topics β k =word distribution for topic k

  19. Latent Dirichlet Allocation Key parameters: 1. θ = matrix of dimensions N documents by K topics where θ ik corresponds to the probability that document i belongs to topic k ; i.e. assuming K = 5: T1 T2 T3 T4 T5 Document 1 0.15 0.15 0.05 0.10 0.55 Document 2 0.80 0.02 0.02 0.10 0.06 . . . Document N 0.01 0.01 0.96 0.01 0.01

  20. Latent Dirichlet Allocation Key parameters: 1. θ = matrix of dimensions N documents by K topics where θ ik corresponds to the probability that document i belongs to topic k ; i.e. assuming K = 5: T1 T2 T3 T4 T5 Document 1 0.15 0.15 0.05 0.10 0.55 Document 2 0.80 0.02 0.02 0.10 0.06 . . . Document N 0.01 0.01 0.96 0.01 0.01 2. β = matrix of dimensions K topics by M words where β km corresponds to the probability that word m belongs to topic k ; i.e. assuming M = 6: W1 W2 W3 W4 W5 W6 Topic 1 0.40 0.05 0.05 0.10 0.10 0.30 Topic 2 0.10 0.10 0.10 0.50 0.10 0.10 . . . Topic k 0.05 0.60 0.10 0.05 0.10 0.10

  21. Validation From Quinn et al, AJPS, 2010: 1. Semantic validity

  22. Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way?

  23. Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity

  24. Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match?

  25. Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match? ◮ Do they depart from existing measures where they should depart?

  26. Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match? ◮ Do they depart from existing measures where they should depart? 3. Predictive validity

  27. Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match? ◮ Do they depart from existing measures where they should depart? 3. Predictive validity ◮ Does variation in topic usage correspond with expected events?

  28. Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match? ◮ Do they depart from existing measures where they should depart? 3. Predictive validity ◮ Does variation in topic usage correspond with expected events? 4. Hypothesis validity

  29. Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match? ◮ Do they depart from existing measures where they should depart? 3. Predictive validity ◮ Does variation in topic usage correspond with expected events? 4. Hypothesis validity ◮ Can topic variation be used effectively to test substantive hypotheses?

Recommend


More recommend