mining the social web
play

Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research - PowerPoint PPT Presentation

Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research Center April 30, 2013 Outline Introduction User Classification Network Analysis Content Analysis Privacy Issues L3S Web Science 1 Web Science Definition from Web


  1. Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research Center April 30, 2013

  2. Outline Introduction User Classification Network Analysis Content Analysis Privacy Issues L3S Web Science 1

  3. Web Science Definition from Web Science Conference 1 ◮ Web Science is the emergent science of the people, organizations, applications, and of policies that shape and are shaped by the Web. ◮ Web Science embraces the study of the Web as a vast universal information network of people and communities. ◮ Studying human behavior and social interaction contributes to our understanding of the Web, while Web data is transforming how social science is conducted. 1 http://www.websci13.org/ L3S Web Science 2

  4. Social Media Figure: Social Media billboard 2 2 http://bit.ly/10216Jy L3S Web Science 3

  5. Twitter ◮ Politicians use Twitter to mobilize users. ◮ Companies use Twitter for marketing products. L3S Web Science 4

  6. User classification in twitter [1] How can we automatically construct user profiles? L3S Web Science 5

  7. Applications ◮ Authoritative users extraction - Discovering expert users for a target topic. ◮ Personalized web search - Personalized social media posts retrieval. ◮ User recommendation - Suggesting new interesting users to a target user. L3S Web Science 6

  8. Example tasks ◮ Political affiliation detection (Right vs Left) ◮ Ethnicity identification (African-Americans or not) ◮ Detecting affinity for a particular business (Starbucks fans) L3S Web Science 7

  9. Machine Learning Model ◮ Feature Construction: Profile, tweeting behaviour, linguistic content, social network features. ◮ Classification Algorithm: Gradient Boosted Decision Trees, GBDT framework L3S Web Science 8

  10. Profile features Profile information does not contain enough quality information to be directly used for user classification. L3S Web Science 9

  11. Profile features Profile information does not contain enough quality information to be directly used for user classification. ◮ Length of name, Number of alphanumeric chars. ◮ Capitalization forms in user name ◮ Use of avatar picture ◮ Number of followers/ friends ◮ Regular expression matches in bio: ( I | i )( m | am | ′ m | [0 − 9] + ( yo | yearold ) whiteman | woman | boy | girl L3S Web Science 9

  12. Tweeting behaviour features A set of statistics capturing the way users interact with the micro-blogging service. L3S Web Science 10

  13. Tweeting behaviour features A set of statistics capturing the way users interact with the micro-blogging service. ◮ Number of tweets of a user. ◮ Number and fraction of retweets of a user. ◮ Ave. number of hashtags and URLs per tweet ◮ Ave. time and std between tweets L3S Web Science 10

  14. Linguistic content features Linguistic content contains the user’s lexical usage and the main topics of interest to the user. L3S Web Science 11

  15. Linguistic content features Linguistic content contains the user’s lexical usage and the main topics of interest to the user. ◮ prototypical words, hashtags instead of bag-of-words representation. ◮ Generic LDA, Domain-specific LDA ◮ Sentiment words L3S Web Science 11

  16. Social network features These features contain the social connections between a user and those one follows, replies to or whose messages they retweet. ◮ Friend accounts - Prototypical ‘friend’ accounts are generated by exploring the social network of users in the training set. ◮ Number of prototypical friends, percentage number of prototypical friend ◮ Prototypical replied users, Prototypical retweeted users L3S Web Science 12

  17. Experiments ◮ Political affiliation, more than 80% ◮ Starbucks fans ◮ Ethnicity L3S Web Science 13

  18. Political Polarization on Twitter [2] How social media shape the networked public sphere and facilitate communication between communities with different political orientations. L3S Web Science 14

  19. Data Set ◮ 250,000 politically relevant tweets from more than 45,000 users. ◮ Construct two networks of political communication - retweet and mention networks. ◮ Data set available at: cnets.indiana.edu/groups/nan/truthy L3S Web Science 15

  20. Finding Figure: Political retweet network (left) and mention network(right) L3S Web Science 16

  21. Framework ◮ Data gathering ◮ Identifying political content ◮ Political communication networks ◮ Network analysis L3S Web Science 17

  22. Identifying Political Content ◮ Political communication - any tweet containing at least one politically relevant hashtag. ◮ Political hashtags constructed from seed hashtags #p2 and #tcot using Jaccard similarity. ◮ Let S set of tweets containing seed hashtag and T set of tweets containing another hashtag. σ ( S , T ) = S ∩ T S ∪ T L3S Web Science 18

  23. Community Structure ◮ Community detection using a label propagation method for two communities. ◮ Label propagation - Assign an initial arbitrary cluster membership to each node and then iteratively update each node’s label according to the label that is shared by most of its neighbors. ◮ Modularity to measure segregation. L3S Web Science 19

  24. Do clusters have similar content? ◮ Associate each user with a profile vector of hashtags in their tweets, weighted by frequency. ◮ Cosine similarity among users. L3S Web Science 20

  25. Do clusters in the retweet network correspond to groups of users of similar political alignment? ◮ Qualitative content analysis from social science. ◮ One author annotates 1,000 random users as ‘left’ or ‘right’. ◮ Another user annotates 200 random users from the 1,000 users above. ◮ Inter annotator agreement measured using Cohen’s Kappa ◮ k = P ( α ) − P ( ǫ ) 1 − P ( ǫ where P ( α ) is observed rate of agreement between annotators and P ( ǫ ) is expected rate of random agreement given relative frequency of each class label. L3S Web Science 21

  26. Political Twitter Trends, PTT [3] ◮ Analysis tool for political polarization of Twitter hashtags 3 . 3 http://politicalhashtagtrends.sandbox.yahoo.com/ L3S Web Science 22

  27. Data Set ◮ Start with a set of seed political users such as @BarackObama and @MittRomney whose political leaning is known. ◮ Get their tweets. L3S Web Science 23

  28. Data Set . . . ◮ Collect users that retweet seed users’ tweets. L3S Web Science 24

  29. Filtering Users by Location ◮ We want to limit our analysis to the U.S. L3S Web Science 25

  30. Evaluating Data Quality ◮ Against Web Directories. ◮ Precision = 0 . 98, 0 . 93 for Wefollow and Twellow respectively. ◮ Manual inspection: “greatest environmentalist. Also, despise republicans” L3S Web Science 26

  31. Detecting Political Hashtags ◮ Look into co-occurrence with seed political hashtags ( #p2 , #tcot , #gop , #ows ) and (‘obama’, ‘romney’, ‘politic’, ‘liberal’, ‘conservative’, ‘democ’, or ‘republic’) ◮ Volume filtering to avoid rare hashtags. L3S Web Science 27

  32. Detecting Political Hashtags ◮ Look into co-occurrence with seed political hashtags ( #p2 , #tcot , #gop , #ows ) and (‘obama’, ‘romney’, ‘politic’, ‘liberal’, ‘conservative’, ‘democ’, or ‘republic’) ◮ Volume filtering to avoid rare hashtags. L3S Web Science 27

  33. Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . L3S Web Science 28

  34. Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . f ( h , w ) / � h ′∈ H f ( h ′ , w ) trend ( h , w ) := u ≤ w f ( h , u ) / � � u ≤ w f ( h ′ , u ) � h ′∈ H L3S Web Science 28

  35. Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . f ( h , w ) / � h ′∈ H f ( h ′ , w ) trend ( h , w ) := u ≤ w f ( h , u ) / � � u ≤ w f ( h ′ , u ) � h ′∈ H Examples: ◮ #obamagotosama : 01 May 2011 to 08 May 2011. ◮ #ows : 25 Sep. 2011 to 2 Oct. 2011. L3S Web Science 28

  36. Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . f ( h , w ) / � h ′∈ H f ( h ′ , w ) trend ( h , w ) := u ≤ w f ( h , u ) / � � u ≤ w f ( h ′ , u ) � h ′∈ H Examples: ◮ #obamagotosama : 01 May 2011 to 08 May 2011. ◮ #ows : 25 Sep. 2011 to 2 Oct. 2011. ◮ Non-trending hashtags: #vote , #democracy . L3S Web Science 28

  37. Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � L3S Web Science 29

  38. Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � ◮ Vote(h,w) l = f ( h , w ) l L3S Web Science 29

  39. Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � ◮ Vote(h,w) l = f ( h , w ) l ◮ Vote(h,w) l = f ( h , w ) l h ′∈ H f ( h ′ , w ) l Normalization � L3S Web Science 29

  40. Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � ◮ Vote(h,w) l = f ( h , w ) l ◮ Vote(h,w) l = f ( h , w ) l h ′∈ H f ( h ′ , w ) l Normalization � ◮ Vote(h,w) l = h ′∈ H f ( h ′ , w ) l f ( h , w ) l � h ′∈ H f ( h ′ , w ) l + (c.f. Laplace 40000 � Smoothing) L3S Web Science 29

Recommend


More recommend