Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research Center April 30, 2013
Outline Introduction User Classification Network Analysis Content Analysis Privacy Issues L3S Web Science 1
Web Science Definition from Web Science Conference 1 ◮ Web Science is the emergent science of the people, organizations, applications, and of policies that shape and are shaped by the Web. ◮ Web Science embraces the study of the Web as a vast universal information network of people and communities. ◮ Studying human behavior and social interaction contributes to our understanding of the Web, while Web data is transforming how social science is conducted. 1 http://www.websci13.org/ L3S Web Science 2
Social Media Figure: Social Media billboard 2 2 http://bit.ly/10216Jy L3S Web Science 3
Twitter ◮ Politicians use Twitter to mobilize users. ◮ Companies use Twitter for marketing products. L3S Web Science 4
User classification in twitter [1] How can we automatically construct user profiles? L3S Web Science 5
Applications ◮ Authoritative users extraction - Discovering expert users for a target topic. ◮ Personalized web search - Personalized social media posts retrieval. ◮ User recommendation - Suggesting new interesting users to a target user. L3S Web Science 6
Example tasks ◮ Political affiliation detection (Right vs Left) ◮ Ethnicity identification (African-Americans or not) ◮ Detecting affinity for a particular business (Starbucks fans) L3S Web Science 7
Machine Learning Model ◮ Feature Construction: Profile, tweeting behaviour, linguistic content, social network features. ◮ Classification Algorithm: Gradient Boosted Decision Trees, GBDT framework L3S Web Science 8
Profile features Profile information does not contain enough quality information to be directly used for user classification. L3S Web Science 9
Profile features Profile information does not contain enough quality information to be directly used for user classification. ◮ Length of name, Number of alphanumeric chars. ◮ Capitalization forms in user name ◮ Use of avatar picture ◮ Number of followers/ friends ◮ Regular expression matches in bio: ( I | i )( m | am | ′ m | [0 − 9] + ( yo | yearold ) whiteman | woman | boy | girl L3S Web Science 9
Tweeting behaviour features A set of statistics capturing the way users interact with the micro-blogging service. L3S Web Science 10
Tweeting behaviour features A set of statistics capturing the way users interact with the micro-blogging service. ◮ Number of tweets of a user. ◮ Number and fraction of retweets of a user. ◮ Ave. number of hashtags and URLs per tweet ◮ Ave. time and std between tweets L3S Web Science 10
Linguistic content features Linguistic content contains the user’s lexical usage and the main topics of interest to the user. L3S Web Science 11
Linguistic content features Linguistic content contains the user’s lexical usage and the main topics of interest to the user. ◮ prototypical words, hashtags instead of bag-of-words representation. ◮ Generic LDA, Domain-specific LDA ◮ Sentiment words L3S Web Science 11
Social network features These features contain the social connections between a user and those one follows, replies to or whose messages they retweet. ◮ Friend accounts - Prototypical ‘friend’ accounts are generated by exploring the social network of users in the training set. ◮ Number of prototypical friends, percentage number of prototypical friend ◮ Prototypical replied users, Prototypical retweeted users L3S Web Science 12
Experiments ◮ Political affiliation, more than 80% ◮ Starbucks fans ◮ Ethnicity L3S Web Science 13
Political Polarization on Twitter [2] How social media shape the networked public sphere and facilitate communication between communities with different political orientations. L3S Web Science 14
Data Set ◮ 250,000 politically relevant tweets from more than 45,000 users. ◮ Construct two networks of political communication - retweet and mention networks. ◮ Data set available at: cnets.indiana.edu/groups/nan/truthy L3S Web Science 15
Finding Figure: Political retweet network (left) and mention network(right) L3S Web Science 16
Framework ◮ Data gathering ◮ Identifying political content ◮ Political communication networks ◮ Network analysis L3S Web Science 17
Identifying Political Content ◮ Political communication - any tweet containing at least one politically relevant hashtag. ◮ Political hashtags constructed from seed hashtags #p2 and #tcot using Jaccard similarity. ◮ Let S set of tweets containing seed hashtag and T set of tweets containing another hashtag. σ ( S , T ) = S ∩ T S ∪ T L3S Web Science 18
Community Structure ◮ Community detection using a label propagation method for two communities. ◮ Label propagation - Assign an initial arbitrary cluster membership to each node and then iteratively update each node’s label according to the label that is shared by most of its neighbors. ◮ Modularity to measure segregation. L3S Web Science 19
Do clusters have similar content? ◮ Associate each user with a profile vector of hashtags in their tweets, weighted by frequency. ◮ Cosine similarity among users. L3S Web Science 20
Do clusters in the retweet network correspond to groups of users of similar political alignment? ◮ Qualitative content analysis from social science. ◮ One author annotates 1,000 random users as ‘left’ or ‘right’. ◮ Another user annotates 200 random users from the 1,000 users above. ◮ Inter annotator agreement measured using Cohen’s Kappa ◮ k = P ( α ) − P ( ǫ ) 1 − P ( ǫ where P ( α ) is observed rate of agreement between annotators and P ( ǫ ) is expected rate of random agreement given relative frequency of each class label. L3S Web Science 21
Political Twitter Trends, PTT [3] ◮ Analysis tool for political polarization of Twitter hashtags 3 . 3 http://politicalhashtagtrends.sandbox.yahoo.com/ L3S Web Science 22
Data Set ◮ Start with a set of seed political users such as @BarackObama and @MittRomney whose political leaning is known. ◮ Get their tweets. L3S Web Science 23
Data Set . . . ◮ Collect users that retweet seed users’ tweets. L3S Web Science 24
Filtering Users by Location ◮ We want to limit our analysis to the U.S. L3S Web Science 25
Evaluating Data Quality ◮ Against Web Directories. ◮ Precision = 0 . 98, 0 . 93 for Wefollow and Twellow respectively. ◮ Manual inspection: “greatest environmentalist. Also, despise republicans” L3S Web Science 26
Detecting Political Hashtags ◮ Look into co-occurrence with seed political hashtags ( #p2 , #tcot , #gop , #ows ) and (‘obama’, ‘romney’, ‘politic’, ‘liberal’, ‘conservative’, ‘democ’, or ‘republic’) ◮ Volume filtering to avoid rare hashtags. L3S Web Science 27
Detecting Political Hashtags ◮ Look into co-occurrence with seed political hashtags ( #p2 , #tcot , #gop , #ows ) and (‘obama’, ‘romney’, ‘politic’, ‘liberal’, ‘conservative’, ‘democ’, or ‘republic’) ◮ Volume filtering to avoid rare hashtags. L3S Web Science 27
Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . L3S Web Science 28
Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . f ( h , w ) / � h ′∈ H f ( h ′ , w ) trend ( h , w ) := u ≤ w f ( h , u ) / � � u ≤ w f ( h ′ , u ) � h ′∈ H L3S Web Science 28
Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . f ( h , w ) / � h ′∈ H f ( h ′ , w ) trend ( h , w ) := u ≤ w f ( h , u ) / � � u ≤ w f ( h ′ , u ) � h ′∈ H Examples: ◮ #obamagotosama : 01 May 2011 to 08 May 2011. ◮ #ows : 25 Sep. 2011 to 2 Oct. 2011. L3S Web Science 28
Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . f ( h , w ) / � h ′∈ H f ( h ′ , w ) trend ( h , w ) := u ≤ w f ( h , u ) / � � u ≤ w f ( h ′ , u ) � h ′∈ H Examples: ◮ #obamagotosama : 01 May 2011 to 08 May 2011. ◮ #ows : 25 Sep. 2011 to 2 Oct. 2011. ◮ Non-trending hashtags: #vote , #democracy . L3S Web Science 28
Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � L3S Web Science 29
Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � ◮ Vote(h,w) l = f ( h , w ) l L3S Web Science 29
Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � ◮ Vote(h,w) l = f ( h , w ) l ◮ Vote(h,w) l = f ( h , w ) l h ′∈ H f ( h ′ , w ) l Normalization � L3S Web Science 29
Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � ◮ Vote(h,w) l = f ( h , w ) l ◮ Vote(h,w) l = f ( h , w ) l h ′∈ H f ( h ′ , w ) l Normalization � ◮ Vote(h,w) l = h ′∈ H f ( h ′ , w ) l f ( h , w ) l � h ′∈ H f ( h ′ , w ) l + (c.f. Laplace 40000 � Smoothing) L3S Web Science 29
Recommend
More recommend