O tt itti Outtwitting the Twitterers – th T itt Predicting Information Predicting Information Cascades in Microblogs Wojciech Galuba Karl Aberer Wojciech Galuba , Karl Aberer EPFL, Switzerland Dipanjan Chakraborty Dipanjan Chakraborty IBM Research India Zoran Despotovic, Wolfgang Kellerer D Docomo Euro-Labs, Munich, Germany E L b M i h G
Why study information flows in OSNs? casual link sharing improve how information flows breaking news Modeling M d li activism new applications viral marketing insights into g emergencies underlying sociology PR campaigns 2
Information overload? Full-time job (reading tweets 40h a week at 150WPM) k t 150WPM) Median: 23 tw/h, 552 tw/day (Sep 2009 data) 3
OSN information spread modeling Related work: generative models reproduce statistical properties of info spread reproduce statistical properties of info spread predict coarse-grained aggregates # of nodes reached by spread etc. Our approach: Our approach: Look at URL diffusion on Twitter Can we predict which user will mention which URL with what probability? URL with what probability? 4
Why predict URL tweets? Protect from information overload Protect from information overload Sort incoming URLs by probability of retweeting t ti Viral marketing Viral marketing Select a subset of users that ensure successful URL propagation f l URL ti Spam detection Spam detection Mispredictions are a sign of anomalous activity ti it 5
6
Data 300 hour window in Sep’09 22M tweets 2.7M unique users 15M unique URLs 15M unique URLs 700M connections in the follower graph g p Approx. 1/15th of the Twitter traffic 7
Follower graph* * active users only: that have sent at least one URL in 300h 8
F ll Follower graph* h* Mean (directed): Mean (directed): 3.61 * active users only: that have sent at least one URL in 300h 9
U User activity ti it 10
Per-URL activity 11
Information cascades Nodes: users that Nodes: users that mentioned a given URL A Arcs: information flow i f ti fl 12
Re-tweeting 13
RT-cascade @bob: RT @alice @alice: http://url.com @ p http://url.com p @charlie: http://url.com Arcs: who retweets whom Irrespective of wheter users follow one another Single parent g p only the user name immediately after „RT” taken into account 14
F-cascade @bob: http://url.com @alice: http://url.com @charlie: http://url.com Arc @a @b exists if: user @a mentioned URL before user @b user @a mentioned URL before user @b user @b follows user @a 15
RT-cascades vs. F-cascades RT cascades are trees RT-cascades are trees F-cascades are DAGs 33% of the retweets credit a source that th the user does not directly follow d t di tl f ll 16
cascade subcascade 17
Subcascade size 18
Cascade fragmentation 19
Cascade depth 20
Influence of the root 21
Information diffusion rate Median: 50mins 22
URL tweeting prediction Based on the past URL retweets by users Based on the past URL retweets by users, predict the future ones Find probability that user i mentions URL u u = u p i p i 23
Influence α ij α 24
External influence β i β 25
URL virality γ u γ http://cnn com/ http://cnn.com/ 26
Per-user diffusion delay 2 , µ i σ i i i 27
Model α ij β i β i 2 , µ i σ i γ u http://cnn.com/ 28
At-Least-One (ALO) model u p α γ ij j u j j Temporal u p p = P( at least one * ( event happens ) * component component i i 2 , µ i σ i β β i γ γ u 29
Linear threshold (LT) model u p α γ ij u j Temporal u p p = * component component * i i 2 , µ i σ i β β i γ γ Thresholding u function (sigmoid) 30
Performance metrics Recall: fraction of tweets predicted Recall: fraction of tweets predicted out of all tweets that happened Precision: fraction of true positives out of all tweets predicted t f ll t t di t d F-score: harmonic mean of recall and F score: harmonic mean of recall and precision F-score is the optimization goal 31
Learning Input: a time window of tweets Input: a time window of tweets Computation: gradient ascent method p g 2 , , , , α β γ µ σ Parameter space: ji i u i i Goal: maximize F-score G l i i F u p p Output: Output: i i 32
Lineup LT – Linear Threshold model LT Linear Threshold model α LTr – Linear Threshold model with j j α instead of ji ALO – At-Least-One model ALO At L t O d l RND – baseline makes random guesses RND – baseline, makes random guesses u p about i 33
* training data: first 150 h, test data: next 150h, 34 results for 100 random URLs
Summary L og-normal degree distribution L og normal degree distribution Small-world: 3.6 hops from user to user Power-laws in the user activity and URL mentions e o s Cascades are shallow: exponential depth falloff Log-normally distributed diffusion delay ff The LT model: The LT model: predicts more than half of the URL tweets with less than 15% false positive rate with less than 15% false positive rate 35
Ongoing work Investigating mispredictions Investigating mispredictions URLs users Scaling up the real-time data mining g p g continous MapReduce crawler farm crawler farm Website: personalized URL rankings for Twitter users Apply to other systems pp y to ot e syste s 36
Recommend
More recommend