o tt itti outtwitting the twitterers th t itt predicting
play

O tt itti Outtwitting the Twitterers th T itt Predicting - PowerPoint PPT Presentation

O tt itti Outtwitting the Twitterers th T itt Predicting Information Predicting Information Cascades in Microblogs Wojciech Galuba Karl Aberer Wojciech Galuba , Karl Aberer EPFL, Switzerland Dipanjan Chakraborty Dipanjan Chakraborty


  1. O tt itti Outtwitting the Twitterers – th T itt Predicting Information Predicting Information Cascades in Microblogs Wojciech Galuba Karl Aberer Wojciech Galuba , Karl Aberer EPFL, Switzerland Dipanjan Chakraborty Dipanjan Chakraborty IBM Research India Zoran Despotovic, Wolfgang Kellerer D Docomo Euro-Labs, Munich, Germany E L b M i h G

  2. Why study information flows in OSNs? casual link sharing  improve how information flows breaking news Modeling M d li activism  new applications viral marketing  insights into g emergencies underlying sociology PR campaigns 2

  3. Information overload? Full-time job (reading tweets 40h a week at 150WPM) k t 150WPM) Median: 23 tw/h, 552 tw/day (Sep 2009 data) 3

  4. OSN information spread modeling  Related work:  generative models  reproduce statistical properties of info spread  reproduce statistical properties of info spread  predict coarse-grained aggregates  # of nodes reached by spread etc.  Our approach:  Our approach:  Look at URL diffusion on Twitter  Can we predict which user will mention which URL with what probability? URL with what probability? 4

  5. Why predict URL tweets?  Protect from information overload  Protect from information overload  Sort incoming URLs by probability of retweeting t ti  Viral marketing  Viral marketing  Select a subset of users that ensure successful URL propagation f l URL ti  Spam detection  Spam detection  Mispredictions are a sign of anomalous activity ti it 5

  6. 6

  7. Data  300 hour window in Sep’09  22M tweets  2.7M unique users  15M unique URLs  15M unique URLs  700M connections in the follower graph g p  Approx. 1/15th of the Twitter traffic 7

  8. Follower graph* * active users only: that have sent at least one URL in 300h 8

  9. F ll Follower graph* h* Mean (directed): Mean (directed): 3.61 * active users only: that have sent at least one URL in 300h 9

  10. U User activity ti it 10

  11. Per-URL activity 11

  12. Information cascades Nodes: users that Nodes: users that mentioned a given URL A Arcs: information flow i f ti fl 12

  13. Re-tweeting 13

  14. RT-cascade @bob: RT @alice @alice: http://url.com @ p http://url.com p @charlie: http://url.com  Arcs: who retweets whom  Irrespective of wheter users follow one another  Single parent g p  only the user name immediately after „RT” taken into account 14

  15. F-cascade @bob: http://url.com @alice: http://url.com @charlie: http://url.com  Arc @a  @b exists if:  user @a mentioned URL before user @b  user @a mentioned URL before user @b  user @b follows user @a 15

  16. RT-cascades vs. F-cascades  RT cascades are trees  RT-cascades are trees  F-cascades are DAGs  33% of the retweets credit a source that th the user does not directly follow d t di tl f ll 16

  17. cascade subcascade 17

  18. Subcascade size 18

  19. Cascade fragmentation 19

  20. Cascade depth 20

  21. Influence of the root 21

  22. Information diffusion rate Median: 50mins 22

  23. URL tweeting prediction  Based on the past URL retweets by users  Based on the past URL retweets by users, predict the future ones  Find probability that user i mentions URL u u = u p i p i 23

  24. Influence α ij α 24

  25. External influence β i β 25

  26. URL virality γ u γ http://cnn com/ http://cnn.com/ 26

  27. Per-user diffusion delay 2 , µ i σ i i i 27

  28. Model α ij β i β i 2 , µ i σ i γ u http://cnn.com/ 28

  29. At-Least-One (ALO) model u p α γ ij j u j j Temporal u p p = P( at least one * ( event happens ) * component component i i 2 , µ i σ i β β i γ γ u 29

  30. Linear threshold (LT) model u p α γ ij u j Temporal u p p   = * component component * i i 2 , µ i σ i β β i γ γ Thresholding u function (sigmoid) 30

  31. Performance metrics  Recall: fraction of tweets predicted  Recall: fraction of tweets predicted  out of all tweets that happened  Precision: fraction of true positives  out of all tweets predicted t f ll t t di t d  F-score: harmonic mean of recall and  F score: harmonic mean of recall and precision  F-score is the optimization goal 31

  32. Learning  Input: a time window of tweets  Input: a time window of tweets  Computation: gradient ascent method p g 2 , , , , α β γ µ σ  Parameter space: ji i u i i  Goal: maximize F-score G l i i F u p p  Output:  Output: i i 32

  33. Lineup  LT – Linear Threshold model  LT Linear Threshold model α  LTr – Linear Threshold model with j j α instead of ji  ALO – At-Least-One model ALO At L t O d l  RND – baseline makes random guesses  RND – baseline, makes random guesses u p about i 33

  34. * training data: first 150 h, test data: next 150h, 34 results for 100 random URLs

  35. Summary  L og-normal degree distribution  L og normal degree distribution  Small-world: 3.6 hops from user to user  Power-laws in the user activity and URL mentions e o s  Cascades are shallow: exponential depth falloff  Log-normally distributed diffusion delay ff  The LT model: The LT model:  predicts more than half of the URL tweets  with less than 15% false positive rate  with less than 15% false positive rate 35

  36. Ongoing work  Investigating mispredictions  Investigating mispredictions  URLs  users  Scaling up the real-time data mining g p g  continous MapReduce  crawler farm  crawler farm  Website: personalized URL rankings for Twitter users  Apply to other systems pp y to ot e syste s 36

Recommend


More recommend