http cs224w stanford edu
play

http://cs224w.stanford.edu October August 12/3/2013 Jure - PowerPoint PPT Presentation

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu October August 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2


  1. CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu

  2. October August 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

  3.  Imagine you want to track the flow of information Obscure  We would like to tech story identify cascades like this: Small tech blog Engadget Wired Slashdot BBC NYT CNN 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

  4. [SDM ‘07]  Tracking Hyperlinks on the Blogosphere Blog Posts Blogs Information cascade Time ordered hyperlinks  Identify cascades – graphs induced by a time ordered propagation of hyperlinks 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

  5. [SDM ‘07] Cascade shapes (ranked by frequency) The probability of Count observing a cascade on n nodes follows: p(n) ~ n -2 x = Cascade size (number of nodes) 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

  6. [SDM ‘07] Effective diameter Number of edges Cascade size Cascade size (number of nodes)  Most of cascades are trees:  Number of edges is smaller than the number of nodes in a cascade  Diameter increases logarithmically 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

  7.  Cascade sizes follow a heavy-tailed distribution  Viral marketing:  Books: steep drop-off: power-law exponent -5  DVDs: larger cascades: exponent -1.5  Blogs:  Power-law exponent -2  What’s a good model?  What role does the underlying social network play?  Can make a step towards more realistic cascade generation (propagation) model? 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

  8. 1) Randomly pick blog to 2) Infect each in-linked neighbor with probability β. infect, add to cascade. 1 1 B 1 1 B 2 B 1 B 2 1 B 1 B 1 1 2 1 2 1 B 3 3 B 4 1 B 3 3 B 4 3) Add infected neighbors 4) Set node infected in (i) to to cascade. uninfected. 1 1 B 1 B 1 B 2 1 B 2 1 B 1 B 1 1 1 2 2 B 4 B 4 1 B 3 3 B 4 1 B 3 3 B 4 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

  9. Generative model produces realistic Count Count cascades β =0.025 Cascade node in-degree Cascade size Count Count Size of star cascade Size of chain cascade Most frequent cascades 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

  10. Obscure  Advantages: tech story  Unambiguous, precise and explicit way to trace information flow Small tech blog Engadget  We obtain both the times as well as the trace (graph) of information flow Slashdot Wired  Caveats:  Not all links transmit information: BBC NYT CNN  Navigational links, templates, adds  Many links are missing:  Mainstream media sites do not create links  Bloggers “forget” to link the source  (We will later see how to identify networks/cascades just based on what times sites mentioned information) 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

  11. [KDD ‘09]  Extract textual fragments that travel relatively unchanged, through many articles:  Look for phrases inside quotes: “…”  About 1.25 quotes per document in our data  Why it works? Quotes …  are integral parts of journalistic practices  tend to follow iterations of a story as it evolves  are attributed to individuals and have time and location 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

  12. [KDD ‘09] Quote: Our opponent is someone who sees America, it seems, as being so imperfect, imperfect enough that he‘s palling around with terrorists who would target their own country. 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

  13. [KDD ‘09]  Goal: Find mutational variants of a phrase  Form approximate phrase inclusion graph  Shorter phrase is approximately included in a longer one (word edit distance = 1) BDXCY ABCDEFGH BCD ABCD  Objective: In DAG of approx. phrase inclusion, delete min total edge weight s.t. each connected component has a single “ sink ” 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

  14. Nodes are phrases BDXCY BCD ABCDEFGH ABCD ABC ABCEFG ABCEF CEFP CEF CEFPQR UVCEXF 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

  15. Nodes are phrase Edges are inclusion relations BDXCY BCD ABCDEFGH ABCD ABC ABCEFG ABCEF CEFP CEF CEFPQR UVCEXF 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16

  16. Nodes are phrases Edges are inclusion relations BDXCY Edges have weights BCD ABCDEFGH ABCD ABC ABCEFG ABCEF CEFP CEF CEFPQR UVCEXF 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

  17.  Objective: In a directed acyclic graph (approx. phrase inclusion), delete min total edge weight s.t. each connected component has a single “sink” node BDXCYZ BCD ABCDEFGH ABCD ABC ABCEFG ABCEF CEFP CEF CEFPQR UVCEXF 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18

  18. [KDD ‘09]  DAG-partitioning is NP-hard but heuristics are effective:  Observation: Enough to know node’s parent to reconstruct optimal solution  Heuristic: Nodes are phrases Edges are inclusion relations BDXCY Edges have weights Proceed right-to-left BCD ABCDEFGH and assign a node ABCD (keep a single edge) ABC ABCEFG to the strongest ABXCE cluster CEFP CEF CEFPQR UVCEXF 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

  19. Quoted text Volume the fundamentals of our economy are strong 3654 the fundamentals of the economy are strong 988 fundamentals of our economy are strong 645 fundamentals of the economy are strong 557 if john mccain hadn't said that the fundamentals of our economy are strong on the day of one of our nation's worst financial crises the claim that he invented the blackberry would have been the most preposterous thing said all week 224 fundamentals of the economy 172 the fundamentals of the economy are sound 119 i promise you we will never put america in this position again we will clean up wall street 83 the fundamentals of our economy are sound 81 clean up wall street 78 our economy i think still the fundamentals of our economy are strong 75 fundamentals of the economy are sound 72 the fundamentals of our economy are strong but these are very very difficult times and i promise you we will never put america in this position again 68 the economy is in crisis 66 these are very very difficult times 63 the fundamentals of our economy are strong but these are very very difficult times 62 do you still think the fundamentals of our economy are strong genius 62 our economy i think still the fundamentals of our economy are strong but these are very very difficult times 60 mccain's first response to this crisis was to say that the fundamentals of our economy are strong then he admitted it was a crisis and then he proposed a commission which is just washington-speak for i'll get back to you later 55 i still believe the fundamentals of our economy are strong 53 i think still the fundamentals of our economy are strong 50 cut taxes for 95 percent of all working families 50 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20 today of all days john mccain's stubborn insistence that the fundamentals of the economy are strong shows that he is

  20.  Since 2008 we have been collecting nearly all blog posts and news articles:  6 billion documents  20 TB of data  Solution: Graph stream clustering  Phrases arrive in a stream  Simultaneously cluster the graph and attach phrases to the graph  Dynamically remove completed clusters 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

  21. Can we extract any ? interesting temporal variations? … is periodic, has no trends. ”Bandwidth” of the online media is constant 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

  22.  Volume over time of top 50 largest total volume memes (phrase clusters)  More at: http://snap.stanford.edu/nifty 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

  23.  Media coverage of the current economic crisis  Main proponents of the debate: Speech in congress Dept. of Labor release 60-minutes interview Top republican voice ranks only 14 th 12/3/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24

Recommend


More recommend