there is something beyond the twitter network Karol Węgrzycki 2016-07-11 1
modeling information diffussion 2
Application in: • sociology • critical analysis • social policy • political science • market analysis and marketing • recommender systems • routing algorithms 3
problem with rumour distribution 0 10 -1 10 -2 10 probability -3 10 -4 10 -5 10 -6 10 -7 10 0 1 2 3 4 10 10 10 10 10 cascade size Rysunek 1: Real distribution of tweets 4
0 10 -1 10 -2 10 probability -3 10 -4 10 -5 10 -6 10 0 1 2 3 4 10 10 10 10 10 cascade size Rysunek 2: Predicted distribution 5
goodness of fit The goodness of fit of a statistical model describes how well it fits a set of observations. Abundance of choice: • Kolmogorov–Smirnov test • Cram´ er–von Mises criterion • Anderson–Darling test • Shapiro–Wilk test • Chi-squared test • Akaike information criterion • Hosmer–Lemeshow test 6
ks-test 7
sup x | X ( x ) − Y ( x ) | , 8
other test Looking “how good” the line fits the distribution in power-law plot is wrong! • Lots of distributions give you straight-ish lines on a log-log plot. • Abusing linear regression makes the Gauss cry. • Use maximum likelihood to estimate the scaling exponent. • Use KS test to estimate where the scaling region begins. 9
data and simulation technique We recievied 5GB of tweets from Univeristy of Rome 500 million tweets, 10% sample, from May 2013. Retweet graph has 71 million vertices, 230 million edges. And decided to share them! (We anonymized it, so it does not valioate the twitter policy). 10
cgm - cascade generation model According to Leskovec et al. 2007: 1. Uniformly at random pick a starting point of the cascade and add it to the set of newly informed nodes. 2. Every newly informed node, for each of his direct neighbors, makes a separate decision to inform the neighbor with the probability α . 3. Let newly informed be the set of nodes that have been informed for the first time in step 2 and add them to the generated cascade. 4. Add all newly informed nodes to the generated cascade. 5. Repeat steps 2 to 4 until newly informed set is empty. In CGM regime all nodes have identical impact. The final graph is called a cascade . 11
cgm learning 0.35 0.30 0.25 0.20 K-S test 0.15 0.10 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 alpha 12
cgm results 0 10 model α -1 10 real -2 10 -3 10 probability -4 10 -5 10 -6 10 -7 10 -8 10 -9 10 0 1 2 3 10 10 10 10 cascade size 13
exponential model How about rumour aging. The probability, that the rumour will be passed should decay in time. 1. In the first round each neighbor of a initial vertex is informed and then with probability α becomes the spreader. 2. During the round no. k each previously, not informed neighbor of the new spreaders from the round k − 1 is informed and subsequently, with probability α k becomes a spreader. 14
maybe information appears randomly in the network • The real structure of social interaction is unknown • Can the information appear randomly in the network? 15
multi source model The number of spreaders that get to known the information from a different source can be modeled by the Binomial distribution: X ∼ B ( n , p ) . By the law of rare events, this can be approximated by Poisson distribution: X ∼ Pois ( np ) . 16
compound poisson process This is is essentially known as compound poisson process! N ( t ) N ( t ) � � X 0 + Y ( t ) = X 0 + X i = X i , i = 1 i = 0 And we can implement it efficiently! 17
algorithm We can model the information diffusion as follows: 1. Randomly choose the first node that will be informed. 2. Propagate the information using the model α k from the previous section. 3. Until there are new, informed nodes, in each round randomly choose X ∼ Pois ( λ ) new source nodes and propagate information from those nodes by model α k . This algorithm with algorithmic and statistical tricks can be simulated essentially in the same time as CGM! 18
parameters learning 0.050 0.30 0.045 0.25 0.040 0.20 0.035 K-S test lambda 0.030 0.15 0.025 0.10 0.020 0.05 0.015 0.00 0.010 0.105 0.110 0.115 0.120 0.125 0.130 0.135 alpha 19
comparison with real distribution 10 0 multi-source 10 − 1 real 10 − 2 10 − 3 probability 10 − 4 10 − 5 10 − 6 10 − 7 10 − 8 10 0 10 1 10 2 10 3 cascade size 20
further improvements • Geographically close nodes might be informed through an unknown social network. Close nodes should be informed with higher probability than distant. • The probability of randomly informing a node may decrease in time because the information may become obsolete. • The evolution of the social network structure within time. 21
all data and code is available online! (social-networks.mimuw.edu.pl) 22
future work • Propose better model of information flow • Propose better metric for comparison of data • Give better statistical framework for infomration modeling 23
Recommend
More recommend