Estimating Clustering Coefficients and Size of Social Networks via Random Walk Stephen J. Hardiman* Liran Katzir Capital Fund Management Advanced Technology Labs France Microsoft Research, Israel *Research was conducted while the author was unaffiliated
Motivation: Social Networks Qzone Habbo Netlog Sonico.com Bebo Google+ Renren Twitter Flixster Facebook MyLife Classmates.com Tagged Friendster hi5 Sina Weibo Orkut Plaxo LinkedIn Vkontakte
Motivation: External access The online social network Social Analytics v 3 v 5 v 7 v 1 v 2 v 9 Privacy v 4 v 6 v 8 Disk Space Communication
Task: Estimate parameters Global Network Number of Clustering Average Registered Coefficient CC Users Predicting Business Social Products’ development/ Potential. advertisement/ market size.
Global Clustering Coefficient Global CC = 3 x number of triangles number of connected triplet v 3 v 5 v 7 v 1 v 2 v 9 Triangle Connected v 4 v 6 v 8 Triplet
Global Clustering Coefficient Exact: [Alon et al, 1997] Estimation – input is read at least once: • Random Access: [Avron, 2010] • Streaming Model: [Buriol et al, 2006] Estimation – sampling: • Random Access: [Schank et al, 2005] • External Access: This work.
Local Clustering Coefficient C i = #connections between vi′s neighbors d i (d i −1)/2 d i – degree of node i C 2 = 1 / 3 v 3 v 5 v 7 d 1 = 1 d 2 = 3 d 9 = 2 v 1 v 2 v 9 Network Average CC v 4 v 6 v 8 = average local CC
Network Average CC Exact: Naïve. Estimation – input is read at least once: • Streaming Model: [Becchetti et al, 2010] Estimation – sampling: • Random Access: [Schank et al, 2005] • External Access: [Ribeiro et al 2010], [Gjoka et al, 2010], This work – Improved accuracy.
Number of Registered Users Exact: trivial Estimation – sampling: • External Access: [Hardiman et al 2009], [Katzir et al, 2011], This work – Improved accuracy.
Random Walk Sampled Nodes: v 1 v 2 v 3 v 4 v 5 Stationary 2 2 3 𝑒 𝑗 22 22 Distribution = 𝑒 𝑗 22 2 v 3 v 5 v 7 1 3 22 22 22 v 1 v 2 v 9 2 3 22 22 4 22 v 4 v 6 v 8
Random Walk - Summary Sampled Nodes Visible Nodes Invisible Nodes Visible Edges Invisible Edges v 3 v 5 v 7 v 1 v 2 v 9 v 4 v 6 v 8
Global CC Algorithm The estimated global clustering coefficient: Φ 𝑑 = Ψ 1. Ψ – Sampled nodes average degree - 1. 𝜚 𝑙 = 1 if there is an edge 𝑤 𝑙−1 − 𝑤 𝑙+1 , 𝜚 𝑙 = 1 iff 𝑤 𝑙−1 , 𝑤 𝑙 , 𝑤 𝑙+1 is a triangle 0 Otherwise. 2. Φ – Sampled nodes average 𝜚 𝑙 𝑒 𝑙 .
Global CC Example Φ = 1 3 0 + 2 + 0 = 2 = 1 5 0 + 2 + 1 + 3 + 1 = 7 Ψ 3 5 𝜚 3 = 1 = 2 5 𝑑 7 ≈ 0.47 v 3 v 5 v 7 3 𝜚 2 = 0 𝑑 = 9 23 ≈ 0.39 v 1 v 2 𝜚 4 = 0 v 4 v 6
Expectation of 𝝔 𝒍 𝑜 𝐹 𝜚 𝑙 𝑒 𝑙 = 𝑒 𝑗 𝐸 𝐹 𝜚 𝑙 𝑒 𝑙 |𝑦 𝑙 = 𝑤 𝑗 Total expectation 𝑗=1 𝑜 = 𝑒 𝑗 2𝑚 𝑗 𝑒 𝑗 𝑒 𝑗 combinations. 𝑒 𝑗 𝐸 𝑒 𝑗 𝑒 𝑗 2 𝑚 𝑗 yield 𝜚 𝑙 =1 𝑗=1 𝑜 = 2𝑚 𝑗 𝐸 𝑗=1 𝑜 𝑒 𝑗 – The degree of node v i . 𝐸 = 𝑒 𝑗 𝑚 𝑗 – The number of triangles contain v i . 𝑗=1 𝑜 – The number of nodes.
Global CC Proof 𝑜 𝑜 𝐹 Φ = 𝐹 𝜚 𝑙 𝑒 𝑙 = 2 = 1 𝐸 𝑚 𝑗 𝐹 Ψ 𝐸 𝑒 𝑗 𝑒 𝑗 − 1 𝑗=1 𝑗=1 concentration bounds 𝐹 Φ 𝑜 = Φ 2 𝑚 𝑗 𝑗=1 𝑑 ≅ = 𝑑 concentration bounds 𝐹 Ψ 𝑜 𝑒 𝑗 𝑒 𝑗 − 1 Ψ 𝑗=1 𝑜 𝑒 𝑗 – The degree of node v i . 𝐸 = 𝑒 𝑗 𝑚 𝑗 – The number of triangles contain v i . 𝑗=1 𝑜 – The number of nodes.
Guarantees For any 𝜗 ≤ 1 8 and 𝜀 ≤ 1 , we have Prob 1 − 𝜁 𝑑 ≤ 𝑑 ≤ 1 + 𝜁 𝑑 ≥ 1 − 𝜀 when the number of samples, r, satisfies 𝑠 ≥ 𝑠 = 𝑃 mixing time(𝜁)
Network Average CC Algorithm The estimated network average CC: = Φ 𝑚 𝑑 𝑚 Ψ 𝑚 1. Ψ 𝑚 – Sampled nodes average 1/degree . 𝜚 𝑙 = 1 if there is an edge 𝑤 𝑙−1 − 𝑤 𝑙+1 , 0 Otherwise. 1 2. Φ 𝑚 – Sampled nodes average 𝜚 𝑙 𝑒 𝑙 −1 .
Evaluations Network n (size) D/n c l c g DBLP 977,987 8.457 0.7231 0.1868 Orkut 3,072,448 76.28 0.1704 0.0413 Flickr 2,173,370 20.92 0.3616 0.1076 Live Journal 4,843,953 17.69 0.3508 0.1179 DBLP facts: Paper with most co-authors: has 119 listed authors. Most prolific author: Vincent Poor with 798 entries.
Global CC Gjoka et al* 3.5 Relative estimation value DBLP Network Ribeiro et al* 3 This work 2.5 2 Relative improvement 1.5 ranges between 300% and 500% depending 1 on the network. 0.5 0 0 0.5 1 1.5 2 Percentage of mined nodes
Network Average CC Ribeiro et al 2.5 Relative estimation value Orkut Network Gjoka et al 2 Random walk 1.5 Relative improvement ranges between 50% 1 and 400% depending on the network. 0.5 0 0 0.5 1 1.5 2 Percentage of mined nodes
Conclusions 1. New external access estimator from Global Clustering Coefficient. 2. Improved estimator for Network Average Clustering Coefficient. 3. Improved estimator for number of registered users.
Estimating Sizes of Social Networks via Biased Sampling Oren Somekh Liran Katzir Edo Liberty Yahoo! Labs, Yahoo! Labs, Yahoo! Labs, Haifa, Israel Haifa, Israel Haifa, Israel
The Birthday “Paradox” The expected number of collisions in a list of r i.i.d. samples from a set of n elements is 𝑠 𝑠−1 . 2𝑜 A collision is a pair of identical samples. Example: Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x 2 , x 3 ), (x 2 , x 5 ), and (x 3 , x 5 )
Cardinality estimation uniform When C collisions are observed n ≅ 𝑠 𝑠 − 1 2𝐷 Needs 𝑠 = 𝑃 𝑜 samples to converge. Used by [Ye et al, 2010] to estimate the size.
Stationary distribution sampling Sampled Nodes: v 5 v 2 v 5 v 4 v 2 Stationary 2 2 3 𝑒 𝑗 22 22 Distribution = 𝑒 𝑗 22 2 v 3 v 5 v 7 1 3 22 22 22 v 1 v 2 v 9 2 3 22 22 4 22 v 4 v 6 v 8
Cardinality estimation stationary When C collisions are observed 𝑒 𝑦 1 𝑒 𝑦 n ≅ 2𝐷 4 Needs 𝑠 = 𝑃 𝑜 log 𝑜 samples to converge when 𝑒 𝑗 ~𝑨𝑗𝑞𝑔( 𝑜, 2) .
Example: 𝑒 𝑦 = 2 + 3 + 2 + 4 + 3 1 = 1 2 + 1 3 + 1 2 + 1 4 + 1 3 𝑒 𝑦 14 23 12 𝑜 = 2∙2 ≈ 6.7 v 5 v 2 v 5 v 4 v 2 v 3 v 5 v 7 v 1 v 2 v 9 v 4 v 6 v 8
Global CC Proof 𝑜 𝑜 𝑜 𝐹 𝑒 𝑦 = 𝑒 𝑗 1 = 𝑒 𝑗 1 = 𝑜 𝐹 𝐷 = 𝑒 𝑗 𝑒 𝑗 𝐸 𝑒 𝑗 𝐹 𝐸 𝑒 𝑦 𝐸 𝑒 𝑗 𝐸 𝐸 𝑗=1 𝑗=1 𝑗=1 concentration bounds 𝐹 𝑒 𝑦 𝐹 𝑒 𝑦 1 1 𝑒 𝑗 𝐸 𝑒 𝑗 𝑜 𝑒 𝑦 𝑒 𝑦 𝐸 𝑜 = ≅ = 𝑜 concentration bounds 2𝐹 𝐷 𝑒 𝑗 𝑒 𝑗 2𝐷 𝐸 𝐸 𝑜 𝑒 𝑗 – The degree of node v i . 𝐸 = 𝑒 𝑗 𝑜 – The number of nodes. 𝑗=1
Improvements 1. Using all samples (Hardiman et al 2009). 2. Using Conditional Monte Carlo (This work).
All Samples Restrict computation to indexes m steps apart, 𝐽 = 𝑙, 𝑚 | 𝑙 − 𝑚 ≥ 𝑛 A collision is only be considered within 𝐽 . Φ = 𝑦 𝑙 = 𝑦 𝑚 | 𝑙, 𝑚 ∈ 𝐽 Ratio of degrees is similarly defined 𝑒 𝑦 𝑙 Ψ = 𝑒 𝑦 𝑚 𝑙,𝑚 ∈𝐽
Conditional Monte Carlo A collision between 𝑦 𝑙 and 𝑦 𝑚 , is replaced by the conditional collision is steps k +1 and l +1 respectively. 𝐹 1 𝑦 𝑙+1 =𝑦 𝑚+1 |𝑦 𝑙 , 𝑦 𝑚 = Common Neighbors 𝑒 𝑦 𝑙 𝑒 𝑦 𝑚
Conditional Monte Carlo • The pair 𝑤 4 , 𝑤 7 is not a collision, but it contributes 1 12 to the collision counter. v 3 v 5 v 7 v 1 v 2 v 9 v 4 v 6 v 8
Size Estimation Priot art 2.5 Relative estimation value DBLP Network This work 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 Percentage of mined nodes
Thanks
Recommend
More recommend