KDD 2020 Research track Estimating Properties of Social Networks via Random Walk considering Private Nodes Kazuki Nakajima Kazuyuki Shudo Tokyo Institute of Technology
Graph Sampling on Social Networks Purpose: Understand global social structure Challenge: Accurate analysis of graph properties • Access limitations to graph data for most researchers. • Crawling-based sampling is effective. • ex) Breadth-first search, Random walk, … • biased estimation due to sampling traverse node (= user) Social Graph 1/20
Re-weighted Random Walk [Gjoka et al., 2010] Effective scheme to obtain unbiased estimators 1. Sample nodes via a random walk 2. Derive sampling bias based on Markov chain analysis 3. Re-weighting each sample to correct sampling bias Unbiased estimators for several graph properties • size (number of nodes), average degree, degree distribution… Distribution Re-weighting True Estimate Degree Unbiased! 2/20
Private Node Problem •Previous algorithms have ignored private nodes Public nodes (publish neighbors) 6 10 8 9 5 4 2 3 Private nodes (hide neighbors) • ex) Facebook, Twitter, Pokec, … 7 1 • There was 20-30% in actual [Catanese et al., 2011], [Takac et al., 2012] •What problems happen? 1. Private nodes inhibit a simple random walk • Require a random walk considering private nodes • Require samples keeping Markov property 2. Private nodes cause estimation errors • Conventional weighting only corrects sampling bias 3/20
Our Study: Addressing Private Node Problem Contributions: Our study enables us to 1. successfully perform re-weighted random walk algorithms in real social networks including private nodes • Discuss transition neighbor selection • Derive sampling bias of each node • Describe calculation of weights to correct sampling bias 2. accurately estimate size and average degree of whole social graph including private nodes • Propose weighting methods to reduce not only sampling bias but also estimation errors caused by private nodes • Theoretically explain estimates obtained by proposed weighting have smaller expected errors than previous weighting 4/20
Preliminaries (1/2) Social graph: 𝐻 = (𝑊, 𝐹) 𝐷 . 𝐷 - 6 10 8 𝐷 / 9 5 4 2 3 ∗ = 1 7 1 𝑒 0 = 2 𝑒 0 • Each node has a privacy label: public or private. • Public node: provide their neighbor data • Private node: does not provide their neighbor data • Public-cluster 𝐷 • connected subgraph consisting of public nodes ∗ • Public-degree 𝑒 * • number of public neighbors of node 𝑤 * 5/20
Preliminaries (2/2) Three assumptions 1. Indices of all neighbors of a queried public node are obtained. 2. Each node independently becomes private with probability 𝑞 , otherwise, public. 3. A seed of a random walk is on the largest public- cluster (LPC). ex) when public node 0 is queried. Two access models 1. 1. Ideal model (ex. [Gjoka et al., 2011]) • Obtain neighbor indices and privacy labels. 2. 2. Hidden privacy model (ex. Twiiter API) • Obtain only neighbor indices. 6/20
Random Walk Sampling (1/2) Random walk considering private nodes • Simple random walk: randomly traverse neighbor • Cannot simply continue when private nodes are traversed • Difficult to correct sampling bias of sampled private nodes Randomly traversing public neighbor [Gjoka et al., 2011] 1. Randomly select a neighbor 2. Traverse if that is public, otherwise, randomly select again • Sampling bias • Each node is sampled in proportion to public-degree. 7/20
Random Walk Sampling (2/2) ∗ to correct sampling bias Calculate public-degree 𝒆 𝒋 1. Ideal model → Exact calculation 2. Hidden privacy model → Proposed approximation • Record two values via designed random walk • 𝑏 * = total number of successful public neighbor selections • 𝑐 * = total number of neighbor selections 2. When public neighbor 1. When private neighbor is selected. is selected. 𝑏 8 ← 𝑏 8 + 1 𝑐 8 ← 𝑐 8 + 1 𝑐 8 ← 𝑐 8 + 1 Theoretical result Approximated value 𝒃 𝒋 ∗ 𝒄 𝒋 × 𝒆 𝒋 converges to true value 𝒆 𝒋 8/20
Properties Estimation (1/2) Problem of existing estimators • Conventional weighting only corrects sampling bias by using public-degree Estimates converge to properties of largest public- cluster (LPC). • Errors of convergence values caused by private nodes • Case of size estimation 6 10 8 Derive expectation regarding a set of 9 5 4 2 3 privacy labels 𝑭 𝒒𝒔𝒋 [𝒐 ∗ ] 7 1 𝒐 ∗ = 𝟔 𝒐 ∗ = 𝟗 Theoretical result Under the condition that all public nodes belong to LPC, 𝑭 𝒒𝒔𝒋 𝒐 ∗ = 𝟐 − 𝒒 𝒐 9/20
Properties Estimation (2/2) Proposed estimators Goal: Reduce errors of convergence values. • Value of 𝑞 is unknown and difficult to estimate. ∗ and degree 𝑒 * Idea: Weighting using public-degree 𝑒 * ∗ follows binomial distribution with parameters 1 − 𝑞 and 𝑒 * . • 𝑒 * • Modify weight for each sample so that errors of convergence values are minimally reduced. Theoretical result Under the condition that all public nodes belong to LPC, 𝑭 𝒒𝒔𝒋 H 𝒐 ≈ 𝒐 𝒐 : convergence value of proposed estimator H Generality: • Our goal and idea are shared in all random walk-based estimators for social networks. 10/20
Experiments Conduct four experiments: 1. Estimation accuracy of size and average degree for various probabilities 𝑞 2. Performance in real-world datasets including real private nodes 3. Effectiveness of proposed public-degree calculation 4. Number of queries performed in seed selection 11/20
Experimental Setup •Publicly available datasets of social graphs Average Network Size Privacy label setting degree independently with probability 𝑞 YouTube 1,134,890 5.27 Pokec 1,632,803 27.32 real labels independently with probability 𝑞 Orkut 3,072,441 76.28 independently with probability 𝑞 Facebook 3,097,165 15.28 independently with probability 𝑞 LiveJournal 3,997,962 17.35 •Accuracy measure: Normalized root mean square error . - Q R Q S N N ∑ *P- 𝑂𝑆𝑁𝑇𝐹 = Q • Number of simulations: 𝑢 = 1000 • True value: 𝑦 • Estimate: 𝑦 * 12/20
Experiment 1: Estimation accuracy for several probabilities 𝑞 NRMSEs of each estimators for several 𝑞 (1% sample) • Size 0.45 Existing 0.45 0.45 0.45 NC NC NC NC 0.4 0.40 0.40 0.40 0.40 Proposed Proposed Proposed Proposed 0.35 0.35 0.35 0.35 0.3 0.30 0.30 0.30 0.30 NRMSE NRMSE NRMSE NRMSE 0.25 0.25 0.25 0.25 0.2 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.1 0.10 0.10 0.10 0.10 Proposed 0.05 0.05 0.05 0.05 0.0 0.00 0.00 0.00 0.00 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Better p p p p 𝒒 • Average degree 0.3 0.30 0.30 0.30 0.30 Smooth Smooth Smooth Smooth 0.25 0.25 0.25 0.25 Proposed Proposed Proposed Proposed 0.2 0.20 0.20 0.20 0.20 NRMSE NRMSE NRMSE NRMSE 0.15 0.15 0.15 0.15 0.1 0.10 0.10 0.10 0.10 88.1% 0.05 0.05 0.05 0.05 0.0 0.00 0.00 0.00 0.00 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 YouTube p Orkut p Facebook p LiveJournal p 13/20
Discussion on results in Experiment 1 1. Improvement of estimation errors results from that of convergence errors. • NRMSEs of converged size 0.45 Existing 0.45 0.45 0.45 0.4 NC NC NC NC 0.40 0.40 0.40 0.40 Proposed Proposed Proposed Proposed 0.35 0.35 0.35 0.35 0.3 0.30 0.30 0.30 0.30 NRMSE NRMSE NRMSE NRMSE 0.25 0.25 0.25 0.25 0.2 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.1 0.10 0.10 0.10 0.10 Proposed 0.05 0.05 0.05 0.05 0.0 0.00 0.00 0.00 0.00 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Better p p p p 𝒒 • NRMSEs of converged average degree 0.3 0.30 0.30 0.30 0.30 Smooth Smooth Smooth Smooth 0.25 0.25 0.25 0.25 Proposed Proposed Proposed Proposed 0.2 0.20 0.20 0.20 0.20 NRMSE NRMSE NRMSE NRMSE 0.15 0.15 0.15 0.15 0.1 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.0 0.00 0.00 0.00 0.00 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 14/20 YouTube Orkut LiveJournal Facebook p p p p
Discussion on results in Experiment 1 2. Estimation and convergence errors are affected by relative size of the largest public-cluster (LPC). • Relative size of LPC • NRMSEs of converged size Orkut YouTube 1.0 0.45 0.45 NC NC 0.4 0.4 0.40 0.40 Proposed Proposed 0.35 0.35 0.3 0.3 Existing 0.30 0.30 0.8 NRMSE NRMSE 0.25 0.25 0.2 0.2 0.20 0.20 0.15 0.15 Proposed 0.1 0.1 0.6 0.10 0.10 0.05 0.05 0.0 0.0 0.00 0.00 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 p p 𝒒 𝒒 𝒒 • On Orkut, almost all public nodes belong to LPC. Experimental results support theoretical claims. • On YouTube, relatively many nodes do not belong to LPC. NRMSEs relatively increase. 15/20
Recommend
More recommend