sampling algorithms for data sampling algorithms for data
play

Sampling Algorithms for Data Sampling Algorithms for Data - PowerPoint PPT Presentation

Sampling Algorithms for Data Sampling Algorithms for Data Collection in Online Networks Collection in Online Networks Carter T. Butts 12 Carter T. Butts 12 Minas Gjoka 3 , Maciej Kurant 4 , Athina Markopoulou 3 Minas Gjoka 3 , Maciej Kurant 4


  1. Sampling Algorithms for Data Sampling Algorithms for Data Collection in Online Networks Collection in Online Networks Carter T. Butts 12 Carter T. Butts 12 Minas Gjoka 3 , Maciej Kurant 4 , Athina Markopoulou 3 Minas Gjoka 3 , Maciej Kurant 4 , Athina Markopoulou 3 Department of Sociology 1 Department of Sociology 1 Institute for Mathematical Behavioral Sciences 2 2 Institute for Mathematical Behavioral Sciences Department of Electrical Engineering and Computer Science 3 Department of Electrical Engineering and Computer Science 3 University of California, Irvine University of California, Irvine EPFL, Lausanne 4 EPFL, Lausanne 4 Prepared for the August 25, 2009 UCI MURI AHM. This work was Prepared for the August 25, 2009 UCI MURI AHM. This work was supported by DOD ONR award N00014-8-1-1015. supported by DOD ONR award N00014-8-1-1015.

  2. The Network Sampling Problem The Network Sampling Problem Online networks of increasing interest - and obvious Online networks of increasing interest - and obvious  importance for our project importance for our project Dramatically enhanced data availability versus offline sources Dramatically enhanced data availability versus offline sources  Increasingly relevant to studies of behavior in developed-world Increasingly relevant to studies of behavior in developed-world  context context The problem: online networks are harder to study than they The problem: online networks are harder to study than they  appear appear Many are true "population" networks w/out strong subgroup Many are true "population" networks w/out strong subgroup  boundaries and w/10 6 -10 8 nodes boundaries and w/10 6 -10 8 nodes Generally, no sampling frame; populations "hidden" from a survey Generally, no sampling frame; populations "hidden" from a survey  point of view point of view Important frontier: principled principled sampling methods for online sampling methods for online Important frontier:  networks networks Today, a quick look at some of our recent work in this area Today, a quick look at some of our recent work in this area 

  3. Extant Methods Extant Methods  Primary family of  Some examples Primary family of Some examples methods: link-trace methods: link-trace  Breadth-first search Breadth-first search sampling sampling (BFS) (BFS)  Exploits network  Visit all nodes at Exploits network Visit all nodes at distances 1,2,... from distances 1,2,... from structure for structure for seed seed sampling purposes sampling purposes  Random Walk Random Walk  Basic idea: find Basic idea: find sampling (RW) sampling (RW) nodes by following nodes by following  Choose random Choose random links from an initial links from an initial neighbor of a node to neighbor of a node to seed set seed set visit visit  Many, many variants Many, many variants  Repeat above step Repeat above step  Some offline for a "long" time for a "long" time Some offline

  4. Challenges to Effective Use Challenges to Effective Use  Lack of a known  Unverified Lack of a known Unverified equilibrium distribution convergence equilibrium distribution convergence BFS, most ad hoc For methods with an BFS, most ad hoc For methods with an   methods badly biased equilibrium, need to methods badly biased equilibrium, need to (unless whole network verify convergence (unless whole network verify convergence is captured) is captured) These are really just These are really just  MCMC methods; same MCMC methods; same RW biased, but RW biased, but  issues apply issues apply converges to 1/ in 1/ d d ( ( v v ) ) in converges to Methods do exist, but Methods do exist, but the undirected, the undirected,  were not previously were not previously connected case connected case applied to this problem applied to this problem Can observe d , and Can observe d ( ( v v ) ) , and  One area of progress: thus adjust post-hoc One area of progress: thus adjust post-hoc  application of MCMC application of MCMC Directed case harder - Directed case harder -  diagnostics to network diagnostics to network can derive in theory, but can derive in theory, but sampling procedures sampling procedures not easily measure not easily measure

  5. Avoiding Bias with MCMC Theory Avoiding Bias with MCMC Theory  Why not derive a link-  MHRW algorithm: Why not derive a link- MHRW algorithm: trace design that has a trace design that has a ∈ V initialize: v v (0) (0) ∈ V , , G G initialize:  uniform (or other target) uniform (or other target) Let CONVERGED:=FALSE Let CONVERGED:=FALSE  equilibrium distribution? equilibrium distribution? Let i i :=0 :=0 Let   Metropolis-Hastings Metropolis-Hastings while !CONVERGED !CONVERGED do do while  Random Walk Sampling Random Walk Sampling Let i i := := i i +1 +1 Let  Like simple RW, but Like simple RW, but  Draw v Draw v ( ) from Unif( from Unif( N N ( ( v v ( -1) )) )) ( i i ) ( i i -1)  rejects moves rejects moves if Unif(0,1)> Unif(0,1)> d d ( ( v v ( )/ d d ( ( v v ( ) then then if ( i i -1) -1) )/ ( i i ) ) )  proportionally to ratio of proportionally to ratio of  Let Let v v ( = v v ( ( i i ) ) = ( i-1 i-1 ) ) old/new degrees old/new degrees endif endif  Equilibrium is uniform on Equilibrium is uniform on  if v v (0) ,..., v v ( has converged then then if (0) ,..., ( i i ) ) has converged  sampled component (for sampled component (for  Let CONVERGED:=TRUE Let CONVERGED:=TRUE version shown) version shown) endif endif  If converged, sample does If converged, sample does  endwhile endwhile not require reweighting for not require reweighting for  standard applications standard applications return v v (0) ,... v v ( return (0) ,... ( i i ) ) 

  6. Application: Probability Sampling Application: Probability Sampling of Facebook Users of Facebook Users  Large online service (>2x10 Large online service (>2x10 8 users at time of study) 8 users at time of study)  Can no longer sample directly Can no longer sample directly (Could before, but few knew this!) (Could before, but few knew this!)   Comparative study of sampling methods, using Comparative study of sampling methods, using convergence diagnostics (M. Gjoka et al., 2009) convergence diagnostics (M. Gjoka et al., 2009) Goal: probability sample of non-isolate, publicly viewable users Goal: probability sample of non-isolate, publicly viewable users  Methods: BFS, RW, MHRW, Uniform (reference sample) Methods: BFS, RW, MHRW, Uniform (reference sample)  28 seeds from uniform sample used to launch independent 28 seeds from uniform sample used to launch independent  parallel traces parallel traces Each trace continued for exactly 81K steps (except Uniform, fixed at Each trace continued for exactly 81K steps (except Uniform, fixed at  982K) 982K) Within (Geweke's z z G ) and between (G+R's Ȓ Ȓ ) chain metrics used to ) chain metrics used to Within (Geweke's G ) and between (G+R's  extract final samples for RW, MHRW extract final samples for RW, MHRW

  7. Convergence for the MHRW Convergence for the MHRW Algorithm Algorithm Overall: acceptable convergence between 500 and 3000 iterations (depending on measure) (M. Gjoka et al., 2009)

  8. Comparative Estimation of Local Comparative Estimation of Local Properties Properties (M. Gjoka et al., 2009)

  9. Comparative Estimation of the Comparative Estimation of the Degree Distribution Degree Distribution (M. Gjoka et al., 2009)

  10. Expansion: Multigraph Sampling Expansion: Multigraph Sampling  Often, no Often, no one one network on a network on a given population supports given population supports sampling sampling May be fragmented, or May be fragmented, or  clustered/heterogeneous clustered/heterogeneous (slowing convergence) (slowing convergence)  Solution: multigraph Solution: multigraph sampling sampling Walk on multiple graphs, or Walk on multiple graphs, or  unions of graphs unions of graphs Much better properties, esp if Much better properties, esp if  uncorrelated uncorrelated Individual networks need not Individual networks need not  be well-connected to be useful be well-connected to be useful

Recommend


More recommend