Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Community-Preserving Generalization of Social Networks Jordi Casas-Roma 1 and Fran¸ cois Rousseau 2 1 Universitat Oberta de Catalunya, Barcelona, Spain jcasasr@uoc.edu 2 ´ Ecole Polytechnique, Palaiseau, France rousseau@lix.polytechnique.fr SoMeRis ’15, Paris, August 25, 2015
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Overview Introduction 1 Preliminary concepts 2 Graph Generalization Algorithm 3 Experimental Set Up 4 Results 5 Conclusions 6
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Introduction Scenario Release data to third parties Preserve the privacy of users
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Simple Anonymization Simple anonymization does not work! User Dan can be re-identified using his structural properties. Figure 1 : Original network Figure 2 : Simple anonymization Amy 1 2 3 4 Tim Bob Lis 5 6 7 Ann Dan Tom 8 9 Eva Joe Figure 3 : Dan’s 1-neighbourhood Figure 4 : Dan is re-identified 2 3 1 2 3 4 6 5 6 7 8 9 8 9
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Anonymization methods Goals Introduce noise to hinder the re-identification processes. Adding/removing edges. Adding fake nodes. Grouping nodes into clusters. . . . Preserve user’s privacy vs. Maximize data utility (minimize information loss). Figure 5 : Dan’s 1-neighbourhood Figure 6 : Noise added 2 3 1 2 3 4 6 5 6 7 8 9 8 9
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Anonymization methods Basic approaches for anonymity on networks: Network modification approaches consists on modifying (adding and/or deleting) edges or vertices in a network. Randomization k -anonymity model Clustering-based approaches (also known as generalization) consist on cluster vertices and edges into groups and anonymize a sub-network into a super-vertex in order to publish the aggregate information about structural properties. Differentially private approaches guarantee that individuals are protected under the definition of differential privacy, which imposes a guarantee on the data release mechanism rather than on the data itself. The goal is to provide statistical information about the data while preserving the privacy of users.
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Graph degeneracy and k-shell k-Core Let k be an integer. A subgraph H k = ( V ′ , E ′ ), induced by the subset of vertices V ′ ⊆ V (and a fortiori by the subset of edges E ′ ⊆ E ), is called a k-core if and only if ∀ v i ∈ V ′ , deg H k ( v i ) ≥ k and H k is the maximal subgraph with this property. k-Shell The notion of k-shell corresponds to the subgraph induced by the set of vertices that belong to the k -core but not the ( k + 1)-core, denoted by S k such that S k = { v i ∈ G , v i ∈ H k ∧ v i / ∈ H k +1 } . Core number The core number or shell index of a vertex v i is the highest order of a core that contains this vertex, denoted by core ( v i ).
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Graph degeneracy and k-shell Graph G and its decomposition in disjoint k -shells C D B F 3-shell E A 2-shell 1-shell 0-shell
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Vertex similarity measures Manhattan similarity ...measures how many equal neighbors the two vertices share but also how many non-neighbors they share. � n Sim Manhattan ( v i , v j ) = 1 − 1 | ( v i , v k ) − ( v j , v k ) | (1) n k =1 where ( v i , v k ) = 1 if ( v i , v k ) ∈ E and ( v i , v k ) = 0 otherwise.
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Vertex similarity measures 2-path similarity ...measures the number of paths of length 2 between two vertices. � n Sim 2- path ( v i , v j ) = 1 ( v i , v k )( v j , v k ) (2) n k =1 where ( v i , v k ) = 1 if ( v i , v k ) ∈ E and ( v i , v k ) = 0 otherwise.
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Step 1 – Information gathering Collect the information needed in the next step to define the partition groups. computes the k -shell of the original graph, since it will preserve the 1 graph decomposition and also the clustering structure; computes vertex similarity measures in order to define groups of 2 vertices that share some properties regarding graph’s structure. Manhattan similarity 2-path similarity Multilevel clustering algorithm Fastgreedy clustering algorithm
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Step 2 – Super-vertex definition Define super-vertices according to the previously collected information for each vertex. For each k -shell in the graph, we merge vertices belonging to the 1 same group partition into the same super-vertex. Additionally, max fusion parameter is defined to avoid merging too 2 many vertices into one super-vertex (split the super-vertex onto two independent super-vertices). As a result of this step, a set of super-vertices is defined and each vertex is assigned to one, and only one, super-vertex.
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Step 3 – Generalized graph creation Create the new generalized graph according to the super-vertices defined in the previous step. Define an empty, undirected, edge-labeled and vertex-labeled graph 1 G = ( � � V , � E ). The process iterates by adding each previously defined super-vertex 2 sv i ∈ � V . A super-edge between two super-vertices is created if there exists an 3 edge between two vertices contained in each of the super-vertices, ( sv i , sv j ) ∈ � E ↔ ( v k , v p ) ∈ E : v k ∈ sv i ∧ v p ∈ sv j . Each super-vertex contains information about the number of vertices, which have merged into this super-vertex ( IntraVertices ) and also the number of edges between the vertices contained in it ( IntraEdges ). Super-edges contain a label indicating the number of edges between all vertices from their endpoints ( InterEdges ).
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Example Toy example generalization process. 1-0 1 1 3-shell 3-3 4 2 4-4 2 2-shell 2-1 4-4 3-3 (a) Original graph (b) Generalized graph
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Networks Synthetic networks: ER-1000 – Erd¨ os-R´ enyi Model [3] is a classical random graph model. It defines a random graph as n vertices connected by m edges that are chosen randomly from the n ( n − 1) / 2 possible edges. In our experiments, n =1,000 and m =5,000. BA-1000 – Barab´ asi-Albert Model [2], also called scale-free model, is a network whose degree distribution follows a power law (for degree d , its probability density function is P ( d ) = d − γ ; n =1,000 and γ =1 in our experiments). Real networks: Polblogs – Political blogosphere data [1] compiles the data on the links among US political blogs. URV email – the email communication network at the University Rovira i Virgili in Tarragona (Spain) [4].
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Generic information loss measures Network metrics Average distance ( dist ) Diameter ( d ) Harmonic mean of the shortest distance ( h ) Transitivity ( T ) We compute the error on these network metrics as follows: ǫ m ( G , � G ) = | m ( G ) − m ( � G ) | , (3)
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Clustering-specific information loss measures Anonymization � G G process p Clustering Clustering method c method c Original clusters Perturbed clusters Precision c ( � c ( G ) G ) index n � G ) = 1 precision ( G , � ✶ l tc ( v i )= l pc ( v i ) , (4) n i =1 where ✶ x = y equals 1 if x = y and 0 otherwise.
Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Clustering-specific information loss measures Clustering algorithms Multilevel (ML) Infomap (IM) Fast greedy modularity optimization (Fastgreedy or FG) Algorithm of Girvan and Newman (Girvan-Newman or GN)
Recommend
More recommend