classifying online social network users through the
play

Classifying Online Social Network Users Through the Social Graph - PowerPoint PPT Presentation

Classifying Online Social Network Users Through the Social Graph Cristina P erez Sol` a and Jordi Herrera Joancomart Departament dEnginyeria de la Informaci o i les Comunicacions Universitat Aut` onoma de Barcelona October


  1. Classifying Online Social Network Users Through the Social Graph Cristina P´ erez Sol` a and Jordi Herrera Joancomart´ ı Departament d’Enginyeria de la Informaci´ o i les Comunicacions Universitat Aut` onoma de Barcelona October 25th, 2012

  2. Introduction Classifier proposal The experiments Conclusions and further work Introduction 1 Classifier proposal 2 The experiments 3 Conclusions and further work 4 2 / 23

  3. Introduction Classifier proposal The experiments Conclusions and further work About the title Classifying... Definition Classification is the problem of identifying to which of a set of categories a new observation belongs. The decision is made on the basis of a training set of data containing observations whose category membership is already known. 3 / 23

  4. Introduction Classifier proposal The experiments Conclusions and further work About the title ... Online Social Network Users... 4 / 23

  5. Introduction Classifier proposal The experiments Conclusions and further work About the title ...Through the Social Graph Definition A social graph is a graph where nodes represent users in a social network and edges represent relationships between these users. 5 / 23

  6. Introduction Classifier proposal The experiments Conclusions and further work What do we want to do? Goals Design a user (node) classifier that uses the graph structure alone (no semantic information is needed). Apply the previously designed classifier to label OSN users. Demonstrate that OSN user classification is possible with naively anonymized graphs. 6 / 23

  7. Introduction Classifier proposal The experiments Conclusions and further work Why is it interesting? Motivation User classification as a privacy attack User classification allows an attacker to infer (private) attributes from the user. Attributes may be sensitive by themselves. Attribute disclosure may have undesirable consecuences for the user. In any case, the user is not able to control the disclosure of the information about himself anymore... 7 / 23

  8. Introduction Classifier proposal The experiments Conclusions and further work Introduction 1 Classifier proposal 2 Architecture overview Classifier modules Specific design details The experiments 3 Conclusions and further work 4 8 / 23

  9. Introduction Classifier proposal The experiments Conclusions and further work Architecture overview Classifier Architecture The proposed classifier is implemented with a 5 module architecture, which includes two different classifiers: an initial classifier and a relational classifier. Class labels Data Initial Neighborhood Data Relational New class preprocessing classifier analysis preprocessing classifier labels Clus. coeff. & degrees 9 / 23

  10. Introduction Classifier proposal The experiments Conclusions and further work Classifier modules Initial classifier The initial classifier analyzes the graph structure and maps each node to a 2-dimensional sample: degree & clustering coefficient. The output is an initial assignation of nodes to categories. 10 / 23

  11. Introduction Classifier proposal The experiments Conclusions and further work Classifier modules Neighborhood analysis The neighborhood analysis module reports to which kind of nodes is every node connected, using the labels assigned by the initial classifier. 11 / 23

  12. Introduction Classifier proposal The experiments Conclusions and further work Classifier modules Relational classifier The relational classifier maps users to n -dimensional samples, using both degree & clustering coefficient and the neighborhood information to classify users. The output is a new assignation of nodes to categories, which can differ from the initial classification. 12 / 23

  13. Introduction Classifier proposal The experiments Conclusions and further work Specific design details Some details about the classifier The graph is directed, so we distinguish between indegree and outdegree (instead of having just degree). This distinction increases by 2 the number of dimensions in the neighborhood analysis. We can have as many categories as we want: we just have to add more dimensions! Classifiers are instantiated with Support Vector Machines with soft margins. The relational classifier is applied iteratively. 13 / 23

  14. Introduction Classifier proposal The experiments Conclusions and further work Introduction 1 Classifier proposal 2 The experiments 3 Experiment design Experiment results Conclusions and further work 4 14 / 23

  15. Introduction Classifier proposal The experiments Conclusions and further work Experiment design The main goal Research question Is an attacker able to recover attributes from OSN users knowing just the social graph structure and the attributes of a small subset of the nodes in the graph? We are facing a within network classification problem, where nodes for which the labels are unknown are linked to nodes for which the label is known. 15 / 23

  16. Introduction Classifier proposal The experiments Conclusions and further work Experiment design Data used in the experiments We collected data from 936.423 Twitter users, which were all the neighbors of a subset of 300 nodes. We constructed two disjoint graphs G 1 = ( V 1 , E 1 ) and G 2 = ( V 2 , E 2 ) with users and their relationships. We labeled the nodes of the graphs to obtain the ground of truth: Binary classification: individual or company. Multiclass classification: normal user, blogger, celebrity, media and organization. 16 / 23

  17. Introduction Classifier proposal The experiments Conclusions and further work Experiment design An experiment Each of the experiments consisted on: Randomly selecting a subset of nodes ( V train ) to be used as training samples: 65%, 50%, 35% and 20% of nodes. Training the classifiers with those samples. Classifying the rest of the nodes ( V test = V � V train ). Evaluating the overall performance using the ground of truth. We performed 100 experiments for each of the training set sizes and for both classification problems. 17 / 23

  18. Introduction Classifier proposal The experiments Conclusions and further work Experiment results Binary Classification Results Correct rates 0.75 0.7 Correct rate 0.65 D1−65% train D1−50% train 0.6 D1−35% train D1−20% train D2−65% train 0.55 D2−50% train D2−35% train D2−20% train 0.5 0 1 2 3 4 5 6 7 8 9 10 Iteration 18 / 23

  19. Introduction Classifier proposal The experiments Conclusions and further work Experiment results Multiclass Classification Results Correct rates 0.6 0.55 Correct rate 0.5 0.45 Cat a − 65% train 0.4 Cat a − 50% train Cat a − 35% train 0.35 Cat a − 20% train 0.3 0 1 2 3 4 5 6 7 8 9 10 Iteration 19 / 23

  20. Introduction Classifier proposal The experiments Conclusions and further work Introduction 1 Classifier proposal 2 The experiments 3 Conclusions and further work 4 20 / 23

  21. Introduction Classifier proposal The experiments Conclusions and further work Conclusions Conclusions Information found in the social graph is enough to perform classification. It is possible to classify OSN users using a naively anonymized copy of a social graph. Naive anonymization does not protect OSN users from attribute disclosure. Success rate varies depening on the training set sizes. 21 / 23

  22. Introduction Classifier proposal The experiments Conclusions and further work Further work Further work Integrate both structural and semantic information to improve classification. Study the impact of different graph anonymization techniques (other than the naive anonymization) on the classification. Analyze the performance of other classification techniques for relational data. 22 / 23

  23. Classifying Online Social Network Users Through the Social Graph Cristina P´ erez Sol` a and Jordi Herrera Joancomart´ ı Departament d’Enginyeria de la Informaci´ o i les Comunicacions Universitat Aut` onoma de Barcelona October 25th, 2012

  24. Linear SVM 24 / 23

  25. Non linear SVM 25 / 23

Recommend


More recommend