big data era
play

Big Data Era 1 1 https://vimeo.com/102998774 The big problem: - PowerPoint PPT Presentation

Big Data Era 1 1 https://vimeo.com/102998774 The big problem: Scalability Visualization Algorithm Hardware 2 The big problem: Scalability Visualization Algorithm Hardware https://upload.wikimedia.org/wikipedia/commons/0/05/Sna_large.png


  1. Big Data Era 1 1 https://vimeo.com/102998774

  2. The big problem: Scalability Visualization Algorithm Hardware 2

  3. The big problem: Scalability Visualization Algorithm Hardware https://upload.wikimedia.org/wikipedia/commons/0/05/Sna_large.png https://upload.wikimedia.org/wikipedia/commons/9/9b/Social_Network_Analysis_Visualization.png https://c1.staticflickr.com/5/4033/4520018121_6dd39e8d7e_z.jpg 3 https://c1.staticflickr.com/1/1/916142_ddc2fd0140.jpg

  4. Graph Sampling • Randomly pick nodes /edges to construct a subgraph that represents the original unfiltered graph: 4

  5. Which sampling strategy to use? 5

  6. Graph Sampling Evaluation [Leskovec and Faloutsos, KDD 2006] Random Walk (RW) v.s. Forest Fire (FF) 6

  7. Graph Sampling Evaluation in Visualization Random Walk (RW) Original Graph Forest Fire (FF) Avg. node degree: 2.4 Avg. node degree: 2.4 Power-law degree distribution Power-law degree distribution Distinct Visual Result! 7

  8. Graph Sampling Evaluation in Visualization Similarity Measurements Statistical Features: Hub Inclusion ? Clustering Coeff. Discovery Quotient … Data Mining Visualization 8

  9. Graph Sampling Evaluation in Visualization Similarity Measurements Goals Procedure G1: Identify the key visual factors Pilot Statistical that makes the sampled graphs representative Study Features: Visual Factors: Hub Inclusion Clustering Coeff. ? Discovery Quotient G2: Evaluate the performance of different Formal … sampling algorithms on these visual factors Studies Data Mining Visualization 9

  10. Outline • Selected Sampling Methods • Pilot Study • Formal Studies • Perception of High Degree Nodes • Perception of Cluster Quality • Perception of Coverage Area 10

  11. Node-Based Sampling Original Graph Random Node Sampling 11

  12. Node-Based Sampling Original Graph Random Node Sampling 12

  13. Node-Based Sampling Original Graph Random Node Sampling 13

  14. Node-Based Sampling Original Graph Random Node Sampling 14

  15. Edge-Based Sampling Original Graph Random Edge Sampling 15

  16. Edge-Based Sampling Original Graph Random Edge Sampling 16

  17. Edge-Based Sampling Original Graph Random Edge Sampling 17

  18. Traversal-Based Sampling: Random Walk Original Graph Random Walk 18

  19. Traversal-Based Sampling: Random Walk Original Graph Random Walk 19

  20. Traversal-Based Sampling: Random Jump Original Graph Random Jump 20

  21. Traversal-Based Sampling: Random Jump Original Graph Random Jump 21

  22. Traversal-Based Sampling: Forest Fire Original Graph Forest Fire 22

  23. Traversal-Based Sampling: Forest Fire Original Graph Forest Fire 23

  24. Outline • Selected Sampling Methods • Pilot Study • Formal Studies • Perception of High Degree Nodes • Perception of Cluster Quality • Perception of Coverage Area 24

  25. Pilot Study • Task: • Identify the visual factors that strongly influence the representativeness of sampled graphs • We also determine the sampling rate used in the formal studies. Dataset: 5 Real-World Graphs Visual Factor Candidates 25

  26. Pilot Study • Task: • Identify the visual factors that strongly influence the representativeness of sampled graphs • We also determine the sampling rate used in the formal studies. High Degree Nodes Cluster Quality Coverage Area Results (key visual factors) Visual Factor Candidates 26

  27. Outline • Selected Sampling Methods • Pilot Study • Formal Studies • Perception of High Degree Nodes • Perception of Cluster Quality • Perception of Coverage Area 27

  28. Formal Study I: High Degree Nodes A A B B 20 high degree nodes 8 high degree nodes? Sampled Graph Original Graph 28

  29. Formal Study I: High Degree Nodes 29

  30. Formal Study I: High Degree Nodes N: 1024, D: S N: 2048, D: S N: 1024, D: L N: 2048, D: L Experiment Setting 20 high degree nodes Data Generation 30

  31. Formal Study I: High Degree Nodes Results • Discussions: • It is easier to perceive high degree nodes in the RW Samples • It is more difficult to perceive high degree nodes in RN Samples • Above results hold across datasets 31

  32. Formal Study I: High Degree Nodes Results • Discussions: • It will be easier to perceive high degree nodes in the RW Samples • It will be more difficult to perceive high degree nodes in RN Samples. • Above results hold across datasets RW FF Number of high degree nodes perceived (Visualization): + Contradiction with Number of high degree nodes remained (Data Mining): * metric-based results! 32

  33. Formal Study I: High Degree Nodes Results 16 high degree nodes remained 7 high degree nodes remained Random Walk (RW) Forest Fire (FF) 33

  34. Formal Study I: High Degree Nodes Results 6 high degree nodes perceived 3 high degree nodes perceived 16 high degree nodes remained 7 high degree nodes remained Random Walk (RW) Forest Fire (FF) 34

  35. Outline • Selected Sampling Methods • Pilot Study • Formal Studies • Perception of High Degree Nodes (more high degree nodes are perceived in RW ) • Perception of Cluster Quality • Perception of Coverage Area 35

  36. Formal Study II: Cluster Quality 36

  37. Formal Study II: Cluster Quality Experiment Setting Data Generation 37

  38. Formal Study II: Cluster Quality Results • Discussions: • RE and RJ best preserve the perceived cluster quality in samples • RN and FF struggles in preserving the perceived cluster quality • The performance of RW and FF depends on graph modularity 38

  39. Formal Study II: Cluster Quality Results The number of clusters remained is important for perceiving the cluster quality in visualization! 39

  40. Outline • Selected Sampling Methods • Pilot Study • Formal Studies • Perception of High Degree Nodes (more high degree nodes are perceived in RW ) • Perception of Cluster Quality (cluster number is important) • Perception of Coverage Area 40

  41. Formal Study III: Coverage Area 41

  42. Formal Study III: Coverage Area N: 1024, D: S N: 2048, D: S N: 1024, D: L N: 2048, D: L Experiment Setting Data Generation 42

  43. Formal Study III: Coverage Area Results • Discussions: • RE and RJ have the largest perceived coverage area • RW has a smallest perceived coverage area in most cases • RW and FF ’s performance vary depending on graph properties G4: (N:2048, D: L) Overall G1 : (N:1024, D: S) G2: (N:1024, D: L) G3: (N:2048, D: S) BA RN RN All RW REN All REN REN All All REN REN REN REN REN RN,RW,RJ All REN RJ RW RW REN RW RW RW REN RW RW RW RW FF RJ RW All FF RW All All All All All All 2 (4) = 481.4, p 2 (4) = 483.9, p 2 (4) = 542.5, p 2 (4) = 475.2, p 2 (4) = 2272.8, p 0.006 0.006 0.006 0.006 0.05 Contradiction with 2.87 3.71 1.30 3.19 2.88 2.79 3.56 1.26 3.03 3.46 2.85 3.99 1.29 3.19 3.32 2.81 3.79 1.32 3.37 3.27 2.77 3.75 1.92 3.39 2.67 G8: (N:2048, M: H) Sah G5: (N:1024, M: L) G6: (N:1024, M: H) G7: (N:2048, M: L) Data RN REN RW RJ FF metric-based results! RN,RW,FF RN RN,RW,FF RN,RW,FF All RN RN,RW,FF G1 22% 29% 22% 28% 27% All REN FF REN FF REN RN G2 23% 31% 24% 29% 29% RW All REN RJ REN RJ All RJ RJ RJ G3 21% 29% 23% 28% 28% G4 22% 31% 24% 28% 28% All All G5 24% 39% 41% 41% 40% All All G6 25% 36% 34% 36% 33% G7 27% 45% 46% 46% 47% 2 (4) = 581.9, p G8 21% 32% 29% 32% 29% 2 (4) = 67.99, p 2 (4) = 605.8, p 2 (4) = 234.7, p 0.006 0.006 0.006 0.006 All 23% 34% 30% 34% 33% 43 2.54 3.35 3.13 3.29 2.87 2.78 3.78 2.15 3.88 1.44 2.49 3.88 2.94 3.5 2.69 3.03 3.92 2.01 3.66 1.44

  44. Formal Study III: Coverage Area Results RW RN 44

  45. Conclusion • We provided the first study of how graph sampling strategies can influence the perception of node-link visualizations • Important visual factors: high degree nodes, cluster quality, and coverage area • Recommendations for sampling network visualizations: • Recommend Random Edge and Random Jump for global structure and cluster quality • Recommend Random Walk for perceived high degree nodes • Use Random Node unless for specific requirements • Random Walk and Forest Fire are modularity sensitive Graph sampling performance in visualization may VARY from previous metric-based results! 45

  46. Q&A Evaluation of Graph Sampling: A Visualization Approach Yanhong Wu , Nan Cao, Daniel Archambault, Qiaomu Shen, Huamin Qu, and Weiwei Cui yanhong.wu@ust.hk http://yhwu.me

Recommend


More recommend