data mining
play

Data Mining: Concepts and Techniques Chapter 9 Graph mining and - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis Li Xiong Slides credits: Jiawei Han and Micheline Kamber 1 April 2, 2008 Graph Mining and Social Network Analysis Graph mining Frequent


  1. Data Mining: Concepts and Techniques — Chapter 9 — Graph mining and Social Network Analysis Li Xiong Slides credits: Jiawei Han and Micheline Kamber 1 April 2, 2008

  2. Graph Mining and Social Network Analysis � Graph mining � Frequent subgraph mining � Social network analysis � Social network � Social network analysis at different levels � Link analysis 2 April 2, 2008 Mining and Searching Graphs in Graph Databases

  3. Graph Mining � Methods for Mining Frequent Subgraphs � Applications: � Graph Indexing � Similarity Search � Classification and Clustering � Summary 3 April 2, 2008 Mining and Searching Graphs in Graph Databases

  4. Why Graph Mining? � Graphs are ubiquitous � Chemical compounds (Cheminformatics) � Protein structures, biological pathways/networks (Bioinformactics) � Program control flow, traffic flow, and workflow analysis � XML databases, Web, and social network analysis � Graph is a general model � Trees, lattices, sequences, and items are degenerated graphs � Diversity of graphs � Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) � Complexity of algorithms: many problems are of high complexity 4 April 2, 2008 Mining and Searching Graphs in Graph Databases

  5. Graph, Graph, Everywhere from H. Jeong et al Nature 411, 41 (2001) Aspirin Yeast protein interaction network Co-author network I nternet 5 April 2, 2008 Mining and Searching Graphs in Graph Databases

  6. Graph Pattern Mining � Frequent subgraph mining � Finding frequent subgraphs within a single graph � Finding frequent (sub)graphs in a set of graphs � support (occurrence frequency) no less than a minimum support threshold � Applications of graph pattern mining � Mining biochemical structures, program control flow analysis, XML structures or Web communities � Building blocks for graph classification, clustering, compression, comparison, and correlation analysis 6 April 2, 2008 Mining and Searching Graphs in Graph Databases

  7. Example: Frequent Subgraph Mining in Chemical Compounds GRAPH DATASET O O OH S N N N O O HO N O O (A) (B) (C) FREQUENT PATTERNS (MIN SUPPORT IS 2) O (1) (2) N N N O 7 April 2, 2008 Mining and Searching Graphs in Graph Databases

  8. Graph Mining Algorithms � Finding interesting and frequent substructures in a single graph � SUBDUE � Finding frequent patterns in a set of independent graphs � Apriori-based approach � Pattern-growth approach 8 April 2, 2008 Mining and Searching Graphs in Graph Databases

  9. SUBDUE (Holder et al. KDD’94) � Problem � Finding “interesting” and repetitive substructures (connected subgraphs) in data represented as a graph � Basic idea � Minimum description length (MDL) principle � Beam search algorithm � Start with best single vertices � Expand best substructures with a new edge � Substructures are evaluated based on their ability to compress input graphs 9 April 2, 2008 Li Xiong

  10. Minimum Description Length (MDL) � Minimum description length (MDL) principle � A formalization of Occam’s Razor � Best hypothesis minimizes description length of the data (largest compression) � Graph substructure discovery based on MDL � Description length (DL): represent vertices and adjacency matrix � Graph compression: replace substructure instances with pointers � Find best substructure S in G that minimizes: DL(S) + DL(G|S) Input Database (G) Substructure (S1) Compressed Database (G|S1) T1 C1 S1 C1 S1 S1 S1 Triangle R1 R1 Square S1 S1 S1 S1 S1 S1 S1 S1 S1 T2 T3 T4 S2 S3 S4 Holder et al.

  11. Beam Search Algorithm � Beam search � An optimization of best-first search � Breadth-first search with a predetermined number of paths kept as candidates (beam width) � Subgraph discovery based on beam search � Start with best single vertices � Expand best substructures with a new edge � Substructures are evaluated based on their ability to compress input graphs (minimize description length) 11 April 2, 2008 Li Xiong

  12. Algorithm Create substructure for each unique vertex label 1. Input Database (G) Input Database (G) Substructures (S) (Graph form) triangle on triangle (4), square (4), circle square T1 circle (1), rectangle (1) on on C1 S1 rectangle R1 on on on T2 T3 T4 triangle triangle triangle on on on S2 S3 S4 square square square 12 Holder et al.

  13. Algorithm (cont.) Expand best substructures by an edge or edge 2. + neighboring vertex Substructures (S) triangle on circle square triangle circle on on on on square rectangle rectangle on on on rectangle square triangle triangle triangle on on on on on triangle rectangle square square square 13 Holder et al.

  14. Algorithm (cont.) Keep best beam-width substructures on queue 3. Terminate when queue is empty or #discovered 4. substructures >= limit Compress graph with hierarchical description 5. 14 Holder et al. SRL Workshop

  15. Frequent Subgraph Mining Approaches � Problem: finding frequent subgraphs in a set of graphs � Apriori-based approach � AGM: Inokuchi, et al. (PKDD’00) � FSG: Kuramochi and Karypis (ICDM’01) � PATH # : Vanetik and Gudes (ICDM’02, ICDM’04) � FFSM: Huan, et al. (ICDM’03) � Pattern growth approach � MoFa, Borgelt and Berthold (ICDM’02) � gSpan: Yan and Han (ICDM’02) � Gaston: Nijssen and Kok (KDD’04) � Close pattern mining � CLOSEGRAPH: Yan & Han (KDD’03) 15 April 2, 2008 Mining and Searching Graphs in Graph Databases

  16. Apriori-Based Approach � Level-wise algorithm: building candidate subgraphs from small frequent subgraphs Subgraphs w ith Frequent extra vertex, edge subgraphs G 1 G G 2 G’ … G’’ G n JOI N 16 April 2, 2008

  17. Apriori-Based Search � AGM (Apriori-based Graph Mining), Inokuchi, et al. PKDD’00 � generates new graphs with one more node � FSG (Frquent SubGraph mining), Kuramochi and Karypis, ICDM’01 � generates new graphs with one more edge b c a a a a a a a a 17 April 2, 2008 Mining and Searching Graphs in Graph Databases

  18. Pattern Growth Method ( k+ 2 ) -edge ( k+ 1 ) -edge … G 1 k-edge duplicate G 2 graph G … G n … 18 April 2, 2008 Mining and Searching Graphs in Graph Databases

  19. GSPAN (Yan and Han ICDM’02) � Depth-based search and right-most extension 19 April 2, 2008 Mining and Searching Graphs in Graph Databases

  20. Graph Mining � Methods for Mining Frequent Subgraphs � Applications: � Classification and Clustering � Graph Indexing � Similarity Search 20 April 2, 2008 Mining and Searching Graphs in Graph Databases

  21. Using Graph Patterns � Similarity measures based on graph patterns � Feature-based similarity measure � Each graph is represented as a feature vector � Frequent subgraphs can be used as features � Vector distance � Structure-based similarity measure � Maximal common subgraph � Graph edit distance: insertion, deletion, and relabel � Frequent and discriminative subgraphs are high-quality indexing features 21 April 2, 2008 Mining and Searching Graphs in Graph Databases

  22. Social Network Analysis � Social network � Different levels of social network analysis � Common measures and methods for social network analysis � Link analysis 22 April 2, 2008 Mining and Searching Graphs in Graph Databases

  23. Social Network � Social network: a social structure consists of nodes and ties. � Nodes are the individual actors within the networks � May be different kinds � May have attributes, labels or classes � Ties are the relationships between the actors � May be different kinds � Links may have attributes, directed or undirected � Homogeneous networks � Single object type and single link type � Single model social networks (e.g., friends) � WWW: a collection of linked Web pages Heterogeneous networks � � Multiple object and link types � Medical network: patients, doctors, disease, contacts, treatments � Bibliographic network: publications, authors, venues 23 April 2, 2008 Mining and Searching Graphs in Graph Databases

  24. Small World Phenomenon � Number of degrees of separation in actual social networks? � Six-degree separation: everyone is an average of six "steps" away from each person on Earth. � Empirical studies � Michael Gurevich,1961. US population linked by 2 intermediaries � Duncan Watts, 2001. Email-delivery on the internet: average number of intermediaries is 6. � Leskovec and Horvitz, 2007. Instant messages: average path length is 6.6 24 April 2, 2008 Mining and Searching Graphs in Graph Databases

Recommend


More recommend