Graph- -based Learning based Learning Graph Larry Holder Larry Holder Computer Science and Engineering Computer Science and Engineering University of Texas at Arlington University of Texas at Arlington 1 1
Graph- -based Learning based Learning Graph Multi- -relational data mining and learning relational data mining and learning Multi SUBDUE graph- -based relational learner based relational learner SUBDUE graph � Discovery Discovery � � Clustering Clustering � � Graph grammar learning Graph grammar learning � � Supervised learning Supervised learning � 2 2
Multi- -Relational Data Mining Relational Data Mining Multi Looking for patterns involving multiple Looking for patterns involving multiple tables (relations) in a relational database tables (relations) in a relational database Person Married ID Last First Age Income Person1 Person2 ID Last First Age Income Person1 Person2 P1 P2 P1 P2 P1 P1 Doe Doe John John 30 30 80000 80000 P3 P7 P2 Doe Sally 29 90000 P3 P7 P2 Doe Sally 29 90000 P3 Smith Robert 35 100000 P3 Smith Robert 35 100000 RichCouple(X,Y) � Person(X,LastX,FirstX,AgeX,IncX) & Person(Y,LastY,FirstY,AgeY,IncY) & Married(X,Y) & (IncX + IncY) > 150000. 3 3
Multi- -Relational Data Mining Relational Data Mining Multi Approaches Approaches � Transform to non Transform to non- -relational problem relational problem � � First First- -order logic based order logic based � Inductive Logic Programming (ILP) Inductive Logic Programming (ILP) � Graph based Graph based � 4 4
Graph- -based Data Mining based Data Mining Graph Finding all subgraphs subgraphs g g within a set of within a set of Finding all graph transactions G G such that such that graph transactions ( ) freq g > t | | G � where where t t is the minimum support is the minimum support � 5 5
Graph- -based Data Mining based Data Mining Graph Systems Systems � Apriori Apriori- -based Graph Mining (AGM) based Graph Mining (AGM) � Inokuchi, , Washio Washio and and Motoda Motoda, 2003 , 2003 Inokuchi � Frequent Sub Frequent Sub- -Graph discovery (FSG) Graph discovery (FSG) � Kuramochi and and Karypis Karypis, 2001 , 2001 Kuramochi � Graph Graph- -based Substructure pattern mining based Substructure pattern mining � (gSpan gSpan) ) ( Yan and Han, 2002 and Han, 2002 Yan Focus on pruning and fast, code- -based based Focus on pruning and fast, code graph matching graph matching 6 6
Graph- -based Relational Learning based Relational Learning Graph Finding patterns in graph(s graph(s) ) Finding patterns in � Discovery Discovery � � Clustering Clustering Smith Robert � � Supervised learning Supervised learning Last First � Married Person Age Income 35 100000 Doe John Doe Sally Last First Last First Married Person Person Age Income Age Income 30 80000 29 90000 7 7
Graph- -based Relational Learning based Relational Learning Graph Graph- -Based Induction (GBI) Based Induction (GBI) Graph � Yoshida, Yoshida, Motoda Motoda and and Indurkhya Indurkhya, 1994 , 1994 � SUBstructure Discovery Using Examples Discovery Using Examples SUBstructure (SUBDUE) (SUBDUE) � Cook and Holder, 1994 Cook and Holder, 1994 � Focus on efficient subgraph subgraph generation generation Focus on efficient and compression- -based heuristic search based heuristic search and compression 8 8
SUBDUE Graph- -based Discovery based Discovery SUBDUE Graph Graph representation Graph representation Graph compression and MDL Graph compression and MDL Discovery algorithm Discovery algorithm Inexact graph match Inexact graph match Background knowledge Background knowledge Parallel/distributed discovery Parallel/distributed discovery 9 9
Graph Representation Graph Representation Input is a labeled (vertices and edges) directed graph Input is a labeled (vertices and edges) directed graph A substructure substructure is a connected is a connected subgraph subgraph A An instance instance of a substructure is an isomorphic of a substructure is an isomorphic subgraph subgraph An of the input graph of the input graph Input graph compressed by replacing instances with Input graph compressed by replacing instances with vertex representing substructure vertex representing substructure Input Database Substructure S1 Compressed Database (graph form) T1 triangle shape C1 S1 C1 S1 S1 S1 object R1 R1 on square S1 S1 S1 S1 S1 S1 S1 S1 S1 shape T2 T3 T4 object S2 S3 S4 10 10
11 11 S 1 S 1 S 2 Graph Representation Graph Representation S 2 S 2 S 1 S 1 S 1
Graph Compression and MDL Graph Compression and MDL Minimum Description Length (MDL) Minimum Description Length (MDL) principle principle � Best theory minimizes description length of Best theory minimizes description length of � theory and the data given theory theory and the data given theory Best substructure S S minimizes description minimizes description Best substructure length of substructure definition DL(S) DL(S) and and length of substructure definition compressed graph DL(G|S) DL(G|S) compressed graph + min ( ( ) ( | )) DL S DL G S S 12 12
Discovery Algorithm Discovery Algorithm 1. Create substructure for each unique Create substructure for each unique 1. vertex label vertex label Substructures: triangle triangle (4), square (4), on circle (1), rectangle (1) circle square on on rectangle on on on triangle triangle triangle on on on square square square 13 13
Discovery Algorithm Discovery Algorithm 2. Expand best substructures by an edge or Expand best substructures by an edge or 2. edge+neighboring vertex edge+neighboring vertex Substructures: triangle on triangle circle circle square on on on on square rectangle rectangle on on on triangle triangle triangle square rectangle on on on on on square square square triangle rectangle 14 14
Discovery Algorithm Discovery Algorithm 3. Keep only best Keep only best beam beam- -width width 3. substructures on queue substructures on queue 4. Terminate when queue is empty or Terminate when queue is empty or 4. #discovered substructures > limit limit #discovered substructures > 5. Compress graph and repeat to generate Compress graph and repeat to generate 5. hierarchical description hierarchical description 15 15
16 16 DNA Example DNA Example
Sample SUBDUE Input Sample SUBDUE Input sample.g: v 1 object e 1 11 shape v 2 object e 2 12 shape v 3 object e 3 13 shape v 4 object e 4 14 shape v 5 object e 5 15 shape T1 v 6 object e 6 16 shape C1 S1 v 7 object e 7 17 shape v 8 object e 8 18 shape R1 v 9 object e 9 19 shape v 10 object e 10 20 shape T2 T3 T4 v 11 triangle e 1 5 on S3 S2 S4 v 12 triangle e 2 6 on v 13 triangle e 3 7 on v 14 triangle e 4 8 on v 15 square e 5 10 on v 16 square e 9 10 on v 17 square e 10 2 on v 18 square e 10 3 on v 19 circle e 10 4 on v 20 rectangle 17 17
Inexact Graph Match Inexact Graph Match Some variations may occur between Some variations may occur between instances instances Want to abstract over minor differences Want to abstract over minor differences Difference = cost of transforming one Difference = cost of transforming one graph to make it isomorphic to another graph to make it isomorphic to another Match if cost/size < threshold threshold Match if cost/size < 18 18
Inexact Graph Match Inexact Graph Match a b A B B A 1 2 3 4 b a a b 5 ∅ B (1, λ ) 1 (1,3) 1 (1,4) 0 (1,5) 1 (2, λ ) (2, λ ) (2, λ ) (2, λ ) (2,4) (2,5) (2,3) (2,5) (2,3) (2,4) (2,3) (2,4) (2,5) 7 6 10 3 6 9 7 7 10 9 10 9 11 Least-cost match is {(1,4), (2,3)} 19 19
Inexact Graph Match Inexact Graph Match Vertices considered by degree Vertices considered by degree Polynomially constrained constrained Polynomially k partial mappings considered n k � Greedy after Greedy after n partial mappings considered � � Suboptimal mappings rare for k>2 Suboptimal mappings rare for k>2 � 20 20
Background Knowledge Background Knowledge User- -defined substructures defined substructures User Two alternative uses Two alternative uses � Prime search queue Prime search queue � � Initial graph compression Initial graph compression � Variant of discovery algorithm used to Variant of discovery algorithm used to generate instances generate instances 21 21
Parallel/Distributed Discovery Parallel/Distributed Discovery Divide graph into P partitions Divide graph into P partitions Distribute to P processors Distribute to P processors Each processor performs serial discovery Each processor performs serial discovery on local partition on local partition Broadcast best substructures, evaluate on Broadcast best substructures, evaluate on other processors other processors Master processor stores best global Master processor stores best global substructures substructures 22 22
Recommend
More recommend