gspan graph based substructure pattern mining
play

gSpan: Graph-Based Substructure Pattern Mining Xifeng Yan Jiawei - PDF document

gSpan: Graph-Based Substructure Pattern Mining Xifeng Yan Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign xyan, hanj @uiuc.edu Abstract chemical compound dataset in 10 minutes


  1. ✩ ✩ ✂ ✻ gSpan: Graph-Based Substructure Pattern Mining Xifeng Yan Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign � xyan, hanj ✁ @uiuc.edu Abstract chemical compound dataset in 10 minutes with 6.5% min- imum support. For the same dataset, our novel algorithm We investigate new approaches for frequent graph-based can complete the same task in 10 seconds. pattern mining in graph datasets and propose a novel algo- AGM and FSG both take advantage of the Apriori level- rithm called gSpan (graph-based Substructure pattern min- wise approach [1]. In the context of frequent subgraph min- ing), which discovers frequent substructures without can- ing, the Apriori-like algorithms meet two challenges: (1) didate generation. gSpan builds a new lexicographic or- candidate generation: the generation of size ✧✼✻✾✽❀✿✥✫ sub- der among graphs, and maps each graph to a unique mini- graph candidates from size frequent subgraphs is more mum DFS code as its canonical label. Based on this lexico- complicated and costly than that of itemsets; and (2) prun- graphic order, gSpan adopts the depth-first search strategy ing false positives: subgraph isomorphism test is an NP- to mine frequent connected subgraphs efficiently. Our per- complete problem, thus pruning false positives is costly. formance study shows that gSpan substantially outperforms Contribution. In this paper, we develop gSpan , which previous algorithms, sometimes by an order of magnitude. targets to reduce or avoid the significant costs mentioned above. If the entire graph dataset can fit in main memory, gSpan can be applied directly; otherwise, one can first per- form graph-based data projection as in [6], and then apply 1. Introduction gSpan . To the best of our knowledge, gSpan is the first algo- rithm that explores depth-first search (DFS) in frequent sub- Frequent substructure pattern mining has been an emerg- graph mining. Two techniques, DFS lexicographic order ing data mining problem with many scientific and com- and minimum DFS code , are introduced here, which form mercial applications. As a general data structure, la- a novel canonical labeling system to support DFS search. beled graph can be used to model much complicated sub- gSpan discovers all the frequent subgraphs without candi- structure patterns among data. Given a graph dataset, ✂☎✄✝✆✟✞✡✠☞☛✌✞✎✍✏☛✒✑✓✑✔✑✓☛✌✞✖✕✘✗ , date generation and false positives pruning. It combines the ✙✛✚☞✜☞✜✣✢✥✤✥✦★✧✪✩✬✫ denotes the number growing and checking of frequent subgraphs into one pro- of graphs (in ) in which is a subgraph. The problem cedure, thus accelerates the mining process. of frequent subgraph mining is to find any subgraph s.t. ✙✒✚✭✜✮✜✯✢✥✤✥✦★✧✪✩✬✫✱✰✳✲✵✴✷✶✹✸✺✚☞✜ (a minimum support threshold). To 2. DFS Lexicographic Order reduce the complexity of the problem (meanwhile consid- ering the connectivity property of hidden structures in most situations), only frequent connected subgraphs are studied This section introduces several techniques developed in in this paper. gSpan , including mapping each graph to a DFS code (a The kernel of frequent subgraph mining is subgraph iso- sequence), building a novel lexicographic ordering among morphism test. Lots of well-known pair-wise isomorphism these codes , and constructing a search tree based on this testing algorithms were developed. However, the frequent lexicographic order . subgraph mining problem was not explored well. Recently, DFS Subscripting. When performing a depth-first search Inokuchi et al. [4] proposed an Apriori-based algorithm, [3] in a graph, we construct a DFS tree. One graph can have called AGM, to discover all frequent (both connected and several different DFS trees. For example, graphs in Fig. disconnected) substructures. Kuramochi and Karypis [5] 1(b)-(d) are isomorphic to that in Fig. 1(a). The thickened further developed the idea using adjacent representation of edges in Fig. 1(b)-(d) represent three different DFS trees for graph and an edge-growing strategy. Their algorithm, called the graph in Fig. 1(a). The depth-first discovery of the ver- FSG, is able to find all frequent connected subgraphs from a tices forms a linear order. We use subscripts to label this

Recommend


More recommend