Discovering Frequent Topological Structures from Graph Datasets R. Jin C. Wang D. Polshakov S. Parthasarathy G. Agrawal Department of Computer Science and Engineering Ohio State University, Columbus OH 43210 � jinr,wachao,polshako,srini,agrawal ✁ @cse.ohio-state.edu ABSTRACT a dataset can be as follows: three ✂ -helices that are not direct neigh- bors of each other, but form a triangle in the three-dimensional The problem of finding frequent patterns from graph-based datasets space. Specifically, in the graphs for different proteins, each pair of is an important one that finds applications in drug discovery, pro- above ✂ -helices is connected through independent paths formed by tein structure analysis, XML querying, and social network analysis other secondary structures, possibly including ✂ -helices, ✄ -sheets, among others. In this paper we propose a framework to mine fre- or loops. The triangle information can be useful for understand- quent large-scale structures, formally defined as frequent topolog- ing the functionalities of these proteins. For instance, two DNA- ical structures , from graph datasets. Key elements of our frame- binding regulatory proteins (1ALI and 1E31), though seemingly work include, fast algorithms for discovering frequent topological different from the local-structure perspective, share such a ✂ -helices patterns based on the well known notion of a topological minor, al- triangle, and perform similar functionalities [5]. In fact, both be- gorithms for specifying and pushing constraints deep into the min- long to the class of zinc finger proteins. However, because this kind ing process for discovering constrained topological patterns, and of structure is hidden under the pair-wise relationship, it is very un- mechanisms for specifying approximate matches when discovering likely to be identified using the existing frequent subgraph mining frequent topological patterns in noisy datasets. We demonstrate the approaches. In particular, even if some subgraphs which embed the viability and scalability of the proposed algorithms on real and syn- three ✂ -helices may appear to be frequent, the triangle structure can thetic datasets and also discuss the use of the framework to discover easily be missed. meaningful topological structures from protein structure data. The main contribution of this paper is a framework to mine fre- quent large-scale structures from graphs. Our work is inspired by 1. INTRODUCTION a well-established mathematical concept, topological minor [4]. A Recently, there has been a lot of interest in mining frequent pat- topological minor of a graph is an abstraction that focuses on its terns from structured datasets , such as chemical compounds, pro- structural information. Intuitively, such an abstraction is achieved teins, web-logs, and XML datasets. Such patterns can effectively by replacing or contracting independent paths in a subgraph with summarize the data, provide key insights and often serve as a pre- individual edges. processing step for further analysis. Since, such datasets can of- An important notion in our framework is that of a relabeling ten be modeled as graphs, a majority of research in this area has function . Since often real datasets can be best represented as la- focused on developing efficient algorithms for mining frequently beled graphs when we replace independent paths in a subgraph with occurring (connected) subgraphs [9, 10, 18, 13]. edges, the information labels on such paths are lost. However, in However, in many real world applications, such as biology, so- many applications, summarized information about the contracted cial networks, and telecommunication, large-scale structures , which paths can be useful to categorize these topological structures. For provide high-level topological information of graphs, may be equally example, we may prefer to distinguish the ✂ -helix triangles of dif- or more important than discovering the basic components. For in- ferent sizes, and the length of each independent path connecting stance, the discovery of non-local or tertiary structural information these ✂ -helices can help to provide such measurement. Our frame- is an important problem in protein structure analysis. Similarly, in work supports this notion through user-defined relabeling functions the analysis of social or communication networks, the direct con- to recover some degree of information loss from the contracted nection between a pair of nodes is often not the focus, instead, paths. Such a function maps an entire labeled path to a single edge the patterns where several nodes are connected through a set of label. In other words, an edge label carries the desired informa- independent paths are of greater interest. Such frequent large-scale tion about its corresponding contracted path. For instance, in the structures can be very hard to discover using current frequent sub- above example, the relabeling function can use the length of each graph mining approaches. This is not only because the subgraphs contracted path as their corresponding edge labels. An additional sharing these kind of structures can be infrequent (i.e. the tradi- benefit of the relabeling function is that it can be used to support tional anti-monotone property leveraged by most such algorithms the mining of constrained topological structures. does not hold), but also because the individual subgraphs are not To summarize, the main contributions of this paper are as fol- adequately abstracted or represented. lows: As an example of a large-scale structure we are focusing on, con- 1. We introduce a novel framework for discovering frequent sider mining a protein dataset where each protein is represented as topological structures from graph datasets based on a vertical a graph. The vertexes of each graph are protein secondary struc- mining approach. tures, and an edge is associated with two protein secondary struc- tures if their distance in the three-dimensional space is within a 2. We study the basic properties of relabeling functions, and certain range. A frequent large-scale topological structure in such demonstrate their use for summarization and discovery of
Recommend
More recommend