SEG5010 presentation G RAPH C OMPRESSION AND S UMMARIZATION Wei Zhang Dept. of Information Engineering The Chinese University of Hong Kong
� Most of the slides are borrowed from the authors’ original presentation. original presentation. � http://www.cs.umd.edu/~saket/pubs/sigmod2008.ppt � http://videolectures net/kdd09 kumar ocsn/ � http://videolectures.net/kdd09_kumar_ocsn/
G RAPH S UMMARIZATION WITH B OUNDED G RAPH S UMMARIZATION WITH B OUNDED E RROR � Saket Navlakha (UMCP) � Rajeev Rastogi (Yahoo! Labs India) � Rajeev Rastogi (Yahoo! Labs, India) � Nisheeth Shrivastava (Bell Labs India)
E F L ARGE G RAPHS yahoo.com cnn.com 20.20.2.2 D D � Many interactions can be represented as B A C graphs 10.1.1.1 G Webgraphs: search engine etc Webgraphs: search engine, etc. � � Netflow jokes.com Netflow graphs (which IPs talk to each other): � traffic patterns, security, worm attacks Social (friendship) networks: ( p) � mine user communities, viral marketing Email exchanges: security. virus spread, � spam detection Market basket data: customer profiles targeted Market basket data: customer profiles, targeted � � advertizing Social Networks email � Need to compress understand � Need to compress, understand � Webgraph ~ 50 billion edges; social networks ~ few million, growing quickly quickly � Compression reduces size to one-tenth (webgraphs)
O UR A PPROACH � Graph Compression (reference encoding) Not applicable to all graphs: use urls node labels for compression Not applicable to all graphs: use urls, node labels for compression � � Resulting structure is hard to visualize/interpret � � Graph Clustering Nice summary works for generic graphs Nice summary, works for generic graphs � � No compression: needs the same memory to store the graph itself � � Our MDL based representation R = (S C) � Our MDL-based representation R = (S,C) S is a high-level summary graph: compact, highlights dominant trends, easy � to visualize C is a set of edge corrections: help in reconstructing the graph C is a set of edge corrections: help in reconstructing the graph � � Compression based on MDL principle: minimize cost of S+C � information-theoretic approach; parameter less; applicable to any graph Novel Approximate Representation: reconstructs graph with bounded error pp p g p � ( є ); results in better compression
d e f g H OW DO WE COMPRESS ? a b c � Compression possible (S) � Many nodes with similar � Many nodes with similar neighborhoods Summary X = {d,e,f,g} � Communities in social networks; link- copying in webpages i i b � Collapse such nodes into Y = {a,b,c} supernodes (clusters) supernodes (clusters) and the edges into superedges � Bipartite subgraph to two supernodes and a superedge d d � Clique to supernode with a “self-edge”
Cost = 14 edges d e f g H OW DO WE COMPRESS ? i h j j a b c � Compression possible (S) Many nodes with similar neighborhoods � � Communities in social networks; link-copying in C webpages Summary Collapse such nodes into supernodes (clusters) and the X = {d,e,f,g} � edges into superedges g p g i � Bipartite subgraph to two supernodes and a h superedge i Y = {a,b,c} � Clique to supernode with a “self-edge” Need to correct mistakes (C) � Most superedges are not complete � � Nodes don’t have exact same neighbors: friends N d d ’t h t i hb f i d Correction in social networks s Cost = 5 +(a,h) Remember edge-corrections � (1 superedge + (1 superedge � Edges not present in superedges ( ve corrections) � Edges not present in superedges (-ve corrections) +(c,i) ( i) 4 corrections) � Extra edges not counted in superedges (+ve +(c,j) corrections) -(a,d) ( , ) Minimize overall storage cost = S+C �
R EPRESENTATION S TRUCTURE R=(S C) R EPRESENTATION S TRUCTURE R=(S,C) X = {d,e,f,g} h i � Summary S(V S , E S ) Y = {a,b,c} j Each supernode v represents a set of nodes A v � Each superedge (u,v) represents E h d ( ) t � all pair of edges π uv = A u x A v C = {+(a,h), +(c,i), +(c,j), -(a,d)} � Corrections C: {(a,b); a and b are nodes of G} � Supernodes are key, superedges/corrections easy easy A uv actual edges of G between A u and A v � Cost with (u,v) = 1 + | π uv – E uv | � d e f f g g C Cost without (u,v) = |E uv | t ith t ( ) |E | � h Choose the minimum, decides whether edge (u,v) i � is in S j j a a b b c c
R EPRESENTATION S TRUCTURE R=(S C) R EPRESENTATION S TRUCTURE R=(S,C) X = {d,e,f,g} h i � Summary S(V S , E S ) Each supernode v represents a set of nodes A v � Y = {a,b,c} j Each superedge (u,v) represents p g ( , ) p � all pair of edges π uv = A u x A v � Corrections C: {(a,b); a and b are nodes of G} C = {+(a,h), +(c,i), +(c,j), -(a,d)} � Supernodes are key superedges/corrections � Supernodes are key, superedges/corrections easy A uv actual edges of G between A u and A v � Cost with (u,v) = 1 + | π Cost with (u,v) 1 + | π uv – E | E uv | � Cost without (u,v) = |E uv | � Choose the minimum, decides whether edge (u,v) is in � S d e f f g g h i � Reconstructing the graph from R For all superedges (u,v) in S, insert all pair of edges j j � a a b b c c π uv For all +ve corrections +(a,b), insert edge (a,b) � For all -ve corrections -(a,b), delete edge (a,b) �
R EPRESENTATION S TRUCTURE R=(S C) R EPRESENTATION S TRUCTURE R=(S,C) X = {d,e,f,g} h i � Summary S(V S , E S ) Each supernode v represents a set of nodes A v Y = {a,b,c} j � Each superedge (u v) represents Each superedge (u,v) represents � � all pair of edges π uv = A u x A v C = {+(a,h), +(c,i), +(c,j), -(a,d)} � Corrections C: {(a,b); a and b are nodes of G} � Supernodes are key superedges/corrections � Supernodes are key, superedges/corrections easy A uv actual edges of G between A u and A v � Cost with (u v) = 1 + | π Cost with (u,v) = 1 + | π uv – E uv | E | � Cost without (u,v) = |E uv | � Choose the minimum, decides whether edge (u,v) is � d e f f g g in S in S h i � Reconstructing the graph from R j j a a b b c c For all superedges (u,v) in S, insert all pair of edges For all superedges (u v) in S insert all pair of edges � � π uv For all +ve corrections +(a,b), insert edge (a,b) � For all -ve corrections -(a,b), delete edge (a,b) �
R EPRESENTATION S TRUCTURE R=(S C) R EPRESENTATION S TRUCTURE R=(S,C) X = {d,e,f,g} h i � Summary S(V S , E S ) Each supernode v represents a set of nodes A v � Y = {a,b,c} j Each superedge (u,v) represents p g ( , ) p � all pair of edges π uv = A u x A v � Corrections C: {(a,b); a and b are nodes of G} C = {+(a,h), +(c,i), +(c,j), -(a,d)} � Supernodes are key superedges/corrections � Supernodes are key, superedges/corrections easy A uv actual edges of G between A u and A v � Cost with (u,v) = 1 + | π Cost with (u,v) 1 + | π uv – E | E uv | � Cost without (u,v) = |E uv | � Choose the minimum, decides whether edge (u,v) is in � S d e f f g g h i � Reconstructing the graph from R For all superedges (u,v) in S, insert all pair of edges j j � a a b b c c π uv For all +ve corrections +(a,b), insert edge (a,b) � For all -ve corrections -(a,b), delete edge (a,b) �
X = {d,e,f,g} A PPROXIMATE R EPRESENTATION R Є Y = {a,b} { b} � Approximate representation Recreating the input graph exactly is not always � necessary necessary C = {-(a,d), -(a,f)} { ( , ), ( ,f)} Reasonable approximation enough: to compute � communities, anomalous traffic patterns, etc. Use approximation leeway to get further cost reduction d d e e f f g g � � Generic Neighbor Query G Given node v, find its neighbors N v in G � Apx-nbr set N’ v estimates N v with є -accuracy p y � a a b b v v Bounded error: error(v) = |N’ v - N v | + |N v - N’ v | < є � |N v | For є =.5, we can remove Number of neighbors added or deleted is at most є - � one correction of a one correction of a fraction of the true neighbors fraction of the true neighbors � Intuition for computing R є If correction (a,d) is deleted, it adds error for both a � d d e e f f g g and d and d From exact representation R for G, remove (maximum) � corrections s.t. є -error guarantees still hold a a b b
C OMPARISON WITH EXISTING TECHNIQUES d e f g � Webgraph compression [Adler-DCC-01] Use nodes sorted by urls: not applicable to other graphs Use nodes sorted by urls: not applicable to other graphs � � More focus on bitwise compression: represent sequence of a b c � neighbors (ids) using smallest bits � Clique stripping [Feder-pods-99] Cli t i i Collapses edges of complete bi-partite subgraph into single � cluster d d e e f f g g Only compresses very large, complete bi-cliques � � Representing webgraphs [Raghavan-icde-03] Represent webgraphs as SNodes, Sedges Represent webgraphs as SNodes, Sedges � a b c Use urls of nodes for compression (not applicable for other � graphs) No concept of approximate representation No concept of approximate representation �
O UTLINE � Compressed graph � MDL representation R=(S C); є -representation � MDL representation R (S,C); є -representation � Computing R � GREEDY RANDOMIZED � GREEDY, RANDOMIZED � Computing R є � APX-MDL, APX-GREEDY APX MDL APX GREEDY � Experimental results � Conclusions and future work
Recommend
More recommend