Towards Graph (Re-)Compression Design decisions and first results Stefan Böttcher University of Paderborn Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 1
Why re-compression of a compressed graph? large graphs è “long time“ to find a “good“ compression idea: instead: do any compression “fast“ and in parallel on small sub-graphs è get compressed sub-graphs “fast“ re-compress compressed sub-graphs è re-compression time depends on size of compressed sub-graph Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 2
Overwiew of steps towards re-compressed graphs string compression string re-compression ordered tree compression ordered tree re-compression unordered tree compression unordered tree re-compression graph compression graph re-compression compression re-compression strings ordered trees ordered trees unordered trees unordered trees graphs Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 3
Why digram-based compression? S à b c d e c d e c d S à b N e N e N N à c d S à b N M M M à e N replacing digram occurrences uses a “look for smallest repeated pattern first“ – approach substitute larger frequently occurring patterns in multiple steps Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 4
(Re-)Compression by replacing a most frequent digram S à b c d b c d S à b N b N N à c d S à M M M à b N N à c d S à M M M à b c d (Re-)Compression Algorithm for strings / trees / graphs : while at least one digram occurs more than once choose a most frequent digram D ( e.g. c d ) (if re-compression: isolate all occurrences of D by smart inlining) replace each occurrence of digram D by a new nonterminal N, which is thereafter treated as a terminal, i.e. not cut-off again introduce a grammar rule ( e.g. N à c d ) inline rules called only once ( e.g. N à c d ) Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 5
Digrams for strings and for trees A digram is a pair of typed items (c,d) in a given relationship r String: b c d e c d e c d digram (c,d) with r is “d follows c“ Tree: c c N à c N N b d e d y 1 d b e digram (c,d) with r is “d is the second child of c“ Unordered Tree: c c edge order does not matter - b d d e like in graphs digram (c,d) with r is “d is a child of c“ Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 6
Digrams for a graph with labeled nodes and labeled edges A digram is a pair of typed items (c,d) in a given relationship r d e Graph: f b c Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 7
Digrams for a graph with labeled nodes and labeled edges A digram is a pair of typed items (c,d) in a given relationship r d e Graph: f b c digram (f,b) with r is “nodes f and b are connected by a hyperedge from f to b“ digram (d,e) with r is “there is a node shared by an incoming hyperedge d and an outgoing hyperedge e“ digram (b,e) with r is “node b has an outgoing hyperedge e“ digram (d,b) with r is “node b has an incoming hyperedge d“ Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 8
Digrams for a graph with labeled nodes and labeled edges A digram is a pair of typed items (c,d) in a given relationship r d e Graph: f b c digram (f,b) with r is “nodes f and b are connected by a hyperedge from f to b“ digram (d,e) with r is “there is a node shared by an incoming hyperedge d and an outgoing hyperedge e“ digram (b,e) with r is “node b has an outgoing hyperedge e“ digram (d,b) with r is “node b has an incoming hyperedge d“ Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 9
Digrams for a graph with labeled nodes and labeled edges A digram is a pair of typed items (c,d) in a given relationship r d e Graph: f b c digram (f,b) with r is “nodes f and b are connected by a hyperedge from f to b“ digram (d,e) with r is “there is a node shared by an incoming hyperedge d and an outgoing hyperedge e“ digram (b,e) with r is “node b has an outgoing hyperedge e“ digram (d,b) with r is “node b has an incoming hyperedge d“ Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 10
Digrams for a graph with labeled nodes and labeled edges A digram is a pair of typed items (c,d) in a given relationship r d e Graph: f b c digram (f,b) with r is “nodes f and b are connected by a hyperedge from f to b“ digram (d,e) with r is “there is a node shared by an incoming hyperedge d and an outgoing hyperedge e“ digram (b,e) with r is “node b has an outgoing hyperedge e“ digram (d,b) with r is “node b has an incoming hyperedge d“ Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 11
Re-compression of a compressed string / tree / graph A string / tree / graph S à d c d c d c that has been compressed to S à d N N c N à c d can be recompressed to S à M M M M à d c to get a better compression Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 12
Re-compress a compressed string: 1. Count digrams S à d N N c N à c d digram generator generated digram d N d c N c d (occurs twice) N N d c N c d c è (d,c) with r = “d follows c“ is the most frequent digram in decompressed graph Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 13
2. Isolate a most frequent digram by smart inlining Task: isolate most frequent digram (d,c) with r = “d follows c“ S à d c N c N c N à c e f g d needed: partial decompression of N to isolate d from N new rules that isolate d from the end of N: N à N -d d N -d à c e f g S à d c N -d d c N -d d c trick: inline rewritten rule N à N -d d instead of N à c e f g d finally, substitute digrams (d,c) with new nonterminal M: S à M N -d M N -d M M à d c Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 14
Re-compress a compressed ordered tree: 1. Count digrams How to count all digrams generated by tree grammars? A à C (A, C, D may be called several times) b D parent node (C) does not determine a digram, but child (D) does: C à r D à h A à r e s i j e s f h f y 2 y 1 g b g i j each non-root non-parameter node (e.g. D) in the RHS of each rule of an SLT grammar represents (a child of) a digram è count calls of rule A for the digram represented by child node D è O ( size(G) ) [ICDE2016] Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 15
2. Smarter inlining needed for ordered tree grammars to isolate a digram: A à C - isolate root terminal of tree generated by D - isolate parent of 2 nd parameter of tree generated by C b D C à r D à h A à r e s i j e s f h f y 2 y 1 g b g i j Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 16
2. Smarter inlining needed for ordered tree grammars to isolate a digram: A à C - isolate root terminal of tree generated by D - isolate parent of 2 nd parameter of tree generated by C b D C à r D à h A à r e s i j e s f h f y 2 y 1 g b g i j needs smarter inlining: C à C -r A à C -r C -e à f C -r à r e e y 1 g y 1 s C -e y 2 C -e h y 1 b i j Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 17
Tree grammar re-compression: compression ratio EXI − EXI − NCBI Medline XMark Treebank Weblog Telecomp #edges 39 39 71 13096 34649 52266 compression ratio 0 % 0.05 % 0.06 % 4.71 % 7.94 % 20.67 % compression ratio with max blow-up 0 % 0.09 % 0.11 % 4.89 % 11.38 % 21.26 % 200% max | intermediate grammar | | final grammar | smarter inlining yields 100% intermediate blow-ups of factor 2 at most 0% document generated from seed by 5000 updates - re-compression after every 100 updates: blow-ups of a factor of 5 at most - without re-compression blow-up up to a factor of 400 Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn 18
Recommend
More recommend