Smallest grammar by recompression Artur Je˙ z Max Planck Institute for Informatics 17.06.2013
Grammar based-compression Represent w as a CFG generating it. 17.06.2013 2/17
Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most quadratic vs. LZ) compression is fast it is exponential on good data 17.06.2013 2/17
Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most quadratic vs. LZ) compression is fast it is exponential on good data extracts hierarchical structure it is easy to work on 17.06.2013 2/17
Grammar based-compression Represent w as a CFG generating it. Advantages it is usually small (at most quadratic vs. LZ) compression is fast it is exponential on good data extracts hierarchical structure it is easy to work on related to LZW and LZ 17.06.2013 2/17
Smallest grammar Problem Given w return smallest CFG G w such that L ( G w ) = w . 17.06.2013 3/17
Smallest grammar Problem Given w return smallest CFG G w such that L ( G w ) = w . With O ( 1 ) increase in size, this is an SLP . Definition (SLP: Straight Line Programme) CFG with ordered nonterminals X 1 , X 2 , . . . Chomsky normal form for X i → X j X k we have j , k < i 17.06.2013 3/17
What is known Best approximation ratio O ( log ( n / g )) , where g is the size of the optimal grammar. 17.06.2013 4/17
What is known Best approximation ratio O ( log ( n / g )) , where g is the size of the optimal grammar. Rytter – represent w as LZ, size ℓ ≤ g – translation of LZ into SLP , size O ( ℓ log ( n /ℓ )) ≤ O ( g log ( n / g )) – the intermediate grammar is balanced (AVL-type condition) 17.06.2013 4/17
What is known Best approximation ratio O ( log ( n / g )) , where g is the size of the optimal grammar. Rytter – represent w as LZ, size ℓ ≤ g – translation of LZ into SLP , size O ( ℓ log ( n /ℓ )) ≤ O ( g log ( n / g )) – the intermediate grammar is balanced (AVL-type condition) Charikar et al.: – similar as Rytter – different balance criterion (length of word) 17.06.2013 4/17
What is known Best approximation ratio O ( log ( n / g )) , where g is the size of the optimal grammar. Rytter – represent w as LZ, size ℓ ≤ g – translation of LZ into SLP , size O ( ℓ log ( n /ℓ )) ≤ O ( g log ( n / g )) – the intermediate grammar is balanced (AVL-type condition) Charikar et al.: – similar as Rytter – different balance criterion (length of word) Sakamoto – local replacement rules (plus a global partition): pairs and blocks – analysis vs LZ 17.06.2013 4/17
What is known Best approximation ratio O ( log ( n / g )) , where g is the size of the optimal grammar. Rytter – represent w as LZ, size ℓ ≤ g – translation of LZ into SLP , size O ( ℓ log ( n /ℓ )) ≤ O ( g log ( n / g )) – the intermediate grammar is balanced (AVL-type condition) Charikar et al.: – similar as Rytter – different balance criterion (length of word) Sakamoto – local replacement rules (plus a global partition): pairs and blocks – analysis vs LZ Linear time. 17.06.2013 4/17
This talk Very simple linear-time algorithm, O ( log ( n / g )) approximation. 17.06.2013 5/17
This talk Very simple linear-time algorithm, O ( log ( n / g )) approximation. analysis in the recompression framework, vs. SLP – very robust – good: easier to show better approximation? – bad: might be in fact larger 17.06.2013 5/17
This talk Very simple linear-time algorithm, O ( log ( n / g )) approximation. analysis in the recompression framework, vs. SLP – very robust – good: easier to show better approximation? – bad: might be in fact larger not balanced – good: easier to show approximation? – bad: worse for further processing 17.06.2013 5/17
This talk Very simple linear-time algorithm, O ( log ( n / g )) approximation. analysis in the recompression framework, vs. SLP – very robust – good: easier to show better approximation? – bad: might be in fact larger not balanced – good: easier to show approximation? – bad: worse for further processing height O ( log n ) , when a ℓ has height 1 17.06.2013 5/17
This talk Very simple linear-time algorithm, O ( log ( n / g )) approximation. analysis in the recompression framework, vs. SLP – very robust – good: easier to show better approximation? – bad: might be in fact larger not balanced – good: easier to show approximation? – bad: worse for further processing height O ( log n ) , when a ℓ has height 1 Algorithm similar to Sakamoto, different analysis. 17.06.2013 5/17
Example a a a b a b c a b a b b a b c b a 17.06.2013 6/17
Example a a a b a b c a b a b b a b c b a 17.06.2013 6/17
Example a 3 b a b c a b a b b a b c b a a 3 → a 3 17.06.2013 6/17
Example a 3 b a b c a b a b 2 a b c b a a 3 → a 3 , b 2 → b 2 17.06.2013 6/17
Example a 3 b d c d a b 2 d c b a a 3 → a 3 , b 2 → b 2 , d → ab 17.06.2013 6/17
Example a 3 b d c d a b 2 d c e a 3 → a 3 , b 2 → b 2 , d → ab, e → ba 17.06.2013 6/17
Example a 3 b d c d a b 2 d c e a 3 → a 3 , b 2 → b 2 , d → ab, e → ba 17.06.2013 6/17
Example a 3 b d c d a b 2 d c e a 3 → a 3 , b 2 → b 2 , d → ab, e → ba Intuition Phases: compress only pairs and block from the beginning of a phase. Treat nonterminals as letters. To speed up, we make some pair compression simultaneously (partition Σ to Σ ℓ , Σ r , pairs from Σ ℓ Σ r ) 17.06.2013 6/17
Algorithm 1: while | T | > 1 do 17.06.2013 7/17
Algorithm 1: while | T | > 1 do L ← list of letters in T 2: for each a ∈ L do ⊲ Blocks compression 3: compress maximal blocks of a ⊲ O ( | T | ) 4: 17.06.2013 7/17
Algorithm 1: while | T | > 1 do L ← list of letters in T 2: for each a ∈ L do ⊲ Blocks compression 3: compress maximal blocks of a ⊲ O ( | T | ) 4: P ← list of pairs 5: find partition of Σ into Σ ℓ and Σ r 6: ⊲ Try to maximize the occurrences from Σ ℓ Σ r in T . 7: 17.06.2013 7/17
Algorithm 1: while | T | > 1 do L ← list of letters in T 2: for each a ∈ L do ⊲ Blocks compression 3: compress maximal blocks of a ⊲ O ( | T | ) 4: P ← list of pairs 5: find partition of Σ into Σ ℓ and Σ r 6: ⊲ Try to maximize the occurrences from Σ ℓ Σ r in T . 7: for ab ∈ P ∩ Σ ℓ Σ r do ⊲ These pairs do not overlap 8: compress pair ab ⊲ Pair compression 9: 17.06.2013 7/17
Algorithm 1: while | T | > 1 do L ← list of letters in T 2: for each a ∈ L do ⊲ Blocks compression 3: compress maximal blocks of a ⊲ O ( | T | ) 4: P ← list of pairs 5: find partition of Σ into Σ ℓ and Σ r 6: ⊲ Try to maximize the occurrences from Σ ℓ Σ r in T . 7: for ab ∈ P ∩ Σ ℓ Σ r do ⊲ These pairs do not overlap 8: compress pair ab ⊲ Pair compression 9: 10: return the constructed grammar 17.06.2013 7/17
Partition 1 / 4 appearances covered A partition Σ ℓ Σ r such that 1 / 4 of pairs is covered. 17.06.2013 8/17
Partition 1 / 4 appearances covered A partition Σ ℓ Σ r such that 1 / 4 of pairs is covered. After block compression aa does not appear. Random partition: 1 / 4 pairs can be covered. derandomise (expected value) we need number of appearances of ab : RadixSort O ( | T | ) . 17.06.2013 8/17
Size reduction Size drop Consider set of two consecutive letters ab in T . For 1 / 4 of them one letter is compressed in a phase. Length drops by a constant factor. 17.06.2013 9/17
Size reduction Size drop Consider set of two consecutive letters ab in T . For 1 / 4 of them one letter is compressed in a phase. – if a = b : it is compressed Length drops by a constant factor. 17.06.2013 9/17
Size reduction Size drop Consider set of two consecutive letters ab in T . For 1 / 4 of them one letter is compressed in a phase. – if a = b : it is compressed – if a � = b : 1 / 4 of those pairs is in Σ ℓ Σ r When we consider ab we replace it, unless one letter was already replaced. Length drops by a constant factor. 17.06.2013 9/17
Size reduction Size drop Consider set of two consecutive letters ab in T . For 1 / 4 of them one letter is compressed in a phase. – if a = b : it is compressed – if a � = b : 1 / 4 of those pairs is in Σ ℓ Σ r When we consider ab we replace it, unless one letter was already replaced. Length drops by a constant factor. Towards running time It is enough to show that one round runs in O ( | T | ) . 17.06.2013 9/17
Running time Partition O ( | T | ) time. Block compression By RadixSort, O ( | T | ) time. Pair compression By RadixSort, O ( | T | ) time. 17.06.2013 10/17
Number of nonterminals Representation cost 17.06.2013 11/17
Number of nonterminals Representation cost when c replaces ab we add rule c → ab , representation cost 1 17.06.2013 11/17
Number of nonterminals Representation cost when c replaces ab we add rule c → ab , representation cost 1 when a ℓ 1 , a ℓ 2 , . . . , a ℓ k are replaced with a ℓ 1 , a ℓ 2 , . . . , a ℓ k ( ℓ 1 < ℓ 2 . . . < ℓ k ): 17.06.2013 11/17
Recommend
More recommend