Analysis of Lempel-Ziv 78 for Markov sources Ph Jacquet, W. Szpankowski Inria – Purdue U the material is made available under the CC-BY-4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode
Lempel Ziv algorithm • Among the 10 most daily used algorithms – Unix, gif, pdf, etc
Huge literature in IT and algorithm community D. Aldous, and P. Shields, A Diffusion Limit for a Class of Random-Growing Binary • Trees, Probab. Th. Rel. Fields, 1988. Merhav, Universal Coding with Minimum Probability of Codeword Length • Overflow, IEEE Trans. Information Theory, 1991 P. Jacquet, W. Szpankowski, Asymptotic behavior of the Lempel-Ziv parsing scheme • and digital search trees. Theoretical Computer Science, 1995 W Schachinger On the variance of a class of inductive valuations of data structures • for digital search, Theoretical computer science, 1995 N. Merhav, and J. Ziv, On the Amount of Statistical Side Information Required for • Lossy Data Compression, IEEE Trans. Information Theory, 1997 R. Neininger and L. Rüschendorf, A General Limit Theorem for Recursive • Algorithms and Combinatorial Structures, The Annals of Applied Probability, 2004 J. Fayolle, M. D. Ward, Analysis of the average depth in a suffix tree under a • Markov model, DMTCS, 2005. K. Leckey, R. Neininger and W. Szpankowski, Towards More Realistic Probabilistic • Models for Data Structures: The External Path Length in Tries under the Markov Model, Algorithms, SODA, 2013
LZ compression process • A text • is fragmented into phrases (not grammatical). • Each phrase replaced by a short code (#+symbol)
Phrase breaking process • The next phrase is the longest copy of a previously seen phrase 1 2 3 • Plus an extra symbol 1 2 3 a • The code of the new phrase is 2+a • Final code sequence 0+a 1+b 1+a 2+a
Breaking process via Digital Search Trees • Build the DST of the current phrases • Use the path made by the remaining text to find the next phrase a
Two models • The DST “ m ” model • The LZ “ n ” model – m independent infinite – A text of length n broken strings inserted in a DST into LZ phrases – L m the path length – M n number of phrases 1 2 3 …… a
Equivalence of DST m and LZ n models • When the text source is memoryless the two models are equivalent – backward independence: the current DST and the rest of the text are independent Jacquet, P., & Szpankowski, W. (1995). Asymptotic behavior of the Lempel-Ziv parsing scheme and digital search trees. Theoretical Computer Science, 144(1-2), 161-197.
The memoryless source on m model • Infinite Text is from a memoryless source • Tractable because phrases are independent – P. Jacquet, W. Szpankowski, Asymptotic behavior of the Lempel- Ziv parsing scheme and digital search trees. Theoretical Computer Science, 1995 • For m phrases the proportion of covered text L m – Tends to be normal when 𝑛 → ∞ – Mean 𝐹 𝑀 ! = ℓ 𝑛 = ! " log 𝑛 + 𝛾(𝑛) with 𝛾 𝑛 = 𝑃(1) – Variance 𝑤𝑏𝑠 𝑀 ! = ! " 𝑤(𝑛) with 𝑤(𝑛) = 𝑃(log 𝑛)
The probability generating function and the non linear differential equation • Let ∑ !,# 𝑄 𝑀 ! = 𝑜 𝑣 # $ # !! = 𝑄(𝑨, 𝑣) 𝜖 𝜖𝑨 𝑄 𝑨, 𝑣 = 𝑄 𝑞 ! 𝑣𝑨, 𝑣 𝑄(𝑞 " 𝑣𝑨, 𝑣) exp − 𝑦 $ 𝑄 𝑀 # − 𝐹[𝑀 # ] 1 ∈ [𝑦, 𝑦 + 𝑒𝑦[ → 𝑒𝑦 2 2𝜌 𝑤𝑏𝑠(𝑀 # )
From phrase to text compression • Number of phrases M n – Using renewal: 𝑄 𝑁 = > 𝑛 = 𝑄 𝑀 > < 𝑜 • Asymptotically normal • Mean 𝐹 𝑁 % = ℓ &' 𝑜 + 𝑃 𝑜 ( , 𝑥𝑗𝑢ℎ 𝜀 > 1/2 𝐹 𝑁 % ~ ℎ𝑜 log 𝑜 𝑤 ℓ &' 𝑜 • variance 𝑜 𝑤𝑏𝑠 𝑁 % ~ $ = 𝑃 log $ 𝑜 ℓ ) ℓ &' 𝑜 – Compression rate: 𝐷 = = (log 𝑜 + log 𝐵) ? ! = • Average redundancy 𝐹 𝐷 % − ℎ~ℎ log 𝐵 − 𝛾 ℓ &' 𝑜 1 = 𝑃 log 𝑜 log 𝑜
DST m model and LZ n model no longer equivalent for markovian text • A Markovian generation incurs dependencies time forward and time backward correlation bbaababbaababaaababbaababababbabbaababbbbaabaababbaaaabbabbbabbbbba
Our result about LZ compression performance on a Markovian text • The number of phrase ∀𝜀 > 1/2 𝐹 𝑁 % = ℓ &' 𝑜 + 𝑃 𝑜 ( , 𝑥𝑗𝑢ℎ ℓ 𝑛 ~𝑛 log 𝑛 ℎ 𝑤𝑏𝑠 𝑁 % = 𝑃(𝑜 $( ) • The distribution of first symbol in phrases is determined and does NOT converge to the stationary distribution of Markov. • Redundancy satisfies 1 𝐹 𝐷 % = 𝑃 log 𝑜
The main top difficulty • The DST m model and LZ n model are non longer equivalent • We need
How far can we go with the m model on markovian sources a b • Classic markovian source – One must track the initial symbol ! = 𝑜) ! 𝑄 = 𝑄 𝑀 # = 𝑜 𝑏𝑚𝑚 𝑡𝑢𝑏𝑠𝑢𝑡 𝑥𝑗𝑢ℎ 𝑏 = 𝑄(𝑀 # #,% 𝜖 𝜖𝑨 𝑄 ! 𝑨, 𝑣 = 𝑄 ! (𝑞 !! 𝑣𝑨, 𝑣)𝑄 " (𝑞 !" 𝑣𝑨, 𝑣) – Path length asymptotically normal = 𝑛 ! 𝐹 𝑀 # ℎ (log 𝑛 + 𝛾 ! (𝑛)) ! 𝑤𝑏𝑠 𝑀 # = 𝑛𝑤 ! 𝑛 = 𝑃(𝑛 log 𝑛) • Jacquet, P., Szpankowski, W., & Tang, J. (2001). Average profile of the Lempel-Ziv parsing scheme for a Markovian source. Algorithmica, 31(3), 318-360.
̅ ̅ m model basic results • Asymptotically indifferent of first symbol 𝛾 + 𝑛 = 𝛾 𝑛 + 𝑃(𝑛 &, ) 𝛾 𝑛 = 𝛾 + 𝑄 - (log 𝑛) – with 𝑄 ! (. ) periodic when the transition matrix is rational, 𝛾 𝑛 = 𝛾 , otherwise
Extended m model with tail symbol • The tail symbol is the next symbol after insertion in the DST a b – It would be the first symbol of the next phrase in the n model – T m number of tail symbols equal to “ a ” b + 𝑄 = 𝑄 𝑈 # = 𝑙 & 𝑀 # = 𝑜 𝑏𝑚𝑚 𝑡𝑢𝑏𝑠𝑢 𝑥𝑗𝑢ℎ 𝑑) 𝑑 ∈ {𝑏, 𝑐} #,.,% 𝑣 % 𝑤 . 𝑨 # + 𝑄 + 𝑨, 𝑣, 𝑤 = ] 𝑄 #,.,% 𝑛! #,.,% 𝜖 𝜖𝑨 𝑄 + 𝑨, 𝑣, 𝑤 = 𝑞 +! 𝑤 + 𝑞 +" 𝑄 ! 𝑞 +! 𝑣𝑨, 𝑣, 𝑤 𝑄 " (𝑞 +" 𝑣𝑨, 𝑣, 𝑤)
Extended m model analytical results • Refining the techniques of the previous m models (limited to binary alphabet) # , 𝑈 # ) is asymptotically normal – (𝑀 " " # = 𝑛𝜐 # (𝑛) , 𝜐 # 𝑛 = 𝜐 𝑛 + 𝑃(𝑛 $% ) – 𝐹 𝑈 " • 𝜐 𝑛 = ̅ 𝜐 + 𝑄 _ (log 𝑛) with P 1 (.) periodic when the transition matrix is rational, 𝜐 𝑛 = ̅ 𝜐 , otherwise • Notice the asymptotic tail symbol distribution is NOT the Markov stationary distribution # , 𝑈 # – 𝑑𝑝𝑤 𝑀 " = 𝑃(𝑛 log 𝑛) "
The remaining very hard nut to crack • Coming back to the n model – Remember : DST m model and LZ n model are NOT equivalent with Markov sources
How will it be if m and n models were equivalent for Markov • LZ n model: let 𝒬 ",' = 𝑄(𝑛 𝑔𝑗𝑠𝑡𝑢 𝑞ℎ𝑠𝑏𝑡𝑓𝑡 ℎ𝑏𝑤𝑓 𝑢𝑝𝑢𝑏𝑚 𝑚𝑓𝑜𝑢ℎ 𝑜) • With memoryless sources we have 𝒬 ",' = 𝑄 ",' because m model and n models are equivalent • For a Markov source a convolution from initial symbol and tail symbols? ! " 𝒬 #,% = ] 𝑄 𝑄 # ! ,.,% ! #&# ! ,# ! &.,%&% ! a b # ! ,.,% ! • But this is wrong! b
What is failing in the transition DST to LZ? • Carving phrases in the text a b a b b b a 𝜏 = (𝑏, 𝑐, 𝑏, 𝑐, 𝑐, 𝑐) • Arranging phrase in a DST a a b a a b b b a b b b 𝜏 ! = 𝑏, 𝑐, 𝑐 . 𝜏 " = 𝑏, 𝑐, 𝑐
Enumerating permutations in the n and m models • Let σ a permutation of m symbols – σ indicates the sequence of tail symbols in the text (n model). 𝒬 0,% = 𝑄(𝑛 𝑔𝑗𝑠𝑡𝑢 𝑢𝑏𝑗𝑚 𝑡𝑧𝑛𝑐𝑝𝑚 𝑔𝑝𝑚𝑚𝑝𝑥 𝜏 & 𝑑𝑝𝑤𝑓𝑠 𝑚𝑓𝑜𝑢ℎ 𝑜) – 𝜏 ! indicates the tail symbol sequence in DST c-subtree (m model) + 𝑄 = 𝑄 𝐸𝑇𝑈 𝑢𝑏𝑗𝑚 𝑡𝑧𝑛𝑐𝑝𝑚 𝑔𝑝𝑚𝑚𝑝𝑥 𝜏 & 𝑞𝑏𝑢ℎ 𝑚𝑓𝑜𝑢ℎ 𝑗𝑡 𝑜 𝑡𝑓𝑟𝑣𝑓𝑜𝑑𝑓𝑡 𝑡𝑢𝑏𝑠𝑢 𝑥𝑗𝑢ℎ 𝑑) 0,% ! ! ",$ = ∑ % &" 𝒬 = ∑ % &", % / &' 𝑄 – We have 𝒬 %,$ and 𝑄 ",',$ %,$ – But we will see that don’t have the m-n convolution ! " 𝒬 #,% = ] 𝑄 𝑄 0 ! ,% ! 0 " ,%&% ! 0 ! 1|0 " |3# – In other words ! " 𝒬 #,% ≠ ] 𝑄 𝑄 # ! ,.,% ! #&# ! ,# ! &.,%&% ! # ! ,% ! ,.
The lost permutations • The following case is not feasible a a a b a b b a a a b b
Recommend
More recommend