General: Compression methods depend on data characteristic � there is no universal (best) method Compression הסיחד Requirements : • Introduction • text, EL’s: lossless • images – may be lossy • Information theory • efficiency -- how may bits per byte of data? • Text compression (often in percentage) • coding should be fast, decoding superfast • IL compression DL - 2004 Compression – Beeri/Feitelson 2 DL - 2004 Compression – Beeri/Feitelson 1 Compression vs. communications: A general model for statistics-based compression: line file source destination n o Model Model i s e coder decoder Minor difference: Communication is always on-line, Same model must be used at both sides Compression is on/ off line (off-line: complete file given) Model is (often) stored in compressed file – its size affects compression efficiency DL - 2004 Compression – Beeri/Feitelson 4 DL - 2004 Compression – Beeri/Feitelson 3 ∑ Appetizer: Huffman coding > = Assume: symbol probabilities: ,..., p p ( 0) ( p 1) 1 q 1 q > Source alphabet: ,..., s s , 1 Huffman’s Algorithm (eager construction of code tree): 1 q coding alphabet: binary -- {0,1} • Allocate a node for each symbol, weight = (standard) binary coding: symbol probability Uniquely decodable • • Enter nodes into priority queue Q • Model = table (small weights first) • efficiency: bits/ symbol log q • While | Q| > 1 { (no/ little compression) – Remove two first nodes (smallest weights) – Create new node, make it their parent, assign it the Can do better if symbol frequencies are known: sum of their weights frequent symbol – short code – Enter new node into Q rare symbol – long code } Minimizes the average Return: single node in Q (root of tree) DL - 2004 Compression – Beeri/Feitelson 6 DL - 2004 Compression – Beeri/Feitelson 5 1
Example: a :1/ 2, b :1/ 4, c d , :1/8 How are the trees used? Q: { } 1 1/2 1/4 1/2 1/4 1/8 1/8 Coding: for each symbol s, output binary path from root to leaf(s) Decoding: read incoming stream of bits, follow 1 path from root of tree. When leaf(s) reached, output s, and return to root. 1/2 1/2 Common model (stored on both sides) : 1/4 1/4 the tree 1/8 1/8 DL - 2004 Compression – Beeri/Feitelson 8 DL - 2004 Compression – Beeri/Feitelson 7 A note on Huffman trees: Expected cost bits/ symbol: The algorithm is non-deterministic: log q Binary: ∑ = p l ( l length of path from root to leaf( )) s Huffman : • In each step, either node can be the left child of i i i i new parent In example: If two children of a node are exchanged, result is also a Huffman tree binary: 2 Huffman : 1/ 2x1 + ¼ x2 + 1/ 8x3 + 1/ 8x3 = 1.75 Closure under rotation w.r.t nodes • Consider 0.4, 0.2, 0.2, 0.1, 0.1 Q: what would be the tree and cost for: after 1 st step, 2 out of 3 nodes are selected � There are many Huffman trees for a given 5/ 12, 1/ 3, 1/ 6, 1/ 12 ? probability distribution DL - 2004 Compression – Beeri/Feitelson 10 DL - 2004 Compression – Beeri/Feitelson 9 A prefix code = binary tree Concepts: variable length code: (e.g. Huffman) Every binary tree with q leaves is a prefix code for uniquely decodable code: each legal code sequence q symbols, lengths of code words = lengths of is generated by a unique source sequence paths ידיימ instantaneous/ prefix code end of code of each symbol can be recognized Kraft inequality: Examples: l 1 ,..., q l Exists a q-leaf tree with path lengths 0, 010, 01, 10 − ≤ ∑ l 2 1 iff i 10, 00, 11, 110 0, 10, 110, 111 (Huffman of example) (comma code) = 1 iff tree is complete 0, 01, 011, 111 (inverted comma code) DL - 2004 Compression – Beeri/Feitelson 12 DL - 2004 Compression – Beeri/Feitelson 11 2
If T is not complete (every node has 0/ 2 children) Proof : it has a node with a single child � assume exists a tree T � Can be “shortened” = − ≤ ∑ Take T’ to be the full tree of depth l max( ) l l i 2 i 1 new tree still satisfies − < ∑ The number of its leaves: 2 l l 2 1 i hence given tree must satisfy − A leaf of T, at distance from root has l l l 2 i � Only complete trees have equality i leaves of T’ under it T Comment: l i Sum on all leaves of T: In general a prefix code that is not a complete l tree is dominated by a tree with smaller cost − ∑ ∑ l l − ≤ ⇒ − ≤ l l l l 2 2 2 1 i i i From now: tree are complete Full: all paths same length DL - 2004 Compression – Beeri/Feitelson 14 DL - 2004 Compression – Beeri/Feitelson 13 − ≤ ∑ : Assume l MacMillan Theorem : 2 1 i exists a uniquely decodable code with lengths − ≤ ∑ = ∃ ≠ l Lemma: if max( ) l l then k j s.t. = l l 2 1 l 1 ,..., q l iff i i j j k Replace these two by their sum (hence q-1 Corollary: when there is a uniquely decodeable lengths) and use induction code, there is also a prefix code (same cost) � No need to think about the first class − = ∑ l Assume must the tree be complete? 2 1 i Uniquely decodable prefix DL - 2004 Compression – Beeri/Feitelson 16 DL - 2004 Compression – Beeri/Feitelson 15 Q> 1: On optimality of Huffman: ∑ In Huffman tree, there are two maximal paths Cost of a tree/ code T: L(T) = p l i i that end in sibling nodes Claim: if a tree T does not satisfy In T, the paths for last two symbols are longest ≥ ≥ ⇒ ≤ ≤ (by (* )) but their ends may not be siblings (*) p ... p l ... l 1 q 1 q l But, T is complete, hence the leaf with has a then it is dominated by a tree with smaller cost q sibling with same length; exchange with the leaf ≤ Claim: for any T, L(T ) L(T) corresponding to l − Huff q 1 Proof: can assume T satisfies (* ) Now, in both trees, these two longest paths can Use induction: be replaced by their parents � Case of q-1 (induction hypothesis) Q= 2: both trees have lengths 1,1 DL - 2004 Compression – Beeri/Feitelson 18 DL - 2004 Compression – Beeri/Feitelson 17 3
Summary: • Huffman trees are optimal hence satisfy (* ) • Any two Huffman trees have equal costs • Huffman trees have min cost among all trees (codes) DL - 2004 Compression – Beeri/Feitelson 19 4
Recommend
More recommend