Compression: Huffman’s Algorithm Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin
Tree Representation of a Prefix Code • Consider the infinite binary tree T in which each leftgoing edge is labeled with a 0 and each rightgoing edge is labeled with a 1 • We define the label of any node x in T as the concatenation of the edge labels on the unique downward path from the root to x – There is a one-to-one correspondence between the nodes of T and { 0 , 1 } ∗ • A prefix code may be represented as the (unique, finite) subtree of T that contains the root and has leaf labels equal to the set of codewords – Such a representation is possible because no two codewords lie on the same downward path from the root Theory in Programming Practice, Plaxton, Spring 2004
Convention Concerning Zero-Frequency Symbols • Let A denote the input alphabet • For each a in A , let f ( a ) denote the frequency of a – Remark: Scaling the frequencies by any positive quantity does not change the problem of computing an optimal code – We will adopt the convention that a symbol with frequency zero should not be assigned a codeword in an optimal code – For the remainder of this lecture, assume that A has been preprocessed to remove any zero-frequency symbols, so f ( a ) > 0 for all a in A Theory in Programming Practice, Plaxton, Spring 2004
Full Binary Tree • A (finite) binary tree is full if every internal node has exactly two children • Theorem: Any optimal prefix code corresponds to a full binary tree • Proof: The weight of a prefix code corresponding to a nonfull binary tree strictly decreases when we contract the lone descendant edge of of any single-child node • Theorem: Any optimal prefix code tightly satisfies the McMillan inequality (i.e., the inequality is an equality) • Hint: Use induction on the number of leaves; for the induction step, remove two sibling leaves at the deepest level Theory in Programming Practice, Plaxton, Spring 2004
Huffman’s Algorithm • The algorithm manipulates a forest of full binary trees – Invariant: The total number of leaves is n = | A | , and each leaf is labeled by a distinct symbol in A – Initially, there are n single-node trees – The algorithm repeatedly merges pairs of trees – The algorithm terminates when there is one tree, the output tree • To merge trees T 1 and T 2 , we introduce a new node with left subtree T 1 and right subtree T 2 – Remark: It doesn’t matter which tree becomes the left or right child of the merged tree • It remains only to specify which pair of trees to merge in each iteration Theory in Programming Practice, Plaxton, Spring 2004
Huffman’s Algorithm: Merging Rule • Let us define the frequency of a tree T in the forest as the sum of the frequencies of the symbols associated with (the leaves of) T • We merge the two lowest-frequency trees, breaking ties arbitrarily • Remark: Due to the arbitrary tie-breaking, Huffman’s algorithm can produce different codes; in fact, it can produce codes with different distributions of codeword lengths Theory in Programming Practice, Plaxton, Spring 2004
Huffman’s Algorithm: Recursive Formulation • Let a and b be the two lowest-frequency symbols in A , breaking ties arbitrarily • Let A ′ = ( A \ { a, b } ) ∪ { c } , where c is a new symbol with f ( c ) = f ( a ) + f ( b ) • Recursively compute the tree representation of an optimal prefix code for A ′ • Modify this tree by adding two children labeled a and b to the leaf labeled c (which ceases to be a leaf) – The resulting tree corresponds to an optimal prefix code for A Theory in Programming Practice, Plaxton, Spring 2004
Huffman’s Algorithm: Proof of Correctness • Not hard to convince yourself that the iterative and recursive versions of Huffman’s algorithm are equivalent • Our proof is given in terms of the recursive version Theory in Programming Practice, Plaxton, Spring 2004
Huffman’s Algorithm: Proof of Correctness • Let X denote the set of tree representations of all prefix codes for A in which the symbols a and b are associated with sibling leaves – Lemma: The set X contains the tree representation of an optimal prefix code • Let Y denote the set of tree representations of all prefix codes for A ′ – Observation: There is a natural one-to-one correspondence between the sets X and Y ; for any tree T in Y , let h ( T ) denote the corresponding tree in X – Lemma: For any tree T in Y , the weight of h ( T ) is equal to the weight of T plus f ( a ) + f ( b ) – It follows that any minimum-weight tree T in Y corresponds to a minimum-weight tree h ( T ) in X Theory in Programming Practice, Plaxton, Spring 2004
Remarks on Optimal Codes • While there is always an optimal prefix code, not every optimal code is a prefix code – Example involving the alphabet { a, b, c } with equal frequencies? • Two optimal prefix codes (for the same input) can have a different distribution of codeword lengths – Example involving the alphabet { a, b, c, d } ? Theory in Programming Practice, Plaxton, Spring 2004
Implementation of Huffman’s Algorithm • Use a priority queue data structure (e.g., a binary heap) that supports the operations insert and delete-min in O (log n ) time • Initially, we insert each of the n symbols into the priority queue • At each iteration (or level of recursion) we remove two entries from the priority queue and insert one • The total cost of the 2 n − 1 insert (resp., delete-min) operations is O ( n log n ) • The total cost of all other operations is O (1) per iteration (or level of recursion), hence O ( n ) overall Theory in Programming Practice, Plaxton, Spring 2004
Linear-Time Implementation • Assume that the “leaves” (i.e., symbols) are provided to us in a queue sorted in nondecreasing order of frequency • As the algorithm progresses, we can find the lowest-frequency unprocessed leaf in O (1) time • But what about the “nonleaf” nodes, one of which is created at each iteration? – When a nonleaf is created, append it to a second queue of nonleaves (initially empty) – Invariant: The nonleaf queue is sorted in nondecreasing order of frequency – To implement delete-min, we simply compare the frequencies of the nodes at the heads of the leaf and nonleaf queues, dequeueing whichever is lower Theory in Programming Practice, Plaxton, Spring 2004
Optimal Prefix Codes: Summary • For an i.i.d. source, the average number of bits used to encode each symbol is within one of the entropy lower bound – Except in cases where the entropy is extremely close to zero, this implies near-optimal compression – But we need to know the symbol probabilities ahead of time in order to encode the data in a single pass • For more general Markov sources, an optimal prefix code can achieve compression that exceeds the entropy bound by an arbitrarily high factor, even if the entropy of the source is high – Example: It is easy to construct a Markov source in which the symbol probabilities fluctuate from time to time; in such a cases we need to use a more adaptive method in order to approach the entropy bound Theory in Programming Practice, Plaxton, Spring 2004
Recommend
More recommend