huffman codes
play

Huffman Codes Idea: The straightforward coding an alphabet A of - PowerPoint PPT Presentation

Huffman Codes Idea: The straightforward coding an alphabet A of ASize characters into binary is to associate each character a string of log 2 ( ASize ) bits. For example, the standard ASCII character set associates with each of the


  1. Huffman Codes Idea: • The straightforward coding an alphabet A of ASize characters into binary is to associate each character a string of  log 2 ( ASize )  bits. • For example, the standard ASCII character set associates with each of the 128 characters ( ASize =128) a string of 7 bits. • However, any method for encoding characters as variable length bit strings may be acceptable in practice as long as strings are always uniquely decodable ; that is, it is always possible to determine where the code for one character ends and the code for the next one begins. - 1 -

  2. Prefix Codes A sufficient (but not necessary) condition for an encoding method to be uniquely decodable is that the code for one character is never a prefix of a code for another; such codes are called prefix codes . For example, items {A, B, C, D, E} could be represented by the following trie: a c A = a b B = ba A F C = bb a c b D = bca B C a E = bcb b F = c D E - 2 -

  3. Here, we will be interested in tries that represent binary prefix codes , and where all internal vertices have two children. For example, items {A, B, C, D, E, F} could be represented by the following trie: - 3 -

  4. Binary prefix code encoding algorithm: initialize a stack to be empty Read a data item ("a character") c . p = the leaf corresponding to c while p is not the root do begin Push on to the stack the bit that connects p to its parent. p = PARENT( p ) end while the stack is not empty do write POP Binary prefix code decoding algorithm: p = the root while p is not a leaf do begin Read a bit b . p = the child of p corresponding to b end write the character corresponding to p - 4 -

  5. Huffman Tries A trie is a Huffman trie for coding with an alphabet B of BSize ≥ 2 characters if: a. All edges are labelled by a character of B . b. Each non-leaf vertex has exactly BSize children. c. Positive weights (probabilities) are associated with each leaf such that the sum of these weights is 1. d. The sum over all the leaves of the leaf's depth times its weight is minimum among all tries satisfying conditions a, b, and c, with the same collection of leaf weights. Notes : • We will limit our attention for the moment to encoding in binary ( B = {0,1} and BSize =2), which reflects the vast majority of all practical applications. • When drawing Huffman tries, we adopt the convention that the edge to the left leaf of a vertex is labelled 0 and the edge to its right leaf is labelled 1. - 5 -

  6. *** Huffman tries have become so well known that prefix codes in general are often referred to as Huffman codes, even when they are not optimal. Will often refer to a Huffman trie as an optimal Huffman trie if we want to add emphasis. - 6 -

  7. Some examples for ASize =8 : ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ��� �� ��� ���� ���� ��� ��� ��� ��� ����� ��� ����� - 7 -

  8. Building an optimal Huffman trie: If an internal vertex v is labelled with the sum w of the weights of all of its descandant leaves, to the parent of v , v "looks" just like a leaf of weight w. Use a "bottom-up" greedy algorithm that starts with each leaf being a separate tree in the forest, and then repeatedly combines the two trees of lowest weight until only one tree remains. Huffman trie construction algorithm: Initialize a FOREST of single-vertex binary tries, one for each weight. while the FOREST has more than one tree do begin Let X and Y be tries in the FOREST of lowest weight (ties broken arbitrarily). Create a new root r and make the roots of X and Y the children of r . Define the weight of r to be the sum of the weights of the roots of X and Y . end - 8 -

  9. Higher Order Huffman Codes It may be that the probability distribution for the next character is highly dependent on the previous characters. For example, if we have just received the charaters elephan in a stream of English text, it is more likely that the next character will be t than s, but if we had just received the characters hippopotamu , s would be more likely than t . One way to perform better on data with higher order dependicies is to encode more than one chararacter at a time. For example, to compress English text, if the alphabet was taken to be all possible pairs of ASCII characters, then a Huffman trie of 128 ∗ 128 = 16,384 leaves could be constructed based on well known statistics for English bi-grams. In general, a Huffman trie for blocks of m characters for any m ≥ 1 can be employed, but at a cost of O ( ASize m ) space. A slightly more "graceful" approach based on m th order statistics is to have a different Huffman trie for each possible string of m –1 characters. That is, use the previous m –1 characters to access one of ASize m –1 Huffman tries, each using O ( ASize ) memory. - 9 -

  10. Adaptive Huffman Codes The first n characters of an input file define a probability distribution; that is if a given character c occurrs m times in the first n characters, its probability can be defined to be m / n . So it is natural to consider coding a string where after each character is encoded by the encoder and decoded by the decoder, the encoder and decoder compute a new Huffman trie, based on the probability distribution of the characters seen thus far. In theory, if we assume the alphabet size is a constant independant of the number of characters encoded and decoded, then a simple linear time adaptive Huffman encoding and decoding algorithm is to run the Huffman trie construction algorithm after each character processed. However, it is possible to much better than this in practice by making only small modification to an optimal Huffman trie based on the first i characters to obtain an optimal Huffman trie for the first i characters. - 10 -

Recommend


More recommend