greedy algorithms continued
play

Greedy Algorithms, Continued Suppose T is a text of 130 million - PowerPoint PPT Presentation

Huffman Encoding, 1 A toy example: Suppose our alphabet is { A , B , C , D } . Greedy Algorithms, Continued Suppose T is a text of 130 million characters. What is a shortest binary string representing T ? (A hard question.) Encoding


  1. Huffman Encoding, 1 A toy example: ◮ Suppose our alphabet is { A , B , C , D } . Greedy Algorithms, Continued ◮ Suppose T is a text of 130 million characters. ◮ What is a shortest binary string representing T ? (A hard question.) Encoding 1 DPV Chapter 5, Part 2 A �→ 00, B �→ 01, C �→ 10, D �→ 11. Total: 260 megabits. Jim Royer Statistics on T Encoding 2 Symbol Frequency February 28, 2019 A 70 million A �→ 0, B �→ 100, C �→ 101, D �→ 11. B 3 million Total: 213 megabits — 17% better. C 20 million Q: How to unambiguously decode? D 37 million (Unless otherwise credited, all images are from DPV.) Q: How to come up with the code? Idea: Use variable length codes Q: How good is the result? A’s code ≪ D’s code ≪ B’s code Royer ❖ Greedy Algorithms 1 Royer ❖ Greedy Algorithms 2 Huffman Encoding, 2 Huffman Encoding, 3 Definition Goal: Find an optimal coding tree for the frequencies given. A prefix-free code is a code in which no codeword is the prefix of another. n Prefix-free codes can be represented by full binary trees cost of a tree = ∑ f [ i ] · ( depth of the i th symbol in tree ) (i.e., trees in which each non-leaf node has two children). i = 1 Example: n ∑ f [ i ] · ( # of bits required for the i th symbol ) = 0 1 i = 1 Symbol Codeword [60] A 0 Assigning frequencies to all tree nodes 0 1 A [70] B 100 Symbol Codeword [60] 101 C (a) Leaf nodes get the frequency of their [23] A 0 A [70] D 11 B 100 D [37] character. 101 C [23] D 11 D [37] (b) Internal nodes get the sum of the freqs B [3] C [20] of the leaf nodes below them. B [3] C [20] Question: How do you use such a tree to decode a file? Sample: 01101001010 Royer ❖ Greedy Algorithms 3 Royer ❖ Greedy Algorithms 4

  2. Huffman Encoding, 4 Huffman Encoding, 5 Observation procedure Huffman( f ) In an optimal code tree: The two lowest freq. characters must be at the children of the lowest // Input: An array f [ 1 . . . n ] of freqs internal node. (Why? Try a replacement argument) // Output: An encoding tree with n leaves Example Greedy Strategy H ← a priority queue of integers, ordered by f a : 45% for i ← 1 to n do insert( H , i , f [ i ] ) Find these two characters, build this node, repeat b : 13% for k ← n + 1 to 2 n − 1 do (where some nodes are groups of characters as we go along). c : 12% i ← deletemin( H ) j ← deletemin( H ) d : 16% procedure Huffman( f ) // Input: An array f [ 1 . . . n ] of freqs create a node numbered k with children i , j e : 9% // Output: An encoding tree with n leaves f [ k ] ← f [ i ] + f [ j ] f : 5% insert( H , k , f [ k ] ) H ← a priority queue of integers, ordered by f for i ← 1 to n do insert( H , i , f [ i ] ) return deletemin( H ) f 1 + f 2 for k ← n + 1 to 2 n − 1 do f 5 f 4 f 3 i ← deletemin( H ); j ← deletemin( H ) [Trace on board] create a node numbered k with children i , j f [ k ] ← f [ i ] + f [ j ] ; insert( H , k , f [ k ] ) f 1 f 2 Royer ❖ Greedy Algorithms 5 Royer ❖ Greedy Algorithms 6 Huffman Encoding, 6 Huffman Encoding, 7: Correctness Suppose x & y are the two chars with the smallest freqs with f [ x ] ≤ f [ y ] . procedure Huffman( f ) x Lemma (1) // Input: An array f [ 1 . . . n ] of freqs Runtime Analysis // Output: An encoding tree with n leaves There is an optimal code tree in which x and y have the same length and differ ◮ initializing H : Θ ( n ) time y H ← a priority queue of integers, ordered by f only in their last bit. for i ← 1 to n do insert( H , i , f [ i ] ) ◮ for-loop iterations: n − 1 a b Proof. for k ← n + 1 to 2 n − 1 do ◮ deletemin’s & insert’s: Suppose T is an optimal code tree and characters a and b which are i ← deletemin( H ) ⇓ cost O ( log n ) each max-depth siblings in T where f [ a ] ≤ f [ b ] . j ← deletemin( H ) Let T ′ be the result of swapping a ↔ x and b ↔ y . Then: create a node numbered k with children i , j Total: Θ ( n ) + ( n − 1 ) O ( log n ) f [ k ] ← f [ i ] + f [ j ] = O ( n log n ) . cost ( T ) − cost ( T ′ ) = f [ x ] · ( d T ( x ) − d T ( a )) + f [ y ] · ( d T ( y ) − d T ( b )) insert( H , k , f [ k ] ) a + f [ a ] · ( d T ( a ) − d T ( x )) + f [ b ] · ( d T ( b ) − d T ( y )) return deletemin( H ) = ( f [ a ] − f [ x ]) · ( d T ( a ) − d T ( x )) b + ( f [ b ] − f [ y ]) · ( d T ( b ) − d T ( y )) ≥ 0. x y So, cost ( T ) ≥ cost ( T ′ ) . ∴ Since T is optimal, so is T ′ . Royer ❖ Greedy Algorithms 7 Royer ❖ Greedy Algorithms 8

  3. Huffman Encoding, 8: Correctness Huffman Encoding, 9: Correctness Suppose x & y are the two chars with the Suppose x & y are the two chars with the smallest freqs with f [ x ] ≤ f [ y ] . smallest frequencies with f [ x ] ≤ f [ y ] . Lemma (2) Lemma 1: The greedy choice is safe procedure Huffman( f ) Replace x and y by a new character z with frequency f [ x ] + f [ y ] . Suppose T ′ is // Input: An array f [ 1 . . . n ] of freqs There is an optimal code tree in which x and y an optimal code tree for the new character set. z : f[x]+f[y] // Output: An encoding tree with n leaves have the same length and differ only in their Then swapping the z-node for a node with children x and y results in an optimal H ← a priority queue of integers, ordered by f last bit. code tree T for the old character set. for i ← 1 to n do insert( H , i , f [ i ] ) � Lemma 2: Optimal code trees have Proof. for k ← n + 1 to 2 n − 1 do Then cost ( T ) = cost ( T ′ ) + f [ x ] + f [ y ] . optimal substructure i ← deletemin( H ); j ← deletemin( H ) (Why?) Suppose T ′′ is an optimal code tree for the old char. set. // Safe by Lemma 1 Replace x and y by a new character z with WLOG, T ′′ has x and y as siblings of max depth. (Why?) frequency f [ x ] + f [ y ] . Suppose T ′ is an create a node numbered k with children i , j f [ k ] ← f [ i ] + f [ j ] ; insert( H , k , f [ k ] ) Replace x ’s and y ’s parent’s subtree with a node for z with frequency optimal code tree for the new character set. f [ x ] + f [ y ] and call the tree T ′′′ . Then cost ( T ′′′ ) // Safe by Lemma 2 Then swapping the z -node for a node with parent = cost ( T ′′ ) − f [ x ] − f [ y ] ≤ cost ( T ) − f [ x ] − f [ y ] = cost ( T ′ ) . children x and y results in an optimal code tree T for the old char. set. x : f[x] y: f[y] But as T ′ is optimal, so is T ′′′ . ∴ cost ( T ) = cost ( T ′′ ) & T is also opt. Royer ❖ Greedy Algorithms 9 Royer ❖ Greedy Algorithms 10 Improving on Huffman: LZ Compression Propositional Logic ◮ The formulas of propositional logic are given by the grammar: P :: = Var | ¬ P | P ∧ P | P ∨ P | P ⇒ P Var :: = standard syntax ◮ LZ = Abraham Lempel and Jacob Ziv ◮ A truth assignment is a function I : Variables → { False , True } . ◮ The rough idea: Start with Huffman, but ... ◮ A truth assignment I determines the value of a formula as follows: • Keep statistics on frequencies in a sliding window of a few K. I [[ x ]] = True iff I ( x ) = True ( x a variable) • Keep readjusting the Huffman coding to fix the freqs of the sliding window ( & and note the change in coding in the compressed file). I [[ ¬ p ]] = True iff I [[ p ]] = False ◮ Huffman ≈ LV with the sliding window = the whole file I [[ p ∧ q ]] = True iff I [[ p ]] = I [[ q ]] = True . ◮ There are many variations on this, see: I [[ p ∨ q ]] = True iff I [[ p ]] = True I [[ q ]] = True . or http://en.wikipedia.org/wiki/LZ77_and_LZ78 . I [[ p ⇒ q ]] = True iff I [[ p ]] = False I [[ q ]] = True . or ◮ A satisfying assignment for a formula p is an I with I [[ p ]] = True . ◮ Finding satisfying assignments for general propositional formulas seems hard . (See Chapter 8.) Royer ❖ Greedy Algorithms 11 Royer ❖ Greedy Algorithms 12

Recommend


More recommend