cs 3000 algorithms data jonathan ullman
play

CS 3000: Algorithms & Data Jonathan Ullman Lecture 19: Data - PowerPoint PPT Presentation

Ah CS 3000: Algorithms & Data Jonathan Ullman Lecture 19: Data Compression Greedy Algorithms: Huffman Codes Apr 5, 2018 MARAMMMM Apr 8,2020 Data Compression How do we store strings of text compactly? Alphabet A binary


  1. Ah CS 3000: Algorithms & Data Jonathan Ullman Lecture 19: Data Compression • Greedy Algorithms: Huffman Codes • Apr 5, 2018 MARAMMMM Apr 8,2020

  2. Data Compression • How do we store strings of text compactly? Alphabet • A binary code is a mapping from Σ → 0,1 ∗ 0 • Simplest code: assign numbers 1,2, … , Σ to each symbol, map to binary numbers of ⌈log - Σ ⌉ bits 00000 A O 000 l Bi 0007 O E • Morse Code: O 00 I 1 D variable length code

  3. Data Compression • Letters have uneven frequencies! • Want to use short encodings for frequent letters, long encodings for infrequent leters a b c d avg. len. Frequency 1/2 1/4 1/8 1.8 I Encoding 1 00 01 10 11 2.0 Encoding 2 0 10 110 111 1.75 t x 3 I x 2 E x I It 1.75 I f I

  4. Data Compression • What properties would a good code have? • Easy to encode a string Encode(KTS) = – ● – – ● ● ● I K ITI average bits per letter s f some frequencies • The encoding is short on average given ≤ 4 bits per letter (30 symbols max!) • Easy to decode a string? Decode( – ● – – ● ● ● ) = possibilities S T K Many E TT S T T E TT E E E K NI

  5. as Prefix Free Codes J • Cannot decode if there are ambiguities • e.g. enc(“6”) is a prefix of enc(“9”) Tee • Prefix-Free Code: • A binary enc: Σ → 0,1 ∗ such that for every < ≠ > ∈ Σ , enc < is not a prefix of enc > • Any fixed-length code is prefix-free O 00 a aa I 0 O l b b I 0 l I 0 c c g I I d do I l 1 a prefix free variable length code

  6. Prefix Free Codes safe • Can represent a prefix-free code as a tree a 1 Of 1 o Ji binary a'ITabeled with a symbol A p r r EE • Encode by going up the tree (or using a table) • d a b → 0 0 1 1 0 0 1 1 i MMMM 001 l i 011 • Decode by going down the tree • 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 b l b teal d teal

  7. � Huffman Codes • (An algorithm to find) an optimal prefix-free code average number of bits per letter I BCDEFGHECDD I len J = ∑ • optimal = min M ⋅ len I R N N∈P 0 • Note, optimality depends on what you’re compressing • H is the 8 th most frequent letter in English (6.094%) but the 20 th most frquent in Italian (0.636%) fo fd Fa f a b c d Frequency 1/2 1/4 1/8 1/8 Encoding 0 10 110 111 fax 3 fax 1.75 fax I fb x 2 3 t t

  8. Huffman Codes • First Try: split letters into two sets of roughly equal frequency and recurse • Balanced binary trees should have low depth IO ILO 111 OO 01 a b c d e .32 .25 .20 .18 .05 T.tw o O oso 1 0 O l ol tho Q is es o.O 25 25 18 32

  9. Huffman Codes • First Try: split letters into two sets of roughly equal frequency and recurse a b c d e .32 .25 .20 .18 .05 first try optimal IM O 0 l len = 2.25 l len = 2.23 J 1 F E l I O O O l l O J l O nm ITfft codeword II fthfuest letter

  10. zgb.ee gdie3 Huffman Codes 38 57 Ea b3 Eadie 43 • Huffman’s Algorithm: pair up the two letters with the lowest frequency and recurse a b c d e .32 .25 .20 .18 .05 X Of o Y of to o

  11. Huffman Codes • Huffman’s Algorithm: pair up the two letters with the lowest frequency and recurse • Theorem: Huffman’s Algorithm produces a prefix- free code of optimal length • We’ll prove the theorem using an exchange argument

  12. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • (1) In an optimal prefix-free code (a tree), every internal node has exactly two children i Z z J 1

  13. lowest depth If the In the optimal code leaves two least there at then d are is siblings d and they are depth at Happen cant

  14. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • (2) If <, > have the lowest frequency, then there is an optimal code where <, > are siblings and are at the bottom of the tree i e have the lowest depths someone gave you the Suppose without labels optimal tree but I should label then O fb fo foe fee the highest leaves with 0 the 0 most frequent symbols 2 and go down e d the lowest depth two strings at there i are By the least frequent items My optimal code fills those siblings w

  15. Huffman Codes 1 1 • Theorem: Huffman’s Alg produces an optimal prefix-free code • Proof by Induction on the Number of Letters in Σ : • Base case ( Σ = 2 ): rather obvious is optimal for Huffman alg If Inductive Step IG I its optimal for L Kil then k I 3 f 7f f 3 fee have 3 frequencies Suppose we k 2 tf 3 w fr fw 1,2 El 19 t K I

  16. Huffman Code Huffman Code for for E A Ow 1 O code T T code 1 fu lent ten T t ft tf t ten an optimal By the T inductive hypothesis is minimizes ler for T E code for E U an optimal code is Suppose the lowest siblings at K L and K are 2 By the tree u fore level of E Ufo Do left lucid fk fk left kn w ler U w les i

  17. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • Proof by Induction on the Number of Letters in Σ : • Inductive Hypothesis:

  18. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • Proof by Induction on the Number of Letters in Σ : • Inductive Hypothesis: • Without loss of generality, frequencies are M S , … , M T , the two lowest are M S , M - • Merge 1,2 into a new letter U + 1 with M TWS = M S + M -

  19. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • Proof by Induction on the Number of Letters in Σ : • Inductive Hypothesis: • Without loss of generality, frequencies are M S , … , M T , the two lowest are M S , M - • Merge 1,2 into a new letter U + 1 with M TWS = M S + M - • By induction, if J X is the Huffman code for M Y , … , M TWS , then J X is optimal • Need to prove that J is optimal for M S , … , M T

  20. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • If J′ is optimal for M Y , … , M TWS then J is optimal for M S , … , M T

  21. An Experiment • Take the Dickens novel A Tale of Two Cities • File size is 799,940 bytes • Build a Huffman code and compress 3 3 • File size is now 439,688 bytes Raw Huffman Size 799,940 439,688 2554

  22. Huffman Codes • Huffman’s Algorithm: pair up the two letters with the lowest frequency and recurse • Theorem: Huffman’s Algorithm produces a prefix- free code of optimal length • In what sense is this code really optimal? (Bonus material… will not test you on this)

  23. Length of Huffman Codes for integeli • What can we say about Huffman code length? T N = 2 Hℓ \ for every R ∈ Σ • Suppose M • Then, len I R = ℓ N for the optimal Huffman code • Proof: MMM b d etter c a 3 2 3 2 2 2 2 freq Ill 110 Ok 10 0 code 3 3 I 2 ten

  24. � Length of Huffman Codes • What can we say about Huffman code length? N = 2 Hℓ \ for every R ∈ Σ • Suppose M • Then, len I R = ℓ N for the optimal Huffman code • len J = ∑ N ⋅ log - S ] \ M ^ 1 in N∈P li di f 3 L 2 e Li log fi log.tl fi li

  25. � Entropy • Given a set of frequencies (aka a probability distribution) the entropy is N ⋅ log - 1 M _ M = ` M length of ^ N the Hoffman code N • Entropy is a “measure of randomness”

  26. � Entropy • Given a set of frequencies (aka a probability distribution) the entropy is N ⋅ log - 1 M _ M = ` M ^ N random How N text is the • Entropy is a “measure of randomness” • Entropy was introduced by Shannon in 1948 and is the foundational concept in: • Data compression • Error correction (communicating over noisy channels) • Security (passwords and cryptography)

  27. Entropy of Passwords • Your password is a specific string, so M abc = 1.0 • To talk about security of passwords, we have to model them as random • Random 16 letter string: _ = 16 ⋅ log - 26 ≈ 75.2 • Random IMDb movie: _ = log - 1764727 ≈ 20.7 • Your favorite IMDb movie: _ ≪ 20.7 • Entropy measures how difficult passwords are to guess “on average”

  28. Entropy of Passwords

  29. � Entropy and Compression • Given a set of frequencies (probability distribution) the entropy is length of N ⋅ log - 1 M _ M = ` M ^ Hoffman code N N • Suppose that we generate string 9 by choosing j random letters independently with frequencies M • Any compression scheme requires at least _ M bits-per-letter to store 9 (as j → ∞ ) • Huffman codes are truly optimal!

  30. But Wait! • Take the Dickens novel A Tale of Two Cities • File size is 799,940 bytes • Build a Huffman code and compress • File size is now 439,688 bytes • But we can do better! Raw Huffman gzip bzip2 Size 799,940 439,688 301,295 220,156

  31. What do the frequencies represent? • Real data (e.g. natural language, music, images) have patterns between letters • U becomes a lot more common after a Q • Possible approach: model pairs of letters • Build a Huffman code for pairs-of-letters • Improves compression ratio, but the tree gets bigger • Can only model certain types of patterns • Zip is based on an algorithm called LZW that tries to identify patterns based on the data

Recommend


More recommend