CSE 417 c 12% Compression Example d 16% e 9% Algorithms f - PowerPoint PPT Presentation

a 45% b 13% CSE 417 c 12% Compression Example d 16% e 9% Algorithms f 5%  100k file, 6 letter alphabet: Winter 2006  File Size:  ASCII, 8 bits/char: 800kbits  2 3 > 6; 3 bits/char: 300kbits Huffman Codes:  00,01,10 for a,b,d; 11xx for c,e,f: 2.52 bits/char 74%*2 +26%*4 : 252kbits An Optimal Data Compression  Optimal? Method  Why?  Storage, transmission vs 1Ghz cpu 1 CSE 417, Wi ’06, Ruzzo 2 a 45% Prefix Codes b 13% c 12% Data Compression = Trees d 16% e 9% f 5%  Binary character code (“code”)  each k-bit source string maps to unique code word (e.g. k=8)  “compression” alg: concatenate code words for successive k-bit “characters” of source  Fixed/variable length codes  all code words equal length?  Prefix codes  no code word is prefix of another (simplifies decoding) 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0 1 f a b f a b CSE 417, Wi ’06, Ruzzo 3 1

a 45% a 45% b 13% b 13% c 12% c 12% Greedy Idea #1 Greedy Idea #1 d 16% d 16% e 9% e 9% f 5% f 5%  Put most frequent  Put most frequent under root, then under root, then recurse 100 100 recurse …  Too greedy: . a:45 a:45 unbalanced tree . . 55 . . .45*1 + .16*2 + .13*3 … = 2.34 not too bad, but imagine if all d:16 29 freqs were ~1/6: (1+2+3+4+5+5)/6=3.33 . . b:13 . CSE 417, Wi ’06, Ruzzo 5 CSE 417, Wi ’06, Ruzzo 6 a 45% a 45% b 13% b 13% c 12% c 12% Greedy Idea #2 Greedy idea #3 d 16% d 16% e 9% e 9% f 5% f 5%  Divide letters into 2  Group least frequent groups, with ~50% letters near bottom 100 100 weight in each; recurse . . (Shannon-Fano code) 50 50 . . . .  Again, not terrible 2*.5+3*.5 = 2.5 25  But this tree a:45 f:5 25 25 14 can easily be c:12 b:13 improved! (How?) b:13 c:12 d:16 e:9 f:5 e:9 CSE 417, Wi ’06, Ruzzo 7 CSE 417, Wi ’06, Ruzzo 8 2

.45*1 + .41*3 + .14*4 = 2.24 bits per char Huffman’s Algorithm (1952) Correctness Strategy Algorithm:  Optimal solution may not be unique, so cannot prove that greedy gives the only insert node for each letter into priority queue by freq possible answer. while queue length > 1 do remove smallest 2; call them x, y make new node z from them, with f(z) = f(x)+f(y)  Instead, show that greedy’s solution is insert z into queue as good as any. Analysis: O(n) heap ops: O(n log n) Goal: Minimize � B ( T ) = freq(c)*depth(c) c � C Correctness : ??? CSE 417, Wi ’06, Ruzzo 11 CSE 417, Wi ’06, Ruzzo 12 3

Defn: A pair of leaves is an inversion if Lemma 1: depth(x) ≥ depth(y) “Greedy Choice Property” and freq(x) ≥ freq(y) The 2 least frequent letters might as well be siblings at deepest level Claim: If we flip an inversion, cost never increases.  Let a be least freq, b 2 nd Why? All other things being equal, better to give more  Let u, v be siblings at frequent letter the shorter code. max depth, f(u) ≤ f(v) (why must they exist?) before after  Then (a,u) and (b,v) are (d(x)*f(x) + d(y)*f(y)) - (d(x)*f(y) + d(y)*f(x)) = inversions. Swap them. (d(x) - d(y)) * (f(x) - f(y)) ≥ 0 I.e. non-negative cost savings. CSE 417, Wi ’06, Ruzzo 14 Lemma 2: Proof: B ( T ) = � � d ( c ) f ( c ) “ Optimal Substructure ” � T c C B ( T ) B ( T ' ) d ( x ) ( f ( x ) f ( y )) d ( z ) f ' ( z ) � = � + � � T T ' Let (C, f) be a problem instance: C an n-letter alphabet ( d ( z ) 1 ) f ' ( z ) d ( z ) f ' ( z ) = + � � � T ' T ' with letter frequencies f(c) for c in C. f ' ( z ) = For any x, y in C, let C’ be the (n-1) letter alphabet C - {x,y} ∪ {z} and for all c in C’ define ˆ T Suppose (having x & y as siblings) is better than T, i.e. f(c), if c x, y, z � � f' (c) = � f(x) f(y), if c z + = � ˆ B ( ˆ Collapse x & y to z, forming ; as above: T ' T ) < B ( T ). Let T’ be an optimal tree for (C’,f’). ˆ ˆ Then B ( T ) B ( T ' ) f ' ( z ) � = T’ = Then: z T x y ˆ ˆ B ( T ' ) B ( T ) f ' ( z ) B ( T ) f ' ( z ) B ( T ' ) = � < � = is optimal for (C,f) among all trees having x,y as siblings Contradicting optimality of T’ CSE 417, Wi ’06, Ruzzo 15 4

Theorem: Data Compression Huffman gives optimal codes Proof: induction on |C|  Huffman is optimal.  Basis: n=1,2 – immediate  BUT still might do better!  Induction: n>2  Huffman encodes fixed length blocks. What if we  Let x,y be least frequent vary them?  Form C’, f’, & z, as above  Huffman uses one encoding throughout a file. What if characteristics change?  By induction, T’ is opt for (C’,f’)  What if data has structure? E.g. raster images,  By lemma 2, T’ → T is opt for (C,f) among trees video,… with x,y as siblings  Huffman is lossless. Necessary?  By lemma 1, some opt tree has x, y as siblings  LZW, MPEG, …  Therefore, T is optimal. CSE 417, Wi ’06, Ruzzo 17 CSE 417, Wi ’06, Ruzzo 18 David A. Huffman, 1925-1999 CSE 417, Wi ’06, Ruzzo 19 CSE 417, Wi ’06, Ruzzo 20 5

CSE 417, Wi ’06, Ruzzo 21 6

CSE 417 c 12% Compression Example d 16% e 9% Algorithms f - PowerPoint PPT Presentation

a 45% b 13% CSE 417 c 12% Compression Example d 16% e 9% Algorithms f 5% 100k file, 6 letter alphabet: Winter 2006 File Size: ASCII, 8 bits/char: 800kbits 2 3 > 6; 3 bits/char: 300kbits Huffman Codes:

CSE 417: Algorithms and Computational Complexity 1: Organization & Overview Winter 2006

I: Organization & Overview Winter 2007 Larry Ruzzo 1 http://www.cs.washington.edu/417 2

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Second quarter 2016 Results ING posts 2Q16 underlying net profit of EUR 1,417 million Ralph

18.417 Introduction to Computational Molecular Biology Foundations of Structural

Protein Structure Prediction Protein = chain of amino acids (AA) aa connected by peptide

Lecture 23 Introduction to Bode Plots CL-417 Process Control Prof. Kannan M. Moudgalya IIT

Predicting Protein Folding Paths S.Will, 18.417, Fall 2011 Protein Folding by Robotics S.Will,

Lecture 11 Controller Specifications CL-417 Process Control Prof. Kannan M. Moudgalya IIT

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

RNA Secondary Structure CSE 417 W.L. Ruzzo The Double Helix Los Alamos Science The Central

The Double Helix RNA Secondary Structure CSE 417 W.L. Ruzzo Los Alamos Science The Central

Dynamic Programming CSE 417: Algorithms and Outline: Computational Complexity General

How the Concept of Shannons Derivation: . . . Shannons Derivation . . . Case of a

A Conditional Information Inequality and Its Combinatorial Applications Nikolay Vereshchagin 1

. ( key key - total j 'D keys closer to root float wt Ivo , . . .vn . , ) weight - and

P o l a r C o d e s o v e r q - a r y A l p h a b e t s a n d P o

CSE 421 Algorithms Summer 2007 Huffman Codes: An Optimal Data Compression Method 1 a 45% b

A Brief Introduction to Graphical Models and How to Learn Them from Data Christian Borgelt Dept.

Intrabody Communication: Applications and Practical Issues Kurt Partridge University of

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical & Electronic

CSE 417 c 12% Compression Example d 16% e 9% Algorithms f - PowerPoint PPT Presentation

a 45% b 13% CSE 417 c 12% Compression Example d 16% e 9% Algorithms f 5% 100k file, 6 letter alphabet: Winter 2006 File Size: ASCII, 8 bits/char: 800kbits 2 3 > 6; 3 bits/char: 300kbits Huffman Codes:

CSE 417: Algorithms and Computational Complexity 1: Organization &amp; Overview Winter 2006

I: Organization &amp; Overview Winter 2007 Larry Ruzzo 1 http://www.cs.washington.edu/417 2

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Second quarter 2016 Results ING posts 2Q16 underlying net profit of EUR 1,417 million Ralph

18.417 Introduction to Computational Molecular Biology Foundations of Structural

Protein Structure Prediction Protein = chain of amino acids (AA) aa connected by peptide

Lecture 23 Introduction to Bode Plots CL-417 Process Control Prof. Kannan M. Moudgalya IIT

Predicting Protein Folding Paths S.Will, 18.417, Fall 2011 Protein Folding by Robotics S.Will,

Lecture 11 Controller Specifications CL-417 Process Control Prof. Kannan M. Moudgalya IIT

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on &amp; Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

RNA Secondary Structure CSE 417 W.L. Ruzzo The Double Helix Los Alamos Science The Central

The Double Helix RNA Secondary Structure CSE 417 W.L. Ruzzo Los Alamos Science The Central

Dynamic Programming CSE 417: Algorithms and Outline: Computational Complexity General

How the Concept of Shannons Derivation: . . . Shannons Derivation . . . Case of a

A Conditional Information Inequality and Its Combinatorial Applications Nikolay Vereshchagin 1

. ( key key - total j 'D keys closer to root float wt Ivo , . . .vn . , ) weight - and

P o l a r C o d e s o v e r q - a r y A l p h a b e t s a n d P o

CSE 421 Algorithms Summer 2007 Huffman Codes: An Optimal Data Compression Method 1 a 45% b

A Brief Introduction to Graphical Models and How to Learn Them from Data Christian Borgelt Dept.

Intrabody Communication: Applications and Practical Issues Kurt Partridge University of

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical &amp; Electronic

CSE 417: Algorithms and Computational Complexity 1: Organization & Overview Winter 2006

I: Organization & Overview Winter 2007 Larry Ruzzo 1 http://www.cs.washington.edu/417 2

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical & Electronic