CSE 312 Foundations of Computing II Lecture 16: Information Theory and Data Compression Stefano Tessaro tessaro@cs.washington.edu 1
Announcements • Office hours: I am available 1-3pm. • Please make sure to read the instructions for the midterm. • Practice midterm solutions posted in the afternoon. 2
Today How much can we compress data? How much information is really contained in data? Central topic in information theory , a discipline based on probability which has been extremely useful across electrical engineering, computer science, statistics, physics, … Claude Shannon, “A Mathematical Theory of Communication” , 1948 http://www.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf 3
Encoding Scheme % & = !"#(%) % !"# $!# $!#: + → 0,1 ∗ !"#: + → 0,1 ∗ Decodability. For all values % ∈ + : $!# !"# % = % Goal: Encoding should “compress” [We will formalize this using the language of probability theory] 4
Encoding – Example Say we need to encode a word from the set + = {hello, world, cse312} hello hello hello 0 0 0 world world world 1 10 11 cse312 cse312 cse312 11 11 100000000 !"# !"# !"# 5
Better Visualization – Trees hello hello 0 0 world world 10 1 cse312 cse312 11 11 0 0 1 1 world hello hello 0 1 1 world cse312 cse312 6
Focus – Prefix-free codes A code is prefix-free if no encoding is a prefix of another one. i.e. every encoding is a leaf 0 0 1 1 world hello hello 0 1 1 world cse312 cse312 Not prefix-free! Prefix-free!! 1 is a prefix of 11 7
Random Variables – Arbitrary Values We will consider random variables ?: Ω → + taking values from a (finite) set + . [We refer to these as a “random variable over the alphabet + .”] Example: + = {hello, world, cse312} C C C A B hello = A B world = A B cse312 = D E E 8
The Data Compression Problem Data = random variable ? over alphabet + ? F = !"#(?) ? !"# $!# $!#: + → 0,1 ∗ !"#: + → 0,1 ∗ Two goals: Decodability. For all values % ∈ + : $!# !"# % = % 1. 2. Minimal length. The length |F| of F should be as small as possible More formally: minimize H(|F|) 9
+ = {I, J, L} Expected Length – Example A B J = 1 A B I = 1 A B L = 1 A M 0 = 1 2 4 4 2 A M 10 = 1 4 A M 11 = 1 0 1 4 I = 1 2 ⋅ 1 + 1 4 ⋅ 2 + 1 4 ⋅ 2 = 3 0 1 H F 2 J L 10
+ = {I, J, L} Expected Length – Example A B J = 1 A B I = 1 A B L = 1 A M 0 = 1 2 4 4 4 A M 10 = 1 2 A M 11 = 1 0 1 4 J = 1 4 ⋅ 1 + 1 2 ⋅ 2 + 1 4 ⋅ 2 = 7 0 1 H F 4 I L 11
What is the shortest encoding? Problem. Given a random variable ? , find optimal (!"#, $!#) , i.e., H |!"# ? | is a small as possible. Next: There is an inherent limit on how short the encoding can be (in expectation). 12
Random Variables – Arbitrary Values Assume you are given a random variable ? with the following PMF: % Q R S T 15 1 1 1 A B (%) 32 64 64 16 You learn ? = I ; surprised? X I = log D 16/15 ≈ 0.09 You learn ? = W ; surprised? X W = 6 C Definition. The surprise of outcome % is X % = log D A Z [ 13
Entropy = Expected Surprise Definition. The entropy of a discrete RV ? over alphabet + is 1 ℍ ? = H X ? = a A B % ⋅ log D A B % [∈+ Weird convention: 0 log D 1/0 = 0 Intuitively: Captures how surprising outcome of random variable is. 14
Entropy = Expected Surprise Definition The entropy of a discrete RV ? over alphabet + is 1 ℍ ? = H X ? = a A B % ⋅ log D A B % [∈+ % Q R S T 1 1 1 15 A B (%) 16 32 64 64 ℍ ? = 15 16 15 + 1 32 ⋅ 5 + 1 64 ⋅ 6 + 1 16 ⋅ log D 64 ⋅ 6 = 15 16 15 + 11 16 log D 32 ≈ 0.431 … 15
Entropy = Expected Surprise Definition. The entropy of a discrete RV ? over alphabet + is 1 ℍ ? = H X ? = a A B % ⋅ log D A B % [∈+ % Q R S T % Q R S T A B (%) 1 0 0 0 A B (%) 1/4 1/4 1/4 1/4 1 ℍ ? = 4 ⋅ 1 ℍ ? = 1 ⋅ 0 + 3 ⋅ 0 log D 0 = 0 4 log D 4 = 2 16
Entropy = Expected Surprise Definition The entropy of a discrete RV ? over alphabet + is 1 ℍ ? = H X ? = a A B % ⋅ log D A B % [∈+ Proposition. 0 ≤ ℍ ? ≤ log D |+| Uniform distribution Takes one value with prob 1 17
Shannon’s Source Coding Theorem Theorem. (Source Coding Theorem) Let (!"#, $!#) be an optimal prefix-free encoding scheme for a RV ? , then ℍ ? ≤ H |!"# ? | ≤ ℍ ? + 1 • We cannot compress beyond the entropy • Corollary: ”uniform” data cannot be compressed • We can get within one bit of it. • Example of optimal code: Huffman Code (CSE 143?) • Result can be extended to uniquely decodable codes. (E.g., suffix free) 18
% Q R S T 1 1 1 15 Example A B (%) 32 64 64 16 H |!"# ? | = 15 16 ⋅ 1 + 1 32 ⋅ 2 + 2 ⋅ 1 0 1 64 ⋅ 3 I = 15 16 + 10 64 = 70 1 0 64 ≤ ℍ ? + 1 J 0 1 L W 19
Data Compression in the Real World Main issue: we do not know the distribution of ? • Universal compression: Lempel/Ziv/Welch – See http://web.mit.edu/6.02/www/f2011/handouts/3.pdf – Used in GIF, UNIX compress. – General idea: Assume data is sequence of symbols generated from a random process to be “estimated”. • Whole area of computer science dedicated to the topic. • This is lossless compression, very different from “lossy compression” used in images, videos, audio etc. – Assumes humans can be “fooled” with some loss of data 20
Recommend
More recommend