Compression: Information Theory Greg Plaxton Theory in Programming Practice, Spring 2005 Department of Computer Science University of Texas at Austin
Coding Theory • Encoder – Input: a message over some finite alphabet such as { 0 , 1 } or { a, . . . , z } – Output: encoded message • Decoder – Input: some encoded message produced by the encoder – Output: (a good approximation to) the associated input message • Motivation? Theory in Programming Practice, Plaxton, Spring 2005
Some Applications of Coding Theory • Compression – Goal: Produce a short encoding of the input message • Error detection/correction – Goal: Produce a fault-tolerant encoding of the input message • Cryptography – Goal: Produce an encoding of the input message that can only be decoded by the intended recipient(s) of the message • It is desirable for the encoding and decoding algorithms to be efficient in terms of time and space – Various tradeoffs are appropriate for different applications Theory in Programming Practice, Plaxton, Spring 2005
Compression • Lossless: decoder recovers the original input message • Lossy: decoder recovers an approximation to the original input message • The application dictates how much, if any, loss we can tolerate – Text compression is usually required to be lossless – Image/video compression is often lossy • We will focus on techniques for lossless compression Theory in Programming Practice, Plaxton, Spring 2005
Text Compression • Practical question: I’m running out of disk space; how much can I compress my files? • A (naive?) idea: – Any file can be compressed to the empty string: just write a decoder that outputs the file when given the empty string as input! – A problem with this approach is that we need to store the decoder, and the naive implementation of the decoder (which simply stores the original file in some static data structure within the decoder program) is at least as large as the original file – Can this idea be salvaged? Theory in Programming Practice, Plaxton, Spring 2005
Kolmogorov Complexity • In some cases, a large file can be generated by a very small program running on the empty string; e.g., a file containing a list of the first trillion prime numbers • Your files can be compressed down to the size of the smallest program that (when given the empty string as input) produces them as output – How do I figure out this shortest program? – Won’t it be time-consuming to write/debug/maintain? Theory in Programming Practice, Plaxton, Spring 2005
Information Theory • May be viewed as providing a practical way to (approximately) carry out the strategy suggested by Kolmogorov complexity • Consider a file that you would like to compress – Assume that this file can be viewed, to a reasonable degree of approximation, as being drawn from a particular probability distribution (e.g., we will see that this is true of English text) – Perhaps many other people have files drawn from this distribution, or from distributions in a similar class – If so, a good encoder/decoder pair for that class of distributions may already exist; with luck, it will already be installed on your system Theory in Programming Practice, Plaxton, Spring 2005
Example: English Text • In what sense can we view English text as being (approximately) drawn from a probability distribution? • English text is one of the example applications discussed in Shannon’s 1948 paper “A Mathematical Theory of Communication” – On page 7 we find a sequence of successively more accurate probabilistic models of English text – Claude Shannon (1916–2001) is known as the “father of information theory” Theory in Programming Practice, Plaxton, Spring 2005
Entropy in Thermodynamics • In thermodynamics, entropy is a measure of energy dispersal – The more “spread out” the energy of a system is, the higher the entropy – A system in which the energy is concentrated at a single point has zero entropy – A system in which the energy is uniformly distributed has reached its maximum possible entropy • Second law of thermodynamics: The entropy of an isolated system can only increase – Bad news: The entropy of the universe can only increase as matter and energy degrade to an ultimate state of inert uniformity – Good news: This process is likely to take a while Theory in Programming Practice, Plaxton, Spring 2005
Entropy in Information Theory (Shannon) • A measure of the uncertainty associated with a probability distribution – The more “spread out” the distribution is, the higher the entropy – A probability distribution in which all of the probability is concentrated on a single outcome has zero entropy – For any given set of possible outcomes, the probability distribution with the maximum entropy is the uniform distribution • Consider a distribution over a set of n outcomes in which the i th outcome has associated probability p i ; Shannon defined the entropy of this distribution as p i log 1 � � = − p i log p i p i i i • The logarithm above is normally assumed to be taken base 2, in which case the units of entropy are bits (binary digits) Theory in Programming Practice, Plaxton, Spring 2005
Entropy of an I.I.D. Source • Consider a message in which each successive symbol is independently drawn from the same probability distribution over n symbols, where the probability of drawing the i th symbol is p i • The entropy of such a source is − � i p i log p i bits per symbol • Example: Shannon’s first-order model of English text yields an entropy of 4 . 07 bits per symbol Theory in Programming Practice, Plaxton, Spring 2005
Discrete Markov Process • A more general notion of a source • Includes as special cases the k th order processes discussed earlier in connection with Shannon’s modeling of English text • Closely related to the concept of finite state machines to be discussed later in this course Theory in Programming Practice, Plaxton, Spring 2005
Entropy of a Discrete Markov Process • Under certain (relatively mild) technical assumptions, for any k > 0 and any X in A k where A denotes the set of symbols, the fraction of all sequences of length k in the output that are equal to X converges to a particular number p ( X ) • We may then define H k as 1 1 � p ( X ) log p ( X ) k X ∈ A k • Theorem (Shannon): If a given discrete Markov process satisfies the technical assumptions alluded to above, then its entropy is equal to lim k →∞ H k bits per symbol Theory in Programming Practice, Plaxton, Spring 2005
Example: English Text • Zero-order approximation: log 27 ≈ 4 . 75 bits per symbol • First-order approximation: 4 . 07 bits per symbol • Second-order approximation: 3 . 36 bits per symbol • Third-order approximation: 2 . 77 bits per symbol • Approximation based on experiments involving humans: 0 . 6 to 1 . 3 bits per symbol Theory in Programming Practice, Plaxton, Spring 2005
Entropy as a Measure of Compressibility • Fundamental Theorem for a Noiseless Channel (Shannon): Let a source have entropy H (bits per symbol) and a channel have capacity C (bits per second). Then it is possible to encode the output of the source in such a way as to transmit at the average rate C H − � symbols per second where � is arbitrarily small. It is not possible to transmit at an average rate greater than C H . • What does this imply regarding how much we can hope to compress a given file containing n symbols, where n is large? – Suppose the file content is similar in structure to the output of a source with entropy H – Then we cannot hope to encode the file using fewer than about nH bits – Furthermore this bound can be achieved to within an arbitrarily small factor Theory in Programming Practice, Plaxton, Spring 2005
Recommend
More recommend