3. Information-Theoretic Foundations � Founder: Claude Shannon, 1940’s � Gives bounds for: � Ultimate data compression � Ultimate transmission rate of communication � Measure of symbol information: � Degree of surprise / uncertainty � Number of yes/no questions (binary decisions) to find out the correct symbol. � Depends on the probability p of the symbol. SEAC-3 J.Teuhola 2014 25
Choosing the information measure � Requirements for information function I ( p ): � I ( p ) ≥ 0 � I ( p 1 p 2) = I ( p 1) + I ( p 2) � I ( p ) is continuous with p . � The solution is essentially unique: I ( p ) = − log p = log (1/ p ). � Base of log = 2 ⇒ The unit of information is bit . SEAC-3 J.Teuhola 2014 26
Examples � Tossing a fair coin: P(heads) = P(tails) = ½ � Information measures for one toss: Inf(heads) = Inf(tails) = -log 2 0.5 bits = 1 bit � Information measure for a 3-sequence: Inf(<heads, tails, heads>) = -log 2 (½ ⋅ ½ ⋅ ½) bits = 3 bits. � Optimal coding: heads � 0, tails � 1 � An unfair coin: P(heads) = 1/8 and P(tails) = 7/8. � Inf(heads) = -log 2 (1/8) bits = 3 bits � Inf(tails) = -log 2 (7/8) bits ≈ 0.193 bits � Inf(<tails, tails, tails>) = -log 2 (7/8) 3 bits ≈ 0.578 bits � Improving the coding requires grouping of tosses into blocks. SEAC-3 J.Teuhola 2014 27
Entropy � Measures the average information of a symbol from alphabet S having probability distribution P : ⎛ ⎞ q q 1 ∑ ∑ ⎜ ⎟ = = ( ) ( ) log H S p I p p ⎜ ⎟ 2 i i i ⎝ ⎠ p = = 1 1 i i i � Noiseless source encoding theorem (C. Shannon): Entropy H ( S ) gives a lower bound on the average code length L for any instantaneously decodable system. SEAC-3 J.Teuhola 2014 28
Example case: Binary source � Two symbols, e.g. S = {0, 1}, probabilities p 0 and p 1 = 1 − p 0 . 1 1 + − log ( 1 ) log p p � Entropy = 0 2 0 2 − 1 p p 0 0 � p 0 = 0.5, p 1 = 0.5 � H ( S ) = 1 � p 0 = 0.1, p 1 = 0.9 � H ( S ) ≈ 0.469 � p 0 = 0.01, p 1 = 0.99 � H ( S ) ≈ 0.081 � The skewer the distribution, the smaller the entropy. � Uniform distribution results in maximum entropy SEAC-3 J.Teuhola 2014 29
Example case: Predictive model HELLO WOR ? Next Prob Inf (bits) Weighted char information Already processed 0.95 ⋅ 0.074 ‘context’ L 0.95 -log 2 0.95 ≈ 0.074 bits ≈ 0.070 bits 0.04 ⋅ 4.644 D 0.04 -log 2 0.04 ≈ 4.644 bits ≈ 0.186 bits 0.01 ⋅ 6.644 M 0.01 -log 2 0.01 ≈ 6.644 bits ≈ 0.066 bits Weighted sum ≈ 0.322 bits SEAC-3 J.Teuhola 2014 30
Code redundancy � Average redundancy of a code (per symbol): L − H ( S ). � Redundancy can be made = 0, if symbol probabilities are negative powers of 2. (Note that − log 2 (2 − i ) = i ) ⎛ ⎞ ⎛ ⎞ � Generally possible: 1 1 ⎜ ⎟ ≤ < ⎜ ⎟ + log log 1 l ⎝ ⎠ ⎝ ⎠ 2 2 i p p i i � Universal code : L ≤ c 1· H ( S ) + c 2 � Asymptotically optimal code: c 1 = 1 SEAC-3 J.Teuhola 2014 31
Generalization: m-memory source � � Conditional information: log ( 2 1 ( | , , )) P s s s i i i m 1 � Conditional entropy for a given context: ⎛ ⎞ 1 ∑ ⎜ ⎟ = � � ( | , , ) ( | , , )log H S s s P s s s ⎜ ⎟ 2 i i i i i � ⎝ ⎠ ( | , , ) P s s s 1 1 m m S i i i 1 m � Global entropy over all contexts: ⎛ ⎞ 1 ∑ ∑ ⎜ ⎟ = � � ( ) ( , , ) ( | , , )log H S P s s P s s s ⎜ ⎟ i i i i i 2 � ⎝ ⎠ ( | , , ) P s s s 1 1 m m m S S i i i 1 m ⎛ ⎞ 1 ∑ ⎜ ⎟ = � ( , , , )log P s s s ⎜ ⎟ i i i 2 � ⎝ ⎠ ( | , , ) P s s s 1 m + 1 m S i i i 1 m SEAC-3 J.Teuhola 2014 32
About conditional sources � Generalized Markov process: � Finite-state machine � For an m -memory source there are q m states � Transitions correspond to symbols that follow the m -block � Transition probabilities are state-dependent � Ergodic source: 0.8 � System settles down to a limiting probability distribution. 0 1 � Equilibrium state probabilities can be inferred from 0.2 0.5 0.5 transition probabilities. SEAC-3 J.Teuhola 2014 33
Solving the example entropy 0.8 = + 0 . 2 0 . 5 p p p 0 0 1 0 1 = + 0 . 8 0 . 5 p p p 1 0 1 0.2 0.5 0.5 Solution: eigenvector p 0 = 0.385, p 1 = 0.615 1 1 1 ∑∑ = = ( ) Pr( | ) log H S p j i i Pr( | ) j i = = 0 0 i j 1 1 1 1 + + + ≈ ( 0 . 2 log 0 . 8 log ) ( 0 . 5 log 0 . 5 log ) 0 . 893 p p 0 1 0 . 2 0 . 8 0 . 5 0 . 5 Example application: compression of black-and – white images (black and white areas highly clustered) SEAC-3 J.Teuhola 2014 34
Empirical observations � Shannon’s experimental value for the entropy of the English language ≈ 1 bit per character � Current text compressor efficiencies: � gzip ≈ 2.5 – 3 bits per character � bzip2 ≈ 2.5 bits per character � The best predictive methods ≈ 2 bits per character � Improvements are still possible! � However, digital images , audio and video are more important data types from compression point of view. SEAC-3 J.Teuhola 2014 35
Other extensions of entropy � Joint entropy, e.g. for two random variables X, Y: ∑ = − ( , ) , log H X Y p p 2 , x y x y , x y � Relative entropy: difference of using q i instead of p i : p ∑ = ( || ) log i D P Q p 2 KL i q i i � Differential entropy for continuous probability distribution: ∫ = − ( ) ( ) log ( ) h X f x f x dx X SEAC-3 J.Teuhola 2014 36
Kolmogorov complexity � Measure of message information = Length of the shortest binary program for generating the message. � This is close to entropy H ( S ) for a sequence of symbols drawn at random from a distribution that S has. � Can be much smaller than entropy for artificially generated data: pseudo random numbers, fractals, ... � Problem: Kolmogorov complexity is not computable ! (Cf. Gödel’s incompleteness theorem and Turing machine stopping problem). SEAC-3 J.Teuhola 2014 37
Recommend
More recommend