Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Lecture 2 Measures of Information I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw September 16, 2015 1 / 42 I-Hsiang Wang IT Lecture 2
Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence How to measure information? Before this, we should ask: What is information? 2 / 42 I-Hsiang Wang IT Lecture 2
Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence In our daily lives, information is often obtained by learning something that we did not know before. Examples: result of a ball game, score of an exam, weather, … In other words, one gets some information by learning something about which that he/she was uncertain before. Shannon: “Information is the resolution of uncertainty.” 3 / 42 I-Hsiang Wang IT Lecture 2
Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence I-Hsiang Wang 4 / 42 players eventually wins the gold medal (message T). wins the final (message B). two pieces of news (the two messages)? How much information can he get after 10 days when he learns the However, due to his work, he cannot access any news in 10 days. NBA final and who will win the Men’s single. D is an enthusiastic sports fan. He is interested in who will win the tournament (the French Open quarterfinals) happening right now. Suppose there is a professional basketball (NBA) final and a tennis Let us take the following example: Motivating Example IT Lecture 2 For the NBA final, D will learn that one of the two teams eventually For the French Open quarterfinals, D will learn that one of the eight
Entropy and Conditional Entropy multiplicative. I-Hsiang Wang 5 / 42 Logarithmic Function What function produces additive outputs with multiplicative inputs? Mutual Information and Kullback–Leibler Divergence IT Lecture 2 should be additive, while the number of possible outcomes is Observations 1 The amount of information is related to the number of possible outcomes: message B is a result of two possible outcomes, while 2 The amount of information obtained in learning the two messages message T is a result of eight possible outcomes. Let f ( · ) be a function that measures the amount of information: # of possible Amount of info. # of possible Amount of info. f ( · ) outcomes of B from learning B outcomes of B from learning B f ( · ) × + # of possible Amount of info. # of possible Amount of info. f ( · ) outcomes of T from learning T outcomes of T from learning T
Entropy and Conditional Entropy The Heats win the final (w.p. I-Hsiang Wang 6 / 42 The Spurs win the final (w.p. The probability that the Spurs win the final: Mutual Information and Kullback–Leibler Divergence IT Lecture 2 that outcome should be very little. The probability that the Heats win the final: Logarithm as the Information Measure outcome occurs with very high probability, the amount of information of Initial guess of the measure of information: log ( # of possible outcomes ) . However, this measure does not take the likeliness into account – if some For example, suppose D knows that the Spurs was leading the Heats 3:1 2 → 1 1 8 . 1 8 ): it is like out of 8 times there is only 1 time that will generate this outcome = ⇒ the amount of information = log 8 = 3 bits. 1 2 → 7 8 . 7 8 ): it is like out of 8 7 times there is only 1 time that will generate this outcome = ⇒ the amount of information = log 8 7 = 3 − log 7 bits.
Entropy and Conditional Entropy uncertainty of an unknown outcome I-Hsiang Wang 7 / 42 Correspondingly, the measure of information of a random outcome X is Mutual Information and Kullback–Leibler Divergence Hence, a plausible measure of information of a realization x drawn from a 4 The measure of information is actually measuring the amount of 3 The measure of information should take the likeliness into account 2 The measure of information should be additive 1 The amount of information is related to the # of possible outcomes From the motivation, we collect the following intuitions: Information and Uncertainty IT Lecture 2 1 random outcome X is f ( x ) := log P { X = x } . the averaged value of f ( x ) : E X [ f ( X )] . Notation : in this lecture, the logarithms are of base 2 if not specified.
Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy 1 Entropy and Conditional Entropy Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy 2 Mutual Information and Kullback–Leibler Divergence 8 / 42 I-Hsiang Wang IT Lecture 2
Entropy and Conditional Entropy the expectation of the self information over all possible realizations: I-Hsiang Wang 9 / 42 information when one learns the actual outcome/realization of r.v. X . Note : Entropy can be understood as the (average) amount of Mutual Information and Kullback–Leibler Divergence The entropy of a random variable X is defined by Definition 1 (Entropy) log Hence, to measure the uncertainty of a random variable , we should take However, on the average, it happens rarely. Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy Entropy: Measure of Uncertainty of a Random Variable log IT Lecture 2 If the outcome has small probability, it contains higher uncertainty. 1 P { X = x } ⇝ measure of information/uncertainty of an outcome x . [ ] = ∑ 1 1 H ( X ) := E X p ( x ) log p ( X ) p ( x ) . x ∈X Note : By convention we set 0 log 1 0 = 0 since lim t → 0 t log t = 0 .
Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence I-Hsiang Wang 10 / 42 concave in p . arg max max 1 Analytically check that Exercise 1 IT Lecture 2 Properties of Entropy and Conditional Entropy Definition of Entropy and Conditional Entropy Example 1 (Binary entropy function) Let X ∼ Ber ( p ) be a Bernoulli random variable, that is, X ∈ { 0 , 1 } , p X (1) = 1 − p X (0) = p . Then, the entropy of X is called the binary entropy function H b ( p ) , where (note: we follow the convention that 0 log 1 0 = 0 .) H b ( p ) := H ( X ) = − p log p − (1 − p ) log (1 − p ) , p ∈ [0 , 1] . H b ( p ) 1 1 . 9 . 8 . 7 p ∈ [0 , 1] H b ( p ) = 1 , . 6 . 5 0 . 5 p ∈ [0 , 1] H b ( p ) = 1/2 . . 4 . 3 . 2 2 Analytically prove that H b ( p ) is . 1 0 0 p 0 0 . 5 1
Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence I-Hsiang Wang 11 / 42 sol : IT Lecture 2 x Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy Example 2 Consider a random variable X ∈ { 0 , 1 , 2 , 3 } with p.m.f. defined as follows: 0 1 2 3 1 1 1 1 p ( x ) 6 3 3 6 Compute H ( X ) and H ( Y ) , where Y := X mod 2 . H ( X ) = 2 × 1 6 × log 6 + 2 × 1 3 × log 3 = 1 3 + log 3 . H ( Y ) = 2 × 1 2 × log 2 = 1 .
Entropy and Conditional Entropy where I-Hsiang Wang 12 / 42 Mutual Information and Kullback–Leibler Divergence i.i.d. Below we take a slight deviation and look at a mathematical problem. Besides the intuitive motivation, Entropy has operational meanings. Operational Meaning of Entropy Properties of Entropy and Conditional Entropy Definition of Entropy and Conditional Entropy IT Lecture 2 Problem : Consider a sequence of discrete rv’s X n := ( X 1 , X 2 , . . . , X n ) , X i ∈ X , X i ∼ p X , ∀ i = 1 , 2 , . . . , n . |X| < ∞ . For a given ϵ ∈ (0 , 1) , we say A ⊆ X n is an ϵ -high-probability set iff P { X n ∈ A} ≥ 1 − ϵ. We would like to find the asymptotic scaling of the smallest cardinality of ϵ -high-probability sets as n → ∞ . Let s ( n , ϵ ) be that smallest cardinality.
Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence I-Hsiang Wang 13 / 42 random source sequences! (as Shannon pointed out in his 1948 paper.) This is the saving (compression) due to the statistical structure of n minimum # of bits required prescribed missed probability, Why? Because the above theorem guarantees that, for any IT Lecture 2 With the theorem, if one would like to describe a random length- n pf : Application of Law of Large Numbers. See HW1. lim Theorem 1 (Cardinality of High Probability Sets) Properties of Entropy and Conditional Entropy Definition of Entropy and Conditional Entropy 1 n log s ( n , ϵ ) = H ( X ) , ∀ ϵ ∈ (0 , 1) . n →∞ Implications : H ( X ) is the minimum possible compression ratio. X -sequence with a missed probability at most ϵ , he/she only needs k ≈ nH ( X ) bits when n is large. → H ( X ) as n → ∞ .
Entropy and Conditional Entropy entropy of the component random variables. I-Hsiang Wang 14 / 42 log Mutual Information and Kullback–Leibler Divergence X d Definition 2 (Entropy) defined by the expectation of the self information In some literature, the entropy of a random vector is also called the joint Initially we define entropy for a random variable. Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy variables, or, a random vector. Entropy: Definition IT Lecture 2 It is straightforward to extend this definition to a sequence of random [ X 1 ] T is The entropy of a d -dimensional random vector X := · · · [ ] ∑ 1 1 H ( X ) := E X = p ( x ) log p ( x ) = H ( X 1 , . . . , X d ) . p ( X ) x ∈X 1 ×···×X d
Recommend
More recommend