Chapter 8 Differential Entropy Peng-Hua Wang Graduate Inst. of Comm. Engineering National Taipei University
Chapter Outline Chap. 8 Differential Entropy 8.1 Definitions 8.2 AEP for Continuous Random Variables 8.3 Relation of Differential Entropy to Discrete Entropy 8.4 Joint and Conditional Differential Entropy 8.5 Relative Entropy and Mutual Information 8.6 Properties of Differential Entropy and Related Amounts Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 2/24
8.1 Definitions Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 3/24
Definitions Definition 1 (Differential entropy) The differential entropy h ( X ) of a continuous random variable X with pdf f ( X ) is defined as � h ( X ) = − f ( x ) log f ( x ) dx, S where S is the support region of the random variable. Example � a 1 a log 1 X ∼ U (0 , a ) , h ( X ) = − adx = log a. 0 Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 4/24
Differential Entropy of Gaussian 2 πσ 2 e − x 2 2 σ 2 , then 1 Example. If X ∼ N (0 , σ 2 ) with pdf φ ( x ) = √ � h a ( x ) = − φ ( x ) log a φ ( x ) dx 2 πσ 2 − x 2 � � � 1 √ = − φ ( x ) log a 2 σ 2 log a e dx = 1 2 log a (2 πσ 2 ) + log a e 2 σ 2 E φ [ X 2 ] = 1 2 log a (2 πeσ 2 ) � Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 5/24
Differential Entropy of Gaussian Remark. If a random variable with pdf f ( x ) has zero mean and variance σ 2 , then � − f ( x ) log a φ ( x ) dx 2 πσ 2 − x 2 � � � 1 √ = − f ( x ) log a 2 σ 2 log a e dx =1 2 log a (2 πσ 2 ) + log a e 2 σ 2 E f [ X 2 ] = 1 2 log a (2 πeσ 2 ) Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 6/24
Gaussian has Maximal Differential Entropy Suppose that a random variable X with pdf f ( x ) has zero mean and variance σ 2 , what is its maximal differential entropy? Let φ ( x ) be the pdf of N (0 , σ 2 ) . � � f ( x ) log φ ( x ) h ( X ) + f ( x ) log φ ( x ) dx = f ( x ) dx �� � f ( x ) φ ( x ) ≤ log f ( x ) dx (convexity of logarithm) � = log φ ( x ) dx = 0 That is, � f ( x ) log φ ( x ) dx = 1 2 log(2 πeσ 2 ) h ( X ) ≤ − and equality holds if f ( x ) = φ ( x ) . � Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 7/24
8.2 AEP for Continuous Random Variables Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 8/24
AEP Theorem 1 (AEP) Let X 1 , X 2 , . . . , X n be a sequence of i.i.d. random variables with common pdf f ( x ) . Then, − 1 n log f ( X 1 , X 2 , . . . , X n ) → E [ − log f ( X )] = h ( X ) in probability. Definition 2 (Typical Set) For ǫ > 0 the typical set A ( n ) with respect ǫ to f ( x ) is defined as � ( x 1 , x 2 , . . . , x n ) ∈ S n : A ( n ) = ǫ � � � � − 1 � � n log f ( x 1 , x 2 , . . . , x n ) − h ( X ) � ≤ ǫ � � Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 9/24
AEP Definition 3 (Volume) The volume Vol( A ) of a set A ⊂ R n is defined as � Vol( A ) = dx 1 dx 2 . . . dx n A Theorem 2 (Properties of typical set) 1. Pr( A ( n ) ǫ ) > 1 − ǫ for n sufficiently large. 2. Vol( A ( n ) ǫ ) ≤ 2 n ( h ( X )+ ǫ ) for all n . 3. Vol( A ( n ) ǫ ) ≥ (1 − ǫ )2 n ( h ( X ) − ǫ ) for n sufficiently large. Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 10/24
8.4 Joint and Conditional Differential Entropy Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 11/24
Definitions Definition 4 (Differential entropy) The differential entropy of jointly distributed random variables X 1 , X 2 , . . . , X n is defined as � f ( x n ) log f ( x n ) dx n h ( X 1 , X 2 , . . . , X n ) = − where f ( x n ) = f ( x 1 , x 2 , . . . , x n ) is the joint pdf. Definition 5 (Conditional differential entropy) The conditional differential entropy of jointly distributed random variables X, Y with joint pdf f ( x, y ) is defined as, if it exists, � h ( X | Y ) = − f ( x, y ) log f ( x | y ) dxdy = h ( X, Y ) − h ( Y ) Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 12/24
Multivariate Normal Distribution Theorem 3 (Entropy of a multivariate normal) Let X 1 , X 2 , . . . , X n have a multivariate normal distribution with mean vector µ and covariance matrix K . Then h ( X 1 , X 2 , . . . , X n ) = 1 2 log(2 πe ) n | K | Proof. The joint pdf of a multivariate normal distribution is 1 (2 π ) n/ 2 | K | 1 / 2 e − 1 2 ( x − µ ) t K − 1 ( x − µ ) φ ( x ) = Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 13/24
Multivariate Normal Distribution Therefore, � h ( X 1 , X 2 , . . . , X n ) = − φ ( x ) log a φ ( x ) d x � 1 � � 2 log a (2 π ) n | K | + 1 2( x − µ ) t K − 1 ( x − µ ) log a e = φ ( x ) d x =1 2 log a (2 π ) n | K | + 1 � � ( x − µ ) t K − 1 ( x − µ ) 2(log a e ) E � �� � = n =1 2 log a (2 π ) n | K | + 1 2 n log a e =1 2 log a (2 πe ) n | K | � Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 14/24
Multivariate Normal Distribution Let Y = ( Y 1 , Y 2 , . . . , Y n ) t be a random vector. If K = E [ YY t ] , then E [ Y t K − 1 Y ] = n. Proof. Denote | | | K = E [ YY t ] = . . . k 1 k 2 k n | | | and a t 1 a t K − 1 = 2 . . . a t n We have k i = E [ Y i Y ] and a t j k i = δ ij . Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 15/24
Multivariate Normal Distribution Now, a t a t 1 Y 1 a t a t 2 Y 2 Y t K − 1 Y = Y t Y = ( Y 1 , Y 2 , . . . , Y n ) . . . . . . a t a t n Y n = Y 1 a t 1 Y + Y 2 a t 2 Y + · · · + Y n a t n Y and E [ Y t K − 1 Y ] = a t 1 E [ Y 1 Y ] + a t 2 E [ Y 2 Y ] + · · · + a t n E [ Y n Y ] = a t 1 k 1 + a t 2 k 2 + · · · + a t n k n = n � Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 16/24
8.5 Relative Entropy and Mutual Information Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 17/24
Definitions Definition 6 (Relative entropy) The relative entropy (or KullbackVLeibler distance) D ( f || g ) between two densities f ( x ) and g ( x ) is defined as � f ( x ) log f ( x ) D ( f || g ) = g ( x ) dx Definition 7 (Mutual information) The mutual information I ( X ; Y ) between two random variables with joint density f ( x, y ) is defined as f ( x, y ) log f ( x, y ) � I ( X ; Y ) = f ( x ) f ( y ) dxdy Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 18/24
Example Let ( X, Y ) ∼ N (0 , K ) where � � σ 2 ρσ 2 K = . ρσ 2 σ 2 2 log(2 πe ) σ 2 and Then h ( X ) = h ( Y ) = 1 h ( X, Y ) = 1 2 log(2 πe ) 2 | K | = 1 2 log(2 πe ) 2 σ 4 (1 − ρ 2 ) . Therefore, I ( X ; Y ) = h ( X ) + h ( Y ) − h ( X ; Y ) = − 1 2 log(1 − ρ 2 ) . Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 19/24
8.6 Properties of Differential Entropy and Related Amounts Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 20/24
Properties Theorem 4 (Relative entropy) D ( f || g ) ≥ 0 with equality iff f = g almost everywhere. Corollary 1 1. I ( X ; Y ) ≥ 0 with equality iff X and Y are independent. 1. h ( X | Y ) ≤ h ( X ) with equality iff X and Y are independent. Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 21/24
Properties Theorem 5 (Chain rule for differential entropy) n � h ( X 1 , X 2 , . . . , X n ) = h ( X i | X 1 , X 2 , . . . , X i − 1 ) i =1 Corollary 2 n � h ( X 1 , X 2 , . . . , X n ) ≤ h ( X i ) i =1 Corollary 3 (Hadamard’s inequality) If K is the covariance matrix of a multivariate normal distribution, then n � | K | ≤ K ii . i =1 Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 22/24
Properties Theorem 6 1. h ( X + c ) = h ( X ) 2. h ( aX ) = h ( X ) + log | a | . 3. h ( AX ) = h ( X ) + log | det( A ) | Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 23/24
Gaussian has Maximal Entropy Theorem 7 Let the random vector X ∈ R n have zero mean and covariance K = E [ XX t ] . Then h ( X ) ≤ 1 2 log(2 πe ) n | K | . with equality X ∼ N ( 0 , K ) � Proof. Let g ( x ) be any density satisfying x i x j g ( x ) d x = K ij . Let φ ( x ) be the density of N ( 0 , K ) . Then, � � 0 ≤ D ( g || φ ) = g log( g/φ ) = − h ( g ) − g log φ � = − h ( g ) − φ log φ = − h ( g ) + h ( φ ) That is, h ( g ) ≤ h ( φ ) . Equality holds if g = φ. � Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 24/24
Recommend
More recommend