Machine Learning Lecture 01-2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 30
Jensen’s Inequality Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 2 / 30
Jensen’s Inequality Concave functions A function f is concave on interval I if for any x , y ∈ I , λ f ( x ) + (1 − λ ) f ( y ) ≤ f ( λ x + (1 − λ ) y ) for any λ ∈ [0 , 1] Weighted average of function is upper bounded by function of weighted average. It is strictly concave if the equality holds only when x = y . Nevin L. Zhang (HKUST) Machine Learning 3 / 30
Jensen’s Inequality Jensen’s Inequality Theorem (1.1) Suppose function f is concave on interval I.Then For any p i ∈ [0 , 1] , � n i =1 p i = 1 and x i ∈ I. n n � � p i f ( x i ) ≤ f ( p i x i ) i =1 i =1 Weighted average of function is upper bounded by function of weighted average. If f is strictly CONCAVE, the equality holds iff p i × p j � = 0 implies x i = x j . Exercise: Prove this (using induction). Nevin L. Zhang (HKUST) Machine Learning 4 / 30
Jensen’s Inequality Logarithmic function The logarithmic function is concave in the interval (0 , ∞ ): Hence n n � � p i log ( x i ) ≤ log ( p i x i ) 0 ≤ x i i =1 i =1 In words, exchanging � i p i with log increases quantity. Or, swapping expectation and logarithm increases quantity: E [log x ] ≤ log E [ x ] . Nevin L. Zhang (HKUST) Machine Learning 5 / 30
Entropy Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 6 / 30
Entropy Entropy The entropy of a random variable X : 1 � H ( X ) = P ( X ) log P ( X ) = − E P [log P ( X )] X with convention that 0 log(1 / 0) = 0. Base of logarithm is 2, unit is bit. Sometimes, also called the entropy of the distribution, H ( P ). H ( X ) measures the amount of uncertainty about X . � For real-valued variable, replace � X . . . with . . . dx . Nevin L. Zhang (HKUST) Machine Learning 7 / 30
Entropy Entropy Example: X — result of coin tossing Y — result of dice throw Z — result of randomly pick a card from a deck of 54 Which one has the highest uncertainty? Entropy: 1 2 log 2 + 1 H ( X ) = 2 log 2 = 1(log 2) 6 log 6 + . . . + 1 1 H ( Y ) = 6 log 6 = log 6 54 log 54 + . . . + 1 1 H ( Z ) = 54 log 54 = log 54 Indeed we have: H ( X ) < H ( Y ) < H ( Z ) . Nevin L. Zhang (HKUST) Machine Learning 8 / 30
Entropy Entropy X binary. The chart on the right shows H ( X ) as a function of p = P ( X =1). The higher H ( X ) is, the more uncertainty about the value of X Nevin L. Zhang (HKUST) Machine Learning 9 / 30
Entropy Entropy Proposition (1.2) H ( X ) ≥ 0 H ( X ) = 0 iff P ( X = x ) = 1 for some x ∈ Ω X . i.e. iff no uncertainty. H ( X ) ≤ log ( | X | ) with equality iff P ( X = x )=1 / | X | . Uncertainty is the highest in the case of uniform distribution. Proof : Because log is concave, by Jensen’s inequality: 1 � H ( X ) = P ( X ) log P ( X ) X 1 � ≤ log P ( X ) P ( X ) = log | X | X Nevin L. Zhang (HKUST) Machine Learning 10 / 30
Entropy Conditional entropy The conditional entropy of Y given event X = x : Entropy of the conditional distribution P ( Y | X = x ), i.e. 1 � H ( Y | X = x ) = P ( Y | X = x ) log P ( Y | X = x ) Y The uncertainty that remains about Y when X is known to be y . It is possible that H ( Y | X = x ) > H ( Y ) Intuitively X = x might contradicts our prior knowledge about Y and increase our uncertainty about Y Exercise: Give example. Nevin L. Zhang (HKUST) Machine Learning 11 / 30
Entropy Conditional Entropy The conditional entropy of Y given variable X : � H ( Y | X ) = P ( X = x ) H ( Y | X = x ) x 1 � � = P ( X ) P ( Y | X ) log P ( Y | X ) X Y 1 � = P ( X , Y ) log P ( Y | X ) X , Y = − E [ logP ( Y | X )] The average uncertainty that remains about X when Y is known. Nevin L. Zhang (HKUST) Machine Learning 12 / 30
Divergence Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 13 / 30
Divergence Kullback-Leibler divergence Relative entropy or Kullback-Leibler divergence Measures how much a distribution Q ( X ) differs from a ”true” probability distribution P ( X ). K-L divergence of Q from P is defined as follows: P ( X ) log P ( X ) � KL ( P || Q ) = Q ( X ) X 0 log 0 0 = 0 and plog p 0 = ∞ if p � =0 Nevin L. Zhang (HKUST) Machine Learning 14 / 30
Divergence Kullback-Leibler divergence Theorem (1.2) ( Gibbs’ inequality ) KL ( P , Q ) ≥ 0 with equality holds iff P is identical to Q Proof : P ( X ) log P ( X ) P ( X ) log Q ( X ) � � = − Q ( X ) P ( X ) X X P ( X ) Q ( X ) � ≥ − log Jensen’s inequality P ( X ) X � = − log Q ( X ) = 0 . X KL divergence between P and Q is larger than 0 unless P and Q are identical. Nevin L. Zhang (HKUST) Machine Learning 15 / 30
Divergence Cross Entropy 1 Entropy: H ( P ) = � X P ( X ) log P ( X ) = − E [log P ( x )] Cross entropy : 1 � H ( P , Q ) = P ( X ) log Q ( X ) = − E P [ logQ ( X )] X Relationship with KL: P ( X ) log P ( X ) � KL ( P || Q ) = Q ( X ) = E P [ logP ( X )] − E P [ logQ ( X )] X = H ( P , Q ) − H ( P ) Or, H ( P , Q ) = KL ( P || Q ) + H ( P ) Nevin L. Zhang (HKUST) Machine Learning 16 / 30
Divergence A corollary Corollary (1.1) (Gibbs Inequality) H ( P , Q ) ≥ H ( P ) , or � � P ( X ) log Q ( X ) ≤ P ( X ) log P ( X ) X X In general, let f ( X ) be a non-negative function. Then � � f ( X ) log Q ( X ) ≤ f ( X ) log P ∗ ( X ) X X where P ∗ ( X ) = f ( X ) / � X f ( X ). Nevin L. Zhang (HKUST) Machine Learning 17 / 30
Divergence Unsupervised Learning Unknown true distribution P ( x ). sampling learning → D = { x i } N P ( x ) − − − − − − − − − → Q ( x ) i =1 Objective: Minimizing KL : KL ( P || Q ) Same as minimizing cross entropy : H ( P , Q ) Approximating the cross entropy using data: � H ( P , Q ) = − P ( x ) log Q ( x ) d x N − 1 � ≈ log Q ( x i ) N i =1 − 1 = N log Q ( D ) Same as maximizing likelihood : log Q ( D ). Nevin L. Zhang (HKUST) Machine Learning 18 / 30
Divergence Supervised Learning Unknown true distribution P ( x , y ), where y is label of input x . sampling learning → D = { x i , y i } N P ( x , y ) − − − − − − − − − → Q ( y | x ) i =1 Objective: Minimizing cross (conditional) entropy : � H ( P , Q ) = − P ( x , y ) log Q ( y | x ) d x dy N − 1 � ≈ log Q ( y i | x i ) N i =1 Same as maximizing loglikelihood : � N i =1 log Q ( y i | x i ), Or minimizing the negative loglikelihood (NLL) : − � N i =1 log Q ( y i | x i ) Nevin L. Zhang (HKUST) Machine Learning 19 / 30
Divergence Jensen-Shannon divergence KL is not symmetric: KL ( P || Q ) usually is not equal to reverse KL KL ( Q || P ). Jensen-Shannon divergence is one symmetrized version of KL: JS ( P || Q ) = 1 2 KL ( P || M ) + 1 2 KL ( Q || M ) where M = P + Q 2 Properties: 0 ≤ JS ( P || Q ) ≤ log 2 JS ( P || Q ) = 0 if P = Q JS ( P || Q ) = log 2 if P and Q has disjoint support. Nevin L. Zhang (HKUST) Machine Learning 20 / 30
Mutual Information Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 21 / 30
Mutual Information Mutual information The mutual information of X and Y : I ( X ; Y ) = H ( X ) − H ( X | Y ) Average reduction in uncertainty about X from learning the value of Y , or Average amount of information Y conveys about X . Nevin L. Zhang (HKUST) Machine Learning 22 / 30
Mutual Information Mutual information and KL Divergence Note that: 1 1 � � I ( X ; Y ) = P ( X ) log P ( X ) − P ( X , Y ) log P ( X | Y ) X X , Y 1 1 � � = P ( X , Y ) log P ( X ) − P ( X , Y ) log P ( X | Y ) X , Y X , Y P ( X , Y ) log P ( X | Y ) � = P ( X ) X , Y P ( X , Y ) log P ( X , Y ) � = equivalent definition P ( X ) P ( Y ) X , Y = KL ( P ( X , Y ) || P ( X ) P ( Y )) Due to equivalent definition: I ( X ; Y ) = H ( X ) − H ( X | Y ) = I ( Y ; X ) = H ( Y ) − H ( Y | X ) Nevin L. Zhang (HKUST) Machine Learning 23 / 30
Mutual Information Property of Mutual information Theorem (1.3) I ( X ; Y ) ≥ 0 with equality holds iff X ⊥ Y . Interpretation: X and Y are independent iff X contains no information about Y and vice versa. Proof : Follows from previous slide and Theorem 1.2. Nevin L. Zhang (HKUST) Machine Learning 24 / 30
Mutual Information Conditional Entropy Revisited Theorem (1.4) H ( X | Y ) ≤ H ( X ) with equality holds iff X ⊥ Y Observation reduces uncertainty in average except for the case of independence. Proof : Follows from Theorem 1.3. Nevin L. Zhang (HKUST) Machine Learning 25 / 30
Mutual Information Mutual information and Entropy From definition of mutual information I ( X ; Y ) = H ( X ) − H ( X | Y ) and the chain rule, H ( X , Y ) = H ( Y ) + H ( X | Y ) we get H ( X ) + H ( Y ) = H ( X , Y ) + I ( X ; Y ) I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X , Y ) Consequently H ( X , Y ) ≤ H ( X ) + H ( Y ) with equality holds iff X ⊥ Y . Nevin L. Zhang (HKUST) Machine Learning 26 / 30
Recommend
More recommend