CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee Sources: Elements Of Information Theory Information Theory, Inference, and Learning Algorithms University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee
Outline • Information Theoretical Entropy • Mutual Information • Decision Tree • KL Divergence • Applications University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2
3 Information Theory What is information theory? A quantitive measure of the information content of a message or measuring how much surprise there is in an event. • What is the ultimate data compression (entropy) • What is the ultimate transmission rate of communication (channel capacity: The ability of channel to transmit what is produced out of source of a given information) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee
Information Theory • A message saying the sun rose this morning is so uninformative • A message saying there was a solar eclipse this morning is very informative • Independent events should have additive information University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4
Entropy • Definition : Entropy measures the amount of uncertainty of a random quantity. View information as a reduction in uncertainty and as surprise: Observe something unexpected gain information. The Shannon’s entropy : the average amount of information about a random variable X is given by the expected value 𝐼 𝑌 = − ∑ & 𝑞 𝑦 log , 𝑞(𝑦) = −𝔽[log , 𝑞(𝑦)] Probability of a friend lives in any of the apartments is 𝑄(𝑦) = 3 3 3 4, 4, so the entropy − ∑ 563 4, log , 4, = 5 bits 3 3 8 After a neighbor tells that your friend lives on top floor − ∑ 563 8 log , 8 = 3 bits 4 floors The neighbor conveyed 2 bits of information University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5 8 apartments on each floor
Entropy • Definition (Conditional Entropy). Given two random variables 𝑌 and 𝑍 , the Conditional Entropy of 𝑌 given 𝑍 , written as 𝐼(𝑌|𝑍 ) : 𝐼(𝑌|𝑍 ) = ∑ < 𝐼 (𝑌|𝑍 = 𝑧) · 𝑄 (𝑍 = 𝑧) = 𝔽 < [𝐼(𝑌|𝑍 = 𝑧) ] In the special case that 𝑌 and 𝑍 are independent, 𝐼 𝑌 𝑍 = 𝐼 𝑌 , which captures that we learn nothing about 𝑌 from 𝑍 • Theorem: Let 𝑌 and 𝑍 be random variables. Then: 𝐼(𝑌|𝑍 ) ≤ 𝐼(𝑌) This means that learning information about another variable 𝑍 can only decrease the uncertainty of 𝑌 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6
Jensen’s Inequality • Definition : If 𝑔 is a continuous and concave function, and 𝑞 3 ,· · · , 𝑞 B are nonnegative reals summing to 1, then for any 𝑦 = 𝑦 3 ,· · ·, 𝑦 B : B B ∑ 563 𝑞 5 𝑔(𝑦 5 ) ≤ 𝑔(∑ 563 𝑞 5 𝑦 5 ) If we treat ( 𝑞 3 ,· · · , 𝑞 B ) as a distribution 𝑞 , and 𝑔(𝑦) is the vector obtained by applying 𝑔 coordinate-wise to 𝑦 then we can write the inequality as: 𝔽 C [𝑔(𝑦)] ≤ 𝑔 ( 𝔽 C [𝑦] ) 3 If 𝑞 5 = B and the concave function ln 𝑦 , we have B 1 B 𝑦 5 E 𝑜 ln 𝑦 5 ≤ ln(E 𝑜) 563 563 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7
Mutual Information 3 𝐼 𝑦 𝑧 − 𝐼 𝑦 = ∑ &,< 𝑄 𝑍 = 𝑧 𝑄 𝑌 = 𝑦 𝑍 = 𝑧 . log , C 𝑌 = 𝑦 𝑍 = 𝑧 − ∑ & 𝑄( 𝑌 = 3 I(J6&) ) ∑ < 𝑄 𝑍 = 𝑧 𝑌 = 𝑦 = ∑ &,< 𝑄(X = 𝑦 ∩ Y = y). log , 𝑦 log , C 𝑌 = 𝑦 𝑍 = 𝑧 = I J6& I J6& I(O6<) I J6& I(O6<) ∑ &,< 𝑄(X = 𝑦 ∩ Y = y). log , C(J6&∩O6<) ≤ log , [∑ &,< 𝑄(𝑌 = 𝑦 ∩ 𝑍 = 𝑧)] C(J6&∩O6<) ]= log , 1 = 0 Definition: The Mutual Information of two random variables 𝑌 and 𝑍 , written 𝐽(𝑌; 𝑍 ) : 𝐽(𝑌; 𝑍) = 𝐼(𝑌) − 𝐼(𝑌|𝑍 ) = 𝐼(𝑍 ) − 𝐼(𝑍 |𝑌) = 𝐽(𝑍 ; 𝑌) In the case that 𝑌 and 𝑍 are independent, as noted above 𝐽(𝑌; 𝑍 ) = 𝐼(𝑌) − 𝐼(𝑌|𝑍 ) = 0 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8
Information Gain • Definition : the amount of information gained about a random variable or signal from observing another random variable. • We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned. • Information gain tells us how important a given attribute of the feature vectors is. • We will use it to decide the ordering of attributes in the nodes of a non-linear classifier known as decision tree . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 9
10 Decision Tree Each node checks one feature 𝑦 5 • Go left if 𝑦 5 < threshold • Go right if 𝑦 5 ≥ threshold University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee
Decision Tree • Every binary split of a node 𝑢 generates two descendent nodes ( 𝑢 O\] , 𝑢 ^_ ) with subsets ( 𝑌 ` a , 𝑌 ` b ) respectively. • Tree grows from root node down to the leaves and generate subsets that are more class homogeneous compared to the ancestor’s subset 𝑌 ` . • A measure that quantifies node impurity and split the node which leads to decreasing overall impurity of the descendent nodes w.r.t. the ancestor’s impurity is given by c 𝐽 𝑢 = − E 𝑄(𝑥 5 |𝑢) log , 𝑄(𝑥 5 |𝑢) 563 𝑄(𝑥 5 |𝑢) : the probability that a vector in the subset 𝑌 ` associated with node 𝑢 belongs to class 𝑥 5 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 11
Decision Tree 3 • If all probabilities are equal to c (high impurity) • If all data belong to a single class 𝐽 𝑢 = −1 log 1 = 0 • Information gain: measure how good is the split with defining the decrease in node impurity ^ fa ^ fb ∆𝐽 𝑢 = 𝐽 𝑢 − ^ f 𝐽(𝑢 O ) - ^ f 𝐽(𝑢 ^ ) 𝐽(𝑢 O ) : the impurity of 𝑢 O Goal: adopt set of candidate questions which performs the split leading to the highest decrease of impurity University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 12
Decision Tree • Entropy=0 if all samples are in the same class • Entropy is large of 𝑄 1 = ⋯ = 𝑄 𝑁 Choose the best one which gives the maximal information gain University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 13
KL Divergence • Consider some unknown distribution 𝑞(𝑦) , and suppose that we have modelled this using an approximate distribution 𝑟 𝑦 , the average additional amount of information required to specify value of 𝑦 as a result of using 𝑟(𝑦) instead of 𝑞(𝑦) 𝐿𝑀 𝑞 ∥ 𝑟 = − m 𝑞 𝑦 ln{𝑟(𝑦) 𝑞(𝑦)} 𝑒𝑦 This is known as the relative entropy or Kullback-Leibler divergence. • KL divergence is not a symmetric quantity University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 14
KL Divergence • We show that 𝐿𝑀 𝑞 ∥ 𝑟 ≥ 0 and we have equality if and only if 𝑞 𝑦 = 𝑟 𝑦 . • A function 𝑔 is convex if it has the property that every cord lies on or above the function. For any value of 𝑦 in the interval from 𝑦 = 𝑏 to 𝑦 = 𝑐 can be written in the form 𝜇𝑏 + 1 − 𝜇 𝑐 where 0 ≤ 𝜇 ≤ 1. • Convexity for function 𝑔 is given by Using the induction proof technique Then the KL divergence becomes University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 15
KL Divergence • We can minimize KL divergence with respect to the parameters of 𝑟 : 𝑏𝑠 min y 𝐸 {| (𝑄 ∥ 𝑟 y ) If 𝑄(𝑦) is a bimodal distribution If we try to approximate 𝑄 with a Gaussian distribution using KL divergence. We consider this mean-seeking behaviour, because the approximate distribution 𝑟 y must cover all the modes and regions of high probability in 𝑄 . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 16
Recommend
More recommend