CS7015 (Deep Learning) : Lecture 20 Markov Chains, Gibbs Sampling for Training RBMs, Contrastive Divergence for training RBMs Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.1 : Markov Chains 2/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals Goal 1 : Given a random variable X ∈ R n , we are interested in drawing samples from the joint distribution P ( X ) Goal 2 : Given a function f ( X ) defined over the random variable X, we are interested in X ∈ R 1024 computing the expectation E P ( X ) [ f ( X )] We will use Gibbs Sampling (class of Metropolis-Hastings algorithm) to achieve these goals We will first understand the intuition be- E P ( X ) [ f ( X )] hind Gibbs Sampling and then understand the math behind it 3/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable X ∈ R n , we have a chain of random variables X 1 , X 2 , . . . , X K each X i ∈ R n The i here corresponds to a time step For example, X i could be a n-dimensional vec- tor containing the number of customers in a given set of n restaurants on day i X ∈ R 1024 In our case, X i could be a 1024 dimensional image sent by our friend on day i For ease of illustration we will stick to the res- taurant example and assume that instead of actual counts we are interested only in binary E P ( X ) [ f ( X )] counts (high=1, low=0) Thus X i ∈ { 0 , 1 } n 4/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X 1 take on the value x 1 ( x 1 is one of the possible 2 n vectors) On day 2, let X 2 take on the value x 2 ( x 2 is again one of the possible 2 n vectors) One way of looking at this is that the state x 1 x 2 x 3 has transitioned from x 1 to x 2 Similarly, on day 3, if X 3 takes on the value x 3 then we can say that the state has transitioned from x 1 to x 2 to x 3 Finally, on day n , we can say that the state has transitioned from x 1 to x 2 to x 3 to . . . x n 5/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most likely value that the state will take on day i given the states on day 1 to day i − 1 More formally, we may be interested in the following distribution P ( X i = x i | X 1 = x 1 , X 2 = x 2 , . . . , X i − 1 = x i − 1 ) Now suppose the chain exhibits the following Markov x 1 x 2 x 3 x i · · · property P ( X i = x i | X 1 = x 1 , X 2 = x 2 , . . . , X i − 1 = x i − 1 ) = P ( X i = x i | X i − 1 = x i − 1 ) In other words, given the previous state X i − 1 , X i is independent of all preceding states Can we draw a graphical model to encode this inde- pendence assumption ? 6/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In this graphical model, the random variables are X 1 , X 2 , . . . , X k We will have a node corresponding to each of X 1 X 2 · · · X k these random variables What will be the edges in the graph ? Well, each node only depends on its prede- cessor, so we will just have an edge between successive nodes 7/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X i − 2 This property ( X i ⊥ | X i − 1 ) is called the 1 Markov property And the resulting chain X 1 , X 2 , . . . , X k is called a Markov chain Further, since we are considering discrete time steps, this is called a discrete time Markov X 1 X 2 · · · X k Chain Further, since X i ’s take on discrete values this is called a discrete time discrete space Markov Chain Okay, but why are we interested in Markov chains? (we will get there soon! for now let us just focus on these definitions) 8/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X 1 X 2 · · · X k Let us delve a bit deeper into Markov Chains and define a few more quantities Let us assume 2 n = l ( i.e. , X i can take l val- Recall that each X i ∈ { 0 , 1 } n ues) How many values do we need to specify the X i − 1 X i − 2 T ab distribution 1 1 0.05 1 2 0.06 . . . . . . ( l 2 ) . . . P ( X i = x i | X i − 1 = x i − 1 )? 1 l 0.02 2 1 0.03 We can represent this as a matrix T ∈ l × 2 2 0.07 . . . . . . . . . l where the entry T a,b of the matrix denotes 2 l 0.01 the probability of transitioning to state b from . . . . . . . . . state a ( i.e. , P ( X i = b | X i − 1 = a )) l 1 0.1 l 2 0.09 The matrix T is called the transition matrix . . . . . . . . . l l 0.21 9/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix T ab , i.e. , · · · X 1 X 2 X k P ( X i = b | X i − 1 = a ) ∀ a, b ∀ i X i − 1 X i − 2 T ab Why do we need to define this ∀ i ? Well, 1 1 0.05 because this transition probabilities may be 1 2 0.06 . . . . . . . . . different for different time steps 1 l 0.02 For example, the transition in the number 2 1 0.03 2 2 0.07 of customers may be different from Friday . . . . . . . . . to Saturday (weekend) as compared to from 2 l 0.01 . . . Sunday to Monday(weekday) . . . . . . l 1 0.1 Thus, for a Markov chain X 1 , X 2 , . . . , X k l 2 0.09 . . . we will have k such transition matrices . . . . . . l l 0.21 T 1 , T 2 , . . . , T k 10/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
· · · However, for this discussion we will assume X 1 X 2 X k that the Markov chain is time homogeneous What does that mean? It means that X i − 1 X i − 2 T ab 1 1 0.05 T 1 = T 2 = · · · = T k = T 1 2 0.06 . . . . . . . . . In other words 1 l 0.02 2 1 0.03 2 2 0.07 P ( X i = b | X i − 1 = a ) = T ab ∀ a, b ∀ i . . . . . . . . . 2 l 0.01 . . . The transition matrix does not depend on the . . . . . . l 1 0.1 time i and hence such a Markov Chain is l 2 0.09 called time homogeneous . . . . . . . . . l l 0.21 11/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time step 0 is given by µ 0 ) Just to be clear µ 0 is a 2 n dimensional vector such that µ 0 a = P ( X 0 = a ) µ 0 a is the probability that the random variable takes on the value a among all the possible 2 n X 1 X 2 · · · X k values Given µ 0 and T how will you compute µ k where µ k a = P ( X k = a ) µ k is again a 2 n dimensional vector whose a th entry tells us the probability that X k will take on the value a among all the possible 2 n values 12/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P ( X 1 = b ) � P ( X 1 = b ) = P ( X 0 = a, X 1 = b ) X 0 X 1 a 1 The above sum essentially captures all the paths of reaching X 1 = b irrespective of the 2 b value of X 0 � . . P ( X 1 = b ) = P ( X 0 = a, X 1 = b ) . . . . a � = P ( X 0 = a ) P ( X 1 = b | X 0 = a ) l a � µ 0 = a T ab a 13/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X 0 X 1 1 1 0.2 0 . 3 0.5 Let us see if there is a more compact way of writing the distribution P ( X 1 ) 0.3 0.3 ( i.e. , of specifying P ( X 1 = b ) ∀ b ) 2 2 0.6 Let us consider a simple case when 0.1 0 . 4 l = 3 (as opposed to 2 n ) 0.4 Thus, µ 0 ∈ R 3 and T ∈ R 3 × 3 0.2 What does the product µ 0 T give us ? 3 3 0.4 It gives us the distribution µ 1 ! (the 0 . 3 0 . 2 0 . 5 0 . 3 b th entry of this vector is � a µ 0 a T ab µ 0 T = � � 0 . 3 0 . 4 0 . 3 0 . 3 0 . 6 0 . 1 which is P ( X 1 = b )) 0 . 4 0 . 2 0 . 4 � � = 0 . 3 0 . 45 0 . 25 14/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P ( X 2 = b ) � X 0 X 1 X 2 P ( X 2 = b ) = P ( X 1 = a, X 2 = b ) a 1 The above sum essentially captures all the paths of reaching X 2 = b irrespective of the value of X 1 2 b � P ( X 2 = b ) = P ( X 1 = a, X 2 = b ) a . . . . . . � . . . = P ( X 1 = a ) P ( X 2 = b | X 1 = a ) a � µ 1 l = a T ab a 15/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P ( X 2 ) compactly as P ( X 2 ) = µ 1 T = ( µ 0 T ) T = µ 0 T 2 X 0 X 1 X 2 In general, 1 P ( X k ) = µ 0 T k 2 b Thus the distribution at any time step can be computed by finding the appropriate element . . . from the following series . . . . . . µ 0 T 1 , µ 0 T 2 , µ 0 T 3 , . . . , µ 0 T k , . . . l Note that this is still computationally expens- ive because it involves a product of µ 0 (2 n ) and T k (2 n × 2 n ) (but later on we will see that we do not need this full product) 16/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t , µ t reaches a distri- bution π such that πT = π Then for all subsequent time steps X 0 X 1 X 2 µ j = π ( j ≥ t ) 1 π is then called the stationary distribution of 2 b the Markov chain X t , X t +1 , X t +2 , . . . will all follow the same dis- . . . . . . . . . tribution π In other words, if we have X t = x t , X t +1 = l x t +1 , X t +2 = x t +2 and so on then we can think of x t , x t +1 , x t +2 as samples drawn from the same distribution π (this is a crucial property and we will return back to it soon) 17/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recommend
More recommend