Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Version: 20170926 è 20170928 è 20171011
Contents 11.1 Introduction …………………………………………………………………………….... 3 11.2 Statistical Mechanics ……………………………………………………………..…. 4 11.3 Markov Chains ………………………..………………………………………..…….... 6 11.4 Metropolis Algorithm ……….………..………………….……………………..... 16 11.5 Simulated Annealing ………………………………………………….…….…...…. 19 11.6 Gibbs Sampling ….…………….…….………………..……..……………………….. 22 11.7 Boltzmann Machine …..……………………………………….……..…………….. 24 11.8 Logistic Belief Nets ……………….…………………..……………….…......……. 29 11.9 Deep Belief Nets ………………………….…………………………..….........…. 30 11.10 Deterministic Annealing (DA) …………….………………………..…….…... 34 11.11 Analogy of DA with EM …..……….….…………………………………….……. 39 Summary and Discussion …………….…………….………………………….………... 41 (c) 2017 Biointelligence Lab, SNU 2
11.1 Introduction Statistical mechanics as a source of ideas for unsupervised (self- • organized) learning systems Statistical mechanics • ü The formal study of macroscopic equilibrium properties of large systems of elements that are subject to the microscopic laws of mechanics. ü The number of degrees of freedom is enormous, making the use of probabilistic methods mandatory. ü The concept of entropy plays a vital role in statistical mechanics, as with the Shannon’s information theory. ü The more ordered the system, or the more concentrated the underlying probability distribution, the smaller the entropy will be. Statistical mechanics for the study of neural networks • ü Cragg and Temperley (1954) and Cowan (1968) ü Boltzmann machine (Hinton and Sejnowsky, 1983, 1986; Ackley et al., 1985) (c) 2017 Biointelligence Lab, SNU 3
11.2 Statistical Mechanics (1/2) p i :!probability!of!occurrence!of!state! i !of!a!stochastic!system ∑ !!!!! p i ≥ 0!(for!all! i )!!and! = 1 p i i E i :!energy!of!the!system!when!it!is!in!state! i In!thermal!equilibrium,!the!probability!of!state! i !is (Canonical!distribution!/!Gibbs!distribution) ⎛ ⎞ Z exp − E i !!!!! p i = 1 1. States of low energy have a higher ⎜ ⎟ k B T ⎝ ⎠ probability of occurrence than the ⎛ ⎞ exp − E i states of high energy. ∑ !!!!! Z = ⎜ ⎟ 2. As the temperature T is reduced, the k B T ⎝ ⎠ i probability is concentrated on a ( ) :!Boltzmann!factor! exp − E / k B T smaller subset of low-energy states. Z :!sum!over!states!(partition!function) We!set! k B = 1!and!view! − log p i !as!"energy" ! (c) 2017 Biointelligence Lab, SNU 4
11.2 Statistical Mechanics (2/2) Helmholtz!free!energy !!!!!!! F = − T log Z ∑ < E > ! = p i E i !!!!!!(avergage!energy) i ∑ !!!!!! < E > − ! F = − T p i log p i i ∑ H = − p i log p i !!!!!(entropy) The!free!energy!of!the!system,! F ,!tends!to!decrease!and i become!a!minimum!in!an!equilibrium!situation.! Thus,!we!have The!resulting!probability!distribution!is!defined!by! !!!!!! < E > − ! F = TH Gibbs!distribution!( The!Principle!of!Minimum!Free!Energy ).! ! !!!!!!! F = ! < E > − ! TH Consider!two!systems! A !and! A '!in!thermal!contact. Nature likes to find a physical system with Δ H !and! Δ H ':!entropy!changes!of! A !and! A '! minimum free energy. The!total!entropy!tends!to!increase!with !!!!!!! Δ H + Δ H ' ≥ 0 ! (c) 2017 Biointelligence Lab, SNU 5
11.3 Markov Chains (1/9) Markov property P ( X n + 1 = x n + 1 | X n = x n , … , X 1 = x 1 ) = P ( X n + 1 = x n + 1 | X n = x n ) Transition probability from state i at time n to j at time n + 1 p ij = P ( X n + 1 = j | X n = i ) ∑ ( p ij ≥ 0 ∀ i , j and p ij = 1 ∀ i ) j If the transition probabilities are fixed, the Markov chain is homogeneous. In case of a system with a finite number of possible states K , the transition probabilities constitute a K -by- K matrix (stochastic matrix): ⎛ ⎞ p 11 … p 1 K ⎜ ⎟ P = ! " ! ⎜ ⎟ ⎜ ⎟ p K 1 # p KK ⎝ ⎠ (c) 2017 Biointelligence Lab, SNU 6
11.3 Markov Chains (2/9) Generalization to m -step transition probability ( m ) = P ( X n + m = x j | X n = x i ), m = 1,2, … p ij k →∞ v i ( k ) = π i i = 1,2, … , K lim ( m + 1) = (1) = p ik ∑ ( m ) p kj , m = 1,2, … , p ij p ik p ik k We can further generalize to (Chapman-Kolmogorov identity) ( m + n ) = ( m ) p kj ∑ m , n = 1,2, … ( n ) p ij p ik , k (c) 2017 Biointelligence Lab, SNU 7
Properties of markov chains 11.3 Markov Chains (3/9) Recurrent p i = P (every returning to state i ) Transient p i < 1 Periodic ⎧ j ∈ S k + 1 , for k = 1,..., d -1 ⎪ If i ∈ S k and p i > 0, then ⎨ j ∈ S k , for k = 1,..., d ⎪ ⎩ Aperiodic Accessable: Accessable from i if there is a finite sequence of transition from i to j Communicate: If the states i and j are accessible to each other If two states communicate each other, they belong to the same class. If all the states consists of a single class, the Markov chain is indecomposible or irreducible . (c) 2017 Biointelligence Lab, SNU 8
11.3 Markov Chains (4/9) Figure 11.1: A periodic recurrent Markov chain with d = 3. (c) 2017 Biointelligence Lab, SNU 9
11.3 Markov Chains (5/9) Ergodic Markov chains Ergodicity: time average = ensemble average i . e . long-term proportion of time spent by the chain in state i corresponds to the steady-state probability π i v i ( k ): Proportion of time spent in state i after k returns k v i ( k ) = ∑ k T i ( ℓ ) ℓ = 1 k →∞ v i ( k ) = π i i = 1,2, … , K lim (c) 2017 Biointelligence Lab, SNU 10
11.3 Markov Chains (6/9) Convergence to Stationary Distributions Consider an ergodic Markov chain with a stochastic matrix P π ( n − 1) : state transition vector of the chain at time n -1 State transition vector at time n is π ( n ) = π ( n − 1) P By iteration we obtain π ( n ) = π ( n − 1) P = π ( n − 2) P 2 = π ( n − 3) P 3 = ! Ergodic theorem π ( n ) = π (0) P n ( n ) = π j ∀ i 1. lim n →∞ p ij π (0) : initial value 2. π j > 0 ∀ j ⎛ ⎞ π 1 π K ⎛ ⎞ … π ∑ K π j = 1 3. ⎜ ⎟ n →∞ P n = ⎜ ⎟ j = 1 = lim " # " " ⎜ ⎟ ⎜ ⎟ ∑ K 4. π j = π i p ij for j = 1,2, … , K ⎜ ⎟ π 1 π K ⎜ π ⎟ ! ⎝ ⎠ i = 1 ⎝ ⎠ (c) 2017 Biointelligence Lab, SNU 11
11.3 Markov Chains (6/9) Figure 11.2: State-transition diagram of Markov chain for Example 1: The states x1 and x2 and may be identified as up-to-date behind, respectively. ⎡ ⎤ 1 5 π (0) = ⎢ ⎥ 6 6 ⎢ ⎥ ⎣ ⎦ π (1) = π (1) P ⎡ ⎤ 1 3 ⎢ ⎥ ⎡ ⎤ ⎡ ⎤ 1 5 4 4 P (2) = 0.4375 0.5625 ⎢ ⎥ !!!!!!! = ! ⎢ ⎥ ⎡ ⎤ ⎢ ⎥ 1 3 ⎢ ⎥ 0.3750 0.6250 6 6 ⎣ ⎦ 1 1 ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎢ ⎥ 4 4 ⎢ ⎥ ⎡ ⎤ P = 2 2 P (3) = 0.4001 0.5999 ⎣ ⎦ ⎢ ⎥ ⎢ ⎥ 1 1 0.3999 0.6001 ⎣ ⎦ ⎡ ⎤ ⎢ ⎥ 11 13 2 2 ⎣ ⎦ !!!!!! = ! ⎢ ⎥ ⎡ ⎤ ! P (4) = 0.4000 0.6000 24 24 ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ 0.4000 0.6000 ⎣ ⎦ ! ! 12
11.3 Markov Chains (7/9) Figure 11.3: State-transition diagram of Markov chain for Example 2. ∑ K π j = π i p ij i = 1 π 1 = 1 3 π 2 + 3 4 π 3 π 1 = 0.3953 π 2 = 0.1395 π 2 = 1 6 π 2 + 1 4 π 3 ⎡ ⎤ π 3 = 0.4652 ! ⎢ ⎥ 0 0 1 π 3 = π 1 + 1 ⎢ ⎥ 2 π 2 1 1 1 ⎢ ⎥ P = ! ⎢ ⎥ 3 6 2 ⎢ ⎥ 3 1 ⎢ ⎥ 0 ⎢ ⎥ 4 4 ⎣ ⎦ ! (c) 2017 Biointelligence Lab, SNU 13
11.3 Markov Chains (8/9) Figure 11.4: Classification of the states of a Markov chain and their associated long-term behavior. 14
11.3 Markov Chains (9/9) Principle of detailed balance At thermal equilibrium, the rate of occurrence of any transition equals the corresponding rate of occurrence of the inverse transition, π i p ij = π j p ji Application :stationary distribution ⎛ ⎞ ⎛ ⎞ π j π i K K K ∑ ∑ ∑ π i p ij = π j = π j p ij p ji ⎜ ⎟ ⎜ ⎟ π j π j ⎝ ⎠ ⎝ ⎠ i = 1 i = 1 i = 1 ( ) K ∑ = π j ( π i p ij = π j p ji ,detailed balance) p ji i = 1 K ∑ = π j (since = 1) p ji i = 1 (c) 2017 Biointelligence Lab, SNU 15
11.4 Metropolis Algorithm (1/3) Metropolis Algorithm A stochastic algorithm for simulating the evolution of a physical system to thermal equilibrium. A modified Monte Carlo method. Markov Chain Monte Carlo (MCMC) method Algorithm Metropolis 1. X n = x i . Randomly generate a new state x j . 2. Δ E = E ( x j ) − E ( x i ) 3. If Δ E < 0, then X n + 1 = x j else if Δ E ≥ 0, then { Select a random number ξ ∼ U[0,1]. If ξ < exp( −Δ E / T ), then X n + 1 = x j , ( accept ) else X n + 1 = x i . ( reject ) } (c) 2017 Biointelligence Lab, SNU 16
Recommend
More recommend