Chapter 11. Stochastic Methods Rooted in Statistical Mechanics - PowerPoint PPT Presentation

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Version: 20170926 è 20170928 è 20171011

Contents 11.1 Introduction …………………………………………………………………………….... 3 11.2 Statistical Mechanics ……………………………………………………………..…. 4 11.3 Markov Chains ………………………..………………………………………..…….... 6 11.4 Metropolis Algorithm ……….………..………………….……………………..... 16 11.5 Simulated Annealing ………………………………………………….…….…...…. 19 11.6 Gibbs Sampling ….…………….…….………………..……..……………………….. 22 11.7 Boltzmann Machine …..……………………………………….……..…………….. 24 11.8 Logistic Belief Nets ……………….…………………..……………….…......……. 29 11.9 Deep Belief Nets ………………………….…………………………..….........…. 30 11.10 Deterministic Annealing (DA) …………….………………………..…….…... 34 11.11 Analogy of DA with EM …..……….….…………………………………….……. 39 Summary and Discussion …………….…………….………………………….………... 41 (c) 2017 Biointelligence Lab, SNU 2

11.1 Introduction Statistical mechanics as a source of ideas for unsupervised (self- • organized) learning systems Statistical mechanics • ü The formal study of macroscopic equilibrium properties of large systems of elements that are subject to the microscopic laws of mechanics. ü The number of degrees of freedom is enormous, making the use of probabilistic methods mandatory. ü The concept of entropy plays a vital role in statistical mechanics, as with the Shannon’s information theory. ü The more ordered the system, or the more concentrated the underlying probability distribution, the smaller the entropy will be. Statistical mechanics for the study of neural networks • ü Cragg and Temperley (1954) and Cowan (1968) ü Boltzmann machine (Hinton and Sejnowsky, 1983, 1986; Ackley et al., 1985) (c) 2017 Biointelligence Lab, SNU 3

11.2 Statistical Mechanics (1/2) p i :!probability!of!occurrence!of!state! i !of!a!stochastic!system ∑ !!!!! p i ≥ 0!(for!all! i )!!and! = 1 p i i E i :!energy!of!the!system!when!it!is!in!state! i In!thermal!equilibrium,!the!probability!of!state! i !is (Canonical!distribution!/!Gibbs!distribution) ⎛ ⎞ Z exp − E i !!!!! p i = 1 1. States of low energy have a higher ⎜ ⎟ k B T ⎝ ⎠ probability of occurrence than the ⎛ ⎞ exp − E i states of high energy. ∑ !!!!! Z = ⎜ ⎟ 2. As the temperature T is reduced, the k B T ⎝ ⎠ i probability is concentrated on a ( ) :!Boltzmann!factor! exp − E / k B T smaller subset of low-energy states. Z :!sum!over!states!(partition!function) We!set! k B = 1!and!view! − log p i !as!"energy" ! (c) 2017 Biointelligence Lab, SNU 4

11.2 Statistical Mechanics (2/2) Helmholtz!free!energy !!!!!!! F = − T log Z ∑ < E > ! = p i E i !!!!!!(avergage!energy) i ∑ !!!!!! < E > − ! F = − T p i log p i i ∑ H = − p i log p i !!!!!(entropy) The!free!energy!of!the!system,! F ,!tends!to!decrease!and i become!a!minimum!in!an!equilibrium!situation.! Thus,!we!have The!resulting!probability!distribution!is!defined!by! !!!!!! < E > − ! F = TH Gibbs!distribution!( The!Principle!of!Minimum!Free!Energy ).! ! !!!!!!! F = ! < E > − ! TH Consider!two!systems! A !and! A '!in!thermal!contact. Nature likes to find a physical system with Δ H !and! Δ H ':!entropy!changes!of! A !and! A '! minimum free energy. The!total!entropy!tends!to!increase!with !!!!!!! Δ H + Δ H ' ≥ 0 ! (c) 2017 Biointelligence Lab, SNU 5

11.3 Markov Chains (1/9) Markov property P ( X n + 1 = x n + 1 | X n = x n , … , X 1 = x 1 ) = P ( X n + 1 = x n + 1 | X n = x n ) Transition probability from state i at time n to j at time n + 1 p ij = P ( X n + 1 = j | X n = i ) ∑ ( p ij ≥ 0 ∀ i , j and p ij = 1 ∀ i ) j If the transition probabilities are fixed, the Markov chain is homogeneous. In case of a system with a finite number of possible states K , the transition probabilities constitute a K -by- K matrix (stochastic matrix): ⎛ ⎞ p 11 … p 1 K ⎜ ⎟ P = ! " ! ⎜ ⎟ ⎜ ⎟ p K 1 # p KK ⎝ ⎠ (c) 2017 Biointelligence Lab, SNU 6

11.3 Markov Chains (2/9) Generalization to m -step transition probability ( m ) = P ( X n + m = x j | X n = x i ), m = 1,2, … p ij k →∞ v i ( k ) = π i i = 1,2, … , K lim ( m + 1) = (1) = p ik ∑ ( m ) p kj , m = 1,2, … , p ij p ik p ik k We can further generalize to (Chapman-Kolmogorov identity) ( m + n ) = ( m ) p kj ∑ m , n = 1,2, … ( n ) p ij p ik , k (c) 2017 Biointelligence Lab, SNU 7

Properties of markov chains 11.3 Markov Chains (3/9) Recurrent p i = P (every returning to state i ) Transient p i < 1 Periodic ⎧ j ∈ S k + 1 , for k = 1,..., d -1 ⎪ If i ∈ S k and p i > 0, then ⎨ j ∈ S k , for k = 1,..., d ⎪ ⎩ Aperiodic Accessable: Accessable from i if there is a finite sequence of transition from i to j Communicate: If the states i and j are accessible to each other If two states communicate each other, they belong to the same class. If all the states consists of a single class, the Markov chain is indecomposible or irreducible . (c) 2017 Biointelligence Lab, SNU 8

11.3 Markov Chains (5/9) Ergodic Markov chains Ergodicity: time average = ensemble average i . e . long-term proportion of time spent by the chain in state i corresponds to the steady-state probability π i v i ( k ): Proportion of time spent in state i after k returns k v i ( k ) = ∑ k T i ( ℓ ) ℓ = 1 k →∞ v i ( k ) = π i i = 1,2, … , K lim (c) 2017 Biointelligence Lab, SNU 10

11.3 Markov Chains (6/9) Convergence to Stationary Distributions Consider an ergodic Markov chain with a stochastic matrix P π ( n − 1) : state transition vector of the chain at time n -1 State transition vector at time n is π ( n ) = π ( n − 1) P By iteration we obtain π ( n ) = π ( n − 1) P = π ( n − 2) P 2 = π ( n − 3) P 3 = ! Ergodic theorem π ( n ) = π (0) P n ( n ) = π j ∀ i 1. lim n →∞ p ij π (0) : initial value 2. π j > 0 ∀ j ⎛ ⎞ π 1 π K ⎛ ⎞ … π ∑ K π j = 1 3. ⎜ ⎟ n →∞ P n = ⎜ ⎟ j = 1 = lim " # " " ⎜ ⎟ ⎜ ⎟ ∑ K 4. π j = π i p ij for j = 1,2, … , K ⎜ ⎟ π 1 π K ⎜ π ⎟ ! ⎝ ⎠ i = 1 ⎝ ⎠ (c) 2017 Biointelligence Lab, SNU 11

11.3 Markov Chains (6/9) Figure 11.2: State-transition diagram of Markov chain for Example 1: The states x1 and x2 and may be identified as up-to-date behind, respectively. ⎡ ⎤ 1 5 π (0) = ⎢ ⎥ 6 6 ⎢ ⎥ ⎣ ⎦ π (1) = π (1) P ⎡ ⎤ 1 3 ⎢ ⎥ ⎡ ⎤ ⎡ ⎤ 1 5 4 4 P (2) = 0.4375 0.5625 ⎢ ⎥ !!!!!!! = ! ⎢ ⎥ ⎡ ⎤ ⎢ ⎥ 1 3 ⎢ ⎥ 0.3750 0.6250 6 6 ⎣ ⎦ 1 1 ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎢ ⎥ 4 4 ⎢ ⎥ ⎡ ⎤ P = 2 2 P (3) = 0.4001 0.5999 ⎣ ⎦ ⎢ ⎥ ⎢ ⎥ 1 1 0.3999 0.6001 ⎣ ⎦ ⎡ ⎤ ⎢ ⎥ 11 13 2 2 ⎣ ⎦ !!!!!! = ! ⎢ ⎥ ⎡ ⎤ ! P (4) = 0.4000 0.6000 24 24 ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ 0.4000 0.6000 ⎣ ⎦ ! ! 12

11.3 Markov Chains (7/9) Figure 11.3: State-transition diagram of Markov chain for Example 2. ∑ K π j = π i p ij i = 1 π 1 = 1 3 π 2 + 3 4 π 3 π 1 = 0.3953 π 2 = 0.1395 π 2 = 1 6 π 2 + 1 4 π 3 ⎡ ⎤ π 3 = 0.4652 ! ⎢ ⎥ 0 0 1 π 3 = π 1 + 1 ⎢ ⎥ 2 π 2 1 1 1 ⎢ ⎥ P = ! ⎢ ⎥ 3 6 2 ⎢ ⎥ 3 1 ⎢ ⎥ 0 ⎢ ⎥ 4 4 ⎣ ⎦ ! (c) 2017 Biointelligence Lab, SNU 13

11.3 Markov Chains (8/9) Figure 11.4: Classification of the states of a Markov chain and their associated long-term behavior. 14

11.3 Markov Chains (9/9) Principle of detailed balance At thermal equilibrium, the rate of occurrence of any transition equals the corresponding rate of occurrence of the inverse transition, π i p ij = π j p ji Application :stationary distribution ⎛ ⎞ ⎛ ⎞ π j π i K K K ∑ ∑ ∑ π i p ij = π j = π j p ij p ji ⎜ ⎟ ⎜ ⎟ π j π j ⎝ ⎠ ⎝ ⎠ i = 1 i = 1 i = 1 ( ) K ∑ = π j ( π i p ij = π j p ji ,detailed balance) p ji i = 1 K ∑ = π j (since = 1) p ji i = 1 (c) 2017 Biointelligence Lab, SNU 15

11.4 Metropolis Algorithm (1/3) Metropolis Algorithm A stochastic algorithm for simulating the evolution of a physical system to thermal equilibrium. A modified Monte Carlo method. Markov Chain Monte Carlo (MCMC) method Algorithm Metropolis 1. X n = x i . Randomly generate a new state x j . 2. Δ E = E ( x j ) − E ( x i ) 3. If Δ E < 0, then X n + 1 = x j else if Δ E ≥ 0, then { Select a random number ξ ∼ U[0,1]. If ξ < exp( −Δ E / T ), then X n + 1 = x j , ( accept ) else X n + 1 = x i . ( reject ) } (c) 2017 Biointelligence Lab, SNU 16

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics - PowerPoint PPT Presentation

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Finger Pointing for Fun, Finger Pointing for Fun, Profit and War? Profit and War? Profit and

Bijective counting of tree-rooted maps Olivier Bernardi - LaBRI, Bordeaux Combinatorics and

Week 7 Oliver Kullmann Binary search Arrays, lists, pointers and rooted trees Lists Pointers

Stochastic simulation and resampling methods Statistical modelling: Theory and practice Gilles

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

S ensory Rooted Behavior Presentation by Jarod Coffey, LCS W, Behavioral Health Provider Olson

Week 5+ Oliver Kullmann Dynamic sets BinaryTrees Trees Implementing rooted trees Dynamic

From 3-manifolds to planar graphs and cycle-rooted trees Michael Polyak Technion November 27,

Pure-cycle Hurwitz factorizations and multi-noded rooted trees by Rosena Ruoxia Du East China

Words in non-periodic branch groups Introduction Groups acting on Rooted Trees Elisabeth Fink

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

TimberWolf Hierarchical Placement Algorithm ECE 6133, Spring 2019 Bishoy Metry, Jaison George 1

BACKEND DESIGN Circuit Partitioning Partitioning System Design Decomposition of a complex

Constraint Satisfaction Problems: Local Search Alice Gao Lecture 8 Based on work by K.

Xudong Zhang 1. Int roduct ion of simulat ed annealing (S A) algorit hm 2. S equent ial S A

Optimization of Sample Configurations using Spatial Simulated Annealing Congreso Escuela en

Local Search (Ch. 4-4.1) Announcements HW1 late penalty Test next week: will cover HW 1 & 2

Functional Abstractions for Simulated Annealing Richard Senington University Of Leeds School Of

The CPLEX Library: MIP Heuristics Ed Rothberg, ILOG, Inc. 1 Motivation for Heuristics Why not

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics - PowerPoint PPT Presentation

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Finger Pointing for Fun, Finger Pointing for Fun, Profit and War? Profit and War? Profit and

Bijective counting of tree-rooted maps Olivier Bernardi - LaBRI, Bordeaux Combinatorics and

Week 7 Oliver Kullmann Binary search Arrays, lists, pointers and rooted trees Lists Pointers

Stochastic simulation and resampling methods Statistical modelling: Theory and practice Gilles

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

S ensory Rooted Behavior Presentation by Jarod Coffey, LCS W, Behavioral Health Provider Olson

Week 5+ Oliver Kullmann Dynamic sets BinaryTrees Trees Implementing rooted trees Dynamic

From 3-manifolds to planar graphs and cycle-rooted trees Michael Polyak Technion November 27,

Pure-cycle Hurwitz factorizations and multi-noded rooted trees by Rosena Ruoxia Du East China

Words in non-periodic branch groups Introduction Groups acting on Rooted Trees Elisabeth Fink

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

TimberWolf Hierarchical Placement Algorithm ECE 6133, Spring 2019 Bishoy Metry, Jaison George 1

BACKEND DESIGN Circuit Partitioning Partitioning System Design Decomposition of a complex

Constraint Satisfaction Problems: Local Search Alice Gao Lecture 8 Based on work by K.

Xudong Zhang 1. Int roduct ion of simulat ed annealing (S A) algorit hm 2. S equent ial S A

Optimization of Sample Configurations using Spatial Simulated Annealing Congreso Escuela en

Local Search (Ch. 4-4.1) Announcements HW1 late penalty Test next week: will cover HW 1 &amp; 2

Functional Abstractions for Simulated Annealing Richard Senington University Of Leeds School Of

The CPLEX Library: MIP Heuristics Ed Rothberg, ILOG, Inc. 1 Motivation for Heuristics Why not

Local Search (Ch. 4-4.1) Announcements HW1 late penalty Test next week: will cover HW 1 & 2