Simulated Annealing input : ( x 1 , t 1 ) , . . . , ( x N , t N ) ∈ R d × {− 1 , +1 } ; T start , T stop ∈ R output : w begin Randomly initialize w T ← T start repeat w ← N ( w ) //neighbors of w , e.g. by adding � Gaussion noise ( N (0 , σ )) if E ( � w ) < E ( w ) then w ← � w � � − E ( b w ) − E ( w ) else if exp > rand [0 , 1) then T w ← � w decrease ( T ) until T < T stop return w end – p. 156
Continuous Hopfield Network Let us consider our previously defined Hopfield network (identical architecture and learning rule), however with following activity rule � 1 S i = tanh T · w ij S j j Start with a large (temperature) value of T and decrease it by some magnitude whenever a unit is updated (deterministic simulated annealing). This type of Hopfield network can approximate the probability distribution � 1 � 1 1 2 x T Wx P ( x | W ) = Z ( W ) exp[ − E ( x )] = Z ( W ) exp – p. 157
Continuous Hopfield Network � exp( − E ( x ′ )) (sum over all possible states) Z ( W ) = x ′ is the partition function and ensures P ( x | W ) is a probability distribution. Idea: construct a stochastic Hopfield network that implements the probability distribution P ( x | W ) . • Learn a model that is capable of generating patterns from that unknown distribution. • Quantify (classify) by means of probabilities seen and unseen patterns. • If needed, we can generate more patterns (generative model). – p. 158
Boltzmann Machines Given patterns { x ( n ) } N 1 , we want to learn the weights such that the generative model � 1 � 1 2 x T Wx P ( x | W ) = Z ( W ) exp is well matched to those patterns. The states are updated according to the stochastic rule: 1 • set x n = +1 with probability 1+exp ( − 2 P j w ij x j ) • else set x n = − 1 . Posterior probability of the weights given the data (Bayes’ theorem) �� N � n =1 P ( x ( n ) | W ) P ( W ) P ( W |{ x ( n ) } N 1 ) = P ( { x ( n ) } N 1 ) – p. 159
Boltzmann Machines Apply maximum likelihood method on the first term in numerator: � N � � 1 � N � � 2 x ( n ) T Wx ( n ) − ln Z ( W ) P ( x ( n ) | W ) ln = n =1 n =1 Taking derivative of the log likelihood gives: note that W is 2 x ( n ) T Wx ( n ) = x ( n ) x ( n ) ∂ 1 symmetric ( w ij = w ji ) that is i j ∂w ij and ∂ � 1 � � 2 x ( n ) T Wx ( n ) 1 ∂ ln Z ( W ) = exp Z ( w ) ∂w ij ∂w ij x � 1 � � 1 2 x ( n ) T Wx ( n ) = exp x i x j Z ( W ) x � x i x j P ( x | W ) = � x i x j � P ( x | W ) = x – p. 160
Boltzmann Machines (cont.) N � � � ∂ x ( n ) x ( n ) ln P ( { x ( n ) } N 1 | W ) − � x i x j � P ( x | W ) = i j ∂w ij n =1 � � = � x i x j � Data − � x i x j � P ( x | W ) N Empirical correlation between x i and x j � � N � � x i x j � Data ≡ 1 x ( n ) x ( n ) i j N n =1 Correlation between x i and x j under the current model � � x i x j � P ( x | W ) ≡ x i x j P ( x | W ) x – p. 161
Interpretation of Boltzmann Machines Learning Illustrative description (MacKay’s book, pp. 523): • Awake state: measure correlation between x i and x j in the real world, and increase the weights in proportion to the measured correlations. • Sleep state: dream about the world using the generative model P ( x | W ) and measure the correlation between x i and x j in the model world. Use these correlations to determine a proportional decrease in the weights. If correlations in dream world and real world are matching, then the two terms balanced and weights do not change. – p. 162
Boltzmann Machines with Hidden Units To model higher order correlations hidden units are required. • x : states of visible units, • h : states of hidden units, • generic state of a unit (either visible or hidden) by y i , with y ≡ ( x , h ) , • state of network when visible units are clamped in state x ( n ) is y ( n ) ≡ ( x ( n ) , h ) . Probability of W given a single pattern x ( n ) is � 1 � � � 1 2 y ( n ) T Wy ( n ) P ( x ( n ) | W ) = P ( x ( n ) , h | W ) = Z ( W ) exp h h � 1 � � where 2 y T Wy Z ( W ) = exp x , h – p. 163
Boltzmann Machines with Hidden Units (cont.) Applying the maximum likelihood method as before one obtains � ∂ ln P ( { x ( n ) } N 1 | W ) = � y i y j � P ( h | x ( n ) , W ) − � y i y j � P ( h | x , W ) ∂w ij � �� � � �� � n free clamped to x ( n ) Term � y i y j � P ( h | x ( n ) , W ) is the correlation between y i and y j when Boltzmann machine is simulated with visible variables clamped to x ( n ) and hidden variables freely sampling from their conditional distribution. Term � y i y j � P ( h | x , W ) is the correlation between y i and y j when the Boltzmann machine generates samples from its model distribution. – p. 164
Boltzmann Machines with Input-Hidden-Output The so far considered Boltzmann machine is a powerful stochastic Hopfield network with no ability to perform classification. Let us introduce visible input and output units: • x ≡ ( x i , x o ) Note that pattern x ( n ) consists of an input and output part, � � that is, x ( n ) ≡ x ( n ) , x ( n ) . o i � � y i y j � P ( h | x ( n ) , W ) − � y i y j � P ( h | x , W ) � �� � � �� � n clamped to ( x ( n ) , x ( n ) clamped to x ( n ) ) o i i – p. 165
Boltzmann Machines Updates Weights Combine gradient descent and simulated annealing to update weights ∆ w ij = η � y i y j � P ( h | x ( n ) , W ) − � y i y j � P ( h | x , W ) T � �� � � �� � clamped to x ( n ) clamped to ( x ( n ) , x ( n ) ) o i i High computational complexity: • present each pattern several times • anneal several times Mean-field version of Boltzmann learning: • calculate approximations of the correlations ( [ y i y j ] ) entering the gradient – p. 166
Deterministic Boltzmann Learning input : { x ( n ) } N 1 ; η, T start , T stop ∈ R output : W begin T ← T start repeat randomly select pattern from sample { x ( n ) } N 1 randomize states anneal network with input and output clamped at final, low T , calculate [ y i y j ] x i , x o clamped randomize states anneal network with input clamped but output free at final, low T, calculate [ y i y j ] x i clamped � � w ij ← w ij + η/T [ y i y j ] x i , x o clamped ] − [ y i y j ] x i clamped until T < T stop return w end – p. 167
Recommend
More recommend