Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu On the Fine-Tuning Parameters in Deep Boltzmann Machines Using Quaternions Jo˜ ao Paulo Papa papa@fc.unesp.br March 28, 2016 1 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines 1 Harmony Search 2 Quaternions 3 Methodology and Experiments 4 Conclusions and Future Works 5 2 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Talk Outline Restricted Boltzmann Machines 1 Harmony Search 2 Quaternions 3 Methodology and Experiments 4 Conclusions and Future Works 5 3 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts RBMs are probabilistic models composed by two layers: visible v ∈ { 0 , 1 } m (input) and hidden h ∈ { 0 , 1 } n , which are connected by a weight matrix W m × n . Additionally, we have bias units attached to each visible and hidden layer. b 1 b 2 ... b n h h 2 h n ... h 1 W w ij v v v v v ... 1 2 3 m a m a 1 a 2 a 3 ... 4 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts The Energy of an RBM is given by: m n m n � � � � E ( v , h ) = − v i h j w ij − a i v i − b j h j , (1) i =1 j =1 i =1 j =1 being the probability of a given configuration ( v , h ) computed as follows: P ( v , h ) = e − E ( v , h ) , (2) Z where Z is the so-called normalizing constant/partition function . Such value is given by: � e − E ( v , h ) . Z = (3) v , h 5 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts The probability of a data point v (visible layer) is defined as follows: h e − E ( v , h ) � � P ( v ) = P ( v , h ) = . (4) Z h Let V = { v 1 , v 2 , . . . , v M } be a training set: in short, the RBM training algorithm aims at decreasing the energy of each training sample v k ∈ ℜ m in order to increase its probability. E E training step v v 6 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts The training data likelihood (using just one training point for sake of simplicity), is given by: φ = log P ( v ) = φ + − φ − , (5) where φ + = log � e − E ( v , h ) (6) h and φ − = log Z = log � e − E ( v , h ) . (7) v , h Now, the question is: how can we train an RBM? 7 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts Basically, the training step aims at updating W in order to maximize the log-likelihood of the training data until a certain convergence criterion is met (usually the number of iterations/epochs). Usually, it is employed the stochastic gradient descent for such purpose, i.e.: � ∂φ + ∂ W − ∂φ − � W t +1 → W t + η , (8) ∂ W where the positive gradient is given by (easy to be computed): ∂φ + ∂ W = v T P ( h | v ) . (9) 8 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts The right term of Equation 9 can be computed as follows: n � P ( h | v ) = P ( h j = 1 | v ) , (10) j =1 where � m � � P ( h j = 1 | v ) = σ w ij v i + b j . (11) i =1 In this case, σ ( x ) = 1 / (1 + exp( − x )). However, the main problem concerns with the negative gradient , which is given by: ∂φ − ∂ W = v T P ( v | h ) , (12) 9 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts where v denotes the estimative (model) of the input data v , and P ( v | h ) is given by: m � P ( v | h ) = P ( v i = 1 | h ) , (13) i =1 and n � . P ( v i = 1 | h ) = σ w ij h j + a i (14) j =1 The problem is to obtain a proper approximation of the model, i.e., ∂φ − ∂ W , which requires a large number of iterations to be computed. 10 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts Usually, we can model the task of estimating a conditional probability by means of the Markov Chain Monte Carlo (MCMC) approach, which models each step towards the approximation of the real data as a Markov chain . A Markov chain is basically a directed and weighted graph that obeys some properties (Ergodic Theorem): ✯ 0.1 0.7 D A 0.3 0.9 C 11 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts One of the most famous approach for sampling in Markov chains is the so-called Gibbs sampling , which approaches the likelihood solution when k → ∞ , being k the number of iterations. ✯ ✯ ✯ P(B|A) P(B|A) ... P(B|A) P(A|B) A A A 12 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts How can we use Gibbs sampling for RBMs? Let’s say we have a v 1 , ˜ v 2 , . . . , ˜ v k } compose by the input data Markov chain C = { v , ˜ v t . (initial state) v and its reconstruction at time step t given by ˜ ... ... h h 1 k k 0 0 1 0 k 1 1 ... P(h|v ) P(v |h ) P(v |h ) P(h |v ) v ... v ... v ... random data model approximation Problem? (High computational burden, since we need k → ∞ ) 13 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts Hinton (2002) a proposed the Contrasttive Divergence (CD), which alleviates the problem of Gibbs sampling. h ... h ... 1 k << ∞ 0 k k 0 1 0 1 1 ... P(h|v ) P(v |h ) P(v |h ) P(h |v ) v ... v ... v ... training data model approximation Usually, k = 1. Problem? (Estimated models tend to stay close to training samples) a Hinton, G. E. “Training products of experts by minimizing contrastive divergence”, Neural Computation , 14(8), 1771-1800, 2002. 14 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Restricted Boltzmann Machines Main concepts After that, we have two main variations of CD: Persistent Contrastive Divergence (PCD) a Fast Persistent Contrastive Divergence (FPCD) b a Tieleman T. “Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient”, Proceedings of the 25th Annual International Conference on Machine Learning , 1064-1071, 2008. b Tieleman T., Hinton G. E. “Using Fast Weights to Improve Persistent Contrastive Divergence”, Proceedings of the 26th Annual International Conference on Machine Learning , 1033-1040, 2009. 15 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Deep Belief Networks Main concepts Stacked RBMs on top of each other (greedy training). ... h L W L ... ... h 2 W 2 h 1 ... W 1 v ... 16 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Deep Boltzmann Machines Main concepts Inference depends on lower and upper layers (intermediate layers); It usually works better than DBNs. m 1 n 2 � � P ( h 1 j = 1 | v , h 2 ) = φ w 1 w 2 jz h 2 . ij v i + (15) z z =1 i =1 17 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Talk Outline Restricted Boltzmann Machines 1 Harmony Search 2 Quaternions 3 Methodology and Experiments 4 Conclusions and Future Works 5 18 / 35
Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu Harmony Search Main concepts Harmony Search is a meta-heuristic algorithm inspired in the improvisation process of music players. Each possible solution is modelled as a harmony, and each musician corresponds to one decision variable. Let ϕ = ( ϕ 1 , ϕ 2 , . . . , ϕ N ) be a set of harmonies that compose the so-called “Harmony Memory”, such that ϕ i ∈ ℜ M . The HS algorithm generates after each iteration a new harmony vector ˆ ϕ based on memory considerations, pitch adjustments, and randomization (music improvisation). Further, the new harmony vector ˆ ϕ is evaluated in order to be accepted in the harmony memory: if ˆ ϕ is better than the worst harmony, the latter is then replaced by the new harmony. 19 / 35
Recommend
More recommend