Geometry of Boltzmann Machines Guido Montúfar Max Planck Institute for Mathematics in the Sciences, Leipzig Talk at IGAIA IV, June 17, 2016 On the occasion of Shun-ichi Amari’s 80th birthday Max Planck Institute for Mathematics in the Sciences
• Boltzmann Machines • Geometric Perspectives • Universal Approximation (new results) • Dimension (new results)
Boltzmann Machines A Boltzmann machine is a network of stochastic units. It defines a set of probability vectors 0 1 @X X A , x ∈ { 0 , 1 } N , p θ ( x ) = exp θ i x i + θ ij x i x j − ψ ( θ ) i i<j for all θ ∈ R d . 1 x 8 x 7 x 8 x 7 0 . 75 x 1 x 6 x 1 x 6 σ ( α ) 0 . 5 0 . 25 x 2 x 5 x 2 x 5 0 − 6 − 4 − 2 0 0 2 4 6 x 3 x 4 x 3 x 3 x 4 α [Ackley, Hinton, Sejnowski ’85] [Geman & Geman ’84]
Boltzmann Machines Generative Models Modeling Temporal Sequences ⎫ X L X L X L X L ⎪ ⎪ 1 2 3 . . . n L ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ X L 1 X L 1 X L 1 X L 1 − ⎪ − − − ⎪ n L Learning Representations 1 2 3 . . . Structured Output Prediction 1 ⎪ − ⎭ X L 2 X L 2 X L 2 X L 2 − − − − n L 1 2 3 . . . 2 − Learning Modules . . . . . . Recommender Systems . . . . . . for Deep Belief Networks X 2 X 2 X 2 X 2 1 2 3 . . . n 2 Classification Stochastic Controller � X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 1 1 2 2 3 3 n 1 n 1 . . . [Montufar, Zahedi, Ay ’15] h 1 x 1 h 2 x 2 y 1 h 3 x 3 y 2 . . . x 4 h k
Information Geometric Perspectives Without hidden units B 0 1 @X X p θ ( x ) = exp θ i x i + θ ij x i x j − ψ ( θ ) A i i<j R Q The Boltzmann machine defines an e- • P linear manifold MLE is the unique m-projection of the • η R target distribution to this manifold η P Natural gradient learning trajectory is the • m-geodesic to the MLE η = r ψ ( θ ) Stochastic interpretation of natural • parameters ∆ ✓ = ✏ G − 1 ( ⌘ Q − ⌘ R ) [Amari, Kurata, Nagaoka ’92]
BOLTZMANN MACHINE LEARNING 155 tion gain (Kullback, 1959; Renyi, 1962), is a measure of the distance from the distribution given by the P’(V,) to the distribution given by the P(VJ. G is zero if and only if the distributions are identical; otherwise it is positive. The term P’(VJ depends on the weights, and so G can be altered by changing them. To perform gradient descent in G, it is necessary to know the partial derivative of G with respect to each individual weight. In most cross-coupled nonlinear networks it is very hard to derive this quantity, but because of the simple relationships that hold at thermal equilibrium, the partial derivative of G is straightforward to derive for our networks. The probabilities of global states are determined by their energies (Eq. 6) and the energies are determined by the weights (Eq. 1). Using these equations Information Geometric Perspectives the partial derivative of G (see the appendix) is: With hidden units x = ( x V , x H ) ac -= - f@G, - PJ 0 1 a wij X @X X p θ ( x V ) = exp θ i x i + θ ij x i x j − ψ ( θ ) A [Ackley, Hinton, Sejnowski ’85] where pij is the average probability of two units both being in the on state x H i i<j when the environment is clamping the states of the visible units, and pi:, as in Eq. (7), is the corresponding probability when the environmental input is not present and the network is running freely. (Both these probabilities must The Boltzmann machine defines a • be measured at equilibrium.) Note the similarity between this equation and curved manifold with singularities Eq. (7), which shows how changing a weight affects the log probability of a single state. MLE minimizes KL-divergence from • To minimize G, it is therefore sufficient to observe pi, and pi; when the m-flat data manifold to the e-flat network is at thermal equilibrium, and to change each weight by an amount fully observable Boltzmann manifold proportional to the difference between these two probabilities: [Amari, Kurata, Nagaoka ’92] K P t A W<j = c@<, - pi;) (10) Iterative optimization using m- and e- • P t +1 . P* projections, EM-algorithm where e scales the size of each weight change. ... . A surprising feature of this rule is that it uses only local/y available Q* information. The change in a weight depends only on the behavior of the Q t +1 [Amari ’16] Q t S two units it connects, even though the change optimizes a global measure, [Amari, Kurata, Nagaoka ’92] and the best value for each weight depends on the values of all the other weights. If there are no hidden units, it can be shown that G-space is con- cave (when viewed from above) so that simple gradient descent will not get trapped at poor local minima. With hidden units, however, there can be local minima that correspond to different ways of using the hidden units to represent the higher-order constraints that are implicit in the probability distribution of environmental vectors. Some techniques for handling these more complex G-spaces are discussed in the next section. Once G has been minimized the network will have captured as well as possible the regularities in the environment, and these regularities will be en- forced when performing completion. An alternative view is that the net-
Algebraic Geometric Perspectives A Boltzmann machine has a polynomial • parametrization and defines a semialgebraic variety in the probability simplex . . . 3 x 3 minors Main invariant of interest is the expected • of 2-d flattenings dimension and the number of parameters of [Raicu ’11] (Zariski) dense models Implicitization: Find an ideal basis that cuts • out the model from the probability simplex { p = g ( θ ): θ ∈ R d } ∩ ∆ { p ∈ ∆ : f ( p ) = 0 , f ∈ I } One polynomial of degree 110 and >5.5 trillion monomials [Cueto, Tobis, Yu ’10] [Pistone, Riccomagno, Wynn ‘01] [Garcia, Stillman, Sturmfels ‘05] [Geiger, Meek, Sturmfels ‘06] [Cueto, Morton, Sturmfels ‘10]
Questions 0 1 X @X X x V ∈ { 0 , 1 } V p θ ( x V ) = exp θ i x i + θ ij x i x j − ψ ( θ ) A , x H i i<j hidden Universal Approximation. What is the smallest • x 8 x 7 number of hidden units such that any distribution on {0,1} V can be represented to within any desired x 1 x 6 accuracy? x 2 x 5 Dimension. What is the dimension of the set of • distributions represented by a fixed network? x 3 x 4 visible Approximation errors. MLE, maximum and • expected KL-divergence, etc. Support sets. Properties of the marginal polytopes. •
Various Possible Hierarchies . . . Number of hidden units . . . . . . fully connected . . . . . . stack of layers . . . . . . bipartite graph
Restricted Boltzmann Machine H . . . . . . V [Smolensky ’86] Y Y #parameters = V · H + V + H Harmony Theory ∈ ∈ h2 h3 h1 Y Hidden Units p ( x V | x H ) = p ( x i | x H ) m=3 i ∈ V (2) w 1 Y p ( x H | x V ) = p ( x j | x V ) n=5 Input Units x1 x2 x3 x4 x5 j ∈ H [Freund & Haussler ’94] Influence Combination Machine Y p ( x V ) ∝ q j ( x V ) j ∈ H Y Y q j ( x V ) = λ j r j,i ( x i ) + (1 − λ j ) s j,i ( x i ) [Hinton ’02] i ∈ V i ∈ V Products of Experts Y
Universal Approximation
Universal Approximation Let H V := min { H : RBM is a universal approximator on { 0 , 1 } V } nr. parameters behaviour ≤ − H V ≥ 2 V − V − 1 2 V . Observation V +1 H V ≤ 2 V . Theorem (Freund & Haussler ’94) H V ≤ 2 V . Theorem (Le Roux & Bengio ’10) V 2 V H V ≤ 2 V − V − 1. Theorem (Younes ’95) 2 2 V − 1. H V ≤ 1 Theorem (M. & Ay ’11) ≤ − H V ≤ 2(log( V )+1) 2 V − 1. ∼ log( V )2 V Theorem (M. & Rauh ’16) V +1
Comparison with mixtures of product distributions ≥ − − Theorem. Every distribution on { 0 , 1 } V can be approximated arbitrarily well by a mixture of k product distributions if and only if k ≥ 2 V � 1 . Θ ( V 2 V ) [M., Kybernetika ’13] � � P ≥ Theorem. Every distribution on { 0 , 1 } V can be approximated arbitrarily well by distributions from RBM V,H whenever H ≥ 2(log( V � 1)+1) (2 V − ( V +1) − 1)+1 . V +1 Ω (2 V ) , O (log( V )2 V ) [M. & Rauh ’16]
Proof I - Intuition Each hidden unit extends the RBM along some parameters of the simplex B θ V ∪ H B V ϑ E Λ Previous Approach V [M. & Rauh ’16] [M. & Ay ’11] [Younes ’95] [Le Roux & Bengio ’08]
Proof II Hierarchical models Consider the set E Λ of probability vectors X ! x V ∈ { 0 , 1 } V , Y q ϑ ( x V ) = exp x i − ψ ( ϑ ) ϑ λ , λ 2 Λ i 2 λ for all ϑ ∈ R Λ , where Λ is an inclusion closed subset of 2 V . Natural parameters X Y ( ϑ λ ) λ 2 Λ ∈ R Λ , ( ϑ λ ) λ 62 Λ = 0 q ϑ ( x V ) − H ( x ) = ϑ λ x i ↔ ↔ λ 2 Λ i 2 λ ⇣ ⌘ Coordinates for the visible probability simplex We will use each hidden unit to model a group of monomials
Proof III Boltzmann Machine ⇣ X ⌘ X X x V ∈ { 0 , 1 } V p θ ( x V ) = exp θ i x i + θ ij x i x j − ψ ( θ ) , x H i i 2 V,j 2 H Free Energy 0 1 ⇣ X ⌘ @X X p θ ( x V ) − F ( x V ) = log exp θ i x i + θ ij x i x j ↔ A x H i i 2 V,j 2 H ⇣ ⌘ X X = log 1 + exp( θ j + θ ij x i ) j 2 H i 2 V Natural parameters in the visible probability simplex ( − 1) | B \ C | log ⇣ ⌘ X X X B ∈ 2 V ϑ B ( θ ) = 1 + exp( θ j + θ ij ) , ↔ j 2 H C ✓ B i 2 C Sum of independent terms
Recommend
More recommend