information geometry
play

Information Geometry and Its Applications to Machine Learning - PowerPoint PPT Presentation

Machine Learning SS: Kyoto U. Information Geometry and Its Applications to Machine Learning Shun-ichi Amari RIKEN Brain Science Institute Information Geometry -- Manifolds of Probability Distributions = x { ( )} M p Information


  1. Stochastic Reasoning , 1 ( , , , , ) p x y z r s = − ( , , | , ) p x y z r s , , ,... 1 x y z

  2. Stochastic Reasoning q(x 1 ,x 2 ,x 3 ,…| observation) X = (x 1 x 2 x 3 …..) x = 1, -1 X = argmax q ( x 1, x 2 ,x 3 ,….. ) maximum likelihood X i = sgn E[x i ] least bit error rate estimator

  3. Mean Value Marginalization: projection to independent distributions Π = = x x ( ) ( ) ( )... ( ) ( ) q q x q x q x q 0 1 1 2 2 0 n n = ∫ ( ( ) ( ..., ) .. .. q x q x x dx dx dx 1, 1 i i n i n η = = E x E x [ ] [ ] q q 0

  4. ⎧ ⎫ L ∑ ∑ ( ) ( ) = ⋅ + − ψ x x ⎨ ⎬ exp q k x c i i r q ⎩ ⎭ = 1 r ( ) ( ) = = x L L , c c x x r i i 1 r r i i s 1 s { } ( ) = − = 1, 1 , x r i i 1 2 i { } ∑ ∑ ( ) = + − ψ x e p x q w x x h x Boltzmann machine, spin glass, neural networks ij i j i i Turbo Codes, LDPC Codes

  5. Computationally Difficult Computationally Difficult [ ] ( ) → η = x x q E { } ∑ ( ) ( ) = − ψ x x exp q c r q mean-field approximation belief propagation tree propagation, CCCP (convex-concave)

  6. Information Geometry of Mean Field Approximation • m-projection • e-projection ( ) q x ∑ ( )log ( ) q x D[q:p]= p x x Π q = argmin D[q:p] m 0 = Π Π { ( )} M p x e q = argmin D[p:q ] 0 i i i 0 ( ) ∈ p x M 0

  7. Information Geometry Information Geometry ( ) q x M r M ' r θ M ∑ 0 = − } φ ( ) exp{ ( ) q x c x r { } ( ) { } = θ = θ ⋅ − ψ x x , exp M p 0 0 0 { } { } ( ) ( ) = = + ⋅ − ψ x ξ x ξ x , exp M p c r r r r r r = L 1, , r L

  8. Belief Propagation Belief Propagation { } ( ) ( ) = + ⋅ − ψ x ξ x ξ x : , exp M p c r r r r r r Π ξ t ( , ) p x 0 r r + = Π − θ ξ ξ 1 t t t ( , ) : belief for ( ) p x c x 0 r r r r r = ∑ + + θ 1 θ 1 t t r

  9. 0 r Belief Prop Algorithm M ' M r M ς r Π ' r ς ς r ' ς r

  10. Equilibrium of BP ( ) ∗ ∗ θ ξ , Equilibrium of BP r 1) m -condition ∗ = Π ( ) θ x ξ * , p M r 0 r r ( ) M M θ ∗ ' -flat submanifold r m M 0 ξ θ ξ 1 ( ') 2) e -condition q 1 − ∑ r ∗ ∗ = θ ξ * M r r 1 L M ' r ( ) ∈ x -flat submanifold q e M 0

  11. Free energy: Free energy: [ ] − ∑ [ ] ( ) θ ζ ζ = L , , , : : F D p q D p p 1 0 0 L r critical point ∂ F = 0 : -condition e ∂ θ ∂ F = 0 : -condition m ∂ ζ r not convex

  12. Belief Propagation e-condition OK 1 ∑ ,... ) = θ ξ ξ ξ θ ξ ( ; , , , ' ' − 1 2 L r 1 L ,... → ,... ξ ξ ξ ξ ξ ξ ( , , ) ( ' , ' , ' ) 1 2 1 2 L L CCCP m-condition OK → θ θ ' ξ θ ξ θ ξ θ ( '), ( '),..., ( ') 1 1 1 ( ) ( ) = Π = Π θ x ξ ξ x θ ' , ' : ' , ' p p 0 0 r r r

  13. ( ) + + + = Π = Π ξ 1 θ 1 x θ 1 t t t , p 0 r r r − ∑ + + = ξ θ θ 1 1 t t t L r

  14. Convex-Concave Computational Procedure (CCCP) Yuille θ = θ − θ ( ) ( ) ( ) F F F 1 2 + ∇ θ = ∇ θ 1 t t ( ) ( ) F F 1 2 Elimination of double loops

  15. Boltzmann Machine x 1 ( ) ∑ ( ) x x = = ϕ − 1 p x w x h 4 2 i ij j i { } x ∑ ∑ ( ) ( ) = − − ψ 3 exp p x w x x h x w ij i j i i ( ) q x ( ) ˆ p x B

  16. Boltzmann machine ---hidden units • EM algorithm D • e-projection • m-projection M

  17. EM algorithm hidden variables ( ) p x y u , ; { } D = x x L 1 , , N { } ( ) = x y u , ; M p { } ( ) ( ) ( ) = = x y x x , D p p p M D ( ) ⎡ ∈ ⎤ m-projection to x y ˆ min , : KL p ⎣ p M ⎦ M ( ) ⎡ ∈ ⎤ e-projection to x y u ˆ min : , ; KL p ⎣ D p ⎦ D

  18. SVM : support vector machine = φ ( ) z x Embedding i i ∑ ∑ = φ = α ( ) ( ) ( , ) f x w x y K x x i i i i i = ∑ φ φ Kernel ( , ') ( ) ( ') K x x x x i i Conformal change of kernel ⎯⎯ → ρ ρ ( , ') ( ) ( ') ( , ') K x x x x K x x ρ = − κ 2 ( ) exp{ | ( ) | } x f x

  19. Signal Processing ICA : Independent Component Analysis = → x s x s A t t t t sparse component analysis positive matrix factorization

  20. mixture and unmixture of independent signals x n ∑ = 1 x A s i ij j s = 1 1 j x s m 2 = x As s n x 2

  21. Independent Component Analysis y s A W ∑ = = x s A x A s x i ij j A − = = y x 1 W W observations: x(1), x(2), …, x(t) recover: s(1), s(2), …, s(t)

  22. Space of Matrices : Lie group + W W d = X WW -1 d d W W − 1 + X I d I ( ) ( ) 2 − − = = W X X W W 1 W W T T T tr tr d d d d d ∂ l ฀ ∇ = ∂ W W T l W non-holonomic basis d X :

  23. Information Geometry of ICA S ={p( y )} q r = { ( ) ( )... ( )} I q y q y q y 1 1 2 2 n n { ( )} p Wx = natural gradient W y W y ( ) [ ( ; ) : ( )] l KL p q estimating function y ( ) r stability, efficiency

  24. Semiparametric Statistical Model = x W W Wx ( ; , ) | | ( ) p r r = Π − = 1 W A r r , ( ) : unknown r s i x(1), x(2), …, x(t)

  25. Natural Gradient ( ) η ∂ y W , l Δ = − W W W T ∂ W

  26. Basis Given: overcomplete case Sparse Solution ∑ ˆ ˆ ˆ = = = sparse A x As x s a : A s i i many solutions → many 0 s i ˆ = x s A t t

  27. ˆ x = ˆ ˆ A As : sparse generalized inverse min Σ 2 s ˆ -norm : L i 2 sparse solution ∑ s ˆ min -norm : L 1 i i

  28. Overcomplete Basis and Sparse Solution ∑ = = x a s s A i i ∑ = s min s i 1 − + α s x s min A ' p p non-linear denoising

  29. Sparse Solution ( ) ϕ β min = ∑ ( ) p β β penalty : Bayes prior F p i ( ) β = β ≠ sparsest solution #1[ 0]: F 0 i ∑ ( ) β = β solution : F L 1 i 1 ( ) Sparse solution: overcomplete case β ≤ ≤ : 0 1 F p p ( ) ∑ 2 β = β generalized inverse solution : F 2 i

  30. Optimization under Spasity Condition ( ) ⎧ ϕ β ⎪ min : convex function ⎨ ( ) β ≤ ⎪ constraint ⎩ F c typical case: ( ) 1 1 ( ϕ β = − β 2 = β − β β − β y T *) ( *) X G 2 2 = = = = ∑ 1 2, 1, 1/ 2 p p p ( ) β β p ; F i p

  31. L1-constrained optimization ( ) ( ) ϕ β β ≤ LASSO min under F c P Problem c ( ) ∗ β = → ∞ solution : 0 c c β ∗ = → β ∗ 0 c LARS ( ) ( ) ϕ β + λ β min F P Problem λ ( ) ∗ β λ λ = ∞ → solution 0 β ∗ = → β ∗ 0 λ ( ) ≥ * * solutions β and β : coincide λ = λ c p 1 , , c λ ( ) p < 1 λ = λ c multiple noncontinuous : , stability different

  32. Projection from to F = c (information geometry) β * β * β *

  33. Convex Cone Programming : positive semi-definite matrix P convex potential function dual geodesic approach A = ⋅ x b c x , min Support vector machine

  34. = > = = a) : 2, 1 b) : 2, 1 R n p R n p c c non-convex = < c) : 2, 1 R n p c Fig. 1

  35. orthogonal projection, dual projec tion ( ) ⎡ ⎤ ϕ = * min β D β : β , F β = c dual geodesic projec tion ( ) : ⎣ ⎦ dual η ∗ − η ∗ ∝ ∇ η ∗ F () c c

  36. F = ∇ n Fig. 5 subgradient n () c ∗ η F ∝ ∇ c ∗ η

  37. LASSO path and LARS path (stagewise solution) ( ) ( ) ϕ β β = min : F c ( ) ( ) ϕ β + λ β min F ( ) ( ) ∗ ∗ β β λ ⇔ c λ correspondence , c

  38. Active set and gradient ( ) { } β = β ≠ 0 A i i ( ) ⎧ ( ) − − 1 β β p ∈ sgn , i A ⎪ i i ( ) ( ) ∇ β = −∞ ∞ ∉ ⎨ , , F i A p ⎪ − [ ] 1,1 ⎩

  39. Solution path ( ) ( ) ∗ ∗ ∗ ∇ ϕ β + λ ∇ β = β 0, F A c c A c c { } ( ) ( ) ( ) & & ∇ ∇ ϕ β ∗ + λ ∇ ∇ β ∗ ⋅ β = − λ ∇ β F F A A c c A A c c c A c ( ) d & & & − ∗ β = − λ ∇ β β = β 1 ; K F c c A c c c dc ( ) ( ) = β ∗ + λ ∇∇ β ∗ K G F c c c ( ) ∇∇ = ∇ = β 0; (sgn ) : F F L 1 1 1 i

  40. Solution path in the subspace of the active set ( ) ( ) ∗ ∗ ∇ ϕ β + ∇ λ β = ∇ 0 : active direction F λ λ A A A ( ) & ∗ − ∗ β = − ∇ β 1 K F λ λ A A ′ → turning point A A

  41. Gradient Descent Method = ε 2 i j min L(x+a): g a a ij ∂ ∇ = ∂ { ( )}: covariant L L x x i ∂ ∑ % ∇ = ji { ( )}: contravariant L g L x ∂ x i ฀ + = − ∇ ( ) x x c L x 1 t t t

  42. Extended LARS (p = 1) and Minkovskian grad ∑ p = a norm a i p ( ) ψ β + ε = a a max under 1 p ( ) ψ β + ε − λ a a p = + 1 p { } ⎧ η η = η η ⎪ L sgn , max , , ( ) ∇ ψ β = 1 1 i i N ⎨ 1 A ⎪ ⎩ 0, otherwise ( ) η = ∇ ψ β

  43. ∗ = arg max i f i = = max f f f ∗ ∗ i i j ⎧ = ∗ ∗ ( ) 1, for and , i i j % ∇ = ⎨ F ⎩ 0 otherwise. i % β + = β − ∇ η LARS F 1 t t

  44. % ∇ = ∇ F f Euclidean case 1 ( ) % ∇ = sgn − F c f f 1 p i i ⎡ ⎤ 0 ⎢ ⎥ M ⎢ ⎥ ⎢ ⎥ ( ) 1 α → 1 % ∇ = ⎢ ⎥ sgn F c f ∗ i 0 ⎢ ⎥ ⎢ ⎥ M ⎢ ⎥ ⎣ ⎦ 0

  45. L1/2 constraint: non-convex optimization λ -trajectory and-trajectory c Ex. 1-dim 1 ( ) ( ) ϕ β = β − β 2 * 2 1 ( ) ( ) 2 λ β = φ + λ = β − + λ β 2 2 f F 2

  46. ( ) 2 β − β ∗ β ≤ : min , P c c β β ∗ c 0 ˆ β = c c λ ∇ = ( ) : 0 P f ˆ β ∗ β = R β − β ∗ + = λ λ : Xu Zongben's operator 0 λ β ( ) R λ β ∗ ( ) β ∗ λ = − c c c c β ∗ λ

  47. ICCN-Huangshan (黄山) Sparse Signal Analysis Shun-ichi Ammari (甘利俊一) RIKEN Brain Science Institute ( Collaborator: Masahiro Yukawa, Niigata University)

  48. Solution Path : λ ↔ c not continuous, not-monotone jump β ⇔ β λ c

  49. An Example of the greedy path β 2 β 1

  50. Linear Programming ∑ ≥ A x b ij j i ∑ max c x i i ( ) ∑ ∑ ( ) ψ = − x lo g A x b ij j i i inner met ho d

  51. Convex Programming ━ Inner Method LP A ≥ ⋅ ≥ x b c x : , 0 ⋅ c x min ( ) ∑ ∑ ( ) ψ = − x log A x b ij j i ∑ + log x i ( ) η = ∂ i ψ x Simplex method ; inner method

  52. Polynomial-Time Algorithm curvature : step-size ( ) 2 H m ( ) ( ) ⋅ + ψ = δ ∇ − ∗ c x x x min : geodesic t t

  53. Neural Networks Multilayer Perceptron Higher-order correlations Synchronous firing

  54. Multilayer Perceptrons x ∑ ( ) = ϕ ⋅ + w x y y v n i i = ( , ,..., ) x x x x 1 2 n ⎧ ⎫ ( ) 1 ( ) ( ) 2 θ = − − θ x x ⎨ ⎬ ; exp , p y c y f ⎩ ⎭ 2 ∑ ( ) ( ) θ = v ϕ ⋅ x w x , f i i θ = ( ,..., ; ,..., ) w w v v 1 1 m m

  55. Multilayer Perceptron ψ ( ) x neuromanifold space of functions ( ) = x θ , y f ∑ ( ) = ϕ ⋅ w x v i i ( ) = θ w w L L , ; , v v 1 1, m m

  56. singularities

  57. Geometry of singular model | 0 = w | v W n + ) w x ⋅ v ( ϕ v = y

  58. Backpropagation ---gradient learning Backpropagation ---gradient learning ( ) ( ) x x L examples : , , , y y 1 1 t t 1 ( ) ( ) 2 = − θ = − θ x x , log , ; E y f p y 2 natural gradient (Riemannian) % ∇ = − ∇ 1 --steepest descent ∂ E G E E Δ θ = − η ∂ θ t t ∑ ( ) ( ) θ = ϕ ⋅ x w x , f v i i

  59. conformal transformation q ‐ Fisher information q ( ) ( ) = q F ( ) g p g p ij ij ( ) h p q − q divergence 1 ∫ − = − 1 q q [ ( ): ( )] (1 ( ) ( ) ) D p x r x p x r x dx − q (1 ) ( ) q h p q

  60. Total Bregman Divergence and its Applications to Shape Retrieval • Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari, Frank Nielsen IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010

  61. Total Bregman Divergence [ ] x y : D [ ] = x y : TD + ∇ 2 1 f • rotational invariance • conformal geometry

  62. Total Bregman divergence (Vemuri) ϕ − ϕ −∇ ϕ ⋅ − ( ) ( ) ( ) ( ) p q q p q = TBD( : ) p q 2 + ∇ ϕ 1 | ( ) | q

Recommend


More recommend