Information Geometry and Its Applications to Machine Learning - PowerPoint PPT Presentation

Stochastic Reasoning , 1 ( , , , , ) p x y z r s = − ( , , | , ) p x y z r s , , ,... 1 x y z

Stochastic Reasoning q(x 1 ,x 2 ,x 3 ,…| observation) X = (x 1 x 2 x 3 …..) x = 1, -1 Ｘ = argmax q ( x 1, x 2 ,x 3 ,….. ) maximum likelihood Ｘ i = sgn E[x i ] least bit error rate estimator

Mean Value Marginalization: projection to independent distributions Π = = x x ( ) ( ) ( )... ( ) ( ) q q x q x q x q 0 1 1 2 2 0 n n = ∫ ( ( ) ( ..., ) .. .. q x q x x dx dx dx 1, 1 i i n i n η = = E x E x [ ] [ ] q q 0

⎧ ⎫ L ∑ ∑ ( ) ( ) = ⋅ + − ψ x x ⎨ ⎬ exp q k x c i i r q ⎩ ⎭ = 1 r ( ) ( ) = = x L L , c c x x r i i 1 r r i i s 1 s { } ( ) = − = 1, 1 , x r i i 1 2 i { } ∑ ∑ ( ) = + − ψ x e p x q w x x h x Boltzmann machine, spin glass, neural networks ij i j i i Turbo Codes, LDPC Codes

Computationally Difficult Computationally Difficult [ ] ( ) → η = x x q E { } ∑ ( ) ( ) = − ψ x x exp q c r q mean-field approximation belief propagation tree propagation, CCCP (convex-concave)

Information Geometry of Mean Field Approximation • m-projection • e-projection ( ) q x ∑ ( )log ( ) q x D[q:p]= p x x Π q = argmin D[q:p] m 0 = Π Π { ( )} M p x e q = argmin D[p:q ] 0 i i i 0 ( ) ∈ p x M 0

Information Geometry Information Geometry ( ) q x M r M ' r θ M ∑ 0 = − } φ ( ) exp{ ( ) q x c x r { } ( ) { } = θ = θ ⋅ − ψ x x , exp M p 0 0 0 { } { } ( ) ( ) = = + ⋅ − ψ x ξ x ξ x , exp M p c r r r r r r = L 1, , r L

Belief Propagation Belief Propagation { } ( ) ( ) = + ⋅ − ψ x ξ x ξ x : , exp M p c r r r r r r Π ξ t ( , ) p x 0 r r + = Π − θ ξ ξ 1 t t t ( , ) : belief for ( ) p x c x 0 r r r r r = ∑ + + θ 1 θ 1 t t r

0 r Belief Prop Algorithm M ' M r M ς r Π ' r ς ς r ' ς r

Equilibrium of BP ( ) ∗ ∗ θ ξ , Equilibrium of BP r 1) m -condition ∗ = Π ( ) θ x ξ * , p M r 0 r r ( ) M M θ ∗ ' -flat submanifold r m M 0 ξ θ ξ 1 ( ') 2) e -condition q 1 − ∑ r ∗ ∗ = θ ξ * M r r 1 L M ' r ( ) ∈ x -flat submanifold q e M 0

Free energy: Free energy: [ ] − ∑ [ ] ( ) θ ζ ζ = L , , , : : F D p q D p p 1 0 0 L r critical point ∂ F = 0 : -condition e ∂ θ ∂ F = 0 : -condition m ∂ ζ r not convex

Belief Propagation e-condition OK 1 ∑ ,... ) = θ ξ ξ ξ θ ξ ( ; , , , ' ' − 1 2 L r 1 L ,... → ,... ξ ξ ξ ξ ξ ξ ( , , ) ( ' , ' , ' ) 1 2 1 2 L L CCCP m-condition OK → θ θ ' ξ θ ξ θ ξ θ ( '), ( '),..., ( ') 1 1 1 ( ) ( ) = Π = Π θ x ξ ξ x θ ' , ' : ' , ' p p 0 0 r r r

( ) + + + = Π = Π ξ 1 θ 1 x θ 1 t t t , p 0 r r r − ∑ + + = ξ θ θ 1 1 t t t L r

Convex-Concave Computational Procedure (CCCP) Yuille θ = θ − θ ( ) ( ) ( ) F F F 1 2 + ∇ θ = ∇ θ 1 t t ( ) ( ) F F 1 2 Elimination of double loops

Boltzmann Machine x 1 ( ) ∑ ( ) x x = = ϕ − 1 p x w x h 4 2 i ij j i { } x ∑ ∑ ( ) ( ) = − − ψ 3 exp p x w x x h x w ij i j i i ( ) q x ( ) ˆ p x B

Boltzmann machine ---hidden units • EM algorithm D • e-projection • m-projection M

EM algorithm hidden variables ( ) p x y u , ; { } D = x x L 1 , , N { } ( ) = x y u , ; M p { } ( ) ( ) ( ) = = x y x x , D p p p M D ( ) ⎡ ∈ ⎤ m-projection to x y ˆ min , : KL p ⎣ p M ⎦ M ( ) ⎡ ∈ ⎤ e-projection to x y u ˆ min : , ; KL p ⎣ D p ⎦ D

SVM : support vector machine = φ ( ) z x Embedding i i ∑ ∑ = φ = α ( ) ( ) ( , ) f x w x y K x x i i i i i = ∑ φ φ Kernel ( , ') ( ) ( ') K x x x x i i Conformal change of kernel ⎯⎯ → ρ ρ ( , ') ( ) ( ') ( , ') K x x x x K x x ρ = − κ 2 ( ) exp{ | ( ) | } x f x

Signal Processing ICA : Independent Component Analysis = → x s x s A t t t t sparse component analysis positive matrix factorization

mixture and unmixture of independent signals x n ∑ = 1 x A s i ij j s = 1 1 j x s m 2 = x As s n x 2

Independent Component Analysis y s A W ∑ = = x s A x A s x i ij j A − = = y x 1 W W observations: x(1), x(2), …, x(t) recover: s(1), s(2), …, s(t)

Space of Matrices : Lie group + W W d = X WW -1 d d W W − 1 + X I d I ( ) ( ) 2 − − = = W X X W W 1 W W T T T tr tr d d d d d ∂ l ฀ ∇ = ∂ W W T l W non-holonomic basis d X :

Information Geometry of ICA S ={p( y )} q r = { ( ) ( )... ( )} I q y q y q y 1 1 2 2 n n { ( )} p Wx = natural gradient W y W y ( ) [ ( ; ) : ( )] l KL p q estimating function y ( ) r stability, efficiency

Semiparametric Statistical Model = x W W Wx ( ; , ) | | ( ) p r r = Π − = 1 W A r r , ( ) : unknown r s i x(1), x(2), …, x(t)

Natural Gradient ( ) η ∂ y W , l Δ = − W W W T ∂ W

Basis Given: overcomplete case Sparse Solution ∑ ˆ ˆ ˆ = = = sparse A x As x s a : A s i i many solutions → many 0 s i ˆ = x s A t t

ˆ x = ˆ ˆ A As : sparse generalized inverse min Σ 2 s ˆ -norm : L i 2 sparse solution ∑ s ˆ min -norm : L 1 i i

Overcomplete Basis and Sparse Solution ∑ = = x a s s A i i ∑ = s min s i 1 − + α s x s min A ' p p non-linear denoising

Sparse Solution ( ) ϕ β min = ∑ ( ) p β β penalty : Bayes prior F p i ( ) β = β ≠ sparsest solution #1[ 0]: F 0 i ∑ ( ) β = β solution : F L 1 i 1 ( ) Sparse solution: overcomplete case β ≤ ≤ : 0 1 F p p ( ) ∑ 2 β = β generalized inverse solution : F 2 i

Optimization under Spasity Condition ( ) ⎧ ϕ β ⎪ min : convex function ⎨ ( ) β ≤ ⎪ constraint ⎩ F c typical case: ( ) 1 1 ( ϕ β = − β 2 = β − β β − β y T *) ( *) X G 2 2 = = = = ∑ 1 2, 1, 1/ 2 p p p ( ) β β p ; F i p

L1-constrained optimization ( ) ( ) ϕ β β ≤ LASSO min under F c P Problem c ( ) ∗ β = → ∞ solution : 0 c c β ∗ = → β ∗ 0 c LARS ( ) ( ) ϕ β + λ β min F P Problem λ ( ) ∗ β λ λ = ∞ → solution 0 β ∗ = → β ∗ 0 λ ( ) ≥ * * solutions β and β : coincide λ = λ c p 1 , , c λ ( ) p < 1 λ = λ c multiple noncontinuous : , stability different

Projection from to F = c (information geometry) β * β * β *

Convex Cone Programming : positive semi-definite matrix P convex potential function dual geodesic approach A = ⋅ x b c x , min Support vector machine

= > = = a) : 2, 1 b) : 2, 1 R n p R n p c c non-convex = < c) : 2, 1 R n p c Fig. 1

orthogonal projection, dual projec tion ( ) ⎡ ⎤ ϕ = * min β D β : β , F β = c dual geodesic projec tion ( ) : ⎣ ⎦ dual η ∗ − η ∗ ∝ ∇ η ∗ F () c c

F = ∇ n Fig. 5 subgradient n () c ∗ η F ∝ ∇ c ∗ η

LASSO path and LARS path (stagewise solution) ( ) ( ) ϕ β β = min : F c ( ) ( ) ϕ β + λ β min F ( ) ( ) ∗ ∗ β β λ ⇔ c λ correspondence , c

Active set and gradient ( ) { } β = β ≠ 0 A i i ( ) ⎧ ( ) − − 1 β β p ∈ sgn , i A ⎪ i i ( ) ( ) ∇ β = −∞ ∞ ∉ ⎨ , , F i A p ⎪ − [ ] 1,1 ⎩

Solution path ( ) ( ) ∗ ∗ ∗ ∇ ϕ β + λ ∇ β = β 0, F A c c A c c { } ( ) ( ) ( ) & & ∇ ∇ ϕ β ∗ + λ ∇ ∇ β ∗ ⋅ β = − λ ∇ β F F A A c c A A c c c A c ( ) d & & & − ∗ β = − λ ∇ β β = β 1 ; K F c c A c c c dc ( ) ( ) = β ∗ + λ ∇∇ β ∗ K G F c c c ( ) ∇∇ = ∇ = β 0; (sgn ) : F F L 1 1 1 i

Solution path in the subspace of the active set ( ) ( ) ∗ ∗ ∇ ϕ β + ∇ λ β = ∇ 0 : active direction F λ λ A A A ( ) & ∗ − ∗ β = − ∇ β 1 K F λ λ A A ′ → turning point A A

Gradient Descent Method = ε 2 i j min L(x+a): g a a ij ∂ ∇ = ∂ { ( )}: covariant L L x x i ∂ ∑ % ∇ = ji { ( )}: contravariant L g L x ∂ x i ฀ + = − ∇ ( ) x x c L x 1 t t t

Extended LARS (p = 1) and Minkovskian grad ∑ p = a norm a i p ( ) ψ β + ε = a a max under 1 p ( ) ψ β + ε − λ a a p = + 1 p { } ⎧ η η = η η ⎪ L sgn , max , , ( ) ∇ ψ β = 1 1 i i N ⎨ 1 A ⎪ ⎩ 0, otherwise ( ) η = ∇ ψ β

∗ = arg max i f i = = max f f f ∗ ∗ i i j ⎧ = ∗ ∗ ( ) 1, for and , i i j % ∇ = ⎨ F ⎩ 0 otherwise. i % β + = β − ∇ η LARS F 1 t t

% ∇ = ∇ F f Euclidean case 1 ( ) % ∇ = sgn − F c f f 1 p i i ⎡ ⎤ 0 ⎢ ⎥ M ⎢ ⎥ ⎢ ⎥ ( ) 1 α → 1 % ∇ = ⎢ ⎥ sgn F c f ∗ i 0 ⎢ ⎥ ⎢ ⎥ M ⎢ ⎥ ⎣ ⎦ 0

L1/2 constraint: non-convex optimization λ -trajectory and-trajectory c Ex. 1-dim 1 ( ) ( ) ϕ β = β − β 2 * 2 1 ( ) ( ) 2 λ β = φ + λ = β − + λ β 2 2 f F 2

( ) 2 β − β ∗ β ≤ : min , P c c β β ∗ c 0 ˆ β = c c λ ∇ = ( ) : 0 P f ˆ β ∗ β = R β − β ∗ + = λ λ : Xu Zongben's operator 0 λ β ( ) R λ β ∗ ( ) β ∗ λ = − c c c c β ∗ λ

ICCN-Huangshan （黄山） Sparse Signal Analysis Shun-ichi Ammari （甘利俊一） RIKEN Brain Science Institute （ Collaborator: Masahiro Yukawa, Niigata University)

Solution Path : λ ↔ c not continuous, not-monotone jump β ⇔ β λ c

An Example of the greedy path β ２ β １

Linear Programming ∑ ≥ A x b ij j i ∑ max c x i i ( ) ∑ ∑ ( ) ψ = − x lo g A x b ij j i i inner met ho d

Convex Programming ━ Inner Method LP A ≥ ⋅ ≥ x b c x : , 0 ⋅ c x min ( ) ∑ ∑ ( ) ψ = − x log A x b ij j i ∑ + log x i ( ) η = ∂ i ψ x Simplex method ; inner method

Polynomial-Time Algorithm curvature : step-size ( ) 2 H m ( ) ( ) ⋅ + ψ = δ ∇ − ∗ c x x x min : geodesic t t

Neural Networks Multilayer Perceptron Higher-order correlations Synchronous firing

Multilayer Perceptrons x ∑ ( ) = ϕ ⋅ + w x ｙ y v n i i = ( , ,..., ) x x x x 1 2 n ⎧ ⎫ ( ) 1 ( ) ( ) 2 θ = − − θ x x ⎨ ⎬ ; exp , p y c y f ⎩ ⎭ 2 ∑ ( ) ( ) θ = v ϕ ⋅ x w x , f i i θ = ( ,..., ; ,..., ) w w v v 1 1 m m

Multilayer Perceptron ψ ( ) x neuromanifold space of functions ( ) = x θ , y f ∑ ( ) = ϕ ⋅ w x v i i ( ) = θ w w L L , ; , v v 1 1, m m

singularities

Geometry of singular model | 0 = w | v Ｗ n + ) w x ⋅ v ( ϕ v = y

Backpropagation ---gradient learning Backpropagation ---gradient learning ( ) ( ) x x L examples : , , , y y 1 1 t t 1 ( ) ( ) 2 = − θ = − θ x x , log , ; E y f p y 2 natural gradient (Riemannian) % ∇ = − ∇ 1 --steepest descent ∂ E G E E Δ θ = − η ∂ θ t t ∑ ( ) ( ) θ = ϕ ⋅ x w x , f v i i

conformal transformation q ‐ Fisher information q ( ) ( ) = q F ( ) g p g p ij ij ( ) h p q − q divergence 1 ∫ − = − 1 q q [ ( ): ( )] (1 ( ) ( ) ) D p x r x p x r x dx − q (1 ) ( ) q h p q

Total Bregman Divergence and its Applications to Shape Retrieval • Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari, Frank Nielsen IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010

Total Bregman Divergence [ ] x y : D [ ] = x y : TD + ∇ 2 1 f • rotational invariance • conformal geometry

Total Bregman divergence (Vemuri) ϕ − ϕ −∇ ϕ ⋅ − ( ) ( ) ( ) ( ) p q q p q = TBD( : ) p q 2 + ∇ ϕ 1 | ( ) | q

Information Geometry and Its Applications to Machine Learning - PowerPoint PPT Presentation

Machine Learning SS: Kyoto U. Information Geometry and Its Applications to Machine Learning Shun-ichi Amari RIKEN Brain Science Institute Information Geometry -- Manifolds of Probability Distributions = x { ( )} M p Information

Stochastic geometry and random generation 1 Stochastic geometry and random generation

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

Hyperbolic Geometry Victor Gonzalez Mentor: Ryan Kirk May 4, 2016 Hyperbolic Geometry We are

Geometry Problems Geometry Problems Examples for Typical ACM Instances Elementary Geometry

3d Geometry for Computer Graphics Lesson 1: Basics & PCA 3d geometry 3d geometry 3d

Geometry Euclid of Alexandria The Founder of Geometry. He was a Greek mathematician, often

Ansys - Old Geometry - Cathode 1 Ansys - New Geometry - Cathode lamella (PCB and copper

Snapshots from the History of Toric Geometry David A. Cox Geometry 19701988 Toric Geometry

Group Rings and Geometry: The (FA) Property Finite Geometry & Friends Doryan Temmerman

A glimpse into convex geometry 5 \ A glimpse into convex geometry Two

Computational Geometry Algorithm Library Efi Fogel Tel Aviv University Computational Geometry

2.2 Classic Differential Geometry 1 Hao Li http://cs621.hao-li.com 1 Spring 2018 CSCI 621:

IGA Lecture II: Dirac Geometry Eckhard Meinrenken Adelaide, September 6, 2011 Eckhard Meinrenken

Mason Experimental Geometry Lab Geometry Labs United 2020 ICERM, July 16, 2020 Sean Lawton

Computational Geometry Lecture 1: Introduction and convex hulls 1 Computational Geometry

3.1 Classic Differential Geometry Hao Li http://cs599.hao-li.com 1 Spring 2014 CSCI 599:

TUTORIAL HANDOUT GENOVA JUNE 11 INFORMATION GEOMETRY AND ALGEBRAIC STATISTICS ON A FINITE STATE

Extended hybrid inflationary Extended hybrid inflationary models models Partly based on works

Modeling Portfolios that Contain Risky Assets Stochastic Models II: Portfolios with Risky Assets

r t t

Value-at-Risk Notations: . S = vector of m market prices 1 . t = horizon for risk

The Orlicz-Sobolev-Gauss Exponential Manifold Giovanni Pistone www.giannidiorestino.it Liblice

General Edgeworth expansions with One-split branching random walks applications to profiles of

Noncrossing partitions, interval partitions and the Bruhat order Philippe Biane CNRS, IGM,

Information Geometry and Its Applications to Machine Learning - PowerPoint PPT Presentation

Machine Learning SS: Kyoto U. Information Geometry and Its Applications to Machine Learning Shun-ichi Amari RIKEN Brain Science Institute Information Geometry -- Manifolds of Probability Distributions = x { ( )} M p Information

Stochastic geometry and random generation 1 Stochastic geometry and random generation

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

Hyperbolic Geometry Victor Gonzalez Mentor: Ryan Kirk May 4, 2016 Hyperbolic Geometry We are

Geometry Problems Geometry Problems Examples for Typical ACM Instances Elementary Geometry

3d Geometry for Computer Graphics Lesson 1: Basics &amp; PCA 3d geometry 3d geometry 3d

Geometry Euclid of Alexandria The Founder of Geometry. He was a Greek mathematician, often

Ansys - Old Geometry - Cathode 1 Ansys - New Geometry - Cathode lamella (PCB and copper

Snapshots from the History of Toric Geometry David A. Cox Geometry 19701988 Toric Geometry

Group Rings and Geometry: The (FA) Property Finite Geometry &amp; Friends Doryan Temmerman

A glimpse into convex geometry 5 \ A glimpse into convex geometry Two

Computational Geometry Algorithm Library Efi Fogel Tel Aviv University Computational Geometry

2.2 Classic Differential Geometry 1 Hao Li http://cs621.hao-li.com 1 Spring 2018 CSCI 621:

IGA Lecture II: Dirac Geometry Eckhard Meinrenken Adelaide, September 6, 2011 Eckhard Meinrenken

Mason Experimental Geometry Lab Geometry Labs United 2020 ICERM, July 16, 2020 Sean Lawton

Computational Geometry Lecture 1: Introduction and convex hulls 1 Computational Geometry

3.1 Classic Differential Geometry Hao Li http://cs599.hao-li.com 1 Spring 2014 CSCI 599:

TUTORIAL HANDOUT GENOVA JUNE 11 INFORMATION GEOMETRY AND ALGEBRAIC STATISTICS ON A FINITE STATE

Extended hybrid inflationary Extended hybrid inflationary models models Partly based on works

Modeling Portfolios that Contain Risky Assets Stochastic Models II: Portfolios with Risky Assets

r t t

Value-at-Risk Notations: . S = vector of m market prices 1 . t = horizon for risk

The Orlicz-Sobolev-Gauss Exponential Manifold Giovanni Pistone www.giannidiorestino.it Liblice

General Edgeworth expansions with One-split branching random walks applications to profiles of

Noncrossing partitions, interval partitions and the Bruhat order Philippe Biane CNRS, IGM,

3d Geometry for Computer Graphics Lesson 1: Basics & PCA 3d geometry 3d geometry 3d

Group Rings and Geometry: The (FA) Property Finite Geometry & Friends Doryan Temmerman