Information Geometric Optimization How information theory sheds new light on black-box optimization Anne Auger, Inria and CMAP
Main reference: Y Ollivier, L. Arnold, A. Auger, N. Hansen, Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles , JMLR (accepted)
Black-Box Optimization optimize f : Ω 7! R Ω = { 0 , 1 } n discrete optimization continuous optimization Ω ⊂ R n 3
Black-Box Optimization optimize f : Ω 7! R Ω = { 0 , 1 } n discrete optimization continuous optimization Ω ⊂ R n f ( x ) ∈ R x ∈ Ω gradients not available or not useful 4
Adaptive Stochastic Black-Box Algorithm θ t : state of the algorithm Sample candidate solutions X i t +1 = S ol ( θ t , U i t +1 ) , i = 1 , . . . , λ { U t +1 , t ∈ N } i.i.d. Evaluate solutions � X i � X i f t +1 t +1 Update state of the algorithm ⇣ ⇣ ⌘⌘ X λ t +1 , f ( X λ X 1 t +1 , f ( X 1 � � θ t +1 = F t +1 ) t +1 ) θ t , , . . . , 5
Comparison-based Stochastic Algorithms Invariance to strictly increasing transformations Sample candidate solutions X i t +1 = S ol ( θ t , U i t +1 ) , i = 1 , . . . , λ Evaluate and rank solutions ⇣ ⌘ ⇣ ⌘ X S (1) X S ( λ ) f ≤ . . . ≤ f t +1 t +1 S permutation with index of ordered solutions Update state of the algorithm ⇣ ⌘ θ t , U S (1) t +1 , . . . , U S ( λ ) θ t +1 = F t +1 6
Overview ➊ Black-Box Optimization Typical difficulties ➋ Information Geometric Optimization ➌ Invariance ➍ Recovering well-known algorithms CMA-ES PBIL, cGA 7
Information Geometric Optimization Setting ( P θ ) θ ∈ Θ on Ω Family of probability distributions continuous multicomponent parameter θ ∈ Θ 8
Information Geometric Optimization Setting Family of probability distributions ( P θ ) θ ∈ Θ on Ω continuous multicomponent parameter θ ∈ Θ Θ : statistical manifold Example: Ω = R n P θ multivariate normal distribution θ = ( m, C ) 9
Changing Viewpoint I Ω Transform original optimization problem on min x ∈ Ω f ( x ) Onto optimization problem on : Minimize Θ Z F ( θ ) = f ( x ) P θ ( dx ) 10
Changing Viewpoint I Ω Transform original optimization problem on min x ∈ Ω f ( x ) Onto optimization problem on : Minimize Θ Z F ( θ ) = f ( x ) P θ ( dx ) Find dirac-delta distribution Minimizing F ⇔ concentrated on argmin x f ( x ) [Wiestra et al, 2014] 11
Changing Viewpoint I Ω Transform original optimization problem on min x ∈ Ω f ( x ) Onto optimization problem on : Minimize Θ Z F ( θ ) = f ( x ) P θ ( dx ) Find dirac-delta distribution But not invariant to strictly increasing Minimizing F ⇔ argmin x f ( x ) concentrated on transformations of f 12
Changing Viewpoint II Invariant under strictly increasing transformation of f Ω Transform original optimization problem on min x ∈ Ω f ( x ) Onto optimization problem on : Maximize Θ Z W f J θ t ( θ ) = θ t ( x ) } P θ ( dx ) | {z w ( P θ t [ y : f ( y ) ≤ f ( x )]) with w : [0 , 1] → R decreasing weight function Rationale: f “small” ↔ W f θ t ( x ) “large” [Ollivier et al.] 13
Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ 14
Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ 15
Natural Gradient Fisher Information Metric e Natural gradient : r θ gradient wrt Fisher metric defined via Fisher matrix ∂ ln P θ ( x ) ∂ ln P θ ( x ) Z I ij ( θ ) = P θ ( dx ) ∂θ i ∂θ j x ∂ 2 ln P θ ( x ) Z = − P θ ( dx ) ∂θ i ∂θ j x e r = I − 1 ∂ ∂θ 16
Fisher Information Metric Equivalently defined via second order expansion of KL Kullback–Leibler divergence: measure of “distance” between distributions ln P θ 0 ( dx ) Z KL( P θ 0 | | P θ ) = P θ ( dx ) P θ ( dx ) Relation between KL divergence and Fisher matrix | P θ ) = 1 X I ij ( θ ) δθ i δθ j + O ( δθ 3 ) KL( P θ + δθ | 2 17
Natural Gradient Fisher Information Metric e Natural gradient : r θ gradient wrt Fisher metric defined via Fisher matrix ∂ ln P θ ( x ) ∂ ln P θ ( x ) Z I ij ( θ ) = P θ ( dx ) ∂θ i ∂θ j x ∂ 2 ln P θ ( x ) Z = − P θ ( dx ) ∂θ i ∂θ j x intrinsic: independent of chosen parametrization θ of P θ Fisher metric essentially the only way to obtain this property [Amari, Nagaoka, 2001] 18
Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ Z = θ t + δ t W f θ t ( x ) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r Z = θ t + δ t w ( P θ t [ y : f ( y ) f ( x )]) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r does not depend on r f 19
Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ Z = θ t + δ t W f θ t ( x ) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r Z = θ t + δ t w ( P θ t [ y : f ( y ) f ( x )]) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r does not depend on r f 20
Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ Z e θ P θ ( x ) r = θ t + δ t W f θ t ( x ) P θ t ( x ) dx P θ t ( x ) Z = θ t + δ t W f θ t ( x ) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r Z = θ t + δ t w ( P θ t [ y : f ( y ) f ( x )]) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r does not depend on r f 21
Maximizing J θ t ( θ ) Information Geometric Optimization Perform natural gradient step on Θ Z θ t + δ t = θ t + δ t e W f θ t ( x ) P θ ( dx ) r θ Z e θ P θ ( x ) r = θ t + δ t W f θ t ( x ) P θ t ( x ) dx P θ t ( x ) Z = θ t + δ t W f θ t ( x ) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r Z = θ t + δ t w ( P θ t [ y : f ( y ) f ( x )]) e θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r does not depend on r f IGO flow: δ t → 0 ➊ IGO algorithms: discretization of integrals ➋ 22
IGO gradient flow Information Geometric Optimization set of continuous time trajectories in the - space Θ defined by the ODE: Z d θ t W f θ t ( x ) e dt = θ ln P θ ( x ) | θ = θ t P θ t ( dx ) r [Ollivier et al.] 23
Information Geometric Optimization Algorithm Information Geometric Optimization (IGO) Monte Carlo Approximation of Integrals Sample X i ∼ P θ t , i = 1 , . . . N ⇣ ⌘ rk( X i )+1 / 2 w ( P θ t [ y : f ( y ) ≤ f ( x )]) ≈ w N rk( X i ) = # { j | f ( X j ) < f ( X i ) } IGO Algorithm ✓ rk( X i ) + 1 / 2 ◆ N X θ t + δ t = θ t + δ t 1 e w r θ ln P θ ( X i ) | θ = θ t N N i =1 N X = θ t + δ t w i e ˆ r θ ln P θ ( X i ) | θ = θ t i =1 24
IGO Algorithm [Ollivier et al.] Monte Carlo Approximation of Integrals Sample X i ∼ P θ t , i = 1 , . . . N ⇣ ⌘ rk( X i )+1 / 2 w ( P θ t [ y : f ( y ) ≤ f ( x )]) ≈ w N IGO Algorithm ✓ rk( X i ) + 1 / 2 ◆ N X θ t + δ t = θ t + δ t 1 e w r θ ln P θ ( X i ) | θ = θ t N N i =1 N X = θ t + δ t w i e ˆ r θ ln P θ ( X i ) | θ = θ t i =1 consistent estimator of integral ✓ rk( X i ) + 1 / 2 ◆ w i = 1 ˆ N w N 25
Instantiation of IGO Multivariate Normal Distributions [Akimoto et al. 2010] P θ multivariate normal distribution, θ = ( m, C ) IGO Algorithm N m t + δ t = m t + δ t X w i ( X i − m t ) ˆ i =1 N C t + δ t = C t + δ t X ( X i − m t )( X i − m t ) T − C t � � ˆ w i i =1 Recovers the CMA-ES with rank-mu update algorithm N = λ δ t learning rate for covariance matrix additional learning rate for the mean 26
Instantiation of IGO Bernoulli measures Ω = { 0 , 1 } d P θ ( x ) = p θ 1 ( x 1 ) . . . p θ d ( x d ) family of Bernoulli measures Recovers PBIL (Population based incremental learning) [Baluja, Caruana 1995] cGA (compact Genetic Algorithm) [Harick et al. 1999] 27
Conclusions Information Geometric Optimization framework: a unified picture of discrete and continuous optimization theoretical foundations for existing algorithms CMA-ES state-of-the-art in continuous bb optimization some parts of CMA-ES algorithm not explained by IGO framework step-size adaptation, cumulation New algorithms: large-scale variant of CMA-ES based on IGO, … 28
Recommend
More recommend