Stochastic Methods for Continuous Optimization Anne Auger and Dimo Brockhoff Paris-Saclay Master - Master 2 Informatique - Parcours Apprentissage, Information et Contenu (AIC) anne.auger@inria.fr 2015
Overview Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Theory Algorithms Covariance Matrix Adaptation Rank-One Update Cumulation—the Evolution Path Rank- µ Update Summary and Final Remarks
Problem Statement Continuous Domain Search/Optimization ◮ Task: minimize an objective function ( fitness function, loss function) in continuous domain f : X ⊆ R n → R , x �→ f ( x ) ◮ Black Box scenario (direct search scenario) x f(x) ◮ gradients are not available or not useful ◮ problem domain specific knowledge is used only within the black box, e.g. within an appropriate encoding ◮ Search costs: number of function evaluations
What Makes a Function Difficult to Solve? Why stochastic search? 1.0 ◮ non-linear, non-quadratic, non-convex 0.8 0.6 on linear and quadratic functions 0.4 much better search policies are 0.2 0.0 1.0 0.5 0.0 0.5 1.0 available 100 90 80 ◮ ruggedness 70 60 50 non-smooth, discontinuous, 40 30 20 multimodal, and/or noisy 10 −4 0 −3 −2 −1 0 1 2 3 4 3 function 2 ◮ dimensionality (size of search space) 1 0 (considerably) larger than three −1 −2 ◮ non-separability −3 −3 −2 −1 0 1 2 3 dependencies between the objective variables ◮ ill-conditioning gradient direction Newton direction
Separable Problems Definition (Separable Problem) A function f is separable if � � arg min ( x 1 ,..., x n ) f ( x 1 , . . . , x n ) = arg min x 1 f ( x 1 , . . . ) , . . . , arg min x n f ( . . . , x n ) ⇒ it follows that f can be optimized in a sequence of n independent 1-D optimization processes Example: Additively 3 decomposable functions 2 1 n � 0 f ( x 1 , . . . , x n ) = f i ( x i ) −1 i = 1 −2 Rastrigin function f ( x ) = 10 n + � n i = 1 ( x 2 −3 i − 10 cos ( 2 π x i )) −3 −2 −1 0 1 2 3
Non-Separable Problems Building a non-separable problem from a separable one ( 1 , 2 ) Rotating the coordinate system ◮ f : x �→ f ( x ) separable ◮ f : x �→ f ( R x ) non-separable R rotation matrix 3 3 2 2 R 1 1 − → 0 0 −1 −1 −2 −2 −3 −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 1Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. Sixth ICGA, pp. 57-64, Morgan Kaufmann 2Salomon (1996). "Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms." BioSystems, 39(3):263-278
Ill-Conditioned Problems � � ◮ If f is convex quadratic, f : x �→ 1 2 x T Hx = 1 i h i , i x 2 i + 1 i � = j h i , j x i x j , 2 2 with H positive, definite, symmetric matrix H is the Hessian matrix of f ◮ ill-conditioned means a high condition number of Hessian Matrix H cond ( H ) = λ max ( H ) λ min ( H ) Example / exercice 1 0.8 0.6 0.4 f ( x ) = 1 0.2 2 ( x 2 1 + 9 x 2 2 ) 0 −0.2 −0.4 condition number equals 9 −0.6 −0.8 −1 −1 −0.5 0 0.5 1 Shape of the iso-fitness lines
Ill-conditionned Problems consider the curvature of iso-fitness lines ill-conditioned means “squeezed” lines of equal function value (high curvatures) gradient direction − f ′ ( x ) T Newton direction − H − 1 f ′ ( x ) T Condition number equals nine here. Condition numbers up to 10 10 are not unusual in real world problems.
Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))
Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))
Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))
Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))
Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))
Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))
Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ )) Everything depends on the definition of P and F θ
Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ )) Everything depends on the definition of P and F θ In Evolutionary Algorithms the distribution P is often implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for Estimation of Distribution Algorithms
Evolution Strategies New search points are sampled normally distributed x i ∼ m + σ N i ( 0 , C ) for i = 1 , . . . , λ where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n
Evolution Strategies New search points are sampled normally distributed x i ∼ m + σ N i ( 0 , C ) for i = 1 , . . . , λ where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n where ◮ the mean vector m ∈ R n represents the favorite solution ◮ the so-called step-size σ ∈ R + controls the step length ◮ the covariance matrix C ∈ R n × n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters
Evolution Strategies New search points are sampled normally distributed x i ∼ m + σ N i ( 0 , C ) for i = 1 , . . . , λ where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n where ◮ the mean vector m ∈ R n represents the favorite solution ◮ the so-called step-size σ ∈ R + controls the step length ◮ the covariance matrix C ∈ R n × n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters The question remains how to update m , C , and σ .
Normal Distribution 1-D case Standard Normal Distribution 0.4 probability density of the 1-D standard normal distribution N ( 0 , 1 ) 0.3 probability density (expected (mean) value, variance) = (0,1) 0.2 � � − x 2 1 √ p ( x ) = exp 0.1 2 2 π 0 −4 −2 0 2 4 General case � m , σ 2 � ◮ Normal distribution N (expected value, variance) = ( m , σ 2 ) � � − ( x − m ) 2 1 density: p m ,σ ( x ) = 2 πσ exp √ 2 σ 2 ◮ A normal distribution is entirely determined by its mean value and variance ◮ The family of normal distributions is closed under linear transformations: if X is normally distributed then a linear transformation aX + b is also normally distributed � m , σ 2 � ◮ Exercice: Show that m + σ N ( 0 , 1 ) = N
Recommend
More recommend