Stochastic Methods for Continuous Optimization Anne Auger and Dimo - PowerPoint PPT Presentation

Stochastic Methods for Continuous Optimization Anne Auger and Dimo Brockhoff Paris-Saclay Master - Master 2 Informatique - Parcours Apprentissage, Information et Contenu (AIC) anne.auger@inria.fr 2015

Overview Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems Stochastic search algorithms - basics A Search Template A Natural Search Distribution: the Normal Distribution Adaptation of Distribution Parameters: What to Achieve? Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Theory Algorithms Covariance Matrix Adaptation Rank-One Update Cumulation—the Evolution Path Rank- µ Update Summary and Final Remarks

Problem Statement Continuous Domain Search/Optimization ◮ Task: minimize an objective function ( fitness function, loss function) in continuous domain f : X ⊆ R n → R , x �→ f ( x ) ◮ Black Box scenario (direct search scenario) x f(x) ◮ gradients are not available or not useful ◮ problem domain specific knowledge is used only within the black box, e.g. within an appropriate encoding ◮ Search costs: number of function evaluations

What Makes a Function Difficult to Solve? Why stochastic search? 1.0 ◮ non-linear, non-quadratic, non-convex 0.8 0.6 on linear and quadratic functions 0.4 much better search policies are 0.2 0.0 1.0 0.5 0.0 0.5 1.0 available 100 90 80 ◮ ruggedness 70 60 50 non-smooth, discontinuous, 40 30 20 multimodal, and/or noisy 10 −4 0 −3 −2 −1 0 1 2 3 4 3 function 2 ◮ dimensionality (size of search space) 1 0 (considerably) larger than three −1 −2 ◮ non-separability −3 −3 −2 −1 0 1 2 3 dependencies between the objective variables ◮ ill-conditioning gradient direction Newton direction

Separable Problems Definition (Separable Problem) A function f is separable if � � arg min ( x 1 ,..., x n ) f ( x 1 , . . . , x n ) = arg min x 1 f ( x 1 , . . . ) , . . . , arg min x n f ( . . . , x n ) ⇒ it follows that f can be optimized in a sequence of n independent 1-D optimization processes Example: Additively 3 decomposable functions 2 1 n � 0 f ( x 1 , . . . , x n ) = f i ( x i ) −1 i = 1 −2 Rastrigin function f ( x ) = 10 n + � n i = 1 ( x 2 −3 i − 10 cos ( 2 π x i )) −3 −2 −1 0 1 2 3

Non-Separable Problems Building a non-separable problem from a separable one ( 1 , 2 ) Rotating the coordinate system ◮ f : x �→ f ( x ) separable ◮ f : x �→ f ( R x ) non-separable R rotation matrix 3 3 2 2 R 1 1 − → 0 0 −1 −1 −2 −2 −3 −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 1Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. Sixth ICGA, pp. 57-64, Morgan Kaufmann 2Salomon (1996). "Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms." BioSystems, 39(3):263-278

Ill-Conditioned Problems � � ◮ If f is convex quadratic, f : x �→ 1 2 x T Hx = 1 i h i , i x 2 i + 1 i � = j h i , j x i x j , 2 2 with H positive, definite, symmetric matrix H is the Hessian matrix of f ◮ ill-conditioned means a high condition number of Hessian Matrix H cond ( H ) = λ max ( H ) λ min ( H ) Example / exercice 1 0.8 0.6 0.4 f ( x ) = 1 0.2 2 ( x 2 1 + 9 x 2 2 ) 0 −0.2 −0.4 condition number equals 9 −0.6 −0.8 −1 −1 −0.5 0 0.5 1 Shape of the iso-fitness lines

Ill-conditionned Problems consider the curvature of iso-fitness lines ill-conditioned means “squeezed” lines of equal function value (high curvatures) gradient direction − f ′ ( x ) T Newton direction − H − 1 f ′ ( x ) T Condition number equals nine here. Condition numbers up to 10 10 are not unusual in real world problems.

Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ ))

Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ )) Everything depends on the definition of P and F θ

Stochastic Search A black box search template to minimize f : R n → R Initialize distribution parameters θ , set population size λ ∈ N While not terminate 1. Sample distribution P ( x | θ ) → x 1 , . . . , x λ ∈ R n 2. Evaluate x 1 , . . . , x λ on f 3. Update parameters θ ← F θ ( θ , x 1 , . . . , x λ , f ( x 1 ) , . . . , f ( x λ )) Everything depends on the definition of P and F θ In Evolutionary Algorithms the distribution P is often implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for Estimation of Distribution Algorithms

Evolution Strategies New search points are sampled normally distributed x i ∼ m + σ N i ( 0 , C ) for i = 1 , . . . , λ where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n

Evolution Strategies New search points are sampled normally distributed x i ∼ m + σ N i ( 0 , C ) for i = 1 , . . . , λ where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n where ◮ the mean vector m ∈ R n represents the favorite solution ◮ the so-called step-size σ ∈ R + controls the step length ◮ the covariance matrix C ∈ R n × n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters

Evolution Strategies New search points are sampled normally distributed x i ∼ m + σ N i ( 0 , C ) for i = 1 , . . . , λ where x i , m ∈ R n , σ ∈ R + , as perturbations of m , C ∈ R n × n where ◮ the mean vector m ∈ R n represents the favorite solution ◮ the so-called step-size σ ∈ R + controls the step length ◮ the covariance matrix C ∈ R n × n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters The question remains how to update m , C , and σ .

Normal Distribution 1-D case Standard Normal Distribution 0.4 probability density of the 1-D standard normal distribution N ( 0 , 1 ) 0.3 probability density (expected (mean) value, variance) = (0,1) 0.2 � � − x 2 1 √ p ( x ) = exp 0.1 2 2 π 0 −4 −2 0 2 4 General case � m , σ 2 � ◮ Normal distribution N (expected value, variance) = ( m , σ 2 ) � � − ( x − m ) 2 1 density: p m ,σ ( x ) = 2 πσ exp √ 2 σ 2 ◮ A normal distribution is entirely determined by its mean value and variance ◮ The family of normal distributions is closed under linear transformations: if X is normally distributed then a linear transformation aX + b is also normally distributed � m , σ 2 � ◮ Exercice: Show that m + σ N ( 0 , 1 ) = N

Stochastic Methods for Continuous Optimization Anne Auger and Dimo - PowerPoint PPT Presentation

Stochastic Methods for Continuous Optimization Anne Auger and Dimo Brockhoff Paris-Saclay Master - Master 2 Informatique - Parcours Apprentissage, Information et Contenu (AIC) anne.auger@inria.fr 2015 Overview Problem Statement Black Box

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Various Topics Outline 1. Dynamic (time-varying) Optimization Problems 2. Stochastic

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &

CSCI 1951-G Optimization Methods in Finance Part 11: Stochastic Optimization April 13, 2018

Rate-Based Stochastic Fusion Calculus and Angelo Troina Continuous Time Markov Chains Fusion

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Outline Continuous Optimization DM812 METAHEURISTICS Lecture 12 1. Model Based Metaheuristics

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search Youhei

WELC LCOME ME TO JS JS101 Job Search ch Training Skills, Knowledge, and Information for the

Task-Oriented Query Reformulation with Reinforcement Learning Authors: Rodrigo Nogueira and

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim Department of

Current Status of GMSB Searches at CMS SUSY at the Near Energy Frontier Fermilab Peter

Overall CMS SUSY search strategy Filip Moortgat (ETH Zurich) Florence, October 22, 2012 GGI

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Final Exam Friday

Introduction to regular expressions Katharine Jarmul Founder, kjamistan DataCamp Natural

Stochastic Methods for Continuous Optimization Anne Auger and Dimo - PowerPoint PPT Presentation

Stochastic Methods for Continuous Optimization Anne Auger and Dimo Brockhoff Paris-Saclay Master - Master 2 Informatique - Parcours Apprentissage, Information et Contenu (AIC) anne.auger@inria.fr 2015 Overview Problem Statement Black Box

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Various Topics Outline 1. Dynamic (time-varying) Optimization Problems 2. Stochastic

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &amp;

CSCI 1951-G Optimization Methods in Finance Part 11: Stochastic Optimization April 13, 2018

Rate-Based Stochastic Fusion Calculus and Angelo Troina Continuous Time Markov Chains Fusion

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Outline Continuous Optimization DM812 METAHEURISTICS Lecture 12 1. Model Based Metaheuristics

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search Youhei

WELC LCOME ME TO JS JS101 Job Search ch Training Skills, Knowledge, and Information for the

Task-Oriented Query Reformulation with Reinforcement Learning Authors: Rodrigo Nogueira and

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim Department of

Current Status of GMSB Searches at CMS SUSY at the Near Energy Frontier Fermilab Peter

Overall CMS SUSY search strategy Filip Moortgat (ETH Zurich) Florence, October 22, 2012 GGI

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Final Exam Friday

Introduction to regular expressions Katharine Jarmul Founder, kjamistan DataCamp Natural

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &