stochastic randomized derivative free optimization
play

Stochastic / Randomized Derivative Free Optimization Anne Auger - PDF document

Stochastic / Randomized Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole Polytechnique, IP Paris) Class notes for Optimization Master and AMS Master, Paris Saclay December 7, 2019 1 Preamble Those notes are intended for the


  1. Stochastic / Randomized Derivative Free Optimization Anne Auger (Inria and CMAP, Ecole Polytechnique, IP Paris) Class notes for Optimization Master and AMS Master, Paris Saclay December 7, 2019

  2. 1 Preamble Those notes are intended for the students following the Derivative Free Optimization class from the Optimization Master of Paris Saclay and AMS (Analyse Mod´ elisation et Simula- tion) master. The material presented in the lecture is not following a particular textbook and those notes are there to compensate the lack of textbook. I appreciate any feedback. Bonus points will be given to students who find mistakes or typos that will help me improve the notes.

  3. Contents 1 A few Definitions, Reminders and Terminology 4 1.1 Reminders and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Argmin and argmax . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Level sets and sublevel sets . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Convex-quadratic functions . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Introduction to Black-box Optimization 6 2.1 Derivative-free and Black-box Optimization . . . . . . . . . . . . . . . . . . 6 2.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 What is the goal? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3 Cost / Runtime of an algorithm . . . . . . . . . . . . . . . . . . . . 11 2.2 What makes an optimization problem difficult? . . . . . . . . . . . . . . . . 11 2.2.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Ill-conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Non-separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.4 Multi-modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.5 Ruggedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.6 Non-xxx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 What this class is about? This class presents derivative-free optimization or black-box methods to optimize numerical problems. We will assume in most of the class an unconstrained minimization problem: min x f ( x ) where f : Ω ⊂ R n → R . The class is divided in two parts. The first part is devoted to the presentation of important theoretical concepts and algorithms that are randomized or stochastic with a 2

  4. CONTENTS 3 strong focus on algorithms that are often considered as state-of-the-art and belong to the family of Evolution Strategies. The second part is devoted to the presentation of deterministic algorithms. Those notes cover the first part of the class only.

  5. Chapter 1 A few Definitions, Reminders and Terminology 1.1 Reminders and Terminology 1.2 Definitions 1.2.1 Argmin and argmax Given a function f : R n → R , we denote as argmin x f ( x ) = A as the set of points of R n such that for all x ⋆ in A f ( x ⋆ ) ≤ f ( x ) for all x ∈ R n . A similar definition holds for the argmax. We might use the somehow ambiguous terminology of minimum to either designate the minimum function value: min { f ( x ) : x ∈ Ω } or one point of R n where this minimum is achieved, that is on point belonging to argmin x f ( x ). 1.2.2 Level sets and sublevel sets Given a function f : x ∈ R n → f ( x ) ∈ R , we define the level set of f as L c := { x ∈ R n | f ( x ) = c } for c ∈ R . We define the sublevel set of f as L c := { x ∈ R n | f ( x ) ≤ c } for c ∈ R . 4

  6. CHAPTER 1. A FEW DEFINITIONS, REMINDERS AND TERMINOLOGY 5 1.2.3 Convex-quadratic functions Let H be a symmetric positive definite (SPD) matrix, a convex-quadratic function is defined as f ( x ) = 1 2( x − x ⋆ ) H ( x − x ⋆ ), x ∈ R n , x ⋆ ∈ R n . (1.1) Convex-quadratic functions play a central role in numerical optimization. They are simple to understand but yet allow to model important difficulties like ill-conditioned problems (this notion will be formalized later) such that they often serve as test problems to evaluate and understand the behavior of algorithms. The exercices below while not presenting any specific difficulties are central for the understanding of this lecture. Exercice 1.1 Consider a convex-quadratic function as given in (1.1) . Show that 1. x ⋆ is the minimum of f . 2. H is the Hessian matrix of f . 3. The function f is convex. 4. The function f has is a unique optimum. 2 x T Hx where x ∈ R 2 and H is the 2 × 2 matrix: Exercice 1.2 Let f ( x ) = 1 � 9 � 0 H = . 0 1 1. Plot the level sets of f . 2. Relate the axis ratio of the level sets to the eigenvalues of H . 3. What is changing in the picture of the level sets you have drawn if H is instead � 9 � 0 H = P T P . 0 1 where P is an orthogonal matrix. 4. More generally, deduce the geometric shape of the level sets of a convex-quadratic function. 1.2.4 Probability and Statistics Mean vector, variance, standard deviation, covariance matrix Gaussian vectors Global optimum, local optimum.

  7. Chapter 2 Introduction to Black-box Optimization 2.1 Derivative-free and Black-box Optimization This class is centered on derivative-free optimization methods where we are interested to minimize a function f : Ω ⊂ R n → R but we do not have access to its derivatives or equivalently to the gradient of f . Those derivatives can either exist in the mathematical sense (i.e. the function is differentiable) but we cannot easily compute them or the function can be non-differentiable. We will more precisely assume a black-box scenario where the function f to be optimized is seen by the algorithm as a zeroth-order oracle that can be queried, which means that we can give as input to the oracle a point x ∈ R n and the oracle returns the function f ( x ) (a first-order oracle returns f ( x ) and ∇ f ( x )). This modelization of the function was notably formalized for analyzing the (query) complexity of class of algorithms depending on the information that they are using and the class of problems (convex, smooth, ...). We refer to Nesterov [ ? ] and Bubeck [ ? ] for further details. Remark 2.1 With this terminology of black-box optimization, Quasi-Newton methods can then be seen as first order black-box methods. 2.1.1 Examples Many examples of (real-world) optimization problems fall into this category of black-box problems. In particular, it is very common that the function to be optimized is the result of a simulation (this is sometimes referred to as simulation-based optimization) that can involve the numerical resolution of partial differential equations, ... Those simulations are typically so complex that we do not want to look into the simulation to extract possibly information 6

  8. CHAPTER 2. INTRODUCTION TO BLACK-BOX OPTIMIZATION 7 to be used within the optimization algorithm and rather handle the problem as a black-box problem. The function can also be a real black-box. For instance, in the context of industry collaborations one might be asked to help to optimize the design of an object, for instance part of a car, part of a plane, part of a launcher so as to optimize a certain criteria that can be the production cost (while satisfying certain physical constraints), the recurrent cost of a launcher [ ? ]. In those cases the function to be optimized can be provided to us as an executable code but we do not have access to its source code. Hence we have to optimize a real black-box. 2.1.2 What is the goal? So far, we have talked about optimizing a numerical optimization problem. What does it precisely mean? First of all, we should keep in mind that locating the exact solution of the problem is typically impossible because of the continuous nature of the search space. Instead, an optimization algorithm will return a sequence of points { x t : t ∈ N } that will converge to the optimum of the problem, denoted x ⋆ , that is: t →∞ x t = x ⋆ . lim Equivalently, given a precision ǫ , the algorithm will aim at returning a solution which approximates x ⋆ with precision ǫ . We can think for the moment as precision in terms of (Euclidean) distance to the optimum, that is the algorithm tries to find a point x T such that � x T − x ⋆ � ≤ ǫ . We will see later on that we should be careful in how we define precision. Essential optimum of a function We will see in this lecture algorithms that can handle well functions that have disconti- nuities. Hence, we do not assume that our underlying functions are continuous. Yet, in this context, talking about locating the optimum of a function can be rather meaningless. Consider for instance the 1-dimensional function f ( x ) = x 2 . Its optimum is in 0. Consider now a function h which is equal everywhere to f except for x = 1 where we set h (1) = − 2. Then the optimum of h is in 1. Yet, it is safe to say that all reasonable optimization algorithm will converge to x = 0 while optimizing h . We can say that the optimum of h is not robust. We can also say that if such a situation happens for a real-world problem, one will also be interested to locate x = 0 and not x = 1 which is an outlier of the function. Exercice 2.1 Think about an alternative definition of the optimum of a function that would give that the optimum of h is in x = 0 .

Recommend


More recommend