Separability f : x = ( x 1 , …, x n ) ∈ ℝ n ↦ f ( x ) ∈ ℝ Given , let us de fi ne the 1-D functions that are cuts of along the di ff erent coordinates: f f i n ) ( y ) = f ( x i 1 , …, x i i − 1 , y , x i i +1 , …, x i n ) ( x i 1 ,…, x i ( x i 1 , …, x i n ) ∈ ℝ n − 1 ( x i 1 , …, x i n ) = ( x i 1 , …, x i i − 1 , x i i +1 , …, x i for , with n ) De fi nition: A function is separable if for all i, for all f ( x i 1 , …, x i n ) ∈ ℝ n − 1 x i x i n ) ∈ ℝ n − 1 , for all ( ̂ 1 , …, ̂ argmin y f i n ) ( y ) = argmin y f i n ) ( y ) ( x i 1 ,…, x i x i x i ( ̂ 1 ,…, ̂ a weak de fi nition of separability 39
Separability (cont) x j Proposition: Let be a separable then for all f i argmin f ( x 1 , …, x n ) = ( argmin f 1 n − 1 ) ( x n ) ) n ) ( x 1 ), …, argmin f n ( x n 1 ,…, x n ( x 1 2 ,…, x 1 and can be optimized using minimization along the f n coordinates. Exercice: prove the previous proposition 40
Example: Additively Decomposable Functions n ∑ Exercice: Let for having a unique f ( x 1 , …, x n ) = h i ( x i ) h i i =1 argmin. Prove that is separable. We say in this case that is f f additively decomposable. Example: Rastrigin function 3 2 n 1 ∑ ( x 2 f ( x ) = 10 n + i − 10 cos(2 π x i )) 0 i =1 − 1 − 2 − 3 − 3 − 2 − 1 0 1 2 3 41
Non-separable Problems Separable problems are typically easy to optimize. Yet di ff icult real-word problems are non-separable . One needs to be careful when evaluating optimization algorithms that not too many test functions are separable and if so that the algorithms do not exploit separability . Otherwise: good performance on test problems will not re fl ect good performance of the algorithm to solve di ff icult problems Algorithms known to exploit separability: Many Genetic Algorithms (GA), Most Particle Swarm Optimization (PSO) 42
Non-separable Problems Building a non-separable problem from a separable one 43
Ill-conditioned Problems - Case of Convex-quadratic functions Exercice: Consider a convex-quadratic function f ( x ) = 1 2( x − x ⋆ ) H ( x − x ⋆ ) with a symmetric, positive, de fi nite H (SPD) matrix. 1. why is it called a convex-quadratic function? What is the Hessian matrix of ? f The condition number of the matrix (with respect to the H Euclidean norm) is de fi ned as cond( H ) = λ max ( H ) λ min ( H ) with and being respectively the largest and smallest λ max () λ min () eigenvalues. 44
Ill-conditioned Problems Ill-conditioned means a high condition number of the Hessian matrix . H f ( x ) = 1 2( x 2 1 + 9 x 2 Consider now the speci fi c case of the function 2 ) 1. Compute its Hessian matrix, its condition number 2. Plots the level sets of , relate the condition number to the f axis ratio of the level sets of f 3. Generalize to a general convex-quadratic function Real-world problems are often ill-conditioned. 4. Why to you think it is the case? 5. why are ill-conditioned problems di ff icult? (see also Exercice 2.5 ) 45
Ill-conditioned Problems 46
Part II: Algorithms 47
Landscape of Derivative Free Optimization Algorithms Deterministic Algorithms Quasi-Newton with estimation of gradient (BFGS) [Broyden et al. 1970] Simplex downhill [Nelder and Mead 1965] Pattern search, Direct Search [Hooke and Jeeves 1961] Trust-region/Model Based methods (NEWUOA, BOBYQA) [Powell, 06,09] Stochastic (randomized) search methods Evolutionary Algorithms (continuous domain) Di ff erential Evolution [Storn, Price 1997] Particle Swarm Optimization [Kennedy and Eberhart 1995] Evolution Strategies, CMA-ES [Rechenberg 1965, Hansen, Ostermeier 2001] Estimation of Distribution Algorithms (EDAs) [Larrañaga, Lozano, 2002] Cross Entropy Method (same as EDAs) [Rubinstein, Kroese, 2004] Genetic Algorithms [Holland 1975, Goldberg 1989] Simulated Annealing [Kirkpatrick et al. 1983] 48
A Generic Template for Stochastic Search ℝ n De fi ne , a family of probability distributions on { P θ : θ ∈ Θ } Generic template to optimize f : ℝ n → ℝ Initialize distribution parameter , set population size λ ∈ ℕ θ While not terminate 1. Sample according to x 1 , …, x λ P θ 2. Evaluate on x 1 , …, x λ f 3. Update parameters θ ← F ( θ , x 1 , …, x λ , f ( x 1 ), …, f ( x λ )) the update of should drive to concentrate on the optima of θ P θ f 49
To obtain an optimization algorithm we need: ➊ to de fi ne { P θ , θ ∈ Θ } ➋ to de fi ne the update function of F θ 50
Which probability distribution to sample candidate solutions? 51
Normal distribution - 1D case 52
Generalization to n Variables: Independent Case p ( x 1 ) = 1 exp ( − 1 ( x 1 − μ 1 ) 2 ) 1 ) denote its density 𝒪 ( μ 1 , σ 2 Assume X1 ~ 2 σ 2 Z 1 1 p ( x 2 ) = 1 exp ( − 1 ( x 2 − μ 2 ) 2 ) 2 ) denote its density Assume X2~ 𝒪 ( μ 2 , σ 2 2 σ 2 Z 1 2 Assume X1 and X2 are independent , then (X1,X2) is a Gaussian vector with p ( x 1 , x 2 ) = 53
Generalization to n Variables: Independent Case p ( x 1 ) = 1 exp ( − 1 ( x 1 − μ 1 ) 2 ) 1 ) denote its density 𝒪 ( μ 1 , σ 2 Assume X1 ~ 2 σ 2 Z 1 1 p ( x 2 ) = 1 exp ( − 1 ( x 2 − μ 2 ) 2 ) 2 ) denote its density Assume X2~ 𝒪 ( μ 2 , σ 2 2 σ 2 Z 1 2 Assume X1 and X2 are independent , then (X1,X2) is a Gaussian vector with 1 exp ( − 1 2( x − μ ) T Σ − 1 ( x − μ ) ) p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 ) = Z 1 Z 2 Σ = ( 2 ) σ 2 0 1 x = ( x 1 , x 2 ) T μ = ( μ 1 , μ 2 ) T with σ 2 0 54
Generalization to n Variables: Independent Case p ( x 1 ) = 1 exp ( − 1 ( x 1 − μ 1 ) 2 ) 1 ) denote its density 𝒪 ( μ 1 , σ 2 Assume X1 ~ 2 σ 2 Z 1 1 p ( x 2 ) = 1 exp ( − 1 ( x 2 − μ 2 ) 2 ) 2 ) denote its density Assume X2~ 𝒪 ( μ 2 , σ 2 2 σ 2 Z 1 2 Assume X1 and X2 are independent , then (X1,X2) is a Gaussian vector with 1 exp ( − 1 2( x − μ ) T Σ − 1 ( x − μ ) ) p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 ) = Z 1 Z 2 Σ = ( 2 ) σ 2 0 1 x = ( x 1 , x 2 ) T μ = ( μ 1 , μ 2 ) T with σ 2 0 ( μ 1 , μ 2 ) σ 1 > σ 2 55
Generalization to n Variables: General Case Gaussian Vector - Multivariate Normal Distribution X = ( X 1 , …, X n ) ∈ ℝ n A random vector is a Gaussian vector (or multivariate normal) if and only if for all real numbers , the random variable has a normal a 1 , …, a n a 1 X 1 + … + a n X n distribution. 56
Gaussian Vector - Multivariate Normal Distribution 57
Density of a n-dimensional Gaussian vector : 𝒪 ( m , C ) (2 π ) n /2 | C | 1/2 exp ( − 1 2( x − m ) ⊤ C − 1 ( x − m ) ) 1 p 𝒪 ( m . C ) ( x ) = The mean vector : m determines the displacement is the value with the largest density the distribution is symmetric around the mean 𝒪 ( m , C ) = m + 𝒪 (0, C ) The covariance matrix: determines the geometrical shape (see next slides) 58
Geometry of a Gaussian Vector Consider a Gaussian vector , remind that lines of equal 𝒪 ( m , C ) densities are given by: { x | Δ 2 = ( x − m ) T C − 1 ( x − m ) = cst} C = U Λ U ⊤ Decompose with U orthogonal, i.e. C = ( | ) ( 2 ) ( σ 2 u 1 u 2 u 2 − ) 0 u 1 − 1 | σ 2 0 | Y = U ⊤ ( x − m ) Let , then in the coordinate system, (u1,u2), the lines of equal densities are given by σ 2 { x | Δ 2 = Y 2 + Y 2 1 2 u2 = cst} σ 2 σ 2 u1 ( μ 1 , μ 2 ) 1 2 σ 1 59
60
Evolution Strategies 61
Evolution Strategies σ 2 C In fact, the covariance matrix of the sampling distribution is but it is convenient to refer to as the covariance matrix (it is a C covariance matrix but not of the sampling distribution) 62
How to update the di ff erent parameters ? m , σ , C 63
Update the Mean: a Simple Algorithm the (1+1)-ES Notation and Terminology: one solution kept one new solution (1+1)-ES from one iteration sampled at each to the next iteration The + means that we keep the best between current solution and new solution, we talk about elitist selection (1+1)-ES algorithm (update of the mean) sample one candidate solution from the mean m x = m + σ 𝒪 (0, C ) if is better than (i.e. if ), select x m f ( x ) ≤ f ( m ) m m ← x 64
The (1+1)-ES algorithm is a simple algorithm, yet: •the elitist selection is not robust to outliers we cannot loose solutions accepted by “chance”, for instance that look good because the noise gave it a low function value •there is no population (just a single solution is sampled) which makes it less robust In practice, one should rather use a: -ES ( μ / μ , λ ) The best solutions are solutions are μ λ selected and recombined sampled (to form the new mean) at each iteration 65
The -ES - Update of the Mean Vector ( μ / μ , λ ) 66
What changes in the previous slide if instead of optimizing , we optimize where g ∘ f g : Im( f ) → ℝ f is strictly increasing? 67
Invariance Under Monotonically Increasing Functions Comparison-based/ranking-based algorithms: Update of all parameters uses only the ranking: f ( x 1: λ ) ≤ f ( x 2: λ ) ≤ … ≤ f ( x λ : λ ) g ( f ( x 1: λ )) ≤ g ( f ( x 2: λ )) ≤ … ≤ g ( f ( x λ : λ )) for all strictly increasing g : Im( f ) → ℝ 68
Recommend
More recommend