The Gaussian Search Distribution The search distribution is given by p ( z j θ ) = N ( z j x , C ) . We use the parameter set θ = h x , A i , with A being the Cholesky decomposition of C , i.e., A is an upper triangular matrix (UTM) and C = A > A . No redundancy in θ since C is symmetric. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 6 / 22
The Gaussian Search Distribution The search distribution is given by p ( z j θ ) = N ( z j x , C ) . We use the parameter set θ = h x , A i , with A being the Cholesky decomposition of C , i.e., A is an upper triangular matrix (UTM) and C = A > A . No redundancy in θ since C is symmetric. O θ log p ( z j θ ) can be computed in closed form: O x log p ( z j θ ) = C � ( z � x ) � A � � O A log p ( z j θ ) = A �> ( z � x ) ( z � x ) > C � � diag Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 6 / 22
The Gaussian Search Distribution The search distribution is given by p ( z j θ ) = N ( z j x , C ) . We use the parameter set θ = h x , A i , with A being the Cholesky decomposition of C , i.e., A is an upper triangular matrix (UTM) and C = A > A . No redundancy in θ since C is symmetric. O θ log p ( z j θ ) can be computed in closed form: O x log p ( z j θ ) = C � ( z � x ) � A � � O A log p ( z j θ ) = A �> ( z � x ) ( z � x ) > C � � diag O s θ J ( θ ) can be computed from O θ log p ( z 1 j θ ) . . . O θ log p ( z 1 j θ ) . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 6 / 22
Stochastic Gradient Ascent θ J ( θ ) = θ + α θ θ + α O s n Gf Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22
Stochastic Gradient Ascent θ J ( θ ) = θ + α θ θ + α O s n Gf Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22
Stochastic Gradient Ascent θ J ( θ ) = θ + α θ θ + α O s n Gf Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22
Stochastic Gradient Ascent θ J ( θ ) = θ + α θ θ + α O s n Gf Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22
Stochastic Gradient Ascent θ J ( θ ) = θ + α θ θ + α O s n Gf Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22
Novel Ideas in eNES Use the Natural Gradient instead of the vanilla gradient. 1 Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22
Novel Ideas in eNES Use the Natural Gradient instead of the vanilla gradient. 1 The natural gradient is computed in an Exact and E¢cient way. 2 Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22
Novel Ideas in eNES Use the Natural Gradient instead of the vanilla gradient. 1 The natural gradient is computed in an Exact and E¢cient way. 2 Use Importance Mixing for reusing previously evaluated samples. 3 Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22
Novel Ideas in eNES Use the Natural Gradient instead of the vanilla gradient. 1 The natural gradient is computed in an Exact and E¢cient way. 2 Use Importance Mixing for reusing previously evaluated samples. 3 Introducing Optimal Fitness Baseline to reduce the variance of 4 gradient estimation. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22
1. Formulation of Natural Gradient Assume the distance between two adjacent distributions p ( �j θ ) and p ( �j θ + δθ ) is de…ned by their KL divergence. The natural gradient O θ J ( θ ) is given by the necessary condition ˜ O θ J ( θ ) = O θ J ( θ ) . F ˜ Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22
1. Formulation of Natural Gradient Assume the distance between two adjacent distributions p ( �j θ ) and p ( �j θ + δθ ) is de…ned by their KL divergence. The natural gradient O θ J ( θ ) is given by the necessary condition ˜ O θ J ( θ ) = O θ J ( θ ) . F ˜ F is the Fisher information matrix (FIM) of θ : (Intuitively, the normalized covariance of the gradient.) h ( O θ log p ( z j θ )) ( O θ log p ( z j θ )) > i F = E . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22
1. Formulation of Natural Gradient Assume the distance between two adjacent distributions p ( �j θ ) and p ( �j θ + δθ ) is de…ned by their KL divergence. The natural gradient O θ J ( θ ) is given by the necessary condition ˜ O θ J ( θ ) = O θ J ( θ ) . F ˜ F is the Fisher information matrix (FIM) of θ : (Intuitively, the normalized covariance of the gradient.) h ( O θ log p ( z j θ )) ( O θ log p ( z j θ )) > i F = E . F may not be invertible. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22
1. Formulation of Natural Gradient Assume the distance between two adjacent distributions p ( �j θ ) and p ( �j θ + δθ ) is de…ned by their KL divergence. The natural gradient O θ J ( θ ) is given by the necessary condition ˜ O θ J ( θ ) = O θ J ( θ ) . F ˜ F is the Fisher information matrix (FIM) of θ : (Intuitively, the normalized covariance of the gradient.) h ( O θ log p ( z j θ )) ( O θ log p ( z j θ )) > i F = E . F may not be invertible. If F is invertable, we can compute the (estimated) natural gradient as O θ J ( θ ) = F � O θ J ( θ ) , ˜ O s θ J ( θ ) = F � O s ˜ θ J ( θ ) . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22
2. Property of FIM in the Gaussian Case Let θ = h x , A i . Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22
2. Property of FIM in the Gaussian Case Let θ = h x , A i . Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible . The Fisher information matrix is a block diagonal matrix 2 3 C � 6 7 F 1 6 7 F = 6 7 5 . ... 4 F d Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22
2. Property of FIM in the Gaussian Case Let θ = h x , A i . Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible . The Fisher information matrix is a block diagonal matrix 2 3 C � 6 7 F 1 6 7 F = 6 5 . 7 ... 4 F d C � is the FIM for x . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22
2. Property of FIM in the Gaussian Case Let θ = h x , A i . Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible . The Fisher information matrix is a block diagonal matrix 2 3 C � 6 7 F 1 6 7 F = 6 5 . 7 ... 4 F d C � is the FIM for x . F k is the FIM for ( n � k + 1 non-zero elements in) the k -th row of A . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22
2. Property of FIM in the Gaussian Case Let θ = h x , A i . Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible . The Fisher information matrix is a block diagonal matrix 2 3 C � 6 7 F 1 6 7 F = 6 7 5 . ... 4 F d C � is the FIM for x . F k is the FIM for ( n � k + 1 non-zero elements in) the k -th row of A . The FIM suggest a natural grouping of elements in θ . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22
2. E¢cient Inverse of FIM The computation of natural gradient requires the inverse of F . � d 2 � , so computing F � requires Naively, F is a matrix of size O � d 6 � O . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22
2. E¢cient Inverse of FIM The computation of natural gradient requires the inverse of F . � d 2 � , so computing F � requires Naively, F is a matrix of size O � d 6 � O . We already …nd that F is block diagonal, so computing F � requires � d 4 � O . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22
2. E¢cient Inverse of FIM The computation of natural gradient requires the inverse of F . � d 2 � , so computing F � requires Naively, F is a matrix of size O � d 6 � O . We already …nd that F is block diagonal, so computing F � requires � d 4 � O . We can do better! Use the special form of each sub-block, the � d 3 � complexity is reduced to O . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22
2. E¢cient Inverse of FIM The computation of natural gradient requires the inverse of F . � d 2 � , so computing F � requires Naively, F is a matrix of size O � d 6 � O . We already …nd that F is block diagonal, so computing F � requires � d 4 � O . We can do better! Use the special form of each sub-block, the � d 3 � complexity is reduced to O . The estimated natural gradient is then computed as θ J ( θ ) = 1 O s n F � Gf . � d 3 � with complexity O . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22
3. Importance Mixing At each cycle, we need to evaluate n new samples. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22
3. Importance Mixing At each cycle, we need to evaluate n new samples. It is common that the updated θ ( t ) is close to θ ( t � 1 ) . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22
3. Importance Mixing At each cycle, we need to evaluate n new samples. It is common that the updated θ ( t ) is close to θ ( t � 1 ) . Problem: Redundant …tness evaluations in overlapping high density area. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22
3. Importance Mixing At each cycle, we need to evaluate n new samples. It is common that the updated θ ( t ) is close to θ ( t � 1 ) . Problem: Redundant …tness evaluations in overlapping high density area. Importance Mixing: Generate samples in less explored areas, while keeping the updated batch conformed to the new search distribution. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22
3. Importance Mixing At each cycle, we need to evaluate n new samples. It is common that the updated θ ( t ) is close to θ ( t � 1 ) . Problem: Redundant …tness evaluations in overlapping high density area. Importance Mixing: Generate samples in less explored areas, while keeping the updated batch conformed to the new search distribution. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22
3. Importance Mixing At each cycle, we need to evaluate n new samples. It is common that the updated θ ( t ) is close to θ ( t � 1 ) . Problem: Redundant …tness evaluations in overlapping high density area. Importance Mixing: Generate samples in less explored areas, while keeping the updated batch conformed to the new search distribution. Reusing samples: fewer …tness evaluations. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22
3. Importance Mixing Formally, importance mixing is carried out by two rejection samplings. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 14 / 22
3. Importance Mixing Formally, importance mixing is carried out by two rejection samplings. Forward pass: For each sample z from the previous batch, accept with probability � z j θ ( t ) � 8 9 < = p � z j θ ( t � 1 ) � min : 1 , ; . p Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 14 / 22
3. Importance Mixing Formally, importance mixing is carried out by two rejection samplings. Forward pass: For each sample z from the previous batch, accept with probability � z j θ ( t ) � 8 9 < = p � z j θ ( t � 1 ) � min : 1 , ; . p Backward pass: Accept newly generated sample z with probability � z j θ ( t � 1 ) � 8 9 < = p � z j θ ( t ) � max : 0 , 1 � ; p until batch size reached. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 14 / 22
4. Optimal Fitness Baseline A typical problem with the Monte-Carlo gradient estimation is that the variance is too big. The …tness baseline is introduced to reduce the variance. Z Z O θ J = O θ f ( z ) p ( z j θ ) d z � O θ bp ( z j θ ) d z | {z } = 0 Z = O θ [ f ( z ) � b ] p ( z j θ ) d z , b is called the …tness baseline. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 15 / 22
4. Optimal Fitness Baseline A typical problem with the Monte-Carlo gradient estimation is that the variance is too big. The …tness baseline is introduced to reduce the variance. Z Z O θ J = O θ f ( z ) p ( z j θ ) d z � O θ bp ( z j θ ) d z | {z } = 0 Z = O θ [ f ( z ) � b ] p ( z j θ ) d z , b is called the …tness baseline. Adding the baseline b won’t a¤ect the expectation of O θ J . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 15 / 22
4. Optimal Fitness Baseline A typical problem with the Monte-Carlo gradient estimation is that the variance is too big. The …tness baseline is introduced to reduce the variance. Z Z O θ J = O θ f ( z ) p ( z j θ ) d z � O θ bp ( z j θ ) d z | {z } = 0 Z = O θ [ f ( z ) � b ] p ( z j θ ) d z , b is called the …tness baseline. Adding the baseline b won’t a¤ect the expectation of O θ J . But it a¤ects the variance of the estimation: For natural gradient h i h i O θ J ( θ )] ∝ b 2 E u > u u > v V [ ˜ � 2 b E + const with u = F � O θ log p ( z j θ ) , v = f ( z ) u . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 15 / 22
4. Optimal Fitness Baseline O θ J ( θ )] is of quadratic form, we can minimize it. The optimal V [ ˜ …tness baseline is given by � � E [ u > u ] ' ∑ n u > v i = 1 u > b � = E i v i . ∑ n i = 1 u > i u i Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22
4. Optimal Fitness Baseline O θ J ( θ )] is of quadratic form, we can minimize it. The optimal V [ ˜ …tness baseline is given by � � E [ u > u ] ' ∑ n u > v i = 1 u > b � = E i v i . ∑ n i = 1 u > i u i The natural gradient is then estimated by θ J ( θ ) = 1 O s n F � G ( f � b � ) . ˜ Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22
4. Optimal Fitness Baseline O θ J ( θ )] is of quadratic form, we can minimize it. The optimal V [ ˜ …tness baseline is given by � � E [ u > u ] ' ∑ n u > v i = 1 u > b � = E i v i . ∑ n i = 1 u > i u i The natural gradient is then estimated by θ J ( θ ) = 1 O s n F � G ( f � b � ) . ˜ Better: Di¤erent baselines b j for di¤erent (groups of) parameter θ j , further reducing the variance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22
4. Optimal Fitness Baseline O θ J ( θ )] is of quadratic form, we can minimize it. The optimal V [ ˜ …tness baseline is given by � � E [ u > u ] ' ∑ n u > v i = 1 u > b � = E i v i . ∑ n i = 1 u > i u i The natural gradient is then estimated by θ J ( θ ) = 1 O s n F � G ( f � b � ) . ˜ Better: Di¤erent baselines b j for di¤erent (groups of) parameter θ j , further reducing the variance. The block diagonal structure of F suggests using a block …tness baseline , where di¤erent baseline values are computed for each group of parameters in θ . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22
Putting Things Together Initialization Update population using importance mixing loop Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22
Putting Things Together Initialization Update population using importance mixing loop Evaluate newly generated samples Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22
Putting Things Together Initialization Update population using importance mixing loop Compute optimal baseline b � and Evaluate newly O s ˜ θ J ( θ ) generated samples Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22
Putting Things Together Initialization Update: Update population O s θ θ + α ˜ θ J ( θ ) using importance mixing loop Compute optimal baseline b � and Evaluate newly O s ˜ θ J ( θ ) generated samples Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22
Empirical Results - Standard Blackbox Benchmarks Unimodal-50 Cigar DiffPow Ellipsoid ParabR Schwefel SharpR Sphere Tablet -fitness number of evaluations Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 18 / 22
Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22
Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22
Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22
Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22
Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22
Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22
Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22
Empirical Results - Double Pole Balancing β 1 β 2 F x Non-Markovian double pole balancing, average numbers of evaluations. Method SANE ESP NEAT CMA CoSyNE FEM NES Eval. 262 , 700 7 , 374 6 , 929 3 , 521 1 , 249 2 , 099 1 , 753 Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 20 / 22
Recommend
More recommend