Covariance Matrix Adaptation Covariance Matrix Adaptation
Evolution Strategies Recalling New search points are sampled normally distributed x i ⇠ m + σ N i ( 0 , C ) for i = 1 , . . . , λ where x i , m 2 R n , σ 2 R + , as perturbations of m , C 2 R n ⇥ n where I the mean vector m 2 R n represents the favorite solution I the so-called step-size σ 2 R + controls the step length I the covariance matrix C 2 R n × n determines the shape of the distribution ellipsoid The remaining question is how to update C .
Covariance Matrix Adaptation Rank-One Update y w = P µ y i ⇠ N i ( 0 , C ) m m + σ y w , i = 1 w i y i : λ , initial distribution, C = I
Covariance Matrix Adaptation Rank-One Update y w = P µ y i ⇠ N i ( 0 , C ) m m + σ y w , i = 1 w i y i : λ , initial distribution, C = I
Covariance Matrix Adaptation Rank-One Update y w = P µ y i ⇠ N i ( 0 , C ) m m + σ y w , i = 1 w i y i : λ , y w , movement of the population mean m (disregarding σ )
Covariance Matrix Adaptation Rank-One Update y w = P µ y i ⇠ N i ( 0 , C ) m m + σ y w , i = 1 w i y i : λ , mixture of distribution C and step y w , C 0 . 8 ⇥ C + 0 . 2 ⇥ y w y T w
Covariance Matrix Adaptation Rank-One Update y w = P µ y i ⇠ N i ( 0 , C ) m m + σ y w , i = 1 w i y i : λ , new distribution (disregarding σ )
Covariance Matrix Adaptation Rank-One Update y w = P µ y i ⇠ N i ( 0 , C ) m m + σ y w , i = 1 w i y i : λ , new distribution (disregarding σ )
Covariance Matrix Adaptation Rank-One Update y w = P µ y i ⇠ N i ( 0 , C ) m m + σ y w , i = 1 w i y i : λ , movement of the population mean m
Covariance Matrix Adaptation Rank-One Update y w = P µ y i ⇠ N i ( 0 , C ) m m + σ y w , i = 1 w i y i : λ , mixture of distribution C and step y w , C 0 . 8 ⇥ C + 0 . 2 ⇥ y w y T w
Covariance Matrix Adaptation Rank-One Update y w = P µ y i ⇠ N i ( 0 , C ) m m + σ y w , i = 1 w i y i : λ , new distribution, C 0 . 8 ⇥ C + 0 . 2 ⇥ y w y T w the ruling principle: the adaptation increases the likelihood of successful steps, y w , to appear again
Covariance Matrix Adaptation Rank-One Update Initialize m 2 R n , and C = I , set σ = 1, learning rate c cov ⇡ 2 / n 2 While not terminate y i ⇠ N i ( 0 , C ) , = m + σ y i , x i µ X m + σ y w where y w = m w i y i : λ i = 1 1 ( 1 � c cov ) C + c cov µ w y w y T where µ w = i = 1 w i 2 � 1 C P µ w | {z } rank-one
Problem Statement Stochastic search algorithms - basics Adaptive Evolution Strategies Mean Vector Adaptation Step-size control Covariance Matrix Adaptation Rank-One Update Cumulation—the Evolution Path Rank- µ Update
Cumulation The Evolution Path Evolution Path Conceptually, the evolution path is the search path the strategy takes over a number of iteration steps. It can be expressed as a sum of consecutive steps of the mean m . An exponentially weighted sum of steps y w is used g X y ( i ) ( 1 � c c ) g − i p c / w | {z } i = 0 exponentially fading weights
Cumulation The Evolution Path Evolution Path Conceptually, the evolution path is the search path the strategy takes over a number of iteration steps. It can be expressed as a sum of consecutive steps of the mean m . An exponentially weighted sum of steps y w is used g X y ( i ) ( 1 � c c ) g − i p c / w | {z } i = 0 exponentially fading weights The recursive construction of the evolution path (cumulation): 1 � ( 1 � c c ) 2 p µ w p ( 1 � c c ) p c + p c y w |{z} | {z } | {z } decay factor normalization factor m − m old input = σ 1 where µ w = P w i 2 , c c ⌧ 1. History information is accumulated in the evolution path.
Cumulation Utilizing the Evolution Path w = � y w ( � y w ) T the sign of y w We used y w y T w for updating C . Because y w y T is lost.
Cumulation Utilizing the Evolution Path w = � y w ( � y w ) T the sign of y w We used y w y T w for updating C . Because y w y T is lost.
Cumulation Utilizing the Evolution Path w = � y w ( � y w ) T the sign of y w We used y w y T w for updating C . Because y w y T is lost. The sign information is (re-)introduced by using the evolution path . 1 � ( 1 � c c ) 2 p µ w p ( 1 � c c ) p c + p c y w | {z } | {z } decay factor normalization factor T ( 1 � c cov ) C + c cov p c p c C | {z } rank-one 1 where µ w = P w i 2 , c c ⌧ 1.
Using an evolution path for the rank-one update of the covariance matrix reduces the number of function evaluations to adapt to a straight ridge from O ( n 2 ) to O ( n ) . ( 3 ) The overall model complexity is n 2 but important parts of the model can be learned in time of order n 3Hansen, Müller and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1) , pp. 1-18
Rank- µ Update = m + σ y i , N i ( 0 , C ) , y i ∼ x i P µ m + σ y w = i = 1 w i y i : λ ← y w m The rank- µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step.
Rank- µ Update = m + σ y i , N i ( 0 , C ) , y i ∼ x i P µ m + σ y w = i = 1 w i y i : λ ← y w m The rank- µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step. The matrix µ X w i y i : λ y T C µ = i : λ i = 1 computes a weighted mean of the outer products of the best µ steps and has rank min( µ, n ) with probability one.
Rank- µ Update = m + σ y i , N i ( 0 , C ) , y i ∼ x i P µ m + σ y w = i = 1 w i y i : λ ← y w m The rank- µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each iteration step. The matrix µ X w i y i : λ y T C µ = i : λ i = 1 computes a weighted mean of the outer products of the best µ steps and has rank min( µ, n ) with probability one. The rank- µ update then reads C ( 1 � c cov ) C + c cov C µ where c cov ⇡ µ w / n 2 and c cov 1.
P y i : λ y T P y i : λ m + 1 1 y i ∼ N ( 0 , C ) C µ = = m + σ y i , m new ← x i i : λ µ µ C ( 1 − 1 ) × C + 1 × C µ ← sampling of calculating C where new distribution µ = 50, w 1 = · · · = λ = 150 solutions w µ = 1 where C = I and µ , and σ = 1 c cov = 1
The rank- µ update I increases the possible learning rate in large populations roughly from 2 / n 2 to µ w / n 2 I can reduce the number of necessary iterations roughly from O ( n 2 ) to O ( n ) ( 4 ) given µ w / λ / n Therefore the rank- µ update is the primary mechanism whenever a large population size is used say λ � 3 n + 10 4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1) , pp. 1-18
The rank- µ update I increases the possible learning rate in large populations roughly from 2 / n 2 to µ w / n 2 I can reduce the number of necessary iterations roughly from O ( n 2 ) to O ( n ) ( 4 ) given µ w / λ / n Therefore the rank- µ update is the primary mechanism whenever a large population size is used say λ � 3 n + 10 The rank-one update I uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from O ( n 2 ) to O ( n ) . 4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1) , pp. 1-18
The rank- µ update I increases the possible learning rate in large populations roughly from 2 / n 2 to µ w / n 2 I can reduce the number of necessary iterations roughly from O ( n 2 ) to O ( n ) ( 4 ) given µ w / λ / n Therefore the rank- µ update is the primary mechanism whenever a large population size is used say λ � 3 n + 10 The rank-one update I uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from O ( n 2 ) to O ( n ) . Rank-one update and rank- µ update can be combined 4Hansen, Müller, and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1) , pp. 1-18
Summary of Equations The Covariance Matrix Adaptation Evolution Strategy Input: m 2 R n , σ 2 R + , λ Initialize: C = I , and p c = 0 , p σ = 0 , Set: c c ⇡ 4 / n , c σ ⇡ 4 / n , c 1 ⇡ 2 / n 2 , c µ ⇡ µ w / n 2 , c 1 + c µ 1, p µ w 1 d σ ⇡ 1 + n , and w i = 1 ... λ such that µ w = i = 1 w i 2 ⇡ 0 . 3 λ P µ While not terminate y i ⇠ N i ( 0 , C ) , x i = m + σ y i , for i = 1 , . . . , λ sampling m P µ where y w = P µ i = 1 w i x i : λ = m + σ y w i = 1 w i y i : λ update mean 1 � ( 1 � c c ) 2 p µ w y w p I { k p σ k < 1 . 5 p n } p c ( 1 � c c ) p c + 1 cumulation for C 1 � ( 1 � c σ ) 2 p µ w C � 1 p 2 y w p σ ( 1 � c σ ) p σ + cumulation for σ C ( 1 � c 1 � c µ ) C + c 1 p c p c T + c µ P µ i = 1 w i y i : λ y T update C i : λ ⇣ ⇣ ⌘⌘ k p σ k c σ σ σ ⇥ exp E k N ( 0 , I ) k � 1 update of σ d σ Not covered on this slide: termination, restarts, useful output, boundaries and encoding
Rank-one and Rank-mu updates
Rank-one and Rank-mu update - default pop size
Rank-one and Rank-mu update - larger pop size
Recommend
More recommend