an adaptive pursuit strategy for allocating operator
play

An Adaptive Pursuit Strategy for Allocating Operator Probabilities - PowerPoint PPT Presentation

An Adaptive Pursuit Strategy for Allocating Operator Probabilities Dirk Thierens Department of Computer Science Universiteit Utrecht, The Netherlands Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 1 / 26 Outline Adaptive


  1. An Adaptive Pursuit Strategy for Allocating Operator Probabilities Dirk Thierens Department of Computer Science Universiteit Utrecht, The Netherlands Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 1 / 26

  2. Outline Adaptive Operator Allocation 1 Probability Matching 2 Adaptive Pursuit Strategy 3 Experiments 4 Conclusion 5 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 2 / 26

  3. Adaptive Operator Allocation What Adaptive Operator Allocation: What ? Given: Set of K operators A = { a 1 , . . . , a K } 1 Probability vector P ( t ) = {P 1 ( t ) , . . . , P K ( t ) } : 2 operator a i applied at time t in proportion to probability P i ( t ) Environment returns rewards R i ( t ) ≥ 0 3 Goal: Adapt P ( t ) such that the expected value of the cumulative reward E [ R ] = � T t = 1 R i ( t ) is maximized Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 3 / 26

  4. Adaptive Operator Allocation Why Adaptive Operator Allocation: Why ? Probability of applying an operator is difficult to determine a priori 1 depends on current state of the search process 2 → Adaptive allocation rule specifies how probabilities are adapted according to the performance of the operators Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 4 / 26

  5. Adaptive Operator Allocation Requirements Adaptive Operator Allocation: Requirements Non-stationary environment ⇒ operator probabilities need to be 1 adapted continuously Stationary environment ⇒ operator probabilities should converge 2 to best performing operator → conflicting goals ! Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 5 / 26

  6. Probability Matching Main idea Probability Matching: Main Idea Adaptive allocation rule often applied in GA literature: probability matching strategy Main idea: update P ( t ) such that the probability of applying operator a i matches the proportion of the estimated reward Q i ( t ) to the sum of all reward estimates � K a = 1 Q a ( t ) Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 6 / 26

  7. Probability Matching Reward estimate Probability Matching: Reward Estimate The adaptive allocation rule computes an estimate of the rewards received when applying an operator In non-stationary environments older rewards should get less influence Exponential, recency-weighted average (0 < α < 1): Q a ( t + 1 ) = Q a ( t ) + α [ R a ( t ) − Q a ( t )] Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 7 / 26

  8. Probability Matching Probability adaptation Probability Matching: Probability Adaptation In non-stationary environments the probability of applying any operator should never be less than some minimal threshold P min > 0 For K operators maximal probability P max = 1 − ( K − 1 ) P min Updating rule for P ( t ) : Q a ( t ) P a ( t ) = P min + ( 1 − K · P min ) � K i = 1 Q i ( t ) Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 8 / 26

  9. Probability Matching Algorithm Probability Matching: Algorithm P ROBABILITY M ATCHING ( P , Q , K , P min , α ) 1 for i ← 1 to K do P i ( 0 ) ← 1 2 K ; Q i ( 0 ) ← 1 . 0 3 while N OT T ERMINATED ? () do a s ← P ROPORTIONAL S ELECT O PERATOR ( P ) 4 R a s ( t ) ← G ET R EWARD ( a s ) 5 6 Q a s ( t + 1 ) = Q a s ( t ) + α [ R a s ( t ) − Q a s ( t )] 7 for a ← 1 to K Q a ( t + 1 ) 8 do P a ( t + 1 ) = P min + ( 1 − K · P min ) � K i = 1 Q i ( t + 1 ) Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 9 / 26

  10. Probability Matching Problem Probability Matching: Problem Assume one operator is consistently better For instance, 2 operators a 1 and a 2 with constant rewards R 1 = 10 and R 2 = 9 If P min = 0 . 1 we would like to apply operator a 1 with probability P 1 = 0 . 9 and operator a 2 with P 2 = 0 . 1. Yet, the probability matching allocation rule will converge to P 1 = 0 . 52 and P 2 = 0 . 48 ! Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 10 / 26

  11. Adaptive Pursuit Strategy Pursuit method Adaptive Pursuit Strategy: Pursuit Method The pursuit algorithm is a rapidly converging algorithm applied in learning automata Main idea: update P ( t ) such that the operator a ∗ that currently has the maximal estimated reward Q a ∗ ( t ) is pursued To achieve this, the pursuit method increases the selection probability P a ∗ ( t ) and decreases all other probabilities P a ( t ) , ∀ a � = a ∗ Adaptive pursuit algorithm is extension of the pursuit algorithm to make it applicable in non-stationary environments Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 11 / 26

  12. Adaptive Pursuit Strategy Adaptive pursuit method Adaptive Pursuit Strategy: Adaptive Pursuit Method Similar to probability matching: The adaptive pursuit algorithm proportionally selects an operator to 1 execute according to the probability vector P ( t ) The estimated reward of the selected operator is updated with: 2 Q a ( t + 1 ) = Q a ( t ) + α [ R a ( t ) − Q a ( t )] Different from probability matching: Selection probability vector P ( t ) is adapted in a greedy way 1 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 12 / 26

  13. Adaptive Pursuit Strategy Probability adaptation Adaptive Pursuit Strategy: Probability Adaptation The selection probability of the current best operator a ∗ = argmax a [ Q a ( t + 1 )] is increased (0 < β < 1): P a ∗ ( t + 1 ) = P a ∗ ( t ) + β [ P max − P a ∗ ( t )] The selection probability of the other operators is decreased: ∀ a � = a ∗ : P a ( t + 1 ) = P a ( t ) + β [ P min − P a ( t )] Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 13 / 26

  14. Adaptive Pursuit Strategy Probability adaptation Note K � P a ( t + 1 ) a = 1 K � = P a ∗ ( t ) + β [ P max − P a ∗ ( t )] + ( P a ( t ) + β [ P min − P a ( t )]) a = 1 , a � = a ∗ K � = ( 1 − β ) P a ( t ) + β [ P max + ( K − 1 ) P min ] a = 1 K � = ( 1 − β ) P a ( t ) + β a = 1 = 1 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 14 / 26

  15. Adaptive Pursuit Strategy Algorithm Adaptive Pursuit Strategy: Algorithm A DAPTIVE P URSUIT ( P , Q , K , P min , α, β ) P max ← 1 − ( K − 1 ) P min 1 2 for i ← 1 to K do P i ( 0 ) ← 1 3 K ; Q i ( 0 ) ← 1 . 0 4 while N OT T ERMINATED ? () do a s ← P ROPORTIONAL S ELECT O PERATOR ( P ) 5 R a s ( t ) ← G ET R EWARD ( a s ) 6 7 Q a s ( t + 1 ) = Q a s ( t ) + α [ R a s ( t ) − Q a s ( t )] a ∗ ← A RGMAX a ( Q a ( t + 1 )) 8 9 P a ∗ ( t + 1 ) = P a ∗ ( t ) + β [ P max − P a ∗ ( t )] 10 for a ← 1 to K do if a � = a ∗ 11 then P a ( t + 1 ) = P a ( t ) + β [ P min − P a ( t )] 12 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 15 / 26

  16. Adaptive Pursuit Strategy Example Adaptive Pursuit Strategy: Example Consider again the 2-operator stationary environment with R 1 = 10 , and R 2 = 9 ( P min = 0 . 1) As opposed to the probability matching rule, the adaptive pursuit method will play the better operator a 1 with maximum probability P max = 0 . 9 It also keeps playing the poorer operator a 2 with minimal probability P min = 0 . 1 in order to maintain its ability to adapt to any change in the reward distribution Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 16 / 26

  17. Experiments Experimental Environment Experiments: Environment We consider an environment with 5 operators a i : i = 1 . . . 5 Each operator a i receives a uniformly distributed reward R i between the boundaries R i = U [ i − 1 . . . i + 1 ] : Operator reward [0..1] [1..2] [2..3] [3..4] [4..5] [5..6] R 1 R 2 R 3 R 4 R 5 After a fixed time interval ∆ T the reward distributions are randomly reassigned to the operators Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 17 / 26

  18. Experiments Experimental Environment Upper bounds to performance If we had full knowledge of the reward distributions and their switching pattern we could always pick the optimal operator a ∗ and achieve an expected reward E [ R Opt ] = 5. The performance in the stationary (non-switching) environment of a correctly converged operator allocation scheme represents an upper bound to the optimal performance in the switching environment. 3 allocation strategies: Non-adaptive, equal-probability allocation rule 1 Probability matching allocation rule ( P min = 0 . 1) 2 Adaptive pursuit allocation rule ( P min = 0 . 1) 3 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 18 / 26

  19. Experiments Experimental Environment Non-adaptive, equal-probability allocation rule The probability of choosing the optimal operator a ∗ Fixed : Fixed ] = 1 Prob [ a s = a ∗ K = 0 . 2 The expected reward: K E [ R a ] Prob [ a s = a ] E [ R Fixed ] � = a = 1 � K a = 1 E [ R a ] = K = 3 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 19 / 26

  20. Experiments Experimental Environment Probability matching allocation rule The probability of choosing the optimal operator a ∗ ProbMatch : Prob [ a s = a ∗ ProbMatch ] E [ R a ∗ ] = P min + ( 1 − K . P min ) = 0 . 2666 . . . � K a = 1 E [ R a ] The expected reward: E [ R ProbMatch ] K E [ R a ] Prob [ a s = a ] � = a = 1 K E [ R a ] � = a [ P min + ( 1 − K · P min ) ] � K a = 1 E [ R a ] a = 1 = 3 . 333 . . . Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 20 / 26

Recommend


More recommend