multi agent learning
play

Multi-agent learning Multi-a rmed bandit algo rithms Gerard - PowerPoint PPT Presentation

Multi-agent learning Multi-a rmed bandit algo rithms Gerard Vreeswijk , Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Thursday 30 th April, 2020 Contents Author: Gerard


  1. MAB problem: random questions Random questions: Given . An array of N slot 1. How long do to stick with a slot machines. machine? 2. Try many machines, or opt for security? 3. Do you exploit success, or do you re the possibilities? explo 4. Is it something we can assume about the distribution of the payouts? Constant mean? Constant variance? Stationary? Does a machine “shift gears” every now and then? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

  2. Experiment Yield Machine 1 Yield Machine 2 Yield Machine 3 8 7 20 8 11 1 8 8 8 9 8 Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7

  3. Experiment Yield Machine 1 Yield Machine 2 Yield Machine 3 8 7 20 8 11 1 8 8 8 9 8 Average 8 8.75 10.5 Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7

  4. explo rer exploiter Exploration vs. exploitation Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  5. explo rer exploiter Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  6. explo rer exploiter Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  7. exploiter Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  8. Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  9. Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  10. Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. We ignore / abstract away from : Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  11. Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. We ignore / abstract away from : 1. How quality of friendships is measured. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  12. Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. We ignore / abstract away from : 1. How quality of friendships is measured. 2. That personalities of friends may change (so-called “non-stationary search”). Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  13. Other practical problems Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

  14. Other practical problems ■ Select a restaurant from N alternatives. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

  15. Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

  16. Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

  17. Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

  18. Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. ■ Adaptive routing to optimize network flow. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

  19. Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. ■ Adaptive routing to optimize network flow. ■ Financial portfolio management. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

  20. Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. ■ Adaptive routing to optimize network flow. ■ Financial portfolio management. ■ . . . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

  21. bat h lea rning online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

  22. bat h lea rning online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

  23. bat h lea rning online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

  24. online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. This is rning . bat h lea Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

  25. online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. This is rning . bat h lea ■ It would be better to have an update formula that computes the new average based on the old average and the new incoming value. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

  26. Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. This is rning . bat h lea ■ It would be better to have an update formula that computes the new average based on the old average and the new incoming value. That would be rning . online lea Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

  27. Computation of the quality (online version) Q n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

  28. Computation of the quality (online version) r 1 + · · · + r n = Q n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

  29. Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

  30. Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

  31. Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

  32. Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

  33. Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n = Q n − 1 � �� � old value Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

  34. Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n 1 = + Q n − 1 n � �� � ���� old learning value rate Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

  35. Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n 1 = + ( r n Q n − 1 n ���� � �� � ���� goal old learning value value rate Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

  36. Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n 1 = + ( r n − Q n − 1 ) Q n − 1 . n ���� � �� � � �� � ���� goal old old learning value value value rate � �� � error � �� � correction � �� � new value Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

  37. lea rning rate average geometri average Progress of the quality of one action Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

  38. average geometri average Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

  39. geometri average Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning ■ To compute the average , the learning rate is 1/ n (decreases!). Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

  40. geometri average Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning ■ To compute the average , the learning rate is 1/ n (decreases!). ■ Learning rate can also be a constant 0 ≤ λ ≤ 1 Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

  41. Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning ■ To compute the average , the learning rate is 1/ n (decreases!). ■ Learning rate can also be a constant 0 ≤ λ ≤ 1 ⇒ average . geometri Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

  42. Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

  43. Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

  44. Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

  45. Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

  46. the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

  47. the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

  48. Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

  49. Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. ● All this a.s. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

  50. Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. ● All this a.s. (= with probability 1). Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

  51. Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. ● All this a.s. (= with probability 1). In particular it is not certain. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

  52. unrealisti ally high qualit y Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

  53. Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

  54. Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

  55. Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

  56. Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

  57. Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

  58. Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

  59. Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: 3. Can we still speak of Q k 0 = high exploration? for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

  60. Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: 3. Can we still speak of Q k 0 = high exploration? for 1 ≤ k ≤ N . 4. ǫ -greedy: Pr ( every action is explored infinitely many ■ As usual, for every slot machine times ) = 1. Also with its average profit is maintained. optimism? ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

  61. Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: 3. Can we still speak of Q k 0 = high exploration? for 1 ≤ k ≤ N . 4. ǫ -greedy: Pr ( every action is explored infinitely many ■ As usual, for every slot machine times ) = 1. Also with its average profit is maintained. optimism? ■ Without exception, always exploit 5. Is optimism (as a method) machines with highest Q -values. suitable to explore an array of (possibly) infinitely many slot machines? Why (not)? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

  62. Optimistic initial values vs. ǫ -greedy From: “Reinforcement Learning (...)”, Sutton and Barto, Sec. 2.8, p. 41. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 16

  63. moving average explo ration rate lea rning- adaptation rate Q-learning Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

  64. explo ration rate lea rning- adaptation rate Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

  65. explo ration rate lea rning- adaptation rate Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

  66. explo ration rate lea rning- adaptation rate Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

  67. Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

  68. Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation A practical disadvantage of having two parameters is that tuning the algorithm takes more time. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

  69. Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation A practical disadvantage of having two parameters is that tuning the algorithm takes more time. ■ Exercise: what if ǫ is small and γ is large? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

  70. Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation A practical disadvantage of having two parameters is that tuning the algorithm takes more time. ■ Exercise: what if ǫ is small and γ is large? The other way? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

  71. Boltzmann Gibbs mixed logit quantal resp onse temp erature Action selection Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

  72. Boltzmann Gibbs mixed logit quantal resp onse temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

  73. Boltzmann Gibbs mixed logit quantal resp onse temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

  74. temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j ■ Through softmax (or Boltzmann , or Gibbs , or logit , or mixed quantal onse ). resp Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

  75. temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j ■ Through softmax (or Boltzmann , or Gibbs , or logit , or mixed quantal onse ). resp e Q i / τ p i = Def j = 1 e Q j / τ , ∑ n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

  76. Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j ■ Through softmax (or Boltzmann , or Gibbs , or logit , or mixed quantal onse ). resp e Q i / τ p i = Def j = 1 e Q j / τ , ∑ n where the parameter τ is often called the erature . temp Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

  77. Effect of the temperature parameter The softmax function: e Q i / τ p i = Def j = 1 e Q j / τ . ∑ n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

Recommend


More recommend