MAB problem: random questions Random questions: Given . An array of N slot 1. How long do to stick with a slot machines. machine? 2. Try many machines, or opt for security? 3. Do you exploit success, or do you re the possibilities? explo 4. Is it something we can assume about the distribution of the payouts? Constant mean? Constant variance? Stationary? Does a machine “shift gears” every now and then? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6
Experiment Yield Machine 1 Yield Machine 2 Yield Machine 3 8 7 20 8 11 1 8 8 8 9 8 Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7
Experiment Yield Machine 1 Yield Machine 2 Yield Machine 3 8 7 20 8 11 1 8 8 8 9 8 Average 8 8.75 10.5 Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7
explo rer exploiter Exploration vs. exploitation Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
explo rer exploiter Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
explo rer exploiter Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
exploiter Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. We ignore / abstract away from : Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. We ignore / abstract away from : 1. How quality of friendships is measured. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. We ignore / abstract away from : 1. How quality of friendships is measured. 2. That personalities of friends may change (so-called “non-stationary search”). Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Other practical problems Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
Other practical problems ■ Select a restaurant from N alternatives. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. ■ Adaptive routing to optimize network flow. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. ■ Adaptive routing to optimize network flow. ■ Financial portfolio management. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. ■ Adaptive routing to optimize network flow. ■ Financial portfolio management. ■ . . . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
bat h lea rning online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
bat h lea rning online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
bat h lea rning online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. This is rning . bat h lea Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. This is rning . bat h lea ■ It would be better to have an update formula that computes the new average based on the old average and the new incoming value. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. This is rning . bat h lea ■ It would be better to have an update formula that computes the new average based on the old average and the new incoming value. That would be rning . online lea Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
Computation of the quality (online version) Q n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Computation of the quality (online version) r 1 + · · · + r n = Q n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n = Q n − 1 � �� � old value Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n 1 = + Q n − 1 n � �� � ���� old learning value rate Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n 1 = + ( r n Q n − 1 n ���� � �� � ���� goal old learning value value rate Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n 1 = + ( r n − Q n − 1 ) Q n − 1 . n ���� � �� � � �� � ���� goal old old learning value value value rate � �� � error � �� � correction � �� � new value Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
lea rning rate average geometri average Progress of the quality of one action Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12
average geometri average Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12
geometri average Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning ■ To compute the average , the learning rate is 1/ n (decreases!). Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12
geometri average Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning ■ To compute the average , the learning rate is 1/ n (decreases!). ■ Learning rate can also be a constant 0 ≤ λ ≤ 1 Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12
Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning ■ To compute the average , the learning rate is 1/ n (decreases!). ■ Learning rate can also be a constant 0 ≤ λ ≤ 1 ⇒ average . geometri Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12
Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. ● All this a.s. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. ● All this a.s. (= with probability 1). Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. ● All this a.s. (= with probability 1). In particular it is not certain. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
unrealisti ally high qualit y Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14
Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14
Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14
Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14
Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: 3. Can we still speak of Q k 0 = high exploration? for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: 3. Can we still speak of Q k 0 = high exploration? for 1 ≤ k ≤ N . 4. ǫ -greedy: Pr ( every action is explored infinitely many ■ As usual, for every slot machine times ) = 1. Also with its average profit is maintained. optimism? ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: 3. Can we still speak of Q k 0 = high exploration? for 1 ≤ k ≤ N . 4. ǫ -greedy: Pr ( every action is explored infinitely many ■ As usual, for every slot machine times ) = 1. Also with its average profit is maintained. optimism? ■ Without exception, always exploit 5. Is optimism (as a method) machines with highest Q -values. suitable to explore an array of (possibly) infinitely many slot machines? Why (not)? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
Optimistic initial values vs. ǫ -greedy From: “Reinforcement Learning (...)”, Sutton and Barto, Sec. 2.8, p. 41. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 16
moving average explo ration rate lea rning- adaptation rate Q-learning Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
explo ration rate lea rning- adaptation rate Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
explo ration rate lea rning- adaptation rate Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
explo ration rate lea rning- adaptation rate Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation A practical disadvantage of having two parameters is that tuning the algorithm takes more time. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation A practical disadvantage of having two parameters is that tuning the algorithm takes more time. ■ Exercise: what if ǫ is small and γ is large? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation A practical disadvantage of having two parameters is that tuning the algorithm takes more time. ■ Exercise: what if ǫ is small and γ is large? The other way? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
Boltzmann Gibbs mixed logit quantal resp onse temp erature Action selection Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
Boltzmann Gibbs mixed logit quantal resp onse temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
Boltzmann Gibbs mixed logit quantal resp onse temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j ■ Through softmax (or Boltzmann , or Gibbs , or logit , or mixed quantal onse ). resp Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j ■ Through softmax (or Boltzmann , or Gibbs , or logit , or mixed quantal onse ). resp e Q i / τ p i = Def j = 1 e Q j / τ , ∑ n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j ■ Through softmax (or Boltzmann , or Gibbs , or logit , or mixed quantal onse ). resp e Q i / τ p i = Def j = 1 e Q j / τ , ∑ n where the parameter τ is often called the erature . temp Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
Effect of the temperature parameter The softmax function: e Q i / τ p i = Def j = 1 e Q j / τ . ∑ n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19
Recommend
More recommend