Multi-agent learning Multi-a rmed bandit algo rithms Gerard - PowerPoint PPT Presentation

MAB problem: random questions Random questions: Given . An array of N slot 1. How long do to stick with a slot machines. machine? 2. Try many machines, or opt for security? 3. Do you exploit success, or do you re the possibilities? explo 4. Is it something we can assume about the distribution of the payouts? Constant mean? Constant variance? Stationary? Does a machine “shift gears” every now and then? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

Experiment Yield Machine 1 Yield Machine 2 Yield Machine 3 8 7 20 8 11 1 8 8 8 9 8 Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7

Experiment Yield Machine 1 Yield Machine 2 Yield Machine 3 8 7 20 8 11 1 8 8 8 9 8 Average 8 8.75 10.5 Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7

explo rer exploiter Exploration vs. exploitation Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

explo rer exploiter Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

explo rer exploiter Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

exploiter Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. We ignore / abstract away from : Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. We ignore / abstract away from : 1. How quality of friendships is measured. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Exploration vs. exploitation Problem . You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies : 1. Make friends whe { n | r } ever possible. You are an rer . explo 2. Stick to the nearest fellow-student. You are an exploiter . 3. What most people would do: first explore, then “exploit”. We ignore / abstract away from : 1. How quality of friendships is measured. 2. That personalities of friends may change (so-called “non-stationary search”). Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Other practical problems Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

Other practical problems ■ Select a restaurant from N alternatives. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. ■ Adaptive routing to optimize network flow. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. ■ Adaptive routing to optimize network flow. ■ Financial portfolio management. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

Other practical problems ■ Select a restaurant from N alternatives. ■ Select a movie channel from N recommendations. ■ Distribute load among servers. ■ Choose a medical treatment from N alternatives. ■ Adaptive routing to optimize network flow. ■ Financial portfolio management. ■ . . . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

bat h lea rning online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

bat h lea rning online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

bat h lea rning online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. This is rning . bat h lea Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

online lea rning Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. This is rning . bat h lea ■ It would be better to have an update formula that computes the new average based on the old average and the new incoming value. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

Computation of the quality (off line version) a tion a after n tries, Q n , would A reasonable measure for the qualit y of an be its average payoff: Formula for the quality of an action after n tries . r 1 + · · · + r n Q n = Def n Data comes in gradually. ■ This formula is correct. However, every time Q n is computed, all r 1 , . . . , r n must be retrieved. This is rning . bat h lea ■ It would be better to have an update formula that computes the new average based on the old average and the new incoming value. That would be rning . online lea Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

Computation of the quality (online version) Q n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Computation of the quality (online version) r 1 + · · · + r n = Q n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n = Q n − 1 � �� old value Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n 1 = + Q n − 1 n � �� old learning value rate Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n 1 = + ( r n Q n − 1 n �� goal old learning value value rate Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Computation of the quality (online version) r 1 + · · · + r n = r 1 + · · · + r n − 1 + r n = Q n n n n r 1 + · · · + r n − 1 · n − 1 + r n = n − 1 n n Q n − 1 · n − 1 + r n = n n � 1 � 1 � � = Q n − 1 − Q n − 1 + r n n n 1 = + ( r n − Q n − 1 ) Q n − 1 . n �� goal old old learning value value value rate � �� error � �� correction � �� new value Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

lea rning rate average geometri average Progress of the quality of one action Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

average geometri average Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

geometri average Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning ■ To compute the average , the learning rate is 1/ n (decreases!). Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

geometri average Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning ■ To compute the average , the learning rate is 1/ n (decreases!). ■ Learning rate can also be a constant 0 ≤ λ ≤ 1 Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

Progress of the quality of one action ■ Amplitude of correction is determined by the rate . lea rning ■ To compute the average , the learning rate is 1/ n (decreases!). ■ Learning rate can also be a constant 0 ≤ λ ≤ 1 ⇒ average . geometri Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

Explo ration the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

the se ond Bo rel-Cantelli lemma the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

the la w of la rge numb ers Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. ● All this a.s. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. ● All this a.s. (= with probability 1). Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

Action selection: greedy and epsilon-greedy ■ Greedy : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ ǫ -Greedy : Let 0 < ǫ ≤ 1 close to 0. Exploitation : choose ( 1 − ǫ ) % of the time an optimal action. 1. 2. ration : at other times, choose a random action. Explo ● Because ∑ ∞ i = 1 ǫ is infinite, it follows from the the se ond lemma that every action is explored infinitely many Bo rel-Cantelli times a.s. So, by ers , the estimated value of an action the la w of la rge numb converges to its true value. ● All this a.s. (= with probability 1). In particular it is not certain. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

unrealisti ally high qualit y Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

Action selection: optimistic initial values An alternative for ǫ -greedy is to work with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . optimisti initial ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: Q k 0 = high for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: 3. Can we still speak of Q k 0 = high exploration? for 1 ≤ k ≤ N . ■ As usual, for every slot machine its average profit is maintained. ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: 3. Can we still speak of Q k 0 = high exploration? for 1 ≤ k ≤ N . 4. ǫ -greedy: Pr ( every action is explored infinitely many ■ As usual, for every slot machine times ) = 1. Also with its average profit is maintained. optimism? ■ Without exception, always exploit machines with highest Q -values. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

Action selection: optimistic initial values An alternative for ǫ -greedy is to work Some questions: with values . 1. Initially, many actions are tried optimisti initial ⇒ all actions are tried? ■ At the outset, an unrealisti ally high y is attributed to every slot 2. How high should “high” be? qualit machine: 3. Can we still speak of Q k 0 = high exploration? for 1 ≤ k ≤ N . 4. ǫ -greedy: Pr ( every action is explored infinitely many ■ As usual, for every slot machine times ) = 1. Also with its average profit is maintained. optimism? ■ Without exception, always exploit 5. Is optimism (as a method) machines with highest Q -values. suitable to explore an array of (possibly) infinitely many slot machines? Why (not)? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

Optimistic initial values vs. ǫ -greedy From: “Reinforcement Learning (...)”, Sutton and Barto, Sec. 2.8, p. 41. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 16

moving average explo ration rate lea rning- adaptation rate Q-learning Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

explo ration rate lea rning- adaptation rate Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

explo ration rate lea rning- adaptation rate Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

explo ration rate lea rning- adaptation rate Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation A practical disadvantage of having two parameters is that tuning the algorithm takes more time. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation A practical disadvantage of having two parameters is that tuning the algorithm takes more time. ■ Exercise: what if ǫ is small and γ is large? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

Q-learning ■ Q-learning is like ǫ -Greedy learnnig, but then with a average . moving Algorithm: 1. At round t choose an optimal action uniformly with probability 1 − ǫ . 2. Update Arm i ’s estimate at round, Q i ( t ) , as � ( 1 − λ ) Q i ( t − 1 ) + λ r i , if Arm i is pulled with reward r i Q i ( t ) = Q i ( t − 1 ) otherwise. ■ Q-learning possesses two parameters: an rate , ǫ , and a explo ration rning- or rate , λ . lea adaptation A practical disadvantage of having two parameters is that tuning the algorithm takes more time. ■ Exercise: what if ǫ is small and γ is large? The other way? Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

Boltzmann Gibbs mixed logit quantal resp onse temp erature Action selection Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

Boltzmann Gibbs mixed logit quantal resp onse temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

Boltzmann Gibbs mixed logit quantal resp onse temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j ■ Through softmax (or Boltzmann , or Gibbs , or logit , or mixed quantal onse ). resp Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

temp erature Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j ■ Through softmax (or Boltzmann , or Gibbs , or logit , or mixed quantal onse ). resp e Q i / τ p i = Def j = 1 e Q j / τ , ∑ n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

Action selection ■ Greedily : exploit the action that is optimal thus far. � 1 if a i is optimal thus far, p i = Def 0 otherwise. ■ Proportional : select randomly and proportional to the expected payoff. Q i p i = Def . ∑ n j = 1 Q j ■ Through softmax (or Boltzmann , or Gibbs , or logit , or mixed quantal onse ). resp e Q i / τ p i = Def j = 1 e Q j / τ , ∑ n where the parameter τ is often called the erature . temp Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

Effect of the temperature parameter The softmax function: e Q i / τ p i = Def j = 1 e Q j / τ . ∑ n Author: Gerard Vreeswijk. Slides last modified on April 30 th , 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

Multi-agent learning Multi-a rmed bandit algo rithms Gerard - PowerPoint PPT Presentation

Multi-agent learning Multi-a rmed bandit algo rithms Gerard Vreeswijk , Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Thursday 30 th April, 2020 Contents Author: Gerard

Multi-agent learning Multi-agent reinforcement learning Gerard Vreeswijk , Intelligent Systems

Overview Multi-Agent Systems Introduction to multi-agent systems and agent societies Agent

Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department,

REINFORCEMENT LEARNING IN MULTI-AGENT SYSTEMS MACHINE LEARNING MEETUP DR. ANA PELETEIRO

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

S S S S calable calable Agent calable calable Agent Agent Plat forms Agent Plat forms

Agent-Based Systems Agent communication Speech act theory Michael Rovatsos Agent

Multi-agent learning Simplied Poker Yannick Bitane , April 14th, 2011. Yannick Bitane. Slides

Learning Agent Learning Agents An Agent that observes its performance and adapts its

ROMA: Multi-Agent Reinforcement Learning with Emerging Roles Tonghan Wang, Heng Dong, Victor

The Player Agent The Player Agent Are they the most important league official right now? right

Rational Agents (Ch. 2) Rational agent An agent/robot must be able to perceive and interact with

Agent-Based Systems Michael Rovatsos mrovatso@inf.ed.ac.uk Lecture 6 Agent Communication 1

Agent Training Welcome Blues Agent Portal Training e-Learning on the BCBSM Agent Portal

W HAT S AN A GENT ? Weiss, p. 29 [after Wooldridge and Jennings]: An agent is a

M ULTI -A GENT S YSTEMS Overview and Research Directions Whats an agent? AI Class 12 (C H .

Adaptive Quiz Generation Using Thompson Sampling Fuhua (Oscar) Lin, PhD Athabasca University,

Polygon Filling Werner Purgathofer Linked Lists flexible data structure x 1 x 2 x 1 x 2

Adaptive Operator Selection with Rank-based Multi-Armed Bandits Alvaro Fialho, Marc Schoenauer

Gravity and the planar spin-2 Schr odinger equation Eric Bergshoeff Groningen University work

Mac OS X : System Integrity Protection Nicolas RUFF - nruff(at)google(dot)com Proprietary

EM-MAC: A Dynamic Multichannel Energy-Efficient MAC Protocol for Wireless Sensor Networks Lei

! Current State of Exploitation ! Return-Oriented Exploitation ! Mac OS X x86 Return-Oriented

Security- -Enhanced Darwin: Enhanced Darwin: Security Porting SELinux to Mac OS X Porting

Sambuz

Useful Links

Newsletter

Mail Us