15 780 graduate artificial intelligence ai and education
play

15-780 - graduate artificial intelligence ai and education iii . - PowerPoint PPT Presentation

15-780 - graduate artificial intelligence ai and education iii . Shayan Doroudi May 1, 2017 1 series overview Lecture Application AI Topics 5/01/17 Instruction Multi-Armed Bandits Series on applications of AI to education. 4/24/17


  1. 15-780 - graduate artificial intelligence ai and education iii . Shayan Doroudi May 1, 2017 1

  2. series overview Lecture Application AI Topics 5/01/17 Instruction Multi-Armed Bandits Series on applications of AI to education. 4/24/17 Learning Machine Learning + Search 4/26/17 Assessment Machine Learning + Mechanism Design 2

  3. prediction vs. intervention Prediction Intervention • Predicting performance in a learning environment • Predicting performance on a test 3

  4. prediction vs. intervention Prediction Intervention • Predicting performance in a • Changing instruction based on learning environment refined cognitive model • Predicting performance on a test 3

  5. prediction vs. intervention Prediction Intervention • Predicting performance in a • Changing instruction based on learning environment refined cognitive model • Predicting performance on a • Computerized Adaptive Testing test 3

  6. prediction vs. intervention Prediction Intervention • Predicting performance in a • Changing instruction based on learning environment refined cognitive model • Predicting performance on a • Computerized Adaptive Testing test • Choosing the best instruction 3

  7. • After each decision, we know if each expert got it right or wrong. • Multi-Armed Bandits: Choose only one arm (expert/action); only know if that arm was good or bad. randomized weighted majority and bandits • Recall the Randomized Weighted Majority Algorithm. 4

  8. • Multi-Armed Bandits: Choose only one arm (expert/action); only know if that arm was good or bad. randomized weighted majority and bandits • Recall the Randomized Weighted Majority Algorithm. • After each decision, we know if each expert got it right or wrong. 4

  9. randomized weighted majority and bandits • Recall the Randomized Weighted Majority Algorithm. • After each decision, we know if each expert got it right or wrong. • Multi-Armed Bandits: Choose only one arm (expert/action); only know if that arm was good or bad. 4

  10. • At each time step t , we choose one action a t . • Observe reward for that action, coming from some unknown distribution with mean a . • Want to minimize regret: T R T T max a a t a t 1 multi-armed bandits • Set of K actions A = { a 1 , . . . , a K } . 5

  11. • Observe reward for that action, coming from some unknown distribution with mean a . • Want to minimize regret: T R T T max a a t a t 1 multi-armed bandits • Set of K actions A = { a 1 , . . . , a K } . • At each time step t , we choose one action a t ∈ A . 5

  12. • Want to minimize regret: T R T T max a a t a t 1 multi-armed bandits • Set of K actions A = { a 1 , . . . , a K } . • At each time step t , we choose one action a t ∈ A . • Observe reward for that action, coming from some unknown distribution with mean µ a . 5

  13. multi-armed bandits • Set of K actions A = { a 1 , . . . , a K } . • At each time step t , we choose one action a t ∈ A . • Observe reward for that action, coming from some unknown distribution with mean µ a . • Want to minimize regret: [ T ] R ( T ) = T max ∑ a ∈A µ a − E µ a t t = 1 5

  14. poll (multi-armed bandits) 0 . 9 Average Reward 0 . 8 0 . 1 . . . . . . . . . . 1 2 3 Action Suppose action 1 was taken 20 times, action 2 was taken 10 times, and action 3 was taken once. Which action should we take next? • Action 1 • Action 2 • Action 3 • Some distribution over the actions. 6

  15. exploration vs. exploitation • Exploration : Trying different actions to discover what's good. • Exploitation : Doing (exploiting) what we believe to be best. 7

  16. explore-then-commit • Explore-then-Commit: Take each action n times, then commit to the action with the best sample average reward. 8

  17. upper confidence bound (ucb) .

  18. After taking action 3 two more times and seeing 0.1 both times: Average Reward 3 2 . . . 1 . . . 0 . . 1 2 3 . . . . . . . . . . . . . . . . Action . optimism in the face of uncertainty Average Reward 3 . 2 . . 1 . 0 . . 1 2 3 Action . . . . . . . . . . . . . . . . . 9

  19. optimism in the face of uncertainty Average Reward 3 . 2 . . 1 . 0 . . 1 2 3 Action . . . . . . . . . . . . . . . . . After taking action 3 two more times and seeing 0.1 both times: Average Reward 3 2 . . . 1 . 0 . . 1 2 3 . . . . . . . . . . . . . . . . Action . 9

  20. ucb1 UCB1 Algorithm: 1. Take each action once. 2. Take action n j √ 1 2 ln ( n ) ∑ arg max r j , i + n j n j a j ∈A i = 1 • n is the total number of actions taken so far • n j is the number of times we took a j • r j , i is the reward from the i th time we took a j 10

  21. thompson sampling .

  22. • Take action a j with probability r a j max r a P d a • Can just sample according to P , and take max a r a thompson sampling Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best. 11

  23. • Can just sample according to P , and take max a r a thompson sampling Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best. • Take action a j with probability ∫ r | a j , θ = max a ∈A E [ r | a , θ ]) P ( θ |D ) d θ [ ] I ( E 11

  24. thompson sampling Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best. • Take action a j with probability ∫ r | a j , θ = max a ∈A E [ r | a , θ ]) P ( θ |D ) d θ [ ] I ( E • Can just sample θ according to P ( θ |D ) , and take max a ∈A E [ r | a , θ ] 11

  25. r j 1 r j P p j r j p 1 p j j • Use Conjugate Prior (Beta Distribution): P p j p j 1 p j • After we take a j , if we see reward r j , r j 1 r j P p j r j P p j P r j p j p j 1 p j p j 1 p j • After any action the posterior distribution will be as follows: s j f j P p j p 1 p j j thompson sampling with beta prior • Suppose each action a j gives rewards according to a Bernoulli distribution with some unknown probability p j . 12

  26. r j 1 r j P p j r j p 1 p j j • After we take a j , if we see reward r j , r j 1 r j P p j r j P p j P r j p j p j 1 p j p j 1 p j • After any action the posterior distribution will be as follows: s j f j P p j p 1 p j j thompson sampling with beta prior • Suppose each action a j gives rewards according to a Bernoulli distribution with some unknown probability p j . • Use Conjugate Prior (Beta Distribution): P ( p j | α, β ) ∝ p α j ( 1 − p j ) β 12

  27. r j 1 r j P p j r j p 1 p j j • After any action the posterior distribution will be as follows: s j f j P p j p 1 p j j thompson sampling with beta prior • Suppose each action a j gives rewards according to a Bernoulli distribution with some unknown probability p j . • Use Conjugate Prior (Beta Distribution): P ( p j | α, β ) ∝ p α j ( 1 − p j ) β • After we take a j , if we see reward r j , r j j ( 1 − p j ) 1 − r j P ( p j | α, β, r j ) ∝ P ( p j | α, β ) P ( r j | p j ) ∝ p α j ( 1 − p j ) β p 12

  28. • After any action the posterior distribution will be as follows: s j f j P p j p 1 p j j thompson sampling with beta prior • Suppose each action a j gives rewards according to a Bernoulli distribution with some unknown probability p j . • Use Conjugate Prior (Beta Distribution): P ( p j | α, β ) ∝ p α j ( 1 − p j ) β • After we take a j , if we see reward r j , r j j ( 1 − p j ) 1 − r j P ( p j | α, β, r j ) ∝ P ( p j | α, β ) P ( r j | p j ) ∝ p α j ( 1 − p j ) β p α + r j ( 1 − p j ) β + 1 − r j P ( p j | α, β, r j ) ∝ p j 12

  29. thompson sampling with beta prior • Suppose each action a j gives rewards according to a Bernoulli distribution with some unknown probability p j . • Use Conjugate Prior (Beta Distribution): P ( p j | α, β ) ∝ p α j ( 1 − p j ) β • After we take a j , if we see reward r j , r j j ( 1 − p j ) 1 − r j P ( p j | α, β, r j ) ∝ P ( p j | α, β ) P ( r j | p j ) ∝ p α j ( 1 − p j ) β p α + r j ( 1 − p j ) β + 1 − r j P ( p j | α, β, r j ) ∝ p j • After any action the posterior distribution will be as follows: α + s j ( 1 − p j ) β + f j P ( p j |D ) ∝ p j 12

  30. thompson sampling with beta prior Thompson Sampling Algorithm with Bernoulli Actions and Beta Prior: • Sample p 1 , . . . , p K with probability α + s j ( 1 − p j ) β + f j P ( p j |D ) ∝ p j • Choose arg max a j ∈A E r | p j = p j [ ] 13

  31. poll (thompson sampling) How can we increase exploration using Thompson Sampling with Beta Prior? • Choose a large α • Choose a large β • Choose an equally large α and β • Beats me 14

  32. example: axis 15

  33. example: axis 16

  34. example: axis 17

  35. example: axis 18

  36. What's missing? 19

  37. contextual bandits .

  38. • Solve for a using linear regression, build confidence intervals over the mean, and apply UCB. linucb • Obtain some context x t , a • Assume linear payoff function: E [ r t , a | x t , a ] = x T t θ a 20

  39. • Solve for a using linear regression, build confidence intervals over the mean, and apply UCB. linucb • Obtain some context x t , a • Assume linear payoff function: E [ r t , a | x t , a ] = x T t θ a 20

Recommend


More recommend