15-780 - graduate artificial intelligence ai and education iii . Shayan Doroudi May 1, 2017 1
series overview Lecture Application AI Topics 5/01/17 Instruction Multi-Armed Bandits Series on applications of AI to education. 4/24/17 Learning Machine Learning + Search 4/26/17 Assessment Machine Learning + Mechanism Design 2
prediction vs. intervention Prediction Intervention • Predicting performance in a learning environment • Predicting performance on a test 3
prediction vs. intervention Prediction Intervention • Predicting performance in a • Changing instruction based on learning environment refined cognitive model • Predicting performance on a test 3
prediction vs. intervention Prediction Intervention • Predicting performance in a • Changing instruction based on learning environment refined cognitive model • Predicting performance on a • Computerized Adaptive Testing test 3
prediction vs. intervention Prediction Intervention • Predicting performance in a • Changing instruction based on learning environment refined cognitive model • Predicting performance on a • Computerized Adaptive Testing test • Choosing the best instruction 3
• After each decision, we know if each expert got it right or wrong. • Multi-Armed Bandits: Choose only one arm (expert/action); only know if that arm was good or bad. randomized weighted majority and bandits • Recall the Randomized Weighted Majority Algorithm. 4
• Multi-Armed Bandits: Choose only one arm (expert/action); only know if that arm was good or bad. randomized weighted majority and bandits • Recall the Randomized Weighted Majority Algorithm. • After each decision, we know if each expert got it right or wrong. 4
randomized weighted majority and bandits • Recall the Randomized Weighted Majority Algorithm. • After each decision, we know if each expert got it right or wrong. • Multi-Armed Bandits: Choose only one arm (expert/action); only know if that arm was good or bad. 4
• At each time step t , we choose one action a t . • Observe reward for that action, coming from some unknown distribution with mean a . • Want to minimize regret: T R T T max a a t a t 1 multi-armed bandits • Set of K actions A = { a 1 , . . . , a K } . 5
• Observe reward for that action, coming from some unknown distribution with mean a . • Want to minimize regret: T R T T max a a t a t 1 multi-armed bandits • Set of K actions A = { a 1 , . . . , a K } . • At each time step t , we choose one action a t ∈ A . 5
• Want to minimize regret: T R T T max a a t a t 1 multi-armed bandits • Set of K actions A = { a 1 , . . . , a K } . • At each time step t , we choose one action a t ∈ A . • Observe reward for that action, coming from some unknown distribution with mean µ a . 5
multi-armed bandits • Set of K actions A = { a 1 , . . . , a K } . • At each time step t , we choose one action a t ∈ A . • Observe reward for that action, coming from some unknown distribution with mean µ a . • Want to minimize regret: [ T ] R ( T ) = T max ∑ a ∈A µ a − E µ a t t = 1 5
poll (multi-armed bandits) 0 . 9 Average Reward 0 . 8 0 . 1 . . . . . . . . . . 1 2 3 Action Suppose action 1 was taken 20 times, action 2 was taken 10 times, and action 3 was taken once. Which action should we take next? • Action 1 • Action 2 • Action 3 • Some distribution over the actions. 6
exploration vs. exploitation • Exploration : Trying different actions to discover what's good. • Exploitation : Doing (exploiting) what we believe to be best. 7
explore-then-commit • Explore-then-Commit: Take each action n times, then commit to the action with the best sample average reward. 8
upper confidence bound (ucb) .
After taking action 3 two more times and seeing 0.1 both times: Average Reward 3 2 . . . 1 . . . 0 . . 1 2 3 . . . . . . . . . . . . . . . . Action . optimism in the face of uncertainty Average Reward 3 . 2 . . 1 . 0 . . 1 2 3 Action . . . . . . . . . . . . . . . . . 9
optimism in the face of uncertainty Average Reward 3 . 2 . . 1 . 0 . . 1 2 3 Action . . . . . . . . . . . . . . . . . After taking action 3 two more times and seeing 0.1 both times: Average Reward 3 2 . . . 1 . 0 . . 1 2 3 . . . . . . . . . . . . . . . . Action . 9
ucb1 UCB1 Algorithm: 1. Take each action once. 2. Take action n j √ 1 2 ln ( n ) ∑ arg max r j , i + n j n j a j ∈A i = 1 • n is the total number of actions taken so far • n j is the number of times we took a j • r j , i is the reward from the i th time we took a j 10
thompson sampling .
• Take action a j with probability r a j max r a P d a • Can just sample according to P , and take max a r a thompson sampling Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best. 11
• Can just sample according to P , and take max a r a thompson sampling Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best. • Take action a j with probability ∫ r | a j , θ = max a ∈A E [ r | a , θ ]) P ( θ |D ) d θ [ ] I ( E 11
thompson sampling Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best. • Take action a j with probability ∫ r | a j , θ = max a ∈A E [ r | a , θ ]) P ( θ |D ) d θ [ ] I ( E • Can just sample θ according to P ( θ |D ) , and take max a ∈A E [ r | a , θ ] 11
r j 1 r j P p j r j p 1 p j j • Use Conjugate Prior (Beta Distribution): P p j p j 1 p j • After we take a j , if we see reward r j , r j 1 r j P p j r j P p j P r j p j p j 1 p j p j 1 p j • After any action the posterior distribution will be as follows: s j f j P p j p 1 p j j thompson sampling with beta prior • Suppose each action a j gives rewards according to a Bernoulli distribution with some unknown probability p j . 12
r j 1 r j P p j r j p 1 p j j • After we take a j , if we see reward r j , r j 1 r j P p j r j P p j P r j p j p j 1 p j p j 1 p j • After any action the posterior distribution will be as follows: s j f j P p j p 1 p j j thompson sampling with beta prior • Suppose each action a j gives rewards according to a Bernoulli distribution with some unknown probability p j . • Use Conjugate Prior (Beta Distribution): P ( p j | α, β ) ∝ p α j ( 1 − p j ) β 12
r j 1 r j P p j r j p 1 p j j • After any action the posterior distribution will be as follows: s j f j P p j p 1 p j j thompson sampling with beta prior • Suppose each action a j gives rewards according to a Bernoulli distribution with some unknown probability p j . • Use Conjugate Prior (Beta Distribution): P ( p j | α, β ) ∝ p α j ( 1 − p j ) β • After we take a j , if we see reward r j , r j j ( 1 − p j ) 1 − r j P ( p j | α, β, r j ) ∝ P ( p j | α, β ) P ( r j | p j ) ∝ p α j ( 1 − p j ) β p 12
• After any action the posterior distribution will be as follows: s j f j P p j p 1 p j j thompson sampling with beta prior • Suppose each action a j gives rewards according to a Bernoulli distribution with some unknown probability p j . • Use Conjugate Prior (Beta Distribution): P ( p j | α, β ) ∝ p α j ( 1 − p j ) β • After we take a j , if we see reward r j , r j j ( 1 − p j ) 1 − r j P ( p j | α, β, r j ) ∝ P ( p j | α, β ) P ( r j | p j ) ∝ p α j ( 1 − p j ) β p α + r j ( 1 − p j ) β + 1 − r j P ( p j | α, β, r j ) ∝ p j 12
thompson sampling with beta prior • Suppose each action a j gives rewards according to a Bernoulli distribution with some unknown probability p j . • Use Conjugate Prior (Beta Distribution): P ( p j | α, β ) ∝ p α j ( 1 − p j ) β • After we take a j , if we see reward r j , r j j ( 1 − p j ) 1 − r j P ( p j | α, β, r j ) ∝ P ( p j | α, β ) P ( r j | p j ) ∝ p α j ( 1 − p j ) β p α + r j ( 1 − p j ) β + 1 − r j P ( p j | α, β, r j ) ∝ p j • After any action the posterior distribution will be as follows: α + s j ( 1 − p j ) β + f j P ( p j |D ) ∝ p j 12
thompson sampling with beta prior Thompson Sampling Algorithm with Bernoulli Actions and Beta Prior: • Sample p 1 , . . . , p K with probability α + s j ( 1 − p j ) β + f j P ( p j |D ) ∝ p j • Choose arg max a j ∈A E r | p j = p j [ ] 13
poll (thompson sampling) How can we increase exploration using Thompson Sampling with Beta Prior? • Choose a large α • Choose a large β • Choose an equally large α and β • Beats me 14
example: axis 15
example: axis 16
example: axis 17
example: axis 18
What's missing? 19
contextual bandits .
• Solve for a using linear regression, build confidence intervals over the mean, and apply UCB. linucb • Obtain some context x t , a • Assume linear payoff function: E [ r t , a | x t , a ] = x T t θ a 20
• Solve for a using linear regression, build confidence intervals over the mean, and apply UCB. linucb • Obtain some context x t , a • Assume linear payoff function: E [ r t , a | x t , a ] = x T t θ a 20
Recommend
More recommend