Reinforcement Learning a gentle introduction & industrial application Christian Hidber
Learning learning from children FOLIE 2 REINFORCEMENT LEARNING
The game: demo FOLIE 3 REINFORCEMENT LEARNING
The game: setup game engine Goal: maximize sum of rewards Step reward actions game state learner FOLIE 4 REINFORCEMENT LEARNING "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC
The game: positive feedback game engine Step reward actions game state learner FOLIE 5 REINFORCEMENT LEARNING "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC
The game: negative feedback game engine Step reward actions game state learner FOLIE 6 REINFORCEMENT LEARNING "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC
the learned stuff => policy game engine Step reward actions game state Policy learner (rules learned, how to play the game) FOLIE 7 REINFORCEMENT LEARNING
policy improvement => learning game engine Step reward actions game state Policy learner (rules learned, how to play the game) FOLIE 8 REINFORCEMENT LEARNING "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC
policy improvement => learning game engine Step reward actions game state Policy learner (rules learned, how to play the game) FOLIE 9 REINFORCEMENT LEARNING "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC
Reinforcement learning game engine Key idea: continuously improve policy to increase total reward Step reward actions game state Policy RL algorithm (rules learned, how to play the game) FOLIE 10 REINFORCEMENT LEARNING "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC
Episode 1 : play with 1st policy (random) Policy (rules learned, how to play the game) Action from State Policy Next State Reward -1 1 2 3 4 5 6 7 Step # FOLIE 11 REINFORCEMENT LEARNING
Episode 1 : play with 1st policy (random) Policy (rules learned, how to play the game) Action from State Policy Next State Reward 100 1 2 3 4 5 6 7 Step # FOLIE 12 REINFORCEMENT LEARNING
Episode 1 : play with 1st policy (random) Policy (rules learned, how to play the game) Action from State Policy Next State Reward -50 Episode Over Episode Over 1 2 3 4 5 6 7 Step # FOLIE 13 REINFORCEMENT LEARNING
Episode 1 : improve 1st policy for state in step 3 Policy (rules learned, how to play the game) Action from State Policy -50 Episode Over 1 2 3 4 5 6 7 Step # FOLIE 14 REINFORCEMENT LEARNING
Episode 1 : improve 1st policy for state in step 2 Policy (rules learned, how to play the game) Action from State Policy 50 (=+100 -50) Future Reward (sum of all rewards from current state until ‘game over’) Episode Over 1 2 3 4 5 6 7 Step # FOLIE 15 REINFORCEMENT LEARNING
Episode 1 : improve 1st policy for state in step 1 Policy (rules learned, how to play the game) Action from State Policy 49 (=-1 +100 -50) Future Reward (sum of all rewards from current state until ‘game over’) Episode Over 1 2 3 4 5 6 7 Step # FOLIE 16 REINFORCEMENT LEARNING M3, OCTOBER 2018
Episode 2 : play with 2nd policy Policy (rules learned, how to play the game) Already learned: go left is ok Action from State Policy Next State Reward -1 1 2 3 4 5 6 7 Step # FOLIE 17 REINFORCEMENT LEARNING
Episode 2 : play with 2nd policy Policy (rules learned, how to play the game) Already learned: go left is ok Action from State Policy Next State Reward 100 1 2 3 4 5 6 7 Step # FOLIE 18 REINFORCEMENT LEARNING
Already learned: Episode 2 : play with 2nd policy Policy don’t go up (rules learned, how to play the game) Action from State Policy Next State Reward 100 1 2 3 4 5 6 7 Step # FOLIE 19 REINFORCEMENT LEARNING
Episode 2 : play with 2nd policy Policy (rules learned, how to play the game) Action from State Policy Next State Reward -1 1 2 3 4 5 6 7 Step # FOLIE 20 REINFORCEMENT LEARNING
Episode 2 : play with 2nd policy Policy (rules learned, how to play the game) Action from State Policy Next State Reward -50 Episode Over Episode Over 1 2 3 4 5 6 7 Step # FOLIE 21 REINFORCEMENT LEARNING
Episode 2 : improve 2nd policy for state in step 5 Policy (rules learned, how to play the game) Action from State Policy -50 (=-50) Future Reward (sum of all rewards from current state until ‘game over’) Episode Over 1 2 3 4 5 6 7 Step # FOLIE 22 REINFORCEMENT LEARNING
Episode 2 : improve 2nd policy for state in step 4 Policy (rules learned, how to play the game) Action from State Policy -51 (=-1-50) Future Reward (sum of all rewards from current state until ‘game over’) Episode Over 1 2 3 4 5 6 7 Step # FOLIE 23 REINFORCEMENT LEARNING
Episode 2 : improve 2nd policy for state in step 3 Policy (rules learned, how to play the game) Action from State Policy 49 (=+100-1-50) Future Reward (sum of all rewards from current state until ‘game over’) Episode Over 1 2 3 4 5 6 7 Step # FOLIE 24 REINFORCEMENT LEARNING
Episode 2 : improve 2nd policy for state in step 2 Policy (rules learned, how to play the game) Action from State Policy 149 (=+100+100-1-50) Future Reward (sum of all rewards from current state until ‘game over’) Episode Over 1 2 3 4 5 6 7 Step # FOLIE 25 REINFORCEMENT LEARNING
Episode 2 : improve 2nd policy for state in step 2 Policy (rules learned, how to play the game) Action from State Policy = some running average of old and new value 149 (=+100+100-1-50) Future Reward (sum of all rewards from current state until ‘game over’) Episode Over 1 2 3 4 5 6 7 Step # FOLIE 26 REINFORCEMENT LEARNING
Episode 2 : improve 2nd policy for state in step 1 Policy (rules learned, how to play the game) Action from State Policy 148 (=-1+100+100-1-50) Future Reward (sum of all rewards from current state until ‘game over’) Episode Over 1 2 3 4 5 6 7 Step # FOLIE 27 REINFORCEMENT LEARNING
So far ….. a policy is a map from states to action probabilities FOLIE 28 REINFORCEMENT LEARNING
…updated by the reinforcement learning algorithm a policy is a map from states to action probabilities FOLIE 29 REINFORCEMENT LEARNING
…updated by the reinforcement learning algorithm a policy is a map from states to action probabilities FOLIE 30 REINFORCEMENT LEARNING
After many, many episodes, for each state… FOLIE 31 REINFORCEMENT LEARNING
Algorithm sketch Policy (rules learned, how to play the game) Initialize table with random action probabilities for each state Repeat play episode with policy given by table Record (state 1 ,action 1 ,reward 1 ),(state 2 ,action 2 ,reward 2 ),…. for episode For each step i compute FutureReward i = reward i + reward i+1 +… update table[state i ] s.t. • action i becomes for state i more likely if FutureReward i is “high” • action i becomes for state i less likely if FutureReward i is “low” FOLIE 32 REINFORCEMENT LEARNING
Algorithm sketch Policy (rules learned, how to play the game) Initialize table with random action probabilities for each state Repeat play episode with policy given by table Record (state 1 ,action 1 ,reward 1 ),(state 2 ,action 2 ,reward 2 ),…. for episode For each step i compute FutureReward i = reward i + reward i+1 +… update table[state i ] s.t. • action i becomes for state i more likely if FutureReward i is “high” • action i becomes for state i less likely if FutureReward i is “low” FOLIE 33 REINFORCEMENT LEARNING
Algorithm sketch Policy (rules learned, how to play the game) Initialize table with random action probabilities for each state Repeat play episode with policy given by table Record (state 1 ,action 1 ,reward 1 ),(state 2 ,action 2 ,reward 2 ),…. for episode For each step i compute FutureReward i = reward i + reward i+1 +… update table[state i ] s.t. • action i becomes for state i more likely if FutureReward i is “high” • action i becomes for state i less likely if FutureReward i is “low” FOLIE 34 REINFORCEMENT LEARNING
The game: demo FOLIE 35 REINFORCEMENT LEARNING
The bad news: nice idea, but… FOLIE 36 «Image" licensed according to CC BY-SA REINFORCEMENT LEARNING
The bad news: nice idea, but… too many states… too many actions • Too much memory needed • Too much time FOLIE 37 «Image" licensed according to CC BY-SA REINFORCEMENT LEARNING
The solution Policy (rules learned, how to play the game) Idea: Replace lookup table with a neural network that approximates the action probabilities contained in the table Instead of Table[state] = action probabilities Do NeuralNet( state ) ~ action probablities Change to “play episode with policy given by NeuralNet” Change to “update weights of NeuralNet” FOLIE 38 REINFORCEMENT LEARNING
Recommend
More recommend