reinforcement learning
play

Reinforcement Learning Hang Su suhangss@tsinghua.edu.cn - PowerPoint PPT Presentation

Reinforcement Learning Hang Su suhangss@tsinghua.edu.cn http://www.suhangss.me State Key Lab of Intelligent Technology & Systems Tsinghua University Nov 4 th , 2019 Sequential Decision Making Goal: select actions to maximize total future


  1. Reinforcement Learning Hang Su suhangss@tsinghua.edu.cn http://www.suhangss.me State Key Lab of Intelligent Technology & Systems Tsinghua University Nov 4 th , 2019

  2. Sequential Decision Making Goal: select actions to maximize total future reward Actions may have long term consequences Reward may be delayed It may be better to sacrifice immediate reward to gain more long-term reward

  3. Learning and Planning Two fundamental problems in sequential decision making Reinforcement Learning: q The environment is initially unknown q The agent interacts with the environment q The agent improves its policy Planning: q A model of the environment is known q The agent performs computations with its model (without any external interaction) q The agent improves its policy via reasoning, search, etc.

  4. Atari Example: Planning Rules of the game are known Can query emulator q perfect model inside agent’s brain If I take action a from state s: right left q what would the next state be? q what would the score be? right left right left Plan ahead to find optimal policy q e.g. tree search

  5. Atari Example: Reinforcement Learning Rules of the game are unknown Learn directly from interactive game-play observation action O t A t Pick actions on joystick, see pixels and scores reward R t

  6. Reinforcement learning Intelligent animals can learn from interactions to adapt to the environment Can computers do similarly?

  7. Reinforcement Learning in a nutshell RL is a general-purpose framework for decision-making q RL is for an agent with the capacity to act q Each action influences the agent’s future state Success is measured by a scalar reward signal q Goal: select actions to maximize future reward

  8. Reinforcement Learning The history is the sequence of observations, actions, rewards H t = O 1 , R 1 , A 1 , ..., A t − 1 , O t , R t q Agent chooses actions so as to maximize expected cumulative reward over a time horizon q Observations can be vectors or other structures q Actions can be multi-dimensional q Rewards are scalar but can be arbitrarily information

  9. Agent and Environment state action s t a t At each step t the agent: q Receives state 𝑡 " reward r t q Receives scalar reward 𝑠 " q Executes action 𝑏 " The environment: q Receives action 𝑏 " q Emits state 𝑡 " q Emits scalar reward 𝑠 "

  10. State Experience is a sequence of observations, actions, rewards o 1 , r 1 , a 1 , ..., a t − 1 , o t , r t The state is a summary of experience s t = f ( o 1 , r 1 , a 1 , ..., a t − 1 , o t , r t ) In a fully observed environment s t = f ( o t )

  11. Major Components of an RL Agent An RL agent may include one or more of these components: q Policy : agent’s behavior function q Value function : how good is each state and/or action q Model : agent’s representation of the environment

  12. Policy A policy is the agent’s behavior It is a map from state to action, e.g Deterministic policy : y: a = π ( s ) Stochastic policy: : π ( a | s ) = P [ A t = a | S t = s ]

  13. Value Function Value function is a prediction of future reward Used to evaluate the goodness/badness of states Q-value function gives expected total reward q from state s and action a q under policy π q with discount factor γ r t +1 + γ r t +2 + γ 2 r t +3 + ... | s , a ⇥ ⇤ Q π ( s , a ) = E Value functions decompose into a Bellman equation Q π ( s , a ) = E s 0 , a 0 ⇥ ⇤ r + γ Q π ( s 0 , a 0 ) | s , a

  14. Model A model predicts what the environment will do next P predicts the next state P predicts the next (immediate) reward, e.g. R ss 0 = P [ S t +1 = s 0 | S t = s , A t = a ] P a R a s = E [ R t +1 | S t = s , A t = a ]

  15. Reinforcement Learning Agent’s inside: Agent’s goal : learn a policy to maximize long-term total reward

  16. Difference between RL and SL? Both learn a model ... supervised learning reinforcement learning supervised learning reinforcement learning environment environment data data algorithm data algorithm data (s,a,s,r,a,s,r...) (x,y) (s,a,s,r,a,s,r...) s,a,s,r,a,s,r, 
 (s,a,s,r,a,s,r...) (x,y) (s,a,s,r,a,s,r...) s,a,s,r,a,s,r, (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (x,y) s,a,s,r,a,s,r, model model ... ... ... ... open loop closed loop closed loop learning from labeled data learning from delayed reward explore passive data environment ������� �������

  17. Supervised Learning Spam detection based on supervised learning

  18. Reinforcement Learning Spam detection based on reinforcement learning

  19. Characteristics of Reinforcement Learning What makes reinforcement learning different from other machine learning paradigms? q There is no supervisor q Only a reward signal q Feedback is delayed, not instantaneous q Time really matters (sequential, non i.i.d data) q Agent’s actions affect the subsequent data it receives

  20. RL vs SL (Supervised Learning) Differences from SL q Learn by trial-and-error — Need exploration/exploitation trade-off q Optimize long-term reward — Need temporal credit assignment Similarities to SL q Representation q Generalization q Hierarchical problem solving q …

  21. Applications: The Atari games n Deepmind Deep Q-learning on Atari ¨ Mnih et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529-533, 2015

  22. Applications: The game of Go n Deepmind Deep Q-learning on Go ¨ Silver et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587): 484 − 489, 2016

  23. Application: Producing flexible behaviors n NIPS 2017: Learning to Run competition

  24. More applications Search Recommendation system Stock prediction every decision changes the world

  25. Generality of RL shortest path problem q Dijkstra's algorithm, Bellman–Ford algorithm, etc st path problem: 6 1 5 3 1 2 t 1 s 2 3 5 einforcement learning by reinforcement learning einforcement learning -6 -1 -5 -3 t 100 -1 -2 0 -1 s -2 -3 -5 y node is a state, an action is an edge out • every node is a state, an action is an edge out • reward function = the negative edge weight ������� • optimal policy leads to the shortest path �������

  26. More applications Also as an differentiable approach for structure learning modeling structure data [Bahdanau et al., An Actor-Critic Algorithm for Sequence Prediction. ArXiv 1607.07086] [He et al., Deep Reinforcement Learning with a Natural Language Action Space, ACL’16] [B. Dhingra et al., End-to-End Reinforcement Learning of Dialogue Agents for Information Access, ArXiv 1609.00777] [Yu et al., SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient, AAAI’17] �������

  27. (Partial) History... Idea of programming a computer to learn by trial and error ( Turing, 1954 ) SNARCs (Stochastic Neural-Analog Reinforcement Calculators) (Minsky, 1951) Checkers playing program (Samuel, 59) Lots of RL in the 60s (e.g., Waltz & Fu 65; Mendel 66; Fu 70) MENACE (Matchbox Educable Naughts and Crosses Engine (Mitchie, 63) RL based Tic Tac Toe learner (GLEE) (Mitchie 68) Classifier Systems (Holland, 75) Adaptive Critics (Barto & Sutton, 81) Temporal Differences (Sutton, 88)

  28. Outline Markov Decision Process Value-based methods Policy search Model-based method Deep reinforcement learning

  29. History and State The history is the sequence of observations, actions, rewards H t = O 1 , R 1 , A 1 , ..., A t − 1 , O t , R t all observable variables up to time t What happens next depends on the history: q The agent selects actions q The environment selects observations/rewards State is the information used to determine what happens next Formally, state is a function of the history: S t = f ( H t )

  30. Agent State The agent state is the agent’s S a t internal representation a agent state S t whatever information the agent observation action uses to pick the next action O t A t it is the information used by reinforcement learning reward R t algorithms It can be any function of history: S a t = f ( H t )

  31. Markov state An Markov state contains all useful information from the history. A state S t is Markov if and only if P [ S t +1 | S t ] = P [ S t +1 | S 1 , ..., S t ] “The future is independent of the past given the present” H 1: t → S t → H t +1: ∞ Once the state is known, the history may be thrown away The state is a sufficient statistic of the future

  32. Introduction to MDPs Markov decision processes formally describe an environment for reinforcement learning Where the environment is fully observable q i.e. The current state completely characterizes the process Almost all RL problems can be formalized as MDPs q Optimal control primarily deals with continuous MDPs q Partially observable problems can be converted into MDPs q Bandits are MDPs with one state

  33. Markov Property “The future is independent of the past given the present” A state S t is Markov if and only if P [ S t +1 | S t ] = P [ S t +1 | S 1 , ..., S t ] The state captures all relevant information from the history Once the state is known, the history may be thrown away q The state is a sufficient statistic of the future

Recommend


More recommend