CPSC 533 Reinforcement Learning Paul Melenchuk Eva Wong Winson - PowerPoint PPT Presentation

CPSC 533 Reinforcement Learning Paul Melenchuk Eva Wong Winson Yuen Kenneth Wong

Outline • Introduction • Passive Learning in an Known Environment • Passive Learning in an Unknown Environment • Active Learning in an Unknown Environment • Exploration • Learning an Action Value Function • Generalization in Reinforcement Learning • Genetic Algorithms and Evolutionary Programming • Conclusion • Glossary

Introduction In which we examine how an agent can learn from success and failure, reward and punishment .

Introduction Learning to ride a bicycle: The goal given to the Reinforcement Learning system is simply to ride the bicycle without falling over Begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right Photo:http://www.roanoke.com/outdoors/bikepages/bikerattler.html

Introduction Learning to ride a bicycle: RL system turns the handle bars to the LEFT Result: CRASH!!! Receives negative reinforcement RL system turns the handle bars to the RIGHT Result: CRASH!!! Receives negative reinforcement

Introduction Learning to ride a bicycle: RL system has learned that the “state” of being titled 45 degrees to the right is bad Repeat trial using 40 degree to the right By performing enough of these trial-and-error interactions with the environment, the RL system will ultimately learn how to prevent the bicycle from ever falling over

Passive Learning in a Known Environment Passive Learner: A passive learner simply watches the world going by, and tries to learn the utility of being in various states. Another way to think of a passive learner is as an agent with a fixed policy trying to determine its benefits.

Passive Learning in a Known Environment In passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

Passive Learning in a Known Environment Agent can move {North, East, South, West} Terminate on reading [4,2] or [4,3]

Passive Learning in a Known Environment Agent is provided: M i j = a model given the probability of reaching from state i to state j

Passive Learning in a Known Environment the object is to use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i Utilities can be learned using 3 approaches 1) LMS (least mean squares) 2) ADP (adaptive dynamic programming) 3) TD (temporal difference learning)

Passive Learning in a Known Environment LMS (Least Mean Squares) (Least Mean Squares) LMS Agent makes random runs (sequences of random moves) through environment [1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1 [1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1

Passive Learning in a Known Environment LMS Collect statistics on final payoff for each state (eg. when on [2,3], how often reached +1 vs -1 ?) Learner computes average for each state Provably converges to true expected value (utilities) (Algorithm on page 602, Figure 20.3)

Passive Learning in a Known Environment LMS Main Drawback: - slow convergence - it takes the agent well over a 1000 training sequences to get close to the correct value

Passive Learning in a Known Environment ADP (Adaptive Dynamic Programming) Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated model

Passive Learning in a Known Environment ADP In general: - R(i) is reward of being in state i (often non zero for only a few end states) - M ij is the probability of transition from state i to j

Passive Learning in a Known Environment ADP Consider U(3,3) U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2) = 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430 = 0.2152

Passive Learning in a Known Environment ADP makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment somewhat intractable for large state spaces

Passive Learning in a Known Environment TD (Temporal Difference Learning) The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations

Passive Learning in a Known Environment TD Learning Suppose we observe a transition from state i to state j U(i) = -0.5 and U(j) = +0.5 Suggests that we should increase U(i) to make it agree better with it successor Can be achieved using the following updating rule

Passive Learning in a Known Environment TD Learning Performance: Runs “noisier” than LMS but smaller error Deal with observed states during sample runs (Not all instances, unlike ADP)

Passive Learning in an Unknown Environment Least Mean Square(LMS) approach and Temporal-Difference(TD) approach operate unchanged in an initially unknown environment. Adaptive Dynamic Programming(ADP) approach adds a step that updates an estimated model of the environment.

Passive Learning in an Unknown Environment ADP Approach • The environment model is learned by direct observation of transitions • The environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbors

Passive Learning in an Unknown Environment ADP & TD Approaches • The ADP approach and the TD approach are closely related • Both try to make local adjustments to the utility estimates in order to make each state “agree” with its successors

Passive Learning in an Unknown Environment Minor differences : • TD adjusts a state to agree with its observed successor • ADP adjusts the state to agree with all of the successors Important differences : • TD makes a single adjustment per observed transition • ADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M

Passive Learning in an Unknown Environment To make ADP more efficient : • directly approximate the algorithm for value iteration or policy iteration • prioritized-sweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates Advantage of the approximate ADP : • efficient in terms of computation • eliminate long value iterations occur in early stage

Active Learning in an Unknown Environment An active agent must consider : • what actions to take • what their outcomes may be • how they will affect the rewards received

Active Learning in an Unknown Environment Minor changes to passive learning agent : • environment model now incorporates the probabilities of transitions to other states given a particular action • maximize its expected utility • agent needs a performance element to choose an action at each step

Active Learning in an Unknown Environment Active ADP Approach • need to learn the probability M a ij of a transition instead of M ij • the input to the function will include the action taken

Active Learning in an Unknown Environment Active TD Approach • the model acquisition problem for the TD agent is identical to that for the ADP agent • the update rule remains unchanged • the TD algorithm will converge to the same values as ADP as the number of training sequences tends to infinity

Exploration Learning also involves the exploration of unknown areas Photo:http://www.duke.edu/~icheese/cgeorge.html

Exploration An agent can benefit from actions in 2 ways immediate rewards received percepts

Exploration Wacky Approach Vs. Greedy Approach - 0.038 0.089 0.215 -0.443 -0.165 -0.418 -0.544 -0.772

Exploration The Bandit Problem Photos: www.freetravel.net

Exploration The Exploration Function a simple example u= expected utility (greed) n= number of times actions have been tried(wacky) R+ = best reward possible

Learning An Action Value-Function What Are Q-Values?

Learning An Action Value-Function The Q-Values Formula

Learning An Action Value-Function The Q-Values Formula Application -just an adaptation of the active learning equation

Learning An Action Value-Function The TD Q-Learning Update Equation - requires no model - calculated after each transition from state .i to j

Learning An Action Value-Function The TD Q-Learning Update Equation in Practice The TD-Gammon System(Tesauro) Program:Neurogammon - attempted to learn from self-play and implicit representation

Generalization In Reinforcement Learning Explicit Representation • we have assumed that all the functions learned by the agents(U,M,R,Q) are represented in tabular form • explicit representation involves one output value for each input tuple.

Generalization In Reinforcement Learning Explicit Representation • good for small state spaces, but the time to convergence and the time per iteration increase rapidly as the space gets larger • it may be possible to handle 10,000 states or more • this suffices for 2-dimensional, maze-like environments

CPSC 533 Reinforcement Learning Paul Melenchuk Eva Wong Winson - PowerPoint PPT Presentation

CPSC 533 Reinforcement Learning Paul Melenchuk Eva Wong Winson Yuen Kenneth Wong Outline Introduction Passive Learning in an Known Environment Passive Learning in an Unknown Environment Active Learning in an Unknown

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Introduction to Artificial Intelligence What is Artificial Intelligence for YOU? CPSC 533

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Information Visualization: Glyphs CPSC 533 Topic Presentation Clarence Chan Nov. 21, 2006

TORRANCE AUTO REPAIR 1750 W Carson St. Torrance, CA 90501 Tel: (310)533-1771 Fax: (310)533-4930

Logistics Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/9

Autoencoders and VAEs Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

CPSC 320: NP-Completeness CPSC 320 2013W2 CPSC 320: NP-Completeness Up to now: We have been

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Machine Learning Lecture 1 Introduction to Machine Learning September 2, 2015

Learning Agents Overview Learning important aspects Learning in Agents goal, types; individual

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Agent-Based Modeling and Simulation Introduction to Reinforcement Learning Dr. Alejandro

Build an Alien Sightings Dashboard BUILDIN G W EB AP P LICATION S W ITH S H IN Y IN R Kaelen

Cracking the Container Scale Problem with Apache Mesos Connor Doyle connor@mesosphere.io Sunil

Combinatorial Interaction Testing Justyna Petke C entre for R esearch in E volution, S earch and

Hashing and Dictionaries 15-110 Monday 03/02 Learning Goals Understand how and why hashing

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us