Examples and Videos of Markov Decision Processes (MDPs) and - PowerPoint PPT Presentation

Examples and Videos   of Markov Decision Processes (MDPs) and Reinforcement Learning

Artificial Intelligence is interaction to achieve a goal Environment action state reward Agent • complete agent • temporally situated • continual learning & planning • object is to affect environment • environment stochastic & uncertain

States, Actions, and Rewards

Hajime Kimura’s RL Robots After Before New Robot, Same algorithm Backward

Devilsticking Stefan Schaal & Chris Atkeson Finnegan Southey Univ. of Southern California University of Alberta “Model-based Reinforcement Learning of Devilsticking”

The RoboCup Soccer Competition

Autonomous Learning of Efficient Gait Kohl & Stone (UTexas) 2004

Policies • A policy maps each state to an action to take • Like a stimulus–response rule • We seek a policy that maximizes cumulative reward • The policy is a subgoal to achieving reward

The Reward Hypothesis The goal of intelligence is to maximize the cumulative sum of a single received number: “reward” = pleasure - pain Artificial Intelligence = reward maximization

Value systems are hedonism with foresight We value situations according to how much reward we expect will follow them All efficient methods for solving sequential decision problems determine (learn or compute) “value functions” as an intermediate step Value systems are a means to reward, yet we care more about values than rewards

Pleasure = Immediate Reward ≠ good = Long-term Reward “Even enjoying yourself you call evil whenever it leads to the loss of a pleasure greater than its own, or lays up pains that outweigh its pleasures. ... Isn't it the same when we turn back to pain? To suffer pain you call good when it either rids us of greater pains than its own or leads to pleasures that outweigh them.” –Plato, Protagoras

Backgammon STATES: configurations of the playing board ( ≈ 10 20 ) ACTIONS: moves REWARDS: win: +1 lose: –1 else: 0 a “big” game

Tesauro, 1992-1995 TD-Gammon Action selection . . . Value . . . by 2-3 ply search . . . . . . TD Error V t + 1 − V t Start with a random Network Play millions of games against itself Learn a value function from this simulated experience Six weeks later it’s the best player of backgammon in the world

The Mountain Car Problem Goal SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always –1 until car Gravity wins reaches the goal No Discounting Minimum-Time-to-Goal Problem Moore, 1990

Value Functions Learned while solving the Mountain Car problem Goal region Minimize Time-to-Goal Value = estimated time to goal

Random Learned Hand-coded Hold

Temporal-difference (TD) error Do things seem to be getting better or worse, in terms of long-term reward,   at this instant in time?

Brain reward systems What signal does this neuron carry? Honeybee Brain VUM Neuron Hammer, Menzel

TD error Brain reward systems seem to signal TD error Wolfram Schultz, et al.

World models

the actor-critic reinforcement learning architecture World or world model

“Autonomous helicopter flight via Reinforcement Learning” Ng (Stanford), Kim, Jordan, & Sastry (UC Berkeley) 2004

Reason as RL over Imagined Experience 1. Learn a predictive model of the world’s dynamics transition probabilities, expected immediate rewards 2. Use model to generate imaginary experiences internal thought trials, mental simulation (Craik, 1943) 3. Apply RL as if experience had really happened vicarious trial and error (Tolman, 1932)

GridWorld Example

Summary: RL’s Computational Theory of Mind Reward Policy Value Function Predictive Model A learned, time-varying prediction of imminent reward Key to all efficient methods for finding optimal policies This has nothing to do with either biology or computers

Summary: RL’s Computational Theory of Mind Reward Policy Value Function Predictive Model It’s all created from the scalar reward signal

Summary: RL’s Computational Theory of Mind Reward Policy Value Function Predictive Model It’s all created from the scalar reward signal together with the causal structure of the world

Examples and Videos of Markov Decision Processes (MDPs) and - PowerPoint PPT Presentation

Examples and Videos of Markov Decision Processes (MDPs) and Reinforcement Learning Artificial Intelligence is interaction to achieve a goal Environment action state reward Agent complete agent temporally situated continual

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Outline CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Formalism

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Recap: MDPs Op)mal Quan))es Markov decision processes:

Q-learning 3-23-16 Markov Decision Processes (MDPs) States: S Actions A vs. A s

Markov Decision Processes (MDPs) and Reinforcement Learning (RL) Sven Koenig, USC Russell and

CS 188: Artificial Intelligence Markov Decision Processes II Instructor: Anca Dragan University

Recap: MDPs Markov decision processes: States S Start state s 0 Actions A

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Markov Decision Processes (MDPs) Machine Learning 10701/15781 Carlos Guestrin Carnegie

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

1 Solving MDPs Example Optimal Policies In deterministic single-agent search problems, want

Planning with MDPs (Markov Decision Processes) H ector Geffner ICREA and Universitat Pompeu

CS 573: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post

1 Markov Decision Processes Markov Decision Processes An MDP is defined by: An MDP is

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] Sec 17.1, 17.2, 17.4 CS 486/686

CS287 Fall 2019 Lecture 2 Markov Decision Processes and Exact Solution Methods Pieter

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Solving Continuous MDPs with Discretization Pieter Abbeel UC Berkeley EECS Markov Decision

CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II Homework k 4:

Examples and Videos of Markov Decision Processes (MDPs) and - PowerPoint PPT Presentation

Examples and Videos of Markov Decision Processes (MDPs) and Reinforcement Learning Artificial Intelligence is interaction to achieve a goal Environment action state reward Agent complete agent temporally situated continual

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Outline CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Formalism

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Recap: MDPs Op)mal Quan))es Markov decision processes:

Q-learning 3-23-16 Markov Decision Processes (MDPs) States: S Actions A vs. A s

Markov Decision Processes (MDPs) and Reinforcement Learning (RL) Sven Koenig, USC Russell and

CS 188: Artificial Intelligence Markov Decision Processes II Instructor: Anca Dragan University

Recap: MDPs Markov decision processes: States S Start state s 0 Actions A

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Markov Decision Processes (MDPs) Machine Learning 10701/15781 Carlos Guestrin Carnegie

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

1 Solving MDPs Example Optimal Policies In deterministic single-agent search problems, want

Planning with MDPs (Markov Decision Processes) H ector Geffner ICREA and Universitat Pompeu

CS 573: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post

1 Markov Decision Processes Markov Decision Processes An MDP is defined by: An MDP is

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] Sec 17.1, 17.2, 17.4 CS 486/686

CS287 Fall 2019 Lecture 2 Markov Decision Processes and Exact Solution Methods Pieter

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Solving Continuous MDPs with Discretization Pieter Abbeel UC Berkeley EECS Markov Decision

CSE 573 Markov Decision Processes: Heuristic Search &amp; Real-Time Dynamic Programming Slides

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II Homework k 4:

CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides