Gaussian Processes for Sample Efficient Reinforcement Learning with - PowerPoint PPT Presentation

Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration Tobias Jung and Peter Stone Department of Computer Science University of Texas at Austin { tjung,pstone } @cs.utexas.edu Outline: 1. Motivation & framework 2. Technical implementation 3. Experiments GP-RMAX – ECML 09/21/10 – p.1/18

Part I: Motivation & Overview This is what w e w ant to do (and why) GP-RMAX – ECML 09/21/10 – p.2/18

Consider: Time-dis rete de ision p ro ess t = 0 , 1 , 2 , . . . with R D state spa e ( ontinuous), A a tion spa e (�nite) Objective: dynamic programming T ransition fun tion x t +1 = f ( x t , a t ) (deterministi ) Rew a rd fun tion r ( x t , a t ) (immediate pa y o� ) X ⊂ Goal: F o r any x 0 �nd a tions a 0 , a 1 , . . . su h that � is maximized. Dynami p rogramming: (value iteration) If transitions f and rew a rd r a re kno wn , w e an solve t ≥ 0 γ t r ( x t , a t ) where ( T Q )( x, a ) := r ( x, a ) + γ max to obtain Q ∗ , the optimal value fun tion. On e Q ∗ is al ulated, b est a tion in x t is simply argmax a Q ∗ ( x t , a ) . Q ( f ( x, a ) , a ′ ) Q = T Q, ∀ x, a a ′ Problems: Usually f and r a re not kno wn a p rio ri = lea rned from samples. (State-a tion spa e �to o big� to do VI, ⇔ will la rgely igno re this) Our goal: w ant to imp rove sample e� ien y . ⇒ = ⇒ GP-RMAX – ECML 09/21/10 – p.3/18

Model-based reinforcement learning Rema rk: throughout the pap er w e will assume that the rew a rd fun tion is sp e i�ed a p rio ri. Sample e� ien y of RL wholly dep ends on sample e� ien y of mo del lea rner. = ⇒ GP-RMAX – ECML 09/21/10 – p.4/18

Bene�ts of mo del-based RL: Mo re sample e� ient than mo del-free (ho w ever, also mo re omputationally exp ensive) : Overview of the talk Samples only used to lea rn mo del, but not as �test-p oints� in value iteration. Sample e� ien y of RL wholly dep ends on sample e� ien y of mo del lea rner. (Mo del an b e reused to solve di�erent tasks in same environment.) Mo del-based RL: requires us to w o rry ab out 3 things 1. Ho w to implement planner? Here: simple interp olation on grid. (not pa rt of this pap er) 2. Ho w to implement mo del-lea rner? 3. Ho w to implement explo ration? Our ontribution GP-RMAX: mo del-lea rner=Gaussian p ro ess regression F ully Ba y esian: p rovides natural (un) ertaint y fo r ea h p redi tion. Automated, data-driven hyp erpa rameter sele tion. F ramew o rk fo r feature sele tion: �nd & eliminate irrelevant va riables/dire tions: imp roves generalization & p redi tion p erfo rman e = faster mo del lea rning. imp roves un ertaint y estimates = mo re e� ient explo ration. Exp eriments indi ate highly sample-e� ient online RL p ossible. ⇒ ⇒ GP-RMAX – ECML 09/21/10 – p.5/18

Example: ompa re three app roa hes fo r mo del lea rning in a 100 × 100 gridw o rld. Motivation: GP+ARD Can Reduce Need for Exploration 100 x 100 cells Goal x right = x old + 0 . 01 Actions: new After observing 20 transitions, w e plot ho w ertain ea h mo del is ab out its p redi tions fo r �right�: y right = y old new x up = x old Right Up new y up Start = y old + 0 . 01 new 1 1 1 1 0.9 0.9 0.9 0.9 0.9 0.8 0.07 0.8 0.8 0.8 0.8 0.7 0.06 0.7 0.7 0.7 0.7 0.6 0.05 0.6 0.6 y coordinate 0.6 0.6 y coordinate y coordinate grid 0.5 Hand-tuned unifo rm RBF GP with ARD k ernel 0.5 0.5 0.5 0.5 0.04 0.4 0.4 0.4 0.4 0.4 0.03 0.3 0.3 0.3 0.3 0.3 GP+ARD dete ts that the y- o o rdinate is irrelevant = redu ed explo ration = faster lea rning. 0.02 0.2 0.2 0.2 0.2 0.2 0.01 0.1 0.1 0.1 0.1 0.1 0 0 0 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x coordinate x coordinate x coordinate 10 × 10 ⇒ ⇒ GP-RMAX – ECML 09/21/10 – p.6/18

Part II: Technical implementation This is ho w w e do it GP-RMAX – ECML 09/21/10 – p.7/18

a. Model learning with GPs GP-RMAX – ECML 09/21/10 – p.8/18

General idea: Have to lea rn D -dim transition fun tion x ′ = f ( x , a ) . Model learning with GPs T o do this, w e ombine multiple univa riate GPs. T raining: Data onsists of transitions { ( x t , a t , x ′ , where x ′ and x t , x ′ R D . T rain indep endently one GP fo r ea h state va riable, a tion. mo dels i -th state va riable under a tion a = j has hyp erpa rameters � found from minimizing ma rginal lik eliho o d t ) } N t = f ( x t , a t ) t ∈ t =1 GP ij GP ij θ ij On e trained, GP ij p ro du es fo r any state x ∗ θ ij ) = − 1 θ ij + σ I ) − 1 θ ij + σ I ) − 1 y − n Predi tion ˜ . L ( � 2 y T ( K � min 2 log det( K � 2 log 2 π � θ ij Un ertaint y ˜ θ ij ( x ∗ ) . A t the end, p redi tions of individual state va riables a re sta k ed together. f i ( x ∗ , a = j ) := k � θ ij ( x ∗ ) T ( K � θ ij + σ I ) − 1 y c i ( x ∗ , a = j ) := k � θ ij ( x ∗ , x ∗ ) − k � θ ij ( x ∗ ) T ( K � θ ij + σ I ) − 1 k � GP-RMAX – ECML 09/21/10 – p.9/18

PSfrag repla ements Automated p ro edure fo r hyp erpa rameter sele tion: an use ov with la rger numb er of hyp erpa rameters Automatic relevance determination (infeasible to set b y hand) b etter �t regula rities of data, remove what is irrelevant Cova rian e: W e onsider three va riants of the fo rm: = ⇒ (I) PSfrag repla ements = ⇒ h with s ala r hyp erpa rameters v 0 , b and matrix Ω given b y . � � − 1 k θ ( x , x ′ ) = v 0 exp 2 ( x − x ′ ) T Ω ( x − x ′ ) + b Variant II: Ω = diag( a 1 , . . . , a D ) . PSfrag repla ements (II) k + diag( a 1 , . . . , a D ) . a 2 Note: Variant I: Ω = h I a 1 (I I), (I I I) ontain adjustable pa rameters fo r every state va riable Setting them automati ally from data = Variant III: Ω = M k M T (III) Mo del sele tion automati ally determines their relevan e u 1 Can use lik eliho o d s o res to p rune irrelevant state va riables. u 2 s 1 s 2 ⇒ GP-RMAX – ECML 09/21/10 – p.10/18

b. Planning (with approximate model) GP-RMAX – ECML 09/21/10 – p.11/18

R D Rememb er: Input to the planner is the urrent mo del. Value iteration in The urrent mo del �p ro du es� fo r any ( x, a ) f ( x, a ) , the p redi ted su esso r state c ( x, a ) , the asso iated un ertaint y (0= ertain, 1=un ertain) General idea: V alue iteration on grid Γ h + multidimensional interp olation. ˜ Instead of true transition fun tion, simulate transitions with urrent mo del. ˜ As in RMAX integrate �explo ration� into value up dates. (Nouri & Littman 2009) Algo rithm: iterate k = 1 , 2 , . . . : ∀ no de ξ i ∈ Γ h , a tion a MAX given a p rio ri interp olation in R D Note: � ˜ � � If ˜ c ( ξ i , a ) ≈ 0 , no explo ration. f ( ξ i , a ) , a ′ � Q k +1 ( ξ i , a ) = (1 − ˜ c ( ξ i , a )) · r ( ξ i , a ) + γ max + ˜ c ( ξ i , a ) · V Q k a ′ � �� If ˜ c ( ξ i , a ) ≈ 1 , state is a rti� ially made mo re attra tive = explo ration. ⇒ GP-RMAX – ECML 09/21/10 – p.12/18

Part III: Experiments These a re the results GP-RMAX – ECML 09/21/10 – p.13/18

Examine what: examine online lea rning p erfo rman e of GP-RMAX, that is, sample omplexit y , and Experimental setup qualit y of lea rned b ehavio r in va rious p opula r b en hma rk domains. Domains: Mountain a r (2D state spa e) Inverted p endulum (2D state spa e) Bi y le balan ing (4D state spa e) A rob ot (swing-up) (4D state spa e) Contestants: Sa rsa( λ ) + tile o ding (explo ration where un ertaint y is determinded from GP) (no explo ration) (explo ration where un ertaint y is determined from grid) GP-RMAXexp GP-RMAXnoexp GP-RMAXgrid GP-RMAX – ECML 09/21/10 – p.14/18

Results 2D domains Mountain car (GP−RMAX) Mountain car (Sarsa) 500 500 optimal optimal Sarsa( λ ) Tilecoding 10 GP−RMAX exp 450 450 Steps to goal (lower is better) Steps to goal (lower is better) GP−RMAX noexp Sarsa( λ ) Tilecoding 20 GP−RMAX grid5 400 400 GP−RMAX grid10 350 350 300 300 250 250 200 200 150 150 100 100 0 5 10 15 20 0 200 400 600 800 1000 Episodes Episodes Inverted pendulum (Sarsa) Inverted pendulum (GP−RMAX) 0 0 −50 −100 Total reward (higher is better) Total reward (higher is better) −100 optimal −200 −150 GP−RMAX exp −200 GP−RMAX noexp optimal −300 GP−RMAX grid5 Sarsa( λ ) Tilecoding 10 −250 GP−RMAX grid10 Sarsa( λ ) Tilecoding 40 −400 −300 −350 −500 −400 −600 −450 0 5 10 15 20 0 100 200 300 400 500 Episodes Episodes GP-RMAX – ECML 09/21/10 – p.15/18

Gaussian Processes for Sample Efficient Reinforcement Learning with - PowerPoint PPT Presentation

Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration Tobias Jung and Peter Stone Department of Computer Science University of Texas at Austin { tjung,pstone } @cs.utexas.edu Outline: 1. Motivation &

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Introduction and Feature Detection CS448V Computational Video Manipulation April 2019

Implement Bopomofo by OpenType font feature Bobby Tung @W3C Digital Publishing Workshop KEIO

Steven Y. Ko (SUNY at Buffalo), Kyungho Jeon (SUNY at Buffalo), Ramses Morales (Xerox Research

Fair Computation using Enclaves and Shared Ledger Rohit Sinha , Siva Gaddam, and Ranjit Kumaresan

5 February 2020 Webinar Slides with speaker notes Kia ora, Koutou, Ko Kahungunu toku iwi No

Accessing Data while Preserving Privacy Kobbi Nissim Georgetown University and CRCS@Harvard

V erification de protocoles cryptographiques en pr esence de th eories equationnelles

Parsing Universal Dependencies Team C2L2 Tianze Shi, Felix G. Wu, Xilun Chen, Yao Cheng

Sambuz

Useful Links

Newsletter

Mail Us