two perspectives on representation learning
play

Two Perspectives on Representation Learning Joseph Modayil - PowerPoint PPT Presentation

Two Perspectives on Representation Learning Joseph Modayil Reinforcement Learning and Artificial Intelligence Laboratory University of Alberta 1 Reasoning & Learning: Two perspectives on knowledge representation For reasoning with a


  1. Two Perspectives on Representation Learning Joseph Modayil Reinforcement Learning and Artificial Intelligence Laboratory University of Alberta 1

  2. Reasoning & Learning: Two perspectives on knowledge representation ‣ For reasoning with a model: • Expressiveness of the model (e.g. space, objects, ...) • Planning with the model is useful for a robot ‣ For learning to predict the consequences of a robot ʼ s behaviour: • Semantics defined by the robot ʼ s future experience • Online, scalable learning during normal robot operation 2

  3. An Analogy with Scientific Knowledge ‣ Reasoning and learning have complementary strengths that are analogous to scientific theories and experiments. • Scientific theories enable broad generalization within a limited domain. Scientific theories enable effective reasoning even when inaccurate. • Experiments measure the world without needing model assumptions. Many experiments are needed to understand the world. ‣ Two approaches for connecting theories and experiments. • Top-down: Theories have experimentally verifiable predictions. • Bottom-up: Many verifiable predictions can generalize to a single theory. • Note: A single prediction a (very) partial model of the world. 3

  4. Rich representations that support reasoning 4

  5. Reasoning with rich representations ‣ Useful analogs to human-scale abstractions can be constructed from robot experience. • The robot constructs models from its sensorimotor experience by searching for particular statistical structures. • The models describe spaces and objects. • The robot reasons within these models to achieve goals. 5

  6. Representing sensor configurations (Modayil, 2010) ‣ Sensors in similar physical configurations yield highly correlated time-series data. (e.g. GP assumption) ‣ Invert this: use time-series data to construct a manifold of sensor configurations. Original Gather Analyze Construct Sensors Experience Time-series Sensor Geometry Sensor Readings Time

  7. Learned geometry from real robot data Cosy Localization Database M e t h o d : 1 . D e fi n e l o c a l d i s t a n c e s b e t w e e n s t r o n g l y c o r r e l a t e d s e n s o r s 2 . U s e t h e f a s t m a x i m u m v a r i a n c e u n f o l d i n g a l g o r i t h m t o c o n s t r u c t a m a n i f o l d C o n c l u s i o n : A r o b o t ʼ s e x p e r i e n c e c a n c o n t a i n e n o u g h i n f o r m a t i o n t o r e c o v e r a p p r o x i m a t e l o c a l s e n s o r g e o m e t r y ( a n d p e r h a p s g l o b a l g e o m e t r y ) . 7

  8. Representing Objects (Modayil & Kuipers, 2007) ‣ Intuition: Moving objects can be distinguished from a static world. ‣ Approach: Use violations of a stationary background model to perceive moving objects. 8

  9. Objects: Background Model The agent has a model of the static environment ‣ Occupancy grid ‣ Observation model (pose,map) → observation ‣ Operators to move the robot to a target pose ‣ Update of the map and robot pose at each time-step 9

  10. Objects: Perception Method 1. Consider sensor readings that violate expectations of a static model. 2. Cluster them in space and then time. 3. Compute new perceptual features from the clusters. distance = average sensor reading angle = average sensor location

  11. Objects: Learned Shapes Note: shape models have size information 11

  12. Objects: Learning Operators M e t h o d : Operator 4: Decrease distance to object 1 . P e r f o r m m o t o r b a b b l i n g t o c o l l e c t d a t a . Description: distance( τ ), decrease, δ < -0.19 2 . U s e b a t c h l e a r n i n g t o fi n d Context: distance( τ ) ≥ 0.43 c o n t e x t s a n d m o t o r o u t p u t s angle( τ ) ≤ 132 t h a t r e l i a b l y c h a n g e a n a t t r i b u t e e v e r y angle( τ ) ≥ 69 t i m e s t e p ( o n e s e c o n d t i m e s t e p s ) . Motor outputs: (0.2 m/s, 0.0 rad/s) 3 . E v a l u a t e t h e l e a r n e d o p e r a t o r s . 12

  13. Objects: Using Operators angle( τ ) increasing distance( τ ) location( τ ) decreasing dir[robot-heading] angle( τ ) decreasing 13

  14. Learning models that support reasoning ‣ Representations that support human-scale abstract reasoning can be learned from sensorimotor experience. • Is a robot ʼ s sensorimotor stream sufficient for learning all useful knowledge? ‣ How can the learning process be improved? • Simple unified semantics with broad applicability • Clarify assumptions • Incremental learning algorithms • Remove need for human oversight 14

  15. Rich representations that support learning 15

  16. Learning to make predictions ‣ A prediction is a claim about a robot ʼ s future experience. • Predictions verified by experiments are the foundation of scientific knowledge. • Thus, the semantics of experimentally verifiable predictions could be a useful foundation for a robot ʼ s knowledge. • An efficient online, incremental algorithm would enable the robot to make and learn many such predictions in parallel. • e.g. Temporal-difference reinforcement learning algorithms. 16

  17. General value functions (GVF) V π , γ ,r,z ( s ) = E [ r ( s 1 ) + . . . + r ( s k ) + z ( s k ) | s 0 = s, a 0: k ∼ π , k ∼ γ ] these four functions define the semantics of an experimentally verifiable prediction policy π : A × S − → [0 , 1] The Experimental Question pseudo reward r : S − → R By selecting actions with the policy, how much reward will be received before termination? termination γ : S − → [0 , 1] terminal reward z : S − → R Note 1: A GVF is a value function, but with a generic reward and termination. Note 2: A constant termination probability corresponds to a timescale.

  18. The Horde Architecture (Sutton et al, 2011) GVF predictions can be learned in parallel and online. } Sparsely activated Non-linear sparse, mostly-binary, binary features φ t. feature representation sparse re-coder (#active << #features) (e.g., tile coding) } ... demons Each computed prediction ( p ) is Each demon is a linear function a full RL agent of the features PSR p = < θ t, Φ t > estimating a sensorimotor general value predictions data The weights ( θ ) can be function learned incrementally in O(#features) time/step by TD( λ ) or related algorithms. 18

  19. The firehose of experience Normalized Sensor Values Timesteps (0.1 second)

  20. 1024 1024 60,000 Light3 Predictions pseudo Ideal 8s reward Light3 (right scale) prediction of a Light 40,000 (left scale) 512 512 Sensor 20,000 0 0,000 0 0 20 40 60 80 100 120 r = Light 3 60,000 Prediction γ = 0 . 9875 TD( " ) at best ! prediction Ideal 8s π = Robot behaviour (offline) Light3 40,000 z = 0 prediction The predictions learned online by 20,000 TD( λ ) are comparable to the ideal predictions and approach the accuracy of the best weight vector. 0 0 60 80 20 40 100 120 (shown after 3 hours of experience) Seconds

  21. Scales to thousands of predictions (Modayil, White, Sutton, 2012) The 2000+ predictions Cumulative mean squared error normalized by dataset sample variance use 6000+ shared Acceleration MotorTemperature OverheatingFlag Light MotorSpeed IR features, MotorCurrent IRLight Thermal LastAction RotationalVelocity Magnetic MotorCommand shared parameters, cover all sensors & many state bits, Unit Variance cover 4 timescales Mean (0.1, 0.5, 2, and 8 Median seconds), and update every 55ms 0 30 60 90 120 150 180 Minutes All experience & learning performed within hours!

  22. Learning predictions about different policies ‣ Off-policy learning enables the robot to learn the consequences of following different policies from a single stream of experience. ‣ Gradient temporal-difference algorithms provide stable, incremental, off-policy learning. (Maei & Sutton, 2009) ‣ Works at scale with robots. (White, Modayil, Sutton, 2012) 22

  23. Summary ‣ Abstract models can be learned from sensorimotor experience. • Learned models of sensor space and objects that support goal-directed planning. ‣ A broad class of predictive knowledge can be learned at scale. • General value function predictions express an expected consequence of a precise experiment. • Temporal-difference algorithms can learn to make such predictions incrementally during normal robot experience. ‣ Robots could benefit from a tighter integration between learning from experience and reasoning with models.

Recommend


More recommend