Recommended Supervisor Structure action set A ControlBasis-III double recommended-setpoint(actions[NACTIONS][NDOF]; (scope is a “project” i.e. a task supervisor) automating sequential control composition Inside all supervisors: state = 0; • reinforcement learning \\ collect return status & new recommended setpoints byproduct example: hierarchical walking for (i=0,k) { • Dynamic Motion Primitives (DMPs) di = action[i]; state += d i * 3 i ; • policy search } switch(state) { case(0): submit recommended-setpoint[ACTION] to motor units; break case(1): … Laboratory for Perceptual Robotics – College of Information and Computer Sciences Laboratory for Perceptual Robotics – College of Information and Computer Sciences 46 45 case(K): 46 } Conditioned Response Recommended Supervisor Structure Convention: every action should return its status and pass a recommended setpoint back through an argument---they don’t write directly to motorunits anymore. Standardize Actions: Search(), Track(), SearchTrack(), Chase(), Touch(), ChaseTouch(), FClosure() Goals: to evaluate* the impact of implicit knowledge *in terms of learning performance (all actions, just primitives, just macros) define training episodes, pause every N episodes and write out greedy policy, in post processing: run M greedy trials, compute mean/variance of performance for plotting Pavlov, I. P. (1927), � Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex, � Translated and Edited by G. Reward: squared wrench residual (sparse), other things? V. Anrep. London: Oxford University Press. Laboratory for Perceptual Robotics – College of Information and Computer Sciences 47 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 48 47 48 1
A Computational Model for Conditioned Response Markov Decision Processes value functions - an generalization describe a memory-less stochastic process of the potential field value functions Reinforcement Learning - value iteration the conditional probability distribution of future states of the • � diffusion � processes process depends only on the current/present state---not how the process got to this state. • curse of dimensionality diminished by exploiting neurological structure actions states Laboratory for Perceptual Robotics – College of Information and Computer Sciences 49 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 50 49 50 The Bellman Equation Markov Decision Processes Define a policy, π ( s, a ), to be a function that returns the probability of selecting action a ∈ A from state s ∈ S M = < S, A, , P, R > S : set of system states the value of state s under policy π , denoted V π ( s ), is the expected sum A : set of available actions of discounted future rewards when policy π is executed from state s , subset of actions allowed from each state probability that ( s k , a k ) transitions to state s k+1 real-valued reward for (state, action) pair 0.0 < γ ≤ 1.0 represents a discounting factor per decision, and scalar r t is the reward received at time t . Laboratory for Perceptual Robotics – College of Information and Computer Sciences 51 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 52 51 52 2
The Bellman Equation The Bellman Equation Laboratory for Perceptual Robotics – College of Information and Computer Sciences 53 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 54 53 54 Value Iteration Q-learning Dynamic Programming (DP) algorithms compute optimal policies from complete knowledge of the underlying MDP Typically, DP employs a full backup--- a comprehensive sweep through the Reinforcement Learning (RL) algorithms are an important subset of DP algorithms entire state-action space using numerical relaxation techniques (Appendix C). that do not require prior knowledge of transition probabilities in the MDP. RL techniques generally estimate V π (s) using sampled backups at the expense of optimality guarantees. provides the basis for a numerical iteration that incorporates the Bellman consistency constraints to estimate V π (s). Attractive in robotics because it focuses exploration on portions of the state/action space most relevant to the reward/task greedy ascent of the converged value function is an optimal policy for a recursive numerical technique that converges to V π as k → ∞ accumulating reward. Laboratory for Perceptual Robotics – College of Information and Computer Sciences 55 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 56 55 56 3
Q-learning Q-learning Quality function – the value function written the state/action form a natural paradigm for composing skills using the control basis actions because it can construct policies using sequences of actions by exploring control interactions in situ . Policy improvement: policy π ( s ) = argmax Q( s , a i ) a i The policy improvement theorem guarantees that a procedure like this will lead monotonically toward optimal policies. maps states to optimal actions by greedy ascent of the value function. Laboratory for Perceptual Robotics – College of Information and Computer Sciences 57 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 58 57 58 Example: Learning to Walk (ca. 1996) Example: Walking Gaits Resource Model 13 controllers sensor resources - moment control abc abc ∈ 0123 { } Φ 1 a • configuration of legs {0123} kinematic • configuration of body (x,y, q ) 0123 Φ 2 ϕ conditioning effector resources - total of 1885 concurrent control options • configuration of legs {0123} • configuration of body (x,y, q ) discrete events: 012 p 0 ← Φ * THING Quadruped control types - 013 p 3 ← Φ * • moment control four coordinated robots 023 p 1 ← Φ * 2 13 states ´ 1885 actions • kinematic conditioning 0123 p 4 ← Φ ϕ 123 p 2 ← Φ * Laboratory for Perceptual Robotics – College of Information and Computer Sciences 59 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 60 59 60 4
Example: ROTATE schema Example: Behavioral Logic for Development propositions that constrain patterns of discrete events in the dynamical system Platform stability constraints at least 1 of 4 stable tripod stances to be true at all times • kinematic constraints • reduced model: • 32 states ´ 157 actions • reduced by 99.94 % \ Laboratory for Perceptual Robotics – College of Information and Computer Sciences 61 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 62 61 62 Schemas from Sensorimotor Programming Transfer � written � by this robot ported to this robot generating the control programs that evaluate context efficiently is the subject of on-going work on inferential perception. Laboratory for Perceptual Robotics – College of Information and Computer Sciences 63 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 64 63 64 5
Implications of Developmental Hierarchy “Objects” - Fully-Observable Case (Aspect Transition Graph) Rob Platt …at least one stable grasp must exist at all times…( g ( F g1 ) g ( F g2 ) g ( F g3 )) Laboratory for Perceptual Robotics – College of Information and Computer Sciences 65 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 66 65 66 Hierarchical Commonsense Control Knowledge “Objects” - Model-Referenced Aspect Transitions models record how SearchTrack options affect visual/haptic tracking events over time balancing three four point point S F= S M=0 S F=+/-M x f f prone g ( prone) V g ( 4-point ) V g ( balance) grasp “flip” arms arms Laboratory for Perceptual Robotics – College of Information and Computer Sciences 68 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 69 68 69 6
“Objects” - Model-Referenced Aspect Transitions Affordance Modeling - Three Objects in general, multiple options exist for transforming existing sensor exploration habituates geometries into new sensor geometries when model stops (visual cues=>viewing angles=>new visual cues) changing visual hue tracker grasp pick-and-place options vary in cost and in information content Stephen Hart Laboratory for Perceptual Robotics – College of Information and Computer Sciences 70 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 71 70 71 Modeling Simple Assemblies Human Tracking • disambiguate human structure against cluttered backgrounds • references (hands/face) for control behavior and modeling stable multi-body relations Laboratory for Perceptual Robotics – College of Information and Computer Sciences 72 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 73 72 73 7
Recommend
More recommend