Learning Models of Human Behavior using a Value Directed Approach Jesse Hoey Computer Science Department University of Toronto http://www.cs.toronto.edu/ ∼ jhoey/ IRIS Learning Workshop - June 9, 2004
Motivation: Modeling Human Behaviors ACTION Decision Theory cognitive vision human behaviors Computer Vision VIDEO IRIS Learning Workshop - June 9, 2004 2/25
Motivation: Modeling Human Behaviors ACTION Decision Theory cognitive vision human behaviors Computer Vision VIDEO IRIS Learning Workshop - June 9, 2004 3/25
POMDPs for Human Behavior Understanding behavior IRIS Learning Workshop - June 9, 2004 4/25
POMDPs for Human Behavior Understanding Action utility Outcome behavior IRIS Learning Workshop - June 9, 2004 5/25
POMDPs for Human Behavior Understanding Action: Context: utility steal cake previous (hunger) don’t steal cake world state Outcome: get cake get caught behavior Partially Observable Markov Decision Process IRIS Learning Workshop - June 9, 2004 6/25
Overview ➜ POMDPs for Display Understanding in Context ➜ Computer Vision: Modeling video sequences • spatial abstraction • temporal abstraction ➜ Learning POMDPs ➜ Solving POMDPs ➜ Value-Directed Learning ➜ Experiments • Robot Control • Card Matching Game ➜ Conclusions, Current & Future Work IRIS Learning Workshop - June 9, 2004 7/25
Partially Observable Markov Decision Processes (POMDPs) A POMDP is a probabilistic temporal model of agent interacting with its environment : a tuple � S, A, T, R, O, B � R A S : finite set of unobservable states A : finite set of agent actions S S T : S × A → S transition function R : S × A → R reward function O O O : set of observations B : S × A → O observation function t−1 t IRIS Learning Workshop - June 9, 2004 8/25
POMDPs for Human Behavior Understanding Action: Context: utility steal cake Θ A previous (hunger) don’t steal cake world state Outcome: get cake Θ get caught D behavior Θ O IRIS Learning Workshop - June 9, 2004 9/25
Output Model b:a most likely A =1 a A t Θ A S a a S t−1 t b:a A t Θ D Θ O O t IRIS Learning Workshop - June 9, 2004 10/25
Output Model b:a most likely A =1 1 "smile" 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 0 a A t Θ A 1 1 1 X 1 1 2 3 4 1 2 3 4 0 1 2 3 4 1 2 3 4 0 0 0 S a a S t−1 t b:a W 1 1 1 1 A t 1 1 Θ D 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 0 0 0 0 Θ O 0 0 O Zx Zx Zx Zx t Zw Zw Zw Zw Zw Zw 1249 1250 1187 1188 V H f t I frame 1187 1188 1189 1249 1250 1251 IRIS Learning Workshop - June 9, 2004 11/25
Output Model b:a most likely A =1 1 "smile" 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 0 a A t Θ A 1 1 1 X 1 1 2 3 4 1 2 3 4 0 1 2 3 4 1 2 3 4 0 0 0 S a a S t−1 t b:a W 1 1 1 1 A t 1 1 Θ D 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 0 0 0 0 Θ O 0 0 O Zx Zx Zx Zx t Zw Zw Zw Zw Zw Zw 1249 1250 1187 1188 V H f t I frame 1187 1188 1189 1249 1250 1251 P ( O | Ab : a ) = P ( IT | WT,iAb : a ) P ( ∇ fT | XT,jAb : a ) | Ab : a ) � � Θ Xijkn Θ W jklnP ( XT − 1 ,kWT − 1 ,l { O } 1 ,T − 1 ij kl IRIS Learning Workshop - June 9, 2004 12/25
Learning the Model Θ A a A a A τ τ−1 S a S a S a τ−1 τ τ+1 Θ D b:a b:a A A τ−1 τ Θ O O O τ−1 τ Find parameters, Θ ∗ = arg max P ( O , S a , A a Θ) Θ Use expectation-maximization(EM)algorithm: Θ ∗ = arg max � P ( A b : a | OS a A a , θ ′ ) log P ( A b : a OS a A a | Θ) + log P (Θ) Θ A b : a finds local maximum of a posteriori probability IRIS Learning Workshop - June 9, 2004 13/25
Solution Techniques Decision Making Finding Decision−Analytic unobservable state problems approach equilibria continuous observations multi−agent systems General Decision Making Monte Carlo MDP Entropy optimal unobservable state approximation approximation POMDPs solution continuous observations Decision Making Incremental Factored unobservable state pruning solvers discrete observations Decision Making SPUDD observable state EM for Learning POMDPs Hurt me Hardcore! Nightmare! I can win! Bring it on! plenty! difficulty IRIS Learning Workshop - June 9, 2004 14/25
Solving the Model R R a A a A τ τ−1 S a S a S a τ−1 τ τ+1 b:a b:a A A τ−1 τ MDP Approximation: Assume A b : a is observable Dynamic Programming: Value Iteration �� � V 0 = R V n +1 ( s ) = R ( s ) + max P r ( t | a, s ) · V n ( t ) , a ∈A t ∈S n-stage to go Policy: actions that maximize expected value �� � π n ( s ) = arg max P r ( t | a, s ) · V n ( t ) a ∈A t ∈S IRIS Learning Workshop - June 9, 2004 15/25
Value Directed Structure Learning merge V 0 salami irrelevant background cook split V 1 relevant V 2 IRIS Learning Workshop - June 9, 2004 16/25
Value Directed Structure Learning State merging: repeat 1.learn the POMDP model 2.compute value functions for behaviors 3.compute distance between value functions 4.if policies agree, merge behaviors closest in value until number of behaviors stops changing State splitting repeat 1.learn the POMDP model 2.examine states for predicitve power - entropy? 3.Split behaviors which predict different outcomes until number of behaviors stops changing IRIS Learning Workshop - June 9, 2004 17/25
Experiments: Robot Control Gestures 1 1 1 1 1 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 0 0 0 0 0 40 41 a robot action A t {"go left","stop","go right","forwards"} R b operator action 40 41 42 A t {"good robot","bad robot"} 1 1 1 1 1 control b:a A 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 0 0 0 0 0 t command 52 51 observation of gesture O t 51 53 52 IRIS Learning Workshop - June 9, 2004 18/25
Value-Directed Structure Learning Part of Value function & policy for robot control: Acom Acom d1 d5 d6 d2 d3 d4 d1 d2 d3 d4 d5 d6 Aact Aact Aact Aact bad good bad good bad good bad good right left left right stop forward forward forward stop stop 0.60 1.60 0.50 1.50 0.54 1.54 0.65 1.65 Value Policy • Some states of Acom are redundant • detect & merge using state aggregation in the policy and value function • re-compute policy IRIS Learning Workshop - June 9, 2004 19/25
Value-Directed Structure Learning Acom Acom d3 d1 d2 d4 d1 d2 d3 d4 Aact Aact Aact Aact bad good bad good bad good bad good left right stop forward 0.54 1.54 0.50 1.50 0.60 1.60 0.65 1.65 Value Policy • Leave-four-out cross validation (12 times) • Take actions and accumulate rewards • Success rate: 47 / 48 = 98% or 11 / 12 correct policies • Merges to 4 states all 12 times. IRIS Learning Workshop - June 9, 2004 20/25
Experiments: Card Matching Game Cooperative two player game Goal: match cards stage 1 stage 2 stage 3 IRIS Learning Workshop - June 9, 2004 21/25
Card Matching Results 3 behaviors identified: nodding, shaking, null Predicts: • 6/7 human actions in test data • 19/20 human actions in training data. Errors: lack of POMDP data, temporal segmentation problems IRIS Learning Workshop - June 9, 2004 22/25
Handwashing Behavior Understanding Action pompt utility (reward) Context: previous world Outcome: state hands washed caregiver behavior P ( video | behavior ) statistically significant value directed IRIS Learning Workshop - June 9, 2004 23/25
Difficult Cases self−occlusion objects appear to merge object occlusion IRIS Learning Workshop - June 9, 2004 24/25
Conclusions • Computer Vision + Probabilistic Models + Decision Theory • Learning purposeful human behavior models from unlabeled data. • System is general and portable - no reliance on expert knowledge • Applications: HCI, surveillance, assisted living, driver support • Future work – Spatial segmentation and representation + tracking – Multimodal observations – Temporal segmentation – POMDP solutions – Value-directed learning (Hoey & Little, CVPR 2004) IRIS Learning Workshop - June 9, 2004 25/25
Recommend
More recommend