Learning from actions Temporal structures for human action recognition Hilde Kuehne Computer Vision Group, Prof. Juergen Gall, Institute of Computer Science III Deep Learning for Computer Vision Dagstuhl Seminar 1739
Overview • Motivation: • Sequence models for activity recognition • Weak action learning • Weak learning of sequential data • Weak learning with CNNs/RNNs • Outlook: • Current projects – Learning of unordered action sets – Mining Youtube Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 2
Why activity recognition? SFB 588 – Humanoid Robot Armar III Human machine interaction – e.g. robotics, services e.g. assisted living, entertainment … Video transcription, movie labeling and indexing Surveillance – Who does what when? HMDB51 [Kuehne2011] Scientific studies – e.g. behavior and motion analysis, sport science … Project AutoTIP - GoHuman Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 3
Activity recognition - Problem Statement Action recognition • … means (usually) one label per image / per clip Doesn‘t work for complex activity Weizmann • One image may not be enough for reliable recognition [Blank2005] • One label per video can be too coarse • Look for a representation that captures the structure of complex action sequences: Pascal [Everingham2010] – human actions as time series (robotics) – models of complex relations between entities (speech) Problem: Find representations that fit the structure of human actions BKT [Kuehne2012] Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 4
Action primitives Action (primitives) / (units) : • A motion that is performed continuously and without interruption. • The smallest entity, which order can be changed during the execution. • Complex tasks, e.g. in the household domain, consist of concatenated action primitives • An action primitive usually is made up of a set of motion phases: start → preparation → main action → finalize → adjust (Energy compensation - Preparation for End position Start position following action) Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 5
Action Grammar • All tasks, as long as they have a meaningful aim, are executed in a certain order. • The order in which the tasks are executed is not random • It is possible to formulate a grammar, which has to be followed. • The action grammar defines the action sequences, which are a concatenation of action primitives that result in a meaningful task. Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 6
Time based modeling 1. Action units: Linear n-state model e.g. action unit: „picking a bowl“ transition states s 1 : move hand towards the bowl s 2 s 3 s 1 s 2 : grasp the bowl s 3 : take bowl to target position state [Gehrig2008] 2. Activitiy: Context-free grammar picking_bowl idle_position idle_position picking_bottle idle_position idle_position picking_bottle idle_position [Gehrig2008] Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 7
Modelling of action units • The task of recognizing an action unit is defined by the best match of the input sequence x x x , x , x ... x 1 2 3 T • with x i representing the feature vector at frame I, to a set of action units u u u , u , u .. u 1 2 3 N • Corresponding to maximizing the probability of an action unit u i given the input sequence x P ( x | u ) P ( u ) i i arg max P ( u | x ) i P ( x ) Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 8
Modelling of action units The joint probability of the model M ui moving through the state sequence S x can be calculated as the product of transition probabilities and observation probabilities given the input x: Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 9
Modelling of action sequences Action sequences are realized as a concatenation of action units : - Computation of probabilities with a combination of Viterbi and pruning - Can include grammar specification Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 10
Practical realization Recognition of action units: • on the level of action units = a n-state left-to-right HMM • state = two equally likely transitions, one to the current state, and one to the next state • number of states = adaptive to mean unit length • initialization = equal distribution of samples Recognition of sequences: • action sequences = defined by a context free grammar • build by automatic parsing of labels or definition by hand Describes stirring, mashing and pouring. Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 11
Properties #!MLF!# "bend.lab" 0 3800000 bend_down Implicit segmentation 3700000 6200000 bend_up . – Output sequence contains semantic and "jack.lab" temporal information in addition to the general 0 2800000 jack 2700000 5000000 jack label . "pjump.lab" 0 2300000 pjump 2200000 4100000 pjump Continuous recognition Ground truth. – Hypothesizes are based on beams of (theoretically) unlimited length #!MLF!# "bend.rec" 0 3700000 bend_down 45358.023438 3700000 6200000 bend_up 35816.691406 Temporal variations are handled by HMMs: . "jack.rec" – Temporal flexibility without need for more 0 1700000 jack 6247.286621 1700000 2700000 jack -544.383606 training samples 2700000 5000000 jack 10465.790039 . – Only constrained by number of states "pjump.rec" – 0 1400000 pjump 11971.578125 Handel large variations 1400000 2800000 pjump 15659.549805 2800000 4100000 pjump -25356.494141 . Result Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 12
Example Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 13
Example Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 14
Weak learning of sequential data Segmentation from video input + transcripts Idea: Given: • sequences of input data • transcripts, i.e. a list of the order the actions occur in the videos + infer the scripted actions and train the related Take cup, Pour milk, Stir coffee , ….. action models without any boundary information Usually applied for training of ASR systems Pour coffee Pour milk Stir coffee - lots of training data: (e.g. TIMIT: ~6300 sentences * ~8.2 words per sentence * ~3.9 …. phones per word ≈ 201474 phone samples , Breakfast: ~ 11000) - well defined vocabulary - low signal variance Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 15
Segment Annotation vs. Transcript Annotation Full segmented annotation requires the start and end frames for each action: transcript annotations contain only the actions within a video and the order in which they occur: Cost of the different annotation techniques: Annotators label both types on 11 videos (making coffee ) with 7 possible actions • Full segmented annotation: real-time factor 3.85 ( = 3.85 x video duration) • Transcript annotations : real-time factor is 1.36 about a third of the time compared to a full annotation Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 16
Video Segmentation given the Action Transcripts Given the action transcripts, a large sequence-HMM can be build that is a concatenation of the HMMs for each action class in the order they occur in the transcript for this sequence. Video segmentation: finding the best alignment of video frames to the sequence-HMM (e.g. Viterbi algorithm) Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 17
System overview Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 18
Example Example for segmentation during the training iterations: Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 19
Alignment vs. Segmentation Alignment: Segmentation: Video + Transcript given Video given Result: Result: Segment boundaries Action classes +Segment boundaries Deep Learning for Computer Vision Dagstuhl Seminar 1739 Insti tute of Computer Science III – Computer Vision Group 17.09.2017 20
Recommend
More recommend