Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition Stefan Mathe, Cristian Sminchisescu Presented by Mit Shah
Motivation… Current Computer Vision ● Annotations subjectively ○ defined Intermediate levels of ○ computation?? 2
Motivation… Lack of large scale datasets that provide recordings of the workings of the ● human visual system 3
Previous Work... Study of Gaze patterns in Humans ● A person browsing reddit with the F-shaped pattern 4
Previous Work... Study of Gaze patterns in Humans ● Inter-observer consistency ○ 5
Previous Work... Study of Gaze patterns in Humans ● Inter-observer consistency ○ Bottom-up Features ○ 6
Previous Work... Study of Gaze patterns in Humans ● Inter-observer consistency ○ Bottom-up Features ○ Human Fixations ○ 7
Previous Work... Study of Gaze patterns in Humans ● Inter-observer consistency ○ Bottom-up Features ○ Human Fixations ○ Models of saliency ○ 8
Previous Work... Study of Gaze patterns in Humans ● Inter-observer consistency ○ Bottom-up Features Action ○ Recognition Human Fixations ○ Models of saliency ○ Uses of Saliency maps ○ Object Localization Scene Classification 9
Previous Work... Study of Gaze patterns in Humans ● Inter-observer consistency ○ Bottom-up Features ○ Human Fixations ○ Models of saliency ○ Uses of Saliency maps ○ Previous data sets ○ At most few hundred videos recorded under free viewing conditions 10
Contributions... (1) Extended existing large scale datasets Hollywood-2 and UCF Sports ❏ 11
Contributions... (2) Dynamic consistency and alignment measures ❏ Temporal AOI AOI Markov Alignment Dynamics 12
Contributions... (3) Training an End-to-End automatic visual action recognition system ❏ 13
Data Collection... Largest and Most challenging dataset Hollywood-2 Movie Dataset 12 classes 69 movies 823/884 split 487k frames Answering phone, 20 hr driving a car, eating, fighting, etc. 14
Data Collection... UCF Sports Action Dataset Broadcast of television channels 150 videos covering 9 sports action classes Diving, golf swinging, kicking, etc.. 15
Many other Data Collection... Timings/Durations Specifications & Breaks SMI iView X HiSpeed Extending the two data sets 1250 Tower-Mounted Eye Tracker Context Recognition Action 19 Recognition Humans Free d e Viewing d i v i D 3 o t n s i k s a TASKS Recording Environment Recording Protocol T 16
Static & Dynamic Consistency Action Recognition by Humans Goal & Importance ● Human errors ● Co Occurring Actions ○ False Positives ○ Mislabeling Videos ○ 17
Static Consistency Among Subjects How well the regions fixated by human subjects agree on a frame by ● frame basis? Evaluation Protocol ● 18
Static Consistency Among Subjects 19
The Influence of Task on Eye Movements Hypothesis n A Derive Predict S A \ {s} prediction Saliency Fixations of scores Maps Subject s n A Times Independent p-value 2-sample >= 0.5 ? T-test with Evaluate n B Derive S A unequal average prediction Saliency variances prediction scores Maps score for s’ in S B 20
The Influence of Task on Eye Movements Results - 21
Dynamic Consistency Among Subjects Spatial distribution - highly consistent ● Significant consistency in the order also?? ● Automatic Discovery of AOIs & 2 metrics ● AOI Markov dynamics ○ Temporal AOI alignment ○ 22
Scanpath representation Human fixations - tightly clustered ● Assigning to closest AOI ● Trace the scan path ● 23
Automatically Finding AOIs Clustering the fixations of all subjects in a frame ● Successively Increase Link centroids Start until the sum of squared from successive K-Means errors drops below a with 1 frames into tracks threshold cluster Each fixation assigned to the Each resulting track closest AOI at becomes an AOI the time of creation 24
Automatically Finding AOIs . 25
AOI Markov Dynamics Transitions of human visual attention between AOIs by.. ● Probability of Transitioning to AOI “b” @ time t Human Fixated at Fixation AOI “a” @ String f i time t-1 26
Temporal AOI Alignment Longest Common Subsequence?? ● Able to handle gaps and missing elements ● 27
Evaluation Pipeline Interest Visual Point Descriptor Classifiers Dictionary Operator Spacetime RBF-2 kernel Input: A video Cluster generalization and Multiple Output: A set of descriptors into of the HoG & Kernel Learning spatio-temporal 4000 Visual MBH from (MKL) coordinates words using optical flow framework K-means 28
Human Fixation Studies Human vs. Computer Vision Operators Fixations as interest point detector ● Findings ● Low correlation ○ Why?? ○ 29
Impact of Human Saliency Maps for Computer Visual Action Recognition Saliency maps encoding only the weak surface structure of fixations (no time ordering), can be used to boost the accuracy of contemporary methods 30
Saliency Map Prediction Static Features Motion Features AUC & Spatial KL Divergence 31
Automatic Visual Action Recognition 32
Conclusions Combining Human + Computer Vision ● Extending Dataset ● Evaluating Static & Dynamic Consistency ● Human Fixations -> Saliency Maps ● End-to-End Action Recognition System ● 33
Thanks! 34
Recommend
More recommend