End-to-end Learning of Action Detection from Frame Glimpse in Videos BIL722 - Advanced Topics in Computer Vision Ezgi Pekşen Soysal Hacettepe University
Task: what is the person doing? Input Output t = 0 t = T Running Talking Olga Russakovsky: The human side of computer vision
Task: what is the person doing? Input Output t = 0 t = T Running Talking Accuracy Efficiency Interpretability Olga Russakovsky: The human side of computer vision
Efficient video processing t = 0 t = T Olga Russakovsky: The human side of computer vision
Efficient video processing t = 0 t = T Running Talking Olga Russakovsky: The human side of computer vision
Efficient video processing t = 0 t = T Running Talking Olga Russakovsky: The human side of computer vision
Efficient video processing t = 0 t = T Running Talking “Knowing the output or the final state… there is no need to explicitly store many previous states” [N. I. Badler. “Temporal Scene Analysis…” 1975 ] Olga Russakovsky: The human side of computer vision
Efficient video processing t = 0 t = T Running Talking “Knowing the output or the final state… there is no need to explicitly store many previous states” [N. I. Badler. “Temporal Scene Analysis…” 1975 ] Dominant paradigm: sliding windows t = T t = 0 Used in all THUMOS challenge action detection entries [OneVerSch 2014 ] … [WanQiaTan 2014 ] KarSeiBim 2014 ] [YuaPeiNiMouKas 2015 ] … Olga Russakovsky: The human side of computer vision
Efficient video processing t = 0 t = T Running Talking “Knowing the output or the final state… there is no need to explicitly store many previous states” “Time may be represented in several ways… The intervals between ‘pulses’ need not be equal.” [N. I. Badler. “Temporal Scene Analysis…” 1975 ] Olga Russakovsky: The human side of computer vision
Our model for efficient action detection Output Frame model Input: A frame t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Our model for efficient action detection [ ] Output: Detection instance [start, end] Output Next frame to glimpse Frame model Input: A frame t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Our model for efficient action detection [ ] [ ] Output: Detection instance [start, end] Output Output Next frame to glimpse Frame model t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Our model for efficient action detection [ ] [ ] [ ] Output: Detection instance [start, end] Output Output Output … Next frame to glimpse Frame model t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Our model for efficient action detection [ ] [ ] [ ] Output: Detection instance [start, end] Output Output Output … Next frame to glimpse Recurrent neural network (time information) Convolutional neural network (frame information) t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Our model for efficient action detection � � [ ] Optional output: Detection instance [start, end] Output Output Output Output: … Next frame to glimpse Recurrent neural network (time information) Convolutional neural network (frame information) t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Training the detection instance output Positive video Negative video [ ] [ ] Training data t = T t = T t = 0 t = 0 [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Training the detection instance output Positive video Negative video [ ] [ ] Training data t = T t = T t = 0 t = 0 Aside: • effective video annotation [YeuRusJinAndMorFei UnderReview] [LiuRusDenBerFei ImageNetChallenge ’ 15] • weakly supervised detection [YeuRamRusMorFei InPreparation] [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Training the detection instance output Positive video Negative video g 1 g 2 [ ] [ ] Training data t = T t = T t = 0 t = 0 [ ] ] [ ] [ ] [ Detections t = T t = T t = 0 t = 0 d 1 d 2 d 3 d 4 [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Training the detection instance output Positive video Negative video g 1 g 2 [ ] [ ] Training data t = T t = T t = 0 t = 0 [ ] ] [ ] [ ] [ Detections t = T t = T t = 0 t = 0 d 1 d 2 d 3 d 4 Reward for detection [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Training the detection instance output Positive video Negative video g 1 g 2 [ ] [ ] Training data t = T t = T t = 0 t = 0 y 2 = 1 y 3 = 2 y 4 = 0 y 1 = 1 [ ] ] [ ] [ ] [ Detections t = T t = T t = 0 t = 0 d 1 d 2 d 3 d 4 Reward for detection cross-entropy classification loss [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Training the detection instance output Positive video Negative video g 1 g 2 [ ] [ ] Training data t = T t = T t = 0 t = 0 y 2 = 1 y 3 = 2 y 4 = 0 y 1 = 1 [ ] ] [ ] [ ] [ Detections t = T t = T t = 0 t = 0 d 1 d 2 d 3 d 4 Reward for detection cross-entropy L 2 distance classification loss localization loss [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Training the non-differentiable outputs [ ] [ ] Training data t = T t = 0 [ ] [ ] [ ] Detections t = T t = 0 [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Training the non-differentiable outputs [ ] [ ] Training data t = T t = 0 [ ] [ ] [ ] Detections t = T t = 0 d 1 d 2 d 3 � (1) whether to predict a detection Model’s action Frame 1 Frame 8 Frame 6 Frame 15 sequence a (2) where to look next go to frame 15 go to frame 8 go to frame 6 [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Training the non-differentiable outputs [ ] [ ] Training data t = T t = 0 [ ] [ ] [ ] Detections t = T t = 0 d 1 d 2 d 3 � (1) whether to predict a detection Model’s action Frame 1 Frame 8 Frame 6 Frame 15 sequence a (2) where to look next go to frame 15 go to frame 8 go to frame 6 Train an policy for actions (1) and (2) using REINFORCE [Williams 1992] [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Training the non-differentiable outputs [ ] [ ] Training data t = T t = 0 good bad bad [ ] [ ] [ ] Detections t = T t = 0 d 1 d 2 d 3 � (1) whether to predict a detection Model’s action Frame 1 Frame 8 Frame 6 Frame 15 sequence a (2) where to look next go to frame 15 go to frame 8 go to frame 6 Train an policy for actions (1) and (2) using REINFORCE [Williams 1992] Reward for an action sequence : [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Training the non-differentiable outputs [ ] [ ] Training data t = T t = 0 good bad bad [ ] [ ] [ ] Detections t = T t = 0 d 1 d 2 d 3 � (1) whether to predict a detection Model’s action Frame 1 Frame 8 Frame 6 Frame 15 sequence a (2) where to look next go to frame 15 go to frame 8 go to frame 6 Train an policy for actions (1) and (2) using REINFORCE [Williams 1992] Reward for an action sequence : Objective: Gradient: Monte-Carlo approximation: [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Accuracy Efficiency Interpretability [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
✓ Accuracy Efficiency Detection AP at IOU 0.5 Dataset State-of-the-art Our result 17.1 THUMOS 2014 14.4 36.7 ActivityNet sports 33.2 39.9 ActivityNet work 31.1 Interpretability [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
✓ ✓ Accuracy Efficiency Glimpse only 2% of video frames Detection AP at IOU 0.5 Dataset State-of-the-art Our result 17.1 THUMOS 2014 14.4 36.7 ActivityNet sports 33.2 39.9 ActivityNet work 31.1 Interpretability [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
✓ ✓ Accuracy Efficiency Glimpse only 2% of video frames Detection AP at IOU 0.5 Dataset Samping Detection AP at IOU 0.5 State-of-the-art Our result Uniform 9.3 17.1 THUMOS 2014 14.4 17.1 Our glimpses 36.7 ActivityNet sports 33.2 39.9 ActivityNet work 31.1 Interpretability [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
✓ ✓ Accuracy Efficiency Glimpse only 2% of video frames Detection AP at IOU 0.5 Dataset Samping Detection AP at IOU 0.5 State-of-the-art Our result Uniform 9.3 17.1 THUMOS 2014 14.4 17.1 Our glimpses 36.7 ActivityNet sports 33.2 39.9 ActivityNet work 31.1 ✓ Interpretability [ ] Ground truth Javelin throw [ ] Detections Javelin throw Glimpses Frames [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision
Recommend
More recommend