end to end learning of action detection from frame
play

End-to-end Learning of Action Detection from Frame Glimpse in - PowerPoint PPT Presentation

End-to-end Learning of Action Detection from Frame Glimpse in Videos BIL722 - Advanced Topics in Computer Vision Ezgi Peken Soysal Hacettepe University Task: what is the person doing? Input Output t = 0 t = T Running Talking Olga


  1. End-to-end Learning of Action Detection from Frame Glimpse in Videos BIL722 - Advanced Topics in Computer Vision Ezgi Pekşen Soysal Hacettepe University

  2. Task: what is the person doing? Input Output t = 0 t = T Running Talking Olga Russakovsky: The human side of computer vision

  3. Task: what is the person doing? Input Output t = 0 t = T Running Talking Accuracy Efficiency Interpretability Olga Russakovsky: The human side of computer vision

  4. Efficient video processing t = 0 t = T Olga Russakovsky: The human side of computer vision

  5. Efficient video processing t = 0 t = T Running Talking Olga Russakovsky: The human side of computer vision

  6. Efficient video processing t = 0 t = T Running Talking Olga Russakovsky: The human side of computer vision

  7. Efficient video processing t = 0 t = T Running Talking “Knowing the output or the final state… there is no need to explicitly store many previous states” [N. I. Badler. “Temporal Scene Analysis…” 1975 ] Olga Russakovsky: The human side of computer vision

  8. Efficient video processing t = 0 t = T Running Talking “Knowing the output or the final state… there is no need to explicitly store many previous states” [N. I. Badler. “Temporal Scene Analysis…” 1975 ] Dominant paradigm: sliding windows t = T t = 0 Used in all THUMOS challenge action detection entries [OneVerSch 2014 ] … [WanQiaTan 2014 ] KarSeiBim 2014 ] [YuaPeiNiMouKas 2015 ] … Olga Russakovsky: The human side of computer vision

  9. Efficient video processing t = 0 t = T Running Talking “Knowing the output or the final state… there is no need to explicitly store many previous states” “Time may be represented in several ways… The intervals between ‘pulses’ need not be equal.” [N. I. Badler. “Temporal Scene Analysis…” 1975 ] Olga Russakovsky: The human side of computer vision

  10. Our model for efficient action detection Output Frame model Input: A frame t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  11. Our model for efficient action detection [ ] Output: Detection instance [start, end] Output Next frame to glimpse Frame model Input: A frame t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  12. Our model for efficient action detection [ ] [ ] Output: Detection instance [start, end] Output Output Next frame to glimpse Frame model t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  13. Our model for efficient action detection [ ] [ ] [ ] Output: Detection instance [start, end] Output Output Output … Next frame to glimpse Frame model t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  14. Our model for efficient action detection [ ] [ ] [ ] Output: Detection instance [start, end] Output Output Output … Next frame to glimpse Recurrent neural network (time information) Convolutional neural network (frame information) t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  15. Our model for efficient action detection � � [ ] Optional output: Detection instance [start, end] Output Output Output Output: … Next frame to glimpse Recurrent neural network (time information) Convolutional neural network (frame information) t = 0 t = T [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  16. Training the detection instance output Positive video Negative video [ ] [ ] Training data t = T t = T t = 0 t = 0 [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  17. Training the detection instance output Positive video Negative video [ ] [ ] Training data t = T t = T t = 0 t = 0 Aside: • effective video annotation [YeuRusJinAndMorFei UnderReview] [LiuRusDenBerFei ImageNetChallenge ’ 15] • weakly supervised detection [YeuRamRusMorFei InPreparation] [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  18. Training the detection instance output Positive video Negative video g 1 g 2 [ ] [ ] Training data t = T t = T t = 0 t = 0 [ ] ] [ ] [ ] [ Detections t = T t = T t = 0 t = 0 d 1 d 2 d 3 d 4 [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  19. Training the detection instance output Positive video Negative video g 1 g 2 [ ] [ ] Training data t = T t = T t = 0 t = 0 [ ] ] [ ] [ ] [ Detections t = T t = T t = 0 t = 0 d 1 d 2 d 3 d 4 Reward for detection [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  20. Training the detection instance output Positive video Negative video g 1 g 2 [ ] [ ] Training data t = T t = T t = 0 t = 0 y 2 = 1 y 3 = 2 y 4 = 0 y 1 = 1 [ ] ] [ ] [ ] [ Detections t = T t = T t = 0 t = 0 d 1 d 2 d 3 d 4 Reward for detection cross-entropy classification loss [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  21. Training the detection instance output Positive video Negative video g 1 g 2 [ ] [ ] Training data t = T t = T t = 0 t = 0 y 2 = 1 y 3 = 2 y 4 = 0 y 1 = 1 [ ] ] [ ] [ ] [ Detections t = T t = T t = 0 t = 0 d 1 d 2 d 3 d 4 Reward for detection cross-entropy L 2 distance classification loss localization loss [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  22. Training the non-differentiable outputs [ ] [ ] Training data t = T t = 0 [ ] [ ] [ ] Detections t = T t = 0 [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  23. Training the non-differentiable outputs [ ] [ ] Training data t = T t = 0 [ ] [ ] [ ] Detections t = T t = 0 d 1 d 2 d 3 � (1) whether to predict a detection Model’s action Frame 1 Frame 8 Frame 6 Frame 15 sequence a (2) where to look next go to frame 15 go to frame 8 go to frame 6 [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  24. Training the non-differentiable outputs [ ] [ ] Training data t = T t = 0 [ ] [ ] [ ] Detections t = T t = 0 d 1 d 2 d 3 � (1) whether to predict a detection Model’s action Frame 1 Frame 8 Frame 6 Frame 15 sequence a (2) where to look next go to frame 15 go to frame 8 go to frame 6 Train an policy for actions (1) and (2) using REINFORCE [Williams 1992] [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  25. Training the non-differentiable outputs [ ] [ ] Training data t = T t = 0 good bad bad [ ] [ ] [ ] Detections t = T t = 0 d 1 d 2 d 3 � (1) whether to predict a detection Model’s action Frame 1 Frame 8 Frame 6 Frame 15 sequence a (2) where to look next go to frame 15 go to frame 8 go to frame 6 Train an policy for actions (1) and (2) using REINFORCE [Williams 1992] Reward for an action sequence : [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  26. Training the non-differentiable outputs [ ] [ ] Training data t = T t = 0 good bad bad [ ] [ ] [ ] Detections t = T t = 0 d 1 d 2 d 3 � (1) whether to predict a detection Model’s action Frame 1 Frame 8 Frame 6 Frame 15 sequence a (2) where to look next go to frame 15 go to frame 8 go to frame 6 Train an policy for actions (1) and (2) using REINFORCE [Williams 1992] Reward for an action sequence : Objective: Gradient: Monte-Carlo approximation: [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  27. Accuracy Efficiency Interpretability [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  28. ✓ Accuracy Efficiency Detection AP at IOU 0.5 Dataset State-of-the-art Our result 17.1 THUMOS 2014 14.4 36.7 ActivityNet sports 33.2 39.9 ActivityNet work 31.1 Interpretability [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  29. ✓ ✓ Accuracy Efficiency Glimpse only 2% of video frames Detection AP at IOU 0.5 Dataset State-of-the-art Our result 17.1 THUMOS 2014 14.4 36.7 ActivityNet sports 33.2 39.9 ActivityNet work 31.1 Interpretability [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  30. ✓ ✓ Accuracy Efficiency Glimpse only 2% of video frames Detection AP at IOU 0.5 Dataset Samping Detection AP at IOU 0.5 State-of-the-art Our result Uniform 9.3 17.1 THUMOS 2014 14.4 17.1 Our glimpses 36.7 ActivityNet sports 33.2 39.9 ActivityNet work 31.1 Interpretability [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

  31. ✓ ✓ Accuracy Efficiency Glimpse only 2% of video frames Detection AP at IOU 0.5 Dataset Samping Detection AP at IOU 0.5 State-of-the-art Our result Uniform 9.3 17.1 THUMOS 2014 14.4 17.1 Our glimpses 36.7 ActivityNet sports 33.2 39.9 ActivityNet work 31.1 ✓ Interpretability [ ] Ground truth Javelin throw [ ] Detections Javelin throw Glimpses Frames [YeuRusMorFei CVPR’16] Olga Russakovsky: The human side of computer vision

Recommend


More recommend