localization with spatio temporal selective search and
play

Localization with Spatio-Temporal Selective Search and SPPnet - PowerPoint PPT Presentation

Localization with Spatio-Temporal Selective Search and SPPnet Ryosuke Yamamoto, Nakamasa Inoue, Koichi Shinoda Tokyo Institute of Technology TRECVID 2015 TokyoTech 1 Outline Previous works Selective Search Spatial Pyramid Pooling


  1. Localization with Spatio-Temporal Selective Search and SPPnet Ryosuke Yamamoto, Nakamasa Inoue, Koichi Shinoda Tokyo Institute of Technology TRECVID 2015 TokyoTech 1

  2. Outline ● Previous works – Selective Search – Spatial Pyramid Pooling (SPP) net ● Our Methods 1.Spatio-Temporal Selective Search 2.Multi-Frame Score Fusion 3.Neighbor-Frame Score Boosting ● Experiments, Results and Conclusion TRECVID 2015 TokyoTech 2

  3. Selective Search ● Selective Search produces a large number of object region proposals from an image – Use several strategies including useless ones The image is from the paper J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders, Selective search for object recognition. In IJCV, vol.104, pp.154-171, 2013 TRECVID 2015 TokyoTech 3

  4. Spatial Pyramid Pooling (SPP) net ● An efficient method to extract Region proposals by Selective Search [2] CNN scores from a large number of object regions of an image – CNN layers shared among all regions CNN – SVMs computed for each region – Selective Search is used for region SPP layer SPP layer proposals FC FC ReLU ReLU FC FC K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in ReLU ReLU deep convolutional networks for visual recognition. In IEEE SVM SVM Transactions on Pattern Analysis and Machine Intelligence, Score Score pp.1904-1916, 2015 TRECVID 2015 TokyoTech 4

  5. Spatio-Temporal Region Proposals 1. (1) ● Selective Search with temporal dimensional extended region proposals – Produce temporally continuous regions – Contains a large number of meaningless regions – Each video is separated at each I-frame and segmented since computational time is limited Image pixels Video voxels Edges weighted with similarity Time TRECVID 2015 TokyoTech 5

  6. Spatio-Temporal Region Proposals 1. (2) ⊃ ⊃ ⊃ Hierarchy ⊃ ⊃ ⊃ Time Regions are hierarchical and temporally continuous TRECVID 2015 TokyoTech 6

  7. Multi-Frame Score Fusion 2. (1) ● Basic idea – Some frames contain noise or object deformation making detection harder – Results of ST-Region Proposals contain many meaningless region proposals ➔ Information of neighbor frames provides robustness ● Fuse feature maps among several frames – This requires region proposals temporal continuous – ST-Region Proposals adopted TRECVID 2015 TokyoTech 7

  8. Multi-Frame Score Fusion 2. (2) – In experiments, we concluded late fusion is the best I-Frame I-Frame P-Frame P-Frame P-Frame P-Frame P-Frame P-Frame CNN CNN CNN CNN SPP SPP SPP SPP FC FC FC FC RELU RELU RELU RELU FC Fusion FC FC Fusion FC RELU RELU FC: Fully connected layer SVM Fusion SVM SPP: Spatial Pyramod Pooling layer Score ReLU: Rectified linear unit TRECVID 2015 TokyoTech 8

  9. Neighbor-Frame Score Boosting 3. ● Basic idea – Based on same aspect of previous score fusion – Objects will appear in several continuous frames ➔ Information of neighbor frames provides robustness ● Boost scores of I-frames between positives by Increase their scores by a constant I-Frame I-Frame I-Frame I-Frame I-Frame I-Frame Boosted Boosted Time TRECVID 2015 TokyoTech 9

  10. Experiments – Manual Annotations ● Airplane, Boat_Ship, Bridges, Bus, Motorcycle, Telephones, Flags, Quadruped – provided ● Anchorperson – annotated 12k I-frames ● Computers – annotated 7k I-frames TRECVID 2015 TokyoTech 10

  11. Experiments – Training ● Deciding the threshold and the fusion method – Used last year's dataset and concepts – Train: IACC_2_A – Val: IACC_2_B ● Submitted runs – Train: IACC_2_A including additional annotations, IACC_2_B – Test: IACC_2_C TRECVID 2015 TokyoTech 11

  12. Results ● Multi-Frame Score Fusion and Neighbor-Frame Score Boosting improved the score ● We archived 3 rd place among all teams with harmonic mean of F-scores Harm. Mean of F-scores Run ID Method Val Test (Base) Selective Search + SPPnet 0.4481 0.5656 + ST-Region Proposals, Multiple 0.4518 0.5716 Multi-Frame Score Fusion Multiple_Aug3 + Neighbour-Frame Score Boost 0.4569 0.5750 TRECVID 2015 TokyoTech 12

  13. Results ● Multi-Frame Score Fusion and Neighbor-Frame Score Boosting improved the score ● We archived 3 rd place among all teams with harmonic mean of F-scores I-frame F-score Mean pixel F-score Harmonic mean 0.8 0.7 0.6 M D C 0.5 e T d C O r C i i m N a u P 0.4 M r p Y U i c s B i l B S l B e 0.3 e O B B s e s e M t s t s e t 0.2 t B s e s t 0.1 t 0 TRECVID 2015 TokyoTech 13

  14. Results – Examples System output ● Sometimes better than GT Ground truth TRECVID 2015 TokyoTech 14

  15. Results – Spatial Score ● We achieved 1 st place in Mean Pixel F-score by throttling a number of positives to reduce FPs – Of course I-frame F-score is not good 0.8 Multiple_Spat Multiple_Spat 0.7 Multiple_Aug3 Multiple_Aug3 0.6 F-Score Mean pixel Multiple Multiple 0.5 Single2 Single2 I-frame 0.4 0.3 0.2 ● Mean Pixel F-score is calculated from true positive and false positive I-frames, not intuitive TRECVID 2015 TokyoTech 15

  16. Conclusion ● We developed a localization system using ST- Region Proposals and CNN with SPP-net ● Multi-Frame Score Fusion with ST-Region Proposals and Neighbor-Frame Score Boosting improved the score ● Problem: The detection results strongly depend on quality of ST-Region Proposals – Improve ST-Region Proposals quality – Localization without region candidates TRECVID 2015 TokyoTech 16

Recommend


More recommend