beyond detection towards multi object tracking and
play

Beyond Detection: Towards Multi-Object Tracking and Segmentation - PowerPoint PPT Presentation

Beyond Detection: Towards Multi-Object Tracking and Segmentation Andreas Geiger Autonomous Vision Group University of T ubingen / MPI for Intelligent Systems June 17, 2018 University of Tbingen MPI for Intelligent Systems Autonomous


  1. Beyond Detection: Towards Multi-Object Tracking and Segmentation Andreas Geiger Autonomous Vision Group University of T¨ ubingen / MPI for Intelligent Systems June 17, 2018 University of Tübingen MPI for Intelligent Systems Autonomous Vision Group

  2. MOTS: Multi-Object Tracking and Segmentation [Voigtlaender, Krause, Osep, Luiten, Sekar, Geiger & Leibe, CVPR 2019]

  3. Motivation ◮ Datasets for multi-object tracking ◮ MOTChallenges ◮ MOT15 [Leal-Taixe et al., 2015] ◮ MOT16, MOT17 [Milan et al., 2016] ◮ CVPR19 [Dendorfer et al., 2019] ◮ KITTI Tracking [Geiger et al., 2012] ◮ VisDrone2018 [Zhu et al., 2018] ◮ DukeMTMC [Ristani et al., 2016] ◮ UA-DETRAC [Wen et al., 2015] ◮ ... 3

  4. Motivation ◮ Datasets for multi-object tracking ◮ MOTChallenges ◮ MOT15 [Leal-Taixe et al., 2015] ◮ MOT16, MOT17 [Milan et al., 2016] ◮ CVPR19 [Dendorfer et al., 2019] ◮ KITTI Tracking [Geiger et al., 2012] ◮ VisDrone2018 [Zhu et al., 2018] ◮ DukeMTMC [Ristani et al., 2016] ◮ UA-DETRAC [Wen et al., 2015] ◮ ... ◮ Led to great progress in the community 3

  5. Motivation ◮ Datasets for multi-object tracking ◮ MOTChallenges ◮ MOT15 [Leal-Taixe et al., 2015] ◮ MOT16, MOT17 [Milan et al., 2016] ◮ CVPR19 [Dendorfer et al., 2019] ◮ KITTI Tracking [Geiger et al., 2012] ◮ VisDrone2018 [Zhu et al., 2018] ◮ DukeMTMC [Ristani et al., 2016] ◮ UA-DETRAC [Wen et al., 2015] ◮ ... ◮ Led to great progress in the community ◮ But annotations are only on the bounding box level 3

  6. Are bounding boxes enough?

  7. Object Tracking vs. Segmentation ◮ In difficult cases, bounding boxes are a very coarse approximation ◮ Most pixels of the bounding box belong to other objects 5

  8. Two Communities Object Tracking Semantic Segmentation / Instance Segmentation 6

  9. Can we unite the two?

  10. MOTS: Multi-Object Tracking and Segmentation ◮ Dense pixel-wise annotations are tedious, hard work .. but we did it! KITTI MOTS 8

  11. MOTS: Multi-Object Tracking and Segmentation ◮ Dense pixel-wise annotations are tedious, hard work .. but we did it! MOTSChallenge 8

  12. MOTS: Multi-Object Tracking and Segmentation ◮ How? 4 student assistants & semi-automatic annotation procedure KITTI MOTS MOTSChallenge train val train # Sequences 12 9 4 # Frames 5,027 2,981 2,862 # Tracks Pedestrian 99 68 228 # Masks Pedestrian (total) 8,073 3,347 26,894 # Masks Pedestrian (annot.) 1,312 647 3,930 # Tracks Car 431 151 - # Masks Car (total) 18,831 8,068 - # Masks Car (annot.) 1,509 593 - 9

  13. Data Annotation

  14. Data Annotation ◮ Starting point: existing box level tracking annotations ◮ Fully convolutional network converts bounding boxes to segmentation masks 11

  15. Data Annotation ◮ Starting point: existing box level tracking annotations ◮ Fully convolutional network converts bounding boxes to segmentation masks ◮ First, 2 instances per track are manually annotated 11

  16. Data Annotation ◮ Starting point: existing box level tracking annotations ◮ Fully convolutional network converts bounding boxes to segmentation masks ◮ First, 2 instances per track are manually annotated ◮ However, the trained segmentation model will not be perfect 11

  17. Data Annotation ◮ Starting point: existing box level tracking annotations ◮ Fully convolutional network converts bounding boxes to segmentation masks ◮ First, 2 instances per track are manually annotated ◮ However, the trained segmentation model will not be perfect ◮ Repeat until annotations are good: 1. Annotators fix worst errors with polygon annotations 2. Add new annotations to training set of FCN 3. Re-train FCN (pre-train on all, fine-tune per object) ⇒ Allows for adaptation to appearance and context of each object 4. Re-generate masks using FCN 11

  18. Data Annotation ◮ Manual corrections ensure consistency and high quality 12

  19. Data Annotation ◮ Manual corrections ensure consistency and high quality ◮ Large savings in annotation time ◮ KITTI MOTS: only 13% of car boxes / 17% of pedestrian boxes manually annotated ◮ MOTSChallenge: 15% of pedestrian boxes manually annotated 12

  20. Evaluation Metrics

  21. Evaluation Metrics ◮ We consider mask-based variants of the CLEAR MOT metrics [Bernardin and Stiefelhagen, 2008] 14

  22. Evaluation Metrics ◮ We consider mask-based variants of the CLEAR MOT metrics [Bernardin and Stiefelhagen, 2008] ◮ Need to associate predictions to ground truth instances ◮ Box-based tracking: boxes might overlap ◮ Requires bi-partite matching 14

  23. Evaluation Metrics ◮ We consider mask-based variants of the CLEAR MOT metrics [Bernardin and Stiefelhagen, 2008] ◮ Need to associate predictions to ground truth instances ◮ Box-based tracking: boxes might overlap ◮ Requires bi-partite matching ◮ Mask-based tracking: masks are disjoint ◮ Establishing correspondences is greatly simplified ◮ Hypothesized and ground truth masks are matched iff mask IoU > 0 . 5 14

  24. Evaluation Metrics (Soft) Multi-Object Tracking and Segmentation Accuracy / Precision: MOTSA = 1 − | FN | + | FP | + | IDS | = | TP | − | FP | − | IDS | | M | | M | � � � TP TP − | FP | − | IDS | � MOTSP = sMOTSA = TP = IoU ( h, c ( h )) | TP | | M | h ∈ TP ◮ c : mapping from hypotheses to ground truth ◮ TP: true positives, � TP: soft number of true positives ◮ FN: false negatives, FP: false positives, IDS: ID switches ◮ M: set of ground truth segmentation masks 15

  25. TrackR-CNN Baseline

  26. TrackR-CNN ... During Image Training Features Image Instance Loss Segmentation Bounding Box ... Feature t-1 Regression Ground Truth Extraction Temporally Enhanced Shared Image weights Classification Features CAR: 0.99 Loss Video Tracking CAR: 0.99 CAR: 0.99 + Scoring CAR: 0.99 CAR: 0.99 Ground Truth Region Feature t Proposal Extraction 2x Network During 3D Conv Evaluation Online Track Association Shared Mask weights Generation Previously ... Feature t+1 Tracked Extraction Objects Association Embedding ... 128-D Association Vectors Key Idea: ◮ Detection, segmentation, and data association with a single ConvNet ◮ Extend Mask R-CNN by 3D convolutions and association head 17

  27. TrackR-CNN Association Head: ◮ Predict association vector for each detection 18

  28. TrackR-CNN Association Head: ◮ Predict association vector for each detection ◮ Detections of same instance should be close in embedding space 18

  29. TrackR-CNN Association Head: ◮ Predict association vector for each detection ◮ Detections of same instance should be close in embedding space ◮ Detections of distinct instances should be distant from each other 18

  30. TrackR-CNN Training: ◮ Learned using batch-hard triplet loss [Hermans et al., 2017]: � � � 1 max max � a e − a d � 2 − min � a e − a d � 2 + α, 0 | D | e ∈D : e ∈D : d ∈D id e = id d id e � = id d ◮ Mini-batch: 8 consecutive frames ◮ Mine furthest detection of same instance and closest detection of other instance ◮ Require separation by not more than margin α 19

  31. TrackR-CNN Training: ◮ Learned using batch-hard triplet loss [Hermans et al., 2017]: � � � 1 max max � a e − a d � 2 − min � a e − a d � 2 + α, 0 | D | e ∈D : e ∈D : d ∈D id e = id d id e � = id d ◮ Mini-batch: 8 consecutive frames ◮ Mine furthest detection of same instance and closest detection of other instance ◮ Require separation by not more than margin α Inference: ◮ Associate detections over time based on Euclidean distance in embedding space and bi-partite graph matching 19

  32. Experimental Evaluation

  33. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  34. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  35. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  36. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  37. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  38. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  39. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  40. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  41. Results of TrackR-CNN on KITTI MOTS ◮ Most objects distinguished well but some erroneous detections remain (red) 22

  42. Results of TrackR-CNN on KITTI MOTS ◮ Most objects distinguished well but some erroneous detections remain (red) 22

  43. Results of TrackR-CNN on KITTI MOTS ◮ Most objects distinguished well but some erroneous detections remain (red) 22

  44. Results of TrackR-CNN on KITTI MOTS ◮ Most objects distinguished well but some erroneous detections remain (red) 22

  45. Results of TrackR-CNN on KITTI MOTS ◮ Continuation of track with same ID after missing detection (red) 23

  46. Results of TrackR-CNN on KITTI MOTS ◮ Continuation of track with same ID after missing detection (red) 23

Recommend


More recommend