ins instanc nce segm segmen enta tati tion on
play

Ins Instanc nce segm segmen enta tati tion on CV3DST | Prof. - PowerPoint PPT Presentation

Ins Instanc nce segm segmen enta tati tion on CV3DST | Prof. Leal-Taix 1 Se Semanti ntic c segmenta ntati tion Label every pixel, including the background (sky, grass, road) Do not differentiate between the pixels coming from


  1. YOLACT: id idea 2) Generate mask coefficients 1) Generate mask prototypes CV3DST | Prof. Leal-Taixé 45

  2. YOLACT: id idea 2) Generate mask coefficients 3) Combine (1) and (2) 1) Generate mask prototypes CV3DST | Prof. Leal-Taixé 46

  3. YO YOLA LACT: : bac ackbone Features computed in different scales ResNet-101 CV3DST | Prof. Leal-Taixé 47

  4. YO YOLA LACT: : pr protonet Generate k prototype masks. k is not the number of classes, but is a hyperparameter. CV3DST | Prof. Leal-Taixé 48

  5. YO YOLA LACT: : pr protonet • Fully convolutional network 3x3 conv Similar to the mask branch in Mask R-CNN. However, no loss function is applied on this stage. 1x1 conv CV3DST | Prof. Leal-Taixé 49

  6. YOLACT: mask coeffic icie ients Predict a coefficient for every predicted mask. CV3DST | Prof. Leal-Taixé 50

  7. YOLACT: mask coeffic icie ients Predict one class per anchor box Predict the regression per anchor box Predict k coefficients (one per prototype mask) per anchor The network is similar but shallower than RetinaNet CV3DST | Prof. Leal-Taixé 51

  8. YO YOLA LACT: : mas ask as assembly 1. Do a linear combination between the mask coefficients and the mask prototypes. Predict the mask as M = 𝜏 ( 𝑄𝐷 ! ) where P is a (HxWxK) 2. matrix of prototype masks, C is a (NxK) matrix of mask coefficients surviving NMS, and 𝜏 is a nonlinearity. CV3DST | Prof. Leal-Taixé 52

  9. YOLACT: loss functio ion Cross-entropy between the assembled masks and the ground truth, in addition to the standard losses (regression for the bounding box, and classification for the class of the object/mask). CV3DST | Prof. Leal-Taixé 53

  10. YO YOLACT: q : qualit itativ ive r resu sults CV3DST | Prof. Leal-Taixé 54

  11. YO YOLACT: q : qualit itativ ive r resu sults For large objects, the quality of the masks is even better than those of two- stage detectors CV3DST | Prof. Leal-Taixé 55

  12. So, , which se segmenter to to use? YOLACT CV3DST | Prof. Leal-Taixé 56

  13. YOLACT: im improvements • A specially designed version of NMS, in order to make the procedure faster. • An auxiliary semantic segmentation loss function performed on the final features of the FPN. The module is not used during the inference stage. • D. Boyla et al. “YOLACT++: Better real-time instance segmentation”. arXiv:1912.06218 2019 CV3DST | Prof. Leal-Taixé 57

  14. Pa Pano noptic ic segm segmen enta tati tion on CV3DST | Prof. Leal-Taixé 58

  15. Panopt Pa ptic c segm gmentation Semantic Instance segmentation segmentation + CV3DST | Prof. Leal-Taixé 59

  16. Panopt Pa ptic c segm gmentation Semantic Instance segmentation segmentation + FCN-like Mask R-CNN CV3DST | Prof. Leal-Taixé 60

  17. Panopt Pa ptic c segm gmentation Semantic Instance Panoptic segmentation segmentation segmentation = + FCN-like Mask R-CNN UPSNet CV3DST | Prof. Leal-Taixé 61

  18. Pa Panopt ptic c segm gmentation It gives labels to uncountable objects called "stuff" (sky, road, etc), similar to FCN-like networks. It differentiates between pixels coming from different instances of the same class (countable objects) called "things" (cars, pedestrians, etc). CV3DST | Prof. Leal-Taixé 62

  19. Pa Panopt ptic c segm gmentation Problem: some pixels might get classified as stuff from FCN network, while at the same time being classified as instances of some class from Mask R-CNN (conflicting results)! CV3DST | Prof. Leal-Taixé 63

  20. Pa Panopt ptic c segm gmentation Solution: Parametric-free panoptic head which combines the information from the FCN and Mask R-CNN, giving final predictions. Xiong et al., “UPSNet: A Unified Panoptic Segmentation Network”. CVPR 2019 CV3DST | Prof. Leal-Taixé 64

  21. Ne Network rk a arc rchitecture ure Putting it Shared features Separate heads together CV3DST | Prof. Leal-Taixé 65

  22. Ne Network rk a arc rchitecture ure Putting it Shared features Separate heads together CV3DST | Prof. Leal-Taixé 66

  23. The The semanti ntic c he head As all semantic heads à fully convolutional network. New: deformable convolutions! CV3DST | Prof. Leal-Taixé 67

  24. Re Reca call: Di Dilated (at atrous) ) con onvol olution ions 2D (a) the dilation parameter (b) the dilation (c ) the dilation is 1, and each element parameter is 2, and parameter is 4, and produced by this filter each element each element has receptive field of 3x3. produced by it has produced by it has receptive field of 7x7. receptive field of 15x15. CV3DST | Prof. Leal-Taixé 68

  25. De Deformable ble co convo volu luti tions Deformable convolutions: generalization of dilated convolutions when you learn the offset CV3DST | Prof. Leal-Taixé 69

  26. De Deformable ble co convo volu luti tions CV3DST | Prof. Leal-Taixé 70

  27. De Deformable ble co convo volu luti tions The deformable convolution will pick the values at different locations for convolutions conditioned on the input image of the feature maps. CV3DST | Prof. Leal-Taixé 71

  28. The The Pano nopti tic c he head Mask logits from the instance head Object logits coming from the semantic head (e.g., car) Stuff logits coming from the semantic head (e.g., sky) CV3DST | Prof. Leal-Taixé 72

  29. The The Pano nopti tic c he head Objects need to be masked by the instance Mask logits from the instance head Object logits coming from the semantic head (e.g., car) Stuff logits coming from the semantic head (e.g., sky) This can be evaluated directly CV3DST | Prof. Leal-Taixé 73

  30. The Pano The nopti tic c he head Perform softmax over the panoptic logits. If the maximum value falls into the first stuff channels, then it belongs to one of the stuff classes. Otherwise the index of the maximum value tells us the instance ID the pixel belongs to. Read the details on how to use the unknown class Xiong et al., “UPSNet: A Unified Panoptic Segmentation Network”. CVPR 2019 CV3DST | Prof. Leal-Taixé 74

  31. Me Metr trics ics CV3DST | Prof. Leal-Taixé 75

  32. Pa Panopt ptic c qu quali lity TP = True positive, FN = False negative, FP = false positive • SQ: Segmentation Quality = how close the predicted segments are to the ground truth segment (does not take into account bad predictions!) CV3DST | Prof. Leal-Taixé 76

  33. Pa Panopt ptic c qu quali lity TP = True positive, FN = False negative, FP = false positive • RQ: Recognition Quality = just like for detection, we want to know if we are missing any instances (FN) or we are predicting more instances (FP). CV3DST | Prof. Leal-Taixé 77

  34. Pa Panopt ptic c qu quali lity • As in detection, we have to “match ground truth and predictions. In this case we have segment matching. FP TP Predictions Ground truth IoU measures • Segment is matched if IoU>0.5. No pixel can belong to two predicted segments. CV3DST | Prof. Leal-Taixé 78

  35. Pa Panopt ptic c segm gmentation: qu quali litative CV3DST | Prof. Leal-Taixé 79

  36. Pa Panopt ptic c segm gmentation: qu quali litative CV3DST | Prof. Leal-Taixé 80

  37. Ob Object ct Ins Instance nce Segm Segmen enta tati tion on as s Voti Voting CV3DST | Prof. Leal-Taixé 81

  38. Sli Sliding ng Wind ndow w Ap Approach ch DPM, RCNN families • Densely enumerate box proposals + classify • Tremendously successful paradigm, very well • engineered SOTA methods are still based on this paradigm • CV3DST | Prof. Leal-Taixé 82

  39. Ge Genera ralize zed H Hough ugh T Tra ransfo sform rm Before DPM, RCNN dominance: detection-as-voting CV3DST | Prof. Leal-Taixé 83

  40. Ho Hough V Votin ing Detect analytical shapes (e.g., lines) as peaks in the • dual parametric space Each pixel casts a vote in this dual space • Detect peaks and 'back-project' them to the image • space CV3DST | Prof. Leal-Taixé 84

  41. Ex Examp mple: e: Lin ine e Det etec ection ion Each edge point in image space casts a vote • CV3DST | Prof. Leal-Taixé 85

  42. Ex Examp mple: e: Lin ine e Det etec ection ion Each edge point in image space casts a vote • • The vote is in the form of a line that crosses the point CV3DST | Prof. Leal-Taixé 86

  43. Ex Examp mple: e: Lin ine e Det etec ection ion Accumulate votes from different points in • (discretized) parameter space Read-out maxima (peaks) from the accumulator • CV3DST | Prof. Leal-Taixé 87

  44. Obj Object ct De Detect ction as Voting Idea: Objects are detected as consistent • configurations of the observed parts (visual words) CV3DST | Prof. Leal-Taixé 88 Leibe et al., Robust Object Detection with Interleaved Categorization and Segmentation, IJCV’08

  45. Obj Object ct De Detect ction Training • Center point voting Interest point detection (SIFT, SURF) CV3DST | Prof. Leal-Taixé 89 Leibe et al., Robust Object Detection with Interleaved Categorization and Segmentation, IJCV’08

  46. Obj Object ct De Detect ction Inference (test time) • CV3DST | Prof. Leal-Taixé 90

  47. Bac Back to o the e future • Back to 2020… • We can use pixel consensus voting for panoptic segmentation (CVPR 20) CV3DST | Prof. Leal-Taixé 91

  48. Ov Overv rview The instance voting branch predicts for every pixel whether the pixel is part of an instance mask, and if so, the relative location of the instance mask centroid. CV3DST | Prof. Leal-Taixé 92 H. Wang et al. “Pixel Consensus Voting for Panoptic Segmentation” CVPR 2020

  49. In In a Nutshell 1. Discretize regions around each pixel. 2. Every pixel votes for a centroid (or no centroid for “stuff”) over a set of grid cells. CV3DST | Prof. Leal-Taixé 93

  50. In In a Nutshell 3. Vote aggregation probabilities at each pixel are cast to accumulator space via (dilated) transposed convolutions 4. Detect objects as 'peaks' in the accumulator space CV3DST | Prof. Leal-Taixé 94

  51. In In a Nutshell 5. Back-projection of 'peaks' back to the image to get an instance masks 6. Category information provided by the parallel semantic segmentation head CV3DST | Prof. Leal-Taixé 95

  52. Vot Voting Look ookup Tab able Discretize region around the pixel: M × M cells • converted into K=17 indices. CV3DST | Prof. Leal-Taixé 96

  53. Vot Voting Look ookup Tab able The vote should be cast to the center, which is the • red pixel, which corresponds to position 16. CV3DST | Prof. Leal-Taixé 97

  54. Vot Voting • At inference, instance voting branch provides tensor of size [H,W,K+1] • Softly accumulate votes in the voting accumulator. Ho How? Example: for the blue pixel, we get a vote for index 16 with 0.9 probability (softmax output) Transfer 0.9 to cell 16 -- (dilated) ● tr transposed convoluti tion Evenly distribute among pixels, each gets ● 0.1 -- av averag age p pooling CV3DST | Prof. Leal-Taixé 98

  55. Tr Trans nsposed Convo nvolu luti tions ns • Take a single value in the input • Multiply with a kernel and distribute in the output map Kernel defines the amount of the input value that is • being distributed to each of the output cells • For the purpose of vote aggregation, however, we fix the kernel parameters to 1-hot across each channel that marks the target location. CV3DST | Prof. Leal-Taixé 99

  56. Vot Voting - Im Implementation • Output tensor: [H,W,K+1] • Example: 9 inner, 8 outer bins, K=17 • Split the output tensor to two tensors: [H,W,9],[H,W,8] Apply two transposed convolutions, with kernel of size • [3,3,9], stride=1 and [3,3,8], stride=3 Pre-fixed kernel parameters; 1-hot across each channel • that marks the target location Dilation => spread votes to the outer ring • • Smooth votes evenly via average pooling CV3DST | Prof. Leal-Taixé 100

Recommend


More recommend