batch construction and multitask learning in visual
play

Batch construction and multitask learning in visual relationship - PowerPoint PPT Presentation

Batch construction and multitask learning in visual relationship recognition Shane Josias Willie Brink Stellenbosch University, CAIR Stellenbosch University josias@sun.ac.za wbrink@sun.ac.za 30 January 2020 1/13 Visual relationship


  1. Batch construction and multitask learning in visual relationship recognition Shane Josias Willie Brink Stellenbosch University, CAIR Stellenbosch University josias@sun.ac.za wbrink@sun.ac.za 30 January 2020 1/13

  2. Visual relationship recognition Task: produce a (subject, predicate, object) triplet given an image. Example: Visual relationship / Scene graph subject: boy on top predicate: of object: surfboard 2/13

  3. Challenges Combinatorial: with 100 subject, 70 predicate, and 100 object labels we have 700,000 possible relationships. Data distribution: is typically long-tailed, making it difficult to learn rare relationships. 6500 15500 6000 6000 number of instances 15000 5000 5500 2000 3000 2000 1500 1500 2000 1000 1000 1000 500 500 0 0 0 0 20 40 60 80 100 0 10 20 30 40 50 60 70 0 20 40 60 80 100 subject label index predicate label index object label index 3/13

  4. Our approach Treat VRR as a classification problem. Input: image, cropped around a pair of objects. Output: (subject, predicate, object) triplet. Three tasks: predict the subject, predict the predicate and predict the object. Avoid predicting over 700,000 classes. Obtain normalised scores over classes in each task. Combine scores through multiplication. 4/13

  5. Single task learning with standard batching FC layer (2,048) FC layer (2,048) FC layer (2,048) ResNet-18 conv. base output scores over subjects input image FC layer (2,048) FC layer (2,048) FC layer (2,048) ResNet-18 conv. base output scores over predicates input image FC layer (2,048) FC layer (2,048) FC layer (2,048) ResNet-18 conv. base output scores over objects input image 5/13

  6. Class-selective batch construction Select n classes from a vocabulary of N classes, uniformly at random. Sample m instances from each selected class, uniformly at random. truck shirt sky building table person instances containing shirt instances containing building instances containing person 6/13

  7. Multitask learning FC layer (2,048) output scores over subjects FC layer (2,048) FC layer (2,048) FC layer (2,048) ResNet-18 conv. base output scores input image over predicates FC layer (2,048) output scores over objects 7/13

  8. VRD dataset (Lu et al. ECCV 2016) 5,000 images, 37,987 visual relationships but only 15,448 unique relationships. 100 labels for both subject and objects, 70 predicate labels in five categories. action verb spatial preposition comparative non-action verb person person motorcycle elephant person kick on top of with taller than wear ball ramp wheel person shirt 8/13

  9. Metrics MPCA: mean per-class accuracy; used to measure performance on rare classes in the individual tasks. R@k: recall-at- k ; percentage of times the correct label occurs in the top k predictions (if ordered by output scores). Tail R@k: R@k measured on visual relationship classes that have fewer than 1,000 samples for subject, predicate, and object labels. 9/13

  10. Quantitative results: individual tasks MPCA on the test set R@1 on the test set 50 60 Standard Batching Standard Batching standard batching standard batching Batch Construction Batch Construction batch construction batch construction 50 40 40 MPCA 30 R@1 30 20 20 10 10 0 0 t e t t e t t e t t e t c t c c t c c t c c t c e a e e a e e a e e a e j c j j c j j c j j c j b b b b b b b b d i d i d i d i u o u o u o u o s e s e s e s e r r r r p p p p � �� � � �� � � �� � � �� � single-task multitask single-task multitask Batch construction is performed with respect to label on x -axis (same as the task being predicted). 10/13

  11. Quantitative results: visual relationship recognition R@50 on the test set Tail R@50 on the test set Standard Batching Standard Batching standard batching standard batching 60 25 Batch Construction Batch Construction batch construction batch construction 50 20 40 Tail R@50 R@50 15 30 10 20 5 10 0 0 single-task multitask single-task multitask Batch construction is performed with respect to the object labels since it performed better overall. 11/13

  12. Qualitative results person , on , horse giraffe , taller than , giraffe person , on , skateboard person , feed , elephant Models person, on, horse 12.0 giraffe, taller than, giraffe 25.1 person, wear, person 11.8 person, above, street 4.3 person, ride, horse 7.0 giraffe, in front of, giraffe 20.8 person, wear, shirt 10.5 person, on, street 4.1 person, wear, horse 5.3 giraffe, next to, giraffe 9.5 person, wear, skateboard 10.0 person, under, street 3.0 ST-SB person, has, horse 5.2 giraffe, above, giraffe 7.6 person, wear, shoes 5.4 sky, above, street 1.7 person, on, person 3.1 giraffe, behind, giraffe 7.2 person, wear, pants 4.4 sky, on, street 1.6 person, on, horse 18.7 giraffe, in front of, giraffe 98.6 person, wear, skateboard 25.6 person, under, elephant 16.4 person, has, horse 11.8 giraffe, taller than, giraffe 0.4 person, on, skateboard 10.0 person, in front of, elephant 16.0 person, wear, horse 7.7 giraffe, behind, giraffe 0.4 person, has, skateboard 9.6 person, above, elephant 10.0 ST-BC-O person, in front of, horse 4.3 giraffe, next to, giraffe 0.1 person, ride, skateboard 5.2 person, near, elephant 4.7 person, next to, person 3.7 giraffe, beside, giraffe 0.1 person, wear, shoes 3.5 person, behind, elephant 4.1 person, wear, horse 9.3 giraffe, taller than, giraffe 45.4 person, wear, shirt 15.5 person, on, street 4.7 person, on, horse 6.8 giraffe, in front of, giraffe 18.9 person, wear, person 9.6 person, under, street 3.9 person, wear, person 3.4 giraffe, next to, giraffe 8.6 person, wear, skateboard 6.9 person, above, street 3.4 MT-SB person, behind, horse 3.1 giraffe, behind, giraffe 7.3 person, wear, shoes 6.1 person, on, person 2.4 person, has, horse 2.6 giraffe, under, giraffe 2.6 person, wear, pants 4.1 person, under, person 1.9 person, on, horse 13.2 giraffe, in front of, giraffe 92.5 person, wear, skateboard 20.0 person, in front of, elephant 7.4 person, above, horse 12.0 giraffe, taller than, giraffe 6.0 person, wear, shoes 14.0 person, near, elephant 6.9 MT-BC-O person, behind, horse 6.3 giraffe, behind, giraffe 0.9 person, wear, helmet 12.0 person, under, elephant 5.1 person, ride, horse 5.3 giraffe, next to, giraffe 0.3 person, has, skateboard 3.8 person, on, elephant 3.4 person, has, horse 4.8 giraffe, beside, giraffe 0.07 person, wear, pants 3.7 person, above, elephant 2.4 ST-SB single-task, standard batching MT-SB multitask, standard batching ST-BC-O single-task, batch construction from object labels MT-SB-O multitask, batch construction from object labels 12/13

  13. Conclusion Class-selective batch construction improves performance on the tail of the distribution, at the cost of performance on the small number of dom- inating classes. 13/13

  14. Conclusion Class-selective batch construction improves performance on the tail of the distribution, at the cost of performance on the small number of dom- inating classes. Multitask learning neither improves nor impedes performance. Reduced capacity can be beneficial. 13/13

  15. Conclusion Class-selective batch construction improves performance on the tail of the distribution, at the cost of performance on the small number of dom- inating classes. Multitask learning neither improves nor impedes performance. Reduced capacity can be beneficial. Predicates are difficult to model. Limitation of pretrained models? 13/13

  16. Conclusion Class-selective batch construction improves performance on the tail of the distribution, at the cost of performance on the small number of dom- inating classes. Multitask learning neither improves nor impedes performance. Reduced capacity can be beneficial. Predicates are difficult to model. Limitation of pretrained models? Misclassifications are often semantically similar to groundtruth. We could use a language model to incorporate semantics. 13/13

Recommend


More recommend