kemal zmec ler 08 03 2016
play

KEMAL ZMECLER (08.03.2016) 1 PRESENTATION TOPIC OBJECT DETECTORS - PowerPoint PPT Presentation

KEMAL ZMECLER (08.03.2016) 1 PRESENTATION TOPIC OBJECT DETECTORS EMERGE IN DEEP SCENE CNNS Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba International Conference on Learning Representations,2015. 2 OUTLINE


  1. KEMAL ÇİZMECİLER (08.03.2016) 1

  2. PRESENTATION TOPIC OBJECT DETECTORS EMERGE IN DEEP SCENE CNNS Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba International Conference on Learning Representations,2015. 2

  3. OUTLINE § Problem statement and motivation § Sımplifying The Input Images § Visualizing The Receptive Fields § Identifying The Semantics Of Internal Units § Connections with other work § Future direction 3

  4. Problem Definition § to study the internal representation learned by Places-CNN on a task other than object recognition. (i.e. scene recognition) § Visualize those representations through inner layers 4

  5. ImageNet CNN and Places CNN ImageNet CNN for Object Classification Same architecture: AlexNet Places CNN for Scene Classification Slide credit : Bolei Zhou Places

  6. Comparison of ImageNet CNN and Places CNN § The ImageNet-CNN from Jia (2013) is trained on 1.3 million images from 1000 object categories of ImageNet (ILSVRC 2012) and achieves a top-1 accuracy of 57.4%. § Places-CNN is trained on 2.4 million images from 205 scene categories of Places Database (Zhou et al., 2014), and achieves a top-1 accuracy of 50.0%. 6

  7. Object Representations in Computer Vision Part-based models are used to represent objects and visual patterns. -Object as a set of parts -Relative locations between parts Slide credit : Bolei Zhou Figure from Fischler & Elschlager (1973)

  8. How Objects are Representedin CNN? Deconvolution Zeiler, M. et al. Visualizing and Understanding Convolutional Networks,ECCV 2014. Strong activation image Slide credit : Bolei Zhou Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-rate object detection and semantic segmentation. CVPR 2014 Back-propagation Simonyan, K. et al. Deep inside convolutional networks: Visualising image classification models and saliency maps. ICLR workshop, 2014

  9. DeConvnet Matthew D. Zeiler , Rob Fergus Visualizing and Understanding Convolutional Networks

  10. IMAGES HAVING HIGHEST ACTIVATIONS the earlier layers such as pool1 and pool2 prefer similar images for both networks ; while the later layers tend to be more specialized to the specific task of scene or object categorization.

  11. How Objects are Representedin CNN? CNN uses distributed code to represent objects. Conv1 Conv2 Conv3 Conv4 Slide credit : Bolei Zhou Pool5 http://people.csail.mit.edu/torralba/research/drawCNN/drawNet.html

  12. DIFFERENCE FROM OTHER WORK § (Girshick et al., 2014) proposed a pre-train for auxiliary task and then fine-tune for the target task, § the same network can do both object localization and scene recognition in a single forward-pass. § Another set of recent works (Oquab et al., 2014; Bergamo et al., 2014) demonstrate the ability of deep networks trained on object classification to do localization without bounding box supervision. However , they require object-level supervision. 12

  13. UNCOVERING THE CNN § SIMPLIFYING THE INPUT IMAGES § VISUALIZING THE RECEPTIVE FIELDS § IDENTIFYING THE SEMANTICS OF INTERNAL UNITS 13

  14. SIMPLIFYING THE INPUT IMAGES § simplify this image such that it keeps as little visual information as possible while still having a high classification score for the same category § second approach: generate the minimal image representations using the fully annotated image set of SUN Database (Xiao et al., 2014) instead of performing automatic segmentation 14

  15. SUN DATABASE J. Xiao et. al. 2010 15

  16. SUN DATABASE J. Xiao et. al. 2010 16

  17. HOW IS SIMPLIFICATION HANDLED § At each iteration remove the segment that produces the smallest decrease of the correct classification score and do this until the image is incorrectly classified § Obtain minimal representation 17

  18. Related idea: Poisson Blending • A good blend should preserve gradients of source region without changing the background slide by Derek Hoiem Perez, Patrick, Gangnet, Michel, and Blake, Andrew.Poisson image editing. ACM Trans.Graph., 2003

  19. MINIMAL IMAGE REPRESENTATION

  20. INFERENCE FROM SIMPLIFICATION § For art gallery the minimal image representations contained paintings (81%) and pictures (58%); in amusement parks, carousel (75%), ride (64%), and roller coaster (50%); in bookstore, bookcase (96%), books (68%), and shelves (67%). § These results suggest that object detection is an important part of the representation built by the network to obtain discriminative informationfor scene classification. 20

  21. VISUALIZING THE RECEPTIVE FIELDS § replicate each image many times with small random occluders (image patches of size 11×11) at different locations in the image. § from the K images, center the discrepancy map around the spatial location of the unit that caused the maximum activation for the given image. Then average the re- centered discrepancy maps 21

  22. VISUALIZING THE RECEPTIVE FIELDS 200K images from scene centric Sun database + object-centric ImageNet 22

  23. Estimating the Receptive Fields Actual size of RF is much smaller than the theoretic Estimated receptive fields size pool1 conv3 pool5 Segmentation using the RF of Units (Highlight the regions within the RF that have the highest value in the feature map. ) More semantically meaningful

  24. IDENTIFYING THE SEMANTICS Workers provide tags without being constrained to a dictionary of terms to pre-calculated segments 3 tasks : (1)identify the concept (2) mark the set of images that do not fall into this theme (3) categorize the concept provided in (1) to one of 6 semantic groups ranging from low-level to high-level: from simple elements and colors to object parts and even scenes For each unit, measure its precision as the percentage of images that were selected as fitting the labeled concept 24

  25. Annotating the Semantics of Units Top ranked segmented images are cropped and sent to Amazon Turk for annotation.

  26. Annotating the Semantics of Units Pool5, unit 76; Label: ocean; Type: scene; Precision: 93%

  27. Annotating the Semantics of Units Pool5, unit 13; Label: Lamps; Type: object; Precision: 84%

  28. Annotating the Semantics of Units Pool5, unit 77; Label:legs; Type: object part; Precision: 96%

  29. Annotating the Semantics of Units Pool5, unit 112; Label: pool table; Type: object; Precision: 70%

  30. Annotating the Semantics of Units Pool5, unit 22; Label: dinner table; Type: scene; Precision: 60%

  31. IDENTIFYING THE SEMANTICS consider only units that had a precision above 75% as provided by the AMT workers. Around 60% of the units on each layer where above that threshold. For both networks, units at the early layers (pool1, pool2) have more units responsive to simple elements and colors, while those at later layers (conv4, pool5) have more high-level semantics (responsive more to objects and scenes). 31

  32. Distribution of Semantic Types at Each Layer Object detectors emerge within CNN trained to classify scenes, without any object supervision! Slide credit : Bolei Zhou

  33. WHAT OBJECT CLASSES EMERGE? SUN database is used because dense object annotations are needed to study what the most informative object classes for scene categorization are, and what the natural object frequencies in scene images are. The segmentation shows the regions of the image for which the unit is above a certain threshold. Each unit seems to be selective to a particular appearance of the object. 33

  34. WHAT OBJECT CLASSES EMERGE? § The categories found in pool5 tend to follow the target categories in ImageNet. § There are 6 units that detect lamps, each unit detecting a particular type of lamp providing finer-grained discrimination; § There are 9 units selective to people, each one tuned to different scales or people doing different tasks 35

  35. AND WHY ? The correlation between object frequency in the database and § object frequency discovered by the units in pool5 is 0.54. The correlation between the number of scene categories a § particular object class is the most useful for and discovered object class is 0.84. Also, there are 115 units not detecting objects. This means that we § cannot rule out other representations being used in combination with objects. 36

  36. Evaluation on SUN Database Correlation:0.53 Correlation:0.84

  37. Evaluation on SUN Database Evaluate the performance of the emerged object detectors

  38. CONCLUSION § Object detection emerges inside a CNN trained to recognize scenes. § Many more objects can be naturally discovered, in a supervised setting tuned to scene classification rather than object classification § Object localization in a single forward-pass § Besides taking the output of the last layer as feature, inner layers can Show different levels of abstraction. 39

  39. FUTURE DIRECTION § In what other tasks, can we learn object classes or discriminative object detectors ? § Can we pop-up all object classes? 40

  40. QUESTIONS Thank You 41

Recommend


More recommend