Deep filter banks for texture recognition and segmentation Mircea Cimpoi, University of Oxford Subhransu Maji, UMASS Amherst Andrea Vedaldi, University of Oxford
Texture understanding 2 Indicator of materials properties, e.g. brick vs wooden Complementary to shape Correlated with identity but not the same Kickstarted orderless image representations (e. g. Bag of words) [Bajcsy et al. 73, Julesz 81, Ojala et al. 96, 02, Dana et al. 99, Leung and Malik 99, Varma and Zisserman 03, 05, Caputo et al. 05, Lazebnik et al. 05, 06, Timofte and Van Gool 12 Sharma et al. 12, Sifre and Mallat 13, Sharan et. al 09, 13]
Is there a relation between texture representations and deep convolutional neural networks?
Texture representations 5 Filters + histogramming image x [Leung and Malik 99, 01, Schmid 01, Varma and Zisserman 02, 05]
Texture representations 6 Filters + histogramming F 1 y image x [Leung and Malik 99, 01, Schmid 01, Varma and Zisserman 02, 05]
Texture representations 7 Filters + histogramming F 1 F 2 y image x local descriptors VQ + histogram bank of filters [Leung and Malik 99, 01, Schmid 01, Varma and Zisserman 02, 05]
Texture representations 8 Filters + histogramming F 1 F 2 … y image x local descriptors bank of filters [Leung and Malik 99, 01, Schmid 01, Varma and Zisserman 02, 05]
Texture representations 9 Filters + histogramming F 1 ɸ( x ) Histogram F 2 … y image x local descriptors VQ + histogram bank of filters [Leung and Malik 99, 01, Schmid 01, Varma and Zisserman 02, 05]
Texture representations 10 Filters may be non-linear Local ɸ( x ) Histogram descriptor … y (SIFT, LBP, LTP, HOG, SURF, BRIEF, ORB , …) non-linear x local descriptors VQ + histogram filters [Geusebroek et al 03, Lowe 99, Ojala et al. 02, Dalal and Triggs 05, Bay et al. 06, Tan and Triggs 10]
Texture representations 11 Replace histograms with an order-less pooling encoder Orderless Local ɸ( x ) pooling descriptor … y (Bag-of-words, Fisher Vector, (SIFT, LBP, LTP, HOG, SURF, BRIEF, ORB , …) VLAD, sparse coding, …) non-linear x local descriptors encoder filters [Sivic and Zisserman 03, Csurka et al. 04, Perronnin and Dance 07, Perronnin et al. 10, Jegou et al. 10]
12 Texture representations vs CNNs non-linear feature image encoder representation filters field Handcrafted Orderless ɸ( x ) features pooling
13 Texture representations vs CNNs non-linear feature image encoder representation filters field Handcrafted Orderless ɸ( x ) features pooling ɸ( x ) c 5 c 1 c 2 c 3 c 4 f 6 f 7 f 8 [Krizhevsky et al. 12]
14 Texture representations vs CNNs non-linear feature image encoder representation filters field Handcrafted Orderless ɸ( x ) x features pooling ɸ( x ) c 1 c 2 c 3 c 4 c 5 f 6 f 7 f 8 x “convolutional” layers “fully - connected” (FC) layers
16 Mix and match non-linear feature image encoder representation filters field Handcrafted Orderless local descriptors pooling ɸ( x ) CNN CNN FC pooling local descriptors
17 Mix and match Standard texture representation non-linear feature image encoder representation filters field Handcrafted Orderless local descriptors pooling ɸ( x ) x CNN CNN FC pooling local descriptors [Sivic and Zisserman 03, Csurka et al. 04, Perronnin and Dance 07, Perronnin et al. 10, Jegou et al. 10]
18 Mix and match Standard application of CNN non-linear feature image encoder representation filters field Handcrafted Orderless local descriptors pooling ɸ( x ) CNN CNN FC pooling local descriptors FC-CNN [Chatfield et al. 14, Girshick et al. 2014, Gong et al. 14, Razavin et al. 14]
19 Mix and match Order-less pooling of CNN local descriptors non-linear feature image encoder representation filters field Handcrafted Orderless local descriptors pooling ɸ( x ) CNN CNN FC pooling local descriptors
20 Mix and match CNN descriptors pooled by Fisher Vector non-linear feature image encoder representation filters field Handcrafted Fisher local descriptors Vector ɸ( x ) CNN CNN FC pooling local descriptors FV-CNN
21 Mix and match non-linear feature image encoder representation filters field Handcrafted Orderless local descriptors pooling ɸ( x ) CNN CNN FC pooling local descriptors See [Perronnin and Larlus 15] Poster 2B-44
Tested modules 22 Baseline CNN models SIFT FV Typical ▶ AlexNet [Krizhevsky et al.12] ɸ( x ) VGG-M [Chatfield et al.14] Deep ▶ VGG-VD [Simonyan Zisserman 14] FC CNN
Tested modules 23 Baseline CNN models SIFT FV Typical ▶ AlexNet [Krizhevsky et al.12] ɸ( x ) VGG-M [Chatfield et al.14] Deep ▶ VGG-VD [Simonyan Zisserman 14] FC CNN Local image descriptors Handcrafted: SIFT [ Lowe 99 ] ▶ Learned: Convolutional layers of CNNs ▶
Tested modules 24 Baseline CNN models SIFT FV Typical ▶ AlexNet [Krizhevsky et al.12] ɸ( x ) VGG-M [Chatfield et al.14] Deep ▶ VGG-VD [Simonyan Zisserman 14] FC CNN Local image descriptors Handcrafted: SIFT [ Lowe 99 ] ▶ Learned: Convolutional layers of CNNs ▶ Pooling encoders Classical ▶ Bag of Visual Words [Sivic and Zisserman 03, Csurka et al. 04] Fisher Vector [Perronnin and Dance 07, Perronnin et al. 10] CNN ▶ FC layers [Chatfield et al. 14, Girshick et al. 2014, Gong et al. 14, Razavin et al. 14]
25 Findings: what pooling CNNs is good for How does FV-CNN perform compared to other descriptors? How does FV-CNN handle region recognition? What is the benefit of FV-CNN in domain-transfer?
26 Datasets and benchmarks Material recognition (FMD) Texture attribute recognition (DTD) [Liu et al.10, Sharan et al. 13] [Cimpoi et al. 14 ] Fine-grained recognition (CUB) Scene recognition (MIT Indoors) [Wah et al. 11] [Quattoni and Torralba 09] Object recognition (VOC07) Things and stuff (MSRC) [Everingham et al. 07] [Criminisi 04, Shotton et al. 06]
Which feature and encoder? 28 BoVW-SIFT Fisher vector-SIFT BoVW-CNN Fisher vector-CNN 87 82 77 73.5 72 67.9 67 62 59.7 57 52 50.5 BOVW BOVW FV FV 47 CNN SIFT SIFT CNN 42 Material (FMD) Finding 1) BoVW < FV Finding 2) SIFT < CNN
CNN vs Fisher Vector pooling 30 CNN pooling FV pooling CNN pooling (deep) FV pooling (deep) 87 82 79.8 77.4 77 73.5 72 70.3 FC-CNN (VGG-VD) FC-CNN (VGG-VD) FC-CNN (VGG-VD) FV-CNN (VGG-VD) FV-CNN (VGG-VD) FV-CNN (VGG-VD) 67 (VGG-VD) (VGG-VD) FC-CNN (VGG-M) FC-CNN (VGG-M) FC-CNN (VGG-M) FV-CNN (VGG-M) FV-CNN (VGG-M) FV-CNN (VGG-M) (VGG-M) (VGG-M) FC-CNN FC-CNN FV-CNN FV-CNN 62 57 Material(FMD) Finding 3) Finding 4) FV- pooling ≥ CNN -pooling Deep ≥ shallow
CNN vs Fisher Vector pooling 31 CNN pooling FV pooling CNN (VGG-VD) FV (VGG-VD) 87 82 81 77 74.2 72 67.6 FC-CNN (VGG-VD) FC-CNN (VGG-VD) FC-CNN (VGG-VD) FV-CNN (VGG-VD) FV-CNN (VGG-VD) FV-CNN (VGG-VD) 67 FC-CNN (VGG-M) FC-CNN (VGG-M) FC-CNN (VGG-M) FV-CNN (VGG-M) FV-CNN (VGG-M) FV-CNN (VGG-M) 62.5 62 57 Scene (MIT Indoor) Finding 3) Finding 4) FV- pooling ≥ CNN -pooling Deep ≥ shallow
Breadth of applicability 34 Fully connected (VGG-VD) Fisher vector (VGG-VD) SoA 88.7 ALOT 97.8 95.9 texture (materials) 77.7 FMD 79.8 57.7 62.9 textures DTD 72.3 58.6 (attributes) 81.7 VOC07 85.9 objects 85.2 67.6 MIT 81 scenes 70.8 62.8 CUB+R 73 fine-grained 76.4 45 55 65 75 85 95 Finding 5) FV + CNN applies to many diverse domains [Cimpoi et al. 14, Sulc and Matas 14, Sharan et al. 13, Wei and Levoy 14, Zhou et al. 14, Zhang et al. 14 Burghouts and Geusebroek 09, Sharan et al. 09, Everingham et al. 08, Quattoni and Torralba 09, Wah et al. 11]
35 Findings: what pooling CNNs is good for How does FV-CNN perform compared to other descriptors? How does FV-CNN handle region recognition? What is the benefit of FV-CNN in domain-transfer?
Texture recognition in the “wild” and “clutter” (OS) 36 metal food metal wood glass paper A new texture benchmark ▶ Based OpenSurfaces dataset [Bell et al. 13, 15] ▶ Textures in the wild (uncontrolled conditions) ▶ Textures in clutter (do not fill the image) First extensive evaluation of texture material/attribute recognition of this kind
40 Regions: the crop & describe approach E.g. R-CNN ɸ( x ;R 1 ) R 1 representation ɸ( x ;R 2 ) R 2 representation ɸ( x ;R 3 ) R 3 representation … Pros : straightforward & universal construction [Chatfield et al. 14, Jia 13, Girshick et al. 2014, Gong et al. 14, Razavin et al. 14]
41 Crop & describe limitations ɸ( x ;R) R representation Expensive May distort images Can only do rectangles representation representation representation representation representation
42 Regions: the pooling encoder approach Share the local descriptors ɸ( x ;R 1 ) R 1 pooling non-linear filters ɸ( x ;R 2 ) R 2 pooling ɸ( x ;R 3 ) R 3 pooling … Cons : restricted to a convolutional representation Pros : fast, flexible, multiscale, and often more accurate [He et al. 2014, Cimpoi et al. 2015]
FV vs FC pooling for regions 43 97.6 CNN pooling FV pooling 95 84 85 76.8 76.4 74.2 73.5 75 70.3 65.5 65.2 65 62.5 56.5 54.3 55 52.5 45 41.3 35 FMD VOC07 MIT Indoor OS+R OSA+R CUB+R MSRC+R Finding 6) FV pooling ≫ CNN pooling for small, variable regions (and faster too!)
Recommend
More recommend