Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana 1 Hang Zhang
Highlight and Overview • Introduced Encoding-Net a new architecture of CNNs • Achieved state-of-the-art results on texture recognition MINC-2500, FMD ,GTOS, KTH, 4D-Light • Released the ArXiv paper (CVPR 17) and Torch Implementation (GPU backend) 2 Hang Zhang
Challenges for Texture Recognition • Orderless • Distributions 3 Hang Zhang
Classic Vision Approaches 4 Hang Zhang
Classic Vision Approaches Feature extraction Filterbank responses or SIFT 5 Hang Zhang
Classic Vision Approaches Dictionary Learning Feature extraction 6 Hang Zhang
Classic Vision Approaches Dictionary Learning Encoding Feature extraction Bag-of-words , VQ or VLAD 7 Hang Zhang
Classic Vision Approaches Dictionary Learning Encoding Feature extraction Classifier 8 Hang Zhang
Classic Vision Approaches Dictionary Learning Encoding Feature extraction Classifier • The input image sizes are flexible • No domain-transfer problem 9 Hang Zhang
Comparing to Deep Learning Framework Dictionary Learning Encoding Feature extraction SVM FC Layer • Preserve Spatial Info • Domain Transfer 10 Hang Zhang • Fix size
Comparing to Deep Learning Framework Dictionary Learning Encoding Feature extraction SVM FC Layer • Can we bridge the gap? 11 Hang Zhang
Hybrid Solution BoWs SVM Histogram Encoding Dictionary SIFT / Filter Bank Responses 12 Hang Zhang
Hybrid Solution and Its Limitation BoWs FV-CNN SVM SVM • Off-the-Shelf • The dictionary and the Histogram Fisher Vector encoders are fixed once built Encoding • Feature learning and encoding are not benefiting from the Dictionary Dictionary labeled data SIFT / Filter Bank Pre-trained CNNs Responses 13 Hang Zhang Off-the-Shelf
End-to-end Encoding Deep-TEN BoWs FV-CNN FC Layer SVM SVM Encoding Layer Histogram Fisher Vector Encoding Residual Encoding Dictionary Dictionary Dictionary SIFT / Filter Bank Pre-trained CNNs Convolutional Layers Responses 14 Hang Zhang End-to-End Off-the-Shelf
� Bag-of-Words (BoW) Encoder • Given a set of visual features 𝑌 = {𝑦 % , … 𝑦 ( } , and a learned codebook C = 𝑑 % , … 𝑑 , (the input features is 𝑒 -dimension and 𝑂 is number of visual features and 𝐿 is number of codewords ) • The assignment weight 𝑏 12 correspond to the visual feature 𝑦 1 assigned to 6 = each codeword 𝑑 2 . Hard-assignment: 𝑏 12 = 𝜀( 𝑦 1 − 𝑑 2 :∈ %,…, min {=𝑦 1 − 6 }) 𝑑 : = • BoWs counts the occurrences of the visual words ∑ 𝑏 1 1 15 Hang Zhang
Residual Encoders • The Fisher Vector, concatenating the gradient of GMM with respect to the mean and standard deviation ( D = E 𝑏 12 𝐻 B C 𝑦 1 − 𝑑 2 1F% ( D = E 𝑏 12 𝑦 1 − 𝑑 2 6 − 1 𝐻 G C 1F% • VLAD (1 st order, hard-assignment) ( 𝑊 2 = E 𝑦 1 − 𝑑 2 1F(( J K FB C 16 Hang Zhang
� Residual Encoding Model Encoding-Layer • Residual vector 𝑠 12 = 𝑦 1 − 𝑑 2 Dictionary Residuals • Aggregating residuals with Aggregate assignment weights Input Assign 𝑓 2 = E 𝑏 12 𝑠 12 1 17 Hang Zhang
Feature Distributions and Assigning • Soft-assignment 6 ) exp (−𝛾 𝑠 1: 𝑏 12 = 6 ) , ∑ exp (−𝛾 𝑠 1: :F% • Learnable Smoothing Factor 6 ) exp (−𝑡 2 𝑠 12 𝑏 12 = 6 ) , ∑ exp (−𝑡 : 𝑠 1: :F% 18 Hang Zhang
End-to-end Learning Deep-TEN FC Layer • The loss function is differentiable w.r.t the input 𝑌 and the parameters Encoding Layer (Dictionary 𝐸 and smoothing factors 𝑡 ) Residual Encoding • The Encoding Layer can be trained end- to-end by standard Stochastic Gradient Dictionary Decent (SGD) with backpropagation Convolutional Layers 19 Hang Zhang End-to-End
20 Hang Zhang
21 Hang Zhang
22 Hang Zhang
Relation to Dictionary Learning • Dictionary learning approaches usually are achieved by unsupervised grouping (e.g. K-means) or minimizing the reconstruction error (e.g. K-SVD). • The Encoding Layer makes the inherent dictionary differentiable w.r.t the loss function and learns the dictionary in a supervised manner. 23 Hang Zhang
Relation to BoWs and Residual Encoders Encoding-Layer • Generalize BoWs, VLAD & Fisher Vector Dictionary Residuals • Arbitrary input sizes, output fixed length Aggregate representation Input Assign • NetVLAD decouples the codewords with their assignments 𝑏 = 𝑔(𝑦) instead of 𝑏 = 𝑔(𝑦, 𝑒) 24 Hang Zhang
Relation to Global Pooling Layer • Sum Pooling (avg Pooling) and B W = B W ( Let 𝐿 = 1 and d = 0 , then 𝑓 = ∑ 𝑦 1 1F% B XK B Y • SPP-Layer (He et. al. ECCV 2014) Fix bin numbers instead of receptive field, reshaping, arbitrary input size) • Bilinear Pooling (Lin et. al. ICCV 2015) sum of the outer product across different location 25 Hang Zhang
Methods Overview 26 Hang Zhang
� Domain Transfer • The Residual Encoding Representation 𝑓 2 = ∑ 𝑏 12 𝑠 12 1 • For a visual feature 𝑦 1 that appears frequently in the data • It is likely to close to a visual center 𝑒 2 • 𝑓 2 is close to zero, since 𝑠 12 = 𝑦 1 − 𝑒 2 ≈ 0 e ) ^_` (ab c d Kc • 𝑓 : (𝑘 ≠ 𝑙) is close to zero, since 𝑏 1: = ≈ 0 g (ab f d Kf e ) ∑ ^_` fhi • The Residual Encoding discard the frequently appearing features, which is like to be domain specific (useful for fine-tuning pre-trained features) 27 Hang Zhang
Experiments • Datasets • Gold-standard material & texture datasets: MINC-2500, KTH, FMD • 2 Recent datasets: GTOS, Light Field • General recognition datasets: MIT-Indoor, Caltech-101 • Baseline approaches (off-the-shelf) • FV-SIFT (128 Gaussian Components, 32𝐿 → 512 ) • FV-CNN (Cimpoi et. al. pre-trained VGG-VD & ResNet, 32GMM) 28 Hang Zhang
Dataset Examples 29 Hang Zhang
Deep-TEN Architecture 30 Hang Zhang
Comparing to the Baselines 31 Hang Zhang
Multi-size Training (using different image sizes) • Deep-TEN ideally accepts arbitrary sizes (larger than a constant) • Training with predefined sizes iteratively in different epochs w/o modifying the solver • Adopt single-size testing for simplicity 32 Hang Zhang
Multi-size Training 33 Hang Zhang
Comparing to State-of-the-Art • Prior approaches • (1) relies on assembling features • (2)adopts an additional SVM classifier for classification. 34 Hang Zhang
Extra Thoughts • So many labeled datasets: object recognition, scene understanding, material recognition • How to benefit from them • Simply merging datasets (different label strategy) • Share convolutional features (domain transfer problem) 35 Hang Zhang
Joint Encoding • Multi-task learning • Encoding Layer carries the domain specific information • Convolutional Layers are generic E1 CIFAR • Joint training on two datasets • CIFAR-10 (50,000 training images with size Conv Layers 36×36 ) • STL-10 (5,000 training images with size STL E2 96×96 ) 36 Hang Zhang
Experimental Results for Joint Training • Joint training on two datasets (simple network architecture) • CIFAR-10 (50,000 training images with size 36×36 ) • STL-10 (5,000 training images with size 96×96 ) The SoA for CIFAR-10 is 95.4% using 1,001 layers ResNet (He et. al. ECCV 2016 ) 37 Hang Zhang
Summary • Proposed a new model • Integrated the entire dictionary learning and encoding into a single layer of CNN • Generalize residual encoders (VLAD, FV), suitable for texture recognition and achieved state-of-the-art results • Introduced a new CNN architecture • Making deep learning framework more flexible by allowing arbitrary input image sizes • Carries domain-specific information and make the learned features easier to transfer 38 Hang Zhang
Thank you! • We provide efficient Torch implementation with CUDA backend at https://github.com/zhanghang1989/Deep-Encoding 39 Hang Zhang
Recommend
More recommend