Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana - PowerPoint PPT Presentation

Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana 1 Hang Zhang

Highlight and Overview • Introduced Encoding-Net a new architecture of CNNs • Achieved state-of-the-art results on texture recognition MINC-2500, FMD ,GTOS, KTH, 4D-Light • Released the ArXiv paper (CVPR 17) and Torch Implementation (GPU backend) 2 Hang Zhang

Challenges for Texture Recognition • Orderless • Distributions 3 Hang Zhang

Classic Vision Approaches 4 Hang Zhang

Classic Vision Approaches Feature extraction Filterbank responses or SIFT 5 Hang Zhang

Classic Vision Approaches Dictionary Learning Feature extraction 6 Hang Zhang

Classic Vision Approaches Dictionary Learning Encoding Feature extraction Bag-of-words , VQ or VLAD 7 Hang Zhang

Classic Vision Approaches Dictionary Learning Encoding Feature extraction Classifier 8 Hang Zhang

Classic Vision Approaches Dictionary Learning Encoding Feature extraction Classifier • The input image sizes are flexible • No domain-transfer problem 9 Hang Zhang

Comparing to Deep Learning Framework Dictionary Learning Encoding Feature extraction SVM FC Layer • Preserve Spatial Info • Domain Transfer 10 Hang Zhang • Fix size

Comparing to Deep Learning Framework Dictionary Learning Encoding Feature extraction SVM FC Layer • Can we bridge the gap? 11 Hang Zhang

Hybrid Solution BoWs SVM Histogram Encoding Dictionary SIFT / Filter Bank Responses 12 Hang Zhang

Hybrid Solution and Its Limitation BoWs FV-CNN SVM SVM • Off-the-Shelf • The dictionary and the Histogram Fisher Vector encoders are fixed once built Encoding • Feature learning and encoding are not benefiting from the Dictionary Dictionary labeled data SIFT / Filter Bank Pre-trained CNNs Responses 13 Hang Zhang Off-the-Shelf

End-to-end Encoding Deep-TEN BoWs FV-CNN FC Layer SVM SVM Encoding Layer Histogram Fisher Vector Encoding Residual Encoding Dictionary Dictionary Dictionary SIFT / Filter Bank Pre-trained CNNs Convolutional Layers Responses 14 Hang Zhang End-to-End Off-the-Shelf

� Bag-of-Words (BoW) Encoder • Given a set of visual features 𝑌 = {𝑦 % , … 𝑦 ( } , and a learned codebook C = 𝑑 % , … 𝑑 , (the input features is 𝑒 -dimension and 𝑂 is number of visual features and 𝐿 is number of codewords ) • The assignment weight 𝑏 12 correspond to the visual feature 𝑦 1 assigned to 6 = each codeword 𝑑 2 . Hard-assignment: 𝑏 12 = 𝜀( 𝑦 1 − 𝑑 2 :∈ %,…, min {=𝑦 1 − 6 }) 𝑑 : = • BoWs counts the occurrences of the visual words ∑ 𝑏 1 1 15 Hang Zhang

Residual Encoders • The Fisher Vector, concatenating the gradient of GMM with respect to the mean and standard deviation ( D = E 𝑏 12 𝐻 B C 𝑦 1 − 𝑑 2 1F% ( D = E 𝑏 12 𝑦 1 − 𝑑 2 6 − 1 𝐻 G C 1F% • VLAD (1 st order, hard-assignment) ( 𝑊 2 = E 𝑦 1 − 𝑑 2 1F(( J K FB C 16 Hang Zhang

� Residual Encoding Model Encoding-Layer • Residual vector 𝑠 12 = 𝑦 1 − 𝑑 2 Dictionary Residuals • Aggregating residuals with Aggregate assignment weights Input Assign 𝑓 2 = E 𝑏 12 𝑠 12 1 17 Hang Zhang

Feature Distributions and Assigning • Soft-assignment 6 ) exp (−𝛾 𝑠 1: 𝑏 12 = 6 ) , ∑ exp (−𝛾 𝑠 1: :F% • Learnable Smoothing Factor 6 ) exp (−𝑡 2 𝑠 12 𝑏 12 = 6 ) , ∑ exp (−𝑡 : 𝑠 1: :F% 18 Hang Zhang

End-to-end Learning Deep-TEN FC Layer • The loss function is differentiable w.r.t the input 𝑌 and the parameters Encoding Layer (Dictionary 𝐸 and smoothing factors 𝑡 ) Residual Encoding • The Encoding Layer can be trained end- to-end by standard Stochastic Gradient Dictionary Decent (SGD) with backpropagation Convolutional Layers 19 Hang Zhang End-to-End

20 Hang Zhang

21 Hang Zhang

22 Hang Zhang

Relation to Dictionary Learning • Dictionary learning approaches usually are achieved by unsupervised grouping (e.g. K-means) or minimizing the reconstruction error (e.g. K-SVD). • The Encoding Layer makes the inherent dictionary differentiable w.r.t the loss function and learns the dictionary in a supervised manner. 23 Hang Zhang

Relation to BoWs and Residual Encoders Encoding-Layer • Generalize BoWs, VLAD & Fisher Vector Dictionary Residuals • Arbitrary input sizes, output fixed length Aggregate representation Input Assign • NetVLAD decouples the codewords with their assignments 𝑏 = 𝑔(𝑦) instead of 𝑏 = 𝑔(𝑦, 𝑒) 24 Hang Zhang

Relation to Global Pooling Layer • Sum Pooling (avg Pooling) and B W = B W ( Let 𝐿 = 1 and d = 0 , then 𝑓 = ∑ 𝑦 1 1F% B XK B Y • SPP-Layer (He et. al. ECCV 2014) Fix bin numbers instead of receptive field, reshaping, arbitrary input size) • Bilinear Pooling (Lin et. al. ICCV 2015) sum of the outer product across different location 25 Hang Zhang

Methods Overview 26 Hang Zhang

� Domain Transfer • The Residual Encoding Representation 𝑓 2 = ∑ 𝑏 12 𝑠 12 1 • For a visual feature 𝑦 1 that appears frequently in the data • It is likely to close to a visual center 𝑒 2 • 𝑓 2 is close to zero, since 𝑠 12 = 𝑦 1 − 𝑒 2 ≈ 0 e ) ^_` (ab c d Kc • 𝑓 : (𝑘 ≠ 𝑙) is close to zero, since 𝑏 1: = ≈ 0 g (ab f d Kf e ) ∑ ^_` fhi • The Residual Encoding discard the frequently appearing features, which is like to be domain specific (useful for fine-tuning pre-trained features) 27 Hang Zhang

Experiments • Datasets • Gold-standard material & texture datasets: MINC-2500, KTH, FMD • 2 Recent datasets: GTOS, Light Field • General recognition datasets: MIT-Indoor, Caltech-101 • Baseline approaches (off-the-shelf) • FV-SIFT (128 Gaussian Components, 32𝐿 → 512 ) • FV-CNN (Cimpoi et. al. pre-trained VGG-VD & ResNet, 32GMM) 28 Hang Zhang

Dataset Examples 29 Hang Zhang

Deep-TEN Architecture 30 Hang Zhang

Comparing to the Baselines 31 Hang Zhang

Multi-size Training (using different image sizes) • Deep-TEN ideally accepts arbitrary sizes (larger than a constant) • Training with predefined sizes iteratively in different epochs w/o modifying the solver • Adopt single-size testing for simplicity 32 Hang Zhang

Multi-size Training 33 Hang Zhang

Comparing to State-of-the-Art • Prior approaches • (1) relies on assembling features • (2)adopts an additional SVM classifier for classification. 34 Hang Zhang

Extra Thoughts • So many labeled datasets: object recognition, scene understanding, material recognition • How to benefit from them • Simply merging datasets (different label strategy) • Share convolutional features (domain transfer problem) 35 Hang Zhang

Joint Encoding • Multi-task learning • Encoding Layer carries the domain specific information • Convolutional Layers are generic E1 CIFAR • Joint training on two datasets • CIFAR-10 (50,000 training images with size Conv Layers 36×36 ) • STL-10 (5,000 training images with size STL E2 96×96 ) 36 Hang Zhang

Experimental Results for Joint Training • Joint training on two datasets (simple network architecture) • CIFAR-10 (50,000 training images with size 36×36 ) • STL-10 (5,000 training images with size 96×96 ) The SoA for CIFAR-10 is 95.4% using 1,001 layers ResNet (He et. al. ECCV 2016 ) 37 Hang Zhang

Summary • Proposed a new model • Integrated the entire dictionary learning and encoding into a single layer of CNN • Generalize residual encoders (VLAD, FV), suitable for texture recognition and achieved state-of-the-art results • Introduced a new CNN architecture • Making deep learning framework more flexible by allowing arbitrary input image sizes • Carries domain-specific information and make the learned features easier to transfer 38 Hang Zhang

Thank you! • We provide efficient Torch implementation with CUDA backend at https://github.com/zhanghang1989/Deep-Encoding 39 Hang Zhang

Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana - PowerPoint PPT Presentation

Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana 1 Hang Zhang Highlight and Overview Introduced Encoding-Net a new architecture of CNNs Achieved state-of-the-art results on texture recognition MINC-2500, FMD ,GTOS,

Topic 12: Texture Mapping Motivation Sources of texture Texture coordinates Bump

Topic 11: Texture Mapping Motivation Sources of texture Texture coordinates

Shape from Texture Texture Discrimination 1 Texture Texture Synthesis Goal of texture

lecture 16 Texture mapping Aliasing (and anti-aliasing) Texture (images) Texture Mapping Q:

C P S C 314 WHY IS TEXTURE IMPORTANT? TEXTURE MAPPING TEXTURE MAPPING TEXTURE MAPPING real

Texture Mapping Texture (images) lecture 16 Texture mapping Aliasing (and anti-aliasing)

Texture Mapping Texture Mapping 1 Texture Mapping Texture Mapping Motivation Motivation:

Texture Synthesis Given a texture, create more CS176: Texture Synthesis All examples from Wei

Texture CS 419 Slides by Ali Farhadi What is a Texture? Texture Spectrum Steven Li, James

Outline Texture Mapping Modeling surface details with images. Roger Crawfis Texture

Outline Texture Mapping Modeling surface details with images. Roger Crawfis Texture

Solid Texture Synthesis Solid Texture Synthesis Solid Texture Synthesis from 2D Exemplars from

TEXTURE MAPPING SAUMITRA BAGCHI DEFINITION Texture: T he feel, appearance, or consistency of a

Texture S ynthesis Daniel Cohen-Or + = + = = The Goal of Texture Synthesis input image

Texture Advection 6-1 Ronald Peikert SciVis 2007 - Texture Advection Texture advection

texture mapping 1 why texture mapping? objects have spatially varying details represent as

From Verified Parsers and Serializers to Format-Aware Fuzzers Benjamin Delaware Purdue Computer

Welcome! Todays Agenda: Recap Flow Control AVX, Larrabee, GPGPU Further

Theory of Computer Science B4. Predicate Logic I Gabriele R oger University of Basel March

An SSA-based Algorithm for Optimal Speculative Code Motion under an Execution Profile Hucheng

Solar-cycle variation of oscillation frequencies and surface magnetic field Shao Min Tan

1 Algorithm for Identifying Loop Invariant Code Algorithm for Identifying Loop Invariant Code

A fixed point theorem for Boolean networks expressed in terms of forbidden subnetworks Adrien

Frequency-hiding Dependency-preserving Encryption for Outsourced Databases ICDE17 Boxiang