Visual Geometry Group, Department of Engineering Science Understanding image representations by measuring their equivariance and equivalence Karel Lenc, Andrea Vedaldi
Representations for image understanding 2 image feature semantic representation classifier space space space ๐ ๐ ๐ bike ๐(๐) ๐ ๐ ๐ ๐(๐) bike ๐ ๐ ๐(๐) ๐ dog Ultimate goal of a representation: simplify a task such as image classification Many representations Local image descriptors โถ SIFT [Lowe 04], HOG [Dalal et al. 05], SURF [Bay et al. 06], LBP [Ojala et al. 02], โฆ Feature encoders โถ BoVW [Sivic et al. 02], Fisher Vector [Perronnin et al. 07], VLAD [Jegou et al. 10], sparse coding, โฆ Deep convolutional neural networks โถ [Fukushima 1974-1982, LeCun et al. 89, Krizhevsky et al. 12, โฆ]
Design of representations 3 Many designs are empirical, the main theoretical design principle is invariance image feature representation space space ๐ ๐ ๐ ๐ ๐๐ Invariant ๐ ๐ = ๐(๐๐)
Design of representations 4 However, many representations such as HOG are not invariant , even to simple transformations image feature representation space space HOG ๐ โ ๐ HOG ๐๐ Not invariant ๐ ๐ โ ๐(๐๐)
Design of representations 5 But they often transform in a simple and predictable manner image feature representation space space HOG ๐ ๐ HOG ๐๐ Equivariant โ๐: ๐ ๐ = ๐ ๐ ๐(๐๐)
Design of representations 6 But what happens with more complex transformations like affine ones? image feature representation space space HOG ๐ ? ๐ HOG ๐๐
Design of representations 7 What happens with more complex representations like CNNs? Invariance of CNN rep. studied in. [Goodfellow et al. 09] or [Zeiler, Fergus 13] image feature representation space space CNN ๐ ? ๐ CNN ๐๐ Contribution : transformations in CNNs
Representation properties 8 ๐ Equivariance ? How does a representation reflect ๐ image transformations?
When are two representations the same? 9 Learning representations means that there is an endless number of them Variants obtained by learning on different datasets, or different local optima representations CNN-A ๐ CNN-B Equivalence ๐ ๐ถ ๐ = ๐น ๐ ๐ต (๐)
Representation properties 10 ๐ Equivariance ? How does a representation reflect ๐ image transformations? ๐ ๐ถ Equivalence ? Do different representations have different meanings? ๐ ๐ต
11 Finding equivariance empirically Regularized linear regression ๐ โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐๐ ๐ ๐(๐ฒ) ๐ต ๐ ๐ ๐ + ๐ ๐ (learned empirically)
12 Finding equivariance empirically Convolutional structure ๐ โ ๐ ๐ ๐ ๐ ๐๐ ๐ ๐ต ๐ ๐ ๐ + ๐ ๐ (learned empirically) convolution by 1 โจ 1 filter bank permutation โ ๐ต ๐ ๐ ๐๐
13 Finding equivariance empirically HOG features Rotation 45ยบ โ ๐ ๐ ๐ ๐ โ ๐ ๐ ๐ ๐ โ ๐ ๐ ๐ ๐
14 Finding equivariance empirically HOG features โ inverse with MIT HOGgles [Vondrick et al. 13] Rotation 45ยบ โ ๐ ๐ ๐ ๐ โ ๐ ๐ ๐ ๐ โ ๐ ๐ ๐ ๐
15 Finding equivariance empirically HOG features โ inverse with MIT HOGgles [Vondrick 13] 1.25x Upscale โ ๐ ๐ ๐ ๐ โ ๐ ๐ ๐ ๐ โ ๐ ๐ ๐ ๐
16 Equivariance of representations Findings Transformations scaling, rotation, flipping, translation โถ Equivariant representations HOG โถ
Finding equivariance empirically 17 CNN case Label ๐ง ๐ 1 2 3 4 5 FC dog ๐ ๐ convolutional layers fully-connected layers We run the same analysis on a typical CNN architecture AlexNet [Krizevsky et al. 12] โถ 5 convolutional layers + fully-connected layers โถ Trained on ImageNet ILSVRC โถ
18 Learning mappings empirically CNN case Label ๐ง ๐ 1 2 3 4 5 FC dog Label ๐ง ๐ ๐(๐) 1 2 3 4 5 2 3 4 5 FC Classif. Loss โ ๐ ๐
19 Learning mappings empirically CNN case Label ๐ง ๐ 1 2 3 4 5 FC dog Label ๐ง ๐ ๐ โ1 ๐(๐๐) ๐ 4 5 FC Classif. Loss โ ๐ ๐ท๐๐๐ค3 learned ๐ ๐ โ1 empirically 1 2 3 ๐๐ ๐(๐๐)
Results โ Vertical Flip 20 ๐ Original Classif., no TF Original Classif. + TF Before Training After Training 60 Original Classif., no TF 50 ๐ 1 2 3 4 5 FC 40 Original Classif. + TF Top-5 Error [%] ๐๐ 1 2 3 4 5 FC 30 Before Training 20 ๐๐ 1 2 3 4 5 FC 10 After Training ๐๐ โ 1 2 3 4 5 FC 0 ๐ท๐๐๐ค4 ๐ ๐ โ1 ๐ท๐๐๐ค2 ๐ท๐๐๐ค1 ๐ท๐๐๐ค3 ๐ท๐๐๐ค5 ๐ ๐ โ1 ๐ ๐ โ1 ๐ ๐ โ1 ๐ ๐ โ1 ๐ท๐๐๐ค3 ๐ ๐ โ1 1 2345 12 345 123 45 1234 5 12345
21 Equivariance of representations Findings Transformations scaling, rotation, flipping, translation โถ Equivariant representations HOG โถ Early convolutional layers in CNNs โถ Equivariant to a lesser degree Deeper convolutional layers in CNNs โถ
Representation properties 22 ๐ Equivariance ? How does a representation reflect ๐ image transformations? ๐ ๐ถ Equivalence ? Do different representations have different meanings? ๐ ๐ต
23 Equivalence CNN transplantation crash course AlexNet [Krizhevsky et al. 12], same training data, different parametrization CNN-A CNN-B 1 2 3 4 5 FC 1 2 3 4 5 FC ๐ ๐โฒ Are ๐ and ๐โฒ equivalent features?
24 Equivalence CNN transplantation crash course Same training data, different parametrization CNN-A CNN-B 1 2 3 4 5 FC 1 2 3 4 5 FC stitching layer (linear convolution) Classif. โ ๐น 5 FC 1 2 3 4 Loss โ Label ๐ง Train with SGD
25 Franken-network Stitch CNN-A ๏ฎ CNN-B Training data is the same, but parametrization is entirely different 100 Baseline 90 1 2 3 4 5 FC 80 70 CNN-B Top-5 Error [%] 60 Before Training 50 1 2 3 4 5 FC 40 CNN-A CNN-B 30 20 After Training 10 ๐น 1 2 3 4 5 FC 0 ๐ท๐๐๐ค1 ๐ท๐๐๐ค2 ๐ท๐๐๐ค3 ๐ท๐๐๐ค4 ๐ท๐๐๐ค5 ๐น ๐โ๐โฒ ๐น ๐โ๐โฒ ๐น ๐โ๐โฒ ๐น ๐โ๐โฒ ๐น ๐โ๐โฒ CNN-A CNN-B 1 2345 12 345 123 45 1234 5 12345
26 Equivalence of similar architecture Compare training on the same or different data ILSVRC12 dataset Places dataset CNN-PLACES CNN-IMNET 1 2 3 4 5 FC 1 2 3 4 5 FC
Franken-network 27 Stitch CNN-PLACES ๏ฎ CNN-IMNET Now even the training sets differ 100 Baseline 90 80 1 2 3 4 5 FC 70 CNN-IMNET Top-5 Error [%] 60 Before Training 50 1 2 3 4 5 FC 40 CNN-IMNET CNN-PLCS 30 After Training 20 ๐น 1 2 3 4 5 FC 10 CNN-PLCS CNN-IMNET 0 ๐ท๐๐๐ค1 ๐ท๐๐๐ค2 ๐ท๐๐๐ค3 ๐ท๐๐๐ค4 ๐ท๐๐๐ค5 ๐น ๐โ๐โฒ ๐น ๐โ๐โฒ ๐น ๐โ๐โฒ ๐น ๐โ๐โฒ ๐น ๐โ๐โฒ 1 2345 12 345 123 45 1234 5 12345
Example application 28 Structured-output pose detection Equivariant maps โ Transform features instead of images ๐ โ = argmax ๐ โ ๐ป ๐, ๐ ๐ โ1 ๐ = argmax ๐ โ ๐ป ๐, ๐ ๐ โ1 ๐ ๐ Allows significant speedup at test time
29 Conclusions Representing geometry Beyond invariance: equivariance โถ Transforming the image results in a simple and predictable transformation โถ of HOG and early CNN layers Application to accelerated structured output regression โถ Representation equivalence CNN trained from different random seeds are very different, โถ but only on the surface Early CNN layers are interchangeable even between tasks โถ General idea study mathematical properties of representations empirically โถ
Recommend
More recommend