Image Space Embeddings and Generalized Convolutional Neural Networks Nate Strawn September 20th, 2019 Georgetown University
Table of Contents 1. Introduction 2. Smooth Image Space Embeddings 3. Example: Dictionary Learning 4. Convolutional Neural Networks 5. Proofs and Conclusion 2
Introduction
Inspiration “When I multiply numbers together, I see two shapes. The image starts to change and evolve, and a third shape emerges. That’s the answer. It’s mental imagery. It’s like maths without having to think.” – Daniel Tammet [6] 4
Idea Idea: Embed data into spaces of “smooth” functions over graphs, thereby extending graphical processing techniques to arbitrary datasets. X = { x i } N i =1 ⊂ R d R d ∋ x Φ X → R G �− 5
Implications � � { 0 , 1 , . . . , r − 1 } , { ( k − 1 , k ) } k = r − 1 • With G = I r = , Φ X k =1 maps into functions over an interval • With G = I r × I r , Φ X maps into r by r images • Wavelet/Curvelet/Shearlet dictionaries for images induce dictionaries for arbitrary datasets • Convolutional Neural Networks can be applied to arbitrary datasets in a principled manner 6
Example: Kernel Image Space Embeddings of Tumor Data Benign Tumors Malignant Tumors 7
Smooth Image Space Embeddings
Image Space Embeddings We will call any isometry Φ : R d → C ∞ ([0 , 1] 2 ) or Φ : R d → R r ⊗ R r an image space embedding. • C ∞ ([0 , 1] 2 ) is identified with the space of smooth images with incomplete norm � 1 � 1 f ( x , y ) 2 dxdy � f � 2 L 2 ([0 , 1] 2 ) = 0 0 • R r ⊗ R r is identified with the space of r by r matrices, or r by r digital images with norm � F � 2 2 = trace( F T F ) . 9
Smoothness of Image Space Embeddings We will let D denote: • the gradient operator on C 1 ([0 , 1] 2 ), or • the graph derivative D : R V → R E for a graph G = ( V , E ) defined by ( D f ) ( i , j ) = f i − f j where f : R V → R and it is assumed that if ( i , j ) ∈ E then ( j , i ) �∈ E , and � R r ⊗ R r − 1 � � R r − 1 ⊗ R r � • the discrete differential D : R r ⊗ R r → ⊕ coincides with the graph derivative on a regular r by r grid 10
Smoothness of Image Space Embeddings Given a dataset X = { x i } N i =1 ⊂ R d , we measure the smoothness of an image space embedding of X by the mean quadratic variation: � N MQV ( X ) = 1 �D (Φ( x i )) � 2 . N i =1 11
Optimally Smooth Image Space Embeddings We seek the projection which minimizes the mean quadratic variation over the dataset N � 1 �D (Φ( x i )) � 2 min 2 N Φ i =1 subject to Φ being a linear isometry. 12
Optimally Smooth Discrete Image Space Embeddings Theorem (S.) Suppose r 2 ≥ d, let { v j } d j =1 ⊂ R d be the principal components of X (ordered by descending singular values), and let { ξ j } r 2 j =1 (ordered by as- cending eigenvalues) denote an orthonormal basis of eigenvectors of the graph Laplacian L = D T D . Then d � ξ j v T Φ = j i =1 solves the optimal mean quadratic variation embedding program. 13
Observations • The optimal isometry pairs highly variable components in R d with low-frequency components in L 2 ( G ). • x �→ F by computing the PCA scores of x , arranging them in an r by r matrix, and applying the inverse discrete cosine transform. • If the data x i are drawn i.i.d. from a Gaussian, then Φ maps this Gaussian to a Gaussian process with minimal expected quadratic variation. • The connection with PCA indicates that we can use Kernel PCA to produce nonlinear embeddings into image spaces as well 14
Optimally Smooth Continuous Image Space Embeddings Theorem (S.) j =1 ⊂ R d be the principal components of X (ordered by descending Let { v j } d singular values), and let { k j } d j =1 denote the first d positive integer vectors ordered by non-decreasing norm. Then d � � � v T exp(2 π i ( k T Φ( x ) = j x j · )) j =1 solves the optimal mean quadratic variation embedding program � N �D Φ( x i ) � 2 min L 2 C ([0 , 1] 2 ) Φ i =1 subject to Φ being a complex isometry. 15
Connection with Regularized PCA Theorem (S.) In the discrete case, the solution to the minimum quadratic variation pro- gram also provides the optimal Φ for the program 1 2 + λ 2 + γ 2 � X − C Φ � 2 2 � C D ∗ � 2 2 � C � 2 min 2 C , Φ subject to Φ being an isometry. 16
Example: Dictionary Learning
The Sparse Dictionary Learning Problem Problem: Given a data matrix X ∈ R N ⊗ R d , with d large, find a linear dictionary Φ ∈ M k , d and coefficients C ∈ M N , k such that C Φ ≈ X , and C is sparse/compressible. 18
Regularized Factorization The “relaxed” approach attempts to solve the non-convex program: 1 2 � X − Φ T C � 2 min 2 + λ � C � 1 . C , Φ 19
Usual Suspects 1 2 � X − C Φ � 2 min 2 + λ � C � 1 C , Φ • Impose � φ i � 2 2 = 1 for each row of − φ 1 − − φ 2 − Φ = . . . − φ k − � � 1 to deal with the fact that C Φ = ( qC ) q Φ . • Program has analytic solution when C is fixed, and is convex optimization with Φ fixed. 20
Algorithms • Optimization algorithm for supervised and online learning of dictionaries: Mairal et al. [9, 8] • Good initialization procedures can lead to provable results: Agarwal et al. [1] 21
Identifiability • Exactly sparse and approximation (even for large factors!) is NP-hard: Tillmann [16] • Probability model-based learning: Remi and Schnass [11], Spielman et al. [14] • Dictionary is incoherent and coefficients are sufficiently sparse, then original dictionary is a local minimum: Geng and Wright [5], Schnass [12] • Full spark matrix is also identifiable given sufficient measurements: Garfinkle and Hillar [4] 22
Caveats • Many possible local solutions • Interpretability? • Large systems require a large amount of computation! 23
Tight Frame Dictionaries Recall that { ψ a } a ∈A ∈ L 2 ( R 2 ) is a frame if there are constants 0 < A ≤ B such that � A � x � 2 ≤ |� f , ψ a �| 2 ≤ B � x � 2 for all f ∈ H , a ∈A where �· , ·� and � · � are the inner product and induced norm on L 2 ( R 2 ), respectively. If A = B , we say that the frame is tight. 24
Examples of Tight Frames • Tensor product wavelet systems • Curvelets • Shearlets Fact: If { ψ a } a ∈A ∈ L 2 ( R 2 ) is a tight frame, and Φ : R d → L 2 ( R 2 ) is an isometry, then { Φ ∗ ψ a } a ∈A is a tight frame for R d . 25
Example: Wisconsin Breast Cancer Dataset • 569 examples in R 30 describing characteristics of cells obtained from biopsy [15] • each example is either benign or malignant • preprocess by removing medians and rescaling by interquartile range in each variable • image space embedding uses r = 32 (images are 32 by 32) 26
Minimal Mean Quadratic Variation Behavior PCA Scores vs. eigenvalues of graph Laplacian vs. product 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20 25 30 0 . 07 0 . 06 0 . 05 0 . 04 0 . 03 0 . 02 0 . 01 0 . 00 0 5 10 15 20 25 30 0 . 30 0 . 25 0 . 20 0 . 15 0 . 10 0 . 05 0 . 00 0 5 10 15 20 25 30 Normalized MMQV ≈ 38 27
Raw Embeddings of Benign and Malignant Examples Image Space Embeddings of Benign Tumor Data Image Space Embeddings of Malignant Tumor Data 28
LASSO in the Haar Wavelet Induced Dictionary Using the 2D Haar wavelet transform W , we solve 1 2 � X − C W Φ � 2 min 2 + λ � C � 1 C where Φ is the image space embedding matrix. Using BCW dataset, average MSE is 3 . 4 × 10 − 3 when λ = 1. 29
Haar Wavelet Coefficients after LASSO 30
Inverse DWT of Haar Coefficients 31
Compression in PCA Basis and Induced Dictionary Consider best k -term approximations of the first 50 members of the BCW dataset using different dictionaries Compression in the dictionary induced by the Haar wavelet system uses orthogonal matching pursuit: 1 . 0 1 . 0 0 0 0 0 . 4 0 . 9 0 . 9 0 . 3 0 . 8 0 . 8 10 10 10 0 . 2 0 . 7 0 . 7 0 . 1 0 . 6 0 . 6 20 20 20 Example index Example index Example index 0 . 5 0 . 5 0 . 0 0 . 4 0 . 4 30 30 30 − 0 . 1 0 . 3 0 . 3 − 0 . 2 0 . 2 0 . 2 40 40 40 − 0 . 3 0 . 1 0 . 1 − 0 . 4 0 . 0 0 . 0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 Support size Support size Support size First and second image: Relative SSE for k -term approximations using the PCA basis, Haar-induced dictionary Third image: First image minus the second image 32
Comparision with Dictionary Learning 0 0 . 4 0 . 3 10 0 . 2 0 . 1 20 Example index 0 . 0 − 0 . 1 30 − 0 . 2 40 − 0 . 3 − 0 . 4 0 5 10 15 20 25 Support size Dictionary learning clearly does better! 33
Convolutional Neural Networks
Convolutional Neural Networks for Arbitrary Datasets People already do this in insane ways! 35
Convolutional Neural Networks for Arbitrary Datasets • Exploit image structure to better deal with image collections [7] • Cutting edge results for image classification tasks 36
Recommend
More recommend