Multiscale Sparse Models in Deep Convolutional Networks Tomas Anglès, Roberto Leonarduzi, Stéphane Mallat, Louis Thiry, John Zarka, Sixin Zhang Collège de France École Normale Supérieure Flatiron Institute www.di.ens.fr/data
Deep Convolutional Network • Deep convolutional neural network to predict y = f ( x ): Y. LeCun x ∈ R d L j ρ linear ˜ L 1 f ( x ) ρ low dimension Scale axis L j : spatial convolutions and linear combination of channels ρ ( a ) = max( a, 0): Supervised learning of L j from n examples { x i , f ( x i ) } i ≤ n Exceptional results for images, speech, language, bio-data. quantum chemistry regressions, .... How does it reduce dimensionality ? Multiscale , Sparsity, Invariants
Statistical Models from 1 Example M. Bethdge et. al. • Supervised network training (ex: on ImageNet) • For 1 realisation x of X , compute each layer • Compute correlation statistics of network coe ffi cients • Synthesize ˜ x having similar statistics x 6 10 4 pixels ˜ x 2 10 5 correlations What mathematical interpretation ?
Learned Generative Networks • Wasserstein autoencoder: trained on n examples { x i } i ≤ n Decoder G Gaussian white Encoder Φ Z = Φ ( X ) ρ e X X = G ( Z ) ρ L j W 1 W 2 W j L 1 Z 1 Network trained on bedroom images: Z 2 Z = α Z 1 + (1 − α ) Z 2 Linearization of deformations Network trained on faces of celebrities: G ( Z ) What mathematical interpretation ?
Image Classification: ImageNet 2012 1000 classes, 1.2 million labeled training images, of 224 × 224 pixels Alex-Net ResNet Top 5 error 20% 10%
Scale Separation and Interactions • Dimension reduction: Interactions de d bodies represented by x ( u ) : particles, pixels... u 1 Interactions across scales u 2 Multiscale regroupement of interactions of d bodies into interactions of O (log d ) groups. Scale separation ⇒ wavelet transforms. Critical How to capture scale interactions ? harmonic analysis problems since 1970’s
Overview • Scale separation with wavelets and interactions through phase • Linear scale interaction models: – Compressive signal approximations – Stochastic models of stationary processes • Non-linear scale interactions models with sparse dictionaries – Generative autoencoders – Classification of ImageNet • All these roads go to Convolutional Neural Networks…
Scale separation with Wavelets • Wavelet filter ψ ( u ): 2 phases + i ψ λ ( u ) = 2 − 2 j ψ (2 − j r θ u ) rotated and dilated: Fourier θ real parts imaginary parts ω 2 ˆ ψ 2 j , θ ( ω ) ω 1 invertible • Wavelet transform: ✓ x ? � 2 J ◆ Wx = \ x ( ! ) ˆ x ? λ x ? λ ( ! ) = ˆ λ ( ! ) λ problem! • Zero-mean and no correlations across scales: X X x ( ! ) | 2 λ ( ! ) λ ( ! ) ⇤ ⇡ 0 if � 6 = � 0 x ? λ ( u ) x ? ⇤ λ 0 ( u ) = | b ω u
Wavelet Transform Filter Cascade 2 0 How to capture multiscale similarities ? Relu & Phase 2 J Scale
Rectified Wavelet Coefficients • Multiphase real wavelets: ψ α , λ = Real( e − i α ψ λ ) • Rectified with ρ ( a ) = max( a, 0): ✓ ◆ x ? � 2 J : conv. net. coe ffi cients Ux = ⇢ ( x ? α , λ ) α , λ • Linearly invertible: x = U − 1 Ux with U − 1 linear ρ ( a ) + ρ ( − a ) = a ⇒ • Relu creates non-zero mean and correlations across scales: X ⇢ ( x ? α , λ ( u )) u X ⇢ ( x ? α , λ ( u )) ⇢ ( x ? α 0 , λ 0 ( u )) u
Linear Rectifiers act on Phase Ux ( u, ↵ , � ) = ⇢ ( x ? Real( e i α λ )) = ⇢ (Real( e i α x ? λ )) x ? � = | x ? � | e i ' ( x ? λ ) Homogeneous: ρ ( α a ) = α ρ ( a ) if α > 0 Ux ( u, ↵ , � ) = | x ? λ | ⇢ (cos( ↵ + ' ( x ? λ )) ∀ z = | z | e i ϕ ( z ) ∈ C , [ z ] k , | z | e ik ϕ ( z ) Phase harmonics A Relu computes phase harmonics: Theorem : Fourier transform along the phase α : b � ( k ) | x ? � ( u ) | e ik ' ( x ? l a ( u )) Ux ( u, k, � ) = ˆ with γ ( α ) = ρ (cos α ) for any homogeneous non-linearity ρ .
Frequency Transpositions [ x ? � ] k = | x ? � ( u ) | e i k ' ( x ? λ ( u )) Phase harmonics: Performs a non-linear frequency dilation / transposition with no time dilation not correlated \ \ x ? λ ( ! ) x ? λ 0 ( ! ) k = 1 λ 0 λ ω Phase k = 2 ω Harmonics k = 3 λ 2 λ 3 λ ω Correlated if k λ ≈ λ 0
Scale Transposition with Harmonics k = 2 | x ? j, θ ( u ) | ' ( x ? j, θ ( u )) k ' ( x ? j, θ ( u )) ω 2 ω 1 Correlated ω 2 k = 2 ω 1 Phase harmonics: Frequency transpositions j scale
Linear Prediction Across Scales/Freq. • Relu mean and correlations: invariant to translations M ( ↵ , � ) = d − 1 X ⇢ ( x ? α , λ ( u )) u C ( ↵ , � , ↵ 0 , � 0 ) = d � 1 X ⇢ ( x ? α , λ ( u )) ⇢ ( x ? α 0 , λ 0 ( u )) u • Define linear autoregressive model from low to high frequencies: ⇢ ( x ? α , λ ) λ Linear prediction ⇢ ( x ? α 0 , λ 0 ) λ 0
Compressive Reconstructions Gaspar Rochette , Sixin Zhang • If x ? λ is sparse then x is recovered from m ⌧ d phase harmonic means Mx and covariances Cx : k Cx � Cy + ( Mx � My ) ( Mx � My ) ∗ k 2 x = arg min ˜ y PSNR (db) log 10 ( m/d ) Approximation rate optimal for total variation signals: x k ⇠ m − 2 k x � ˜
Compressive Reconstructions PSNR (db) log 10 ( m/d ) Approximation rate optimal for total variation signals: x k ⇠ m − 1 k x � ˜
Gaussian Models of Stationary Proc. What stochastic models for turbulence ? x d = 6 10 4 Kolmogorov model: From d empirical moments: d − 1 P u ( x ( u ) x ( u − τ )) ˜ x Gaussian model with same power spectrum No correlation is captured across scales and frequencies. Random phases. How to capture non-Gaussianity and long range interactions ?
Models of Stationary Processes Sixin Zhang x If ergodic then empirical moments converge: ⇣ ⌘ d − 1 X ⇢ ( x ? α , λ ) E ⇢ ( x ? α , λ ( u )) d → ∞ u ⇣ ⌘ d − 1 X ⇢ ( x ? α , λ ( u )) ⇢ ( x ? α 0 , λ 0 ( u )) ⇢ ( x ? α , λ ) ⇢ ( x ? α 0 , λ 0 ) E d → ∞ u • Stationary processes conditioned by translation invariant moments
Ergodic Stationary Processes Sixin Zhang d = 6 10 4 x ˜ x m = 3 10 3 Phase coherence is captured number Same quality as with learned Deep networks of moments with much less moments
Multifractal Models without high order moments Roberto Leonarduzi Financial S & P 500 returns: • Multifractal properties: E [ | X ⋆ ψ | q ] ∼ 2 j ζ ( q ) reproduce high-order moments • Probability distribution: P ( | x | ) L ( τ ) = E [ | X ( t + τ ) | 2 X ( t ) ] • Leverage correlation: : time asymmetry
Learned Generative Networks • Variational autoencoder: trained on n examples { x i } i ≤ n Decoder G Gaussian white Encoder Φ Z = Φ ( X ) ρ e X X = G ( Z ) ρ L j W 1 W 2 W j L 1 Z 1 Network trained on bedroom images: Z 2 Z = α Z 1 + (1 − α ) Z 2 Linearization of deformations • Encoder Lipschitz continuous to actions of deformations How to build such auto encoders ?
Averaged Rectified Wavelets Spatial averaging at a large scale 2 J : ✓ x ? � 2 J (2 J n ) ◆ Ux ? � J = ⇢ ( x ? α , λ ) ? � J (2 J n ) α , λ Ux ? � J 2 J Scale separation and spatial averaging with φ J : - Gaussianization - Linearize small deformations Theorem if D τ x ( u ) = x ( u − τ ( u )) then J →∞ k S J D τ x � S J x k C kr τ k ∞ k x k lim
Multiscale Autoencoder Tomas Angles • Encoder: convolutional network d 0 = 10 2 d = 10 4 Id − Pr Ix L Ux ? � J x Z Innovation White noise ≈ Gaussian • Innovations: prediction errors are decorrelated across scales • Spatial decorrelation and dimension reduction • Generator: sparse deconvolution L − 1 Deconvolution ( Id − Pr ) − 1 U − 1 e Z e Ix + ✏ Ux ? � J + ✏ 0 Ux x CNN pseudo-inverse Non-linear Dictionary model
Progressive Sparse Deconvolution Tomas Angles • Progressive sparse deconvolution of x ? � j for j decreasing. α Ux ? � j + ✏ 0 D j Ux ? � j +1 + ✏ CNN sparse • Learns a dictionary D j where Ux ? � j is sparse the CNN computes a sparse code α so that: Ux ? � j + ✏ 0 = D j ↵ The CNN is learned jointly with D j by minimising the average error k ✏ 0 k 2 over a data basis. What sparse code is computed by the CNN ? Could it be an l 1 sparse code ?
Training Reconstruction Tomas Angles Training x i 2 J = 16 G ( S J ( x i )) Polygones Celebrities Data Basis
Testing Reconstruction Tomas Angles Testing G ( S J ( x t )) x t
Generative Interpolations Tom´ as Angles Celebrities Polygons b b Z = α Z 1 + (1 − α ) Z 2 Z 2 Z 1 G G G
Random Sampling Tomas Angles • Images synthesised from a Gaussian white noise
Classification by Dictionary Learning Louis Thiry, John Zarka 1000 classes, 1.2 million labeled training images, of 224 × 224 pixels Spatial Linear pooling classifier Ux ? � J Logistic Averaging class W 2 J 10 3 Phase Harmonic Alex-Net Wavelets Top 5 error 20% 70%
Classification by Dictionary Learning Louis Thiry, John Zarka 1000 classes, 1.2 million labeled training images, of 224 × 224 pixels Sparse Linear Spatial Linear dictionary Dimension pooling classifier expansion Reduction Ux ? � J Logistic α L class CCN, D Averaging W 2 J 10 3 sparse invariants multiscale l 1 sparse coding
Recommend
More recommend