multiscale sparse models in deep convolutional networks
play

Multiscale Sparse Models in Deep Convolutional Networks Tomas - PowerPoint PPT Presentation

Multiscale Sparse Models in Deep Convolutional Networks Tomas Angls, Roberto Leonarduzi, Stphane Mallat, Louis Thiry, John Zarka, Sixin Zhang Collge de France cole Normale Suprieure Flatiron Institute www.di.ens.fr/data Deep


  1. Multiscale Sparse Models in Deep Convolutional Networks Tomas Anglès, Roberto Leonarduzi, Stéphane Mallat, Louis Thiry, John Zarka, Sixin Zhang Collège de France École Normale Supérieure Flatiron Institute www.di.ens.fr/data

  2. Deep Convolutional Network • Deep convolutional neural network to predict y = f ( x ): Y. LeCun x ∈ R d L j ρ linear ˜ L 1 f ( x ) ρ low dimension Scale axis L j : spatial convolutions and linear combination of channels ρ ( a ) = max( a, 0): Supervised learning of L j from n examples { x i , f ( x i ) } i ≤ n Exceptional results for images, speech, language, bio-data. quantum chemistry regressions, .... How does it reduce dimensionality ? Multiscale , Sparsity, Invariants

  3. Statistical Models from 1 Example M. Bethdge et. al. • Supervised network training (ex: on ImageNet) • For 1 realisation x of X , compute each layer • Compute correlation statistics of network coe ffi cients • Synthesize ˜ x having similar statistics x 6 10 4 pixels ˜ x 2 10 5 correlations What mathematical interpretation ?

  4. Learned Generative Networks • Wasserstein autoencoder: trained on n examples { x i } i ≤ n Decoder G Gaussian white Encoder Φ Z = Φ ( X ) ρ e X X = G ( Z ) ρ L j W 1 W 2 W j L 1 Z 1 Network trained on bedroom images: Z 2 Z = α Z 1 + (1 − α ) Z 2 Linearization of deformations Network trained on faces of celebrities: G ( Z ) What mathematical interpretation ?

  5. Image Classification: ImageNet 2012 1000 classes, 1.2 million labeled training images, of 224 × 224 pixels Alex-Net ResNet Top 5 error 20% 10%

  6. Scale Separation and Interactions • Dimension reduction: Interactions de d bodies represented by x ( u ) : particles, pixels... u 1 Interactions across scales u 2 Multiscale regroupement of interactions of d bodies into interactions of O (log d ) groups. Scale separation ⇒ wavelet transforms. Critical How to capture scale interactions ? harmonic analysis problems since 1970’s

  7. Overview • Scale separation with wavelets and interactions through phase • Linear scale interaction models: – Compressive signal approximations – Stochastic models of stationary processes • Non-linear scale interactions models with sparse dictionaries – Generative autoencoders – Classification of ImageNet • All these roads go to Convolutional Neural Networks…

  8. Scale separation with Wavelets • Wavelet filter ψ ( u ): 2 phases + i ψ λ ( u ) = 2 − 2 j ψ (2 − j r θ u ) rotated and dilated: Fourier θ real parts imaginary parts ω 2 ˆ ψ 2 j , θ ( ω ) ω 1 invertible • Wavelet transform: ✓ x ? � 2 J ◆ Wx = \ x ( ! ) ˆ x ? λ x ? λ ( ! ) = ˆ λ ( ! ) λ problem! • Zero-mean and no correlations across scales: X X x ( ! ) | 2 λ ( ! ) λ ( ! ) ⇤ ⇡ 0 if � 6 = � 0 x ? λ ( u ) x ? ⇤ λ 0 ( u ) = | b ω u

  9. Wavelet Transform Filter Cascade 2 0 How to capture multiscale similarities ? Relu & Phase 2 J Scale

  10. Rectified Wavelet Coefficients • Multiphase real wavelets: ψ α , λ = Real( e − i α ψ λ ) • Rectified with ρ ( a ) = max( a, 0): ✓ ◆ x ? � 2 J : conv. net. coe ffi cients Ux = ⇢ ( x ? α , λ ) α , λ • Linearly invertible: x = U − 1 Ux with U − 1 linear ρ ( a ) + ρ ( − a ) = a ⇒ • Relu creates non-zero mean and correlations across scales: X ⇢ ( x ? α , λ ( u )) u X ⇢ ( x ? α , λ ( u )) ⇢ ( x ? α 0 , λ 0 ( u )) u

  11. Linear Rectifiers act on Phase Ux ( u, ↵ , � ) = ⇢ ( x ? Real( e i α λ )) = ⇢ (Real( e i α x ? λ )) x ? � = | x ? � | e i ' ( x ? λ ) Homogeneous: ρ ( α a ) = α ρ ( a ) if α > 0 Ux ( u, ↵ , � ) = | x ? λ | ⇢ (cos( ↵ + ' ( x ? λ )) ∀ z = | z | e i ϕ ( z ) ∈ C , [ z ] k , | z | e ik ϕ ( z ) Phase harmonics A Relu computes phase harmonics: Theorem : Fourier transform along the phase α : b � ( k ) | x ? � ( u ) | e ik ' ( x ? l a ( u )) Ux ( u, k, � ) = ˆ with γ ( α ) = ρ (cos α ) for any homogeneous non-linearity ρ .

  12. Frequency Transpositions [ x ? � ] k = | x ? � ( u ) | e i k ' ( x ? λ ( u )) Phase harmonics: Performs a non-linear frequency dilation / transposition with no time dilation not correlated \ \ x ? λ ( ! ) x ? λ 0 ( ! ) k = 1 λ 0 λ ω Phase k = 2 ω Harmonics k = 3 λ 2 λ 3 λ ω Correlated if k λ ≈ λ 0

  13. Scale Transposition with Harmonics k = 2 | x ? j, θ ( u ) | ' ( x ? j, θ ( u )) k ' ( x ? j, θ ( u )) ω 2 ω 1 Correlated ω 2 k = 2 ω 1 Phase harmonics: Frequency transpositions j scale

  14. Linear Prediction Across Scales/Freq. • Relu mean and correlations: invariant to translations M ( ↵ , � ) = d − 1 X ⇢ ( x ? α , λ ( u )) u C ( ↵ , � , ↵ 0 , � 0 ) = d � 1 X ⇢ ( x ? α , λ ( u )) ⇢ ( x ? α 0 , λ 0 ( u )) u • Define linear autoregressive model from low to high frequencies: ⇢ ( x ? α , λ ) λ Linear prediction ⇢ ( x ? α 0 , λ 0 ) λ 0

  15. Compressive Reconstructions Gaspar Rochette , Sixin Zhang • If x ? λ is sparse then x is recovered from m ⌧ d phase harmonic means Mx and covariances Cx : k Cx � Cy + ( Mx � My ) ( Mx � My ) ∗ k 2 x = arg min ˜ y PSNR (db) log 10 ( m/d ) Approximation rate optimal for total variation signals: x k ⇠ m − 2 k x � ˜

  16. Compressive Reconstructions PSNR (db) log 10 ( m/d ) Approximation rate optimal for total variation signals: x k ⇠ m − 1 k x � ˜

  17. Gaussian Models of Stationary Proc. What stochastic models for turbulence ? x d = 6 10 4 Kolmogorov model: From d empirical moments: d − 1 P u ( x ( u ) x ( u − τ )) ˜ x Gaussian model with same power spectrum No correlation is captured across scales and frequencies. Random phases. How to capture non-Gaussianity and long range interactions ?

  18. Models of Stationary Processes Sixin Zhang x If ergodic then empirical moments converge: ⇣ ⌘ d − 1 X ⇢ ( x ? α , λ ) E ⇢ ( x ? α , λ ( u )) d → ∞ u ⇣ ⌘ d − 1 X ⇢ ( x ? α , λ ( u )) ⇢ ( x ? α 0 , λ 0 ( u )) ⇢ ( x ? α , λ ) ⇢ ( x ? α 0 , λ 0 ) E d → ∞ u • Stationary processes conditioned by translation invariant moments

  19. Ergodic Stationary Processes Sixin Zhang d = 6 10 4 x ˜ x m = 3 10 3 Phase coherence is captured number Same quality as with learned Deep networks of moments with much less moments

  20. Multifractal Models without high order moments Roberto Leonarduzi Financial S & P 500 returns: • Multifractal properties: E [ | X ⋆ ψ | q ] ∼ 2 j ζ ( q ) reproduce high-order moments • Probability distribution: P ( | x | ) L ( τ ) = E [ | X ( t + τ ) | 2 X ( t ) ] • Leverage correlation: : time asymmetry

  21. Learned Generative Networks • Variational autoencoder: trained on n examples { x i } i ≤ n Decoder G Gaussian white Encoder Φ Z = Φ ( X ) ρ e X X = G ( Z ) ρ L j W 1 W 2 W j L 1 Z 1 Network trained on bedroom images: Z 2 Z = α Z 1 + (1 − α ) Z 2 Linearization of deformations • Encoder Lipschitz continuous to actions of deformations How to build such auto encoders ?

  22. Averaged Rectified Wavelets Spatial averaging at a large scale 2 J : ✓ x ? � 2 J (2 J n ) ◆ Ux ? � J = ⇢ ( x ? α , λ ) ? � J (2 J n ) α , λ Ux ? � J 2 J Scale separation and spatial averaging with φ J : - Gaussianization - Linearize small deformations Theorem if D τ x ( u ) = x ( u − τ ( u )) then J →∞ k S J D τ x � S J x k  C kr τ k ∞ k x k lim

  23. Multiscale Autoencoder Tomas Angles • Encoder: convolutional network d 0 = 10 2 d = 10 4 Id − Pr Ix L Ux ? � J x Z Innovation White noise ≈ Gaussian • Innovations: prediction errors are decorrelated across scales • Spatial decorrelation and dimension reduction • Generator: sparse deconvolution L − 1 Deconvolution ( Id − Pr ) − 1 U − 1 e Z e Ix + ✏ Ux ? � J + ✏ 0 Ux x CNN pseudo-inverse Non-linear Dictionary model

  24. Progressive Sparse Deconvolution Tomas Angles • Progressive sparse deconvolution of x ? � j for j decreasing. α Ux ? � j + ✏ 0 D j Ux ? � j +1 + ✏ CNN sparse • Learns a dictionary D j where Ux ? � j is sparse the CNN computes a sparse code α so that: Ux ? � j + ✏ 0 = D j ↵ The CNN is learned jointly with D j by minimising the average error k ✏ 0 k 2 over a data basis. What sparse code is computed by the CNN ? Could it be an l 1 sparse code ?

  25. Training Reconstruction Tomas Angles Training x i 2 J = 16 G ( S J ( x i )) Polygones Celebrities Data Basis

  26. Testing Reconstruction Tomas Angles Testing G ( S J ( x t )) x t

  27. Generative Interpolations Tom´ as Angles Celebrities Polygons b b Z = α Z 1 + (1 − α ) Z 2 Z 2 Z 1 G G G

  28. Random Sampling Tomas Angles • Images synthesised from a Gaussian white noise

  29. Classification by Dictionary Learning Louis Thiry, John Zarka 1000 classes, 1.2 million labeled training images, of 224 × 224 pixels Spatial Linear pooling classifier Ux ? � J Logistic Averaging class W 2 J 10 3 Phase Harmonic Alex-Net Wavelets Top 5 error 20% 70%

  30. Classification by Dictionary Learning Louis Thiry, John Zarka 1000 classes, 1.2 million labeled training images, of 224 × 224 pixels Sparse Linear Spatial Linear dictionary Dimension pooling classifier expansion Reduction Ux ? � J Logistic α L class CCN, D Averaging W 2 J 10 3 sparse invariants multiscale l 1 sparse coding

Recommend


More recommend