understanding or not deep convolutional networks
play

Understanding (or not) Deep Convolutional Networks Stphane Mallat - PowerPoint PPT Presentation

Understanding (or not) Deep Convolutional Networks Stphane Mallat cole Normale Suprieure www.di.ens.fr/data Deep Neural Networks Approximations of high-dimensional functions from examples, for classification and regression.


  1. Understanding (or not) Deep Convolutional Networks Stéphane Mallat École Normale Supérieure www.di.ens.fr/data

  2. Deep Neural Networks • Approximations of high-dimensional functions from examples, for classification and regression. • Applications: computer vision, audio and music classification, natural language analysis, bio-medical data, unstructured data… • Related to: neurophysiology of vision and audition, quantum and statistical physics, linguistics, … • Mathematics: statistics, probability, harmonic analysis, geometry, optimization. Little is understood.

  3. High Dimensional Learning • High-dimensional x = ( x (1) , ..., x ( d )) ∈ R d : • Classification: estimate a class label f ( x ) given n sample values { x i , y i = f ( x i ) } i ≤ n Image Classification d = 10 6 Huge variability Anchor Joshua Tree Beaver Lotus Water Lily inside classes Find invariants

  4. Curse of Dimensionality • f ( x ) can be approximated from examples { x i , f ( x i ) } i by local interpolation if f is regular and there are close examples: ? x • Need ✏ − d points to cover [0 , 1] d at a Euclidean distance ✏ ) k x � x i k is always large Huge variability inside classes

  5. Linearisation by Change of Variable Change of variable Φ ( x ) = { φ k ( x ) } k ≤ d 0 to nearly linearize f ( x ), which is approximated by: ˜ X f ( x ) = h Φ ( x ) , w i = w k φ k ( x ) . 1D projection k Φ ( x ) ∈ R d 0 x ∈ R d Data: Linear Classifier w Φ x

  6. Deep Convolution Neworks • The revival of an old (1950) idea: Y. LeCun , G. Hinton x L 1 linear convolution ρ ( u ) = | u | non-linear scalar: neuron ρ L 2 linear convolution ρ . . . Linear Classificat. Φ ( x ) Optimize L j with architecture constraints: over 10 9 parameters Exceptional results for images, speech, bio-data classification. Products by FaceBook, IBM, Google, Microsoft, Yahoo... Why does it work so well ?

  7. ImageNet Data Basis • Data basis with 1 million images and 2000 classes

  8. Alex Deep Convolution Network A. Krizhevsky, Sutsever, Hinton • Imagenet supervised training: 1.2 10 6 examples, 10 3 classes 15.3% testing error in 2012 New networks with 5% errors. with 150 layers! Wavelets

  9. Image Classification

  10. Scene Labeling / Car Driving

  11. Overview • Linearisation of symmetries • Deep convolutional networks architectures • Simplified convolutional trees: wavelet scattering • Deep networks: contractions, linearization and separations

  12. Separation and Linearization with Φ • Separation: change of variable f ( x ) = f ( Φ ( x )) Φ ( x ) 6 = Φ ( x 0 ) if f ( x ) 6 = f ( x 0 ) ) f ( z ) is Lipschitz , k Φ ( x ) � Φ ( x 0 ) k � ✏ | f ( x ) � f ( x 0 ) | • Linearization: f ( z ) = h w, z i linearize level sets Ω t = { x : f ( x ) = t } 8 x 2 Ω t , f ( x ) = h Φ ( x ) , w i = t Φ ( Ω t ) for all t are in parallel linear spaces Ω t w

  13. Linearization of Symmetries • No local estimations because of dimensionality curse • A symmetry is an operator g which preserves level sets: f ( g.x ) = f ( x ) . ∀ x , : global If g 1 and g 2 are symmetries then g 1 .g 2 is also a symmetry ⇒ groups G of symmetries. : high dimensional • A change of variable Φ ( x ) must linearize the orbits { g.x } g ∈ G Problem: find the symmetries and linearise them.

  14. Contract Linearize Symmetries • A change of variable Φ ( x ) must linearize the orbits { g.x } g ∈ G Problem: find the symmetries and linearise them. • Regularize the orbit, remove high curvature: linearisation

  15. Translation and Deformations • Digit classification: x 0 ( u ) x ( u ) - Globally invariant to the translation group : small - Locally invariant to small di ff eomorphisms : huge group Video of Philipp Scott Johnson

  16. Deep Convolutional Networks x J ( u, k J ) x 2 ( u, k 2 ) up to J = 150 x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 2 ρ L 1 k 1 k 2 x j = ρ L j x j − 1 • ρ is a pointwise contractive non-linearity: ∀ ( α , α 0 ) ∈ R 2 , | ρ ( α ) − ρ ( α 0 ) | ≤ | α − α 0 | Examples: ρ ( u ) = max( u, 0) or ρ ( u ) = | u | . • Optimisation of the L j to minimise the training error with stochastic gradient descent and back-propagation. • What is the role of the linear operators L j and of ρ ?

  17. Deep Convolutional Networks x J ( u, k J ) x 2 ( u, k 2 ) up to J = 150 x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 2 ρ L 1 k 1 k 2 x j = ρ L j x j − 1 L j has several roles: • L j eliminates useless linear variable: dimension reduction • L j computes appropriate variables contracted by ρ Linearizes and computes invariants to groups of symmetries • L j is a linear preprocessing for the next layers

  18. Deep Convolutional Networks x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 k 1 k 2 x j = ρ L j x j − 1 • L j is a linear combination of convolutions and subsampling: ⇣ X ⌘ x j ( u, k j ) = ⇢ x j − 1 ( · , k ) ? h k j ,k ( u ) k sum across channels • Optimization of h k j ,k ( u ) to minimise the training error

  19. Simplified Convolutional Networks x J ( u, k J ) • No channel combination x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J ρ L 1 k 1 k 2 x j = ρ L j x j − 1 • L j is a linear combination of convolutions and subsampling: ⇣ ⌘ x j ( u, k j ) = ⇢ x j − 1 ( · , k ) ? h k j ,k j − 1 ( u ) no channel interaction • If α ≥ 0 then ρ ( α ) = α ⇒ if h k j ,k j − 1 is an averaging filter then x j ( u, k j ) = x j − 1 ( · , k ) ? h k j ,k j − 1 ( u )

  20. Convolution Tree Network • No channel combination x ρ L 1 ρ x 1 ρ L 2 ρ ρ x 2 ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ L J x J : averaging filters : band-pass filters

  21. Wavelet Transform x ρ ρ W 1 x ρ ρ ρ : averaging filters : band-pass filters W 1 : cascade of low-pass filters and a band-pass filter

  22. Wavelet Filter Bank x ( u ) 2 0 ρ ( α ) = | α | | W 1 | | W 1 | 2 1 | x ? 2 1 , θ | 2 2 | x ? 2 2 , θ | ψ 2 j , θ : equivalent filter | x ? 2 j , θ | 2 J Scale • Sparse representation

  23. Scale separation with Wavelets • Complex wavelet: ψ ( u ) = g ( u ) exp i ξ u , u ∈ R 2 ψ 2 j , θ ( u ) = 2 − j ψ (2 − j r θ u ) rotated and dilated: real parts imaginary parts ✓ x ? � 2 J ( u ) ◆ : average • Wavelet transform: Wx = x ? 2 j , θ ( u ) : higher j ≤ J, θ frequencies | x ? 2 j , θ ( u ) | : eliminates phase which encodes local translation

  24. Wavelet Scattering Network x ρ L 1 ρ L 2 ρ L J x J : averaging filters ρ W 1 ρ W 2 ... ρ W J x x J = ρ ( α ) = | α | n o ||| x ? 2 j 1 , θ 1 | ? 2 j 2 , θ 2 | ? ... | ? 2 jm , θ m | ? � J Sx = j k , θ k

  25. Scattering Properties   x ? � 2 J | x ? λ 1 | ? � 2 J   = . . . | W 3 | | W 2 | | W 1 | x   || x ? λ 1 | ? λ 2 | ? � 2 J S J x =     ||| x ? λ 2 | ? λ 2 | ? λ 3 | ? � 2 J   ... λ 1 , λ 2 , λ 3 ,... k | W k x | � | W k x 0 | k  k x � x 0 k k W k x k = k x k ) Lemma : k [ W k , D τ ] k = k W k D τ � D τ W k k  C kr τ k ∞ Theorem : For appropriate wavelets, a scattering is contractive k S J x � S J y k  k x � y k ( L 2 stability ) translations invariance and linearizes small deformations: if D τ x ( u ) = x ( u − τ ( u )) then J →∞ k S J D τ x � S J x k  C kr τ k ∞ k x k lim

  26. Digit Classification: MNIST Joan Bruna Supervised y = f ( x ) S J x x Linear classifier Invariants to specific deformations Invariants to translations Separates di ff erent patterns Linearises small deformations No learning Classification Errors Training size Conv. Net. Scattering 50000 0 . 5% 0 . 4 % LeCun et. al.

  27. Classification of Textures J. Bruna CUREt database 61 classes Texte Scat. Moments Supervised y = f ( x ) S J x x Linear classifier 2 J = image size Classification Errors Training Fourier Histogr. Scattering per class Spectr. Features 46 1% 1% 0 . 2 %

  28. Reconstruction from Scattering • Second order scattering: n o x ? � J , | x ? 2 j 1 , θ 1 | ? � J , | x ? 2 j 1 , θ 1 | ? 2 j 2 , θ 2 | ? � J S J x = If x has N 2 pixels and J = log 2 N : translation invariant then S J x has O ([log 2 N ] 2 ) coe ffi cients. • If x ( u ) is a stationary process n o E ( x ) , E ( | x ? 2 j 1 , θ 1 | ) , E ( || x ? 2 j 1 , θ 1 | ? 2 j 2 , θ 2 | ) S J x ≈ • Gradient descent reconstruction: given a random initialisation x 0 iteratively update x n to minimise k S J x � S J x n k

  29. Translation Invariant Models Joan Bruna Original Textures 2D Turbulence Sparse Gaussian process model with same second order moments From O ((log 2 N ) 2 ) scattering coe ffi cients of order 2

  30. Complex Image Classification Edouard Oyallon Arbre de Joshua Castore Ancre Metronome Nénuphare Bateau Supervised y = f ( x ) S J x x Linear classifier No learning Data Basis Deep-Net Scat/Unsupervised CIFAR-10 7% 20%

  31. Generation with Deep Networks A. Radford, L. Metz, S. Chintala • Unsupervised generative models with convolutional networks • Trained on a data basis of faces: linearization • On a data basis including bedrooms: interpolaitons

Recommend


More recommend