organizing deep networks
play

Organizing Deep Networks Edouard Oyallon advisor: Stphane Mallat - PowerPoint PPT Presentation

DATA 1 Organizing Deep Networks Edouard Oyallon advisor: Stphane Mallat following the works of Laurent Sifre, Joan Bruna, collaborators : Eugene Belilovsky, Sergey Zagoruyko, Bogdan Cirstea, Jrn Jacobsen, 2 DATA Classification


  1. DATA 1 Organizing Deep Networks Edouard Oyallon advisor: Stéphane Mallat following the works of Laurent Sifre, Joan Bruna, … collaborators : Eugene Belilovsky, Sergey Zagoruyko, Bogdan Cirstea, Jörn Jacobsen, …

  2. 2 DATA Classification of signals ( X, Y ) ∈ R n × Y • Let , random variables n > 0 • Problem : Estimate such that y = arg inf ˜ ˆ y E ( | ˜ y ( X ) − Y | ) ˆ y ( x i , y i ) ∈ R n × Y • We are given a training set to build ˆ y • Say one can write , Classifier being y = Classifier( Φ x ) ˆ built with ( Φ x i , y i ) • 3 ways to build : 
 Φ <latexit sha1_base64="eqfykHkxP29JLdLAGokHSA1pBNI=">AAAB53icbZDLSgMxFIbP1Futt6pLN8EiuCozCupKC25ctuDYQjuUTHqmjc1khiSjlNIncONCxa3P4hu4821MLwtt/SHw8f/nkHNOmAqujet+O7ml5ZXVtfx6YWNza3unuLt3p5NMMfRZIhLVCKlGwSX6hhuBjVQhjUOB9bB/Pc7rD6g0T+StGaQYxLQrecQZNdaqPbaLJbfsTkQWwZtB6erz9LQCANV28avVSVgWozRMUK2bnpuaYEiV4UzgqNDKNKaU9WkXmxYljVEHw8mgI3JknQ6JEmWfNGTi/u4Y0ljrQRzaypianp7PxuZ/WTMz0UUw5DLNDEo2/SjKBDEJGW9NOlwhM2JggTLF7ayE9aiizNjbFOwRvPmVF8E/KZ+V3ZpXqlzCVHk4gEM4Bg/OoQI3UAUfGCA8wQu8OvfOs/PmvE9Lc86sZx/+yPn4AYu+jmw=</latexit> <latexit sha1_base64="O0lQJ2xoQXj8ACW3DbgTbPBIgF0=">AAAB53icbZDLSgNBEEVr4ivGV9Slm8YguAozCupKA25cJuCYQDKEnk5N0qbnQXePEoZ8gRsXKm79Fv/AnX9jZ5KFJl5oONxbRVeVnwiutG1/W4Wl5ZXVteJ6aWNza3unvLt3p+JUMnRZLGLZ8qlCwSN0NdcCW4lEGvoCm/7wepI3H1AqHke3epSgF9J+xAPOqDZW47FbrthVOxdZBGcGlavP01z1bvmr04tZGmKkmaBKtR070V5GpeZM4LjUSRUmlA1pH9sGIxqi8rJ80DE5Mk6PBLE0L9Ikd393ZDRUahT6pjKkeqDms4n5X9ZOdXDhZTxKUo0Rm34UpILomEy2Jj0ukWkxMkCZ5GZWwgZUUqbNbUrmCM78yovgnlTPqnbDqdQuYaoiHMAhHIMD51CDG6iDCwwQnuAFXq1769l6s96npQVr1rMPf2R9/ABGL474</latexit> Supervised Unsupervised Predefined Geometric priors ( x i ) i ( x i , y i ) i Y = { } , n = 2 w Classifier w

  3. DATA 3 3 High Dimensional classification ( x i , y i ) ∈ R 224 2 × { 1 , ..., 1000 } , i < 10 6 → ˆ y ( x )? − Estimation problem Training set to 
 "Rhino" predict labels • Caltech 101, etc Not a "rhino" "Rhinos"

  4. DATA 4 High-dimensional variabilities • Claim : In , the variance is huge. 
 R n , n � 1 Ex .: 
 9 C > 0 , 8 n, P ( k X k � t ))  2 e − t 2 X ∼ N (0 , I n ) then Cn E ( X ) = 0 • Claim : Small deformations (not parametric) can have huge e ff ects: 
 Ex.: 
 x ∈ L 2 ( R n ) , L τ x ( u ) = x ( u − τ ( u )) define τ ∈ C ∞ ⌧ ( u ) = ✏ , C ⇢ R 2 , k 1 C � L τ 1 C k = 2 • The variance is high , and the bias is di ffi cult to estimate . There are also few available samples … 
 How to handle that? k x � y k 2 = 2 x y

  5. DATA 5 Image variabilities Geometric variability Class variability Groups acting on images: translation, rotation, scaling Intraclass variability Not informative Extraclass variability Other sources : luminosity, occlusion, small deformations L τ x ( u ) = x ( u − τ ( u )) , τ ∈ C ∞ I − τ High variance: how to reduce it?

  6. DATA 
 
 6 Fighting the curse of dimensionality • Objective : building a representation of such that a Φ x x <latexit sha1_base64="eqfykHkxP29JLdLAGokHSA1pBNI=">AAAB53icbZDLSgMxFIbP1Futt6pLN8EiuCozCupKC25ctuDYQjuUTHqmjc1khiSjlNIncONCxa3P4hu4821MLwtt/SHw8f/nkHNOmAqujet+O7ml5ZXVtfx6YWNza3unuLt3p5NMMfRZIhLVCKlGwSX6hhuBjVQhjUOB9bB/Pc7rD6g0T+StGaQYxLQrecQZNdaqPbaLJbfsTkQWwZtB6erz9LQCANV28avVSVgWozRMUK2bnpuaYEiV4UzgqNDKNKaU9WkXmxYljVEHw8mgI3JknQ6JEmWfNGTi/u4Y0ljrQRzaypianp7PxuZ/WTMz0UUw5DLNDEo2/SjKBDEJGW9NOlwhM2JggTLF7ayE9aiizNjbFOwRvPmVF8E/KZ+V3ZpXqlzCVHk4gEM4Bg/OoQI3UAUfGCA8wQu8OvfOs/PmvE9Lc86sZx/+yPn4AYu+jmw=</latexit> <latexit sha1_base64="O0lQJ2xoQXj8ACW3DbgTbPBIgF0=">AAAB53icbZDLSgNBEEVr4ivGV9Slm8YguAozCupKA25cJuCYQDKEnk5N0qbnQXePEoZ8gRsXKm79Fv/AnX9jZ5KFJl5oONxbRVeVnwiutG1/W4Wl5ZXVteJ6aWNza3unvLt3p+JUMnRZLGLZ8qlCwSN0NdcCW4lEGvoCm/7wepI3H1AqHke3epSgF9J+xAPOqDZW47FbrthVOxdZBGcGlavP01z1bvmr04tZGmKkmaBKtR070V5GpeZM4LjUSRUmlA1pH9sGIxqi8rJ80DE5Mk6PBLE0L9Ikd393ZDRUahT6pjKkeqDms4n5X9ZOdXDhZTxKUo0Rm34UpILomEy2Jj0ukWkxMkCZ5GZWwgZUUqbNbUrmCM78yovgnlTPqnbDqdQuYaoiHMAhHIMD51CDG6iDCwwQnuAFXq1769l6s96npQVr1rMPf2R9/ABGL474</latexit> simple (say euclidean) classifier can estimate the ˆ y label : 
 y Φ D � d w R D R d • Designing consist of building an approximation of a Φ low dimensional space which is regular with respect to the class: k Φ x � Φ x 0 k n 1 ) ˆ y ( x 0 ) y ( x ) = ˆ • Necessary dimensionality reduction 


  7. DATA 7 Translation x k x � y k 2 = 2 y Rotation Averaging is the key to get invariants y x Averaging makes euclidean distance meaningful in high dimension

  8. DATA 
 8 An example: Invariance to translation Translation operator L a x ( u ) = x ( u − a ) • In many cases, one wish to be invariant globally to translation, a simple way is to perform an averaging: 
 Z Z Ax = L a xda = x ( u ) du It’s the 0 frequency! AL a = A • Even if it can be localized, the averaging keeps the low frequency structures: the invariance brings a loss of information! A • Bias issue! How do we recover the missing information?

  9. DATA 9 Necessary mechanism: Separation - Contraction • In high dimension, typical distances are huge, thus an appropriate representation must contract the space: 
 k Φ x � Φ x 0 k  k x � x 0 k Φ • While avoiding the di ff erent classes to collapse: 9 ✏ > 0 , y ( x ) 6 = y ( x 0 ) ) k Φ x � Φ x 0 k � ✏ ✏

  10. DATA 10 Deep learning: Technical breakthrough • Deep learning has permitted to solve a large number of task that were considered as extremely challenging for a computer. • The technique that is used is generic and its success implies that it reduces those sources of variability. • Previous properties hold for deep learning. • How , why ?

  11. DATA 11 non- linear operator linear operator x j +1 = ρ j W j x j Classi fi er … ρ J − 1 W J − 1 x J = Φ x ρ 1 W 1 ρ 0 W 0 x 0 x 2 x 1 x j ( ., ˜ X x j +1 ( u, � ) = ⇢ ( � ) ? w j, λ , ˜ λ ( u )) The kernel ˜ is learned λ Ref.: ImageNet Classi fi cation with DeepNetwork Deep Convolutional Neural Networks. A Krizhevsky et al.

  12. DATA 12 Why mathematics about deep learning are important • Pure black box . Few mathematical results are available. Many rely on a "manifold hypothesis". Clearly wrong: 
 Ex: stability to di ff eomorphisms 
 • No stability results . It means that "small" variations of the inputs might have a large impact on the system. And this happens. Ref.: Intriguing properties of neural networks. C. Szegedy et al. • No generalisation result. Rademacher complexity can not explain the generalization properties. Ref.: Understanding deep learning requires rethinking generalization C. Zhang et al. • Shall we learn each layer from scratch? (geometric priors?) The deep cascade makes features are hard to Ref.: Deep Roto-Translation Scattering interpret for Object Classi fi cation. EO and S Mallat

  13. DATA 13 Organization is a key • Consider a problem of questionnaires: people answer to 0 or 1 to some question. What does structuration means? Ref.: Harmonic Analysis of Digital Data Bases Coifman R. et al. In general, structuration à changer Organizing questions Questions Questions works tackle only one of the aspect Answers Answers Both Organizing answers neighbours Questions Questions become meaningful: local metrics Answers Answers

  14. DATA 
 14 Organization permits creation of invariance • As (all) the sources of regularities are obtained, interpolating new points is possible ( in statistical terms: generalisation property! ) 
 regularity • In the previous case, one can build a discriminative and invariant representation: Haar wavelets on graphs for example. + + - Questions Ref.: Harmonic Analysis of Digital Data Bases Coifman R. et al. 0 0 Answers

  15. DATA 15 Organising the CNN representation: Local Support Vectors Ref.: Building a Regular Decision • Let’s consider a CNN of depth J. Boundary with Deep Networks EO Local dimension is intractable! • Local Support Vectors of order k at depth j: representations at depth j that are well classified by a k- NN but not by a l-NN for l<k 
 2-LSV 4-LSV k-LSV, k>6 0-LSV • They give a measure of the separation-contraction via: j ) , l  k + 1 } > k j | card { y ( x ( l ) j ) 6 = y ( x ( l ) Γ k +1 x j 2 Γ k � = j 2 x ( l ) j : l-NN at depth j

  16. DATA 16 Complexity measure # of k-local support vectors at di ff erent depth n Slow decay to stationary regime indicates high complexity (separation) Small amount indicates contraction

  17. DATA 17 An organisation of the representation • There is a progressive localisation which explains why a 1-NN (or a Gaussian SVM) works better with depth: linear metrics are more meaningful in low dimension • How do the representation got localized? Necessary variability reduction

  18. DATA 18 Identifying the variabilities? • Several works showed a Deepnet exhibits some covariance: Ref.: Understanding deep features with computer-generated imagery, M Aubry, B Russel • Manifold of faces at a certain depth: • Can we use these? Ref.: Unsupervised Representation Learning with Deep Convolutional GAN, Radford, Metz & Chintalah

Recommend


More recommend