DATA 1 Organizing Deep Networks Edouard Oyallon advisor: Stéphane Mallat following the works of Laurent Sifre, Joan Bruna, … collaborators : Eugene Belilovsky, Sergey Zagoruyko, Bogdan Cirstea, Jörn Jacobsen, …
2 DATA Classification of signals ( X, Y ) ∈ R n × Y • Let , random variables n > 0 • Problem : Estimate such that y = arg inf ˜ ˆ y E ( | ˜ y ( X ) − Y | ) ˆ y ( x i , y i ) ∈ R n × Y • We are given a training set to build ˆ y • Say one can write , Classifier being y = Classifier( Φ x ) ˆ built with ( Φ x i , y i ) • 3 ways to build : Φ <latexit sha1_base64="eqfykHkxP29JLdLAGokHSA1pBNI=">AAAB53icbZDLSgMxFIbP1Futt6pLN8EiuCozCupKC25ctuDYQjuUTHqmjc1khiSjlNIncONCxa3P4hu4821MLwtt/SHw8f/nkHNOmAqujet+O7ml5ZXVtfx6YWNza3unuLt3p5NMMfRZIhLVCKlGwSX6hhuBjVQhjUOB9bB/Pc7rD6g0T+StGaQYxLQrecQZNdaqPbaLJbfsTkQWwZtB6erz9LQCANV28avVSVgWozRMUK2bnpuaYEiV4UzgqNDKNKaU9WkXmxYljVEHw8mgI3JknQ6JEmWfNGTi/u4Y0ljrQRzaypianp7PxuZ/WTMz0UUw5DLNDEo2/SjKBDEJGW9NOlwhM2JggTLF7ayE9aiizNjbFOwRvPmVF8E/KZ+V3ZpXqlzCVHk4gEM4Bg/OoQI3UAUfGCA8wQu8OvfOs/PmvE9Lc86sZx/+yPn4AYu+jmw=</latexit> <latexit sha1_base64="O0lQJ2xoQXj8ACW3DbgTbPBIgF0=">AAAB53icbZDLSgNBEEVr4ivGV9Slm8YguAozCupKA25cJuCYQDKEnk5N0qbnQXePEoZ8gRsXKm79Fv/AnX9jZ5KFJl5oONxbRVeVnwiutG1/W4Wl5ZXVteJ6aWNza3unvLt3p+JUMnRZLGLZ8qlCwSN0NdcCW4lEGvoCm/7wepI3H1AqHke3epSgF9J+xAPOqDZW47FbrthVOxdZBGcGlavP01z1bvmr04tZGmKkmaBKtR070V5GpeZM4LjUSRUmlA1pH9sGIxqi8rJ80DE5Mk6PBLE0L9Ikd393ZDRUahT6pjKkeqDms4n5X9ZOdXDhZTxKUo0Rm34UpILomEy2Jj0ukWkxMkCZ5GZWwgZUUqbNbUrmCM78yovgnlTPqnbDqdQuYaoiHMAhHIMD51CDG6iDCwwQnuAFXq1769l6s96npQVr1rMPf2R9/ABGL474</latexit> Supervised Unsupervised Predefined Geometric priors ( x i ) i ( x i , y i ) i Y = { } , n = 2 w Classifier w
DATA 3 3 High Dimensional classification ( x i , y i ) ∈ R 224 2 × { 1 , ..., 1000 } , i < 10 6 → ˆ y ( x )? − Estimation problem Training set to "Rhino" predict labels • Caltech 101, etc Not a "rhino" "Rhinos"
DATA 4 High-dimensional variabilities • Claim : In , the variance is huge. R n , n � 1 Ex .: 9 C > 0 , 8 n, P ( k X k � t )) 2 e − t 2 X ∼ N (0 , I n ) then Cn E ( X ) = 0 • Claim : Small deformations (not parametric) can have huge e ff ects: Ex.: x ∈ L 2 ( R n ) , L τ x ( u ) = x ( u − τ ( u )) define τ ∈ C ∞ ⌧ ( u ) = ✏ , C ⇢ R 2 , k 1 C � L τ 1 C k = 2 • The variance is high , and the bias is di ffi cult to estimate . There are also few available samples … How to handle that? k x � y k 2 = 2 x y
DATA 5 Image variabilities Geometric variability Class variability Groups acting on images: translation, rotation, scaling Intraclass variability Not informative Extraclass variability Other sources : luminosity, occlusion, small deformations L τ x ( u ) = x ( u − τ ( u )) , τ ∈ C ∞ I − τ High variance: how to reduce it?
DATA 6 Fighting the curse of dimensionality • Objective : building a representation of such that a Φ x x <latexit sha1_base64="eqfykHkxP29JLdLAGokHSA1pBNI=">AAAB53icbZDLSgMxFIbP1Futt6pLN8EiuCozCupKC25ctuDYQjuUTHqmjc1khiSjlNIncONCxa3P4hu4821MLwtt/SHw8f/nkHNOmAqujet+O7ml5ZXVtfx6YWNza3unuLt3p5NMMfRZIhLVCKlGwSX6hhuBjVQhjUOB9bB/Pc7rD6g0T+StGaQYxLQrecQZNdaqPbaLJbfsTkQWwZtB6erz9LQCANV28avVSVgWozRMUK2bnpuaYEiV4UzgqNDKNKaU9WkXmxYljVEHw8mgI3JknQ6JEmWfNGTi/u4Y0ljrQRzaypianp7PxuZ/WTMz0UUw5DLNDEo2/SjKBDEJGW9NOlwhM2JggTLF7ayE9aiizNjbFOwRvPmVF8E/KZ+V3ZpXqlzCVHk4gEM4Bg/OoQI3UAUfGCA8wQu8OvfOs/PmvE9Lc86sZx/+yPn4AYu+jmw=</latexit> <latexit sha1_base64="O0lQJ2xoQXj8ACW3DbgTbPBIgF0=">AAAB53icbZDLSgNBEEVr4ivGV9Slm8YguAozCupKA25cJuCYQDKEnk5N0qbnQXePEoZ8gRsXKm79Fv/AnX9jZ5KFJl5oONxbRVeVnwiutG1/W4Wl5ZXVteJ6aWNza3unvLt3p+JUMnRZLGLZ8qlCwSN0NdcCW4lEGvoCm/7wepI3H1AqHke3epSgF9J+xAPOqDZW47FbrthVOxdZBGcGlavP01z1bvmr04tZGmKkmaBKtR070V5GpeZM4LjUSRUmlA1pH9sGIxqi8rJ80DE5Mk6PBLE0L9Ikd393ZDRUahT6pjKkeqDms4n5X9ZOdXDhZTxKUo0Rm34UpILomEy2Jj0ukWkxMkCZ5GZWwgZUUqbNbUrmCM78yovgnlTPqnbDqdQuYaoiHMAhHIMD51CDG6iDCwwQnuAFXq1769l6s96npQVr1rMPf2R9/ABGL474</latexit> simple (say euclidean) classifier can estimate the ˆ y label : y Φ D � d w R D R d • Designing consist of building an approximation of a Φ low dimensional space which is regular with respect to the class: k Φ x � Φ x 0 k n 1 ) ˆ y ( x 0 ) y ( x ) = ˆ • Necessary dimensionality reduction
DATA 7 Translation x k x � y k 2 = 2 y Rotation Averaging is the key to get invariants y x Averaging makes euclidean distance meaningful in high dimension
DATA 8 An example: Invariance to translation Translation operator L a x ( u ) = x ( u − a ) • In many cases, one wish to be invariant globally to translation, a simple way is to perform an averaging: Z Z Ax = L a xda = x ( u ) du It’s the 0 frequency! AL a = A • Even if it can be localized, the averaging keeps the low frequency structures: the invariance brings a loss of information! A • Bias issue! How do we recover the missing information?
DATA 9 Necessary mechanism: Separation - Contraction • In high dimension, typical distances are huge, thus an appropriate representation must contract the space: k Φ x � Φ x 0 k k x � x 0 k Φ • While avoiding the di ff erent classes to collapse: 9 ✏ > 0 , y ( x ) 6 = y ( x 0 ) ) k Φ x � Φ x 0 k � ✏ ✏
DATA 10 Deep learning: Technical breakthrough • Deep learning has permitted to solve a large number of task that were considered as extremely challenging for a computer. • The technique that is used is generic and its success implies that it reduces those sources of variability. • Previous properties hold for deep learning. • How , why ?
DATA 11 non- linear operator linear operator x j +1 = ρ j W j x j Classi fi er … ρ J − 1 W J − 1 x J = Φ x ρ 1 W 1 ρ 0 W 0 x 0 x 2 x 1 x j ( ., ˜ X x j +1 ( u, � ) = ⇢ ( � ) ? w j, λ , ˜ λ ( u )) The kernel ˜ is learned λ Ref.: ImageNet Classi fi cation with DeepNetwork Deep Convolutional Neural Networks. A Krizhevsky et al.
DATA 12 Why mathematics about deep learning are important • Pure black box . Few mathematical results are available. Many rely on a "manifold hypothesis". Clearly wrong: Ex: stability to di ff eomorphisms • No stability results . It means that "small" variations of the inputs might have a large impact on the system. And this happens. Ref.: Intriguing properties of neural networks. C. Szegedy et al. • No generalisation result. Rademacher complexity can not explain the generalization properties. Ref.: Understanding deep learning requires rethinking generalization C. Zhang et al. • Shall we learn each layer from scratch? (geometric priors?) The deep cascade makes features are hard to Ref.: Deep Roto-Translation Scattering interpret for Object Classi fi cation. EO and S Mallat
DATA 13 Organization is a key • Consider a problem of questionnaires: people answer to 0 or 1 to some question. What does structuration means? Ref.: Harmonic Analysis of Digital Data Bases Coifman R. et al. In general, structuration à changer Organizing questions Questions Questions works tackle only one of the aspect Answers Answers Both Organizing answers neighbours Questions Questions become meaningful: local metrics Answers Answers
DATA 14 Organization permits creation of invariance • As (all) the sources of regularities are obtained, interpolating new points is possible ( in statistical terms: generalisation property! ) regularity • In the previous case, one can build a discriminative and invariant representation: Haar wavelets on graphs for example. + + - Questions Ref.: Harmonic Analysis of Digital Data Bases Coifman R. et al. 0 0 Answers
DATA 15 Organising the CNN representation: Local Support Vectors Ref.: Building a Regular Decision • Let’s consider a CNN of depth J. Boundary with Deep Networks EO Local dimension is intractable! • Local Support Vectors of order k at depth j: representations at depth j that are well classified by a k- NN but not by a l-NN for l<k 2-LSV 4-LSV k-LSV, k>6 0-LSV • They give a measure of the separation-contraction via: j ) , l k + 1 } > k j | card { y ( x ( l ) j ) 6 = y ( x ( l ) Γ k +1 x j 2 Γ k � = j 2 x ( l ) j : l-NN at depth j
DATA 16 Complexity measure # of k-local support vectors at di ff erent depth n Slow decay to stationary regime indicates high complexity (separation) Small amount indicates contraction
DATA 17 An organisation of the representation • There is a progressive localisation which explains why a 1-NN (or a Gaussian SVM) works better with depth: linear metrics are more meaningful in low dimension • How do the representation got localized? Necessary variability reduction
DATA 18 Identifying the variabilities? • Several works showed a Deepnet exhibits some covariance: Ref.: Understanding deep features with computer-generated imagery, M Aubry, B Russel • Manifold of faces at a certain depth: • Can we use these? Ref.: Unsupervised Representation Learning with Deep Convolutional GAN, Radford, Metz & Chintalah
Recommend
More recommend