Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stéphane Mallat École Normale Supérieure www.di.ens.fr/data
High Dimensional Learning • High-dimensional x = ( x (1) , ..., x ( d )) ∈ R d : • Classification: estimate a class label f ( x ) given n sample values { x i , y i = f ( x i ) } i ≤ n Image Classification d = 10 6 Huge variability Anchor Joshua Tree Beaver Lotus Water Lily inside classes Find invariants
High Dimensional Learning • High-dimensional x = ( x (1) , ..., x ( d )) ∈ R d : • Regression: approximate a functional f ( x ) given n sample values { x i , y i = f ( x i ) ∈ R } i ≤ n Physics: energy f ( x ) of a state vector x Astronomy Quantum Chemistry Importance of symmetries.
Curse of Dimensionality • f ( x ) can be approximated from examples { x i , f ( x i ) } i by local interpolation if f is regular and there are close examples: d=2 0 1 o o o o o o o o o o ? o o o o o o o o o o x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 1 o o o o o o o o o o • Need ✏ − d points to cover [0 , 1] d at a Euclidean distance ✏ Problem: k x � x i k is always large
Multiscale Separation • Variables x ( u ) indexed by a low-dimensional u : time/space... pixels in images, particles in physics, words in text... • Mutliscale interactions of d variables: u 1 u 2 From d 2 interactions to O (log 2 d ) multiscale interactions. • Multiscale analysis: wavelets on groups of symmetries. hierarchical architecture.
Overview • 1 Hidden Layer Network, Approximation theory and Curse • Kernel learning • Dimension reduction with change of variables • Deep Neural networks and symmetry groups • Wavelet Scattering transforms • Applications and many open questions Understanding Deep Convolutional Networks , arXiv 2016.
Learning as an Approximation • To estimate f ( x ) from a sampling { x i , y i = f ( x i ) } i ≤ M we must build an M -parameter approximation f M of f . • Precise sparse approximation requires some ”regularity”. ⇢ 1 if x ∈ Ω • For binary classification f ( x ) = − 1 if x / ∈ Ω f ( x ) = sign( ˜ f ( x )) where ˜ f is potentially regular. • What type of regularity ? How to compute f M ?
1 Hidden Layer Neural Networks ridge functions One-hidden layer neural network: ρ ( x.w n + b n ) ρ ( w n .x + b n ) x w n α n w n .x = P k w k,n x k M X f M ( x ) = α n ρ ( w n .x + b n ) n =1 d { w k,k } k,n and { α n } n are learned non-linear approximation. M Cybenko, Hornik, Stinchcombe, White Theorem: For ”resonnable” bounded ρ ( u ) and appropriate choices of w n,k and α n : 8 f 2 L 2 [0 , 1] d M →∞ k f � f M k = 0 . lim No big deal: curse of dimensionality still there.
1 Hidden Layer Neural Networks One-hidden layer neural network: ρ ( w n .x + b n ) x α n M w n .x = P k w k,n x k X f M ( x ) = α n ρ ( w n .x + b n ) n =1 { w k,k } k,n and { α n } n are learned d non-linear approximation. M Fourier series: ρ ( u ) = e iu M X α n e iw n .x f M ( x ) = n =1 For nearly all ρ : essentially same approximation results.
Piecewise Linear Approximation • Piecewise linear approximation: ρ ( u ) = max( u, 0) ˜ X f ( x ) = a n ⇢ ( x − n ✏ ) f ( x ) n x ✏ n ✏ If f is Lipschitz: | f ( x ) − f ( x 0 ) | ≤ C | x − x 0 | | f ( x ) − ˜ f ( x ) | ≤ C ✏ . ⇒ Need M = ✏ − 1 points to cover [0 , 1] at a distance ✏ k f � f M k C M − 1 ⇒
Linear Ridge Approximation • Piecewise linear ridge approximation: x ∈ [0 , 1] d ˜ X f ( x ) = a n ⇢ ( w n .x − n ✏ ) ρ ( u ) = max( u, 0) n If f is Lipschitz: | f ( x ) � f ( x 0 ) | C k x � x 0 k Sampling at a distance ✏ : | f ( x ) − ˜ f ( x ) | ≤ C ✏ . ⇒ need M = ✏ − d points to cover [0 , 1] d at a distance ✏ ⇒ k f � f M k C M − 1 /d Curse of dimensionality!
Approximation with Regularity • What prior condition makes learning possible ? • Approximation of regular functions in C s [0 , 1] d : | f ( x ) − p u ( x ) | ≤ C | x − u | s with p u ( x ) polynomial ∀ x, u f ( x ) x u p u ( x ) | x − u | ≤ ✏ 1 /s | f ( x ) − p u ( x ) | ≤ C ✏ ⇒ Need M − d/s point to cover [0 , 1] d at a distance ✏ 1 /s k f � f M k C M − s/d ⇒ • Can not do better in C s [0 , 1] d , not good because s ⌧ d . Failure of classical approximation theory.
Kernel Learning Change of variable Φ ( x ) = { φ k ( x ) } k ≤ d 0 to nearly linearize f ( x ), which is approximated by: ˜ X f ( x ) = h Φ ( x ) , w i = w k φ k ( x ) . 1D projection k Φ ( x ) ∈ R d 0 x ∈ R d Data: Linear Classifier w Φ x Metric: k x � x 0 k k Φ ( x ) � Φ ( x 0 ) k • How and when is possible to find such a Φ ? • What ”regularity” of f is needed ?
Increase Dimensionality Proposition: There exists a hyperplane separating any two subsets of N points { Φ x i } i in dimension d 0 > N + 1 if { Φ x i } i are not in an a ffi ne subspace of dimension < N . ⇒ Choose Φ increasing dimensionality ! , overfitting. Problem: generalisation. ⇣ �k x � x 0 k 2 ⌘ Example: Gaussian kernel h Φ ( x ) , Φ ( x 0 ) i = exp 2 σ 2 Φ ( x ) is of dimension d 0 = ∞ If σ is small, nearest neighbor classifier type: σ
Reduction of Dimensionality • Discriminative change of variable Φ ( x ): Φ ( x ) 6 = Φ ( x 0 ) if f ( x ) 6 = f ( x 0 ) ∃ ˜ f with f ( x ) = ˜ f ( Φ ( x )) ⇒ • If ˜ f is Lipschitz: | ˜ f ( z ) � ˜ f ( z 0 ) | C k z � z 0 k | f ( x ) � f ( x 0 ) | C k Φ ( x ) � Φ ( x 0 ) k z = Φ ( x ) , Discriminative: k Φ ( x ) � Φ ( x 0 ) k � C � 1 | f ( x ) � f ( x 0 ) | • For x ∈ Ω , if Φ ( Ω ) is bounded and a low dimension d 0 ) k f � f M k C M − 1 /d 0
Deep Convolution Neworks • The revival of neural networks: Y. LeCun x L 1 linear convolution ρ ( u ) = max( u, 0) non-linear scalar: neuron ρ Hierarchical L 2 linear convolution invariants ρ Linearization . . . Linear Classificat. y = ˜ Φ ( x ) f ( x ) Optimize L j with architecture constraints: over 10 9 parameters Exceptional results for images, speech, language, bio-data... Why does it work so well ? A di ffi cult problem
ImageNet Data Basis • Data basis with 1 million images and 2000 classes
Alex Deep Convolution Network A. Krizhevsky, Sutsever, Hinton • Imagenet supervised training: 1.2 10 6 examples, 10 3 classes 15.3% testing error in 2012 New networks with 5% errors. Up to 150 layers! Wavelets
Image Classification
Scene Labeling / Car Driving
Why Understading ? Sz egedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus k ✏ k < 10 − 2 k x k ˜ x + with ✏ x = correctly classified as classified ostrich • Trial and error testing can not guarantee reliability.
Deep Convolutional Networks x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 k 1 k 2 x j = ρ L j x j − 1 • L j is a linear combination of convolutions and subsampling: ⇣ X ⌘ x j ( u, k j ) = ⇢ x j − 1 ( · , k ) ? h k j ,k ( u ) k sum across channels • ρ is contractive: | ρ ( u ) − ρ ( u 0 ) | ≤ | u − u 0 | ρ ( u ) = max( u, 0) or ρ ( u ) = | u |
Linearisation in Deep Networks A. Radford, L. Metz, S. Chintala • Trained on a data basis of faces: linearization • On a data basis including bedrooms: interpolaitons
Many Questions x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 ρ L j k 1 k 2 • Why convolutions ? Translation covariance. • Why no overfitting ? Contractions, dimension reduction • Why hierarchical cascade ? • Why introducing non-linearities ? • How and what to linearise ? • What are the roles of the multiple channels in each layer ?
Linear Dimension Reduction Ω 1 Classes Ω 2 x Ω 3 Level sets of f ( x ) Ω t = { x : f ( x ) = t } Φ ( x ) If level sets (classes) are parallel to a linear space then variables are eliminated by linear projections: invariants .
Linearise for Dimensionality Reduction Classes x Level sets of f ( x ) Ω 2 Ω 1 Ω t = { x : f ( x ) = t } Ω 3 Φ ( x ) • If level sets Ω t are not parallel to a linear space - Linearise them with a change of variable Φ ( x ) - Then reduce dimension with linear projections • Di ffi cult because Ω t are high-dimensional, irregular, known on few samples.
Level Set Geometry: Symmetries • Curse of dimensionality ⇒ not local but global geometry , characterised by their global symmetries. Level sets: classes Ω 1 g g Ω 2 • A symmetry is an operator g which preserves level sets: f ( g.x ) = f ( x ) . ∀ x , : global If g 1 and g 2 are symmetries then g 1 .g 2 is also a symmetry f ( g 1 .g 2 .x ) = f ( g 2 .x ) = f ( x )
Groups of symmetries • G = { all symmetries } is a group: unknown ⇒ g.g 0 ∈ G ∀ ( g, g 0 ) ∈ G 2 g − 1 ∈ G ∀ g ∈ G , Inverse: ( g.g 0 ) .g 00 = g. ( g 0 .g 00 ) Associative: If commutative g.g 0 = g 0 .g : Abelian group. • Group of dimension n if it has n generators: g = g p 1 1 g p 2 2 ... g p n n • Lie group: infinitely small generators (Lie Algebra)
Translation and Deformations • Digit classification: x 0 ( u ) = x ( u − τ ( u )) x ( u ) Ω 3 Ω 5 - Globally invariant to the translation group : small - Locally invariant to small di ff eomorphisms : huge group Video of Philipp Scott Johnson
Recommend
More recommend