Harmonic Analysis of Deep Convolutional Networks 1 Yuan YAO HKUST Based on Mallat and Bolcskei talks etc.
Acknowledgement A following-up course at HKUST: https://deeplearning-math.github.io/
High Dimensional Natural Image Classification • High-dimensional x = ( x (1) , ..., x ( d )) ∈ R d : • Classification: estimate a class label f ( x ) given n sample values { x i , y i = f ( x i ) } i ≤ n Image Classification d = 10 6 Huge variability Anchor Joshua Tree Beaver Lotus Water Lily inside classes Find invariants
Curse of Dimensionality • Analysis in high dimension: x ∈ R d with d ≥ 10 6 . • Points are far away in high dimensions d : o o o o o o o o o o - 10 points cover [0 , 1] at a distance 10 − 1 o o o o o o o o o o - 100 points for [0 , 1] 2 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o - need 10 d points over [0 , 1] d o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o impossible if d ≥ 20 points are volume sphere of radius r concentrated lim = 0 in 2 d corners! volume [0 , r ] d d →∞ ⇒ Euclidean metrics are not appropriate on raw data .
A Blessing from Physical world? Multiscale “compositional” sparsity • Variables x ( u ) indexed by a low-dimensional u : time/space... pixels in images, particles in physics, words in text... • Mutliscale interactions of d variables: u 1 u 2 From d 2 interactions to O (log 2 d ) multiscale interactions. • Multiscale analysis: wavelets on groups of symmetries. hierarchical architecture.
Learning as an Approximation • To estimate f ( x ) from a sampling { x i , y i = f ( x i ) } i ≤ M we must build an M -parameter approximation f M of f . • Precise sparse approximation requires some ”regularity”. ⇢ 1 if x ∈ Ω • For binary classification f ( x ) = − 1 if x / ∈ Ω f ( x ) = sign( ˜ f ( x )) where ˜ f is potentially regular. • What type of regularity ? How to compute f M ?
1 Hidden Layer Neural Networks One-hidden layer neural network: ρ ( w n .x + b n ) x α n M w n .x = P k w k,n x k X f M ( x ) = α n ρ ( w n .x + b n ) n =1 { w k,k } k,n and { α n } n are learned d non-linear approximation. M Fourier series: ρ ( u ) = e iu M X α n e iw n .x f M ( x ) = n =1 For nearly all ρ : essentially same approximation results.
Piecewise Linear Approximation • Piecewise linear approximation: ρ ( u ) = max( u, 0) ˜ X f ( x ) = a n ⇢ ( x − n ✏ ) f ( x ) n x ✏ n ✏ If f is Lipschitz: | f ( x ) − f ( x 0 ) | ≤ C | x − x 0 | | f ( x ) − ˜ f ( x ) | ≤ C ✏ . ⇒ Need M = ✏ − 1 points to cover [0 , 1] at a distance ✏ k f � f M k C M − 1 ⇒
Linear Ridge Approximation • Piecewise linear ridge approximation: x ∈ [0 , 1] d ˜ X f ( x ) = a n ⇢ ( w n .x − n ✏ ) ρ ( u ) = max( u, 0) n If f is Lipschitz: | f ( x ) � f ( x 0 ) | C k x � x 0 k Sampling at a distance ✏ : | f ( x ) − ˜ f ( x ) | ≤ C ✏ . ⇒ need M = ✏ − d points to cover [0 , 1] d at a distance ✏ ⇒ k f � f M k C M − 1 /d Curse of dimensionality!
Approximation with Regularity • What prior condition makes learning possible ? • Approximation of regular functions in C s [0 , 1] d : | f ( x ) − p u ( x ) | ≤ C | x − u | s with p u ( x ) polynomial ∀ x, u f ( x ) u x p u ( x ) | x − u | ≤ ✏ 1 /s | f ( x ) − p u ( x ) | ≤ C ✏ ⇒ Need M − d/s point to cover [0 , 1] d at a distance ✏ 1 /s k f � f M k C M − s/d ⇒ • Can not do better in C s [0 , 1] d , not good because s ⌧ d . Failure of classical approximation theory.
Kernel Learning Change of variable Φ ( x ) = { φ k ( x ) } k ≤ d 0 to nearly linearize f ( x ), which is approximated by: ˜ X f ( x ) = h Φ ( x ) , w i = w k φ k ( x ) . 1D projection k Φ ( x ) ∈ R d 0 x ∈ R d Data: Linear Classifier w Φ x Metric: k x � x 0 k k Φ ( x ) � Φ ( x 0 ) k • How and when is possible to find such a Φ ? • What ”regularity” of f is needed ?
Spirit in Fisher’s Linear Discriminant Analysis Reduction of Dimensionality • Discriminative change of variable Φ ( x ): Φ ( x ) 6 = Φ ( x 0 ) if f ( x ) 6 = f ( x 0 ) ∃ ˜ f with f ( x ) = ˜ f ( Φ ( x )) ⇒ • If ˜ f is Lipschitz: | ˜ f ( z ) � ˜ f ( z 0 ) | C k z � z 0 k z = Φ ( x ) | f ( x ) � f ( x 0 ) | C k Φ ( x ) � Φ ( x 0 ) k , Discriminative: k Φ ( x ) � Φ ( x 0 ) k � C � 1 | f ( x ) � f ( x 0 ) | • For x ∈ Ω , if Φ ( Ω ) is bounded and a low dimension d 0 ) k f � f M k C M − 1 /d 0
Deep Convolution Neworks • The revival of neural networks: Y. LeCun x L 1 linear convolution ρ ( u ) = max( u, 0) non-linear scalar: neuron ρ Hierarchical L 2 linear convolution invariants ρ Linearization . . . Linear Classificat. y = ˜ Φ ( x ) f ( x ) Optimize L j with architecture constraints: over 10 9 parameters Exceptional results for images, speech, language, bio-data... Why does it work so well ? A di ffi cult problem
Deep Convolutional Networks x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 k 1 k 2 x j = ρ L j x j − 1 • L j is a linear combination of convolutions and subsampling: ⇣ X ⌘ x j ( u, k j ) = ⇢ x j − 1 ( · , k ) ? h k j ,k ( u ) k sum across channels • ρ is contractive: | ρ ( u ) − ρ ( u 0 ) | ≤ | u − u 0 | ρ ( u ) = max( u, 0) or ρ ( u ) = | u |
Many Questions x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 ρ L j k 1 k 2 • Why convolutions ? Translation covariance. • Why no overfitting ? Contractions, dimension reduction • Why hierarchical cascade ? • Why introducing non-linearities ? • How and what to linearise ? • What are the roles of the multiple channels in each layer ?
Linear Dimension Reduction Ω 1 Classes Ω 2 x Ω 3 Level sets of f ( x ) Ω t = { x : f ( x ) = t } Φ ( x ) If level sets (classes) are parallel to a linear space then variables are eliminated by linear projections: invariants .
Linearise for Dimensionality Reduction Classes x Level sets of f ( x ) Ω 2 Ω 1 Ω t = { x : f ( x ) = t } Ω 3 Φ ( x ) • If level sets Ω t are not parallel to a linear space - Linearise them with a change of variable Φ ( x ) - Then reduce dimension with linear projections • Di ffi cult because Ω t are high-dimensional, irregular, known on few samples.
Level Set Geometry: Symmetries • Curse of dimensionality ⇒ not local but global geometry Level sets: classes , characterised by their global symmetries. Ω 1 g g Ω 2 • A symmetry is an operator g which preserves level sets: f ( g.x ) = f ( x ) . ∀ x , : global If g 1 and g 2 are symmetries then g 1 .g 2 is also a symmetry f ( g 1 .g 2 .x ) = f ( g 2 .x ) = f ( x )
Groups of symmetries • G = { all symmetries } is a group: unknown ⇒ g.g 0 ∈ G ∀ ( g, g 0 ) ∈ G 2 g − 1 ∈ G ∀ g ∈ G , Inverse: ( g.g 0 ) .g 00 = g. ( g 0 .g 00 ) Associative: If commutative g.g 0 = g 0 .g : Abelian group. • Group of dimension n if it has n generators: g = g p 1 1 g p 2 2 ... g p n n • Lie group: infinitely small generators (Lie Algebra)
Translation and Deformations • Digit classification: x 0 ( u ) = x ( u − τ ( u )) x ( u ) Ω 3 Ω 5 - Globally invariant to the translation group : small - Locally invariant to small di ff eomorphisms : huge group Video of Philipp Scott Johnson
Rotation and Scaling Variability • Rotation and deformations SO (2) × Di ff ( SO (2)) Group: • Scaling and deformations R × Di ff ( R ) Group:
Linearize Symmetries • A change of variable Φ ( x ) must linearize the orbits { g.x } g ∈ G g p g 1 x 1 .x x x 0 g 1 x 0 g p 1 .x 0 • Linearise symmetries with a change of variable Φ ( x ) Φ ( g p 1 .x ) Φ ( x ) Φ ( x 0 ) Φ ( g p 1 .x 0 ) • Lipschitz: 8 x, g : k Φ ( x ) � Φ ( g.x ) k C k g k
Translation and Deformations • Digit classification: x 0 ( u ) x ( u ) - Globally invariant to the translation group - Locally invariant to small di ff eomorphisms Linearize small di ff eomorphisms: ⇒ Lipschitz regular Video of Philipp Scott Johnson
Translations and Deformations • Invariance to translations: g.x ( u ) = x ( u − c ) Φ ( g.x ) = Φ ( x ) . ⇒ • Small di ff eomorphisms: g.x ( u ) = x ( u − τ ( u )) Metric: k g k = kr τ k ∞ maximum scaling Linearisation by Lipschitz continuity k Φ ( x ) � Φ ( g.x ) k C kr τ k ∞ . • Discriminative change of variable: k Φ ( x ) � Φ ( x 0 ) k � C � 1 | f ( x ) � f ( x 0 ) |
Fourier Deformation Instability x ( t ) e − i ω t dt R • Fourier transform ˆ x ( ω ) = x c ( ω ) = e − ic ω ˆ x c ( t ) = x ( t − c ) ˆ x ( ω ) ⇒ The modulus is invariant to translations: Φ ( x ) = | ˆ x | = | ˆ x c | • Instabilites to small deformations x τ ( t ) = x ( t − τ ( t )) : | | ˆ x τ ( ω ) | − | ˆ x ( ω ) | | is big at high frequencies | b x ( ω ) | | b x τ ( ω ) | ⌧ ( t ) = ✏ t ω ) k | ˆ x | � | ˆ x τ | k � kr τ k ∞ k x k
Wavelet Transform • Complex wavelet: ψ ( t ) = ψ a ( t ) + i ψ b ( t ) ψ λ ( t ) = 2 − j ψ (2 − j t ) with λ = 2 − j . • Dilated: ψ λ � ( t ) | ˆ | ˆ ψ λ ( ω ) | 2 | ˆ ψ λ � ( ω ) | 2 φ ( ω ) | 2 ψ λ ( t ) x ( ω ) ˆ λ � λ 0 ω Z x ? λ ( t ) = x ( u ) λ ( t − u ) du • Wavelet transform: ✓ x ? � ( t ) ◆ Wx = x ? λ ( t ) t, λ � Wx � 2 = � x � 2 . Unitary:
Recommend
More recommend