On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman - PowerPoint PPT Presentation
On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman Arora Johns Hopkins University June 10, 2019 Motivation Algorithmic approaches endow deep learning systems with certain inductive biases that help generalization. In
On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman Arora Johns Hopkins University June 10, 2019
Motivation ◮ Algorithmic approaches endow deep learning systems with certain inductive biases that help generalization. ◮ In this paper we study dropout, one of the most popular algorithmic heuristics for training deep neural nets.
Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]
Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. ◮ x ∈ R d 0 , y ∈ R d k +1 , ( x , y ) ∼ D . Assume E [ xx ⊤ ] = I . Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]
Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. ◮ x ∈ R d 0 , y ∈ R d k +1 , ( x , y ) ∼ D . Assume E [ xx ⊤ ] = I . ◮ Learning problem: minimize the population risk L ( w ) := E ( x , y ) ∼D [ � y − f w ( x ) � 2 ] based on iid samples from the distribution. Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]
Problem Setup ◮ Network perturbed by dropping hidden nodes at random, computing ¯ f w ( x ) = W k +1 B k W k · · · B 1 W 1 x , where B i ( j , j )=0 with probability 1 − θ , and 1 θ with probability θ . dropout dropout dropout dropout dropout dropout dropout dropout dropout Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]
Problem Setup ◮ Network perturbed by dropping hidden nodes at random, computing ¯ f w ( x ) = W k +1 B k W k · · · B 1 W 1 x , where B i ( j , j )=0 with probability 1 − θ , and 1 θ with probability θ . ◮ dropout boils down to SGD on the dropout objective L θ ( w ) := E { B i } , ( x , y ) � y − ¯ f w ( x ) � 2 dropout dropout dropout dropout dropout dropout dropout dropout dropout Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 1 − θ = 0.75 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 1 − θ = 0.75 1 − θ = 0.85 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) pivot = j 1 pivot = j 2 pivot = j 3 Input layer Output layer x [1] y [1] i 1 i 3 x [2] y [2] i 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ] α j 1 , i 1 := � W j 1 → 1 ( i 1 , :) � β 1 := W j 2 → j 1 +1 ( i 2 , i 1 ) β 2 := W j 3 → j 2 +1 ( i 3 , i 2 ) γ j 3 , i 3 := � W k +1 → j 3 +1 (: , i 3 ) �
Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w )
Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗
Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗ ◮ One-dimensional output Θ( f w ) = Θ ∗∗ ( f w ) = ν { d i } � f w � 2
Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗ ◮ One-dimensional output Θ( f w ) = Θ ∗∗ ( f w ) = ν { d i } � f w � 2 Effective Regularization Parameter ν { d i } increases with depth and decreases with width deeper and narrower networks are more biased towards low-rank solutions
Thanks for your attention! Stop by Poster 79 for more information.
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.