On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman Arora Johns Hopkins University June 10, 2019
Motivation ◮ Algorithmic approaches endow deep learning systems with certain inductive biases that help generalization. ◮ In this paper we study dropout, one of the most popular algorithmic heuristics for training deep neural nets.
Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]
Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. ◮ x ∈ R d 0 , y ∈ R d k +1 , ( x , y ) ∼ D . Assume E [ xx ⊤ ] = I . Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]
Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. ◮ x ∈ R d 0 , y ∈ R d k +1 , ( x , y ) ∼ D . Assume E [ xx ⊤ ] = I . ◮ Learning problem: minimize the population risk L ( w ) := E ( x , y ) ∼D [ � y − f w ( x ) � 2 ] based on iid samples from the distribution. Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]
Problem Setup ◮ Network perturbed by dropping hidden nodes at random, computing ¯ f w ( x ) = W k +1 B k W k · · · B 1 W 1 x , where B i ( j , j )=0 with probability 1 − θ , and 1 θ with probability θ . dropout dropout dropout dropout dropout dropout dropout dropout dropout Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]
Problem Setup ◮ Network perturbed by dropping hidden nodes at random, computing ¯ f w ( x ) = W k +1 B k W k · · · B 1 W 1 x , where B i ( j , j )=0 with probability 1 − θ , and 1 θ with probability θ . ◮ dropout boils down to SGD on the dropout objective L θ ( w ) := E { B i } , ( x , y ) � y − ¯ f w ( x ) � 2 dropout dropout dropout dropout dropout dropout dropout dropout dropout Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 1 − θ = 0.75 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 1 − θ = 0.75 1 − θ = 0.85 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values
Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) pivot = j 1 pivot = j 2 pivot = j 3 Input layer Output layer x [1] y [1] i 1 i 3 x [2] y [2] i 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ] α j 1 , i 1 := � W j 1 → 1 ( i 1 , :) � β 1 := W j 2 → j 1 +1 ( i 2 , i 1 ) β 2 := W j 3 → j 2 +1 ( i 3 , i 2 ) γ j 3 , i 3 := � W k +1 → j 3 +1 (: , i 3 ) �
Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w )
Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗
Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗ ◮ One-dimensional output Θ( f w ) = Θ ∗∗ ( f w ) = ν { d i } � f w � 2
Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗ ◮ One-dimensional output Θ( f w ) = Θ ∗∗ ( f w ) = ν { d i } � f w � 2 Effective Regularization Parameter ν { d i } increases with depth and decreases with width deeper and narrower networks are more biased towards low-rank solutions
Thanks for your attention! Stop by Poster 79 for more information.
Recommend
More recommend