on dropout and nuclear norm regularization
play

On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman - PowerPoint PPT Presentation

On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman Arora Johns Hopkins University June 10, 2019 Motivation Algorithmic approaches endow deep learning systems with certain inductive biases that help generalization. In


  1. On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman Arora Johns Hopkins University June 10, 2019

  2. Motivation ◮ Algorithmic approaches endow deep learning systems with certain inductive biases that help generalization. ◮ In this paper we study dropout, one of the most popular algorithmic heuristics for training deep neural nets.

  3. Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]

  4. Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. ◮ x ∈ R d 0 , y ∈ R d k +1 , ( x , y ) ∼ D . Assume E [ xx ⊤ ] = I . Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]

  5. Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. ◮ x ∈ R d 0 , y ∈ R d k +1 , ( x , y ) ∼ D . Assume E [ xx ⊤ ] = I . ◮ Learning problem: minimize the population risk L ( w ) := E ( x , y ) ∼D [ � y − f w ( x ) � 2 ] based on iid samples from the distribution. Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]

  6. Problem Setup ◮ Network perturbed by dropping hidden nodes at random, computing ¯ f w ( x ) = W k +1 B k W k · · · B 1 W 1 x , where B i ( j , j )=0 with probability 1 − θ , and 1 θ with probability θ . dropout dropout dropout dropout dropout dropout dropout dropout dropout Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]

  7. Problem Setup ◮ Network perturbed by dropping hidden nodes at random, computing ¯ f w ( x ) = W k +1 B k W k · · · B 1 W 1 x , where B i ( j , j )=0 with probability 1 − θ , and 1 θ with probability θ . ◮ dropout boils down to SGD on the dropout objective L θ ( w ) := E { B i } , ( x , y ) � y − ¯ f w ( x ) � 2 dropout dropout dropout dropout dropout dropout dropout dropout dropout Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]

  8. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  9. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  10. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  11. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  12. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  13. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  14. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  15. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  16. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  17. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  18. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 1 − θ = 0.75 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  19. Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 1 − θ = 0.75 1 − θ = 0.85 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

  20. Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) pivot = j 1 pivot = j 2 pivot = j 3 Input layer Output layer x [1] y [1] i 1 i 3 x [2] y [2] i 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ] α j 1 , i 1 := � W j 1 → 1 ( i 1 , :) � β 1 := W j 2 → j 1 +1 ( i 2 , i 1 ) β 2 := W j 3 → j 2 +1 ( i 3 , i 2 ) γ j 3 , i 3 := � W k +1 → j 3 +1 (: , i 3 ) �

  21. Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w )

  22. Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗

  23. Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗ ◮ One-dimensional output Θ( f w ) = Θ ∗∗ ( f w ) = ν { d i } � f w � 2

  24. Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗ ◮ One-dimensional output Θ( f w ) = Θ ∗∗ ( f w ) = ν { d i } � f w � 2 Effective Regularization Parameter ν { d i } increases with depth and decreases with width deeper and narrower networks are more biased towards low-rank solutions

  25. Thanks for your attention! Stop by Poster 79 for more information.

Recommend


More recommend