Understanding Priors in Bayesian Neural Networks at the Unit Level Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, Julyan Arbel Inria, Grenoble, France � mariia.vladimirova@inria.fr International Conference on Machine Leaning June 13, 2019 1/9
Outline Sub-Weibull distributions Main result: Prior on units gets heavier-tailed with depth Regularization interpretation 2/9
Distribution families with respect to tail behavior E | X | k � 1 / k For all k ∈ N , k -th row moment: � X � k = � Distribution Tail Moments √ F ( x ) ≤ e − λ x 2 � X � k ≤ C Sub-Gaussian k F ( x ) ≤ e − λ x Sub-Exponential � X � k ≤ Ck F ( x ) ≤ e − λ x 1 /θ � X � k ≤ Ck θ Sub-Weibull • θ > 0 called tail parameter • � X � k ≍ k θ = ⇒ X ∼ subW( θ ), θ called optimal • subW(1 / 2) = subG, subW(1) = subE • Larger θ implies heavier tail Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 3/9
Outline Sub-Weibull distributions Main result: Prior on units gets heavier-tailed with depth Regularization interpretation 4/9
Assumptions on neural network We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N ( µ, σ 2 ) (A2) Nonlinearity. ReLU-like with envelope property: exist c 1 , c 2 , d 2 ≥ 0, d 1 > 0 s.t. | φ ( u ) | ≥ c 1 + d 1 | u | for all u ∈ R + or u ∈ R − , | φ ( u ) | ≤ c 2 + d 2 | u | for all u ∈ R . • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh. • Nonlinearity does not harm the distributional tail: � φ ( X ) � k ≍ � X � k , k ∈ N Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 5/9
Assumptions on neural network We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N ( µ, σ 2 ) (A2) Nonlinearity. ReLU-like with envelope property: exist c 1 , c 2 , d 2 ≥ 0, d 1 > 0 s.t. | φ ( u ) | ≥ c 1 + d 1 | u | for all u ∈ R + or u ∈ R − , | φ ( u ) | ≤ c 2 + d 2 | u | for all u ∈ R . • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh. • Nonlinearity does not harm the distributional tail: � φ ( X ) � k ≍ � X � k , k ∈ N Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 5/9
Assumptions on neural network We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N ( µ, σ 2 ) (A2) Nonlinearity. ReLU-like with envelope property: exist c 1 , c 2 , d 2 ≥ 0, d 1 > 0 s.t. | φ ( u ) | ≥ c 1 + d 1 | u | for all u ∈ R + or u ∈ R − , | φ ( u ) | ≤ c 2 + d 2 | u | for all u ∈ R . • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh. • Nonlinearity does not harm the distributional tail: � φ ( X ) � k ≍ � X � k , k ∈ N Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 5/9
Assumptions on neural network We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N ( µ, σ 2 ) (A2) Nonlinearity. ReLU-like with envelope property: exist c 1 , c 2 , d 2 ≥ 0, d 1 > 0 s.t. | φ ( u ) | ≥ c 1 + d 1 | u | for all u ∈ R + or u ∈ R − , | φ ( u ) | ≤ c 2 + d 2 | u | for all u ∈ R . • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh. • Nonlinearity does not harm the distributional tail: � φ ( X ) � k ≍ � X � k , k ∈ N Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 5/9
Main theorem Consider a Bayesian neural network with (A1) i.i.d. Gaussian priors on the weights and (A2) nonlinearity satisfying envelope property. Then conditional on input x , the marginal prior distribution of a unit u ( ℓ ) of ℓ -th hidden layer is sub-Weibull with optimal tail parameter θ = ℓ/ 2: π ( ℓ ) ( u ) ∼ subW( ℓ/ 2) ℓ th layer input 10 0 x u ( ℓ ) u (1) u (2) u (3) . . . 10 − 1 . log P ( X ≥ x ) . . . . . . . . . . subW(50) . . . . . 10 − 2 . subW(5) . subW(3/2) . . subW(1) . subW(1/2) 10 − 3 subW( 1 subW( 3 0 10 20 30 40 50 60 70 subW(1) subW( ℓ 2 ) 2 ) 2 ) x Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 6/9
Outline Sub-Weibull distributions Main result: Prior on units gets heavier-tailed with depth Regularization interpretation 7/9
Interpretation: shrinkage effect Maximum a Posteriori (MAP) is a Regularized Weight distribution ℓ -th layer unit distribution π ( w ) ≈ e − w 2 ⇒ π ( ℓ ) ( u ) ≈ e − u 2 /ℓ problem max W π ( W |D ) ∝ L ( D| W ) π ( W ) Layer Penalty on W Penalty on U min W − log L ( D| W ) − log π ( W ) L 2 (weight decay) min W L ( W ) + λ R ( W ) � W (1) � 2 2 , L 2 � U (1) � 2 1 2 L 1 (Lasso) � W (2) � 2 2 , L 2 � U (2) � 2 L ( W ) is a loss function, R ( W ) is a norm on R p , regularizer. � W ( ℓ ) � 2 2 , L 2 � U ( ℓ ) � 2 /ℓ L 2 /ℓ ℓ 2 /ℓ Layer 1 Layer 2 Layer 3 Layer 10 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 U 2 U 2 U 2 U 2 0.0 0.0 0.0 0.0 −0.5 −0.5 −0.5 −0.5 −1.0 −1.0 −1.0 −1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U 1 U 1 U 1 U 1 Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 8/9
Conclusion (i) We define the notion of sub-Weibull distributions, which are characterized by tails lighter than (or equally light as) Weibull distributions. (ii) We prove that the marginal prior distribution of the units are heavier-tailed as depth increases. (iii) We offer an interpretation from a regularization viewpoint. Future directions: • Prove the Gaussian process limit of sub-Weibull distributions in the wide regime; • Investigate if the described regularization mechanism induces sparsity at the unit level. Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 9/9
Recommend
More recommend