Problem The Problem – reformulated We need informed or systematic design strategies for the choosing network structure (UW-Madison) Network structure vs. convergence Sep 28, 2016 14 / 41
Problem Solution strategy The Solution strategy What is the best possible network for the given task? Need informed design strategies Part I (UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41
Problem Solution strategy The Solution strategy What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds • Gradient convergence + Learning Mechanism + Network/Data Statistics (UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41
Problem Solution strategy The Solution strategy What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds • Gradient convergence + Learning Mechanism + Network/Data Statistics Part II (UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41
Problem Solution strategy The Solution strategy What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds • Gradient convergence + Learning Mechanism + Network/Data Statistics Part II Construct design procedures using the bounds • For the given dataset, a pre-specified convergence level Find the depth, hidden layer lengths, etc. (UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41
Problem Solution strategy The Solution strategy – This work What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds • Gradient convergence + Learning Mechanism + Network/Data Statistics Part II Construct design procedures using the bounds • For the given dataset, a pre-specified convergence level Find the depth, hidden layer lengths, etc. (UW-Madison) Network structure vs. convergence Sep 28, 2016 16 / 41
Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41
Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41
Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths ( d 0 , d 1 , . . . , d L − 1 , d L ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41
Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths ( d 0 , d 1 , . . . , d L − 1 , d L ) → The activation functions ( σ 1 , . . . , σ L ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41
Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths ( d 0 , d 1 , . . . , d L − 1 , d L ) → The activation functions ( σ 1 , . . . , σ L ) Bounded and Smooth; Focus on Sigmoid (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41
Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths ( d 0 , d 1 , . . . , d L − 1 , d L ) → The activation functions ( σ 1 , . . . , σ L ) Bounded and Smooth; Focus on Sigmoid → Average first-moment (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41
Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths ( d 0 , d 1 , . . . , d L − 1 , d L ) → The activation functions ( σ 1 , . . . , σ L ) Bounded and Smooth; Focus on Sigmoid → Average first-moment µ x = 1 j E x j , τ x = 1 j E 2 x j � � d 0 d 0 (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41
Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics min f ( W ) := E x , y ∼X L ( x , y ; W ) W (UW-Madison) Network structure vs. convergence Sep 28, 2016 18 / 41
Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics min f ( W ) := E x , y ∼X L ( x , y ; W ) W → L := ℓ 2 Loss (UW-Madison) Network structure vs. convergence Sep 28, 2016 18 / 41
Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics min f ( W ) := E x , y ∼X L ( x , y ; W ) W → L := ℓ 2 Loss W ∈ R d Stochastic Gradients OR W ∈ Ω := Box-constraint [ − w , w ] d Projected Gradients (UW-Madison) Network structure vs. convergence Sep 28, 2016 18 / 41
Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41
Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41
Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization Convergence instead? R : Last iteration – In general , training time is fixed apriori (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41
Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization Convergence instead? R : Last iteration – In general , training time is fixed apriori The expected gradients ∆ := E R , x , y �∇ W f ( W R ) � 2 (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41
Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization Control on last/stopping iteration Convergence instead? R : Last iteration – In general , training time is fixed apriori The expected gradients ∆ := E R , x , y �∇ W f ( W R ) � 2 (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41
Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization Control on last/stopping iteration Convergence instead? R : Last iteration – In general , training time is fixed apriori The expected gradients ∆ := E R , x , y �∇ W f ( W R ) � 2 Under mild assumptions, ∆ can be bounded when- ever R is chosen randomly [Ghadimi and Lan 2013] (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41
Problem Solution strategy The Interplay – Gradient Convergence Gradients backpropagation + randomly stop after some iterations (UW-Madison) Network structure vs. convergence Sep 28, 2016 20 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network (UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N (UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Decreasing stepsizes Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N (UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Decreasing stepsizes Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N N : Maximum allowable iterations the stopping distribution R ∈ [ 1 , N ] ( N ≫ R ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Decreasing stepsizes Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N N : Maximum allowable iterations the stopping distribution R ∈ [ 1 , N ] ( N ≫ R ) ∆ : Expected gradients (UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N (UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N D f ≈ f ( W 1 ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N D f ≈ f ( W 1 ) H N ≈ 0 . 2 γ GenHar ( N , ρ ) N : Maximum allowable iterations (UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N D f ≈ f ( W 1 ) Goodness of fit – Influence of W 1 H N ≈ 0 . 2 γ GenHar ( N , ρ ) N : Maximum allowable iterations (UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N D f ≈ f ( W 1 ) Goodness of fit – Influence of W 1 H N ≈ 0 . 2 γ GenHar ( N , ρ ) • Sublinear decay vs. N N : Maximum allowable iterations (UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N Ψ ≈ q d 0 d 1 γ B (0 . 05 < q < 0 . 25) d 0 d 1 := # unknowns (UW-Madison) Network structure vs. convergence Sep 28, 2016 23 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N Ψ ≈ q d 0 d 1 γ B Influence of # free parameters (degrees of freedom) (0 . 05 < q < 0 . 25) d 0 d 1 := # unknowns (UW-Madison) Network structure vs. convergence Sep 28, 2016 23 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N Ψ ≈ q d 0 d 1 γ B Influence of # free parameters (degrees of freedom) (0 . 05 < q < 0 . 25) Bias from mini-batch size d 0 d 1 := # unknowns (UW-Madison) Network structure vs. convergence Sep 28, 2016 23 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N • Ideal scenario: Large # samples; Small network (UW-Madison) Network structure vs. convergence Sep 28, 2016 24 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N • Ideal scenario: Large # samples; Small network • Realistic scenario: Reasonable network size; Large B with long training time (UW-Madison) Network structure vs. convergence Sep 28, 2016 24 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence for small ρ i.e, slow stepsize decay P R ( k ) approaches a uniform distribution (UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence for small ρ i.e, slow stepsize decay P R ( k ) approaches a uniform distribution � 5 D f � ∆ � N γ + Ψ (UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence for small ρ i.e, slow stepsize decay P R ( k ) approaches a uniform distribution � 5 D f � ∆ � N γ + Ψ when ρ = 0 i.e., constant stepsize P R ( k ) := UNIF [ 1 , N ] (UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence for small ρ i.e, slow stepsize decay P R ( k ) approaches a uniform distribution � 5 D f � ∆ � N γ + Ψ when ρ = 0 i.e., constant stepsize P R ( k ) := UNIF [ 1 , N ] � D f � ∆ ≤ N γ + Ψ (UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence for small ρ i.e, slow stepsize decay P R ( k ) approaches a uniform distribution � 5 D f � ∆ � N γ + Ψ Uniform stopping may not when ρ = 0 i.e., constant stepsize be interesting! P R ( k ) := UNIF [ 1 , N ] � D f � ∆ ≤ N γ + Ψ (UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N P R ( k ) = 0 P R ( k ) = ν N (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N P R ( k ) = 0 P R ( k ) = ν N Expected Gradients + P R ( · ) from above example For 1-layer network with constant stepsize γ , we have � 5 D f � ∆ ≤ ν N γ + Ψ (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N P R ( k ) = 0 P R ( k ) = ν N Expected Gradients + P R ( · ) from above example For 1-layer network with constant stepsize γ , we have � 5 D f � ∆ ≤ ν N γ + Ψ require P R ( k ) ≤ P R ( k + 1 ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N P R ( k ) = 0 For ν ≫ 1, R → N P R ( k ) = ν N Expected Gradients + P R ( · ) from above example For 1-layer network with constant stepsize γ , we have � 5 D f � ∆ ≤ ν N γ + Ψ require P R ( k ) ≤ P R ( k + 1 ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N P R ( k ) = 0 For ν ≫ 1, R → N P R ( k ) = ν N bound too loose Expected Gradients + P R ( · ) from above example For 1-layer network with constant stepsize γ , we have � 5 D f � ∆ ≤ ν N γ + Ψ require P R ( k ) ≤ P R ( k + 1 ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Using T independent random stopping iterations (UW-Madison) Network structure vs. convergence Sep 28, 2016 27 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Using T independent random stopping iterations Large deviation estimate (UW-Madison) Network structure vs. convergence Sep 28, 2016 27 / 41
Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Using T independent random stopping iterations Large deviation estimate Let ǫ > 0 and 0 < δ ≪ 1. min t �∇ W f ( W R t ) � 2 ≤ ǫ � � An ( ǫ, δ ) -solution guarantees Pr ≥ 1 − δ (UW-Madison) Network structure vs. convergence Sep 28, 2016 27 / 41
Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41
Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41
Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41
Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Initialize (or Warm-start or Pretrain) each of the layers sequentially (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41
Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Initialize (or Warm-start or Pretrain) each of the layers sequentially x (w.p. 1 − ζ , the j th unit is 0) x → ˜ (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41
Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Initialize (or Warm-start or Pretrain) each of the layers sequentially x (w.p. 1 − ζ , the j th unit is 0) x → ˜ h 1 = σ ( W 1 ˜ L ( x , W ) = � x − h 1 � 2 W ∈ [ − w , w ] d x ) with Referred to as a Denoising Autoencoder (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41
Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Initialize (or Warm-start or Pretrain) each of the layers sequentially x (w.p. 1 − ζ , the j th unit is 0) x → ˜ h 1 = σ ( W 1 ˜ L ( x , W ) = � x − h 1 � 2 W ∈ [ − w , w ] d x ) with Referred to as a Denoising Autoencoder • L − 1 such DAs are learned x → h 1 → . . . h L − 2 → h L − 1 (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41
Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together (UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41
Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Bring in the y s; perform backpropagation (UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41
Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Bring in the y s; perform backpropagation Use stochastic gradients; start at L th -layer Propagate the gradients (UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41
Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Bring in the y s; perform backpropagation Use stochastic gradients; start at L th -layer Propagate the gradients → Dropout Update only a fraction ( ζ ) of all the parameters (UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41
Problem Multi-layer Networks The Interplay – Learning Mechanism Multi-layer Neural Network (UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41
Problem Multi-layer Networks The Interplay – Learning Mechanism Multi-layer Neural Network The new mechanism – Randomized stopping strategy at all stages (UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41
Problem Multi-layer Networks The Interplay – Learning Mechanism Multi-layer Neural Network The new mechanism – Randomized stopping strategy at all stages • L − 1 layers are initialized to ( α, δ α ) solutions α : Goodness of pretraning (UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41
Problem Multi-layer Networks The Interplay – Learning Mechanism Multi-layer Neural Network The new mechanism – Randomized stopping strategy at all stages • L − 1 layers are initialized to ( α, δ α ) solutions α : Goodness of pretraning • Gradient backpropagation is performed to a ( ǫ, δ ) solution (UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41
Problem Multi-layer Networks The Interplay – The most general result Multi-layer Neural Network For L -layered network with dropout rate ζ and constant stepsize γ , pretrained to ( α, δ α ) , we have � D f � ∆ ≤ Ne + Π (UW-Madison) Network structure vs. convergence Sep 28, 2016 31 / 41
Recommend
More recommend