Global Convergence of Block Coordinate Descent in Deep Learning 1 - PowerPoint PPT Presentation

ICML 2019 Long Beach, CA Global Convergence of Block Coordinate Descent in Deep Learning 1 Jiangxi Normal Univ. * Equal contribution Tim Tsz-Kit Lau Department of Statistics Northwestern University Jinshan Zeng 1 2 * Tim Tsz-Kit Lau 3 * Shaobo Lin 4 Yuan Yao 2 2 HKUST 3 Northwestern 4 CityU HK

INTRODUCTION

Introduction Block Coordinate Descent in Deep Learning Block Coordinate Descent (BCD) Algorithms Global Convergence Analysis Proof Ideas Demonstration MOTIVATION OF BLOCK COORDINATE DESCENT (BCD) IN DEEP LEARNING later layers 3 ◦ Gradient-based methods are commonly used in training deep neural networks ◦ But gradient-based methods may suffer from various problems for deep networks ◦ Gradients of the loss function w.r.t. parameters of earlier layers involve those of ⇒ Gradient vanishing ⇒ Gradient exploding ◦ First-order gradient-based methods does not work well

Introduction Block Coordinate Descent in Deep Learning Block Coordinate Descent (BCD) Algorithms Global Convergence Analysis Proof Ideas Demonstration MOTIVATION OF BLOCK COORDINATE DESCENT (BCD) IN DEEP LEARNING – Block Coordinate Descent (BCD) – Alternating Direction Method of Multipliers (ADMM) – Deal with non-differentiable nonlinearities – Potentially avoid vanishing gradient – Can be easily implemented in a distributed and parallel fashion 4 ◦ Gradient-free methods have recently been adapted to training DNNs: ◦ Advantages of Gradient-free Methods:

BLOCK COORDINATE DESCENT IN DEEP LEARNING

Introduction Split the highly coupled network layer-wise to compose a surrogate loss function n n min Block Coordinate Descent in Deep Learning 6 Proof Ideas BLOCK COORDINATE DESCENT IN DEEP LEARNING Demonstration Block Coordinate Descent (BCD) Algorithms Global Convergence Analysis ◦ View parameters of hidden layers and the output layer as variable blocks ◦ Variable splitting : ◦ Notations: – W := { W ℓ } L ℓ =1 : the set of layer parameters – L : R k × R k → R + ∪ { 0 } : loss function – Φ( x i ; W ) := σ L ( W L σ L − 1 ( W L − 1 · · · W 2 σ 1 ( W 1 x i ))) : the neural network ◦ Empirical risk minimization : X ; W ) , Y ) := 1 � W R n (Φ( L (Φ( x i ; W ) , y i ) i =1 ◦ Two ways of variable splitting appear in the literature

Introduction min L min subject to Block Coordinate Descent in Deep Learning L L 7 Block Coordinate Descent (BCD) Algorithms Global Convergence Analysis Proof Ideas Demonstration BCD IN DEEP LEARNING: TWO-SPLITTING FORMULATION ◦ Introduce one set of auxiliary variables V := { V ℓ } L ℓ =1 � � W , V L 0 ( W , V ) := R n ( V L ; Y ) + r ℓ ( W ℓ ) + s ℓ ( V ℓ ) ℓ =1 ℓ =1 V ℓ = σ ℓ ( W ℓ V ℓ − 1 ) , ℓ ∈ { 1 , . . . , L } ◦ The functions r ℓ and s ℓ are regularizers ◦ Rewritten as unconstrained optimization: W , V L ( W , V ) := L 0 ( W , V ) + γ � W ℓ V ℓ − 1 ) � 2 � V ℓ − σ ℓ ( F , 2 ℓ =1 ◦ γ > 0 is a hyperparameter

Introduction Output F output of hidden layers norms ) between the input and the terms of squared Frobenius Block Coordinate Descent in Deep Learning Output layer Input layer Hidden layer 8 Input #4 Input #3 Input #2 Input #1 TWO-SPLITTING FORMULATION: GRAPHICAL ILLUSTRATION Block Coordinate Descent (BCD) Algorithms Global Convergence Analysis Proof Ideas Demonstration ◦ Jointly minimize the distances (in ◦ E.g., defjne V 0 := X , W 1 V 0 ) � 2 � V 1 − σ 1 ( X ∈ R 4 × n σ 1 ( W 1 X ) =: V 1 � Y = W 2 V 1

Introduction min F L min Block Coordinate Descent in Deep Learning subject to 9 Global Convergence Analysis BCD IN DEEP LEARNING: THREE-SPLITTING FORMULATION Block Coordinate Descent (BCD) Algorithms Proof Ideas Demonstration ◦ Introduce two sets of auxiliary variables U := { U ℓ } L ℓ =1 , V := { V ℓ } L ℓ =1 W , V , U L 0 ( W , V ) U ℓ = W ℓ V ℓ − 1 , V ℓ = σ ℓ ( U ℓ ) , ℓ ∈ { 1 , . . . , L } ◦ Rewritten as unconstrained optimization: W , V , U L ( W , V , U ) := L 0 ( W , V ) + γ � U ℓ ) � 2 U ℓ − W ℓ V ℓ − 1 � 2 � � � V ℓ − σ ℓ ( F + � , 2 ℓ =1 ◦ Variables more loosely coupled than those in two-splitting

Introduction Output layer F F hidden layers the post-activation output of 2. the pre-activation output and output of hidden layers 1. the input and the pre-activation norms ) between terms of squared Frobenius Block Coordinate Descent in Deep Learning 10 Input layer Input #3 Block Coordinate Descent (BCD) Algorithms Global Convergence Analysis Proof Ideas Demonstration THREE-SPLITTING FORMULATION: GRAPHICAL ILLUSTRATION Hidden layer Input #2 Input #1 Input #4 Output ◦ Jointly minimize the distances (in ◦ E.g., defjne V 0 := X , U 1 − W 1 V 0 � 2 W 1 X =: U 1 � X ∈ R 4 × n � Y = W 2 V 1 σ 1 ( U 1 ) =: V 1 U 1 ) � 2 + � V 1 − σ 1 (

BLOCK COORDINATE DESCENT (BCD) ALGORITHMS

Introduction Block Coordinate Descent in Deep Learning Block Coordinate Descent (BCD) Algorithms Global Convergence Analysis Proof Ideas Demonstration BLOCK COORDINATE DESCENT (BCD) ALGORITHMS 12 ◦ Devise algorithms for training DNNs based on the two formulations ◦ Update all the variables cyclically while fjxing the remaining blocks ◦ Update in a backward order as in backpropagation ◦ Adopt the proximal update strategies

Introduction L end for end for L Block Coordinate Descent in Deep Learning L L L L 13 Demonstration BCD ALGORITHM (TWO-SPLITTING) Proof Ideas Global Convergence Analysis Block Coordinate Descent (BCD) Algorithms Algorithm 1 Two-splitting BCD for DNN Training Data : X ∈ R d × n , Y ∈ R k × n W (0) , V (0) ℓ =1 , V ( t ) Initialization : { } L ≡ V 0 := X 0 ℓ ℓ Hyperparameters: γ > 0 , α > 0 for t = 1 , . . . do V ( t ) V L − W ( t − 1) V ( t − 1) V L − V ( t − 1) V L ; Y ) + γ L − 1 � 2 F + α � 2 = argmin V L { s L ( V L ) + R n ( 2 � 2 � F } W ( t ) V ( t ) − W L V ( t − 1) W L − W ( t − 1) W L ) + γ L − 1 � 2 F + α � 2 = argmin W L { r L ( 2 � 2 � F } for ℓ = L − 1 , . . . , 1 do V ( t ) W ( t − 1) V ( t − 1) V ( t ) W ( t ) V ℓ )+ γ F + γ ℓ − 1 ) � 2 ℓ +1 V ℓ ) � 2 = argmin V ℓ { s ℓ ( 2 � V ℓ − σ ℓ ( 2 � ℓ +1 − σ ℓ +1 ( F + ℓ ℓ V ℓ − V ( t − 1) α � 2 2 � F } ℓ W ( t ) V ( t ) W ℓ V ( t − 1) W ℓ − W ( t − 1) W ℓ ) + γ ℓ − 1 ) � 2 F + α � 2 = argmin W ℓ { r ℓ ( 2 � − σ ℓ ( 2 � F } ℓ ℓ ℓ

Introduction L end for end for L Block Coordinate Descent in Deep Learning L L L L L L L 14 Proof Ideas Demonstration BCD ALGORITHM (THREE-SPLITTING) Global Convergence Analysis Block Coordinate Descent (BCD) Algorithms Algorithm 2 Three-splitting BCD for DNN training Samples : X ∈ R d × n , Y ∈ R k × n W (0) , V (0) , U (0) ℓ =1 , V ( t ) Initialization : { } L ≡ V 0 := X ℓ ℓ ℓ 0 Hyperparameters: γ > 0 , α > 0 for t = 1 , . . . do V ( t ) V L − U ( t − 1) V L − V ( t − 1) V L ; Y ) + γ � 2 F + α � 2 = argmin V L { s L ( V L ) + R n ( 2 � 2 � F } U ( t ) V ( t ) U L − W ( t − 1) V ( t − 1) U L { γ F + γ − U L � 2 L − 1 � 2 = argmin 2 � 2 � F } W ( t ) U ( t ) − W L V ( t − 1) W L − W ( t − 1) W L ) + γ L − 1 � 2 F + α � 2 = argmin W L { r L ( 2 � 2 � F } for ℓ = L − 1 , . . . , 1 do V ( t ) U ( t − 1) U ( t ) ℓ +1 − W ( t ) V ℓ ) + γ ) � 2 F + γ ℓ +1 V ℓ � 2 = argmin V ℓ { s ℓ ( 2 � V ℓ − σ ℓ ( 2 � F } ℓ ℓ U ( t ) V ( t ) U ℓ − W ( t − 1) V ( t − 1) U ℓ − U ( t − 1) U ℓ { γ U ℓ ) � 2 F + γ ℓ − 1 � 2 F + α � 2 = argmin 2 � − σ ℓ ( 2 � 2 � F } ℓ ℓ ℓ ℓ W ( t ) U ( t ) − W ℓ V ( t − 1) W ℓ − W ( t − 1) W ℓ ) + γ ℓ − 1 � 2 F + α � 2 = argmin W ℓ { r ℓ ( 2 � 2 � F } ℓ ℓ ℓ

GLOBAL CONVERGENCE ANALYSIS

Introduction Suppose that semialgebraic, and continuous on their domains. convex functions, and bounded set, Block Coordinate Descent in Deep Learning 16 Assumption ASSUMPTIONS OF THE FUNCTIONS FOR CONVERGENCE GUARANTEES Demonstration Proof Ideas Global Convergence Analysis Block Coordinate Descent (BCD) Algorithms (a) the loss function L is a proper lower semicontinuous 1 and nonnegative function, (b) the activation functions σ ℓ ( ℓ = 1 . . . , L − 1 ) are Lipschitz continuous on any (c) the regularizers r ℓ and s ℓ ( ℓ = 1 . . . , L − 1 ) are nonegative lower semicontinuous (d) all these functions L , σ ℓ , r ℓ and s ℓ ( ℓ = 1 . . . , L − 1 ) are either real analytic or 1 A function f : X → R is called lower semicontinuous if lim inf x → x 0 f ( x ) ≥ f ( x 0 ) for any x 0 ∈ X .

Introduction Block Coordinate Descent in Deep Learning Block Coordinate Descent (BCD) Algorithms Global Convergence Analysis Proof Ideas Demonstration EXAMPLES OF THE FUNCTIONS Proposition Examples satisfying Assumption 1 include: activations; of some nonempty closed convex set (such as the nonnegative closed half space, 17 (a) L is the squared, logistic, hinge, or cross-entropy losses; (b) σ ℓ is ReLU, leaky ReLU, sigmoid, hyperbolic tangent, linear, polynomial, or softplus (c) r ℓ and s ℓ are the squared ℓ 2 norm, the ℓ 1 norm, the elastic net, the indicator function box set or a closed interval [0 , 1] ), or 0 if no regularization.

Global Convergence of Block Coordinate Descent in Deep Learning 1 - PowerPoint PPT Presentation

ICML 2019 Long Beach, CA Global Convergence of Block Coordinate Descent in Deep Learning 1 Jiangxi Normal Univ. * Equal contribution Tim Tsz-Kit Lau Department of Statistics Northwestern University Jinshan Zeng 1 2 * Tim Tsz-Kit Lau 3 * Shaobo

Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization Paul Tseng

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Block Ciphers Eli Biham - May 3, 2005 c 83 Block Ciphers (4) Block Ciphers and Stream

Chapter 16 Chapter 16 The Elements: The he Elements: The d -Block -Block The d -Block

EI331 Signals and Systems Lecture 31 Bo Jiang John Hopcroft Center for Computer Science

WiFi Can Be the Weakest Link of Round Trip Network Latency in the Wild Changhua Pei , Youjian

Switched differential algebraic equations: Jumps and impulses Stephan Trenn Technomathematics

Web Services with JWIG and Xact Christian Kirkegaard BRICS, University of Aarhus JWIG and Xact

Some KC Tools @ UCD / UL C. Menc a, A. Previti, A. Ignatiev, A. Morgado (et al.) Joao

REXI: breaking the time step constraint David Acreman, Jemma Shipton, Colin Cotter and Beth

Overview of Line Search Topics Problem Definition Problem definition f ( ) Line search

using R frauds, robberies, liabilities, ...) Two complementary approaches: historical data

Global Convergence of Block Coordinate Descent in Deep Learning 1 - PowerPoint PPT Presentation

ICML 2019 Long Beach, CA Global Convergence of Block Coordinate Descent in Deep Learning 1 Jiangxi Normal Univ. * Equal contribution Tim Tsz-Kit Lau Department of Statistics Northwestern University Jinshan Zeng 1 2 * Tim Tsz-Kit Lau 3 * Shaobo

Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization Paul Tseng

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations &amp; Transformations &amp; Coordinate Systems Coordinate Systems CSCD 472?

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Block Ciphers Eli Biham - May 3, 2005 c 83 Block Ciphers (4) Block Ciphers and Stream

Chapter 16 Chapter 16 The Elements: The he Elements: The d -Block -Block The d -Block

EI331 Signals and Systems Lecture 31 Bo Jiang John Hopcroft Center for Computer Science

WiFi Can Be the Weakest Link of Round Trip Network Latency in the Wild Changhua Pei , Youjian

Switched differential algebraic equations: Jumps and impulses Stephan Trenn Technomathematics

Web Services with JWIG and Xact Christian Kirkegaard BRICS, University of Aarhus JWIG and Xact

Some KC Tools @ UCD / UL C. Menc a, A. Previti, A. Ignatiev, A. Morgado (et al.) Joao

REXI: breaking the time step constraint David Acreman, Jemma Shipton, Colin Cotter and Beth

Overview of Line Search Topics Problem Definition Problem definition f ( ) Line search

using R frauds, robberies, liabilities, ...) Two complementary approaches: historical data

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?