Principled Deep Neural Network Training through Linear Programming Daniel Bienstock 1 , Gonzalo Muñoz 2 , Sebastian Pokutta 3 January 9, 2019 1 IEOR, Columbia University 2 IVADO, Polytechnique Montréal 3 ISyE, Georgia Tech 1
“...I’m starting to look at machine learning problems” Oktay Günlük’s research interests, Aussois 2019 2
Goal of this talk
• Unfortunately, only recent results regarding the complexity of training deep neural networks have been obtained. trained to near optimality using linear programs whose size is linear Goal of this talk • Deep Learning is receiving signifjcant attention due to its impressive performance. • Our goal: to show that large classes of Neural Networks can be on the data. 3
trained to near optimality using linear programs whose size is linear Goal of this talk • Deep Learning is receiving signifjcant attention due to its impressive performance. training deep neural networks have been obtained. • Our goal: to show that large classes of Neural Networks can be on the data. 3 • Unfortunately, only recent results regarding the complexity of
Goal of this talk • Deep Learning is receiving signifjcant attention due to its impressive performance. training deep neural networks have been obtained. • Our goal: to show that large classes of Neural Networks can be trained to near optimality using linear programs whose size is linear on the data. 3 • Unfortunately, only recent results regarding the complexity of
• D data points x i y i • A loss function m to solve Compute f D D i 1 Empirical Risk Minimization problem f x i 1 (+ optional regularizer f ) f F (some class) y i n f Given: (not necessarily convex) m m m y i n • x i D 1 i 4
m to solve Compute f Empirical Risk Minimization problem D (some class) F f f ) (+ optional regularizer y i f x i 1 i 1 D Given: f n 4 • D data points (ˆ x i , ˆ y i ) , i = 1 , . . . , D • ˆ x i ∈ R n , ˆ y i ∈ R m • A loss function ℓ : R m × R m → R (not necessarily convex)
Empirical Risk Minimization problem f (some class) D D Given: 1 4 • D data points (ˆ x i , ˆ y i ) , i = 1 , . . . , D • ˆ x i ∈ R n , ˆ y i ∈ R m • A loss function ℓ : R m × R m → R (not necessarily convex) Compute f : R n → R m to solve ∑ min ℓ ( f (ˆ x i ) , ˆ y i ) (+ optional regularizer Φ( f ) ) i = 1 f ∈ F
Empirical Risk Minimization problem (some class) • Neural Networks with k layers. Examples: 5 D D 1 f ∑ min ℓ ( f (ˆ x i ) , ˆ y i ) (+ optional regularizer Φ( f ) ) i = 1 f ∈ F • Linear Regression. f ( x ) = Ax + b with ℓ 2 -loss. • Binary Classifjcation. Varying f architectures and cross-entropy loss: ℓ ( p , y ) = − y log( p ) − ( 1 − y ) log( 1 − p ) f ( x ) = T k + 1 ◦ σ ◦ T k ◦ σ . . . ◦ σ ◦ T 1 ( x ) , each T j affjne.
Thus, THE problem becomes Function parameterization We assume family F (statisticians’ hypothesis) is parameterized: there exists f such that 1 D D i 1 f x i y i 6 F = { f ( x , θ ) : θ ∈ Θ ⊆ [ − 1 , 1 ] N } .
Function parameterization We assume family F (statisticians’ hypothesis) is parameterized: there exists f such that 1 D D 6 F = { f ( x , θ ) : θ ∈ Θ ⊆ [ − 1 , 1 ] N } . Thus, THE problem becomes ∑ min ℓ ( f (ˆ x i , θ ) , ˆ y i ) θ ∈ Θ i = 1
What we know for Neural Nets
• Each T i affjne T i y • A 1 is n 1 is w m , A i is w Neural Networks m w w n . . . w otherwise. w , A k b i A i y T 1 T k 1 T k • f 7 • D data points (ˆ x i , ˆ y i ) , 1 ≤ i ≤ D , ˆ x i ∈ R n , ˆ y i ∈ R m
• Each T i affjne T i y • A 1 is n 1 is w m , A i is w Neural Networks m w w n . . . w otherwise. w , A k b i A i y 7 • D data points (ˆ x i , ˆ y i ) , 1 ≤ i ≤ D , ˆ x i ∈ R n , ˆ y i ∈ R m • f = T k + 1 ◦ σ ◦ T k ◦ σ . . . ◦ σ ◦ T 1
• A 1 is n 1 is w m , A i is w m w w n . . . w otherwise. Neural Networks 7 w , A k • D data points (ˆ x i , ˆ y i ) , 1 ≤ i ≤ D , ˆ x i ∈ R n , ˆ y i ∈ R m • f = T k + 1 ◦ σ ◦ T k ◦ σ . . . ◦ σ ◦ T 1 • Each T i affjne T i ( y ) = A i y + b i
Neural Networks . m w w n . . 7 • D data points (ˆ x i , ˆ y i ) , 1 ≤ i ≤ D , ˆ x i ∈ R n , ˆ y i ∈ R m • f = T k + 1 ◦ σ ◦ T k ◦ σ . . . ◦ σ ◦ T 1 • Each T i affjne T i ( y ) = A i y + b i • A 1 is n × w , A k + 1 is w × m , A i is w × w otherwise.
Hardness Results Theorem (Blum and Rivest 1992) . . . Theorem (Boob, Dey and Lan 2018) activation. Then training is NP-hard in the same network. 8 x i ∈ R n , ˆ y i ∈ { 0 , 1 } , ℓ ∈ (absolute value, 2-norm squared) and σ a Let ˆ threshold function. Then training is NP-hard even in this simple network: x i ∈ R n , ˆ y i ∈ { 0 , 1 } , ℓ a norm and σ ( t ) = max { 0 , t } a ReLU Let ˆ
that is polynomial in the data size” algorithms for DNNs with two or more hidden layers and this seems Exact Training Complexity Theorem (Arora, Basu, Mianjy and Mukherjee 2018) training algorithm of complexity O 2 w D nw poly D n w Polynomial in the size of the data set, for fjxed n w. Also in that paper: “we are not aware of any complexity results which would rule out the possibility of an algorithm which trains to global optimality in time “Perhaps an even better breakthrough would be to get optimal training like a substantially harder nut to crack” 9 If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exact
that is polynomial in the data size” algorithms for DNNs with two or more hidden layers and this seems Exact Training Complexity Theorem (Arora, Basu, Mianjy and Mukherjee 2018) training algorithm of complexity Also in that paper: “we are not aware of any complexity results which would rule out the possibility of an algorithm which trains to global optimality in time “Perhaps an even better breakthrough would be to get optimal training like a substantially harder nut to crack” 9 If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exact O ( 2 w D nw poly ( D , n , w ) ) Polynomial in the size of the data set, for fjxed n , w.
Exact Training Complexity Theorem (Arora, Basu, Mianjy and Mukherjee 2018) training algorithm of complexity Also in that paper: “we are not aware of any complexity results which would rule out the possibility of an algorithm which trains to global optimality in time “Perhaps an even better breakthrough would be to get optimal training algorithms for DNNs with two or more hidden layers and this seems like a substantially harder nut to crack” 9 If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exact O ( 2 w D nw poly ( D , n , w ) ) Polynomial in the size of the data set, for fjxed n , w. that is polynomial in the data size”
training problems coming from x i y i D What we’ll prove There exists a polytope: whose size depends linearly on D that encodes approximately all possible i 1 1 1 n m D . Spoiler: Theory-only results 10
training problems coming from x i y i D What we’ll prove There exists a polytope: whose size depends linearly on D that encodes approximately all possible i 1 1 1 n m D . Spoiler: Theory-only results 10
What we’ll prove There exists a polytope: whose size depends linearly on D that encodes approximately all possible Spoiler: Theory-only results 10 i = 1 ⊆ [ − 1 , 1 ] ( n + m ) D . training problems coming from (ˆ x i , ˆ y i ) D
What we’ll prove There exists a polytope: whose size depends linearly on D that encodes approximately all possible Spoiler: Theory-only results 10 i = 1 ⊆ [ − 1 , 1 ] ( n + m ) D . training problems coming from (ˆ x i , ˆ y i ) D
Our Hammer
Given a chordal graph G, we say its treewidth is • Trees have treewidth 1 • Cycles have treewidth 2 • K n has treewidth n Treewidth Treewidth is a parameter that measures how tree-like a graph is. Defjnition if its clique number is 1 . 1 11
• Trees have treewidth 1 • Cycles have treewidth 2 • K n has treewidth n Treewidth Treewidth is a parameter that measures how tree-like a graph is. Defjnition 1 11 Given a chordal graph G, we say its treewidth is ω if its clique number is ω + 1 .
Treewidth Treewidth is a parameter that measures how tree-like a graph is. Defjnition 11 Given a chordal graph G, we say its treewidth is ω if its clique number is ω + 1 . • Trees have treewidth 1 • Cycles have treewidth 2 • K n has treewidth n − 1
i over 0 1 n • Each f i is “well-behaved”: Lipschitz constant • Intersection graph: An edge whenever two variables appear in the x 4 x 5 Approximate optimization of well-behaved functions x 4 6 5 4 3 2 1 The intersection graph is: 2 x 6 1 1 x 3 Prototype problem: x 3 x 2 x 1 For example: same f i Toolset: 12 min c T x s.t. f i ( x ) ≤ 0 , i = 1 , . . . , m x ∈ [ 0 , 1 ] n
• Intersection graph: An edge whenever two variables appear in the x 4 x 5 Approximate optimization of well-behaved functions x 4 6 5 4 3 2 1 The intersection graph is: 2 x 6 1 1 x 3 Prototype problem: x 3 x 2 x 1 For example: same f i Toolset: 12 min c T x s.t. f i ( x ) ≤ 0 , i = 1 , . . . , m x ∈ [ 0 , 1 ] n • Each f i is “well-behaved”: Lipschitz constant L i over [ 0 , 1 ] n
Recommend
More recommend