Generalized Derivatives Automatic Evaluation & Implications for Algorithms Paul I. Barton, Kamil A. Khan & Harry A. J. Watson Process Systems Engineering Laboratory Massachusetts Institute of Technology
Nonsmooth Equation Solving ◆ Semismooth Newton method: G ( x k )( x − x k ) = − f ( x k ) ◆ Linear programming (LP) Newton method: γ , x γ min s.t. f ( x k ) + G ( x k )( x − x k ) ∞ ≤ γ f ( x k ) 2 ∞ ( x − x k ) ∞ ≤ γ f ( x k ) ∞ x ∈ X Polyhedral set G ( x k ) ◆ some element of a generalized derivative Kojima & Shindo (1986), Qi & Sun (1993), Facchinei, Fischer & Herrich (2014) . 2
Generalized Derivatives f ◆ Suppose locally Lipschitz => differentiable on a set S ◆ B-subdifferential: ∂ B f ( x ): = { H : H = lim i →∞ Jf ( x ( i ) ), x = lim i →∞ x ( i ) , x ( i ) ∈ S } ◆ Clarke Jacobian: ∂ f ( x ): = conv ∂ B f ( x ) f ( x ) = x ∂ f ( x ) = {1} ∂ f ( x ) = { − 1} x ∂ B f ( x ) = { − 1,1}, ∂ f ( x ) = [ − 1,1] ◆ Useful properties of : ∂ f ( x ) Ø Nonempty, convex, and compact Ø Satisfies mean-value theorem, implicit/inverse function theorems Ø Reduces to subdifferential/derivative when is convex/strictly f differentiable Clarke (1973) . 3
Convergence Properties ◆ Suppose generalized derivative contains no singular matrices at the solution ◆ Semismooth Newton method: G ( x k ) ∈∂ f ( x k ) Ø local Q-superlinear convergence if Ø local Q-quadratic convergence if strongly semismooth ◆ Semismooth Newton & LP-Newton methods for PC 1 or strongly semismooth functions: G ( x k ) ∈∂ B f ( x k ) Ø local Q-quadratic convergence if ◆ Automatic/Algorithmic Differentiation (AD) Ø Automatic methods for computing derivatives in complex settings Ø Automatic method for computing elements of generalized derivatives? Ø Computationally relevant generalized derivatives 4
All generalized derivatives are equal… But, some are more equal than others. 5
Obstacles to Automatic Gen. Derivative Evaluation 1 ◆ Automatically evaluating Clarke Jacobian elements is difficult ◆ Lack of sharp calculus rules: g ( x ) = max{0, x } h ( x ) = min{0, x } f ( x ) = g ( x ) + h ( x ) x x x (0 + 0) ∉∂ f (0) = {1} 0 ∈ ∂ h (0) = [0,1] 0 ∈ ∂ g (0) = [0,1] ∂ f (0) ⊂ ∂ g (0) + ∂ h (0) 6
Directional Derivatives & PC 1 Functions ◆ Directional derivative: f ( x + t d ) − f ( x ) f '( x ; d ) = lim t t → 0 + ◆ Sharp chain rule for locally Lipschitz functions: [ f ! g ]'( x ; d ) = f '( g ( x ); g '( x ; d )) ◆ AD gives the directional derivative ◆ PC 1 functions: finite collection of C 1 functions for which { } , ∀ y ∈ N ( x ) f ( y ) ∈ φ ( y ): φ ∈ F f ( x ) ◆ 2-norm not PC 1 Griewank (1994), Scholtes (2012) . 7
Obstacles 2 ◆ PC 1 functions have piecewise linear directional derivative d 2 ′ f (x ; d) = B ( 1 ) d f (x ; d) = B ( 2 ) d ′ d 1 ′ f (x ; d) = B ( 3 ) d 8
Obstacles 2 ◆ PC 1 functions have piecewise linear directional derivative d 2 ′ f (x ; d) = B ( 1 ) d f (x ; d) = B ( 2 ) d ′ d 1 ′ f (x ; d) = B ( 3 ) d ◆ Directional derivatives in the coordinate directions do not necessarily give B- subdifferential elements ◆ Also defeats finite differences 9
Obstacles 3 ∏ m ∂ f(x) ∂ f i (x) ◆ may be a strict subset of i = 1 ⎧ ⎫ ⎡ ⎤ 2 s − 1 1 ⎡ ⎤ x 1 + | x 2 | ∂ f(0) = ⎥ : s ∈ 0 , 1 ⎡ ⎤ ⎨ ⎬ ⎢ ⎣ ⎦ f :( x 1 , x 2 ) ! ⎢ ⎥ 1 − 2 s 1 ⎪ ⎪ ⎣ ⎦ x 1 − | x 2 | ⎩ ⎭ ⎣ ⎦ ⎧ ⎫ 2 s 1 − 1 ⎤ ⎡ ⎪ ⎪ 1 2 ⎡ ⎤ ∂ f 1 (0) × ∂ f 2 (0) = ⎥ :( s 1 , s 2 ) ∈ 0 , 1 ⎨ ⎬ ⎢ ⎣ ⎦ 2 s 2 − 1 1 ⎣ ⎪ ⎪ ⎦ ⎩ ⎭ π 2 ∂ f(0) π 2 ( ∂ f 1 (0) × ∂ f 2 (0)) 10
L-smooth Functions f : X ∈ R n → R m ◆ The following functions are L-smooth: Ø Continuously differentiable functions Ø Convex functions (e.g. abs, 2-norm) Ø PC 1 functions x ! h ( g ( x )) Ø Compositions of L-smooth functions: Ø Integrals of L-smooth functions: b ∫ x ! g ( t , x ) dt a Ø Solutions of ODEs with L-smooth right-hand sides: c ! x ( b , c ), where d x dt ( t , c ) = g ( t , x ( t , c )), x (0, c ) = c Nesterov (1987), Khan and Barton (2014), Khan and Barton (2015). 11
Lexicographic Derivatives L-subdifferential: ◆ ∂ L f ( x ) = { J L f ( x ; M ):det M ≠ 0} J L f ( x ; M ), det M ≠ 0 Ø Contains L-derivatives in directions M : Useful properties: ◆ Ø L-derivatives classical derivative wherever strictly differentiable Ø L-derivatives elements of Clarke gradient Ø Contains only subgradients when f convex Ø Contained in plenary hull of Clarke Jacobian, and can be used in place of Clarke Jacobian in numerical methods: { Ad : A ∈∂ L f ( x )} ⊂ { Ad : A ∈∂ f ( x )} for each d ∈ R n Ø For PC 1 functions, L-derivatives elements of B-subdifferential Ø Satisfies sharp chain rule, expressed naturally using LD-derivatives Nesterov (1987), Khan and Barton (2014), Khan and Barton (2015). 12
Lexicographic Directional (LD)-Derivatives ◆ Extension of classical directional derivative M : = [ m (1) ! m ( p ) ] ∈ R n × p , ◆ LD-derivative: for any (0) ( m (1) ) ! f x , M ( p − 1) ( m ( p ) )] f '( x ; M ) = [ f x , M ◆ If M is square and nonsingular: f '( x ; M ) = J L f ( x ; M ) M ◆ If f is differentiable at x : f '( x ; M ) = Jf ( x ) M ◆ Sharp LD-derivative chain rule: [ f ! g ]'( x ; M ) = f '( g ( x ); g '( x ; M )) Khan and Barton (2015). 13
Vector Forward AD Mode for LD-derivatives ◆ Sharp chain rule immediately implies, given the “seed directions” M , forward-mode AD can compute: f '( x ; M ) ◆ Need calculus rules for “elementary functions”: ⋅ 2 Ø abs, min, max, mid, , etc. Ø algorithm for “elemental PC 1 functions” Ø linear programs and lexicographic linear programs parameterized by their RHSs Ø implicit function: h ( w ( z ), z ) = 0 is the unique solution N of w '(ˆ z ; M ) ( ) = 0 h ' (ˆ y ,ˆ z );( N , M ) Khan and Barton (2015), Khan and Barton (2013), Hoeffner et al. (2015). 14
Semismooth Inexact Newton Method i = 1,2, … J ( x ) d i , ◆ Inexact Newton method: ◆ Solve iteratively: J L f ( x ; M ) Δ x = − f ( x ) ◆ But, directional derivative not a linear function of the directions… ⎡ ⎤ M = d 1 , d 2 , … ◆ Let , M nonsingular. Then: ⎣ ⎦ f '( x ; M ) = J L f ( x ; M ) M ◆ But, M not known in advance f '( x ; M ) ◆ Compute columns of one at time Ø computation of a column affects subsequent columns Ø automatic code can be “locked” to record influence of earlier columns ◆ Local Q-superlinear & Q-quadratic convergence rates can be achieved 15
Approximation of LD-derivatives using FDs M : = [ m (1) ! m ( p ) ] ∈ R n × p LD-derivative: ◆ (0) ( m (1) ) ! f x , M ( p − 1) ( m ( p ) )] f '( x ; M ) = [ f x , M FD approx. of using p+1 function evaluations: f '( x ; M ) ◆ (0) ( m (1) ) ≈ α − 1 [ f ( x + α m (1) ) − f ( x )] = : D α m (1) [ f ]( x ) f x , M (1) ( m (2) ) ≈ D α m (2) [ f x , M (0) ]( m (1) ) = D α m (2) D α m (1) [ f ]( x ) f x , M ! ( p − 1) ( m ( p ) ) ≈ D α m ( p ) [ f x , M ( p − 2) ]( m ( p − 1) ) = D α m ( p ) " D α m (2) D α m (1) [ f ]( x ) f x , M x + α m (1) + α 2 m (2) (0) ( m (1) ) f x , M (1) ( m (2) )] f '( x ; M ) = [ f x , M (0) ( m (1) ) ≈ α − 1 [ f ( x + α m (1) ) − f ( x )] f x , M x + α m (1) x (1) ( m (2) ) ≈ α − 2 [ f ( x + α m (1) + α 2 m (2) ) − f ( x + α m (1) )] f x , M 16
Sparse Accumulation for L- derivatives ◆ Cost of AD can be reduced when the Jacobian is sparse Ø Find structurally orthogonal columns n × n n × p I ∈ M ∈ ϒ ϒ Ø Perform vector forward pass with seed matrix rather than ⎡ ⎤ ⎡ ⎤ a b 0 0 1 0 ⎢ ⎥ ⎢ ⎥ c 0 d 0 ⎢ ⎥ 0 1 ⎢ ⎥ ⎢ ⎥ 0 e 0 f ⎢ ⎥ 0 1 ⎢ ⎥ ⎢ ⎥ 0 0 g h 1 0 ⎢ ⎥ ⎣ ⎦ ⎣ ⎦ ◆ AD for LD-derivatives à order of the directions matters ⎡ ⎤ Ø Corresponding to M is an uncompressed (permutation) matrix Q : 1 0 0 0 ⎢ ⎥ » M = QD for some matrix D 0 0 1 0 ⎢ ⎥ ⎢ ⎥ 0 0 0 1 Ø Procedure: ⎢ ⎥ 0 1 0 0 ⎣ ⎦ » Identify matrices Q , D , and M ′ f (x ; M) » Perform vector forward pass to calculate ′ ′ f (x ; M) f (x ; Q) » Copy entries of into entries of sparse data structure for f (x ; M) = ′ ′ ◆ Done based on assumption that f (x ; Q)D f (x ; Q)Q − 1 J L f(x ; Q) = ′ » Calculate (i.e. by sparse permutation) f (x ; M) = ′ ′ f (x ; Q)D Ø is not true in general 17
Generalized Derivatives of Algorithms: MHEX model out F 1 , T in F 1 , T 1 1 ! ! out out F | H | , T | H | F | H | , T | H | MHEX in out f 1 , t 1 f 1 , t 1 ! ! out f | C | , t | C | out f | C | , t | C | ( ) ( ) in − T i out − t i ∑ ∑ = out in F i T i f j t i i ∈ H j ∈ C ( ) = 0 p − EBP p min p ∈ P EBP C H Δ Q k ∑ UA − = 0 Δ T LM k k ∈ K k ≠ | K | Watson et al . (2015). 18
Recommend
More recommend