Basics of Numerical Optimization: Preliminaries Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities February 11, 2020 1 / 24
Supervised learning as function approximation – Underlying true function: f 0 – Training data: { x i , y i } with y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close 2 / 24
Supervised learning as function approximation – Underlying true function: f 0 – Training data: { x i , y i } with y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Find f , i.e., optimization � min ℓ ( y i , f ( x i )) + Ω ( f ) f ∈H i – Approximation capacity: Univeral approximation theorems (UAT) = ⇒ replace H by DNN W , i.e., a deep neural network with weights W 2 / 24
Supervised learning as function approximation – Underlying true function: f 0 – Training data: { x i , y i } with y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Find f , i.e., optimization � min ℓ ( y i , f ( x i )) + Ω ( f ) f ∈H i – Approximation capacity: Univeral approximation theorems (UAT) = ⇒ replace H by DNN W , i.e., a deep neural network with weights W – Optimization: � min ℓ ( y i , DNN W ( x i )) + Ω ( W ) W i 2 / 24
Supervised learning as function approximation – Underlying true function: f 0 – Training data: { x i , y i } with y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Find f , i.e., optimization � min ℓ ( y i , f ( x i )) + Ω ( f ) f ∈H i – Approximation capacity: Univeral approximation theorems (UAT) = ⇒ replace H by DNN W , i.e., a deep neural network with weights W – Optimization: � min ℓ ( y i , DNN W ( x i )) + Ω ( W ) W i – Generalization: how to avoid over-complicated DNN W in view of UAT 2 / 24
Supervised learning as function approximation – Underlying true function: f 0 – Training data: { x i , y i } with y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Find f , i.e., optimization � min ℓ ( y i , f ( x i )) + Ω ( f ) f ∈H i – Approximation capacity: Univeral approximation theorems (UAT) = ⇒ replace H by DNN W , i.e., a deep neural network with weights W – Optimization: � min ℓ ( y i , DNN W ( x i )) + Ω ( W ) W i – Generalization: how to avoid over-complicated DNN W in view of UAT Now we start to focus on optimization . 2 / 24
Outline Elements of multivatiate calculus Optimality conditions of unconstrained optimization 3 / 24
Recommended references [Munkres, 1997, Zorich, 2015, Coleman, 2012] 4 / 24
Our notation – scalars: x , vectors: x , matrices: X , tensors: X , sets: S 5 / 24
Our notation – scalars: x , vectors: x , matrices: X , tensors: X , sets: S – vectors are always column vectors , unless stated otherwise 5 / 24
Our notation – scalars: x , vectors: x , matrices: X , tensors: X , sets: S – vectors are always column vectors , unless stated otherwise – x i : i -th element of x , x ij : ( i, j ) -th element of X , x i : i -th row of X as a row vector , x j : j -th column of X as a column vector 5 / 24
Our notation – scalars: x , vectors: x , matrices: X , tensors: X , sets: S – vectors are always column vectors , unless stated otherwise – x i : i -th element of x , x ij : ( i, j ) -th element of X , x i : i -th row of X as a row vector , x j : j -th column of X as a column vector – R : real numbers, R + : positive reals, R n : space of n -dimensional vectors, R m × n : space of m × n matrices, R m × n × k : space of m × n × k tensors, etc 5 / 24
Our notation – scalars: x , vectors: x , matrices: X , tensors: X , sets: S – vectors are always column vectors , unless stated otherwise – x i : i -th element of x , x ij : ( i, j ) -th element of X , x i : i -th row of X as a row vector , x j : j -th column of X as a column vector – R : real numbers, R + : positive reals, R n : space of n -dimensional vectors, R m × n : space of m × n matrices, R m × n × k : space of m × n × k tensors, etc – [ n ] . = { 1 , . . . , n } 5 / 24
Differentiability — first order Consider f ( x ) : R n → R m – Definition: First-order differentiable at a point x if there exists a matrix B ∈ R m × n such that f ( x + δ ) − f ( x ) − Bδ → 0 as δ → 0 . � δ � 2 6 / 24
Differentiability — first order Consider f ( x ) : R n → R m – Definition: First-order differentiable at a point x if there exists a matrix B ∈ R m × n such that f ( x + δ ) − f ( x ) − Bδ → 0 as δ → 0 . � δ � 2 i.e., f ( x + δ ) = f ( x ) + Bδ + o ( � δ � 2 ) as δ → 0 . 6 / 24
Differentiability — first order Consider f ( x ) : R n → R m – Definition: First-order differentiable at a point x if there exists a matrix B ∈ R m × n such that f ( x + δ ) − f ( x ) − Bδ → 0 as δ → 0 . � δ � 2 i.e., f ( x + δ ) = f ( x ) + Bδ + o ( � δ � 2 ) as δ → 0 . echet) derivative. When m = 1 , b ⊺ (i.e., B ⊺ ) – B is called the (Fr´ called gradient , denoted as ∇ f ( x ) . For general m , also called Jacobian matrix, denoted as J f ( x ) . 6 / 24
Differentiability — first order Consider f ( x ) : R n → R m – Definition: First-order differentiable at a point x if there exists a matrix B ∈ R m × n such that f ( x + δ ) − f ( x ) − Bδ → 0 as δ → 0 . � δ � 2 i.e., f ( x + δ ) = f ( x ) + Bδ + o ( � δ � 2 ) as δ → 0 . echet) derivative. When m = 1 , b ⊺ (i.e., B ⊺ ) – B is called the (Fr´ called gradient , denoted as ∇ f ( x ) . For general m , also called Jacobian matrix, denoted as J f ( x ) . – Calculation: b ij = ∂f i ∂x j ( x ) 6 / 24
Differentiability — first order Consider f ( x ) : R n → R m – Definition: First-order differentiable at a point x if there exists a matrix B ∈ R m × n such that f ( x + δ ) − f ( x ) − Bδ → 0 as δ → 0 . � δ � 2 i.e., f ( x + δ ) = f ( x ) + Bδ + o ( � δ � 2 ) as δ → 0 . echet) derivative. When m = 1 , b ⊺ (i.e., B ⊺ ) – B is called the (Fr´ called gradient , denoted as ∇ f ( x ) . For general m , also called Jacobian matrix, denoted as J f ( x ) . – Calculation: b ij = ∂f i ∂x j ( x ) – Sufficient condition : if all partial derivatives exist and are continuous at x , then f ( x ) is differentiable at x . 6 / 24
Calculus rules Assume f, g : R n → R m are differentiable at a point x ∈ R n . – linearity : λ 1 f + λ 2 g is differentiable at x and ∇ [ λ 1 f + λ 2 g ] ( x ) = λ 1 ∇ f ( x ) + λ 2 ∇ g ( x ) – product : assume m = 1 , fg is differentiable at x and ∇ [ fg ] ( x ) = f ( x ) ∇ g ( x ) + g ( x ) ∇ f ( x ) – quotient : assume m = 1 and g ( x ) � = 0 , f g is differentiable at x and � � ( x ) = g ( x ) ∇ f ( x ) − f ( x ) ∇ g ( x ) f ∇ g g 2 ( x ) 7 / 24
Calculus rules Assume f, g : R n → R m are differentiable at a point x ∈ R n . – linearity : λ 1 f + λ 2 g is differentiable at x and ∇ [ λ 1 f + λ 2 g ] ( x ) = λ 1 ∇ f ( x ) + λ 2 ∇ g ( x ) – product : assume m = 1 , fg is differentiable at x and ∇ [ fg ] ( x ) = f ( x ) ∇ g ( x ) + g ( x ) ∇ f ( x ) – quotient : assume m = 1 and g ( x ) � = 0 , f g is differentiable at x and � � ( x ) = g ( x ) ∇ f ( x ) − f ( x ) ∇ g ( x ) f ∇ g g 2 ( x ) – Chain rule : Let f : R m → R n and h : R n → R k , and f is differentiable at x and y = f ( x ) and h is differentiable at y . Then, h ◦ f : R n → R k is differentiable at x , and J [ h ◦ f ] ( x ) = J h ( f ( x )) J f ( x ) . 7 / 24
Calculus rules Assume f, g : R n → R m are differentiable at a point x ∈ R n . – linearity : λ 1 f + λ 2 g is differentiable at x and ∇ [ λ 1 f + λ 2 g ] ( x ) = λ 1 ∇ f ( x ) + λ 2 ∇ g ( x ) – product : assume m = 1 , fg is differentiable at x and ∇ [ fg ] ( x ) = f ( x ) ∇ g ( x ) + g ( x ) ∇ f ( x ) – quotient : assume m = 1 and g ( x ) � = 0 , f g is differentiable at x and � � ( x ) = g ( x ) ∇ f ( x ) − f ( x ) ∇ g ( x ) f ∇ g g 2 ( x ) – Chain rule : Let f : R m → R n and h : R n → R k , and f is differentiable at x and y = f ( x ) and h is differentiable at y . Then, h ◦ f : R n → R k is differentiable at x , and J [ h ◦ f ] ( x ) = J h ( f ( x )) J f ( x ) . When k = 1 , ∇ [ h ◦ f ] ( x ) = J ⊤ f ( x ) ∇ h ( f ( x )) . 7 / 24
Differentiability — second order Consider f ( x ) : R n → R and assume f is 1st-order differentiable in a small ball around x ∂x j ∂x i ( x ) . � � �� ∂f 2 ∂f ∂ – Write = ( x ) provided the right side well ∂x j ∂x i defined 8 / 24
Differentiability — second order Consider f ( x ) : R n → R and assume f is 1st-order differentiable in a small ball around x ∂x j ∂x i ( x ) . � � �� ∂f 2 ∂f ∂ – Write = ( x ) provided the right side well ∂x j ∂x i defined ∂f 2 ∂f 2 – Symmetry : If both ∂x j ∂x i ( x ) and ∂x i ∂x j ( x ) exist and both are continuous at x , then they are equal. – Hessian (matrix) : ∂f 2 � � ∇ 2 f ( x ) . = ( x ) (1) , ∂x j ∂x i j,i � � ∂f 2 j,i ∈ R n × n has its ( j, i ) -th element as ∂f 2 where ∂x j ∂x i ( x ) ∂x j ∂x i ( x ) . 8 / 24
Recommend
More recommend