Optimization Problems for Neural Networks Chih-Jen Lin National Taiwan University Last updated: May 25, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 78
Outline Regularized linear classification 1 Optimization problem for fully-connected networks 2 Optimization problem for convolutional neural 3 networks (CNN) Discussion 4 Chih-Jen Lin (National Taiwan Univ.) 2 / 78
Regularized linear classification Outline Regularized linear classification 1 Optimization problem for fully-connected networks 2 Optimization problem for convolutional neural 3 networks (CNN) Discussion 4 Chih-Jen Lin (National Taiwan Univ.) 3 / 78
Regularized linear classification Minimizing Training Errors Basically a classification method starts with minimizing the training errors min (training errors) model That is, all or most training data with labels should be correctly classified by our model A model can be a decision tree, a neural network, or other types Chih-Jen Lin (National Taiwan Univ.) 4 / 78
Regularized linear classification Minimizing Training Errors (Cont’d) For simplicity, let’s consider the model to be a vector ✇ That is, the decision function is sgn( ✇ T ① ) For any data, ① , the predicted label is � if ✇ T ① ≥ 0 1 − 1 otherwise Chih-Jen Lin (National Taiwan Univ.) 5 / 78
Regularized linear classification Minimizing Training Errors (Cont’d) The two-dimensional situation ◦ ◦ ◦ ◦ ◦ ◦ ◦ △ ◦ △ △ △ △ △ △ ✇ T ① = 0 This seems to be quite restricted, but practically ① is in a much higher dimensional space Chih-Jen Lin (National Taiwan Univ.) 6 / 78
Regularized linear classification Minimizing Training Errors (Cont’d) To characterize the training error, we need a loss function ξ ( ✇ ; y , ① ) for each instance ( y , ① ), where y = ± 1 is the label and ① is the feature vector Ideally we should use 0–1 training loss: � if y ✇ T ① < 0 , 1 ξ ( ✇ ; y , ① ) = 0 otherwise Chih-Jen Lin (National Taiwan Univ.) 7 / 78
Regularized linear classification Minimizing Training Errors (Cont’d) However, this function is discontinuous. The optimization problem becomes difficult ξ ( ✇ ; y , ① ) − y ✇ T ① We need continuous approximations Chih-Jen Lin (National Taiwan Univ.) 8 / 78
Regularized linear classification Common Loss Functions Hinge loss (l1 loss) ξ L1 ( ✇ ; y , ① ) ≡ max(0 , 1 − y ✇ T ① ) (1) Logistic loss ξ LR ( ✇ ; y , ① ) ≡ log(1 + e − y ✇ T ① ) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2) SVM and LR are two very fundamental classification methods Chih-Jen Lin (National Taiwan Univ.) 9 / 78
Regularized linear classification Common Loss Functions (Cont’d) ξ ( ✇ ; y , ① ) ξ L1 ξ LR − y ✇ T ① Logistic regression is very related to SVM Their performance is usually similar Chih-Jen Lin (National Taiwan Univ.) 10 / 78
Regularized linear classification Common Loss Functions (Cont’d) However, minimizing training losses may not give a good model for future prediction Overfitting occurs Chih-Jen Lin (National Taiwan Univ.) 11 / 78
Regularized linear classification Overfitting See the illustration in the next slide For classification, You can easily achieve 100% training accuracy This is useless When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error Chih-Jen Lin (National Taiwan Univ.) 12 / 78
Regularized linear classification ● and ▲ : training; � and △ : testing Chih-Jen Lin (National Taiwan Univ.) 13 / 78
Regularized linear classification Regularization To minimize the training error we manipulate the ✇ vector so that it fits the data To avoid overfitting we need a way to make ✇ ’s values less extreme. One idea is to make ✇ values closer to zero We can add, for example, ✇ T ✇ or � ✇ � 1 2 to the function that is minimized Chih-Jen Lin (National Taiwan Univ.) 14 / 78
Regularized linear classification General Form of Linear Classification Training data { y i , ① i } , ① i ∈ R n , i = 1 , . . . , l , y i = ± 1 l : # of data, n : # of features l f ( ✇ ) ≡ ✇ T ✇ � min ✇ f ( ✇ ) , + C ξ ( ✇ ; y i , ① i ) 2 i =1 ✇ T ✇ / 2: regularization term ξ ( ✇ ; y , ① ): loss function C : regularization parameter (chosen by users) Chih-Jen Lin (National Taiwan Univ.) 15 / 78
Optimization problem for fully-connected networks Outline Regularized linear classification 1 Optimization problem for fully-connected networks 2 Optimization problem for convolutional neural 3 networks (CNN) Discussion 4 Chih-Jen Lin (National Taiwan Univ.) 16 / 78
Optimization problem for fully-connected networks Multi-class Classification I Our training set includes ( ② i , ① i ), i = 1 , . . . , l . ① i ∈ R n 1 is the feature vector. ② i ∈ R K is the label vector. As label is now a vector, we change (label, instance) from ( y i , ① i ) to ( ② i , ① i ) K : # of classes If ① i is in class k , then ② i = [0 , . . . , 0 , 1 , 0 , . . . , 0] T ∈ R K � �� � k − 1 Chih-Jen Lin (National Taiwan Univ.) 17 / 78
Optimization problem for fully-connected networks Multi-class Classification II A neural network maps each feature vector to one of the class labels by the connection of nodes. Chih-Jen Lin (National Taiwan Univ.) 18 / 78
Optimization problem for fully-connected networks Fully-connected Networks Between two layers a weight matrix maps inputs (the previous layer) to outputs (the next layer). A 1 A 3 A 2 B 1 B 3 B 2 C 1 C 3 Chih-Jen Lin (National Taiwan Univ.) 19 / 78
Optimization problem for fully-connected networks Operations Between Two Layers I The weight matrix W m at the m th layer is w m w m w m · · · 11 12 1 n m w m w m w m · · · W m = 21 22 2 n m . . . . . . . . . . . . w m n m +1 1 w m n m +1 2 · · · w m n m +1 n m n m +1 × n m n m : # input features at layer m n m +1 : # output features at layer m , or # input features at layer m + 1 L : number of layers Chih-Jen Lin (National Taiwan Univ.) 20 / 78
Optimization problem for fully-connected networks Operations Between Two Layers II n 1 = # of features, n L +1 = # of classes Let ③ m be the input of the m th layer, ③ 1 = ① and ③ L +1 be the output From m th layer to ( m + 1)th layer s m = W m ③ m , z m +1 = σ ( s m j ) , j = 1 , . . . , n m +1 , j σ ( · ) is the activation function. Chih-Jen Lin (National Taiwan Univ.) 21 / 78
Optimization problem for fully-connected networks Operations Between Two Layers III Usually people do a bias term b m 1 b m 2 , . . . b m n m +1 n m +1 × 1 so that s m = W m ③ m + ❜ m Chih-Jen Lin (National Taiwan Univ.) 22 / 78
Optimization problem for fully-connected networks Operations Between Two Layers IV Activation function is usually an R → R transformation. As we are interested in optimization, let’s not worry about why it’s needed We collect all variables: vec( W 1 ) ❜ 1 . . ∈ R n θ = . vec( W L ) ❜ L Chih-Jen Lin (National Taiwan Univ.) 23 / 78
Optimization problem for fully-connected networks Operations Between Two Layers V n : total # variables = ( n 1 +1) n 2 + · · · +( n L +1) n L +1 The vec( · ) operator stacks columns of a matrix to a vector Chih-Jen Lin (National Taiwan Univ.) 24 / 78
Optimization problem for fully-connected networks Optimization Problem I We solve the following optimization problem, min θ f ( θ ) , where f ( θ ) = 1 � l 2 θ T θ + C i =1 ξ ( ③ L +1 , i ( θ ); ② i , ① i ) . C : regularization parameter ③ L +1 ( θ ) ∈ R n L +1 : last-layer output vector of ① . ξ ( ③ L +1 ; ② , ① ): loss function. Example: ξ ( ③ L +1 ; ② , ① ) = || ③ L +1 − ② || 2 Chih-Jen Lin (National Taiwan Univ.) 25 / 78
Optimization problem for fully-connected networks Optimization Problem II The formulation is same as linear classification However, the loss function is more complicated Further, it’s non-convex Note that in the earlier discussion we consider a single instance In the training process we actually have for i = 1 , . . . , l , s m , i = W m ③ m , i , z m +1 , i = σ ( s m , i ) , j = 1 , . . . , n m +1 , j j This makes the training more complicated Chih-Jen Lin (National Taiwan Univ.) 26 / 78
Optimization problem for convolutional neural networks (CNN) Outline Regularized linear classification 1 Optimization problem for fully-connected networks 2 Optimization problem for convolutional neural 3 networks (CNN) Discussion 4 Chih-Jen Lin (National Taiwan Univ.) 27 / 78
Optimization problem for convolutional neural networks (CNN) Why CNN? I There are many types of neural networks They are suitable for different types of problems While deep learning is hot, it’s not always better than other learning methods For example, fully-connected networks were evalueated for general classification data (e.g., data from UCI machine learning repository) They are not consistently better than random forests or SVM; see the comparisons (Meyer et al., 2003; Fern´ andez-Delgado et al., 2014; Wang et al., 2018). Chih-Jen Lin (National Taiwan Univ.) 28 / 78
Recommend
More recommend