CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Neural Networks. Petr Poˇ s´ ık petr.posik@fel.cvut.cz Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics P. Poˇ s´ ık c � 2020 Artificial Intelligence – 1 / 34 petr.posik@fel.cvut.cz
Introduction and Rehearsal P. Poˇ s´ ık c � 2020 Artificial Intelligence – 2 / 34 petr.posik@fel.cvut.cz
Notation In supervised learning , we work with ■ an observation described by a vector x = ( x 1 , . . . , x D ) , Intro ■ the corresponding true value of the dependent variable y , and • Notation y = f w ( x ) , where the model parameters are in vector w . ■ the prediction of a model � • Multiple regression • Logistic regression • Question • Gradient descent • Ex: Grad. for MR • Ex: Grad. for LR • Relations to NN Multilayer FFN Gradient Descent Regularization Other NNs Summary P. Poˇ s´ ık c � 2020 Artificial Intelligence – 3 / 34 petr.posik@fel.cvut.cz
Notation In supervised learning , we work with ■ an observation described by a vector x = ( x 1 , . . . , x D ) , Intro ■ the corresponding true value of the dependent variable y , and • Notation y = f w ( x ) , where the model parameters are in vector w . ■ the prediction of a model � • Multiple regression • Logistic regression ■ Very often, we use homogeneous coordinates and matrix notation, and represent the • Question whole training data set as T = ( X , y ) , where • Gradient descent • Ex: Grad. for MR • Ex: Grad. for LR x ( 1 ) y ( 1 ) 1 • Relations to NN . . . X = y = . . , and . . Multilayer FFN . . . x ( | T | ) y ( | T | ) Gradient Descent 1 Regularization Other NNs Summary P. Poˇ s´ ık c � 2020 Artificial Intelligence – 3 / 34 petr.posik@fel.cvut.cz
Notation In supervised learning , we work with ■ an observation described by a vector x = ( x 1 , . . . , x D ) , Intro ■ the corresponding true value of the dependent variable y , and • Notation y = f w ( x ) , where the model parameters are in vector w . ■ the prediction of a model � • Multiple regression • Logistic regression ■ Very often, we use homogeneous coordinates and matrix notation, and represent the • Question whole training data set as T = ( X , y ) , where • Gradient descent • Ex: Grad. for MR • Ex: Grad. for LR x ( 1 ) y ( 1 ) 1 • Relations to NN . . . X = y = . . , and . . Multilayer FFN . . . x ( | T | ) y ( | T | ) Gradient Descent 1 Regularization Other NNs Learning then amounts to finding such model parameters w ∗ which minimize certain loss Summary (or energy) function: w ∗ = arg min J ( w , T ) w P. Poˇ s´ ık c � 2020 Artificial Intelligence – 3 / 34 petr.posik@fel.cvut.cz
Multiple linear regression Multiple linear regression model: y = f w ( x ) = w 1 x 1 + w 2 x 2 + . . . + w D x D = xw T � Intro • Notation The minimum of • Multiple regression • Logistic regression � y ( i ) � 2 | T | • Question 1 y ( i ) − � ∑ J MSE ( w ) = • Gradient descent , | T | • Ex: Grad. for MR i = 1 • Ex: Grad. for LR • Relations to NN is given by Multilayer FFN w ∗ = ( X T X ) − 1 X T y , Gradient Descent Regularization or found by numerical optimization. Other NNs Summary P. Poˇ s´ ık c � 2020 Artificial Intelligence – 4 / 34 petr.posik@fel.cvut.cz
Multiple linear regression Multiple linear regression model: y = f w ( x ) = w 1 x 1 + w 2 x 2 + . . . + w D x D = xw T � Intro • Notation The minimum of • Multiple regression • Logistic regression � y ( i ) � 2 | T | • Question 1 y ( i ) − � ∑ J MSE ( w ) = • Gradient descent , | T | • Ex: Grad. for MR i = 1 • Ex: Grad. for LR • Relations to NN is given by Multilayer FFN w ∗ = ( X T X ) − 1 X T y , Gradient Descent Regularization or found by numerical optimization. Other NNs Summary Multiple regression as a linear neuron : x 1 w d 3 x 2 � y 3 x 3 3 P. Poˇ s´ ık c � 2020 Artificial Intelligence – 4 / 34 petr.posik@fel.cvut.cz
Logistic regression Logistic regression model: y = f ( w , x ) = g ( xw T ) , � Intro • Notation where • Multiple regression • Logistic regression 1 • Question g ( z ) = • Gradient descent 1 + e − z • Ex: Grad. for MR • Ex: Grad. for LR is the sigmoid (a.k.a logistic ) function. • Relations to NN ■ No explicit equation for the optimal weights. Multilayer FFN ■ The only option is to find the optimum numerically, usually by some form of gradient Gradient Descent descent. Regularization Other NNs Summary P. Poˇ s´ ık c � 2020 Artificial Intelligence – 5 / 34 petr.posik@fel.cvut.cz
Logistic regression Logistic regression model: y = f ( w , x ) = g ( xw T ) , � Intro • Notation where • Multiple regression • Logistic regression 1 • Question g ( z ) = • Gradient descent 1 + e − z • Ex: Grad. for MR • Ex: Grad. for LR is the sigmoid (a.k.a logistic ) function. • Relations to NN ■ No explicit equation for the optimal weights. Multilayer FFN ■ The only option is to find the optimum numerically, usually by some form of gradient Gradient Descent descent. Regularization Other NNs Logistic regression as a non-linear neuron : Summary x 1 w d 3 g ( xw T ) x 2 y ˆ 3 x 3 3 P. Poˇ s´ ık c � 2020 Artificial Intelligence – 5 / 34 petr.posik@fel.cvut.cz
Question Logistic regression uses sigmoid function to transform the result of the linear combination of inputs: Intro 1 • Notation g ( z ) = 1 + e − z • Multiple regression • Logistic regression • Question • Gradient descent • Ex: Grad. for MR 1 • Ex: Grad. for LR • Relations to NN 0.5 Multilayer FFN 0 Gradient Descent -3 -2 -1 0 1 2 3 Regularization Other NNs What is the value of the derivative of g ( z ) at z = 0? Summary A g ′ ( 0 ) = − 1 2 g ′ ( 0 ) = 0 B C g ′ ( 0 ) = 1 4 g ′ ( 0 ) = 1 D 2 P. Poˇ s´ ık c � 2020 Artificial Intelligence – 6 / 34 petr.posik@fel.cvut.cz
Gradient descent algorithm ■ Given a function J ( w ) that should be minimized, ■ start with a guess of w , and change it so that J ( w ) decreases, i.e. Intro ■ update our current guess of w by taking a step in the direction opposite to the • Notation gradient: • Multiple regression • Logistic regression w ← w − η ∇ J ( w ) , i.e. • Question • Gradient descent ∂ • Ex: Grad. for MR w d ← w d − η J ( w ) , ∂ w d • Ex: Grad. for LR • Relations to NN where all w d s are updated simultaneously and η is a learning rate (step size). Multilayer FFN ■ For cost functions given as the sum across the training examples Gradient Descent Regularization | T | Other NNs E ( w , x ( i ) , y ( i ) ) , ∑ J ( w ) = Summary i = 1 we can concentrate on a single training example because | T | ∂ ∂ E ( w , x ( i ) , y ( i ) ) , ∑ J ( w ) = ∂ w d ∂ w d i = 1 and we can drop the indices over training data set: E = E ( w , x , y ) . P. Poˇ s´ ık c � 2020 Artificial Intelligence – 7 / 34 petr.posik@fel.cvut.cz
Example: Gradient for multiple regression and squared loss x 1 w d 3 Intro x 2 • Notation y � 3 • Multiple regression • Logistic regression • Question x 3 • Gradient descent 3 • Ex: Grad. for MR • Ex: Grad. for LR Assuming the squared error loss • Relations to NN Multilayer FFN E ( w , x , y ) = 1 y ) 2 = 1 2 ( y − xw T ) 2 , 2 ( y − � Gradient Descent Regularization we can compute the derivatives using the chain rule as Other NNs Summary ∂ � ∂ E = ∂ E y , where ∂ w d ∂ � y ∂ w d ∂ E y = ∂ 1 y ) 2 = − ( y − � 2 ( y − � y ) , and ∂ � ∂ � y ∂ � y ∂ xw T = x d , = ∂ w d ∂ w d and thus ∂ � ∂ E = ∂ E y = − ( y − � y ) x d . ∂ w d ∂ � y ∂ w d P. Poˇ s´ ık c � 2020 Artificial Intelligence – 8 / 34 petr.posik@fel.cvut.cz
Example: Gradient for logistic regression and crossentropy loss Nonlinear activation function: x 1 w d 3 1 g ( a ) g ( a ) = a Intro 1 + e − a x 2 y � • Notation 3 • Multiple regression Note that • Logistic regression • Question g ′ ( a ) = g ( a ) ( 1 − g ( a )) . x 3 • Gradient descent 3 • Ex: Grad. for MR • Ex: Grad. for LR • Relations to NN Multilayer FFN Gradient Descent Regularization Other NNs Summary P. Poˇ s´ ık c � 2020 Artificial Intelligence – 9 / 34 petr.posik@fel.cvut.cz
Recommend
More recommend