Neural Network Learning Looking behind the scenes: a mathematical - PowerPoint PPT Presentation

Neural Network Learning Looking behind the scenes: a mathematical perspective Textbook reference: Sections 11.1-11.2 Additional References: Nillson, N. Artificial Intelligence: A New Synthesis , San Francisco: Morgan Kaufmann, 1998. (Chapter 2, Chapter 3 (3.1 - 3.2)) http://en.wikipedia.org/wiki/Sigmoid_function . – p.1/24

The learning problem We are given a set, E, of n-dimensional vectors, X , with components x i , i = 0 , . . . , n. These vectors are feature vectors computed by a perceptual processing component. The values can be real or Boolean. For each X in E, we also know the appropriate action or classification y. These associated actions are sometimes called the labels or the classes of the vectors. . – p.2/24

The learning problem (cont’d) The set E and the associated labels are called the examples , or the training set . The machine learning problem is to find a function, say, f ( X ) , that responds "acceptably" to the members of the training set. Note that this type of learning is supervised . We would like the action computed by f to agree with the label for as many vectors in E as possible. . – p.3/24

Training a single neuron Equation of hyperplane Θ = 0 X . W − Θ > 0 X . W − W Unit vector normal on this side |W| to hyperplane Θ < 0 X . W − Origin on this side adjusting the threshold θ changes the position of the hyperplane boundary with respect to the origin adjusting the weights changes the orientation of the hyperplane . – p.4/24

Gradient descent method Define an error function that can be minimized by adjusting weight values. A commonly used error function is squared error: ε = ∑ ( d i − f i ) 2 X i ∈ E where f i is the actual response for input X i , and d i is the desired response. For fixed E, we see that the error depends on the weight values through f i . . – p.5/24

Gradient descent method (cont’d) A gradient descent process is useful to find the minimum of ε : calculate the gradient of ε in weight space and move the weight vector along the negative gradient (downhill). Note that, ε as defined, depends on all the input vectors in E. Use one vector at a time incrementally rather than all at once. Note that, the incremental process is an approximation of the “batch” process. Nevertheless, it works. . – p.6/24

Gradient descent method (cont’d) The following is a hypothetical error surface in two dimensions. Constant c dictates the size of the learning step. E W old c W new local minimum error surface W . – p.7/24

The procedure Take one member of E. Adjust the weights if needed. Repeat (a predefined number of times or until ε is sufficiently small.) . – p.8/24

How to adjust the weights The squared error for a single output vector, X , evoking an output of f , when the desired output is d is: ε = ( d − f ) 2 . The gradient of ε with respect to the weights is ∂ε / ∂ W = [ ∂ε / ∂ w 0 ,..., ∂ε / ∂ w i ,..., ∂ε / ∂ w n ] . . – p.9/24

How to adjust the weights (cont’d) Since ε ′ s dependence on W is entirely through the dot product, s = X . W, we can use the chain rule to write ∂ε / ∂ W = ∂ε / ∂ s × ∂ s / ∂ W Because ∂ s / ∂ W = X ∂ε / ∂ W = ∂ε / ∂ s × X Note that ∂ε / ∂ s = − 2 ( d − f ) ∂ f / ∂ s . Thus ∂ε / ∂ W = − 2 ( d − f ) ∂ f / ∂ s × X . – p.10/24

How to adjust the weights (cont’d) The remaining problem is to compute ∂ f / ∂ s . The perceptron output, f , is not continuously differentiable with respect to s because of the presence of the threshold function. Most small changes in the dot product do not change f at all, and when f does change, it changes abruptly from 1 to 0 or vice versa. We will look at two methods to compute the differential. . – p.11/24

Computing the differential Ignore the threshold function and let f = s . ( The Widrow-Hoff Procedure ). Replace the threshold function with another nonlinear function that is differentiable ( The Generalized Delta Procedure ). . – p.12/24

The Widrow-Hoff procedure Suppose we attempt to adjust the weights so that every training vector labeled with a 1 produces a dot product of exactly 1, and every vector labeled with a 0 produces a dot product of exactly -1. In that case, with f = s , ε = ( d − f ) 2 = ( d − s ) 2 , and, ∂ f / ∂ s = 1 . Now, the gradient is ∂ε / ∂ W = − 2 ( d − f ) X . – p.13/24

The Widrow-Hoff procedure (cont’d) Moving the weight vector along the negative gradient, and incorporating the factor 2, into a learning rate parameter , c , the new value of the weight vector is given by W ← W + c ( d − f ) X All we need to do now is to plug in this formula in the "adjust the weights" step of the training procedure. . – p.14/24

The Widrow-Hoff procedure (cont’d) We have, W ← W + c ( d − f ) X . Whenever ( d − f ) is positive, we add a fraction of the input vector into the weight vector. This addition makes the dot product larger, and ( d − f ) smaller. Similarly, when ( d − f ) is negative, we subtract a fraction of the input vector from the weight vector. . – p.15/24

The Widrow-Hoff procedure (cont’d) This procedure is also known as the Delta rule . After finding a set of weights that minimize the squared error (using f = s ), we are free to revert to the threshold function for f . . – p.16/24

The generalized delta procedure Another way of dealing with the nondifferentiable threshold function: replace the threshold function by an S-shaped differentiable function called a sigmoid . Usually, the sigmoid function used is the logistic function which is defined as follows: 1 f ( s ) = 1 + e − s where, s is the input and f is the output. . – p.17/24

A sigmoid function 1.0 0.8 0.6 0.4 0.2 −6 −4 −2 2 4 6 It is possible to get sigmoid functions of different “flatness” by adjusting the exponent. . – p.18/24

Differentiating a sigmoid function Sigmoid functions are popular in neural networks because they are a convenient approximation to the threshold function and they yield the following differential: d dt sig ( t ) = sig ( t ) × ( 1 − sig ( t )) . – p.19/24

The generalized Delta procedure (cont’d) With the sigmoid function, ∂ f / ∂ s = f ( 1 − f ) Substitute into ∂ε / ∂ W = − 2 ( d − f ) ∂ f / ∂ s × X ∂ε / ∂ W = − 2 ( d − f ) f ( 1 − f ) × X The new weight change rule is: W ← W + c ( d − f ) f ( 1 − f ) X This is equivalent to the weight change rule included in the learning algorithm: W j ← W j + c × Err × g ′ ( in ) × x j [ e ] . – p.20/24

Fuzzy hyperplane In generalized Delta, there is the added term f ( 1 − f ) due to the presence of the sigmoid function. When f = 0, f ( 1 − f ) is also 0. When f = 1, f ( 1 − f ) is 0. When f = 1/2, f ( 1 − f ) reaches its maximum value (1/4). Weight changes are made where changes have much effect on f . For an input vector far away from the fuzzy hyperplane, f ( 1 − f ) has value closer to 0, and the generalized Delta rule makes little or no change to the weight values regardless of the desired output. . – p.21/24

The error-correction procedure Keep the threshold input Adjust the weight vector only when the perceptron responds in error, i.e., when ( d − f ) is 1 or -1. As before, the change is in the direction that helps correct the error. Whether it is corrected fully depends on c . . – p.22/24

The error-correction procedure (cont’d) It can be proven that if there is some weight vector, W, that produces a correct output for all the input vectors in S then after a finite number of input vector presentations, the error-correction procedure will find such a weight vector and thus make no more weight changes. Remember that a single perceptron can only learn linearly separable input vectors. . – p.23/24

Linearly non-separable inputs When the input vectors in the training set are not linearly separable, the error-correction procedure will never terminate. Thus, it cannot be used to find a "good enough" answer. On the other hand, the Widrow-Hoff and generalized Delta procedures can find minimum squared error solutions even when the minimum error is not zero. . – p.24/24

Neural Network Learning Looking behind the scenes: a mathematical - PowerPoint PPT Presentation

Neural Network Learning Looking behind the scenes: a mathematical perspective Textbook reference: Sections 11.1-11.2 Additional References: Nillson, N. Artificial Intelligence: A New Synthesis , San Francisco: Morgan Kaufmann, 1998. (Chapter

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Network II Neural Network II Week 8 1 Team Homework Assignment #10 Team Homework

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Artificial Neural Networks By: Kodi Neumiller Overview What is an artificial neural network

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Advanced Machine Learning Dense Neural Networks Amit Sethi Electrical Engineering, IIT Bombay

Machine Learning for Computational Linguistics May 3, 2016 regression non-parametric neighbors

Lecture 9: Logistic Regression Discriminative vs. Generative Classification Linear

Linear Regression Let us assume that the target variable and the inputs are related by the

Introduction to GSEM in Stata Christopher F Baum ECON 8823: Applied Econometrics Boston College,

Generalized linear mixed effects models Consider stochastic vector Y = ( Y 1 , . . . , Y n ) and

(Ch. 18.6-18.7) Announcements Homework 4 due Sunday Test next Wednesday... covers ch 15-17 (HW

Mathematics Subject GRE Workshop Agenda Description of Mathematics Subject GRE Topics it

Generating Functions Will Perkins February 14, 2013 Turning a Function into a Sequence