Supervised Learning in Neural Networks Keith L. Downing The - PowerPoint PPT Presentation

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Keith L. Downing Supervised Learning in Neural Networks

Supervised Learning Constant feedback from an instructor, indicating not only right/wrong, but also the correct answer for each training case. Many cases (i.e., input-output pairs) to be learned. Weights are modified by a complex procedure (back-propagation) based on output error. Feed-forward networks with back-propagation learning are the standard implementation. 99% of neural network applications use this. Typical usage: problems with a) lots of input-output training data, and b) goal of a mapping (function) from inputs to outputs. Not biologically plausible, although the cerebellum appears to exhibit some aspects. But, the result of backprop, a trained ANN to perform some function, can be very useful to neuroscientists as a sufficiency proof . Keith L. Downing Supervised Learning in Neural Networks

Backpropagation Overview Training/Test Cases: {(d1, r1) (d2, r2) (d3, r3)....} d3 r3 Encoder Decoder r* E = r3 - r* dE/dW Feed-Forward Phase - Inputs sent through the ANN to compute outputs. Feedback Phase - Error passed back from output to input layers and used to update weights along the way. Keith L. Downing Supervised Learning in Neural Networks

Training -vs- Testing Cases N times, with learning Training Neural Net Test 1 time, without learning Generalization - correctly handling test cases (that ANN has not been trained on). Over-Training - weights become so fine-tuned to the training cases that generalization suffers: failure on many test cases. Keith L. Downing Supervised Learning in Neural Networks

Widrow-Hoff (a.k.a. Delta) Rule 2 3 T = target output value X δ = error w 1 Node N 1 Y 0 0 S δ = T - Y Y Δ w = ηδ X Delta ( δ ) = error; Eta ( η ) = learning rate Goal: change w so as to reduce | δ | . Intuitive: If δ > 0, then we want to decrease it, so we must increase Y. Thus, we must increase the sum of weighted inputs to N, and we do that by increasing (decreasing) w if X is positive (negative). Similar for δ < 0 Assumes derivative of N’s transfer function is everywhere non-negative. Keith L. Downing Supervised Learning in Neural Networks

Gradient Descent Goal = minimize total error across all output nodes Method = modify weights throughout the network (i.e., at all levels) to follow the route of steepest descent in error space. min Δ E Error(E) Weight Vector (W) ∆ w ij = − η ∂ E i ∂ w ij Keith L. Downing Supervised Learning in Neural Networks

Computing ∂ E i ∂ w ij tid 1 fT wi1 x1d sumid oid i win xnd Eid n Sum of Squared Errors (SSE) E i = 1 ( t id − o id ) 2 2 ∑ d ∈ D ∂ E = 1 2 ( t id − o id ) ∂ ( t id − o id ) ( t id − o id ) ∂ ( − o id ) = ∑ 2 ∑ ∂ w ij ∂ w ij ∂ w ij d ∈ D d ∈ D Keith L. Downing Supervised Learning in Neural Networks

Computing ∂ ( − o id ) ∂ w ij tid 1 fT wi1 x1d sumid i oid win xnd Eid n Since output = f(sum weighted inputs) ∂ E ( t id − o id ) ∂ ( − f T ( sum id )) = ∑ ∂ w ij ∂ w ij d ∈ D where n ∑ sum id = w ik x kd k = 1 Using Chain Rule: ∂ f ( g ( x )) ∂ g ( x ) × ∂ g ( x ) ∂ f = ∂ x ∂ x ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = ∂ f T ( sum id ) × x jd ∂ w ij ∂ sum id ∂ w ij ∂ sum id Keith L. Downing Supervised Learning in Neural Networks

Computing ∂ sum id - Easy!! ∂ w ij ∑ n � � = ∂ � � = ∂ w i 1 x 1 d + w i 2 x 2 d + ... + w ij x jd + ... + w in x nd ∂ sum id k = 1 w ik x kd ∂ w ij ∂ w ij ∂ w ij + ... + ∂ ( w ij x jd ) = ∂ ( w i 1 x 1 d ) + ∂ ( w i 2 x 2 d ) + ... + ∂ ( w in x nd ) ∂ w ij ∂ w ij ∂ w ij ∂ w ij = 0 + 0 + ... + x jd + ... + 0 = x jd Keith L. Downing Supervised Learning in Neural Networks

Computing ∂ f T ( sum id ) - Harder for some f T ∂ sum id f T = Identity function: f T ( sum id ) = sum id ∂ f T ( sum id ) = 1 ∂ sum id Thus: ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = 1 × x jd = x jd ∂ w ij ∂ sum id ∂ w ij 1 f T = Sigmoid: f T ( sum id ) = 1 + e − sumid ∂ f T ( sum id ) = o id ( 1 − o id ) ∂ sum id Thus: ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = o id ( 1 − o id ) × x jd = o id ( 1 − o id ) x jd ∂ w ij ∂ sum id ∂ w ij Keith L. Downing Supervised Learning in Neural Networks

The only non-trivial calculation ( 1 + e − sum id ) − 1 � = ( − 1 ) ∂ ( 1 + e − sum id ) = ∂ � ∂ f T ( sum id ) ( 1 + e − sum id ) − 2 ∂ sum id ∂ sum id ∂ sum id e − sum id = ( − 1 )( − 1 ) e − sum id ( 1 + e − sum id ) − 2 = ( 1 + e − sum id ) 2 But notice that: e − sum id ( 1 + e − sum id ) 2 = f T ( sum id )( 1 − f T ( sum id )) = o id ( 1 − o id ) Keith L. Downing Supervised Learning in Neural Networks

Putting it all together ∂ E i ( t id − o id ) ∂ ( − f T ( sum id )) � ( t id − o id ) ∂ f T ( sum id ) × ∂ sum id � = ∑ = − ∑ ∂ w ij ∂ w ij ∂ sum id ∂ w ij d ∈ D d ∈ D So for f T = Identity: ∂ E i = − ∑ ( t id − o id ) x jd ∂ w ij d ∈ D and for f T = Sigmoid: ∂ E i = − ∑ ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij d ∈ D Keith L. Downing Supervised Learning in Neural Networks

Weight Updates ( f T = Sigmoid) Batch: update weights after each training epoch ∆ w ij = − η ∂ E i = η ∑ ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij d ∈ D The weight changes are actually computed after each training case, but w ij is not updated until the epoch’s end. Incremental: update weights after each training case ∆ w ij = − η ∂ E i = η ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij A lower learning rate ( η ) recommended here than for batch method. Can be dependent upon case-presentation order. So randomly sort the cases after each epoch. Keith L. Downing Supervised Learning in Neural Networks

Backpropagation in Multi-Layered Neural Networks d(Ed) d(sum1d) 1 d(sum1d) d(ojd) d(ojd) w1j d(sumjd) Ed ojd j sumjd wnj d(sumnd) d(ojd) n d(Ed) d(sumnd) For each node (j) and each training case (d), backpropagation computes an error term: ∂ E d δ jd = − ∂ sum jd by calculating the influence of sum jd along each connection from node j to the next downstream layer. Keith L. Downing Supervised Learning in Neural Networks

Computing δ jd d(Ed) d(sum1d) 1 d(sum1d) d(ojd) d(ojd) w1j d(sumjd) Ed ojd j sumjd wnj d(sumnd) d(ojd) n d(Ed) d(sumnd) ∂ E d Along the upper path, the contribution to ∂ sum jd is: ∂ o jd × ∂ sum 1 d ∂ E d × ∂ sum jd ∂ o jd ∂ sum 1 d So summing along all paths: n ∂ o jd ∂ E d ∂ sum kd ∂ E d ∑ = ∂ sum jd ∂ sum jd ∂ o jd ∂ sum kd k = 1 Keith L. Downing Supervised Learning in Neural Networks

Computing δ jd Just as before, most terms are 0 in the derivative of the sum, so: ∂ sum kd = w kj ∂ o jd Assuming f T = a sigmoid: ∂ o jd = ∂ f T ( sum jd ) = o jd ( 1 − o jd ) ∂ sum jd ∂ sum jd Thus: n = − ∂ o jd ∂ E d ∂ sum kd ∂ E d ∑ δ jd = − ∂ sum jd ∂ sum jd ∂ o jd ∂ sum kd k = 1 n n ∑ ∑ = − o jd ( 1 − o jd ) w kj ( − δ kd ) = o jd ( 1 − o jd ) w kj δ kd k = 1 k = 1 Keith L. Downing Supervised Learning in Neural Networks

Computing δ jd Note that δ jd is defined recursively in terms of the δ values in the next downstream layer: n ∑ δ jd = o jd ( 1 − o jd ) w kj δ kd k = 1 So all δ values in the network can be computed by moving backwards, one layer at a time. Keith L. Downing Supervised Learning in Neural Networks

Computing ∂ E d ∂ w ij from δ jd - Easy!! 1 d(Ed) wi1 d(sumid) ojd sumid wij i j The only effect of w ij upon the error is via its effect upon sum id , which is: ∂ sum id = o jd ∂ w ij So: ∂ E d = ∂ sum id ∂ E d = ∂ sum id × × ( − δ id ) = − o jd δ id ∂ w ij ∂ w ij ∂ sum id ∂ w ij Keith L. Downing Supervised Learning in Neural Networks

Computing ∆ w ij Given an error term, δ id (for node i on training case d), the update of w ij for all nodes j that feed into i is: ∆ w ij = − η ∂ E d = − η ( − o jd δ id ) = ηδ id o jd ∂ w ij So given δ i , you can easily calculate ∆ w ij for all incoming arcs. Keith L. Downing Supervised Learning in Neural Networks

Learning XOR Sum-Squared-Error Sum-Squared-Error Sum-Squared-Error 1.2 1.2 1.2 1.0 1.0 1.0 0.8 0.8 0.8 Error Error Error 0.6 Error 0.6 Error 0.6 Error 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Epoch Epoch Epoch Epoch = All 4 entries of the XOR truth table. 2 (inputs) X 2 (hidden) X 1 (output) network Random init of all weights in [-1 1]. Not linearly separable, so it takes awhile! Each run is different due to random weight init. Keith L. Downing Supervised Learning in Neural Networks

Learning to Classify Wines Class Properties 1 14.23 1.71 2.43 15.6 127 2.8 ··· ··· 1 13.2 1.78 2.14 11.2 100 2.65 ··· ··· 2 13.11 1.01 1.7 15 78 2.98 ··· ··· 3 13.17 2.59 2.37 20 120 1.65 ··· ··· . . . Wine 1 Properties 2 Wine 1 Class 1 5 Hidden Layer 13 Keith L. Downing Supervised Learning in Neural Networks

Supervised Learning in Neural Networks Keith L. Downing The - PowerPoint PPT Presentation

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Keith L. Downing Supervised Learning in Neural Networks Supervised Learning

Neural networks: Unsupervised learning 1 Previously The supervised learning paradigm: given

Supervised Learning Artificial Neural Networks Marco Chiarandini Department of Mathematics &

7.- Non supervised Neural Networks: Self-organizing Maps by Pascual Campoy Grupo de Visin

Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image

Supervised Learning in Structured Spiking Neural Networks 12 (including a detour to SpiNNaker)

AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, 2009 SUPERVISED LEARNING

Learning frameworks Self-supervised learning: (Auto)encoder networks Supervised learning Network

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Out line Neural net wor ks Percept r on Neural Net works Supervised learning

Outline Evolution of neurocomputing Artificial neural networks Feed forward

Machine Learning for NLP An introduction to neural networks Aurlie Herbelot 2019 Centre for

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

CS485/685 Lecture 7: Jan 24, 2012 Perceptrons, Neural Networks [B]: Sections 4.1.7, 5.1 CS485/685

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS Neural networks Fully connected networks

Learning to Compose Neural Networks for Question Answering (a.k.a. Dynamic Neural Module Networks)

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

Neural Networks Philipp Koehn 14 April 2020 Philipp Koehn Artificial Intelligence: Neural

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

Supervised Learning in Neural Networks Keith L. Downing The - PowerPoint PPT Presentation

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Keith L. Downing Supervised Learning in Neural Networks Supervised Learning

Neural networks: Unsupervised learning 1 Previously The supervised learning paradigm: given

Supervised Learning Artificial Neural Networks Marco Chiarandini Department of Mathematics &amp;

7.- Non supervised Neural Networks: Self-organizing Maps by Pascual Campoy Grupo de Visin

Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image

Supervised Learning in Structured Spiking Neural Networks 12 (including a detour to SpiNNaker)

AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, 2009 SUPERVISED LEARNING

Learning frameworks Self-supervised learning: (Auto)encoder networks Supervised learning Network

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Out line Neural net wor ks Percept r on Neural Net works Supervised learning

Outline Evolution of neurocomputing Artificial neural networks Feed forward

Machine Learning for NLP An introduction to neural networks Aurlie Herbelot 2019 Centre for

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

CS485/685 Lecture 7: Jan 24, 2012 Perceptrons, Neural Networks [B]: Sections 4.1.7, 5.1 CS485/685

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS Neural networks Fully connected networks

Learning to Compose Neural Networks for Question Answering (a.k.a. Dynamic Neural Module Networks)

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

Neural Networks Philipp Koehn 14 April 2020 Philipp Koehn Artificial Intelligence: Neural

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

Supervised Learning Artificial Neural Networks Marco Chiarandini Department of Mathematics &