Neural network for supervised learning Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Biological metaphor Human brain working Transmission of information and learning process Important things to retain • Receiving information (signal) • Activation and processing by a neuron • Transmission to other neurons (if the signal is enough strong) • In the long run: strengthening of some connections LEARNING Ricco Rakotomalala 2 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Mc Colluch and Pitts’ model Binary problem (positive vs. negative) Single-layer perceptron Y 1 ( ), 0 ( ) Input layer Output layer Transfer function X 0 =1 Bias Activation function – Heaviside step function a 0 X 1 a 1 1 Input X 2 a 2 0 Variables d ( X ) a 3 X 3 Prediction model and classification rule weights d ( X ) a a x a x a x 0 1 1 2 2 3 3 IF d(X) 0 THEN Y 1 ELSE Y 0 The single-layer perceptron is a linear classifier Ricco Rakotomalala 3 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Learning algorithm – Single-layer perceptron How to calculate the weights X 0 =1 a 0 from a data set (Y ; X1, X2, X3) X 1 a 1 X 2 a 2 a 3 Draw a parallel with the least squares regression. A neural network can be used for X 3 the regression (linear transfer function) (1) Criterion to optimize? (1) Minimizing of the prediction error (2) Optimization process? (2) Error correction learning procedure Ricco Rakotomalala 4 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Example – Learning the logical AND function Instructive example - The first applications are from the computer science area. 1.5 1.3 X1 X2 Y 1.1 1 0.9 0 0 0 0.7 0.5 0 1 0 0.3 1 0 0 0.1 -0.1 1 1 1 -0.3 -0.5 Dataset -0.5 0 0.5 1 1.5 2D representation (scatter plot) Main steps: 1. Mix up randomly the instances of the learning set 2. Initialize the weights (small random value) 3. For each instance of the training set • Calculate the output of the perceptron • If the prediction is wrong, update the weights 4. Until convergence (termination condition is satisfied) Sequential learning procedure An instance may be processed a few times !!! Ricco Rakotomalala 5 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Example AND (1) Initialize (randomly) the weights: a 0 . 1 ; a 0 . 2 ; a 0 . 05 0 1 2 Update rule of the weights Decision boundary : For each instance which is processed 0 . 1 0 . 2 x 0 . 05 x 0 x 2 4 . 0 x 2 . 0 1 2 1 a a a j j j 6 4 Strength of with the signal 2 1 ˆ a y y x j j 0 -0.5 0 0.5 1 1.5 Error It enables to -2 determine if we correct the Learning rate parameter -4 weights or not It determines the intensity of the correction What is the good value? -6 Too small processing is too slow Too high oscillation A rule of thumb, about 0.05 ~ 0.15 (0.1 for our example) Ricco Rakotomalala 6 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Example AND (2) One instance of the dataset Calculate the output Error => Update the weights a 0 . 1 1 1 0 . 1 x 1 0 0 0 . 1 1 0 . 2 0 0 . 05 0 0 . 1 a 0 . 1 1 0 0 x 0 1 1 ˆ y 1 x 0 a 0 . 1 1 0 0 2 2 y 0 New decision boundary: 0 . 0 0 . 2 x 0 . 05 x 0 x 4 . 0 x 0 . 0 1 2 2 1 6 4 2 1 0 -0.5 0 0.5 1 1.5 -2 -4 -6 Ricco Rakotomalala 7 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Example AND (3) Other instance Calculate the output Error => update the weights a 0 . 1 1 1 0 . 1 x 1 0 0 0 . 0 1 0 . 2 1 0 . 05 0 0 . 2 a 0 . 1 1 1 0 . 1 x 1 1 1 ˆ y 1 x 0 a 0 . 1 1 0 0 2 2 y 0 New decision boundary: 0 . 1 0 . 1 x 0 . 05 x 0 x 2 . 0 x 2 . 0 1 2 2 1 6 4 2 1 0 -0.5 0 0.5 1 1.5 -2 -4 -6 Ricco Rakotomalala 8 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Example AND (4) – Termination condition Other instance Calculate the output Good prediction => no update a 0 . 1 0 1 0 x 1 0 0 0 . 1 1 0 . 1 0 0 . 05 1 0 . 05 a 0 . 1 0 0 0 x 0 1 1 ˆ y 0 x 1 a 0 . 1 0 1 0 2 2 y 0 No correction here. Why? See the decision boundary in the scatter plot. Decision boundary: 0 . 1 0 . 1 x 0 . 05 x 0 x 2 . 0 x 2 . 0 1 2 2 1 Note: What happens if we process again (x1=1 ; x2=0)? 6 4 Convergence? 2 1 0 (1) No correction is made whatever the instance handled -0.5 0 0.5 1 1.5 (2) The error rate no longer decreases "significantly“ -2 (3) The weights are "stable“ (4) We set a maximum number of iterations -4 (5) We set a minimum error to achieve -6 Ricco Rakotomalala 9 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Estimation of the conditional probability P(Y/X) Sigmoid transfer function The perceptron provides a classification rule But in some circumstances, we need the estimation of P(Y/X) Transfer function Transfer function Heaviside function Sigmoid function 1 1 0 0 d ( X ) d ( X ) 1 g ( v ) v 1 e v d ( X ) The decision rule becomes: IF g(v) > 0.5 THEN Y=1 ELSE Y=0 Ricco Rakotomalala 10 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Consequence of the using a derivable real function as activation function Modification of the optimization criterion X 0 =1 a 0 Output of the network X 1 a 1 ˆ y g ( v ) f d ( x ) X 2 a 2 a 3 X 3 Least mean squares criterion 1 2 ˆ E y ( ) y ( ) 2 But we use always the sequential learning procedure! Ricco Rakotomalala 11 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Gradient descent optimization algorithm The derivative of the g ' ( v ) g ( v )( 1 g ( v )) sigmoid function Optimization: derivative E ˆ of the objective function [ y ( ) y ( )] g ' [ v ( )] x ( ) j (criterion) with respect a i j to the weights Update rule of the weights ˆ for each processed instance a a ( y y ) g ' ( v ) x j j j (Widrow-Hoff learning rule or Delta rule) Gradient: Computing the weights in the direction which minimizes E The convergence toward the minimum is good in practice Ability to handle correlated input variables Ability to handle large datasets (rows and columns) Updating the model is easy if new labeled instances are available Ricco Rakotomalala 12 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Multiclass perceptron (K the number of classes, K > 2) (1) Dummy coding of the output 1 y iif y y ˆ k k y 1 (2) « Output » for each neuron in output layer X 0 =1 y ˆ [ ] g v k k X 1 ˆ with v a a x a x y k 0 , k 1 , k 1 J , k J 2 X 2 ( / ) [ ] P Y y X g v (3) P(Y/X) k k ˆ y X 3 3 (4) Classification rule ˆ ˆ * arg max y y iif k y k * k k Minimizing the mean squared error By processing K perceptrons in parallel K 1 ˆ 2 E y ( ) y ( ) k k 2 k 1 Ricco Rakotomalala 13 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Example on the "breast cancer" dataset (SIPINA tool) Evolution of the error rate Weights Set the input variables on the same scale (standardization, normalization, etc.) Sometimes, it is useful to partition the data set in three parts: training set (learning of the weights), validation set (to monitor the error rate), test set (to estimate the generalization performance) The settings must be handled with care (learning rate, stopping rule, etc.) Ricco Rakotomalala 14 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/
Recommend
More recommend