data mining and machine learning fundamental concepts and
play

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


  1. Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 25: Neural Networks Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 1 / 60

  2. Artificial Neural Networks Artificial neural networks or simply neural networks are inspired by biological neuronal networks. A real biological neuron, or a nerve cell , comprises dendrites, a cell body, and an axon that leads to synaptic terminals. A neuron transmits information via electrochemical signals. When there is enough concentration of ions at the dendrites of a neuron it generates an electric pulse along its axon called an action potential , which in turn activates the synaptic terminals, releasing more ions and thus causing the information to flow to dendrites of other neurons. A human brain has on the order of 100 billion neurons, with each neuron having between 1,000 to 10,000 connections to other neurons. Artificial neural networks are comprised of abstract neurons that try to mimic real neurons at a very high level. They can be described via a weighted directed graph G = ( V , E ) , with each node v i ∈ V representing a neuron, and each directed edge ( v i , v j ) ∈ E representing a synaptic to dendritic connection from v i to v j . The weight of the edge w ij denotes the synaptic strength. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 2 / 60

  3. Artificial neuron: aggregation and activation. x 0 1 x 1 b k w 1 k w k · x 2 w 2 k z k � d i = 1 w ik · x i + b k w dk . . . x d Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 3 / 60

  4. Artificial Neuron An artificial neuron acts as a processing unit, that first aggregates the incoming signals via a weighted sum, and then applies some function to generate an output. For example, a binary neuron will output a 1 whenever the combined signal exceeds a threshold, or 0 otherwise. d � w ik · x i = b k + w T x net k = b k + (1) i = 1 where w k = ( w 1 k , w 2 k , ··· , w dk ) T ∈ R d and x = ( x 1 , x 2 , ··· , x d ) T ∈ R d is an input point. Notice that x 0 is a special bias neuron whose value is always fixed at 1, and the weight from x 0 to z k is b k , which specifies the bias term for the neuron. Finally, the output value of z k is given as some activation function , f ( · ) , applied to the net input at z k z k = f ( net k ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 4 / 60

  5. Linear Function Function: f ( net k ) = net k ∂ f ( net j ) Derivative: = 1 ∂ net j + ∞ z k 0 −∞ −∞ 0 + ∞ − b k w T x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 5 / 60

  6. Step Function � 0 if net k ≤ 0 Function: f ( net k ) = 1 if net k > 0 ∂ f ( net j ) Derivative: = 0 ∂ net j 1 . 0 z k 0 . 5 0 −∞ − b k 0 + ∞ w T x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 6 / 60

  7. Rectified Linear Function � 0 if net k ≤ 0 Function: f ( net k ) = net k if net k > 0 � 0 if net j ≤ 0 ∂ f ( net j ) Derivative: = ∂ net j 1 if net j > 0 + ∞ z k 0 −∞ 0 + ∞ − b k w T x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 7 / 60

  8. Sigmoid Function 1 Function: f ( net k ) = 1 +exp {− net k } ∂ f ( net j ) Derivative: = f ( net j ) · ( 1 − f ( net j )) ∂ net j 1 . 0 z k 0 . 5 0 −∞ − b k 0 + ∞ w T x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 8 / 60

  9. Hyperbolic Tangent Function Function: f ( net k ) = exp { net k }− exp {− net k } exp { net k } +exp {− net k } = exp { 2 · net k }− 1 exp { 2 · net k } + 1 ∂ f ( net j ) = 1 − f ( net j ) 2 Derivative: ∂ net j 1 z k 0 − 1 − b k −∞ + ∞ 0 w T x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 9 / 60

  10. Softmax Function exp { net k } Function: f ( net k | net ) = � p i = 1 exp { net i } ∂ f ( net j | net ) Derivative: = f ( net j ) · ( 1 − f ( net j )) ∂ net j z k net k net j Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 10 / 60

  11. Linear and Logistic Regression via Neural Networks x 0 x 0 1 1 b 1 b b p x 1 w 1 x 1 w 11 o 1 o w 1 p . . . . w d . . . . . w d 1 x d x d w dp o p Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 11 / 60

  12. ANN for Multiple and Multivariate Regression Example Consider the multiple regression of sepal length and petal length on the dependent attribute petal width for the Iris dataset with n = 150 points. The solution is given as y = − 0 . 014 − 0 . 082 · x 1 + 0 . 45 · x 2 ˆ The squared error for this optimal solution is 6 . 179 on the training data. Using the presented neural network,with linear activation for the output and minimizing the squared error via gradient descent, results in the following learned parameters, b = 0 . 0096, w 1 = − 0 . 087 and w 2 = 0 . 452, yielding the regression model o = 0 . 0096 − 0 . 087 · x 1 + 0 . 452 · x 2 with a squared error of 6 . 18, which is very close to the optimal solution. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 12 / 60

  13. ANN for Multiple and Multivariate Regression Example Multivariate Linear Regression For multivariate regression, we use the neural network architecture presented to learn the weights and bias for the Iris dataset, where we use sepal length and sepal width as the independent attributes, and petal length and petal width as the response or dependent attributes. Therefore, each input point x i is 2-dimensional, and the true response vector y i is also 2-dimensional. That is, d = 2 and p = 2 specify the size of the input and output layers. Minimizing the squared error via gradient descent, yields the following parameters:    − 1 . 83 − 1 . 47  b 1 b 2 � � � � − 1 . 83 + 1 . 72 · x 1 − 1 . 46 · x 2 o 1  = 1 . 72 0 . 72 w 11 w 12 =    − 1 . 47 + 0 . 72 · x 1 − 0 . 50 · x 2 o 2 − 1 . 46 − 0 . 50 w 21 w 22 The squared error on the training set is 84 . 9. Optimal least squared multivariate regression yields a squared error of 84 . 16 with the following parameters � � � � y 1 ˆ − 2 . 56 + 1 . 78 · x 1 − 1 . 34 · x 2 = y 2 ˆ − 1 . 59 + 0 . 73 · x 1 − 0 . 48 · x 2 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 13 / 60

  14. bC bC uT bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC uT bC bC bC bC bC bC bC bC bC bC bC bCbC uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT bC uT uT uT uT uT uT uT uT uT uT bC bC bC bC bC uT rS rS rS rS rS rS rS uT rS rS rS rS rSrS rS rS rS uT rS rS rS rS uT uTuT uT uT uT uT bC bC rS rS bC rS bC bC bC bC bC bC bC bC bC rS rS uT rS rS rSrS rS rS rS rS rS uT rS rS rS rS rS rS rS rS uT Neural networks for multiclass logistic regression Iris principal components data. Misclassified point are shown in dark gray color. Points in class c 1 and c 2 are shown displaced with respect to the base class c 3 only for illustration. Y π 1 ( x ) π 3 ( x ) π 2 ( x ) rS rS rS rS rS rS rS rS bC bC X 1 uT uT uT uT uT uT uT uT uT uT uT uT uT X 2 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 14 / 60

  15. Logistic Regression: Binary and Multiclass Example We applied the neural network presented, with logistic activation at the output neuron and cross-entropy error function, on the Iris principal components dataset. The output is a binary response indicating Iris-virginica ( Y = 1) or one of the other Iris types ( Y = 0). As expected, the neural network learns an identical set of weights and bias as shown for the logistic regression model, namely: o = − 6 . 79 − 5 . 07 · x 1 − 3 . 29 · x 2 Next, we we applied the neural network presented, using a softmax activation and cross-entropy error function, to the Iris principal components data with three classes: Iris-setosa ( Y = 1), Iris-versicolor ( Y = 2) and Iris-virginica ( Y = 3). Thus, we need K = 3 output neurons, o 1 , o 2 , and o 3 . Further, to obtain the same model as in the multiclass logistic regression example, we fix the incoming weights and bias for output neuron o 3 to be zero. The model is given as o 1 = − 3 . 49 + 3 . 61 · x 1 + 2 . 65 · x 2 o 2 = − 6 . 95 − 5 . 18 · x 1 − 3 . 40 · x 2 o 3 = 0 + 0 · x 1 + 0 · x 2 which is essentially the same as presented before. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 15 / 60

Recommend


More recommend