data mining and machine learning fundamental concepts and
play

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


  1. Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 26: Deep Learning Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 1 / 98

  2. Recurrent Neural Networks Multilayer perceptrons are feed-forward networks in which the information flows in only one direction, namely from the input layer to the output layer via the hidden layers. In contrast, recurrent neural networks (RNNs) are dynamically driven (e.g., temporal), with a feedback loop between two (or more) layers, which makes them ideal for learning from sequence data. The task of an RNN is to learn a function that predicts the target sequence Y given the input sequence X . That is, the predicted output o t on input x t should be similar or close to the target response y t , for each time point t . To learn dependencies between elements of the input sequence, an RNN maintains a sequence of m -dimensional hidden state vectors h t ∈ R m , where h t captures the essential features of the input sequences up to time t .The hidden vector h t at time t depends on the input vector x t at time t and the previous hidden state vector h t − 1 from time t − 1, and it is computed as follows: h t = f h ( W T i x t + W T h h t − 1 + b h ) (1) Here, f h is the hidden state activation function, typically tanh or ReLU. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 2 / 98

  3. Recurrent Neural Network W h , b h − 1 x t h t o t W i W o , b o Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 3 / 98

  4. Recurrent Neural Networks It is important to note that all the weight matrices and bias vectors are independent of the time t . For example, for the hidden layer, the same weight matrix W h and bias vector b h is used and updated while training the model, over all time steps t . This is an example of parameter sharing or weight tying between different layers or components of a neural network. Likewise, the input weight matrix W i , the output weight matrix W o and the bias vector b o are all shared across time. This greatly reduces the number of parameters that need to be learned by the RNN, but it also relies on the assumption that all relevant sequential features can be captured by the shared parameters. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 4 / 98

  5. RNN unfolded in time. t = 0 t = 1 t = 2 ... t = τ − 1 t = τ ··· o 1 o 2 o τ − 1 o τ W o , b o W o , b o W o , b o W o , b o h 0 h 1 h 2 ··· h τ − 1 h τ W h , b h W h , b h W h , b h W i W i W i W i x 1 x 2 ··· x τ − 1 x τ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 5 / 98

  6. Training an RNN For training the network, we compute the error or loss between the predicted and response vectors over all time steps. For example, the squared error loss is given as τ τ E x t = 1 � � � y t − o t � 2 E X = 2 · t = 1 t = 1 On the other hand, if we use a softmax activation at the output layer, then we use the cross-entropy loss, given as τ τ p � � � E X = E x t = − y ti · ln( o ti ) t = 1 t = 1 i = 1 where y t = ( y t 1 , y t 2 , ··· , y tp ) T ∈ R p and o t = ( o t 1 , o t 2 , ··· , o tp ) T ∈ R p . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 6 / 98

  7. Feed-forward in Time The feed-forward process starts at time t = 0, taking as input the initial hidden state vector h 0 , which is usually set to 0 or it can be user-specified, say from a previous prediction step. Given the current set of parameters, we predict the output o t at each time step t = 1 , 2 , ··· ,τ . o t = f o � � W T o h t + b o = f o � � o f h � � W T W T i x t + W T h h t − 1 + b h + b o � �� � h t . . = . = f o � o f h � � � h f h � ··· f h � � � W T W T i x t + W T W T i x 1 + W T h h 0 + b h + ··· + b h + b o � �� � h 1 We can observe that the RNN implicitly makes a prediction for every prefix of the input sequence, since o t depends on all the previous input vectors x 1 , x 2 , ··· , x t , but not on any future inputs x t + 1 , ··· , x τ . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 7 / 98

  8. Backpropagation in Time Once the the output sequence O = � o 1 , o 2 , ··· , o τ � is generated, we can compute the error in the predictions using the squared error (or cross-entropy) loss function, which can in turn be used to compute the net gradient vectors that are backpropagated from the output layers to the input layers for each time step. Let E x t denote the loss on input vector x t from the input sequence X = � x 1 , x 2 , ··· , x τ � . Define δ o t as the net gradient vector for the output vector o t , i.e., the derivative of the error function E x t with respect to the net value at each neuron in o t , given as � ∂ E x t � T , ∂ E x t , ··· , ∂ E x t δ o t = ∂ net o ∂ net o ∂ net o t 1 t 2 tp where o t = ( o t 1 , o t 2 , ··· , o tp ) T ∈ R p is the p -dimensional output vector at time t , and net o ti is the net value at output neuron o ti at time t . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 8 / 98

  9. Backpropagation in Time Likewise, let δ h t denote the net gradient vector for the hidden state neurons h t at time t � ∂ E x t � T , ∂ E x t , ··· , ∂ E x t δ h t = ∂ net h ∂ net h ∂ net h tm t 1 t 2 where h t = ( h t 1 , h t 2 , ··· , h tm ) T ∈ R m is the m -dimensional hidden state vector at time t , and net h ti is the net value at hidden neuron h ti at time t . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 9 / 98

  10. RNN: Feed-forward step l = 0 l = 1 l = 2 l = 3 ... l = τ l = τ + 1 o 1 o 2 o τ − 1 o τ ··· W o , b o b o , b o b o , , W o W o o W h 0 h 1 h 2 h τ − 1 h τ ··· W h , b h W h , b h W h , b h W i W i W i W i x 1 x 2 x τ − 1 x τ ··· Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 10 / 98

  11. RNN:Backpropagation step l = 0 l = 1 l = 2 l = 3 ... l = τ l = τ + 1 o τ − 1 o 1 o 2 o τ ··· δ o δ o δ o δ o τ − 1 1 2 τ W o · δ o W o · δ o τ − 1 W o · δ o 1 2 W o · δ o τ h 1 h 2 h τ − 1 h τ h 0 ··· δ h δ h δ h δ h W h · δ h W h · δ h W h · δ h τ − 1 1 2 τ 1 2 τ ··· x 1 x 2 x τ − 1 x τ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 11 / 98

  12. Computing Net Gradients The key step in backpropagation is to compute the net gradients in reverse order, starting from the output neurons to the input neurons via the hidden neurons. The backpropagation step reverses the flow direction for computing the net gradients δ o t and δ h t , as shown in the backpropagation graph.In particular, the net gradient vector at the output o t can be computed as follows: δ o t = ∂ f o t ⊙ ∂ E x t (2) where ⊙ is the element-wise or Hadamard product. On the other hand, the net gradients at each of the hidden layers need to account for the incoming net gradients from o t and from h t + 1 .Thus, the net gradient vector for h t (for t = 1 , 2 ,...,τ − 1) is given as �� �� � � δ h t = ∂ f h W o · δ o W h · δ h t ⊙ + (3) t + 1 t Note that for h τ , it depends only on o τ .Finally, note that the net gradients do not have to be computed for h 0 or for any of the input neurons x t , since these are leaf nodes in the backpropagation graph, and thus do not backpropagate the gradients beyond those neurons. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 12 / 98

  13. Reber grammar automata. S X 2 4 T S B E 0 1 P 6 7 X P V 3 5 V T Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 13 / 98

  14. RNN Reber grammar automata We use an RNN to learn the Reber grammar, which is generated according to the automata.Let Σ = { B , E , P , S , T , V , X } denote the alphabet comprising the seven symbols. Further, let $ denote a terminal symbol. Starting from the initial node, we can generate strings that follow the Reber grammar by emitting the symbols on the edges. If there are two transitions out of a node, each one can be chosen with equal probability. The sequence � B , T , S , S , X , X , T , V , V , E � is a valid Reber sequence (with the corresponding state sequence � 0 , 1 , 2 , 2 , 2 , 4 , 3 , 3 , 5 , 6 , 7 � ). On the other hand, the sequence � B , P , T , X , S , E � is not a valid Reber sequence, since there is no edge out of state 3 with the symbol X . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 26: Deep Learning 14 / 98

Recommend


More recommend