lecture 5 representation learning
play

Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA - PowerPoint PPT Presentation

Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1 This lecture v Review: Neural Network v Recurrent NN v Representation learning in NLP ML in NLP 2


  1. Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1

  2. This lecture v Review: Neural Network v Recurrent NN v Representation learning in NLP ML in NLP 2

  3. Neural Network 10 Based on slide by AndrewNg

  4. Neural Network (feed forward) 12 Slide by AndrewNg

  5. Feed-Forward Process v Input layer units are features (in NLP, e.g., words) v Usually, one-hot vector or word embedding v Working forward through the network, the input function is applied to compute the input value v E.g., weighted sum of the input v The activation function transforms this input function into a final value v Typically a nonlinear function (e.g, sigmoid ) 13 Based on slide by T. Finin, M. desJardins, L Getoor, R.Par

  6. 14 Slide by AndrewNg

  7. Vector Representation 15 Based on slide by AndrewNg

  8. Can extend to multi-class Pedestrian Car Motorcycle Truck 17 Slide by AndrewNg

  9. Why staged predictions? 21 Based on slide and example by AndrewNg

  10. Representing Boolean Functions 22

  11. Combining Representations to Create Non-Linear Functions 23 Based on example by AndrewNg

  12. Layering Representations x 1 ... x 20 x 21 ... x 40 x 41 ... x 60 ... x 381 ... x 400 20 × 20 pixel images d = 400 10 classes Each image is “unrolled” into a vector x of pixel intensities 2 4

  13. Layering Representations x 1 x 2 “0” x 3 “1” x 4 x 5 “9” Output Layer Hidden Layer x d Input Layer Visualization of Hidden Layer 2 5

  14. This lecture v Review: Neural Network v Learning NN v Recursive and Recurrent NN v Representation learning in NLP ML in NLP 14

  15. Stochastic Sub-gradient Descent Given a training set 𝒠 = { 𝒚,𝑧 } Initialize 𝒙 ← 𝟏 ∈ ℝ & 1. For epoch 1…𝑈 : 2. For (𝒚,𝑧) in 𝒠 : 3. Update 𝑥 ← 𝑥 − 𝜃 𝛼 𝑔(𝜄) 4. Return 𝜄 5. ML in NLP 15

  16. � � Recap: Logistic regression 𝜾 𝜇 2𝑜 𝜾 A 𝜾 + 1 𝑜 C log( 1 + 𝑓 HI J (𝜾 K 𝐲 J ) ) min N Let h P (𝑦 N ) = 1/(1 + 𝑓 HP S T U ) (probability 𝑧 = 1 given 𝑦 N ) W& 𝜾 A 𝜾 + X V & ∑ y [ log( ℎ P (𝑦 N )) + (1 − 𝑧 N ) ( log(1 − ℎ P (𝑦 N )) N ML in NLP 16

  17. Cost Function 𝑕 𝜄 = 𝛿 𝜄 A 𝜄 𝑔 𝜄 = 𝐾 𝜄 + 𝑕 𝜄 , 3 Based on slide by AndrewNg 2

  18. Optimizing the Neural Network 3 Based on slide by AndrewNg 3

  19. Forward Propagation 3 Based on slide by AndrewNg 4

  20. Backpropagation: Compute Gradient 36 Based on slide by AndrewNg

  21. This lecture v Review: Neural Network v Recurrent NN v Representation learning in NLP ML in NLP 21

  22. How to deal with input with variant size? v Use same parameters Today is a … </S> <S> Today is … day Advanced ML: Inference 22

  23. Recurrent Neural Networks

  24. Recurrent Neural Networks

  25. Unroll RNNs U V

  26. RNN training v Back-propagation over time

  27. Vanishing Gradients v For the traditional activation functions, each gradient term has the value in range (-1, 1). v Multiplying n of these small numbers to compute gradients v The longer the sequence is, the more severe the problems are.

  28. RNNs characteristics v Model hidden states (input) dependencies v Errors “back propagation over time” v Feature learning methods v Vanishing gradient problem: cannot model long-distant dependencies of the hidden states.

  29. Long-Short Term Memory Networks (LSTMs) Use gates to control the information to be added from the input, forgot from the p revious memories, and outputted. σ and f are sigmoid and tanh function respectively, to map the value to [-1, 1]

  30. Another Visualization Capable of modeling long-distant dependencies between states. Figure credit: Christopher Olah

  31. Bidirectional LSTMs

  32. How to deal with sequence output? v Idea 1: combine DL with CRF v Idea 2: introduce structure in DL ML in NLP 32

  33. LSTMs for Sequential Tagging y t y t = Wh t + b ^ ∑ min l ( y t , y t ) t y t = Wht Sophisticated model of input + local predictions.

  34. Recall CRFs for Sequential Tagging Arbitrary features on the input side Markov assumption on the output side

  35. LSTMs for Sequential Tagging v Completely ignored the interdependencies of the outputs v Will this work? Yes. v Liang et. al. (2008), Structure Compilation: Trading Structure for Features v Is this the best model? Not necessarily.

  36. Combining CRFs with LSTMs

  37. Traditional CRFs v.s. LSTM-CRFs v Traditional CRFs: n 1 ∏ P ( Y | X ; θ ) = exp( θ f ( y i , y i − 1 , x 1: n )) n ∑ ∏ n = 1 exp( θ f ( y i , y i − 1 , x 1: n )) Y n = 1 v LSTM-CRFs: n 1 ∏ P ( Y | X ; Θ ) = exp( λ f ( y i , y i − 1 , LSTM ( x 1: n ))) n ∑ ∏ n = 1 exp( λ f ( y i , y i − 1 , LSTM ( x 1: n ))) Y n = 1 Θ = { λ , Ω } where Ω is LSTM parameters

  38. Combining Two Benefits ● Directly model output dependencies by CRFs. ● Powerful automatic feature learning using biLSTMs. ● Jointly training all the parameters to “share the modeling responsibilities”

  39. Transfer Learning with LSTM-CRFs v Neural networks as feature learner v Share the feature learner for different tasks v Jointly train the feature learners so that it learns the common features . v Use different CRFs for different tasks to encode task-specific information v Going forward, one can imagine using other graphical models besides linear chain CRFs.

  40. Transfer Learning CWS + NER Shared

  41. Joint Training v Simply linearly combine two objectives. v Alternative updates for each module’s parameters.

  42. How to deal with sequence output? v Idea 1: combine DL with CRF v Idea 2: introduce structure in DL ML in NLP 42

  43. Advanced ML: Inference 43

  44. Advanced ML: Inference 44

  45. Advanced ML: Inference 45

  46. Advanced ML: Inference 46

  47. Advanced ML: Inference 47

Recommend


More recommend