cse 517 natural language processing winter 2019
play

CSE 517 Natural Language Processing Winter 2019 Deep Learning - PowerPoint PPT Presentation

CSE 517 Natural Language Processing Winter 2019 Deep Learning Yejin Choi University of Washington Next several slides are from Carlos Guestrin, Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10


  1. LST LSTMS S (LONG ONG SHO HORT-TERM ERM MEM EMORY NETWO WORKS Forget gate: forget the past or not Output gate: output from the new cell or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) Input gate: use the input or not Hidden state: i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) h t = o t � tanh( c t ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: - mix old cell with the new temp cell c t = f t � c t − 1 + i t � ˜ c t Figure by Christopher Olah (colah.github.io)

  2. LSTMS LST S (LONG ONG SHO HORT-TERM ERM MEM EMORY NETWO WORKS Forget gate: forget the past or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) Output gate: output from the new o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) cell or not New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: c t = f t � c t − 1 + i t � ˜ c t - mix old cell with the new temp cell Hidden state: h t = o t � tanh( c t ) 𝑑 (," 𝑑 ( ℎ (," ℎ (

  3. vanishing gradient problem for RNNs. The shading of the nodes in the unfolded network indicates their • sensitivity to the inputs at time one (the darker the shade, the greater the sensitivity). The sensitivity decays over time as new inputs overwrite the activations • of the hidden layer, and the network ‘forgets’ the first inputs. Example from Graves 2012

  4. Preservation of gradient information by LSTM Output gate Forget gate Input gate For simplicity, all gates are either entirely open (‘O’) or closed (‘—’). • The memory cell ‘remembers’ the first input as long as the forget gate is • open and the input gate is closed. The sensitivity of the output layer can be switched on and off by the output • gate without affecting the cell. Example from Graves 2012

  5. Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) • Vanilla RNNs: h t = tanh( Ux t + Wh t − 1 + b ) • GRUs ( G ated R ecurrent U nits): (Cho et al, 2014) z t = σ ( U ( z ) x t + W ( z ) h t − 1 + b ( z ) ) r t = σ ( U ( r ) x t + W ( r ) h t − 1 + b ( r ) ) ˜ h t = tanh( U ( h ) x t + W ( h ) ( r t � h t − 1 ) + b ( h ) ) Z: Update gate h t = (1 � z t ) � h t − 1 + z t � ˜ h t R: Reset gate Less parameters than LSTMs. Easier to train for comparable ℎ % ℎ " ℎ # ℎ $ performance! 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

  6. RNN Learning: B ack p rop T hrough T ime (BPTT) • Similar to backprop with non-recurrent NNs • But unlike feedforward (non-recurrent) NNs, each unit in the computation graph repeats the exact same parameters… • Backprop gradients of the parameters of each unit as if they are different parameters • When updating the parameters using the gradients, use the average gradients throughout the entire chain of units. ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

  7. Gates • Gates contextually control information flow • Open/close with sigmoid • In LSTMs and GRUs, they are used to (contextually) maintain longer term history 43

  8. Bi-directional RNNs Can incorporate context from both directions • Generally improves over uni-directional RNNs • 44

  9. Google NMT (Oct 2016)

  10. Tree LSTMs Are tree LSTMs more • expressive than sequence LSTMs? I.e., recursive vs recurrent • When Are Tree Structures • Necessary for Deep Learning of Representations? Jiwei Li, Minh-Thang Luong, Dan Jurafsky and Eduard Hovy. EMNLP , 2015. 46

  11. Recursive Neural Networks Sometimes, inference over a tree structure makes more sense • than sequential structure An example of compositionality in ideological bias detection • (red → conservative, blue → liberal, gray → neutral) in which modifier phrases and punctuation cause polarity switches at higher levels of the parse tree Example from Iyyer et al., 2014

  12. Recursive Neural Networks • NNs connected as a tree • Tree structure is fixed a priori • Parameters are shared, similarly as RNNs Example from Iyyer et al., 2014

  13. Neural Probabilistic Language Model (Bengio 2003) 49

  14. Neural Probabilistic Language Model (Bengio 2003) • Each word prediction is a separate feed forward neural network • Feedforward NNLM is a Markovian language model • Dashed lines show optional direct connections NN DMLP 1 ( x ) = [ tanh ( xW 1 + b 1 ) , x ] W 2 + b 2 I W 1 ∈ R d in × d hid , b 1 ∈ R 1 × d hid ; first a ffi ne transformation I W 2 ∈ R ( d hid + d in ) × d out , b 2 ∈ R 1 × d out ; second a ffi ne transformation 50

  15. AT ATTENTION!

  16. Encoder – Decoder Architecture Sequence-to-Sequence the red dog ˆ ˆ ˆ y 1 y 2 y 3 s s s s s s s t s t s t 1 2 3 1 2 3 ˆ ˆ ˆ x 1 x 2 x 3 x 1 x 2 x 3 the red dog < s > 52 Diagram borrowed from Alex Rush

  17. Trial: Hard Attention s t • At each step generating the target word i s s • Compute the best alignment to the source word j • And incorporate the source word to generate the target word y t i = argmax y O ( y, s t i , s s j ) • Contextual hard alignment. How? z j = tanh([ s t i , s s j ] W + b ) j = argmax j z j • Problem? 53

  18. Attention: Soft Alignments s t • At each step generating the target word i • Compute the attention to the source sequence s s c • And incorporate the attention to generate the target word y t i = argmax y O ( y, s t i , s s j ) • Contextual attention as soft alignment. How? z j = tanh([ s t i , s s j ] W + b ) α = softmax( z ) X α j s s c = j j – Step-1: compute the attention weights 54 – Step-2: compute the attention vector as interpolation

  19. Attention 55 Diagram borrowed from Alex Rush

  20. Attention parameterization z j = tanh([ s t i ; s s j ] W + b ) • Feedforward NNs z j = tanh([ s t i ; s s j ; s t i � s s j ] W + b ) • Dot product z j = s t i · s s j s t i · s s • Cosine similarity j z j = || s t i |||| s s j || T Ws s • Bi-linear models z j = s t i j 56

  21. Learned Attention! 57 Diagram borrowed from Alex Rush

  22. Qualitative results Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “soft” (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.) Figure 3. Examples of attending to the correct object ( white indicates the attended regions, underlines indicated the corresponding word) 58 27 M. Malinowski

  23. BiDAF 59

  24. LE LEARNING: TRAINING DEEP NE NETWOR WORKS

  25. Vanishing / exploding Gradients • Deep networks are hard to train • Gradients go through multiple layers • The multiplicative effect tends to lead to exploding or vanishing gradients • Practical solutions w.r.t. – network architecture – numerical operations 61

  26. Vanishing / exploding Gradients • Practical solutions w.r.t. network architecture – Add skip connections to reduce distance • Residual networks, highway networks, … – Add gates (and memory cells) to allow longer term memory • LSTMs, GRUs, memory networks, … 62

  27. Highway Network (Srivastava et al., 2015) • A plain feedforward neural network: y = H ( x , W H ) . – H is a typical affine transformation followed by a non- linear activation • Highway network: y = H ( x , W H ) · T ( x , W T ) + x · C ( x , W C ) . – T is a “transform gate” – C is a “carry gate” – Often C = 1 – T for simplicity 63

  28. Residual Networks • Residual net • Plaint net 0 0 weight layer weight layer any two relu b(0) identity relu stacked layers 0 weight layer weight layer relu a 0 = b 0 + 0 a(0) relu • ResNet (He et al. 2015): first very deep (152 layers) network successfully trained for object recognition 64

  29. Residual Networks • Residual net • Plaint net 0 0 weight layer weight layer any two relu b(0) identity relu stacked layers 0 weight layer weight layer relu a 0 = b 0 + 0 a(0) relu F(x) is a residual mapping with respect to identity • Direct input connection +x leads to a nice property w.r.t. back • propagation --- more direct influence from the final loss to any deep layer In contrast, LSTMs & Highway networks allow for long distance • input connection only through “gates”. 65

  30. Residual Networks Revolution of Depth soft max2 Soft maxAct ivat ion FC AveragePool 7x7+ 1(V) AlexNet, 8 layers 11x11 conv, 96, /4, pool/2 VGG, 19 layers 3x3 conv, 64 GoogleNet, 22 layers Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 5x5 conv, 256, pool/2 3x3 conv, 64, pool/2 (ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2014) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat 3x3 conv, 384 3x3 conv, 128 Conv Conv Conv Conv soft max1 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Soft maxAct ivation 3x3 conv, 384 3x3 conv, 128, pool/2 MaxPool FC 3x3+ 2(S) Dept hConcat FC 3x3 conv, 256, pool/2 3x3 conv, 256 Conv Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S) Conv Conv MaxPool AveragePool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V) fc, 4096 3x3 conv, 256 Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) fc, 4096 3x3 conv, 256 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat soft max0 fc, 1000 3x3 conv, 256, pool/2 Conv Conv Conv Conv Soft maxAct ivat ion 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool FC 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512 Dept hConcat FC Conv Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S) 3x3 conv, 512 Conv Conv MaxPool AveragePool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V) Dept hConcat 3x3 conv, 512 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512, pool/2 MaxPool 3x3+ 2(S) Dept hConcat 3x3 conv, 512 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512 Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 3x3 conv, 512 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) MaxPool 3x3+ 2(S) 3x3 conv, 512, pool/2 LocalRespNorm Conv 3x3+ 1(S) fc, 4096 Conv 1x1+ 1(V) LocalRespNorm fc, 4096 MaxPool 3x3+ 2(S) Conv fc, 1000 7x7+ 2(S) input Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 66

  31. Residual Networks 7x7 conv, 64, /2, pool/2 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 Revolution of Depth 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 11x11 conv, 96, /4, pool/2 AlexNet, 8 layers 5x5 conv, 256, pool/2 VGG, 19 layers ResNet, 152 layers 1x1 conv, 256 3x3 conv, 64 3x3 conv, 256 3x3 conv, 384 3x3 conv, 384 3x3 conv, 64, pool/2 1x1 conv, 1024 3x3 conv, 128 1x1 conv, 256 3x3 conv, 256, pool/2 fc, 4096 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 fc, 4096 1x1 conv, 1024 fc, 1000 3x3 conv, 256 1x1 conv, 256 3x3 conv, 256 3x3 conv, 256 (ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2015) 3x3 conv, 256, pool/2 1x1 conv, 1024 3x3 conv, 512 1x1 conv, 256 3x3 conv, 512 3x3 conv, 256 3x3 conv, 512 1x1 conv, 1024 3x3 conv, 512, pool/2 1x1 conv, 256 3x3 conv, 512 3x3 conv, 256 3x3 conv, 512 1x1 conv, 1024 3x3 conv, 512 1x1 conv, 256 3x3 conv, 512, pool/2 3x3 conv, 256 fc, 4096 1x1 conv, 1024 fc, 4096 1x1 conv, 256 fc, 1000 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 67

  32. Residual Networks Revolution of Depth 28.2 25.8 152 layers 16.4 11.7 22 layers 19 layers 7.3 6.7 3.57 8 layers 8 layers shallow ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10 ResNet GoogleNet VGG AlexNet ImageNet Classification top-5 error (%) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 68

  33. Vanishing / exploding Gradients • Practical solutions w.r.t. numerical operations – Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid) • ReLU or hard-tanh instead 69

  34. Sigmoid • Often used for gates 1 σ ( x ) = • Pro: neuron-like, 1 + e − x differentiable σ 0 ( x ) = σ ( x )(1 − σ ( x )) • Con: gradients saturate to zero almost everywhere except x near zero => vanishing gradients • Batch normalization helps 70

  35. Tanh • Often used for hidden states & cells tanh( x ) = e x − e − x in RNNs, LSTMs e x + e − x • Pro: differentiable, tanh’(x) = 1 − tanh 2 ( x ) often converges faster than sigmoid tanh( x ) = 2 σ (2 x ) − 1 • Con: gradients easily saturate to zero => vanishing gradients 71

  36. Hard Tanh • Pro: computationally cheaper  − 1 t < − 1   • Con: saturates to   hardtanh ( t ) = − 1 ≤ t ≤ 1 t zero easily, doesn’t   differentiate at 1, -1  1 t > 1  72

  37. ReLU • Pro: doesn’t saturate for ReLU( x ) = max(0 , x ) x > 0, computationally cheaper, induces sparse  1 x > 0   NNs d ReLU ( x )   = 0 x < 0 dx • Con: non-differentiable    1 or 0 o . w  at 0 • Used widely in deep NN, but not as much in RNNs • We informally use subgradients: 73

  38. Vanishing / exploding Gradients • Practical solutions w.r.t. numerical operations – Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid) • ReLU or hard-tanh instead – Batch Normalization: add intermediate input normalization layers 74

  39. Batch Normalization 75

  40. Regularization • Regularization by objective term n y c 0 ) } + λ || θ || 2 ∑ L ( θ ) = max { 0, 1 � ( ˆ y c � ˆ i = 1 – Modify loss with L1 or L2 norms • Less depth, smaller hidden states, early stopping • Dr Dropo pout ut – Randomly delete parts of network during training – Each node (and its corresponding incoming and outgoing edges) dropped with a probability p – P is higher for internal nodes, lower for input nodes – The full network is used for testing – Faster training, better results – Vs. Bagging 76

  41. Convergence of backprop • Without non-linearity or hidden layers, learning is convex optimization – Gradient descent reaches gl globa bal mi minima ma • Multilayer neural nets (with nonlinearity) are no not t co conve vex – Gradient descent gets stuck in local minima – Selecting number of hidden units and layers = fuzzy process – NNs have made a HUGE comeback in the last few years • Neural nets are back with a new name – Deep belief networks – Huge error reduction when trained with lots of data on GPUs

  42. SUPPLE LEMENTARY TOPICS

  43. PO POINTER TER NETW ETWORK RKS

  44. Pointer Networks! (Vinyals et al. 2015) • NNs with attention: content-based attention to input • Pointer networks: location-based attention to input Applications: Convex haul, Delaunay Triangulation, Traveling • Salesman 80

  45. Pointer Networks (b) Ptr-Net (a) Sequence-to-Sequence 81

  46. Pointer Networks Attention Mechanism vs Pointer Networks Attention mechanism Ptr-Net Softmax normalizes the vector e ij to be an output distribution over the dictionary of inputs Diagram borrowed from Keon Kim 82

  47. CopyNet (Gu et al. 2016) • Conversation – I: Hello Jack, my name is Chandralekha – R: Nice to meet you, Chandralekha – I: This new guy doesn’t perform exactly as expected. – R: what do you mean by “doesn’t perform exactly as expected?” • Translation 83

  48. CopyNet (Gu et al. 2016) (b) Generate-Mode & Copy-Mode Prob (“ Jebara ”) = Prob( “ Jebara ” , g) + Prob( “ Jebara ” , c) Softmax hi , Tony Jebara … ... Vocabulary Source s 1 s 2 s 3 s 4 M <eos> hi , Tony s 4 Attentive Read Embedding for “Tony” DNN Selective Read for “Tony” “Tony” M h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 hello , my name is Tony Jebara . ! (c) State Update (a) Attention-based Encoder-Decoder (RNNSearch) 84

  49. CopyNet (Gu et al. 2016) • Key idea: interpolation between generation model & copy model p ( y t | s t , y t − 1 , c t , M ) = p ( y t , g | s t , y t − 1 , c t , M ) + p ( y t , c | s t , y t − 1 , c t , M ) (4) 1 8 Generate-Mode: The same scoring function as Z e ψ g ( y t ) , y t 2 V > > in the generic RNN encoder-decoder (Bahdanau et < y t 2 X \ ¯ p ( y t , g |· )= 0 , V (5) al., 2014) is used, i.e. 1 > Z e ψ g ( UNK ) > y t 62 V [ X : ψ g ( y t = v i ) = v > v i ∈ V ∪ UNK i W o s t , (7) ( 1 j : x j = y t e ψ c ( x j ) , P y t 2 X where W o ∈ R ( N +1) ⇥ d s and v i is the one-hot in- p ( y t , c |· )= (6) Z 0 otherwise dicator vector for v i . Copy-Mode: The score for “copying” the word x j is calculated as ⇣ ⌘ h > ψ c ( y t = x j ) = σ x j ∈ X j W c s t , (8) 85

  50. CONVOLU LUTION NEURAL L NE NETWOR WORK Next several slides borrowed from Alex Rush

  51. Models with Sliding Windows • Classification/prediction with sliding windows – E.g., neural language model • Feature representations with sliding window – E.g., sequence tagging with CRFs or structured perceptron [ w 1 w 2 w 3 w 4 w 5 ] w 6 w 7 w 8 w 1 [ w 2 w 3 w 4 w 5 w 6 ] w 7 w 8 w 1 w 2 [ w 3 w 4 w 5 w 6 w 7 ] w 8 . . . 87

  52. Sliding Windows w/ Convolution Let our input be the embeddings of the full sentence, X 2 R n ⇥ d 0 X = [ v ( w 1 ) , v ( w 2 ) , v ( w 3 ) , . . . , v ( w n )] Define a window model as NN window : R 1 ⇥ ( d win d 0 ) 7! R 1 ⇥ d hid , NN window ( x win ) = x win W 1 + b 1 The convolution is defined as NN conv : R n ⇥ d 0 7! R ( n � d win + 1 ) ⇥ d hid ,   NN window ( X 1 : d win )   NN window ( X 2 : d win + 1 )   NN conv ( X ) = tanh   . .   .     NN window ( X n � d win : n ) 88

  53. Pooling Operations I Pooling “over-time” operations f : R n ⇥ m 7! R 1 ⇥ m 1. f max ( X ) 1, j = max i X i , j 2. f min ( X ) 1, j = min i X i , j 3. f mean ( X ) 1, j = ∑ i X i , j / n   + + . . .   + + . . .   f ( X ) = = [ . . . ]   . .   .     + + . . . 89

  54. Convolution + Pooling y = softmax ( f max ( NN conv ( X )) W 2 + b 2 ) ˆ I W 2 ∈ R d hid × d out , b 2 ∈ R 1 × d out I Final linear layer W 2 uses learned window features 90

  55. Multiple Convolutions conv ( X ))] W 2 + b 2 ) conv ( X )) , . . . , f ( NN f y = softmax ([ f ( NN 1 conv ( X )) , f ( NN 2 ˆ I Concat several convolutions together. I Each NN 1 , NN 2 , etc uses a di ff erent d win I Allows for di ff erent window-sizes (similar to multiple n-grams) 91

  56. Convolution Diagram (kim 2014) I n = 9, d hid = 4 , d out = 2 I red- d win = 2, blue- d win = 3, (ignore back channel) 92

  57. Text Classification (Kim 2014) 93

  58. AlexNet (krizhevsky et al., 2012) 94

  59. Discussion Points • Strength and challenges of deep learning? … what do NNs think about this? 95

  60. Discussion Points Strength and challenges of deep learning? • Representation learning • – Less efforts on feature engineering (at the cost of more hyperparameter tuning!) – In computer vision: NN learned representation is significantly better than human engineered features – In NLP: often NN induced representation is concatenated with additional human engineered features. Data • – Most success from massive amount of clean (expensive) data – Recent surge of data creation type papers (especially AI challenge type tasks) – Which significantly limits the domains & applications – Need stronger models for unsupervised & distantly supervised approaches 96

  61. Discussion Points • Strength and challenges of deep learning? • Architecture – allows for flexible, expressive, and creative modeling • Easier entry to the field – Recent breakthrough from engineering advancements than theoretic advancements – Several NN platforms, code sharing culture 97

  62. LE LEARNING: BA BACKPR KPROPAGATI TION

  63. In Inside-ou outside and fo forwar ard-ba backward al algorithms ar are ju just backp ckprop . Jason Eisner (2016). In EMNLP Workshop on Structured Prediction for NLP . 99

  64. 100

Recommend


More recommend