Recurrent Neural Networks (RNNs) Each RNN unit computes a new hidden state using the previous • state and a new input h t = f ( x t , h t − 1 ) Each RNN unit (optionally) makes an output using the current hidden • state y t = softmax( V h t ) h t ∈ R D Hidden states are continuous vectors • – Can represent very rich information – Possibly the entire history from the beginning Parameters are shared (tied) across all RNN units (unlike feedforward • NNs) ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #
Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) y t = softmax( V h t ) • Vanilla RNN: h t = tanh( Ux t + Wh t − 1 + b ) y t = softmax( V h t ) ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #
Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) • Vanilla RNNs: h t = tanh( Ux t + Wh t − 1 + b ) • LSTMs ( L ong S hort- t erm M emory Networks): i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ There are many c t = f t � c t − 1 + i t � ˜ c t known variations to this set of h t = o t � tanh( c t ) equations! 𝑑 % 𝑑 ( : cell state 𝑑 # 𝑑 $ 𝑑 " ℎ # ℎ $ ℎ % ℎ ( : hidden state ℎ " 𝑦 $ 𝑦 % 𝑦 " 𝑦 #
Many uses of RNNs 1. Classification (seq to one) • Input: a sequence • Output: one label (classification) • Example: sentiment classification h t = f ( x t , h t − 1 ) y = softmax( V h n ) ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #
Many uses of RNNs 2. one to seq • Input: one item • Output: a sequence h t = f ( x t , h t − 1 ) • Example: Image captioning y t = softmax( V h t ) Cat sitting on top of …. ℎ $ ℎ " ℎ % ℎ # ℎ " ℎ # ℎ $ 𝑦 "
Many uses of RNNs 3. sequence tagging • Input: a sequence • Output: a sequence (of the same length) • Example: POS tagging, Named Entity Recognition • How about Language Models? – Yes! RNNs can be used as LMs! h t = f ( x t , h t − 1 ) – RNNs make markov assumption: T/F? y t = softmax( V h t ) ℎ $ ℎ " ℎ % ℎ # ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #
Many uses of RNNs 4. Language models • Input: a sequence of words h t = f ( x t , h t − 1 ) • Output: one next word y t = softmax( V h t ) • Output: or a sequence of next words • During training, x_t is the actual word in the training sentence. • During testing, x_t is the word predicted from the previous time step. • Does RNN LMs make Markov assumption? – i.e., the next word depends only on the previous N words ℎ $ ℎ " ℎ % ℎ # ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #
Many uses of RNNs 5. seq2seq (aka “encoder-decoder”) • Input: a sequence • Output: a sequence (of different length) • Examples? h t = f ( x t , h t − 1 ) y t = softmax( V h t ) ℎ * ℎ % ℎ + ℎ ) ℎ " ℎ # ℎ $ ℎ % ℎ ) ℎ * 𝑦 $ 𝑦 " 𝑦 #
Many uses of RNNs 4. seq2seq (aka “encoder-decoder”) Conversation and Dialogue • Machine Translation • Figure from http://www.wildml.com/category/conversational-agents/
Many uses of RNNs 4. seq2seq (aka “encoder-decoder”) Parsing! - “Grammar as Foreign Language” (Vinyals et al., 2015) ℎ * ℎ % ℎ + ℎ ) ℎ " ℎ # ℎ $ ℎ % ℎ ) ℎ * 𝑦 $ 𝑦 " 𝑦 # John has a dog
Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) y t = softmax( V h t ) • Vanilla RNN: h t = tanh( Ux t + Wh t − 1 + b ) y t = softmax( V h t ) ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #
Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) • Vanilla RNNs: h t = tanh( Ux t + Wh t − 1 + b ) • LSTMs ( L ong S hort- t erm M emory Networks): i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ There are many c t = f t � c t − 1 + i t � ˜ c t known variations to this set of h t = o t � tanh( c t ) equations! 𝑑 % 𝑑 ( : cell state 𝑑 # 𝑑 $ 𝑑 " ℎ # ℎ $ ℎ % ℎ ( : hidden state ℎ " 𝑦 $ 𝑦 % 𝑦 " 𝑦 #
LST LSTMS S (LONG ONG SHO HORT-TERM ERM MEM EMORY NETWO WORKS 𝑑 (," 𝑑 ( ℎ (," ℎ ( Figure by Christopher Olah (colah.github.io)
LST LSTMS S (LONG ONG SHO HORT-TERM ERM MEM EMORY NETWO WORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Figure by Christopher Olah (colah.github.io)
LSTMS LST S (LONG ONG SHO HORT-TERM ERM MEM EMORY NETWO WORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not tanh: [-1,1] i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ Figure by Christopher Olah (colah.github.io)
LSTMS LST S (LONG ONG SHO HORT-TERM ERM MEM EMORY NETWO WORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not tanh: [-1,1] i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: - mix old cell with the new temp cell c t = f t � c t − 1 + i t � ˜ c t Figure by Christopher Olah (colah.github.io)
LST LSTMS S (LONG ONG SHO HORT-TERM ERM MEM EMORY NETWO WORKS Forget gate: forget the past or not Output gate: output from the new cell or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) Input gate: use the input or not Hidden state: i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) h t = o t � tanh( c t ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: - mix old cell with the new temp cell c t = f t � c t − 1 + i t � ˜ c t Figure by Christopher Olah (colah.github.io)
LSTMS LST S (LONG ONG SHO HORT-TERM ERM MEM EMORY NETWO WORKS Forget gate: forget the past or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) Output gate: output from the new o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) cell or not New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: c t = f t � c t − 1 + i t � ˜ c t - mix old cell with the new temp cell Hidden state: h t = o t � tanh( c t ) 𝑑 (," 𝑑 ( ℎ (," ℎ (
vanishing gradient problem for RNNs. The shading of the nodes in the unfolded network indicates their • sensitivity to the inputs at time one (the darker the shade, the greater the sensitivity). The sensitivity decays over time as new inputs overwrite the activations • of the hidden layer, and the network ‘forgets’ the first inputs. Example from Graves 2012
Preservation of gradient information by LSTM Output gate Forget gate Input gate For simplicity, all gates are either entirely open (‘O’) or closed (‘—’). • The memory cell ‘remembers’ the first input as long as the forget gate is • open and the input gate is closed. The sensitivity of the output layer can be switched on and off by the output • gate without affecting the cell. Example from Graves 2012
Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) • Vanilla RNNs: h t = tanh( Ux t + Wh t − 1 + b ) • GRUs ( G ated R ecurrent U nits): z t = σ ( U ( z ) x t + W ( z ) h t − 1 + b ( z ) ) r t = σ ( U ( r ) x t + W ( r ) h t − 1 + b ( r ) ) ˜ h t = tanh( U ( h ) x t + W ( h ) ( r t � h t − 1 ) + b ( h ) ) Z: Update gate h t = (1 � z t ) � h t − 1 + z t � ˜ h t R: Reset gate Less parameters than LSTMs. Easier to train for comparable ℎ % ℎ " ℎ # ℎ $ performance! 𝑦 $ 𝑦 % 𝑦 " 𝑦 #
RNN Learning: B ack p rop T hrough T ime (BPTT) • Similar to backprop with non-recurrent NNs • But unlike feedforward (non-recurrent) NNs, each unit in the computation graph repeats the exact same parameters… • Backprop gradients of the parameters of each unit as if they are different parameters • When updating the parameters using the gradients, use the average gradients throughout the entire chain of units. ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #
Gates • Gates contextually control information flow • Open/close with sigmoid • In LSTMs and GRUs, they are used to (contextually) maintain longer term history 59
Bi-directional RNNs Can incorporate context from both directions • Generally improves over uni-directional RNNs • 60
Google NMT (Oct 2016)
Tree LSTMs Are tree LSTMs more • expressive than sequence LSTMs? I.e., recursive vs recurrent • When Are Tree Structures • Necessary for Deep Learning of Representations? Jiwei Li, Minh-Thang Luong, Dan Jurafsky and Eduard Hovy. EMNLP , 2015. 62
Recursive Neural Networks Sometimes, inference over a tree structure makes more sense • than sequential structure An example of compositionality in ideological bias detection • (red → conservative, blue → liberal, gray → neutral) in which modifier phrases and punctuation cause polarity switches at higher levels of the parse tree Example from Iyyer et al., 2014
Recursive Neural Networks • NNs connected as a tree • Tree structure is fixed a priori • Parameters are shared, similarly as RNNs Example from Iyyer et al., 2014
Neural Probabilistic Language Model (Bengio 2003) 65
Neural Probabilistic Language Model (Bengio 2003) • Each word prediction is a separate feed forward neural network • Feedforward NNLM is a Markovian language model • Dashed lines show optional direct connections NN DMLP 1 ( x ) = [ tanh ( xW 1 + b 1 ) , x ] W 2 + b 2 I W 1 ∈ R d in × d hid , b 1 ∈ R 1 × d hid ; first a ffi ne transformation I W 2 ∈ R ( d hid + d in ) × d out , b 2 ∈ R 1 × d out ; second a ffi ne transformation 66
AT ATTENTION!
Encoder – Decoder Architecture Sequence-to-Sequence the red dog ˆ ˆ ˆ y 1 y 2 y 3 s s s s s s s t s t s t 1 2 3 1 2 3 ˆ ˆ ˆ x 1 x 2 x 3 x 1 x 2 x 3 the red dog < s > 68 Diagram borrowed from Alex Rush
Trial: Hard Attention s t • At each step generating the target word i s s • Compute the best alignment to the source word j • And incorporate the source word to generate the target word y t i = argmax y O ( y, s t i , s s j ) • Contextual hard alignment. How? z j = tanh([ s t i , s s j ] W + b ) j = argmax j z j • Problem? 69
Attention: Soft Alignments s t • At each step generating the target word i • Compute the attention to the source sequence s s c • And incorporate the attention to generate the target word y t i = argmax y O ( y, s t i , s s j ) • Contextual attention as soft alignment. How? z j = tanh([ s t i , s s j ] W + b ) α = softmax( z ) X α j s s c = j j – Step-1: compute the attention weights 70 – Step-2: compute the attention vector as interpolation
Attention 71 Diagram borrowed from Alex Rush
Seq-to-Seq with Attention Diagram from http://distill.pub/2016/augmented-rnns/ 72
Seq-to-Seq with Attention Diagram from http://distill.pub/2016/augmented-rnns/ 73
Attention function parameterization z j = tanh([ s t i ; s s j ] W + b ) • Feedforward NNs z j = tanh([ s t i ; s s j ; s t i � s s j ] W + b ) • Dot product z j = s t i · s s j s t i · s s • Cosine similarity j z j = || s t i |||| s s j || T Ws s • Bi-linear models z j = s t i j 74
Learned Attention! 75 Diagram borrowed from Alex Rush
Qualitative results Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “soft” (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.) Figure 3. Examples of attending to the correct object ( white indicates the attended regions, underlines indicated the corresponding word) 76 27 M. Malinowski
BiDAF 77
LE LEARNING: TRAINING DEEP NE NETWOR WORKS
Vanishing / exploding Gradients • Deep networks are hard to train • Gradients go through multiple layers • The multiplicative effect tends to lead to exploding or vanishing gradients • Practical solutions w.r.t. – network architecture – numerical operations 79
Vanishing / exploding Gradients • Practical solutions w.r.t. network architecture – Add skip connections to reduce distance • Residual networks, highway networks, … – Add gates (and memory cells) to allow longer term memory • LSTMs, GRUs, memory networks, … 80
Gradients of deep networks NN layer ( x ) = ReLU ( xW 1 + b 1 ) h n h n − 1 . . . h 2 h 1 x I Can have similar issues with vanishing gradients. ∂ L ∂ L = ∑ 1 ( h n , j n > 0 ) W j n − 1 , j n ∂ h n − 1, j n − 1 ∂ h n , j n j n 81 Diagram borrowed from Alex Rush
Effects of Skip Connections on Gradients • Thought Experiment: Additive Skip-Connections NN sl 1 ( x ) = 1 2 ReLU ( xW 1 + b 1 ) + 1 2 x h n h n − 1 . . . ∂ L ∂ L h 3 1 ( ∑ = 1 ( h n , j n > 0 ) W j n − 1 , j n ) + 2 ∂ h n − 1, j n − 1 ∂ h n , j n j n h 2 ∂ L 1 ( h n − 1, j n − 1 ) h 1 2 ∂ h n , j n − 1 x 82 Diagram borrowed from Alex Rush
Effects of Skip Connections on Gradients • Thought Experiment: Dynamic Skip-Connections ( 1 − t ) ReLU ( xW 1 + b 1 ) + t x NN sl 2 ( x ) = σ ( xW t + b t ) = t W 1 R d hid × d hid ∈ W t R d hid × 1 ∈ h n h n − 1 . . . h 3 h 2 h 1 83 Diagram borrowed from Alex Rush x
Highway Network (Srivastava et al., 2015) • A plain feedforward neural network: y = H ( x , W H ) . – H is a typical affine transformation followed by a non- linear activation • Highway network: y = H ( x , W H ) · T ( x , W T ) + x · C ( x , W C ) . – T is a “transform gate” – C is a “carry gate” – Often C = 1 – T for simplicity 84
Residual Networks • Residual net • Plaint net 0 0 weight layer weight layer any two relu b(0) identity relu stacked layers 0 weight layer weight layer relu a 0 = b 0 + 0 a(0) relu • ResNet (He et al. 2015): first very deep (152 layers) network successfully trained for object recognition 85
Residual Networks • Residual net • Plaint net 0 0 weight layer weight layer any two relu b(0) identity relu stacked layers 0 weight layer weight layer relu a 0 = b 0 + 0 a(0) relu F(x) is a residual mapping with respect to identity • Direct input connection +x leads to a nice property w.r.t. back • propagation --- more direct influence from the final loss to any deep layer In contrast, LSTMs & Highway networks allow for long distance • input connection only through “gates”. 86
Residual Networks Revolution of Depth soft max2 Soft maxAct ivat ion FC AveragePool 7x7+ 1(V) AlexNet, 8 layers 11x11 conv, 96, /4, pool/2 VGG, 19 layers 3x3 conv, 64 GoogleNet, 22 layers Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 5x5 conv, 256, pool/2 3x3 conv, 64, pool/2 (ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2014) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat 3x3 conv, 384 3x3 conv, 128 Conv Conv Conv Conv soft max1 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Soft maxAct ivation 3x3 conv, 384 3x3 conv, 128, pool/2 MaxPool FC 3x3+ 2(S) Dept hConcat FC 3x3 conv, 256, pool/2 3x3 conv, 256 Conv Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S) Conv Conv MaxPool AveragePool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V) fc, 4096 3x3 conv, 256 Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) fc, 4096 3x3 conv, 256 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat soft max0 fc, 1000 3x3 conv, 256, pool/2 Conv Conv Conv Conv Soft maxAct ivat ion 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool FC 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512 Dept hConcat FC Conv Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S) 3x3 conv, 512 Conv Conv MaxPool AveragePool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V) Dept hConcat 3x3 conv, 512 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512, pool/2 MaxPool 3x3+ 2(S) Dept hConcat 3x3 conv, 512 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512 Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 3x3 conv, 512 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) MaxPool 3x3+ 2(S) 3x3 conv, 512, pool/2 LocalRespNorm Conv 3x3+ 1(S) fc, 4096 Conv 1x1+ 1(V) LocalRespNorm fc, 4096 MaxPool 3x3+ 2(S) Conv fc, 1000 7x7+ 2(S) input Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 87
Residual Networks 7x7 conv, 64, /2, pool/2 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 Revolution of Depth 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 11x11 conv, 96, /4, pool/2 AlexNet, 8 layers 5x5 conv, 256, pool/2 VGG, 19 layers ResNet, 152 layers 1x1 conv, 256 3x3 conv, 64 3x3 conv, 256 3x3 conv, 384 3x3 conv, 384 3x3 conv, 64, pool/2 1x1 conv, 1024 3x3 conv, 128 1x1 conv, 256 3x3 conv, 256, pool/2 fc, 4096 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 fc, 4096 1x1 conv, 1024 fc, 1000 3x3 conv, 256 1x1 conv, 256 3x3 conv, 256 3x3 conv, 256 (ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2015) 3x3 conv, 256, pool/2 1x1 conv, 1024 3x3 conv, 512 1x1 conv, 256 3x3 conv, 512 3x3 conv, 256 3x3 conv, 512 1x1 conv, 1024 3x3 conv, 512, pool/2 1x1 conv, 256 3x3 conv, 512 3x3 conv, 256 3x3 conv, 512 1x1 conv, 1024 3x3 conv, 512 1x1 conv, 256 3x3 conv, 512, pool/2 3x3 conv, 256 fc, 4096 1x1 conv, 1024 fc, 4096 1x1 conv, 256 fc, 1000 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 88
Residual Networks Revolution of Depth 28.2 25.8 152 layers 16.4 11.7 22 layers 19 layers 7.3 6.7 3.57 8 layers 8 layers shallow ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10 ResNet GoogleNet VGG AlexNet ImageNet Classification top-5 error (%) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 89
Highway Network (Srivastava et al., 2015) • A plain feedforward neural network: y = H ( x , W H ) . – H is a typical affine transformation followed by a non- linear activation • Highway network: y = H ( x , W H ) · T ( x , W T ) + x · C ( x , W C ) . – T is a “transform gate” – C is a “carry gate” – Often C = 1 – T for simplicity 90
Vanishing / exploding Gradients • Practical solutions w.r.t. numerical operations – Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid) • ReLU or hard-tanh instead 91
Sigmoid • Often used for gates 1 σ ( x ) = • Pro: neuron-like, 1 + e − x differentiable σ 0 ( x ) = σ ( x )(1 − σ ( x )) • Con: gradients saturate to zero almost everywhere except x near zero => vanishing gradients • Batch normalization helps 92
Tanh • Often used for hidden states & cells tanh( x ) = e x − e − x in RNNs, LSTMs e x + e − x • Pro: differentiable, tanh’(x) = 1 − tanh 2 ( x ) often converges faster than sigmoid tanh( x ) = 2 σ (2 x ) − 1 • Con: gradients easily saturate to zero => vanishing gradients 93
Hard Tanh • Pro: computationally cheaper − 1 t < − 1 • Con: saturates to hardtanh ( t ) = − 1 ≤ t ≤ 1 t zero easily, doesn’t differentiate at 1, -1 1 t > 1 94
ReLU • Pro: doesn’t saturate for ReLU( x ) = max(0 , x ) x > 0, computationally cheaper, induces sparse 1 x > 0 NNs d ReLU ( x ) = 0 x < 0 dx • Con: non-differentiable 1 or 0 o . w at 0 • Used widely in deep NN, but not as much in RNNs • We informally use subgradients: 95
Vanishing / exploding Gradients • Practical solutions w.r.t. numerical operations – Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid) • ReLU or hard-tanh instead – Batch Normalization: add intermediate input normalization layers 96
Batch Normalization 97
Regularization • Regularization by objective term n y c 0 ) } + λ || θ || 2 ∑ L ( θ ) = max { 0, 1 � ( ˆ y c � ˆ i = 1 – Modify loss with L1 or L2 norms • Less depth, smaller hidden states, early stopping • Dr Dropo pout ut – Randomly delete parts of network during training – Each node (and its corresponding incoming and outgoing edges) dropped with a probability p – P is higher for internal nodes, lower for input nodes – The full network is used for testing – Faster training, better results – Vs. Bagging 98
Convergence of backprop • Without non-linearity or hidden layers, learning is convex optimization – Gradient descent reaches gl globa bal mi minima ma • Multilayer neural nets (with nonlinearity) are no not t co conve vex – Gradient descent gets stuck in local minima – Selecting number of hidden units and layers = fuzzy process – NNs have made a HUGE comeback in the last few years • Neural nets are back with a new name – Deep belief networks – Huge error reduction when trained with lots of data on GPUs
SUPPLE LEMENTARY TOPICS
Recommend
More recommend