Outline Morning program Preliminaries Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 8
Outline Morning program Preliminaries Feedforward neural network Back propagation Distributed representations Recurrent neural networks Sequence-to-sequence models Convolutional neural networks Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 9
Preliminaries Multi-layer perceptron a.k.a. feedforward neural network x i,j target: y y 1 y 2 y 3 cost function ^ φ: activation function ^ ^ ^ ^ e.g.: 1/2 (y - y) 2 output/prediction: y y 1 y 2 y 3 e.g.: sigmoid 1 1 + e -o output layer weights hidden layer j weights w 1,4 x i-1, 1 x i-1, 4 x i-1, 2 x i-1, 3 input: x x 1 x 2 x 3 x 4 node j at level i 10
Outline Morning program Preliminaries Feedforward neural network Back propagation Distributed representations Recurrent neural networks Sequence-to-sequence models Convolutional neural networks Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 11
Preliminaries Back propagation until convergence: - do a forward pass - compute the cost/error y 1 y 2 y 3 - adjust weights ← how?? cost function ^ ^ ^ y 1 y 2 y 3 Adjust every weight w i,j by: cost w i,j ∆ w i,j = − α ∂cost ∂w i,j w i,j x 1 x 2 x 3 x 4 α is the learning rate. 12
Preliminaries Back propagation ∆ w i,j = − α ∂cost y 1 y 2 y 3 cost function ^ ^ ^ ∂w i,j y 1 y 2 y 3 cost = − α ∂cost ∂x i,j w i,j ← chain rule ∂x i,j ∂w i,j w i,j x 1 x 2 x 3 x 4 y, y ) = 1 y ) 2 cost (ˆ 2( y − ˆ y j = x i,j = φ ( o i,j ) ˆ 13
Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost cost w i,j ∂w i,j = − α ∂cost ∂x i,j x 1 x 3 x 4 w i,j x 2 ∂x i,j ∂w i,j y, y ) = 1 y ) 2 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j ← chain rule ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 1 x i,j = σ ( o ) = 1 + e − o K � o i,j = w i,k · x i − 1 ,k k =1 14
Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost cost w i,j ∂w i,j = − α ∂cost ∂x i,j x 1 x 3 x 4 w i,j x 2 ∂x i,j ∂w i,j y, y )= 1 y ) 2 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 1 x i,j = σ ( o )= 1 + e − o K � o i,j = w i,k · x i − 1 ,k k =1 15
Preliminaries Back propagation ∆ w i,j = − α ∂cost y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∂w i,j cost w i,j = − α ∂cost ∂x i,j ∂x i,j ∂w i,j x 1 x 3 x 4 w i,j x 2 = − α ∂cost ∂x i,j ∂o i,j y, y )= 1 y ) 2 cost (ˆ 2( y − ˆ ∂x i,j ∂o i,j ∂w i,j = − α ∂cost ∂x i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ x i − 1 ,j ∂x i,j ∂o i,j 1 x i,j = σ ( o )= 1 + e − o K � o i,j = w i,k · x i − 1 ,k k =1 16
Preliminaries Back propagation y 1 y 2 y 3 cost function ∆ w i,j = − α ∂cost ^ ^ ^ y 1 y 2 y 3 ∂w i,j cost w i,j = − α ∂cost ∂x i,j w i,j x 1 x 2 x 3 x 4 ∂x i,j ∂w i,j y, y )= 1 y ) 2 = − α ∂cost ∂x i,j ∂o i,j cost (ˆ 2( y − ˆ ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ = − α ∂cost x i,j (1 − x i,j ) x i − 1 ,j 1 ∂x i,j x i,j = σ ( o )= 1 + e − o σ ′ ( o )= σ ( o )(1 − σ ( o )) K � o i,j = w i,k · x i − 1 ,k k =1 17
Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost cost w i,j ∂w i,j = − α ∂cost ∂x i,j w i,j x 1 x 2 x 3 x 4 ∂x i,j ∂w i,j y, y )= 1 y ) 2 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 1 = − α y j − x i,j x i,j (1 − x i,j ) x i − 1 ,j x i,j = σ ( o )= 1 + e − o σ ′ ( o )= σ ( o )(1 − σ ( o )) K � o i,j = w i,k · x i − 1 ,k k =1 18
Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost σ ( o ) σ ′ ( o ) cost w i,j ∂w i,j 1.0 = − α ∂cost ∂x i,j 0.8 w i,j x 1 x 2 x 3 x 4 0.6 ∂x i,j ∂w i,j y, y )= 1 y ) 2 0.4 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j 0.2 0.0 ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 6 4 2 0 2 4 6 1 = − α y j − x i,j x i,j (1 − x i,j ) x i − 1 ,j x i,j = σ ( o )= 1 + e − o = l.rate cost activation input σ ′ ( o )= σ ( o )(1 − σ ( o )) K � o i,j = w i,k · x i − 1 ,k k =1 19
Preliminaries Back propagation ∆ w i,j = − α ∂cost y 1 y 2 y 3 ∂w i,j cost function ^ ^ ^ y 1 y 2 y 3 = − α ∂cost ∂x i,j ∂o i,j ∂x i,j ∂o i,j ∂w i,j δ 1 δ 2 δ 3 cost = l.rate cost activation input = − α ∂cost ∂x i,j ∂o i,j w i,j w i,j x 1 x 2 x 3 x 4 ∂x i,j ∂o i,j ∂w i,j = − α δ x i − 1 ,j δ output = ( y j − x i,j ) x i,j (1 − x i,j ) ← previous slide � � � δ hidden = δ n w n,j x i,j (1 − x i,j ) n ∈ nodes 20
Preliminaries Network representation y 1 y 2 y 3 y ^ ^ ^ y 1 y 2 y 3 x 3 activation: x 3 = σ(o 2 ) = [ 1 x 3 ] ↔ x 2 * W 2 = o 2 [ 1 × 4 ] [ 4 × 3 ] = [ 1 × 3 ] x 2 activation: x 2 = σ(o 1 ) = [ 1 x 4 ] x 1 * W 1 = o 1 [ 1 × 4 ] [ 4 × 4 ] = [ 1 × 4 ] x 1 x 2 x 3 x 4 x 1 x 1 [1] x 1 [2] x 1 [3] x 1 [4] 21
Outline Morning program Preliminaries Feedforward neural network Back propagation Distributed representations Recurrent neural networks Sequence-to-sequence models Convolutional neural networks Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 22
Preliminaries Distributed representations ◮ Represent units, e.g., words, as vectors ◮ Goal: words that are similar, e.g., in terms of meaning, should get similar embeddings newspaper = <0.08, 0.31, 0.41> Cosine similarity to determine how similar two vectors magazine = <0.09, 0.35, 0.36> are: v ⊤ · � � w cosine ( � v, � w ) = || � v || 2 || � w || 2 biking = <0.59, 0.25, 0.01> � | v | i =1 v i ∗ w i = �� | v | �� | w | i =1 v 2 i =1 w 2 i i 23
Preliminaries Distributed representations How do we get these vectors? ◮ You shall know a word by the company it keeps [Firth, 1957] ◮ The vector of a word should be similar to the vectors of the words surrounding it − → you − need − − → → is − − → all − − → love 24
Preliminaries Embedding methods allanswer amtrak need what zorro you is ... target distribution 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... vocabulary size probabitity distribution turn this into a probability distribution ... vocabulary size layer embedding size × vocabulary size weight matrix embedding size hidden layer vocabulary size × embedding size weight matrix vocabulary size inputs ... 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 is answer love you what zorro all amtrak need 25
Preliminaries Probability distributions softmax = normalize the logits y: probability distribution ... cost e logits [ i ] ^ y: probability distribution ... = ? � | logits | logits e logits [ j ] ... j =1 cost = cross entropy loss � = − p ( x ) log ˆ p ( x ) ... x � = − p ground truth ( word = vocabulary [ i ]) log p predictions ( word = vocabulary [ i ]) i � = − y i log ˆ y i i 26
Outline Morning program Preliminaries Feedforward neural network Back propagation Distributed representations Recurrent neural networks Sequence-to-sequence models Convolutional neural networks Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 27
Preliminaries Recurrent neural networks ◮ Lots of information is sequential and ◮ Recurrent neural networks (RNNs) are requires a memory for successful called recurrent because they perform processing same task for every element of sequence, with output dependent on ◮ Sequences as input, sequences as previous computations output ◮ RNNs have memory that captures information about what has been computed so far ◮ RNNs can make use of information in arbitrarily long sequences – in practice they limited to looking back only few steps Image credits: http://karpathy.github.io/assets/rnn/diags.jpeg 28
Preliminaries Recurrent neural networks ◮ RNN being unrolled (or unfolded) into full network ◮ Unrolling: write out network for complete sequence ◮ Formulas governing computation: ◮ x t input at time step t ◮ s t hidden state at time step t – memory of the network, calculated based on previous hidden state and input at the current step: s t = f ( Ux t + Ws t − 1 ) ; f usually nonlinearity, e.g., tanh or ReLU ; s − 1 typically initialized to all zeroes ◮ o t output at step t . E.g.,, if we want to predict next word in sentence, a vector of probabilities across vocabulary: o t = softmax( V s t ) Image credits: Nature 29
Recommend
More recommend