practical neural networks for nlp part 2
play

Practical Neural Networks for NLP (Part 2) Chris Dyer, Yoav - PowerPoint PPT Presentation

Practical Neural Networks for NLP (Part 2) Chris Dyer, Yoav Goldberg, Graham Neubig Previous Part DyNet Feed Forward Networks RNNs All pretty standard, can do very similar in TF / Theano / Keras. This Part Where DyNet shines


  1. WORDS_LOOKUP = model.add_lookup_parameters((nwords, 128)) CHARS_LOOKUP = model.add_lookup_parameters((nchars, 20)) fwdRNN = dy.LSTMBuilder(1, 128, 50, model) cFwdRNN = dy.LSTMBuilder(1, 20, 64, model) bwdRNN = dy.LSTMBuilder(1, 128, 50, model) cBwdRNN = dy.LSTMBuilder(1, 20, 64, model) def word_rep(w): w_index = vw.w2i[w] return WORDS_LOOKUP[w_index]

  2. WORDS_LOOKUP = model.add_lookup_parameters((nwords, 128)) CHARS_LOOKUP = model.add_lookup_parameters((nchars, 20)) fwdRNN = dy.LSTMBuilder(1, 128, 50, model) cFwdRNN = dy.LSTMBuilder(1, 20, 64, model) bwdRNN = dy.LSTMBuilder(1, 128, 50, model) cBwdRNN = dy.LSTMBuilder(1, 20, 64, model) def word_rep(w): w_index = vw.w2i[w] return WORDS_LOOKUP[w_index] def word_rep(w, cf_init, cb_init): if wc[w] > 5: w_index = vw.w2i[w] return WORDS_LOOKUP[w_index] else : char_ids = [vc.w2i[c] for c in w] char_embs = [CHARS_LOOKUP[cid] for cid in char_ids] fw_exps = cf_init.transduce(char_embs) bw_exps = cb_init.transduce(reversed(char_embs)) return dy.concatenate([ fw_exps[-1], bw_exps[-1] ])

  3. def build_tagging_graph(words): dy.renew_cg() # initialize the RNNs f_init = fwdRNN.initial_state() b_init = bwdRNN.initial_state() cf_init = cFwdRNN.initial_state() cb_init = cBwdRNN.initial_state() wembs = [word_rep(w, cf_init, cb_init) for w in words] fws = f_init.transduce(wembs) bws = b_init.transduce(reversed(wembs)) # biLSTM states bi = [dy.concatenate([f,b]) for f,b in zip(fws, reversed(bws))] # MLPs H = dy.parameter(pH) O = dy.parameter(pO) outs = [O*(dy.tanh(H * x)) for x in bi] return outs

  4. def tag_sent(words): vecs = build_tagging_graph(words) vecs = [dy.softmax(v) for v in vecs] probs = [v.npvalue() for v in vecs] tags = [] for prb in probs: tag = np.argmax(prb) tags.append(vt.i2w[tag]) return zip(words, tags)

  5. def sent_loss(words, tags): vecs = build_tagging_graph(words) losses = [] for v,t in zip(vecs,tags): tid = vt.w2i[t] loss = dy.pickneglogsoftmax(v, tid) losses.append(loss) return dy.esum(losses)

  6. num_tagged = cum_loss = 0 for ITER in xrange(50): random.shuffle(train) for i,s in enumerate(train,1): if i > 0 and i % 500 == 0: # print status trainer.status() print cum_loss / num_tagged cum_loss = num_tagged = 0 if i % 10000 == 0: # eval on dev good = bad = 0.0 for sent in dev: words = [w for w,t in sent] golds = [t for w,t in sent] tags = [t for w,t in tag_sent(words)] for go,gu in zip(golds,tags): if go == gu: good +=1 else : bad+=1 print good/(good+bad) # train on sent words = [w for w,t in s] golds = [t for w,t in s] loss_exp = sent_loss(words, golds) cum_loss += loss_exp.scalar_value() num_tagged += len(golds) loss_exp.backward() trainer.update()

  7. num_tagged = cum_loss = 0 for ITER in xrange(50): random.shuffle(train) for i,s in enumerate(train,1): if i > 0 and i % 500 == 0: # print status trainer.status() print cum_loss / num_tagged progress cum_loss = num_tagged = 0 if i % 10000 == 0: # eval on dev reports good = bad = 0.0 for sent in dev: words = [w for w,t in sent] golds = [t for w,t in sent] tags = [t for w,t in tag_sent(words)] for go,gu in zip(golds,tags): if go == gu: good +=1 else : bad+=1 print good/(good+bad) # train on sent words = [w for w,t in s] golds = [t for w,t in s] training loss_exp = sent_loss(words, golds) cum_loss += loss_exp.scalar_value() num_tagged += len(golds) loss_exp.backward() trainer.update()

  8. To summarize this part • We've seen an implementation of a BiLSTM tagger • ... where some words are represented as char-level LSTMs • ... and other words are represented as word-embedding vectors • ... and the representation choice is determined at run time • This is a rather dynamic graph structure.

  9. up next • Even more dynamic graph structure (shift-reduce parsing) • Extending the BiLSTM tagger to use global inference.

  10. Transition-Based Parsing

  11. Stack Buffer Action SHIFT I saw her duck SHIFT saw her duck I her duck REDUCE-L I saw her duck I saw SHIFT SHIFT duck I saw her REDUCE-L I saw her duck REDUCE-R I saw her duck I saw her duck

  12. Transition-based parsing 
 • Build trees by pushing words (“ shift ”) onto a stack and combing elements at the top of the stack into a syntactic constituent (“ reduce ”) • Given current stack and buffer of unprocessed words, what action should the algorithm take? Let’s use a neural network!

  13. Transition-based parsing 
 tokens is the sentence to be parsed. oracle_actions is a list of { SHIFT , REDUCE_L , REDUCE_R }.

  14. Transition-based parsing 
 tokens is the sentence to be parsed. oracle_actions is a list of { SHIFT , REDUCE_L , REDUCE_R }.

  15. Transition-based parsing 
 tokens is the sentence to be parsed. oracle_actions is a list of { SHIFT , REDUCE_L , REDUCE_R }.

  16. Transition-based parsing 
 tokens is the sentence to be parsed. oracle_actions is a list of { SHIFT , REDUCE_L , REDUCE_R }.

  17. Transition-based parsing 
 tokens is the sentence to be parsed. oracle_actions is a list of { SHIFT , REDUCE_L , REDUCE_R }.

  18. Transition-based parsing 
 tokens is the sentence to be parsed. oracle_actions is a list of { SHIFT , REDUCE_L , REDUCE_R }.

  19. Transition-based parsing 
 tokens is the sentence to be parsed. oracle_actions is a list of { SHIFT , REDUCE_L , REDUCE_R }.

  20. Transition-based parsing 
 tokens is the sentence to be parsed. oracle_actions is a list of { SHIFT , REDUCE_L , REDUCE_R }.

  21. Transition-based parsing 
 tokens is the sentence to be parsed. oracle_actions is a list of { SHIFT , REDUCE_L , REDUCE_R }.

  22. Transition-based parsing 
 • This is a good problem for dynamic networks! • Different sentences trigger different parsing states • The state that needs to be embedded is complex (sequences, trees, sequences of trees) • The parsing algorithm has fairly complicated flow control and data structures

  23. Transition-based parsing 
 Challenges unbounded depth unbounded length arbitrarily complex trees her duck I saw ( duck I saw her reading and 
 forgetting I saw her duck

  24. Transition-based parsing 
 State embeddings • We can embed words • Assume we can embed tree fragments • The contents of the buffer are just a sequence • which we periodically “shift” from • The contents of the stack is just a sequence • which we periodically pop from and push to • Sequences -> use RNNs to get an encoding! • But running an RNN for each state will be expensive. Can we do better?

  25. Transition-based parsing 
 Stack RNNs • Augment RNN with a stack pointer • Three constant-time operations • push - read input, add to top of stack • pop - move stack pointer back • embedding - return the RNN state at the location of the stack pointer (which summarizes its current contents)

  26. Transition-based parsing 
 Stack RNNs DyNet: s=[rnn.inital_state()] y 0 s.append[s[-1].add_input(x1) s.pop() s.append[s[-1].add_input(x2) s.pop() s.append[s[-1].add_input(x3) ∅

  27. Transition-based parsing 
 Stack RNNs DyNet: s=[rnn.inital_state()] y 0 y 1 s.append[s[-1].add_input(x1) s.pop() s.append[s[-1].add_input(x2) s.pop() s.append[s[-1].add_input(x3) ∅ x 1

  28. Transition-based parsing 
 Stack RNNs DyNet: s=[rnn.inital_state()] y 0 y 1 s.append[s[-1].add_input(x1) s.pop() s.append[s[-1].add_input(x2) s.pop() s.append[s[-1].add_input(x3) ∅ x 1

  29. Transition-based parsing 
 Stack RNNs DyNet: s=[rnn.inital_state()] y 0 y 1 y 2 s.append[s[-1].add_input(x1) s.pop() s.append[s[-1].add_input(x2) s.pop() s.append[s[-1].add_input(x3) ∅ x 1 x 2

  30. Transition-based parsing 
 Stack RNNs DyNet: s=[rnn.inital_state()] y 0 y 1 y 2 s.append[s[-1].add_input(x1) s.pop() s.append[s[-1].add_input(x2) s.pop() s.append[s[-1].add_input(x3) ∅ x 1 x 2

  31. Transition-based parsing 
 Stack RNNs DyNet: s=[rnn.inital_state()] y 0 y 1 y 2 y 3 s.append[s[-1].add_input(x1) s.pop() s.append[s[-1].add_input(x2) s.pop() s.append[s[-1].add_input(x3) ∅ x 3 x 1 x 2

  32. Transition-based parsing 
 DyNet wrapper implementation:

  33. Transition-based parsing 
 Representing the state RED-L(amod) SHIFT … SHIFT REDUCE_L REDUCE_R p t

  34. Transition-based parsing 
 Representing the state RED-L(amod) SHIFT … SHIFT REDUCE_L REDUCE_R S p t } {z | TOP amod an decision ∅ overhasty

  35. Transition-based parsing 
 Representing the state RED-L(amod) SHIFT … SHIFT REDUCE_L REDUCE_R B S p t } {z | } {z | TOP TOP amod an decision was made ∅ root overhasty

  36. Transition-based parsing 
 Syntactic compositions head h

  37. Transition-based parsing 
 Syntactic compositions modifier head h m

  38. Transition-based parsing 
 Syntactic compositions c = tanh( W [ h ; m ] + b ) modifier head h m

  39. Transition-based parsing 
 Syntactic compositions It is very easy to experiment with different 
 composition functions.

  40. Code Tour

  41. Transition-based parsing 
 Representing the state RED-L(amod) SHIFT … SHIFT REDUCE_L REDUCE_R B S p t } {z | } {z | TOP TOP amod an decision was made ∅ root overhasty

  42. Transition-based parsing 
 Representing the state RED-L(amod) SHIFT … SHIFT REDUCE_L REDUCE_R B S p t } {z | } {z | TOP TOP amod an decision was made ∅ root overhasty TOP | REDUCE-LEFT(amod) {z A SHIFT }

  43. Transition-based parsing 
 Pop quiz • How should we add this functionality?

  44. Structured Training

  45. What do we Know So Far? • How to create relatively complicated models • How to optimize them given an oracle action sequence

  46. 
 
 
 
 
 
 Local vs. Global Inference • What if optimizing local decisions doesn’t lead to good global decisions? 
 time flies like an arrow P( ) = 0.4 NN VBZ PRPDET NN P( ) = 0.3 NN NNP VB DET NN P( ) = 0.3 VB NNP PRPDET NN NN NNP PRPDET NN • Simple solution: input last label (e.g. RNNLM) 
 → Modeling search is difficult, can lead down garden paths • Better solutions: • Local consistency parameters (e.g. CRF: Lample et al. 2016) • Global training (e.g. globally normalized NNs: Andor et al. 2016)

  47. BiLSTM Tagger w/ Tag Bigram Parameters <s> tag tag tag tag tag <s> MLP MLP MLP MLP MLP concat concat concat concat concat LSTM_F LSTM_F LSTM_F LSTM_F LSTM_F LSTM_B LSTM_B LSTM_B LSTM_B LSTM_B the brown fox the

  48. From Local to Global • Standard BiLSTM loss function: X log P ( y | x ) = log P ( y i | x ) i log emission 
 • With transition features: probs as scores log P ( y , x ) = 1 X ( s e ( y i , x ) + s t ( y i − 1 , y i )) Z i global normalization transition scores

  49. How do We Train? • Cannot simply enumerate all possibilities and do backprop • In easily decomposable cases, can use DP to calculate gradients (CRF) • More generally applicable solutions: structured perceptron, margin-based methods

  50. Structured Perceptron Overview ˆ y = argmax score( y | x ; θ ) y time flies like an arrow Hypothesis Reference ≠ NN VBZ PRPDET NN NN NNP VB DET NN Update! Perceptron Loss ` percep ( x , y , ✓ ) = max(score(ˆ y | x ; ✓ ) − score( y | x ; ✓ ) , 0)

  51. Structured Perceptron in DyNet def viterbi_sent_loss(words, tags): vecs = build_tagging_graph(words) vit_tags, vit_score = viterbi_decoding(vecs, tags) if vit_tags != tags: ref_score = forced_decoding(vecs, tags) return vit_score - ref_score else : return dy.scalarInput(0)

  52. Viterbi Algorithm time flies like an arrow <s>

  53. Viterbi Algorithm time flies like an arrow s 1,NN NN <s> s 1,NNP NNP s 1,VB VB s 1,VBZ VBZ s 1,DET DET s 1,PRP PRP …

  54. Viterbi Algorithm time flies like an arrow s 1,NN NN NN <s> s 1,NNP NNP s 1,VB VB s 1,VBZ VBZ s 1,DET DET s 1,PRP PRP …

  55. Viterbi Algorithm time flies like an arrow s 1,NN s 2,NN NN NN <s> s 1,NNP NNP s 1,VB VB s 1,VBZ VBZ s 1,DET DET s 1,PRP PRP …

  56. Viterbi Algorithm time flies like an arrow s 1,NN s 2,NN NN NN <s> s 1,NNP s 2,NNP NNP NNP s 1,VB s 2,VB VB VB s 1,VBZ s 2,VBZ VBZ VBZ s 1,DET s 2,DET DET DET s 1,PRP s 2,PRP PRP PRP … …

  57. Viterbi Algorithm time flies like an arrow s 1,NN s 2,NN s 3,NN s 4,NN s 5,NN s 6,<s> NN NN NN NN NN <s> <s> s 1,NNP s 2,NNP s 3,NNP s 4,NNP s 5,NNP NNP NNP NNP NNP NNP s 1,VB s 2,VB s 3,VB s 4,VB s 5,VB VB VB VB VB VB s 1,VBZ s 2,VBZ s 3,VBZ s 4,VBZ s 5,VBZ VBZ VBZ VBZ VBZ VBZ s 1,DET s 2,DET s 3,DET s 4,DET s 5,DET DET DET DET DET DET s 1,PRP s 2,PRP s 3,PRP s 4,PRP s 5,PRP PRP PRP PRP PRP PRP … … … … …

  58. Viterbi Algorithm time flies like an arrow s 1,NN s 2,NN s 3,NN s 4,NN s 5,NN s 6,<s> NN NN NN NN NN <s> <s> s 1,NNP s 2,NNP s 3,NNP s 4,NNP s 5,NNP NNP NNP NNP NNP NNP s 1,VB s 2,VB s 3,VB s 4,VB s 5,VB VB VB VB VB VB s 1,VBZ s 2,VBZ s 3,VBZ s 4,VBZ s 5,VBZ VBZ VBZ VBZ VBZ VBZ s 1,DET s 2,DET s 3,DET s 4,DET s 5,DET DET DET DET DET DET s 1,PRP s 2,PRP s 3,PRP s 4,PRP s 5,PRP PRP PRP PRP PRP PRP … … … … …

  59. Code

  60. Viterbi Initialization Code time flies like an arrow s 0,<s> = 0 <s> s 0,NN = - ∞ s 0 = [0 , −∞ , −∞ , . . . ] T NN s 0,NNP = - ∞ init_score = [SMALL_NUMBER] * ntags NNP init_score[S_T] = 0 s 0,VB = - ∞ for_expr = dy.inputVector(init_score) VB s 0,VBZ = - ∞ VBZ s 0,DET = - ∞ DET …

  61. Viterbi Forward Step time flies s 1,NN 1 X ( s e ( y i , x ) + s t ( y i − 1 , y i )) NN NN <s> Z i s 1,NNP s 2,NNP,NN NNP s f,i,j,k = s f,i − 1 ,j + s e,i,k + s t,j,k s 1,VB forward transition VB emission s 1,VBZ VBZ s 1,DET i = 2 (time step) DET j = NNP (previous POS) s 1,PRP k = NN (next POS) PRP …

  62. Viterbi Forward Step time flies s 1,NN s 2,NN,NN NN NN <s> s 2,NNP,NN s 1,NNP s 2,VB,NN NNP s f,i,j,k = s f,i − 1 ,j + s e,i,k + s t,j,k s 1,VB s 2,VBZ,NN VB s 2,DET,NN s 1,VBZ VBZ s 2,PRP,NN s 1,DET DET s 1,PRP PRP …

  63. Viterbi Forward Step time flies s f,i,j,k = s f,i − 1 ,j + s e,i,k + s t,j,k s 1,NN s 2,NN,NN NN NN <s> vectorize s 2,NNP,NN s 1,NNP s f,i,k = s f,i − 1 + s e,i,k + s t,k s 2,VB,NN NNP s 1,VB s 2,VBZ,NN VB s 2,DET,NN s 1,VBZ VBZ s 2,PRP,NN s 1,DET DET s 1,PRP PRP …

  64. Viterbi Forward Step time flies s f,i,j,k = s f,i − 1 ,j + s e,i,k + s t,j,k s 1,NN s 2,NN NN NN <s> vectorize s 1,NNP s f,i,k = s f,i − 1 + s e,i,k + s t,k NNP max s 1,VB s f,i,k = max( s f,i,k ) VB s 1,VBZ VBZ s 1,DET DET s 1,PRP PRP …

  65. Viterbi Forward Step time flies s f,i,j,k = s f,i − 1 ,j + s e,i,k + s t,j,k s 1,NN s 2,NN NN NN <s> vectorize s 1,NNP s 2,NNP s f,i,k = s f,i − 1 + s e,i,k + s t,k NNP NNP max s 1,VB s 2,VB s f,i,k = max( s f,i,k ) VB VB recurse concat s 1,VBZ s 2,VBZ s f,i = concat( s f,i, 1 , s f,i, 2 , . . . ) VBZ VBZ s 1,DET s 2,DET DET DET s 1,PRP s 2,PRP PRP PRP … …

  66. Transition Matrix in DyNet Add additional parameters TRANS_LOOKUP = model.add_lookup_parameters((ntags, ntags)) Initialize at sentence start trans_exprs = [TRANS_LOOKUP[tid] for tid in range(ntags)]

  67. Viterbi Forward in DyNet # Perform the forward pass through the sentence for i, vec in enumerate(vecs): my_best_ids = [] my_best_exprs = [] for next_tag in range(ntags): # Calculate vector for single next tag next_single_expr = for_expr + trans_exprs[next_tag] next_single = next_single_expr.npvalue() # Find and save the best score my_best_id = np.argmax(next_single) my_best_ids.append(my_best_id) my_best_exprs.append(dy.pick(next_single_expr, my_best_id)) # Concatenate vectors and add emission probs for_expr = dy.concatenate(my_best_exprs) + vec # Save the best ids best_ids.append(my_best_ids) and do similar for final “<s>” tag

Recommend


More recommend