sequnce s to sequence transformatjons in text processing
play

Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada - PowerPoint PPT Presentation

Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada Warakagoda Seq2seq Transformatjon Variable length output Model Variable length input Example Applicatjons Summarizatjon (extractjve/abstractjve) Machine translatjon


  1. Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada Warakagoda

  2. Seq2seq Transformatjon Variable length output Model Variable length input

  3. Example Applicatjons ● Summarizatjon (extractjve/abstractjve) ● Machine translatjon ● Dialog systems /chatbots ● Text generatjon ● Questjon answering ● ●

  4. Seq2seq Transformatjon Variable length output Model size should be Model constant. Variable length input Solutjon : Apply a constant sized neural net module repeatedly on the data

  5. Possible Approaches ● Recurrent networks ● Apply the NN module in a serial fashion ● Convolutjons networks ● Apply the NN modules in a hierarchical fashion ● Self-atuentjon ● Apply the NN module in a parallel fashion

  6. Processing Pipeline Variable length output Decoder Intermediate representatjon Encoder Variable length input

  7. Processing Pipeline Variable length output Decoder Attention Intermediate representatjon Encoder Variable length input Embedding Variable length text

  8. Architecture Variants Encoder Decoder Atuentjon Recurrent net Recurrent net No Recurrent net Recurrent net Yes Convolutjonal net Convolutjonal net No Convolutjonal net Recurrent net Yes Convolutjonal net Convolutjonal net Yes Fully connected net Fully connected net Yes with self-atuentjon with self-atuentjon

  9. Possible Approaches ● Recurrent networks ● Apply the NN module in a serial fashion ● Convolutjons networks ● Apply the NN modules in a hierarchical fashion ● Self-atuentjon ● Apply the NN module in a parallel fashion

  10. RNN-decoder with RNN-encoder = RNN cell Soft Soft Soft Soft max max max max <start> Thanks Very Much Tusen <end> Takk Decoder Encoder

  11. RNN-dec with RNN-enc, Training Thanks Very Much <end> Soft Soft Soft Soft max max max max <start> Thanks Very Much Ground Truths Tusen <end> Takk Decoder Encoder

  12. RNN-dec with RNN-enc, Decoding Thanks Much Very <end> Greedy Decoding Soft Soft Soft Soft max max max max <start> Thanks Much Very Tusen <end> Takk Decoder Encoder

  13. Decoding Approaches ● Optjmal decoding ● Greedy decoding ● Easy ● Not optjmal ● Beam search ● Closer to optjmal decoder ● Choose top N candidates instead of the best one at each step.

  14. Beam Search Decoding Beam Width = 3

  15. Straight-forward Extensions Next state Current state Current control Next control state state Current state Next state Current Input Current Input LSTM Cell RNN Cell Next state Current state Next state Current state Current state Next state Current state Next state Current Input Current Input Bidirectional Cell Stacked Cell

  16. RNN-decoder with RNN-encoder with Atuentjon = RNN cell Soft Soft Soft Soft max max max max Context + <start> Thanks Very Much Tusen <end> Takk Decoder Encoder

  17. Atuentjon ● Context is given by ● Atuentjon weights are dynamic ● Generally defjned by with where functjon f can be defjned in several ways. ● Dot product ● Weighted dot product ● Use another MLP (eg: 2 layer)

  18. Atuentjon RNN Cell +

  19. Example: Google Neural Machine Translatjon

  20. Possible Approaches ● Recurrent networks ● Apply the NN module in a serial fashion ● Convolutjons networks ● Apply the NN modules in a hierarchical fashion ● Self-atuentjon ● Apply the NN module in a parallel fashion

  21. Why Convolutjon ● Recurrent networks are serial ● Unable to be parallelized ● “Distance” between feature vector and difgerent inputs are not constant ● Convolutjons networks ● Can be parallelized (faster) ● “Distance” between feature vector and difgerent inputs are constant

  22. Long range dependency capture with conv nets n k k

  23. Conv net, Recurrent net with Atuentjon CNN-a CNN-c z z z z y y y y 1 2 3 4 4 1 2 3 d a a a a i i ,1 i ,2 i ,3 i ,4 c i h + h i i 1 h i g i d W h g = + i d i i Gehring et.al. A Convolutjonal Encoder Model for Neural Machine Translatjon (2016)

  24. Two conv nets with atuentjon e e e 3 2 1 z z z 2 3 1 d i = 1,2,3,4 i a i 1,2,3,4 j 1,2,3 Wd Wd Wd Wd = = i j , c c c c 2 4 1 3 h i = , 1,2,3,4 i W W W W g g g g 1 2 3 4 Gehring et.al, Convolutjonal Sequence to Sequence Learning , 2017

  25. Possible Approaches ● Recurrent networks ● Apply the NN module in a serial fashion ● Convolutjons networks ● Apply the NN modules in a hierarchical fashion ● Self-atuentjon ● Apply the NN module in a parallel fashion

  26. Why Self-atuentjon ● Recurrent networks are serial ● Unable to be parallelized ● “Distance” between feature vector and difgerent inputs are not constant ● Self-atuentjon networks ● Can be parallelized (faster) ● “Distance” between feature vector and difgerent inputs does not depend on the input length

  27. FCN with self-atuentjon Probability of the next words Previous Words Vasvani et.al, Atuentjon is all you need , 2017 Inputs

  28. Scaled dot product atuentjon Query Keys Values

  29. Multj-Head Atuentjon

  30. Encoder Self-atuentjon Self Attention

  31. Decoder Self-atuentjon • Almost same as encoder self atuentjon • But only lefuward positjons are considered.

  32. Encoder-decoder atuentjon Decoder state Encoder states

  33. Overall Operatjon Next Word Previous Words Neural machine translation, philipp Koehn

  34. Reinforcement Learning ● Machine Translatjon/Summarizatjon ● Dialog Systems ● ●

  35. Reinforcement Learning ● Machine Translatjon/Summarizatjon ● Dialog Systems ● ●

  36. Why Reinforcement Learning ● Exposure bias ● In training ground truths are used. In testjng, generated word in the previous step is used to generate the next word. ● Use generated words in training needs sampling : Non difgerentjable ● Maximum Likelihood criterion is not directly relevant to evaluatjon metrics ● BLEU (Machine translatjon) ● ROUGE (Summarizatjon) ● Use BLEU/ROUGE in training: Non difgerentjable

  37. Sequence Generatjon as Reinforcement Learning ● Agent: The Recurrent Net ● State: Hidden layers, Atuentjon weights etc. ● Actjon: Next Word ● Policy: Generate the next word ( actjon ) given the current hidden layers and atuentjon weights ( state ) ● Reward: Score computed using the evaluatjon metric (eg: BLEU)

  38. Maximum Likelihood Training (Revisit) Minimize the negative log likelihood

  39. Reinforcement Learning Formulatjon Minimize the expected negative reward, using REINFORCE algorithm

  40. Reinforcement Learning Details ● Expected reward ● We need the gradient ● Need to write this as an expectatjon, so that we can evaluate it using samples. Use the log derivatjve trick: ● This is an expectatjon ● Approximate this with sample mean ● In practjce we use only one sample

  41. Reinforcement Learning Details ● Gradient ● This estjmatjon has high variance. Use a baseline to combat this problem. ● Baseline can be anything independent of ● It can for example be estjmated as the reward for word sequence generated using argmax at each cell.

  42. Reinforcement Learning ● Machine Translatjon/Summarizatjon ● Dialog Systems ● ●

  43. Maximum Likelihood Dialog Systems Am I Fine <start> I Am Are You? How

  44. Why Reinforcement Learning ● Maximum Likelihood criterion is not directly relevant to successful dialogs ● Dull responses (“I don’t know”) ● Repetjtjve responses ● Need to integrate developer defjned rewards relevant to longer term goals of the dialog

  45. Dialog Generatjon as Reinforcement Learning ● Agent: The Recurrent Net ● State: Previous dialog turns ● Actjon: Next dialog utuerance ● Policy: Generate the next dialog utuerance ( actjon ) given the previous dialog turns ( state ) ● Reward: Score computed based on relevant factors such as ease of answering, informatjon fmow, semantjc coherence etc.

  46. Training Setup Decoder Decoder Encoder Encoder Agent 2 Agent 1

  47. Training Procedure ● From the viewpoint of a given agent, the procedure is similar to that of sequence generatjon ● REINFORCE algorithm ● Appropriate rewards must be calculated based on current and previous dialog turns. ● Can be initjalized with maximum likelihood trained models.

  48. Adversarial Learning Use a discriminator as in GANs to calculate the reward ● Same training procedure based on REINFORCE for generator ● Discriminator Human Dialog

  49. Questjon Answering ● Slightly difgerent from sequence-to-sequence model. Single Word Answer/ Fixed Length Output Start-end points of the answer Model Variable length inputs Question/Query Passage/Document/Context

  50. QA- Naive Approach ● Combine questjon and passage and use an RNN to classify it . ● Will not work because relatjonship between the passage and questjon is not adequately captured. Single Word Answer/ Start-end points of the answer Fixed Length Output Model Variable length input Question and passage

Recommend


More recommend