memory networks and neural turing machines
play

Memory Networks and Neural Turing Machines Diego Marcheggiani - PowerPoint PPT Presentation

Memory Networks and Neural Turing Machines Diego Marcheggiani University of Amsterdam ILLC Unsupervised Language Learning 2016 Outline Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines Outline Motivation


  1. Memory Networks and Neural Turing Machines Diego Marcheggiani University of Amsterdam ILLC Unsupervised Language Learning 2016

  2. Outline Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines

  3. Outline Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines

  4. Motivation ◮ Neural networks have hard times to capture long-range dependencies.

  5. Motivation ◮ Neural networks have hard times to capture long-range dependencies. Yes, even LSTMs.

  6. Motivation ◮ Neural networks have hard times to capture long-range dependencies. Yes, even LSTMs. ◮ Memory networks (MN) and Neural Turing machines (NTM) try to overcome this problem using an external memory. ◮ MN are mainly motivated by the fact that it is hard to capture long-range dependencies, ◮ while NTM are devised to perform program induction.

  7. Outline Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines

  8. General idea ◮ We have a neural network, RNN, MLP, ...

  9. General idea ◮ We have a neural network, RNN, MLP, ... ◮ An external memory.

  10. General idea ◮ We have a neural network, RNN, MLP, does not matter. ◮ External memory where the neural network can write and read to. writing

  11. General idea ◮ We have a neural network, RNN, MLP, does not matter. ◮ External memory where the neural network can write and read to. writing reading

  12. Memory network components ◮ Input feature map (I): transforms the input in a feature representation, e.g., bag of words

  13. Memory network components ◮ Input feature map (I): transforms the input in a feature representation, e.g., bag of words ◮ Generalization (G): writes the input, or a function of it, on the memory

  14. Memory network components ◮ Input feature map (I): transforms the input in a feature representation, e.g., bag of words ◮ Generalization (G): writes the input, or a function of it, on the memory ◮ Output feature map (O): reads the most relevant memory slots

  15. Memory network components ◮ Input feature map (I): transforms the input in a feature representation, e.g., bag of words ◮ Generalization (G): writes the input, or a function of it, on the memory ◮ Output feature map (O): reads the most relevant memory slots ◮ Response (R): given the info read from the memory, returns the output

  16. Memory network components ◮ Input feature map (I): transforms the input in a feature representation, e.g., bag of words ◮ Generalization (G): writes the input, or a function of it, on the memory ◮ Output feature map (O): reads the most relevant memory slots ◮ Response (R): given the info read from the memory, returns the output Extremely general framework which can be instantiated in several ways.

  17. Question answering ◮ Input text: Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom.

  18. Question answering ◮ Input text: Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. ◮ Input question: Where is Dan now?

  19. Question answering ◮ Input text: Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. ◮ Input question: Where is Dan now? ◮ Output answer: bedroom Let’s see a simple instantiation of memory networks for QA.

  20. I component Raw text sentence is transformed in its vector representation e.g., bag of words, with the component I. Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom.

  21. I component Raw text sentence is transformed in its vector representation e.g., bag of words, with the component I. Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom.

  22. G component The sentences are then written to the memory sequentially, via the component G. Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Notice that the memory is fixed in this approach once is written, it is not changed neither during learning nor during testing.

  23. I component The question is transformed in its vector representation with the component I. Fred moved to the bedroom. Where is Dan now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m

  24. O component The best matching memory (supporting fact), according to the question is retrieved with the component O. Fred moved to the bedroom. Where is Dan now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m o 1 = O 1 ( q , m ) = argmax i =1 ,..., N s O ( q , m i ) where the similarity function is defined as: s O ( x , y ) = x T · U T O · U O · y

  25. R component Given the supporting fact and the query, the best matching word in the dictionary is retrieved. Fred moved to the bedroom. Where is Dan now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. Answer: bedroom m r = argmax w ∈ W s R ( q + m o 1 , w ) where the similarity function is defined as below: s R ( x , y ) = x T · U T R · U R · y

  26. R component Given the supporting fact and the query, the best matching word in the dictionary is retrieved. Fred moved to the bedroom. Where is Dan now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. Answer: bedroom m r = argmax w ∈ W s R ( q + m o 1 , w ) where the similarity function is defined as below: s R ( x , y ) = x T · U T R · U R · y What about harder questions?

  27. Question Answering Fred moved to the bedroom. Where is the milk now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m

  28. Question Answering The best matching memory (supporting fact), according to the question is retrieved with the component O. Fred moved to the bedroom. Where is the milk now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m o 1 = O 1 ( q , m ) = argmax i =1 ,..., N s O ( q , m i ) where the similarity function is defined as: s O ( x , y ) = x T · U T O · U O · y

  29. Question Answering We also need another supporting fact Fred moved to the bedroom. Where is the milk now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m o 2 = O 2 ( q + m o 1 , m ) = argmax i =1 ,..., N s O ( q + m o 1 , m i ) where the similarity function is defined as: s O ( x , y ) = x T · U T O · U O · y

  30. Question Answering Given the supporting facts and the query, the best matching word in the dictionary is retrieved. Fred moved to the bedroom. Where is the milk now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. Answer: kitchen m r = argmax w ∈ W s r ( q + o 1 + o 2 , w ) where the similarity function is defined as below: s r ( x , y ) = x T · U T R · U R · y

  31. Training Training is then performed with a hinge loss and stochastic gradient descent (SGD). � max (0 , γ − s O ( q , m O 1 ) + s O ( q , f ))+ f � = m o 1

  32. Training Training is then performed with a hinge loss and stochastic gradient descent (SGD). � max (0 , γ − s O ( q , m O 1 ) + s O ( q , f ))+ f � = m o 1 � max (0 , γ − s O ( q + m O 1 , m O 2 ) + s O ( q + m O 1 , f ′ ))+ f ′ � = m o 2

  33. Training Training is then performed with a hinge loss and stochastic gradient descent (SGD). � max (0 , γ − s O ( q , m O 1 ) + s O ( q , f ))+ f � = m o 1 � max (0 , γ − s O ( q + m O 1 , m O 2 ) + s O ( q + m O 1 , f ′ ))+ f ′ � = m o 2 � max (0 , γ − s R ( q + m O 1 + m O 2 , r ) + s R ( q + m O 1 + m O 2 , ˆ r )) r � = r ˆ Negative sampling instead of sum.

  34. Experiments Question answering with artificially generated data. model accuracy RNN 17.8 % LSTM 29.0 % MN k=1 44.4 % MN k=2 99.9 %

  35. How many problems can you spot?

  36. How many problems can you spot? ◮ Single word as answer.

  37. How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory.

  38. How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive.

  39. How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly

  40. How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly fully

  41. How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly fully extremely

  42. How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly fully extremely supervised.

  43. Outline Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines

  44. Introduction ◮ argmax is substituted by a soft attention mechanism ◮ less supervised, no need for annotated supporting facts

  45. QA example Transform sentences in vector representation, write representations in the memory, transform query in vector representation. Where is Dan now? Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m

Recommend


More recommend