Memory Networks and Neural Turing Machines Diego Marcheggiani - PowerPoint PPT Presentation

Memory Networks and Neural Turing Machines Diego Marcheggiani University of Amsterdam ILLC Unsupervised Language Learning 2016

Outline Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines

Motivation ◮ Neural networks have hard times to capture long-range dependencies.

Motivation ◮ Neural networks have hard times to capture long-range dependencies. Yes, even LSTMs.

Motivation ◮ Neural networks have hard times to capture long-range dependencies. Yes, even LSTMs. ◮ Memory networks (MN) and Neural Turing machines (NTM) try to overcome this problem using an external memory. ◮ MN are mainly motivated by the fact that it is hard to capture long-range dependencies, ◮ while NTM are devised to perform program induction.

General idea ◮ We have a neural network, RNN, MLP, ...

General idea ◮ We have a neural network, RNN, MLP, ... ◮ An external memory.

General idea ◮ We have a neural network, RNN, MLP, does not matter. ◮ External memory where the neural network can write and read to. writing

General idea ◮ We have a neural network, RNN, MLP, does not matter. ◮ External memory where the neural network can write and read to. writing reading

Memory network components ◮ Input feature map (I): transforms the input in a feature representation, e.g., bag of words

Memory network components ◮ Input feature map (I): transforms the input in a feature representation, e.g., bag of words ◮ Generalization (G): writes the input, or a function of it, on the memory

Memory network components ◮ Input feature map (I): transforms the input in a feature representation, e.g., bag of words ◮ Generalization (G): writes the input, or a function of it, on the memory ◮ Output feature map (O): reads the most relevant memory slots

Memory network components ◮ Input feature map (I): transforms the input in a feature representation, e.g., bag of words ◮ Generalization (G): writes the input, or a function of it, on the memory ◮ Output feature map (O): reads the most relevant memory slots ◮ Response (R): given the info read from the memory, returns the output

Memory network components ◮ Input feature map (I): transforms the input in a feature representation, e.g., bag of words ◮ Generalization (G): writes the input, or a function of it, on the memory ◮ Output feature map (O): reads the most relevant memory slots ◮ Response (R): given the info read from the memory, returns the output Extremely general framework which can be instantiated in several ways.

Question answering ◮ Input text: Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom.

Question answering ◮ Input text: Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. ◮ Input question: Where is Dan now?

Question answering ◮ Input text: Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. ◮ Input question: Where is Dan now? ◮ Output answer: bedroom Let’s see a simple instantiation of memory networks for QA.

I component Raw text sentence is transformed in its vector representation e.g., bag of words, with the component I. Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom.

G component The sentences are then written to the memory sequentially, via the component G. Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Notice that the memory is fixed in this approach once is written, it is not changed neither during learning nor during testing.

I component The question is transformed in its vector representation with the component I. Fred moved to the bedroom. Where is Dan now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m

O component The best matching memory (supporting fact), according to the question is retrieved with the component O. Fred moved to the bedroom. Where is Dan now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m o 1 = O 1 ( q , m ) = argmax i =1 ,..., N s O ( q , m i ) where the similarity function is defined as: s O ( x , y ) = x T · U T O · U O · y

R component Given the supporting fact and the query, the best matching word in the dictionary is retrieved. Fred moved to the bedroom. Where is Dan now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. Answer: bedroom m r = argmax w ∈ W s R ( q + m o 1 , w ) where the similarity function is defined as below: s R ( x , y ) = x T · U T R · U R · y

R component Given the supporting fact and the query, the best matching word in the dictionary is retrieved. Fred moved to the bedroom. Where is Dan now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. Answer: bedroom m r = argmax w ∈ W s R ( q + m o 1 , w ) where the similarity function is defined as below: s R ( x , y ) = x T · U T R · U R · y What about harder questions?

Question Answering Fred moved to the bedroom. Where is the milk now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m

Question Answering The best matching memory (supporting fact), according to the question is retrieved with the component O. Fred moved to the bedroom. Where is the milk now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m o 1 = O 1 ( q , m ) = argmax i =1 ,..., N s O ( q , m i ) where the similarity function is defined as: s O ( x , y ) = x T · U T O · U O · y

Question Answering We also need another supporting fact Fred moved to the bedroom. Where is the milk now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m o 2 = O 2 ( q + m o 1 , m ) = argmax i =1 ,..., N s O ( q + m o 1 , m i ) where the similarity function is defined as: s O ( x , y ) = x T · U T O · U O · y

Question Answering Given the supporting facts and the query, the best matching word in the dictionary is retrieved. Fred moved to the bedroom. Where is the milk now? Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. Answer: kitchen m r = argmax w ∈ W s r ( q + o 1 + o 2 , w ) where the similarity function is defined as below: s r ( x , y ) = x T · U T R · U R · y

Training Training is then performed with a hinge loss and stochastic gradient descent (SGD). � max (0 , γ − s O ( q , m O 1 ) + s O ( q , f ))+ f � = m o 1

Training Training is then performed with a hinge loss and stochastic gradient descent (SGD). � max (0 , γ − s O ( q , m O 1 ) + s O ( q , f ))+ f � = m o 1 � max (0 , γ − s O ( q + m O 1 , m O 2 ) + s O ( q + m O 1 , f ′ ))+ f ′ � = m o 2

Training Training is then performed with a hinge loss and stochastic gradient descent (SGD). � max (0 , γ − s O ( q , m O 1 ) + s O ( q , f ))+ f � = m o 1 � max (0 , γ − s O ( q + m O 1 , m O 2 ) + s O ( q + m O 1 , f ′ ))+ f ′ � = m o 2 � max (0 , γ − s R ( q + m O 1 + m O 2 , r ) + s R ( q + m O 1 + m O 2 , ˆ r )) r � = r ˆ Negative sampling instead of sum.

Experiments Question answering with artificially generated data. model accuracy RNN 17.8 % LSTM 29.0 % MN k=1 44.4 % MN k=2 99.9 %

How many problems can you spot?

How many problems can you spot? ◮ Single word as answer.

How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory.

How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive.

How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly

How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly fully

How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly fully extremely

How many problems can you spot? ◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly fully extremely supervised.

Introduction ◮ argmax is substituted by a soft attention mechanism ◮ less supervised, no need for annotated supporting facts

QA example Transform sentences in vector representation, write representations in the memory, transform query in vector representation. Where is Dan now? Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m

Memory Networks and Neural Turing Machines Diego Marcheggiani - PowerPoint PPT Presentation

Memory Networks and Neural Turing Machines Diego Marcheggiani University of Amsterdam ILLC Unsupervised Language Learning 2016 Outline Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines Outline Motivation

1 Turing Machines 1.1 Introduction Turing machines provide an answer to the question, What is a

Science (Bridging Course) Turing Machines Gian Diego Tipaldi Topics Covered Turing machines

Foundations of Computer Science Lecture 26 Turing Machines The Turing Machine: DFA with Random

Foundations of Computer Science Lecture 26 Turing Machines The Turing Machine: DFA with Random

Theory Chapter 3: The Church-Turing Thesis 1 Chapter 3.1 Turing Machines 2 Turing Machines:

Lecture 13: Oracle Turing Machines Arijit Bishnu 13.04.2010 Oracle Turing Machines

Turing Machines (TM) Deterministic Turing Machine (DTM) Nondeterministic Turing Machine

Advanced Topics in Theoretical Computer Science Part 1: Turing Machines and Turing Computability

Outline Super-Turing I. The Limits of Turing Computation or A. Models & Frames of

Turing Machines A more powerful computation model than a PDA ? [Section 9.1] Turing Machines

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Turing Machines Our most powerful model of a computer is the Turing Machine. This is an FA with

TURING MACHINE VARIATIONS ENCODING TURING MACHINES UNIVERSAL TURING MACHINE Your Questions?

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Turing Machine properties There are many ways to skin a cat Turing Machines And many ways

Checkout Recursion project from SVN Monday 10/28 If you got a D or F on Exam 1, please be

Lecture 23 Log into Linux. Copy files on csserver from /home/hwang/cs215/lecture23/. into a

ADT Lists, Stacks, and Queues Instructor: Ahmed Eldawy 1 Objectives Understand the importance

Bilgisayar Yap s Bilgisayar verilen verileri, belirlenen bir programa gre i leyen,

Midterm 1 topics (in one slide) The C language Functions, variables, and types Branches and

Quadratic Video Interpolatoin Project page: https://sites.google.com/view/xiangyuxu/qvi_nips19

CSCI 5582 Artificial Intelligence Lecture 28 Jim Martin CSCI 5582 Fall 2006 HW 3 On the

Calibration Revisited Jan Kodovsk, Jessica Fridrich September 7, 2009 / ACM MM&Sec 09

Memory Networks and Neural Turing Machines Diego Marcheggiani - PowerPoint PPT Presentation

Memory Networks and Neural Turing Machines Diego Marcheggiani University of Amsterdam ILLC Unsupervised Language Learning 2016 Outline Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines Outline Motivation

1 Turing Machines 1.1 Introduction Turing machines provide an answer to the question, What is a

Science (Bridging Course) Turing Machines Gian Diego Tipaldi Topics Covered Turing machines

Foundations of Computer Science Lecture 26 Turing Machines The Turing Machine: DFA with Random

Foundations of Computer Science Lecture 26 Turing Machines The Turing Machine: DFA with Random

Theory Chapter 3: The Church-Turing Thesis 1 Chapter 3.1 Turing Machines 2 Turing Machines:

Lecture 13: Oracle Turing Machines Arijit Bishnu 13.04.2010 Oracle Turing Machines

Turing Machines (TM) Deterministic Turing Machine (DTM) Nondeterministic Turing Machine

Advanced Topics in Theoretical Computer Science Part 1: Turing Machines and Turing Computability

Outline Super-Turing I. The Limits of Turing Computation or A. Models &amp; Frames of

Turing Machines A more powerful computation model than a PDA ? [Section 9.1] Turing Machines

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Turing Machines Our most powerful model of a computer is the Turing Machine. This is an FA with

TURING MACHINE VARIATIONS ENCODING TURING MACHINES UNIVERSAL TURING MACHINE Your Questions?

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Turing Machine properties There are many ways to skin a cat Turing Machines And many ways

Checkout Recursion project from SVN Monday 10/28 If you got a D or F on Exam 1, please be

Lecture 23 Log into Linux. Copy files on csserver from /home/hwang/cs215/lecture23/*.* into a

ADT Lists, Stacks, and Queues Instructor: Ahmed Eldawy 1 Objectives Understand the importance

Bilgisayar Yap s Bilgisayar verilen verileri, belirlenen bir programa gre i leyen,

Midterm 1 topics (in one slide) The C language Functions, variables, and types Branches and

Quadratic Video Interpolatoin Project page: https://sites.google.com/view/xiangyuxu/qvi_nips19

CSCI 5582 Artificial Intelligence Lecture 28 Jim Martin CSCI 5582 Fall 2006 HW 3 On the

Calibration Revisited Jan Kodovsk, Jessica Fridrich September 7, 2009 / ACM MM&amp;Sec 09

Outline Super-Turing I. The Limits of Turing Computation or A. Models & Frames of

Lecture 23 Log into Linux. Copy files on csserver from /home/hwang/cs215/lecture23/. into a

Calibration Revisited Jan Kodovsk, Jessica Fridrich September 7, 2009 / ACM MM&Sec 09