Memory Networks and Neural Turing Machines Diego Marcheggiani - - PowerPoint PPT Presentation

memory networks and neural turing machines
SMART_READER_LITE
LIVE PREVIEW

Memory Networks and Neural Turing Machines Diego Marcheggiani - - PowerPoint PPT Presentation

Memory Networks and Neural Turing Machines Diego Marcheggiani University of Amsterdam ILLC Unsupervised Language Learning 2016 Outline Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines Outline Motivation


slide-1
SLIDE 1

Memory Networks and Neural Turing Machines

Diego Marcheggiani

University of Amsterdam ILLC

Unsupervised Language Learning 2016

slide-2
SLIDE 2

Outline

Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines

slide-3
SLIDE 3

Outline

Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines

slide-4
SLIDE 4

Motivation

◮ Neural networks have hard times to capture long-range

dependencies.

slide-5
SLIDE 5

Motivation

◮ Neural networks have hard times to capture long-range

  • dependencies. Yes, even LSTMs.
slide-6
SLIDE 6

Motivation

◮ Neural networks have hard times to capture long-range

  • dependencies. Yes, even LSTMs.

◮ Memory networks (MN) and Neural Turing machines (NTM) try to

  • vercome this problem using an external memory.

◮ MN are mainly motivated by the fact that it is hard to capture

long-range dependencies,

◮ while NTM are devised to perform program induction.

slide-7
SLIDE 7

Outline

Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines

slide-8
SLIDE 8

General idea

◮ We have a neural network, RNN, MLP, ...

slide-9
SLIDE 9

General idea

◮ We have a neural network, RNN, MLP, ... ◮ An external memory.

slide-10
SLIDE 10

General idea

◮ We have a neural network, RNN, MLP, does not matter. ◮ External memory where the neural network can write and read to.

writing

slide-11
SLIDE 11

General idea

◮ We have a neural network, RNN, MLP, does not matter. ◮ External memory where the neural network can write and read to.

writing reading

slide-12
SLIDE 12

Memory network components

◮ Input feature map (I): transforms the input in a feature

representation, e.g., bag of words

slide-13
SLIDE 13

Memory network components

◮ Input feature map (I): transforms the input in a feature

representation, e.g., bag of words

◮ Generalization (G): writes the input, or a function of it, on the

memory

slide-14
SLIDE 14

Memory network components

◮ Input feature map (I): transforms the input in a feature

representation, e.g., bag of words

◮ Generalization (G): writes the input, or a function of it, on the

memory

◮ Output feature map (O): reads the most relevant memory slots

slide-15
SLIDE 15

Memory network components

◮ Input feature map (I): transforms the input in a feature

representation, e.g., bag of words

◮ Generalization (G): writes the input, or a function of it, on the

memory

◮ Output feature map (O): reads the most relevant memory slots ◮ Response (R): given the info read from the memory, returns the

  • utput
slide-16
SLIDE 16

Memory network components

◮ Input feature map (I): transforms the input in a feature

representation, e.g., bag of words

◮ Generalization (G): writes the input, or a function of it, on the

memory

◮ Output feature map (O): reads the most relevant memory slots ◮ Response (R): given the info read from the memory, returns the

  • utput

Extremely general framework which can be instantiated in several ways.

slide-17
SLIDE 17

Question answering

◮ Input text: Fred moved to the bedroom.

Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom.

slide-18
SLIDE 18

Question answering

◮ Input text: Fred moved to the bedroom.

Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom.

◮ Input question: Where is Dan now?

slide-19
SLIDE 19

Question answering

◮ Input text: Fred moved to the bedroom.

Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom.

◮ Input question: Where is Dan now? ◮ Output answer: bedroom

Let’s see a simple instantiation of memory networks for QA.

slide-20
SLIDE 20

I component

Raw text sentence is transformed in its vector representation e.g., bag of words, with the component I.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom.

slide-21
SLIDE 21

I component

Raw text sentence is transformed in its vector representation e.g., bag of words, with the component I.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom.

slide-22
SLIDE 22

G component

The sentences are then written to the memory sequentially, via the component G.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m

Notice that the memory is fixed in this approach once is written, it is not changed neither during learning nor during testing.

slide-23
SLIDE 23

I component

The question is transformed in its vector representation with the component I.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is Dan now?

slide-24
SLIDE 24

O component

The best matching memory (supporting fact), according to the question is retrieved with the component O.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is Dan now?

  • 1 = O1(q, m) = argmaxi=1,...,NsO(q, mi)

where the similarity function is defined as: sO(x, y) = xT · UT

O · UO · y

slide-25
SLIDE 25

R component

Given the supporting fact and the query, the best matching word in the dictionary is retrieved.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is Dan now? Answer: bedroom

r = argmaxw∈W sR(q + mo1, w) where the similarity function is defined as below: sR(x, y) = xT · UT

R · UR · y

slide-26
SLIDE 26

R component

Given the supporting fact and the query, the best matching word in the dictionary is retrieved.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is Dan now? Answer: bedroom

r = argmaxw∈W sR(q + mo1, w) where the similarity function is defined as below: sR(x, y) = xT · UT

R · UR · y

What about harder questions?

slide-27
SLIDE 27

Question Answering

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is the milk now?

slide-28
SLIDE 28

Question Answering

The best matching memory (supporting fact), according to the question is retrieved with the component O.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is the milk now?

  • 1 = O1(q, m) = argmaxi=1,...,NsO(q, mi)

where the similarity function is defined as: sO(x, y) = xT · UT

O · UO · y

slide-29
SLIDE 29

Question Answering

We also need another supporting fact

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is the milk now?

  • 2 = O2(q + mo1, m) = argmaxi=1,...,NsO(q + mo1, mi)

where the similarity function is defined as: sO(x, y) = xT · UT

O · UO · y

slide-30
SLIDE 30

Question Answering

Given the supporting facts and the query, the best matching word in the dictionary is retrieved.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is the milk now? Answer: kitchen

r = argmaxw∈W sr(q + o1 + o2, w) where the similarity function is defined as below: sr(x, y) = xT · UT

R · UR · y

slide-31
SLIDE 31

Training

Training is then performed with a hinge loss and stochastic gradient descent (SGD).

  • f =mo1

max(0, γ − sO(q, mO1) + sO(q, f ))+

slide-32
SLIDE 32

Training

Training is then performed with a hinge loss and stochastic gradient descent (SGD).

  • f =mo1

max(0, γ − sO(q, mO1) + sO(q, f ))+

  • f ′=mo2

max(0, γ − sO(q + mO1, mO2) + sO(q + mO1, f ′))+

slide-33
SLIDE 33

Training

Training is then performed with a hinge loss and stochastic gradient descent (SGD).

  • f =mo1

max(0, γ − sO(q, mO1) + sO(q, f ))+

  • f ′=mo2

max(0, γ − sO(q + mO1, mO2) + sO(q + mO1, f ′))+

  • ˆ

r=r

max(0, γ − sR(q + mO1 + mO2, r) + sR(q + mO1 + mO2,ˆ r)) Negative sampling instead of sum.

slide-34
SLIDE 34

Experiments

Question answering with artificially generated data. model accuracy RNN 17.8 % LSTM 29.0 % MN k=1 44.4 % MN k=2 99.9 %

slide-35
SLIDE 35

How many problems can you spot?

slide-36
SLIDE 36

How many problems can you spot?

◮ Single word as answer.

slide-37
SLIDE 37

How many problems can you spot?

◮ Single word as answer. ◮ Need to iterate over the entire memory.

slide-38
SLIDE 38

How many problems can you spot?

◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive.

slide-39
SLIDE 39

How many problems can you spot?

◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly

slide-40
SLIDE 40

How many problems can you spot?

◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly fully

slide-41
SLIDE 41

How many problems can you spot?

◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly fully extremely

slide-42
SLIDE 42

How many problems can you spot?

◮ Single word as answer. ◮ Need to iterate over the entire memory. ◮ The write component is somehow naive. ◮ Strongly fully extremely supervised.

slide-43
SLIDE 43

Outline

Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines

slide-44
SLIDE 44

Introduction

◮ argmax is substituted by a soft attention mechanism ◮ less supervised, no need for annotated supporting facts

slide-45
SLIDE 45

QA example

Transform sentences in vector representation, write representations in the memory, transform query in vector representation.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is Dan now?

slide-46
SLIDE 46

QA example

For each sentence in memory calculate the "level of compatibility" between the sentence and the query - soft attention.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is Dan now? B A softmax(u^T A m)

u = B · q pi = softmax(uT · A · mi)

slide-47
SLIDE 47

QA example

Calculate the weighted output representation.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is Dan now? B A softmax(u^T A m) C

  • ci = C · mi

where c is the output memory representation

  • =
  • i

(pici) and o is the weighted output representation.

slide-48
SLIDE 48

QA example

Calculate the most likely answer given the query and the output memory representation.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is Dan now? B A softmax(u^T A m) C

  • Answer: bedroom

W

r = argmaxw∈W (w · (o + u))

slide-49
SLIDE 49

QA example

As in the fully supervised case, we can performs multiple readings of the memory given the previous result.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is the milk now?

slide-50
SLIDE 50

QA example

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is the milk now? B A1 softmax(u^T A1 m)

u1 = B · q p1

i = softmax(u1T · A1 · mi)

slide-51
SLIDE 51

QA example

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is the milk now? B A1 softmax(u1^T A1 m) C1

  • 1

c1

i = C 1 · mi

where c1 is the output memory representation at the first hop

  • 1 =
  • i

(p1

i c1 i )

and o1 is the weighted output representation at the first hop.

slide-52
SLIDE 52

QA example

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is the milk now? B A1 softmax(u1^T A1 m) C1

  • 1

u2 = (u1 + o1)

slide-53
SLIDE 53

QA example

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is the milk now? B A1 softmax(u1^T A1 m) C1

  • 1

softmax(u2^T A2 m) A2

p2

i = softmax(u2T · A2 · mi)

slide-54
SLIDE 54

QA example

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is the milk now? B A1 softmax(u1^T A1 m) C1

  • 1

softmax(u2^T A2 m) A2 C2

  • 2

c2

i = C 2 · mi

where c2 is the output memory representation at the second hop

  • 2 =
  • i

(p2

i c2 i )

and o2 is the weighted output representation at the second hop.

slide-55
SLIDE 55

QA example

As in the fully supervised case we can performs multiple reading of the memory given the previous result.

Fred moved to the bedroom. Joe went to the kitchen. Joe took the milk. Dan journeyed to the bedroom. m Where is the milk now? B A1 softmax(u1^T A1 m) C1

  • 1

softmax(u2^T A2 m) A2 C2

  • 2

Answer: kitchen W

r = argmaxw∈W (w · (o2 + u2))

slide-56
SLIDE 56

Experiments

Question answering toy tasks, Weston et al. (2016). model mean error Strongly sup. MN 6.7 % LSTM 51.3 % EEMN k=1 25.8 % EEMN k=2 15.6 % EEMN k=3 13.3 %

slide-57
SLIDE 57

Outline

Motivation Memory Networks End-to-end Memory Networks Neural Turing Machines

slide-58
SLIDE 58

Introduction

◮ As Turing machines, NTMs have a controller, a memory, a write

head, and a read head. NTMs are differentiable.

◮ Differently from Memory Networks,

◮ the attention mechanism of NTMs is more sophisticated. ◮ NTMs are already equipped for rewriting the memory.

slide-59
SLIDE 59

Read head

◮ The memory can be updated during training and testing, at each

time t we have a memory Mt.

slide-60
SLIDE 60

Read head

◮ The memory can be updated during training and testing, at each

time t we have a memory Mt.

◮ wr t is the weighting vector over the memory at time t - It is

constrained to be a probability distribution.

slide-61
SLIDE 61

Read head

◮ The memory can be updated during training and testing, at each

time t we have a memory Mt.

◮ wr t is the weighting vector over the memory at time t - It is

constrained to be a probability distribution.

◮ The read vector is calculated as:

rt =

  • i

wt(i)Mt(i) wr

t is emitted by the controller.

slide-62
SLIDE 62

Write head

◮ ww t is the write weighting vector (emitted by the controller). ◮ Mt(i) represents the memory location i at time step t.

slide-63
SLIDE 63

Write head

◮ ww t is the write weighting vector (emitted by the controller). ◮ Mt(i) represents the memory location i at time step t. ◮ write operation is composed by two parts: ◮ erase part ◮ the controller emits an erase vector et in the range (0,1)

˜ Mt(i) = Mt−1(i)[1 − wt(i)et]

slide-64
SLIDE 64

Write head

◮ ww t is the write weighting vector (emitted by the controller). ◮ Mt(i) represents the memory location i at time step t. ◮ write operation is composed by two parts: ◮ erase part ◮ the controller emits an erase vector et in the range (0,1)

˜ Mt(i) = Mt−1(i)[1 − wt(i)et]

◮ add part ◮ the controller emits an add vector at

Mt(i) = ˜ Mt(i) + wt(i)at

slide-65
SLIDE 65

Addressing mechanism

How do we get wt?

◮ content-based addressing ◮ location-based addressing

slide-66
SLIDE 66

Content-based addressing

◮ the controller emits a key vector kt ◮ the key vector is compared to the memory via cosine similarity K[·, ·] ◮

wt(i) = exp(βt · K[kt, Mt(i)])

  • j exp(βt · K[kt, Mt(j)])

◮ βt is a scalar emitted by the controller that attenuates or amplify

the precision of the focus

slide-67
SLIDE 67

Location-based addressing

Interpolation gate gt, decides how much of the content-based weighting is preserved. wg

t = gtwc t + (1 − gt)wt−1

slide-68
SLIDE 68

Location-based addressing

Convolutional shift (as in Turing machines) ˜ wt(i) =

N−1

  • j=0

w g

t (j)st(i − j)

the shift weighting st is emitted by the controller and is a distribution

  • ver possible shifts.
slide-69
SLIDE 69

Location-based addressing

Sharpening, wt(i) = ˜ wt(i)γt

  • j ˜

wt(j)γt this operation is useful when the shift weighting is not sharp.

slide-70
SLIDE 70

Controller network

◮ It can be a recurrent or a feedforward neural network. ◮ it takes as input a vector xt and the memory Mt ◮ the output is

yt = (at

slide-71
SLIDE 71

Controller network

◮ It can be a recurrent or a feedforward neural network. ◮ it takes as input a vector xt and the memory Mt ◮ the output is

yt = (at, et

slide-72
SLIDE 72

Controller network

◮ It can be a recurrent or a feedforward neural network. ◮ it takes as input a vector xt and the memory Mt ◮ the output is

yt = (at, et, {kt

slide-73
SLIDE 73

Controller network

◮ It can be a recurrent or a feedforward neural network. ◮ it takes as input a vector xt and the memory Mt ◮ the output is

yt = (at, et, {kt, βt

slide-74
SLIDE 74

Controller network

◮ It can be a recurrent or a feedforward neural network. ◮ it takes as input a vector xt and the memory Mt ◮ the output is

yt = (at, et, {kt, βt, gt

slide-75
SLIDE 75

Controller network

◮ It can be a recurrent or a feedforward neural network. ◮ it takes as input a vector xt and the memory Mt ◮ the output is

yt = (at, et, {kt, βt, gt, st

slide-76
SLIDE 76

Controller network

◮ It can be a recurrent or a feedforward neural network. ◮ it takes as input a vector xt and the memory Mt ◮ the output is

yt = (at, et, {kt, βt, gt, st, γt

slide-77
SLIDE 77

Controller network

◮ It can be a recurrent or a feedforward neural network. ◮ it takes as input a vector xt and the memory Mt ◮ the output is

yt = (at, et, {kt, βt, gt, st, γt}r

slide-78
SLIDE 78

Controller network

◮ It can be a recurrent or a feedforward neural network. ◮ it takes as input a vector xt and the memory Mt ◮ the output is

yt = (at, et, {kt, βt, gt, st, γt}r, {kt, βt, gt, st, γt}w)

◮ the emissions for the write and read head, and the erase and add

vectors are adjusted to meet the constraints.

slide-79
SLIDE 79

Experiments

Can a neural network learn procedures/programs?

◮ copy task ◮ repeat copy ◮ associative recall ◮ sorting

On all these tasks NTM outperforms LSTM.

slide-80
SLIDE 80

Copy task demo

https: //thumbs.gfycat.com/WelllitInferiorAndeancondor-mobile.mp4

slide-81
SLIDE 81

Extensions

Program induction papers:

◮ Neural Programmer: Inducing Latent Programs with Gradient

Descent

◮ Neural Programmer-Interpreters ◮ Reinforcement Learning Neural Turing Machines - Revised ◮ Neural Random-Access Machines ◮ Neural GPUs Learn Algorithms

Memory networks extensions:

◮ Ask Me Anything: Dynamic Memory Networks for Natural Language

Processing

◮ The Goldilocks Principle: Reading Children’s Books with Explicit

Memory Representations

slide-82
SLIDE 82

Lecture recap

◮ Memory networks ◮ End-to-end memory networks ◮ Neural Turing machines