soft attention models in deep networks
play

Soft Attention Models in Deep Networks Praveen Krishnan CVIT, IIIT - PowerPoint PPT Presentation

Soft Attention Models in Deep Networks Praveen Krishnan CVIT, IIIT Hyderabad June 21, 2017 Everyone knows what attention is. It is the taking possession of the mind, in clear and vivid form, of one out of what seem several simultaneously


  1. Soft Attention Models in Deep Networks Praveen Krishnan CVIT, IIIT Hyderabad June 21, 2017 Everyone knows what attention is. It is the taking possession of the mind, in clear and vivid form, of one out of what seem several simultaneously possible objects or trains of thought. Focalization, concentration of consciousness are of its essence....James’1980 1

  2. Outline Motivation Primer on Prediction using RNNs Handwriting prediction [Read] Handwriting synthesis [Write] Deep Recurrent Attentive Writer [DRAW] 2

  3. Motivation Few questions to begin ◮ How do we perceive an image and start interpreting? ◮ Given the knowledge of any languages, how do we manually translate between its sentences? ◮ . . . 3

  4. Motivation Few questions to begin ◮ How do we perceive an image and start interpreting? ◮ Given the knowledge of any languages, how do we manually translate between its sentences? ◮ . . . Figure 1: Left: Source Wikipedia, Right: Bahdanau et al. ICLR’15 3

  5. Motivation Why Attention? ◮ You don’t see every pixel! ◮ You remove the clutter and process the salient parts . ◮ You process one step at a time and aggregate the information in your memory . 4

  6. Attention Mechanism Definition In cognitive neuroscience, it is viewed as a neural system for the selection of information similar in many ways to the visual, auditory, or motor systems [Posner 1994]. Visual Attention Components [Tsotsos, et al. 1995] ◮ Selection of a region of interest in the visual field. ◮ Selection of feature dimensions and values of interest ◮ Control of information flow through visual cortex. ◮ Shifting from one selected region to the next in time 5

  7. Attention Mechanism Attention in Neural Networks An architecture-level feature in neural networks to allow attending different parts of image sequentially and aggregate information over time. Types of attention ◮ Hard: Picking discrete location to attend. However the model is non-differentiable. ◮ Soft: Spread out the attention weights over the entire image. In this talk, we limit ourselves to soft attention models which are differentiable and uses standard back-propagation. Before we dig deeper, lets brush up RNN. 6

  8. Recurrent Neural Network Figure 2: An unrolled recurrent neural network 1 RNN network ◮ Neural network with loops. ◮ Captures temporal information. ◮ Issue of long-term dependency is solved by gated units (LSTMs, GRUs, . . . ). ◮ Wide range of applications in image captioning, speech processing, language modelling, . . . 1 colah’s blog , Understanding LSTM Networks. 7

  9. LSTM Long Short-term Memory Cell f t = σ ( W xf x t + W hf h t − 1 + W cf c t − 1 + b f ) i t = σ ( W xi x t + W hi h t − 1 + W ci c t − 1 + b i ) c t = f t c t − 1 + i t tanh( W xc x t + W hc h t − 1 + b c ) o t = σ ( W xo x t + W ho h t − 1 + W co c t + b o ) h t = o t tanh( c t ) ◮ Uses memory cells to store information. ◮ The above version uses peephole connections. Let us now see how we can use them for prediction. 8

  10. LSTMs as Prediction Network Given an input sample x t , predict the next sample x t +1 . Prediction Problem Learn a distribution P ( x t +1 | y t ) where x = ( x 1 , . . . , x T ) is the input sequence, given to N hidden layers ( h n = h n 1 , . . . , h n T ) to predict an output sequence y = ( y 1 , . . . , y T ) 9

  11. LSTMs as Prediction Network Choice of predictive distribution (Density Modeling) The probability given by the network to the input sequence x is:- T � P ( x ) = P ( x t +1 | y t ) t =1 and the sequence loss is:- T � L ( x ) = − log P ( x t +1 | y t ) t =1 Training is done through back-propagation through time. For e.g: In text prediction, one can parameterize the output distribution using a softmax function. 10

  12. Handwriting Prediction Problem Given online handwriting data (recorded pen tip x 1 , x 2 locations) at time step t , predict the location of pen at t + 1 time step along with the end of stroke variable. Figure 3: Left: Samples of online handwritten data of multiple authors. Right: Demo 2 2 Carter et al. , Experiments in Handwriting with a Neural Network, Distill, 2016. 11

  13. Handwriting Prediction Mixture Density Outputs [Graves arxiv’13] A mixture of bivariate Gaussians is used to predict x 1 , x 2 , while a Bernoulli distribution is used for x 3 . x t ∈ R × R × { 0 , 1 } y t =( e t , { π j t , µ j t , σ j t , ρ j t } M j =1 ) ◮ e t ∈ (0 , 1) is the end of stroke probability, ◮ π j ∈ (0 , 1) is the mixture weights, ◮ µ j ∈ R 2 the means vector, ◮ σ j > 0 the standard deviation, and ◮ ρ j ∈ ( − 1 , 1) are the correlations Note that x 1 , x 2 are now the offsets from the previous location and the above parameters are normalized from network outputs. 12

  14. Handwriting Prediction Mixture Density Outputs [Graves arxiv’13] The probability of next input is given as:- M � if ( x t +1 ) 3 = 1 e t � π j t N ( x t +1 | µ j t , σ j t , ρ j P ( x t +1 | y t ) = t ) 1 − e t otherwise j =1 As shown earlier the sequence loss is given as:- � M T � � � π j t N ( x t +1 | µ j t , σ j t , ρ j L ( x ) = − log t ) t =1 j =1 � log e t if ( x t +1 ) 3 = 1 − log(1 − e t ) otherwise 13

  15. Handwriting Prediction Visualization Figure 4: Heat map showing the mixture density outputs for handwriting prediction. 14

  16. Handwriting Prediction Demo Available at :- Link Figure 5: Carter et al. , Experiments in Handwriting with a Neural Network, Distill, 2016. 15

  17. Handwriting synthesis [Graves arxiv’13] HW Synthesis Generation of handwriting conditioned on an input text. Key Question Ques: How to resolve the alignment problem between two sequences of varying length? 16

  18. Handwriting synthesis [Graves arxiv’13] HW Synthesis Generation of handwriting conditioned on an input text. Key Question Ques: How to resolve the alignment problem between two sequences of varying length? Sol: Add “attention” as a soft window which is convolved with the input text and given as input to prediction network. Learning to decide which character to write next. 16

  19. Handwriting synthesis 17

  20. Handwriting synthesis The soft window w t into c at timestep t is defined as:- K � t − u ) 2 ) α k t exp( − β k t ( κ k φ ( t , u ) = k =1 U � w t = φ ( t , u ) c u u =1 φ ( t , u ) acts as window weight for c u (one-hot en- coding) at time t . The soft attention is modeled by a mixture of K Gaussians, where κ t → location, β t → width and α t → weight of the Gaussian. 17

  21. Handwriting synthesis Window Parameters α t , ˆ κ t ) = W h 1 p h 1 (ˆ β t , ˆ t + b p α t = exp(ˆ α t ) β t = exp(ˆ β t ) κ t = κ t − 1 + exp(ˆ κ t ) Figure 6: Alignment between the text sequence and handwriting. 18

  22. Handwriting synthesis Qualitative Results Questions ◮ How is stochasticity induced in the generation of different samples? 19

  23. Handwriting synthesis Qualitative Results Questions ◮ How is stochasticity induced in the generation of different samples? ◮ How to decide the network has finished writing text? 19

  24. Handwriting synthesis Qualitative Results Questions ◮ How is stochasticity induced in the generation of different samples? ◮ How to decide the network has finished writing text? ◮ How to control the quality of writing? 19

  25. Handwriting synthesis Qualitative Results Questions ◮ How is stochasticity induced in the generation of different samples? ◮ How to decide the network has finished writing text? ◮ How to control the quality of writing? ◮ How to generate handwriting in a particular style? 19

  26. Handwriting synthesis Biased Sampling vs. Primed Sampling 20

  27. Deep Recurrent Attentive Writer (DRAW) DRAW Combines spatial attention mechanism with a sequential variational auto-encoding framework for iterative construction of complex images. Figure 7: MNIST digits drawn using recurrent attention model. 21

  28. Deep Recurrent Attentive Writer (DRAW) Major contribution ◮ Progressive Refinement (Temporal): Suppose C is the canvas on which the image is drawn. The joint distribution of P ( C ) can be split into multiple latent variables C 1 , C 2 , . . . , C T − 1 , given the observed variable P ( C T ). P ( C ) = P ( C T | C T − 1 ) P ( C T − 1 | C T − 2 ) . . . P ( C 1 | C 0 ) P (0) (1) ◮ Spatial Attention (Spatial): Drawing a part of canvas at a time which simplifies the drawing process by defining “where to look” and “where to write”. Figure 8: Recurrence relation 22

  29. Deep Recurrent Attentive Writer (DRAW) Figure 9: Left: Traditional VAE network, Right: DRAW network ◮ Encoder and decoder are recurrent networks. ◮ Encoder oversees the previous output of decoder to tailor its current output while decoder output is successively added to the output distribution. ◮ An dynamic attention mechanism for “where to read” and “where to write”. 23

  30. Deep Recurrent Attentive Writer (DRAW) Training x t = x − σ ( c t − 1 ) ˆ x t , h dec r t = read ( x t , ˆ t − 1 ) h enc = RNN enc ( h enc t − 1 , [ r t , h dec t − 1 ]) t z t ∼ Q ( Z t | h enc ) t h dec = RNN dec ( h dec t − 1 , z t ) t c t = c t − 1 + write ( h dec ) t 1 Here σ ( x ) = 1+exp − x is the logistic sigmoid function and the latent distribution is taken as diagonal Gaussian N ( Z t | µ t , σ t ) where:- µ t = W ( h enc ) t σ t = exp W ( h enc ) t 24

Recommend


More recommend