Attention Is All You Need Ashish Vaswani, Noam Shazeer, Niki Parmar, - PowerPoint PPT Presentation

Feb 15, 2024 •2.35k likes •2.59k views

Attention Is All You Need Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, Illia Polosukhin From: Google brain Google research Presented by: Hsuan-Yu Chen RNN Advantages:

Attention Is All You Need Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł ukasz Kaiser, Illia Polosukhin From: Google brain Google research Presented by: Hsuan-Yu Chen
RNN • Advantages: • State-of-the-art for variable-length representations such as sequences • RNN are considered core of Seq2Seq (with attention) • Problems: • Sequential process prohibits parallelization. Long range dependencies • Sequences-aligned states: hard to model hierarchical-alike domains ex. languages
CNN • Better than RNN (Linear): path length between positions can be logarithmic when using dilated convolutions • Drawback: require a lot of layers to catch long-term dependencies
Attention and Self-Attention • Attention: • Removes bottleneck of Encoder-Decoder model • Focus on important parts • Self-Attention: • all the variables (queries, keys and values) come from the same sequence
Why Self Attention
Transformer Architecture • Encoder: 6 layers of self- attention + feed-forward network • Decoder: 6 layers of masked self-attention and output of encoder + feed- forward
Encoder • N = 6 • All layer output size 512 • Embedding • Positional Encoding • Multi-head Attention • Residual Connection • Position wise feed forward
Positional Encoding • Positional encoding provides relative or absolute position of given token • where pos is the position and i is the dimension
Encoder • N = 6 • All layer output size 512 • Embedding • Positional Encoding • Multi-head Attention • Residual Connection • Position wise feed forward
Scaled Dot Product and Multi-Head Attention
Encoder • N = 6 • All layer output size 512 • Embedding • Positional Encoding • Multi-head Attention • Residual Connection • Position wise feed forward
Residual Connection • LayerNorm(x + Sublayer(x))
Encoder • N = 6 • All layer output size 512 • Embedding • Positional Encoding • Multi-head Attention • Residual Connection • Position wise feed forward
Position Wise Feed Forward • two linear transformation with a ReLU activation in between
Decoder • N = 6 • All layer output size 512 • Embedding • Positional Encoding • Residual Connection:   LayerNorm(x + Sublayer(x)) • Multi-head Attention • Position wise feed forward • softmax:  
Q, V, K • Queries (Q) come from previous decoder layer, and the memory keys (K) and values (V) come from the output of the encoder • all three come from previous layer (Hidden State)
Training • Data sets: • WMT 2014 English-German: • 4.5 million sentences pairs with 37K tokens. • WMT 2014 English-French: • 36M sentences, 32K tokens. • Hardware: • 8 Nvidia P100 GPus (Base model 12 hours, big model 3.5 days)
Results
More Results
Summary • Introduces a new model, named Transformer • In particular, introduces the concept of multi-head attention mechanism . • It follows a classical encoder + decoder structure . • It is an autoregressive model • Achieves new state-of-the-art results in NMT

Recommend

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in encoder-decoder networks Various kinds of attention 2 Overview What is attention? Attention in encoder-decoder networks 3 Visual

971 views • 73 slides

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A. Waswani et al., NIPS , 2017 Google Brain & University of Toronto 2 Attention Visual attention and textual attention

628 views • 21 slides

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is attention? How is attention allocated? How are eye movements related to attention? Further questions Attention Attention

331 views • 18 slides

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from:

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from: Mausam, Jay Alammar The Illustrated Transformer Attention in seq2seq models (Bahdanau 2014) Multi-head attention Self-attention (single-head,

720 views • 48 slides

The Attention Economy What is the attention economy? A business model where you (as the

The Attention Economy What is the attention economy? A business model where you (as the company) want to hold the users attention as much as possible. Attention is treat like a scarce resource What are ethical issues that have emerged

170 views • 3 slides

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

4/14/17 Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial attention Attention to features 3. Directing attention: Posterior parietal cortex Frontal eye fields Top-down and bottom-up attention 1

336 views • 17 slides

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al.,

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al., Attention is All You Need, NeurIPS , 2017] University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1 Attention Attention in Computer Vision

708 views • 13 slides

Consciousness First? Attention First? David Chalmers Some Issues Q1: Is there consciousness

Consciousness First? Attention First? David Chalmers Some Issues Q1: Is there consciousness without attention? Q2: Is there attention without consciousness? Q3: What is the structure of attention? Q4: What s the causal/explanatory priority

455 views • 34 slides

Attention Models Attention Models: Motivation bird Image: H x W x 3 The whole input volume is

Day 4 Lecture 6 Attention Models Attention Models: Motivation bird Image: H x W x 3 The whole input volume is used to predict the output... 2 Attention Models: Motivation bird Image: H x W x 3 The whole input volume is used to predict

497 views • 31 slides

Attention Models Focus on parts of input Olof Mogren Improves NN performance on different

Attention Models Attention Models Focus on parts of input Olof Mogren Improves NN performance on different tasks Chalmers University of Technology IBM1 attention mechanism (1980s) Feb 2016 Attention Models Arxiv 2016

348 views • 6 slides

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention Mechanisms 1 Attention Mechanisms Understanding Attention Mechanisms Attention Variants NMT at Different Granularities 2 Hybrid Models

828 views • 57 slides

Attention and its (mis)interpretation Danish Pruthi 1 Acknowledgements Mansi Gupta Bhuwan

Attention and its (mis)interpretation Danish Pruthi 1 Acknowledgements Mansi Gupta Bhuwan Dhingra Graham Neubig Zachary C. Lipton 2 Outline 1. What is attention mechanism? 2. Attention-as-explanations 3. Manipulating attention weights 4.

1.72k views • 136 slides

Visual Attention FEF V4 spatial attention: simultaneous neural recordings in V4

10/29/14 Visual Attention FEF V4 spatial attention: simultaneous neural recordings in V4 & Frontal Eye Fields (monkeys) Gregoriou, Gotts, Zhou, Desimone (2010) object attention: MEG & fMRI in FFA (faces),

238 views • 4 slides

A Convolutional Attention Network for Extreme Summarization of Source Code ATTENTION

A Convolutional Attention Network for Extreme Summarization of Source Code ATTENTION MECHANISM Attention: Withdrawal from some things in order to deal e ff ectively with others ~William James LIBGDX Cross-platform game and

913 views • 34 slides

Advanced Neural Machine Translation Gongbo Tang 21 September 2020 Outline NMT with Attention

Advanced Neural Machine Translation Gongbo Tang 21 September 2020 Outline NMT with Attention Mechanisms 1 Attention Mechanisms Understanding Attention Mechanisms Attention Variants NMT at Different Granularities 2 Hybrid Models

810 views • 56 slides

How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers Kazuki Irie *,

How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers Kazuki Irie *, Alexander Gerstenberger, Ralf Schl uter, Hermann Ney Human Language Technology and Pattern Recognition Group RWTH Aachen University, Aachen, Germany

407 views • 17 slides

Learning unknown forces in nonlinear models with Gaussian processes and autoregressive flows Wil

Learning unknown forces in nonlinear models with Gaussian processes and autoregressive flows Wil O C Ward w.ward@sheffield.ac.uk Department of Physics and Astronomy, The University of Sheffield GPSS Workshop: Structurally Constrained Gaussian

923 views • 89 slides

Sequence Models Instructor: John Thickstun Discussion Board: Available on Ed! Zoom Link: Available

Sequence Models Instructor: John Thickstun Discussion Board: Available on Ed! Zoom Link: Available on Canvas Instructor Contact: thickstn@cs.washington.edu Course Webpage: https://courses.cs.washington.edu/courses/cse599i/20au/ CSE 599I:

497 views • 18 slides

Autoregressive Models Overview Direct Structures P Direct structures x ( n ) = a k x

Autoregressive Models Overview Direct Structures P Direct structures x ( n ) = a k x ( n k ) + w ( n ) Types of estimators k =1 Parametric spectral estimation Notation differs (again) from text Parametric

510 views • 13 slides

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22 Module

773 views • 24 slides

Innovation in Pediatric Healthcare Delivery Utah Regional Healthcare Innovation Day April 27,

Innovation in Pediatric Healthcare Delivery Utah Regional Healthcare Innovation Day April 27, 2016 Comprehensive Care for Children with Complex Medical Conditions N Murphy MD E Clark MD Department of Pediatrics University of Utah School

345 views • 12 slides

Lecture 6 Firms and Markets in the Performing Arts Nonprofits and For-Profits Professor

Lecture 6 Firms and Markets in the Performing Arts Nonprofits and For-Profits Professor Julia Lowell lowell@econ.ucsb.edu Spring 2012 4/ 18/ 2012 Econ 191ac -- Lecture 6 1 Outline: Lecture 6 Go over HW5 (due today) Key

761 views • 49 slides

Automatic Creation of Tile Size Selection Models Tomofumi Yuki Lakshminarayanan Renganarayanan

Automatic Creation of Tile Size Selection Models Tomofumi Yuki Lakshminarayanan Renganarayanan Sanjay Rajopadhye Charles Anderson Alexandre Eichenberger Kevin O'Brien Colorado State University IBM Research Tile Size Selection Problem

613 views • 23 slides

An Introduction to Reverse Mathematics Noah A. Hughes Appalachian State University Boone, NC

An Introduction to Reverse Mathematics Noah A. Hughes Appalachian State University Boone, NC March 28, 2014 Appalachian State University Mathematical Sciences Colloquium Series Outline Preliminary Definitions Motivations Reverse

1.45k views • 75 slides