Attention-based Model and It Its Application in in Scene Text xt - PowerPoint PPT Presentation

Attention-based Model and It Its Application in in Scene Text xt Recognition Baoguang Shi ( 石葆光 ) Huazhong University of Science and Technology March 23, 2016

Part 1: Introduction to Attention- Based Models 3/23/2016 VALSE Panel 2

Introduction • The problem to solve: predicting a sequence given an input, with deep neural networks. • Input can be: image, speech, sentence, etc. • Why it matters? • Speech recognition: speech signal sequence => transcription sequence • Image captioning: image => word sequence • Machine translation: word sequence (in one language) => word sequence (in another) • … 3/23/2016 VALSE Panel 3

Main Difficulties • Outputs are variable-length sequences • Inputs may also have unfixed number of dimensions 3/23/2016 VALSE Panel 4

Attention-based models [1-2] • An encoder-decoder framework Output Encoder Decoder Input Representation sequence RNN, CNN, etc. RNN • At each step, • Select relevant contents in the representation ( attend ) • Generate a token [1] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate . ICLR 2015. [2] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, KyungHyun Cho, Yoshua Bengio: Attention-Based Models for Speech Recognition . CoRR abs/1506.07503 (2015) 3/23/2016 VALSE Panel 5

First Look • ℎ : input sequence • 𝑧 : output sequence [1] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, KyungHyun Cho, Yoshua Bengio: Attention-Based Models for Speech Recognition . CoRR abs/1506.07503 (2015) 3/23/2016 VALSE Panel 6

Key: Attention Weights • At each step, • Calculates a vector of attention weights (non-negative, sum to 1) • Linearly combines the input vectors into a glimpse vector • Convert variable-length inputs into a fixed-dimensional vector output sequence Attention weights RNN RNN Input X contents Fixed-dim vector ℎ 1 ℎ 𝑈 3/23/2016 VALSE Panel 7

Different Weights at Every Step • Attend to different contents 𝑢 = 1 output sequence Attention weights LSTM Input X contents ℎ 1 ℎ 𝑈 3/23/2016 VALSE Panel 8

Different Weights at Every Step • Attend to different contents 𝑢 = 2 output sequence Attention weights LSTM LSTM Input X contents ℎ 1 ℎ 𝑈 3/23/2016 VALSE Panel 9

Different Weights at Every Step • Attend to different contents 𝑢 = 3 output sequence <EOS> Attention weights LSTM LSTM LSTM Input X contents ℎ 1 ℎ 𝑈 3/23/2016 VALSE Panel 10

Detailed architecture 3/23/2016 VALSE Panel 11

The Attention Mechanism • Allows us to predict a sequence from input contents. • Allows the model to be trained ent-to-end. 3/23/2016 VALSE Panel 12

What do attention weights tell us? • Indicate the importance of inputs for each output token • Provides the soft-alignment between inputs and outputs 3/23/2016 VALSE Panel 13

Attention weights (2D) woman Frisbee park [1] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 3/23/2016 VALSE Panel 14

Part 2: Attention-based Models for Scene Text Recognition 3/23/2016 VALSE Panel 15

Scene Text Recognition • A problem of image to sequence learning Tango ATM Hotel BLACK 3/23/2016 VALSE Panel 16

Traditional Approaches • Character detection • Character recognition • Generate word from characters 3/23/2016 VALSE Panel 17

Our Previous Work: CRNN • Convolutional Recurrent Neural Network • Convolutional layers • Bidirectional LSTM • CTC layer • Code & model released at https://github.com/bgshih/crnn 3/23/2016 VALSE Panel 18

Our approach • Scheme: sequence-to-sequence learning • Sequence-based image representation • To character sequence [‘S’,‘A’,‘L’,‘E’,<EOS>] • Encoder: Convolutional layers + LSTM, extract sequence-based representation • Decoder: Attention-based RNN, generate character sequence 3/23/2016 VALSE Panel 19

Encoder • Extract a sequence-based representation of image • Structure: Convolutional layers + Bidirectional-LSTM • Convolutional layers extract feature maps of size 𝐷 × 𝐼 × 𝑋 • Split feature maps along columns, into 𝑋 vectors with 𝐷𝐼 dimensions (map- to-sequence conversion) • A Bidirectional-LSTM models the context within the sequence 3/23/2016 VALSE Panel 20

Decoder • Attention-based RNN, whose cells are Gated Recurrent Units (GRU) Attention: select relevant contents 3/23/2016 VALSE Panel 21

Sequence Recognition Network: The Whole Structure • Components • Convolutional layer • Bidirectional-LSTM • Attention-based decoder 3/23/2016 VALSE Panel 22

However… • This scheme does not work well on irregular text 3/23/2016 VALSE Panel 23

Rectification + Recognition • Rectifying images using a Spatial Transformer Network [1] (STN). • Recognizing rectified images using the network mentioned above (SRN). • STN and SRN are trained jointly. [1] Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu: Spatial Transformer Networks. CoRR abs/1506.02025 (2015) 3/23/2016 VALSE Panel 24

Rectification with STN • Given an input image, • regress the locations of 20 fiducial points on the input image • calculate a TPS transformation • transform the input image 3/23/2016 VALSE Panel 25

End-to-end trainable • No need to label the fiducial points manually, but let STN learns by itself 3/23/2016 VALSE Panel 26

Performance • Significant improvement on datasets that focuses on irregular text SVT-Perspective (perspective text) CUTE80 (curved text) 3/23/2016 VALSE Panel 27

Performance • State-of-the-art, or highly competitive results on general text recognition datasets 3/23/2016 VALSE Panel 28

Some Results 3/23/2016 VALSE Panel 29

Recognition & Character Localization • Row- 𝑢 is the vector of attention weight at step 𝑢 “billiards” “hertz” “door” “restaurant” “ everest ” “central” 3/23/2016 VALSE Panel 30

Advantages of the Proposed Model • Globally trainable learning system • Learning representation from data • End-to-end trainable • Handles images of arbitrary sizes, and text of arbitrary length • The encoder accepts images of arbitrary widths • For the decoder, both input and output sequences can have arbitrary lengths • Robust to irregular text 3/23/2016 VALSE Panel 31

Takeaways • Attention-based models predict sequences given input images/speeches/sentences/etc. • Attention-weights provide soft-alignment between inputs and outputs • The rectification + recognition scheme is effective for scene text recognition 3/23/2016 VALSE Panel 32

Thanks! • Paper: Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, Xiang Bai, Robust Scene Text Recognition with Automatic Rectification . Accepted to CVPR 2016. • Preprint available at http://arxiv.org/abs/1603.03915 • CRNN code & model: https://github.com/bgshih/crnn 3/23/2016 VALSE Panel 33

Attention-based Model and It Its Application in in Scene Text xt - PowerPoint PPT Presentation

Attention-based Model and It Its Application in in Scene Text xt Recognition Baoguang Shi ( ) Huazhong University of Science and Technology March 23, 2016 Part 1: Introduction to Attention- Based Models 3/23/2016 VALSE Panel 2

Scene Graphs Scene Representation How does one describe the objects in a 3D scene? Scene

Scene Representation How does one describe the objects in a Scene Graphs 3D scene? Scene

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Episode 42: I Made Slides 10 February 2019 The Three-Act, Seven Scene Structure Act I:

CMSC427 Scene graphs Credit: slides from Dr. Zwicker Today Scene graphs & hierarchies

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Volumetric Scene Reconstruction Volumetric Scene Reconstruction Goal Goal from Multiple

The Attention Economy What is the attention economy? A business model where you (as the

a better and faster way Shu Kong CS, ICS, UCI Image Understanding --> Scene Parsing Scene

Pay Attention to the Pixel, Understand the Scene Better Shu Kong CS, ICS, UCI Background: Scene

Image-Based Rendering and Modeling l Image-based rendering (IBR): A scene is represented as a

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

Attention and its (mis)interpretation Danish Pruthi 1 Acknowledgements Mansi Gupta Bhuwan

Managing Street Scene Matthew Wakelam Assistant Director Street Scene Cardiff Council 1.

Scene Understanding Introduction & Overview Outline Motivation The problems Scene

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

At Attent ntio ion The The proble lem For very long sentences, the score for machine

Building Community Between Police and Youth Wednesday, August 16, 2017 Housekeeping Question or

On the Level of Teaching Heaven Teaching Principles Earth 1 Four Principles Stepwise

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge,

Java Decision Making and booleans (Java: An Eventful Approach, Ch 4), 26 October 2012 Slides

Getting up to speed on PHP 5.3 Lukas Kahwe Smith - lukas@liip.ch PHPCon Italia - Roma 18. - 20.

& Reduce Costs What Is Service Design? The Double Diamond Discover Define Design Deploy

Attention-based Model and It Its Application in in Scene Text xt - PowerPoint PPT Presentation

Attention-based Model and It Its Application in in Scene Text xt Recognition Baoguang Shi ( ) Huazhong University of Science and Technology March 23, 2016 Part 1: Introduction to Attention- Based Models 3/23/2016 VALSE Panel 2

Scene Graphs Scene Representation How does one describe the objects in a 3D scene? Scene

Scene Representation How does one describe the objects in a Scene Graphs 3D scene? Scene

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Episode 42: I Made Slides 10 February 2019 The Three-Act, Seven Scene Structure Act I:

CMSC427 Scene graphs Credit: slides from Dr. Zwicker Today Scene graphs &amp; hierarchies

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Volumetric Scene Reconstruction Volumetric Scene Reconstruction Goal Goal from Multiple

The Attention Economy What is the attention economy? A business model where you (as the

a better and faster way Shu Kong CS, ICS, UCI Image Understanding --&gt; Scene Parsing Scene

Pay Attention to the Pixel, Understand the Scene Better Shu Kong CS, ICS, UCI Background: Scene

Image-Based Rendering and Modeling l Image-based rendering (IBR): A scene is represented as a

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

Attention and its (mis)interpretation Danish Pruthi 1 Acknowledgements Mansi Gupta Bhuwan

Managing Street Scene Matthew Wakelam Assistant Director Street Scene Cardiff Council 1.

Scene Understanding Introduction &amp; Overview Outline Motivation The problems Scene

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

At Attent ntio ion The The proble lem For very long sentences, the score for machine

Building Community Between Police and Youth Wednesday, August 16, 2017 Housekeeping Question or

On the Level of Teaching Heaven Teaching Principles Earth 1 Four Principles Stepwise

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge,

Java Decision Making and booleans (Java: An Eventful Approach, Ch 4), 26 October 2012 Slides

Getting up to speed on PHP 5.3 Lukas Kahwe Smith - lukas@liip.ch PHPCon Italia - Roma 18. - 20.

&amp; Reduce Costs What Is Service Design? The Double Diamond Discover Define Design Deploy

CMSC427 Scene graphs Credit: slides from Dr. Zwicker Today Scene graphs & hierarchies

a better and faster way Shu Kong CS, ICS, UCI Image Understanding --> Scene Parsing Scene

Scene Understanding Introduction & Overview Outline Motivation The problems Scene

& Reduce Costs What Is Service Design? The Double Diamond Discover Define Design Deploy