S9824 Surpassing State-of-the-Art VQA with Deep Learning - PowerPoint PPT Presentation

S9824 Surpassing State-of-the-Art VQA with Deep Learning Optimization Techniques under Limited GPU Resources Quang D. Tran Erman Tjiputra Head of AI, AIOZ Pte Ltd CEO, AIOZ Pte Ltd

AIOZ Introduction 2

INTRODUCTION 3

Photo credit: Vietnamtourism 4 Concept credit: Devi Parikh Georgia Tech

5 The kids are watching an old master writing letters.

It is Tet 6 holiday in Vietnam with warm and fragrant floral atmosphere. The kids are very attentive and eager to wait for the old master drawing the traditional words.

Q: How many people are 7 there? A: 5 Q: What is the old man doing? A: Writing Q: Where is it? A: On street

8 Human : What a nice picture! What event is this? AI : It is Tet holiday in Vietnam. You can see lots of flowers and the atmosphere is pretty warm. Human : Wow, that’s great. What are they doing? AI : The kids are watching an old master drawing the traditional letters. Human : Awesome, what are the kids wearing? AI : It is Ao Dai. a Vietnamese traditional clothes. …

9 Vision AI See

10 Language Vision AI Understand

11 Language Vision AI Reasoning Reasoning

12 Words & Pictures • Vision à Visual stream à Pictures • Language à Text/Speech à Words • Pictures are everywhere • Words are how we communicate Measuring & demonstrating AI capabilities o Image Understanding o Language Understanding

13 Words & Pictures • Beyond visual recognition • Language is compositional “Two steeds are racing against two brave little dogs.”

14 Image Captioning • Image captions tend to be generic • Coarse understanding of image + simple language models can suffice • Passive Credit by: Karpathy (Stanford)

Introduction: 15 Visual Quesition Answering (VQA) • Input = {Image/Video, Question} • Output = Answer • Question: asking on the detail of corresponding image • Question types: Yes/No, Counting, Multi-Choices, Others. • Dataset: § VQA-1.0, VQA-2.0, TDIUC, DAQUAR, Visual Genome, Visual-7W, Flickr-30, etc.

Effective use of vast amounts 16 Visual Question Answering of visual data “When a person understands a story, [they] can demonstrate [their] understanding by answering Improving Human Computer Interaction questions about the story. Since questions can be devised to query any aspect of text comprehension, the ability to answer questions is the strongest possible demonstration of understanding.” Challenging multi-modal AI research problem - Wendy Lehnert (PhD, 1977)

17 Visual Question Answering Credit by: https://visualqa.org • Details of the image • Common sense + knowledge base • Task-driven • Holy-grail of semantic image understanding

Introduction: 18 Visual Quesition Answering (VQA) • VQA on Image uses a image-question pair with answer label as an example à Supervised Learning • Each answer is belonged to a predefined list à A classifier task • Features are extracted from both image & question to determine answer à An intersection of Computer Vision & NLP

Introduction: 19 Visual Quesition Answering (VQA)

Yash Goyal, et al. Making the V in VQA Matter…, CVPR 2017 VQ VQA Cha halleng nge Aishwarya Agrawal, et al. VQA: Visual Question Answering, ICCV 2015 20 Dataset: VQA 1.0 - 2.0 >0.25 million images ~1.1 million questions ~11 million answers Human Performance

nge : Leaderboard VQ VQA Cha halleng

VQ VQA 22 General Solution & Targets Modern approach for VQA task usually includes 4 main steps: 1 Feature Extraction 2 Joint Semantic Representation Resources Optimization 3 Attention Mechanism Accuracy à Ensemble models 4 VQA Classifier

VQ VQA Cha halleng nges 23 First Glance

VQ VQA Cha halleng nges 26 Question Identification and Model Combination

VQA Decomposition 27

VQ VQA Featur ure Extraction 28 Visual & Question Embedding • Visual Feature: Apply Bottom-Up attention § Use Faster RCNN to get candidate objects & their bounding boxes. § Use ResNet-101 to extract features to get final vector 𝑊 = {𝑊 $ , 𝑊 & ,…, 𝑊 ' } with K is number of proposals. In this step, we find out that K , number of object proposals , plays an important role in increasing overall performance. • Question Feature: Inherit from GloVe. Reference : Bottom-Up and Top-Down Attention, CVPR 2018

VQ VQA Featur ure Extraction 29 Visual & Question Embedding • K proposals = 50 is proved to be better in increasing performance. • K value affects the number of bounding boxes that we store à reducing K would help decrease resource consumption and training time.

VQ VQA 30 Attention Mechanism GloVe Word Embedding 14 Word Question GRU Embedding Low-rank Bilinear Bilinear Counter Classifier 14 x 300 Attention Pooling Bottom-Up Image Attention K x 2048 Bilinear Attention Network (BAN) Ant 3129 answers • Dog Inspired from Co-attention mechanism [2] . 1 × 3129 • . Find bilinear attention distribution . à consider interaction among 2 groups of input channels Zebra • High resource consumption: using 4 GPUs [1] Jiasen Lu, et al., Hierarchical Question-Image Co-Attention , NIPS 2016 [2] Jin-Hwa Kim, et al. Bilinear Attention Network, NIPS 2018

VQ VQA 31 Counting Module • Turn attention map (a) into attention graph ( 𝐵 = 𝑏 - 𝑏) to represent relation between objects. • Objects have high attention score (black circle) will have connected edge. • To get count matrix, we eliminate intra-object edges (red edges) and inter-object edges (blue edge) à The number of remaining vertices is the count result.

VQ VQA 32 Counting Module • To guarantee the objects are fully overlapping or fully distinct we add the normalization function for attention graph A and distance matrix D before removing intra-object edges and inter-object edges. • The normalization function: 𝑔 𝑦 = 𝑦 &($23) • This function increase the value if it higher than 0.5 and decrease value if it lower than 0.5. The main objective is to widen the distance between low value and high value to make fully distinct or fully overlapping.

VQ VQA 33 Counting Module Evaluation Results with proposal counting module

VQA Model Optimiza zation 34 Activation & Dropout • Classifier task in VQA is designed to be simple. However, it is one of the most important module to improve overall performance. à We find out that optimize the only-one activation function in classifier task is important. Thus, we recommend: § Change ReLU activation function by another one (e.g., Swish). § Change Dropout value to local optimal of the corresponding activation function. Pros: • Resolve vanishing gradient problem. • Provide sparsity in representation. • Simple to implement. Cons: No derivative at zero point.

VQ VQA Classifer 35

Ensemble Method 36

37 Ensemble Method

En Ensemble Method 38 Proposal • Step 1: Train member models for ensembling • Step 2: Get prediction answer with each member model • Step 3: Predict question type based on A-Q map learnt from data • Step 4: Re-voting answer • Step 5: return final ensemble model

En Ensemble Method 39 Pros & Cons of Voting Pros: • Simple & easy to implement • No architecture restriction à Identify question-type without training a classification model • Reduce bias • Maximize the performance of each model trained for specific question type Cons: • Useless when the number of voting is equal • No emphasis in any specific good models

Resource Consumption Optimization 40

Resource Consumption 41 Optimization Processing Power Computing Speed

Resource Consumption 42 Optimization • Fast half precision floating point (FP-16) for Deep Learning Training • Delayed Updates (Gradient accumulating)

Resource Consumption Optimization 43 Mixed Pr Mi Precision n Traini ning ng • ML models are usually trained in FP-32. FP-64 (Double precision): expensive but high accuracy. o FP-32 (Single precision): less expensive also less accuracy. o FP-16 (Half precision): cheap but low accuracy. o • ML rule of thumb: Balance of speed & accuracy . o • Expectation: “ running with FP-16 while having comparable accuracy to FP-32 ”

Resource Consumption Optimization 44 Mixed Pr Mi Precision n Traini ning ng Solution • Baidu Research & NVIDIA has successfully trained FP-16 with accuracy comparable to FP-32, 2x speed up and reduced 1.5 times memory consumed. • Reference: Paulius et al., Mixed Precision Training , ICLR 2018. Pros • Speed up training progress • Training with larger model

Resource Consumption Optimization 45 Delayed Updates De • Reference: Myle et al., Scaling Neural Machine Translation , ACL 2018 • We divide entire data into mini- batches. Do forward (compute outputs) and backward (compute gradients based on loss), then updating parameters (learning) on each mini- batch. Evaluation results of delayed updates technique.

S9824 Surpassing State-of-the-Art VQA with Deep Learning - PowerPoint PPT Presentation

S9824 Surpassing State-of-the-Art VQA with Deep Learning Optimization Techniques under Limited GPU Resources Quang D. Tran Erman Tjiputra Head of AI, AIOZ Pte Ltd CEO, AIOZ Pte Ltd AIOZ Introduction 2 INTRODUCTION 3 Photo credit:

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

msb( x ) in O(1) steps using 5 multiplications [M.L. Fredman, D.E. Willard, Surpassing the

msb( x ) in O(1) steps using 5 multiplications [M.L. Fredman, D.E. Willard, Surpassing the

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Deep Learning: State of the Art (2020) Deep Learning Lecture Series https://deeplearning.mit.edu

Overview of Presentation Public Art Definitions Why is Public Art Important ? Percent for Art

ART OF CHANGE 21 PRSENTATION 2 ART OF CHANGE 21 ABOUT US Art of Change 21 works in the field

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim,

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

a shallow survey of deep learning Applications, Models, Algorithms and Theory (?) Chiyuan Zhang

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Despite vows of transparency, police block access to body camera videos Posted: 12:03 a.m.

Phase 1 - Presentation to Honble Chief Minister of Andhra Pradesh Greater Visakhapatnam

Competency of Counsel Issues in the Post-Conviction Context Alissa Bjerkhoel, J.D. Wrongf gful

Lucas County Dog Warden Progress Report Building Improvements Front Office Staffing

Cheetah Conservation Fund Dr. Laurie Dr. Laurie Dr. Laurie Dr. Laurie Marker, Dr. Bruce Brewer

Olfactory sensitivity of spider monkeys ( Ateles geoffroyi ) for six structurally related aromatic

Microalgal Biofuel Technology Kunn Kangvansaichol, Ph.D. Researcher / THINK ALGAE Project Manager

ASH Die ieback .. Coming to a tree near you The Bookham Tree Wardens 10 th July 2019 What is