deep learning in computer vision
play

Deep Learning in Computer Vision Yikang Li MMLab, The Chinese - PowerPoint PPT Presentation

Deep Learning in Computer Vision Yikang Li MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image


  1. Deep Learning in Computer Vision Yikang Li MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China

  2. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  3. Introduction - DL in the press [1] http://news.stanford.edu/press-releases/2017/04/03/deep-learning-aldrug-development/ [2]https://www.cnbc.com/2017/09/08/a-i-can-detect-the-sexual-orientation-of-a-person-based-on- one-photo-research-shows.html

  4. Introduction - DL in the press CVPR 2017 (2600+ submissions, 4200+ registrants, 120+ sponsors) http://cvpr2017.thecvf.com/

  5. http://business.financialpost.com/t echnology/federal-and-ontario-go vernments-invest-up-to-100-millio n-in-new-artificial-intelligence-vect or-institute/wcm/ceb9218f-cbaf-49 Introduction - Investment in AI 68-a6a6-cceff5ec3754

  6. Renowned Researchers/Groups - Trevor Darrell, BAIR, UC Berkeley - Recognition, detection - Yanqing Jia (Caffe) , Jeff Donahue (DeepMind), Ross Girshick (Fast-RCNN) - Fei-Fei LI, Stanford University - ImageNet, Emerging topics - Jia LI (Snapchat, Google), Jia DENG (UMich), Andrej Karpathy (Tesla, OpenAI) - Antonio Torralba, CSAIL, MIT - Scene understanding, multimodality-based Computer Vision - Facebook Artificial Intelligence Research (FAIR) - DeepMind, Google Brain, Google Research - Microsoft Research

  7. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  8. Roadmap of Deep Learning - Depth

  9. Basic Block - Convolution Kernel Kernel Convolution operation: f(x) = Wx + b, f is called feature maps . W is the shared weight (kernel/filter/parameter in the network). Its weights are shared across locations. Convolution is conducted in a sliding-window style to save parameters and achieve translation-invariant, which is very important for vision tasks. Deep neural network is just a stack of convolutional layers. Rule of thumb: deeper means better.

  10. Roadmap of Deep Learning - Network Structure (cont’d)

  11. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  12. What is object detection? http://mp7.watson.ibm.com/ICCV2015/slides/iccv15_tutorial_training_rbg.pdf

  13. Why object detection? It is the fundamental task in vision + Detection in general classes + Face detection, crowd analysis + Car/signal detection

  14. RCNN -> Fast RCNN -> Faster RCNN

  15. RCNN -> Fast RCNN -> Faster RCNN

  16. RCNN -> Fast RCNN -> Faster RCNN

  17. RCNN -> Fast RCNN -> Faster RCNN

  18. Detection Results

  19. Back into the General Picture: Deep Learning for Computer Vision

  20. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  21. Dataset: Image Captioning - PASCAL SENTENCE DATASET Describe image with a natural sentence - 1000 images & 5 sents / im - Designed for image classification, object detection and segmentation. - No filtering, complex scenes, scaling, view points of different objects. - FLICKR 8K - 8108 images & 5 sents / im - obtained from the Flickr website by University of Illinois at Urbana, Champaign - FLICKR 30K - extension to the Flickr 8K - MS COCO - Largest Caption dataset - Includes captions & object annoatations - 328,000 images & 5 sents / im - Visual Genome - Densely-annotated dataset - Includes objects, scene graphs, region captions (grounded), Q&As (grounded), attributes - 108,077 images with full annotations - Not very clean, need a little pre-processing Two gentleman talking in front of propeller ❏ plane. Two men are conversing next to a small ❏ Metric: airplane. Two men talking in front of a plane. ❏ - BLEU, METEOR, Rouge, CIDEr, Human-based Measurement Two men talking in front of a small plane. ❏ Two men talk while standing next to a small ❏ passenger plane at an airport. Exploring Image Captioning Datasets: http://sidgan.me/technical/2016/01/09/Exploring-Datasets

  22. A simple Baseline: NeuralTalk A simple NeuralTalk Demo: https://github.com/karpathy/neuraltalk

  23. Attention Mechanism: Show, Attend and Tell Show, Attend and Tell: Neural Image Caption Generation with Visual Attention: https://arxiv.org/abs/1502.03044

  24. Modified Attention Mechanism: Know when to look Adaptive Attention module Determine how to mix the visual or linguistic information with a visual sentinel (softmax over k feature map vectors & 1 linguistic vector). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning: https://arxiv.org/abs/1612.01887

  25. Concept-driven Image Captioning Semantic Compositional Networks for Visual Captioning: https://arxiv.org/abs/1502.03044

  26. Dense Captioning Localize and describe salient region with a natural sentence DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

  27. DenseCap: Fully Convolutional Localization Networks for Dense Captioning DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

  28. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  29. Dataset: Visual Q&A - DAQUAR Answer an image-based question - first dataset and benchmark released for the VQA task - Images are from NYU Depth V2 dataset with semantic segmentations - 1449 images (795 training, 654 test), 12468 question (auto-generated & human-annotated) - COCO-QA - Automatically generated from image captions. - 123287 images, 78736 train questions, 38948 test questions - 4 types of questions: object, number, color, location - Answers are all one-word - VQA - Most widely-used VQA dataset - two parts: one contains images from COCO, the other contains abstract scenes - 204,721 COCO and 50,000 abstract images with ~5.4 questions/im - CLEVR - A Diagnostic Dataset for the reasoning ability of VQA models - rendered images and automatically-generated questions with functional programs and scene graphs - 100,000 images (70,000 train & 15,000 val & 15,000 test) with ~10 questions/im - Visual Genome Question : What color is the man's tie? - Densely-annotated dataset Answer : Brown - Includes objects, scene graphs, region captions (grounded), Q&As (grounded), attributes - 108,077 images with 1.7M grounded Q&A pairs - Not very clean, need a little pre-processing Survey of Visual Question Answering: Datasets and Techniques: https://arxiv.org/abs/1705.03865

  30. Simple Baseline Method Simple Baseline for Visual Question Answering: https://arxiv.org/abs/1512.02167

  31. A Strong Baseline: Attention (1) Where To Look: Focus Regions for Visual Question Answering: https://arxiv.org/abs/1511.07394

  32. A Strong Baseline: Attention (2) Multiple glimpse Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering: https://arxiv.org/abs/1704.03162

  33. Co-Attention Mechanism for Image & Question Parallel Co-Attention Alternating Co-Attention Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

  34. Hierarchical Question Encoding Hierarchical Question Encoding Scheme Encoding for Answer prediction Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

  35. Multimodal Fusion: Bilinear interaction modeling MUTAN: Multimodal Tucker Fusion for Visual Question Answering: https://arxiv.org/abs/1705.06676

  36. Duality of Question Answering and Question Generation Visual Question Generation as Dual Task of Visual Question Answering

  37. Duality of Question Answering and Question Generation: Dual MUTAN Visual Question Generation as Dual Task of Visual Question Answering

  38. Learning to Reason: Compositional Network End-to-End Training with policy gradient Learning to Reason: End-to-End Module Networks for Visual Question Answering: https://arxiv.org/abs/1704.05526

  39. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  40. Visual Relations Describe the Image with object nodes and their interactions Scene Graph Generation from Objects, Phrases and Region Captions: https://arxiv.org/abs/1707.09700

  41. Baseline: Visual Relationship Detection with Language Prior Using word2vec as extra information for predicate recognition Visual Relationship Detection with Language Priors: http://cs.stanford.edu/people/ranjaykrishna/vrd/

Recommend


More recommend