Joint Caption Detection and Inpainting using Generative Network - PowerPoint PPT Presentation

Joint Caption Detection and Inpainting using Generative Network Anubha Pandey, Vismay Patel Indian Institute of Technology Madras cs16s023@cse.iitm.ac.in 19th September 2018 Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 1 / 15

Overview Problem Statement 1 Introduction 2 Proposed Solution 3 Network Architecture 4 Training 5 Results 6 Conclusion 7 Future Work 8 Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 2 / 15

Problem Statement Chalearn LAP Inpainting Competition Track2 - Video Decaptioning Objective To develop algorithms that can inpaint video frames that contain text overlays in various size, background, color, location. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 3 / 15

Problem Statement Chalearn LAP Inpainting Competition Track2 - Video Decaptioning Objective To develop algorithms that can inpaint video frames that contain text overlays in various size, background, color, location. Dataset Consists of a set of (X,Y) pairs where X is a 5 second video clip and Y is the corresponding target video clip. 150 hours of diverse videos clips in 128x128 RGB pixels, containing both captioned and de-captioned versions, taken from YouTube. 70000 training samples and 5000 samples in Validation and Test set Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 3 / 15

Introduction Video Decaptioning involves two tasks caption detection and the general video inpainting. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 4 / 15

Introduction Video Decaptioning involves two tasks caption detection and the general video inpainting. Existing patch-based video inpainting methods search for complete spatio-temporal patches to copy into the missing area. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 4 / 15

Introduction Video Decaptioning involves two tasks caption detection and the general video inpainting. Existing patch-based video inpainting methods search for complete spatio-temporal patches to copy into the missing area. Despite recent advances in machine learning, it is still challenging to aim at fast (real time) and accurate automatic text removal in video sequences. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 4 / 15

Related Works More recently, Globally and locally consistent image completion [1] CVPR 2017 paper, has shown promising results for the task of Image Inpainting. It has improved the results by introducing local and global discriminators. In addition, it uses dilated convolutions to increase the receptive fields and replace the fully connected layers adopted in the contextual encoders. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 5 / 15

Proposed Solution: Frame Level Inpainting and Caption Detection We propose a generative CNN to do joint caption detection and decaptioning task in an end-to-end fashion. Our network is inspired by the work of Globally and locally consistent image completion [1]. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 6 / 15

Proposed Solution: Frame Level Inpainting and Caption Detection We propose a generative CNN to do joint caption detection and decaptioning task in an end-to-end fashion. Our network is inspired by the work of Globally and locally consistent image completion [1]. The network has two branches each for the image generation and the mask generation tasks Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 6 / 15

Proposed Solution: Frame Level Inpainting and Caption Detection We propose a generative CNN to do joint caption detection and decaptioning task in an end-to-end fashion. Our network is inspired by the work of Globally and locally consistent image completion [1]. The network has two branches each for the image generation and the mask generation tasks Both the branches share the parameters up to first three convolution layers and the layers thereafter, are trained independently. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 6 / 15

Proposed Solution: Frame Level Inpainting and Caption Detection We propose a generative CNN to do joint caption detection and decaptioning task in an end-to-end fashion. Our network is inspired by the work of Globally and locally consistent image completion [1]. The network has two branches each for the image generation and the mask generation tasks Both the branches share the parameters up to first three convolution layers and the layers thereafter, are trained independently. Inputs Frames from the captioned videos. The caption masks, extracted by taking the difference between the corresponding frames of the ground truth decaptioned videos and the input captioned videos. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 6 / 15

Network Architecture Figure: Building blocks of the Figure: Architecture of the discriminator module of network. the inpainting network. Each building block is described in Figure 7. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 7 / 15

Network Architecture Figure: Architecture of the discriminator module of the inpainting network.Building block is shown in Fig 7. Figure: Architecture of the generator module of the inpainting network. Building block is shown in Fig 7. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 8 / 15

Loss Functions Following loss functions have been used to train the network- Reconstruction Loss [2] L r = 1 � K imitation | + α ∗ 1 � K Mask ) 2 i =1 | I i y − I i i =1 ( I i Mask − O i K K where, K is the batch size and alpha = 0.000001. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 9 / 15

Loss Functions Following loss functions have been used to train the network- Reconstruction Loss [2] L r = 1 � K imitation | + α ∗ 1 � K Mask ) 2 i =1 | I i y − I i i =1 ( I i Mask − O i K K where, K is the batch size and alpha = 0.000001. Adversarial Loss [2] L real = − log ( p ), L fake = − log (1 − p ) L d = L real + β ∗ L fake where, p is the output probability of the discriminator module and β = 0.01 (hyper parameter) Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 9 / 15

Loss Functions Following loss functions have been used to train the network- Reconstruction Loss [2] L r = 1 � K imitation | + α ∗ 1 � K Mask ) 2 i =1 | I i y − I i i =1 ( I i Mask − O i K K where, K is the batch size and alpha = 0.000001. Adversarial Loss [2] L real = − log ( p ), L fake = − log (1 − p ) L d = L real + β ∗ L fake where, p is the output probability of the discriminator module and β = 0.01 (hyper parameter) Perceptual Loss [3] � K L p = 1 i =1 ( φ ( I y ) − φ ( I imitation )) 2 K where, φ represents features from VGG16 network pretrained on Microsoft COCO dataset. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 9 / 15

Training The network is trained using Adam Optimizer with learning rate 0.006 and batch size 20. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 10 / 15

Training The network is trained using Adam Optimizer with learning rate 0.006 and batch size 20. For first 8 epochs only the generator module of the network is trained minimizing only the reconstruction loss and perceptual loss Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 10 / 15

Training The network is trained using Adam Optimizer with learning rate 0.006 and batch size 20. For first 8 epochs only the generator module of the network is trained minimizing only the reconstruction loss and perceptual loss For the next 12 epochs, the entire GAN network [2] is trained end-to-end minimizing all three losses- Reconstruction loss, Adversarial Loss and Perceptual loss. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 10 / 15

Results With our proposed solution we secured 3rd position in the competition. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 11 / 15

Results With our proposed solution we secured 3rd position in the competition. To evaluate the quality of the reconstruction, metrics as mentioned on the competitions website are used for pairwise frame comparison. Evaluation Metrics Training Phase Testing Phase PSNR 30.5311 32.0021 MSE 0.0016 0.0012 DSSIM 0.0610 0.0499 Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 11 / 15

Conclusion We have proposed an end-to-end network for de-captioning which can simultaneously do frame level caption detection and inpainting. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 12 / 15

Conclusion We have proposed an end-to-end network for de-captioning which can simultaneously do frame level caption detection and inpainting. However, this method requires individual frames from the clip to do its task which lacks the temporal context required to produce the desired result. Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 12 / 15

Joint Caption Detection and Inpainting using Generative Network - PowerPoint PPT Presentation

Joint Caption Detection and Inpainting using Generative Network Anubha Pandey, Vismay Patel Indian Institute of Technology Madras cs16s023@cse.iitm.ac.in 19th September 2018 Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th

Chapter 3 Tight-frame Applications 1 Outline 1. Inpainting 1. Inpainting 2. Impulse Noise

June 12, 2020 Type to enter a caption. Greeter Graham Drake Type to enter a caption. Give

Generative Image Inpainting for Person Pose Generation Anubha Pandey, Vismay Patel Indian

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

generative design systems Generative Brief Design Definitions Workshop Processes

Convex Optimization and Inpainting: A Tutorial Thomas Pock Institute of Computer Graphics and

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Pre-processing and Classification of Hyperspectral Imagery via Selective Inpainting Victoria

A Benchmark for Inpainting-Based 5 Image Reconstruction and Compression 6 7 8 9 Sarah

A2 (Inpainting) and Pictorial Structure CSC320: Introduction to Visual Computing - Winter 2014

Optimising Data for PDE-Based Inpainting and Compression Laurent Hoeltgen hoeltgen@b-tu.de

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Image Caption Image Caption Image Caption Lorem ipsum dolor sit amet, consectetur adipiscing

April 3, 2020 Type to enter a caption. Estate Planning | 9 Estate Planning | 10 Jamie

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

arXiv:1511.06392v3 [cs.LG] 9 Feb 2016 Neural Random Access Machine. It can manipulate and

The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10 Patrick Simianer, Gesa

Virtual Inertia Emulation and Placement in Power Grids Institute for Mathematics and its

Reverse Engineering Top-k Join Queries Kiril Panev panev@cs.uni-kl.de Nico Weisenauer

Status and initial results from the M AJORANA D EMONSTRATOR

Ground-based follow up and their science cases Sofia Feltzing Lund Observatory Gaia will

Geometry-Induced Superdiffusion in Driven Crowded Systems Carlos Meja-Monasterio Technical

Revealing the Source of the Radial Flow Patterns in Proton-Proton Collisions using Hard Probes