VIDEO SYNTHESIS from the StyleGAN Latent Space Video Synthesis from the StyleGAN Latent Space A Project Presented to The Faculty of Department of Computer Science San José State University Department of Computer Science San José State University In Partial Fulfillment Of the Requirements for the Degree of Science By Lei Zhang April, 2020
VIDEO SYNTHESIS FROM THE STYLEGAN LATENT SPACE The Designated Project Committee Approves the Master’s Project Titled Video Synthesis from the StyleGAN Latent Space By Lei Zhang APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE SAN JOSE STATE UNIVERSITY April 2020 Dr. Christopher Pollett Department of Computer Science Dr. Leonard Wesley Department of Computer Science Dr. Philip Heller Department of Computer Science 2
VIDEO SYNTHESIS FROM THE STYLEGAN LATENT SPACE Acknowledgement I would like to express my sincere gratitude and appreciation to Dr. Chris Pollett for his guidance and support throughout the entire project. It was my honor to have him as my advisor. I would like also to take this opportunity to thank the committee members, Dr. Philip Heller and Dr. Leonard Wesley, for their suggestions and time. Finally, I am grateful to my friends and family for always being a strong support for me throughout my career. 3
VIDEO SYNTHESIS FROM THE STYLEGAN LATENT SPACE A BSTRACT Generative models have shown impressive results in generating synthetic images. However, video synthesis is still difficult to achieve, even for these generative models. The best videos that generative models can currently create are a few seconds long, distorted, and low resolution. For this project, I propose and implement a model to synthesize videos at 1024x1024x32 resolution that include human facial expressions by using static images generated from a Generative Adversarial Network trained on the human facial images. To the best of my knowledge, this is the first work that generates realistic videos that are larger than 256x256 resolution from single starting images. This model improves the video synthesis in both quantitative and qualitative ways compared to two state-of-the-art models: TGAN and MocoGAN. In a quantitative comparison, this project reaches a best Average Content Distance (ACD) score of 0.167, as compared to 0.305 and 0.201 of TGAN and MocoGAN, respectively. Keywords - Generative Adversarial Network (GAN), video generation, StyleGAN, 3D convolutions 4
VIDEO SYNTHESIS FROM THE STYLEGAN LATENT SPACE TABLE OF CONTENTS 1 INTRODUCTION ............................................................................................................................................... 7 2 BACKGROUND ................................................................................................................................................. 9 2.1 9 C ONVOLUTIONAL N EURAL N ETWORKS (CNN S ) ......................................................................................................... 2.2 10 I MAGE GAN S .................................................................................................................................................... 2.2.1 10 Progressive growing GAN ........................................................................................................................ 2.2.2 11 StyleGAN .................................................................................................................................................. 2.2.3 13 StyleGAN2 ................................................................................................................................................ 2.2.4 13 Face Pose Synthesis .................................................................................................................................. 2.3 14 V IDEO GAN S .................................................................................................................................................... 2.4 15 S EQUENCE P REDICTION ....................................................................................................................................... 2.4.1 15 Recurrent Neural Networks (RNNs) ......................................................................................................... 2.4.2 16 Long Short-Term Memory (LSTM) ........................................................................................................... 2.5 17 V ERY D EEP C ONVOLUTIONAL N ETWORKS (VGG) .................................................................................................... 2.6 18 D EEP R ESIDUAL L EARNING (R ES N ET ) .................................................................................................................... 2.7 19 E MBED I MAGES INTO L ATENT S PACE ..................................................................................................................... 2.7.1 19 Precise Recovery of Latent Vectors from GANs ....................................................................................... 2.7.2 20 Image2StyleGAN ...................................................................................................................................... 2.8 21 N OISE V ECTOR A RITHMETIC AND V IDEO I NTERPOLATION .......................................................................................... 3 ......................................................................................................................................... 23 IMPLEMENTATION 3.1 24 D ATASETS ......................................................................................................................................................... 3.1.1 24 FFHQ ......................................................................................................................................................... 3.1.2 24 IMPA-FACE3D ........................................................................................................................................... 3.1.3 24 MUG Facial Expression Database ............................................................................................................ 3.1.4 24 CelebA Dataset ......................................................................................................................................... 3.1.5 24 YouTube Movie Trailers ........................................................................................................................... 3.2 25 M ODEL O VERVIEW ............................................................................................................................................ 3.3 26 F ACE A LIGNMENT .............................................................................................................................................. 3.4 26 Y OU T UBE V IDEO P REPROCESSING ......................................................................................................................... 3.5 26 E MOTIONS P REDICTION ...................................................................................................................................... 3.6 26 LSTM E MOTION S EQUENCE P REDICTION M ODEL .................................................................................................... 3.7 28 VGG16 ........................................................................................................................................................... 3.8 29 I MAGE R ECONSTRUCTION FROM THE L ATENT S PACE ................................................................................................. 3.9 29 G ENERATE K EYFRAMES WITH L ATENT S PACE M ANIPULATION .................................................................................... 3.10 30 I NTERPOLATION IN THE L ATENT S PACE ................................................................................................................... 4 EXPERIMENTS AND RESULTS .......................................................................................................................... 31 4.1 31 C OARSE I MAGE R ECOVERY FROM L ATENT S PACE ..................................................................................................... 4.2 31 F INE I MAGES R ECOVERY FROM THE PRE - TRAINED L ATENT S PACE ................................................................................ 4.3 32 M IMIC F ACE P OSE WITHOUT T RAINING D IRECTIONS ................................................................................................ 4.4 33 T RANSFER F ACIAL A TTRIBUTES TO A NOTHER P ERSON ............................................................................................... 4.5 34 P REDICT E MOTIONS FROM Y OU T UBE V IDEO C LIPS ................................................................................................... 4.6 35 V IDEO S YNTHESIS ............................................................................................................................................... 5
Recommend
More recommend