VIDEO-TO-VIDEO SYNTHESIS Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro
GENERATIVE ADVERSARIAL NETWORKS Unconditional GANs False Generator Discriminator ~ Discriminator True 2 Image credit: Celebrity dataset, Jensen Huang, Founder and CEO of NVIDIA, Ian Goodfellow, Father of GANs.
After training for a while using NVIDIA DGX1 machines Fun sampling time begin Generator 3 Image credit: NVIDIA StyleGAN
CONDITIONAL GANS Allow user more control on the sampling process Modeling (training) Given info (e.g. image, text) Generated result Sampling (testing) Given info (e.g. image, text) output style 4
SKETCH-CONDITIONAL GANS Generator 5 Image credit: NVIDIA pix2pixHD
IMAGE-CONDITIONAL GANS 6 Image credit: NVIDIA MUNIT
MASK-CONDITIONAL GANS Semantic Image Synthesis 7
MASK-CONDITIONAL GANS Semantic Image Synthesis 8
LIVE DEMO I need to get an RTX Ready Laptop (https://www.nvidia.com/en- us/geforce/gaming-laptops/20- series/) It is running live in GTC Will be online for everyone to try out in NVIDIA AI Playground website (https://www.nvidia.com/en- us/research/ai-playground/) 9
Interface 10
11
PROBLEM WITH PREVIOUS METHODS input result 12
PROBLEM WITH PREVIOUS METHODS Batch Norm (Ioffe et al. 2015) 0 0 𝑧 = 𝑦 − 𝜈 1 0 𝑦 = 𝑦 = 0 0 ⋅ 𝛿 + 𝛾 0 1 𝜏 same output! affine transform de-normalization normalization removes label information 13
PROBLEM WITH PREVIOUS METHODS input result 14
PROBLEM WITH PREVIOUS METHODS • Do not feed the label map directly to network • Use the label map to generate normalization layers instead 15
SPADE ( SP atially A daptive DE normalization) 𝛾 conv conv 𝛿 𝑦 𝑧 network output network input Parameter-free (label free) Batch Norm label free element-wise 𝑧 = 𝑦 − 𝜈 ⋅ 𝛿 + 𝛾 𝜏 16
𝛾 conv conv 𝛿 Parameter-free element-wise Batch Norm SPADE SPatially Adaptive DE-normalization 19
SPADE RESIDUAL BLOCKS 3x3 Conv 3x3 Conv SPADE SPADE ReLU ReLU SPADE ResBlk 20
SPADE GENERATOR SPADE SPADE SPADE SPADE ~ ResBlk ResBlk ResBlk ResBlk 21
PROBLEM WITH PREVIOUS METHODS input w/o SPADE w/ SPADE 22
23
24
25
Multimodal Results on Flickr IMAGE RESULTS 26
Multimodal Results on Flickr IMAGE RESULTS 27
28
29
30
31
VIDEO-TO-VIDEO SYNTHESIS 33
IMAGE-TO-IMAGE SYNTHESIS Building Tree Car Sidewalk Road 34
VIDEO-TO-VIDEO SYNTHESIS 35
VIDEO-TO-VIDEO SYNTHESIS 36
VIDEO-TO-VIDEO SYNTHESIS 37
VIDEO-TO-VIDEO SYNTHESIS 38
MOTIVATION • AI-based rendering Traditional graphics Geometry, texture, lighting Machine learning graphics Data 39
MOTIVATION • AI-based rendering • High-level semantic manipulation little explored (this work) Largely explored Edit here! Segmentation Image/video synthesis Keypoint Detection etc High-level representation Original image New image 40
PREVIOUS WORK Image translation Unconditional synthesis pix2pixHD [2018], CRN [2017], pix2pix [2017] MoCoGAN [2018], TGAN [2017], VGAN [2016] Video style transfer Video prediction MCNet [2017], PredNet [2017] COVST [2017], ArtST [2016] 41
PREVIOUS WORK: FRAME-BY-FRAME RESULT 42
OUR METHOD • Sequential generator • Multi-scale temporal discriminator • Spatio-temporal progressive training procedure 43
OUR METHOD Sequential Generator W 44
OUR METHOD Sequential Generator Multi-scale Discriminators Image Discriminator Video Discriminator D 1 D 2 D 1 D 2 D 3 D 3 W 45
OUR METHOD Spatio-temporally Progressive Training Spatially progressive Residual blocks Alternating training ... ... T T S S S T Temporally progressive 46
RESULTS 47
RESULTS • Semantic → Street view scenes • Edges → Human faces • Poses → Human bodies 48
RESULTS • Semantic → Street view scenes • Edges → Human faces • Poses → Human bodies 49
STREET VIEW: CITYSCAPES Semantic map pix2pixHD COVST (video style transfer) Ours 50
STREET VIEW: BOSTON 51
STREET VIEW: NYC 52
RESULTS • Semantic → Street view scenes • Edges → Human faces • Poses → Human bodies 53
FACE SWAPPING (FACE → EDGE → FACE) input edges output 54
FACE SWAPPING (SLIMMER FACE) input (slimmed) edges (slimmed) output 55
FACE SWAPPING (SLIMMER FACE) input (slimmed) edges (slimmed) output 56
MULTI-MODAL EDGE → FACE Style 1 Style 2 Style 3 57
RESULTS • Semantic → Street view scenes • Edges → Human faces • Poses → Human bodies 58
MOTION TRANSFER (BODY → POSE → BODY) input poses output 59
MOTION TRANSFER (BODY → POSE → BODY) input poses output 60
MOTION TRANSFER (BODY → POSE → BODY) input poses output 61
MOTION TRANSFER (BODY → POSE → BODY) input poses output 62
MOTION TRANSFER 63
EXTENSION: FRAME PREDICTION • Goal: predict future frames given past frames • Our method: decompose prediction into two steps • 1. predict the semantic map for next frame • 2. synthesize the frame based on the semantic map 64
EXTENSION: FRAME PREDICTION Ground truth PredNet MCNet Ours 65
INTERACTIVE GRAPHICS 66
PATH TO INTERACTIVE GRAPHICS • Real-time inference • Combining with existing graphics pipeline • Domain gap between real input and synthetic input 67
PATH TO INTERACTIVE GRAPHICS • Real-time inference • Combining with existing graphics pipeline • Domain gap between real input and synthetic input 68
PATH TO INTERACTIVE GRAPHICS • Real-time inference FP16 + TensorRT → ~5 times speed up • 36ms (27.8 fps) for 1080p inference • • Overall: 15~25 fps 69
PATH TO INTERACTIVE GRAPHICS • Real-time inference • Combining with existing graphics pipeline • CARLA: open-source simulator for autonomous driving research Make game engine render semantic maps • Pass the maps to the network and display the inference result • 70
PATH TO INTERACTIVE GRAPHICS • Real-time inference • Combining with existing graphics pipeline • Domain gap between real input and synthetic input • Network trained on real data but tested on synthetic data • Things that differ: Object shapes/edges, density of objects, camera viewpoints, etc • On-going work 71
ORIGINAL CARLA IMAGE 72
RENDERED SEMANTIC MAPS 73
RECORDED DEMO RESULTS 74
RECORDED DEMO RESULTS 75
CONCLUSION 76
CONCLUSION • What can we achieve? • What can it be used for? 77
CONCLUSION • What can we achieve? • Synthesize high-res realistic images 78
CONCLUSION • What can we achieve? • Synthesize high-res realistic images • Produce temporally-smooth videos 79
CONCLUSION • What can we achieve? • Synthesize high-res realistic images • Produce temporally-smooth videos • Reinvent interactive graphics 80
CONCLUSION • What can we achieve? • What can it be used for? • AI-based rendering • High-level semantic manipulation Traditional graphics High-level representation Machine learning graphics Original image New image 81
THANK YOU https://github.com/NVIDIA/vid2vid
Recommend
More recommend