video to video synthesis
play

VIDEO-TO-VIDEO SYNTHESIS Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, - PowerPoint PPT Presentation

VIDEO-TO-VIDEO SYNTHESIS Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro GENERATIVE ADVERSARIAL NETWORKS Unconditional GANs False Generator Discriminator ~ Discriminator True 2 Image credit:


  1. VIDEO-TO-VIDEO SYNTHESIS Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro

  2. GENERATIVE ADVERSARIAL NETWORKS Unconditional GANs False Generator Discriminator ~ Discriminator True 2 Image credit: Celebrity dataset, Jensen Huang, Founder and CEO of NVIDIA, Ian Goodfellow, Father of GANs.

  3. After training for a while using NVIDIA DGX1 machines Fun sampling time begin Generator 3 Image credit: NVIDIA StyleGAN

  4. CONDITIONAL GANS Allow user more control on the sampling process Modeling (training) Given info (e.g. image, text) Generated result Sampling (testing) Given info (e.g. image, text) output style 4

  5. SKETCH-CONDITIONAL GANS Generator 5 Image credit: NVIDIA pix2pixHD

  6. IMAGE-CONDITIONAL GANS 6 Image credit: NVIDIA MUNIT

  7. MASK-CONDITIONAL GANS Semantic Image Synthesis 7

  8. MASK-CONDITIONAL GANS Semantic Image Synthesis 8

  9. LIVE DEMO I need to get an RTX Ready Laptop (https://www.nvidia.com/en- us/geforce/gaming-laptops/20- series/) It is running live in GTC Will be online for everyone to try out in NVIDIA AI Playground website (https://www.nvidia.com/en- us/research/ai-playground/) 9

  10. Interface 10

  11. 11

  12. PROBLEM WITH PREVIOUS METHODS input result 12

  13. PROBLEM WITH PREVIOUS METHODS Batch Norm (Ioffe et al. 2015) 0 0 𝑧 = 𝑦 − 𝜈 1 0 𝑦 = 𝑦 = 0 0 ⋅ 𝛿 + 𝛾 0 1 𝜏 same output! affine transform de-normalization normalization removes label information 13

  14. PROBLEM WITH PREVIOUS METHODS input result 14

  15. PROBLEM WITH PREVIOUS METHODS • Do not feed the label map directly to network • Use the label map to generate normalization layers instead 15

  16. SPADE ( SP atially A daptive DE normalization) 𝛾 conv conv 𝛿 𝑦 𝑧 network output network input Parameter-free (label free) Batch Norm label free element-wise 𝑧 = 𝑦 − 𝜈 ⋅ 𝛿 + 𝛾 𝜏 16

  17. 𝛾 conv conv 𝛿 Parameter-free element-wise Batch Norm SPADE SPatially Adaptive DE-normalization 19

  18. SPADE RESIDUAL BLOCKS 3x3 Conv 3x3 Conv SPADE SPADE ReLU ReLU SPADE ResBlk 20

  19. SPADE GENERATOR SPADE SPADE SPADE SPADE ~ ResBlk ResBlk ResBlk ResBlk 21

  20. PROBLEM WITH PREVIOUS METHODS input w/o SPADE w/ SPADE 22

  21. 23

  22. 24

  23. 25

  24. Multimodal Results on Flickr IMAGE RESULTS 26

  25. Multimodal Results on Flickr IMAGE RESULTS 27

  26. 28

  27. 29

  28. 30

  29. 31

  30. VIDEO-TO-VIDEO SYNTHESIS 33

  31. IMAGE-TO-IMAGE SYNTHESIS Building Tree Car Sidewalk Road 34

  32. VIDEO-TO-VIDEO SYNTHESIS 35

  33. VIDEO-TO-VIDEO SYNTHESIS 36

  34. VIDEO-TO-VIDEO SYNTHESIS 37

  35. VIDEO-TO-VIDEO SYNTHESIS 38

  36. MOTIVATION • AI-based rendering Traditional graphics Geometry, texture, lighting Machine learning graphics Data 39

  37. MOTIVATION • AI-based rendering • High-level semantic manipulation little explored (this work) Largely explored Edit here! Segmentation Image/video synthesis Keypoint Detection etc High-level representation Original image New image 40

  38. PREVIOUS WORK Image translation Unconditional synthesis pix2pixHD [2018], CRN [2017], pix2pix [2017] MoCoGAN [2018], TGAN [2017], VGAN [2016] Video style transfer Video prediction MCNet [2017], PredNet [2017] COVST [2017], ArtST [2016] 41

  39. PREVIOUS WORK: FRAME-BY-FRAME RESULT 42

  40. OUR METHOD • Sequential generator • Multi-scale temporal discriminator • Spatio-temporal progressive training procedure 43

  41. OUR METHOD Sequential Generator W 44

  42. OUR METHOD Sequential Generator Multi-scale Discriminators Image Discriminator Video Discriminator D 1 D 2 D 1 D 2 D 3 D 3 W 45

  43. OUR METHOD Spatio-temporally Progressive Training Spatially progressive Residual blocks Alternating training ... ... T T S S S T Temporally progressive 46

  44. RESULTS 47

  45. RESULTS • Semantic → Street view scenes • Edges → Human faces • Poses → Human bodies 48

  46. RESULTS • Semantic → Street view scenes • Edges → Human faces • Poses → Human bodies 49

  47. STREET VIEW: CITYSCAPES Semantic map pix2pixHD COVST (video style transfer) Ours 50

  48. STREET VIEW: BOSTON 51

  49. STREET VIEW: NYC 52

  50. RESULTS • Semantic → Street view scenes • Edges → Human faces • Poses → Human bodies 53

  51. FACE SWAPPING (FACE → EDGE → FACE) input edges output 54

  52. FACE SWAPPING (SLIMMER FACE) input (slimmed) edges (slimmed) output 55

  53. FACE SWAPPING (SLIMMER FACE) input (slimmed) edges (slimmed) output 56

  54. MULTI-MODAL EDGE → FACE Style 1 Style 2 Style 3 57

  55. RESULTS • Semantic → Street view scenes • Edges → Human faces • Poses → Human bodies 58

  56. MOTION TRANSFER (BODY → POSE → BODY) input poses output 59

  57. MOTION TRANSFER (BODY → POSE → BODY) input poses output 60

  58. MOTION TRANSFER (BODY → POSE → BODY) input poses output 61

  59. MOTION TRANSFER (BODY → POSE → BODY) input poses output 62

  60. MOTION TRANSFER 63

  61. EXTENSION: FRAME PREDICTION • Goal: predict future frames given past frames • Our method: decompose prediction into two steps • 1. predict the semantic map for next frame • 2. synthesize the frame based on the semantic map 64

  62. EXTENSION: FRAME PREDICTION Ground truth PredNet MCNet Ours 65

  63. INTERACTIVE GRAPHICS 66

  64. PATH TO INTERACTIVE GRAPHICS • Real-time inference • Combining with existing graphics pipeline • Domain gap between real input and synthetic input 67

  65. PATH TO INTERACTIVE GRAPHICS • Real-time inference • Combining with existing graphics pipeline • Domain gap between real input and synthetic input 68

  66. PATH TO INTERACTIVE GRAPHICS • Real-time inference FP16 + TensorRT → ~5 times speed up • 36ms (27.8 fps) for 1080p inference • • Overall: 15~25 fps 69

  67. PATH TO INTERACTIVE GRAPHICS • Real-time inference • Combining with existing graphics pipeline • CARLA: open-source simulator for autonomous driving research Make game engine render semantic maps • Pass the maps to the network and display the inference result • 70

  68. PATH TO INTERACTIVE GRAPHICS • Real-time inference • Combining with existing graphics pipeline • Domain gap between real input and synthetic input • Network trained on real data but tested on synthetic data • Things that differ: Object shapes/edges, density of objects, camera viewpoints, etc • On-going work 71

  69. ORIGINAL CARLA IMAGE 72

  70. RENDERED SEMANTIC MAPS 73

  71. RECORDED DEMO RESULTS 74

  72. RECORDED DEMO RESULTS 75

  73. CONCLUSION 76

  74. CONCLUSION • What can we achieve? • What can it be used for? 77

  75. CONCLUSION • What can we achieve? • Synthesize high-res realistic images 78

  76. CONCLUSION • What can we achieve? • Synthesize high-res realistic images • Produce temporally-smooth videos 79

  77. CONCLUSION • What can we achieve? • Synthesize high-res realistic images • Produce temporally-smooth videos • Reinvent interactive graphics 80

  78. CONCLUSION • What can we achieve? • What can it be used for? • AI-based rendering • High-level semantic manipulation Traditional graphics High-level representation Machine learning graphics Original image New image 81

  79. THANK YOU https://github.com/NVIDIA/vid2vid

Recommend


More recommend