FloWaveNet: A Generative Flow for Raw Audio Sungwon Kim 1 , Sang-gil - - PowerPoint PPT Presentation

▶

Jul 17, 2023 167 likes •348 views

ICML 2019 FloWaveNet: A Generative Flow for Raw Audio Sungwon Kim 1 , Sang-gil Lee 1 , Jongyoon Song 1 , Jaehyeon Kim 2 , Sungron Yoon 1,3 1 Seoul National University, 2 Kakao Corporation, 3 ASRI, INMC, Institute of Engineering Research, Seoul

SLIDE 1

FloWaveNet: A Generative Flow for Raw Audio

Sungwon Kim1, Sang-gil Lee1, Jongyoon Song1, Jaehyeon Kim2, Sungron Yoon1,3

1Seoul National University, 2Kakao Corporation, 3ASRI, INMC, Institute of Engineering Research, Seoul National University

ICML 2019

Poster 6/12 6:30 PM @Pacific Ballroom #2

SLIDE 2

WaveNet

log $% &':) = +

,-' )

log $% &, &.,

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

SLIDE 3

WaveNet

log $% &':) = +

,-' )

log $% &, &.,

Sequential sampling

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

SLIDE 4

Previous parallel speech synthesis models

!" #

$ % ||#' % Pre-trained WaveNet Inverse Autoregressive Flows (IAFs) Probability Density Distillation

Oord, Aaron, et al. "Parallel WaveNet: Fast High-Fidelity Speech Synthesis." International Conference on Machine Learning. 2018.

SLIDE 5

Previous parallel speech synthesis models

!" #

$ % ||#' % Pre-trained WaveNet Inverse Autoregressive Flows (IAFs) Probability Density Distillation Parallel sampling

Oord, Aaron, et al. "Parallel WaveNet: Fast High-Fidelity Speech Synthesis." International Conference on Machine Learning. 2018.

SLIDE 6

Previous parallel speech synthesis models

!" #

$ % ||#' % Pre-trained WaveNet Inverse Autoregressive Flows (IAFs) Probability Density Distillation Power Loss Perceptual Loss Contrastive Loss Frame Loss

+

Parallel sampling

Oord, Aaron, et al. "Parallel WaveNet: Fast High-Fidelity Speech Synthesis." International Conference on Machine Learning. 2018.

SLIDE 7

Our Objectives

Simplify the training procedure for parallel sampling
Maintain the quality of speech samples

SLIDE 8

Our Objectives

Simplify the training procedure for parallel sampling
Maintain the quality of speech samples

Flow-based generative models for raw audio!

SLIDE 9

FloWaveNet

log $% &':) = log $+ ,- &':) + log det 2,- & 2&

,-

3 %

3+

Raw audio Gaussian Noise

Training phase

SLIDE 10

FloWaveNet

log $% &':) = log $+ ,- &':) + log det 2,- & 2&

,- ,-

34

5 %

5+ 6 = 6':) ~ 5+ 6 = 8 9, ; , < & = ,-

34(6) Raw audio Gaussian Noise

Training phase Sampling phase

SLIDE 11

FloWaveNet

log $% &':) = log $+ ,- &':) + log det 2,- & 2&

,- ,-

34

5 %

5+ 6 = 6':) ~ 5+ 6 = 8 9, ; , < & = ,-

34(6) Raw audio Gaussian Noise

Training phase Sampling phase Both the transformation ,- and ,-

34 are designed to be computed efficiently

à Efficient training & Parallel sampling

SLIDE 12

FloWaveNet

log $% &':) = log $+ , &':) + .

/

log det 3 456

7 ⋅ 459 7

& 3&

4:

459

7

456

7

SLIDE 13

Mean Opinion Scores

FloWaveNet ≥ Gaussian IAF

SLIDE 14

Sampling speed

FloWaveNet ≅ Gaussian IAF ≅ Parallel WaveNet >> Autoregressive WaveNet

1000s times faster

SLIDE 15

Conclusion

FloWaveNet produces high quality audio samples as well as

previous distilled models.

FloWaveNet synthesizes audio samples in parallel

– w/o well pre-trained WaveNet (No distillation!) – w/o auxiliary loss terms

Demo page Code

Poster 6/12 6:30 PM @Pacific Ballroom #2

SLIDE 16