Deep learning for speech synthesis The good news, the bad news, and the fake news Scott Stevenson scott@faculty.ai
The fake news 2
3
4
The effect of “hot mic” incidents “Hot mic” incidents Incidents can change can bring huge voting intentions and the negative publicity outcome of elections This has been Politicians are demonstrated on particularly at risk multiple occasions 5
“Ugh everything! “She’s just a sort of bigoted woman that said she used to be Labour. “I mean it’s just ridiculous.” 6
The bad news 7
If adversaries can generate ● realistic audio they can fabricate “hot mic” recordings Traditionally requires domain ● expertise Modern deep learning makes ● this imminently possible We can’t stop technology, but ● we can inoculate people against it 8
L IH NG G W IH S T IH K . R EH P R AH Z EH N T EY SH AH N . Frontend Backend linguistic text audio representation 9
Frontend 10
Frontend: tokenisation and normalisation “IBM was founded in 1911” → “i b m was founded in nineteen eleven” “Apple is valued at $1 trillion” → “apple is valued at one trillion dollars” “He lives on St Paul’s St.” → “he lives on saint paul’s street” 11
Frontend: phonetic transcriptions “apple is valued at one trillion dollars” → AE P AH L . IH Z . V AE L Y UW D . AE T . W AH N . T R IH L Y AH N . D AA L ER Z . The CMU Pronouncing Dictionary http://www.speech.cs.cmu.edu/cgi-bin/cmudict 12
Backend 13
Backend: concatenative Synthesise waveform from ● linguistic representation Most commonly ● concatenative systems Prerecorded database of ● audio samples (“units”) Typically 10 ms to 1 s long ● Picks best units to concatenate ● 14
Problems with concatenative systems Require large database of high quality recordings ● Can’t change speaker or emotion without new database ● High intelligibility and naturalness ● Distinguishable by prosody (intonation, tone, stress, rhythm) ● “ My latest project is to learn how to project my voice ”: two pronunciations of project ○ ○ Liaison in French: final consonant no longer silent if following word begins with vowel 15
Backend: parametric Don’t use pre-recorded units ● Mathematical model contains ● information to synthesise speech Speaker and emotion stored in params ● Speech contents controlled by input ● Model outputs passed to vocoder ● Less natural than concatenative ● systems because of DSP artefacts 16
17
18
WaveNet arXiv 1609.03499 Change of paradigm for parametric speech synthesis ● Don’t feed model output to vocoder to generate waveform ● Instead, sample waveform directly from neural network ● Sample at ≥16 kHz to generate audio ● 19
Causal convolutions time 20
Causal convolutions Problem: causal convolutions require huge depth to give ● sufficiently large receptive field for good prosody Such a depth is computationally infeasible to train ● Chosen solution is to dilate convolutions ● Skip input values with interval to increase receptive field ● Receptive field grows exponentially with depth ● 21
Dilated causal convolutions time 22
Activation function arXiv 1606.05328 Use gated activation taken from PixelCNN ● Filter Gate Empirical choice: performs better than ReLU activation ● 23
Activation function arXiv 1609.03499 Need to condition locally to input text sequence ● Have a second time series h (i.e. from linguistic frontend) ● Learned upsampling y = (h) to same frequency as x ● 24
WaveNet limitations WaveNet can generate very human sounding waveforms ● But how do tell the WaveNet what to say? ● Still requires extensive feature engineering frontend ● Need time and linguistic expertise, and is brittle ● How do we improve on conventional frontend? ● 25
deep learning 26
27
Tacotron 2 28
“Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speec h.” 29
30
The good news 31
Classification 32
Generative adversarial networks Via backpropagation, Generator learns to produces better audio, while Discriminator learns to better distinguish synthetic from real. 33
+ = 35
We’re hiring! We are hiring data scientists and machine learning engineers at all levels. If you’re interested in finding out more about Faculty and our work, get in touch! scott@faculty.ai Follow us: 36
Recommend
More recommend