non parametric duration modelling for speech synthesis
play

Non-parametric duration modelling for speech synthesis with a joint - PowerPoint PPT Presentation

Non-parametric duration modelling for speech synthesis with a joint model of acoustics and duration Gustav Eje Henter 1 , Srikanth Ronanki 2 , Oliver Watts 2 , and Simon King 2 1 Digital Content and Media Sciences Research Division, National


  1. Non-parametric duration modelling for speech synthesis with a joint model of acoustics and duration Gustav Eje Henter 1 , Srikanth Ronanki 2 , Oliver Watts 2 , and Simon King 2 1 Digital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo 2 The Centre for Speech Technology Research (CSTR), The University of Edinburgh, UK Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 1 / 40

  2. Graphical overview Phone level Frame level Text Phone Duration modelling Conventional features durations Speech Acoustic modelling parameters Text Duration modelling features Proposed Transition probabilities Speech Acoustic modelling parameters Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 2 / 40

  3. Graphical overview Phone level Frame level Text Phone Duration modelling Conventional features durations Speech Acoustic modelling parameters Text Duration modelling features Proposed Transition probabilities Speech Acoustic modelling parameters Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 2 / 40

  4. Key takeaways • Innovations 1. Train an RNN/DNN to predict per-frame transition probabilities 2. Generate durations using median or other distribution quantiles • Advantages • Non-parametric – can model any duration distribution shape! • Predicts acoustics and durations in tandem • Is a proper hidden semi-Markov model Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 3 / 40

  5. Outline 1. Background 2. Formal specification 3. Experiments 4. Extensions Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 4 / 40

  6. Motivation • Prosody remains a major shortcoming of TTS • Duration is an important prosodic component • State-of-the-art (Gaussian) duration models: • Allow non-positive durations • Do not sum to one on the integers (unnormalised) • Do not account for skewness • Are separate from the acoustic model Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 5 / 40

  7. Real durations Forced-aligned durations from dataset vs. fitted Gaussian 0.07 Gaussian fit Mean=19.83 0.06 Median=16 0.05 Probability 0.04 0.03 0.02 0.01 0.00 0 20 40 60 80 100 Phone duration d p (frames) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 6 / 40

  8. Statistical TTS Statistical parametric speech synthesis requires three components: 1. A stochastic distribution family f D ( d ; θ ) for durations D 2. A machine-leaning predictor θ ( l ) • l are text-derived linguistic features • Predicts how duration distributions depend on text • Is learned from training data (statistical) 3. A duration-generation principle • Mean-based generation � d = E ( D | l ) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 7 / 40

  9. HMM-based TTS Speech is generated by a hidden Markov model (HMM) • Hidden-state models specified by: • Emissions: f O | S ( o | s ) (acoustic observations o ) • Transition probability: P ( S t + 1 = s + 1 | S t = s ) (durations) • State transitions follow a Markov process • State S t tracks sub-phone time evolution • Training (EM-algorithm) is linear in sequence length Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 8 / 40

  10. HMM-based durations 1. Geometric duration distribution f D ( d ; a ) = a ( 1 − a ) d − 1 • Implicit consequence of fixed HMM transition probability a • Memoryless (unrealistic) 2. Regression tree (RT) predictor a ( l ) 3. Mean-based generation � 1 d = E ( D | l ) ∝ a ( l ) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 9 / 40

  11. HSMM-based TTS Change to a hidden-semi Markov model (HSMM) (Zen et al., 2004) • Model specified by: • Emissions: f O | S ( o | s ) • Unchanged • Transition probability: P ( S t + 1 = s + 1 | S t = s , n t ) • Can now depend on n t , time spent in current state • This is a semi-Markov process • Training complexity is now quadratic in sequence length Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 10 / 40

  12. HSMM-based durations 1. Any parametric distribution f D ( d ; θ ) possible! � d ; µ, σ 2 � • Gaussian distribution f D ( d ; θ ) = f N standard in HTS (Zen et al., 2007) • Log-normal (Campbell, 1989) or gamma (Huber, 1990) 2. Regression tree (RT) predictor θ ( l ) • Unchanged 3. Mean-based generation � d = E ( D | l ) = � µ ( l ) • Unchanged Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 11 / 40

  13. NN-based durations 1. Gaussian distribution f D ( d ; θ ) = f N ( d ; µ, σ 2 ) • Unchanged 2. Deep or recurrent neural network µ ( l ) • DNNs/RNNs are more successful practical predictors • Typically, only µ is predicted (minimum MSE) 3. Mean-based generation � d = E ( D | l ) = � µ ( l ) • Unchanged Note: Data is forced-aligned using HMM/HSMM before training Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 12 / 40

  14. Approaches in review f D ( d ; θ ) Pred. θ ( l ) TTS type Level Generation Formant - Phone - Rule Concat. - Phone - Exemplar HMM Geom. State RT Mean HSMM Param. State RT Mean NN Gauss. State NN Mean Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 13 / 40

  15. Approaches in review f D ( d ; θ ) Pred. θ ( l ) TTS type Level Generation Formant - Phone - Rule Concat. - Phone - Exemplar HMM Geom. State RT Mean HSMM Param. State RT Mean NN Gauss. State NN Mean Proposed Non-par. ≤ Frame NN Quantile Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 13 / 40

  16. Proposed approach 1. General categorical distribution f D ( d ) • Not restricted to a specific parametric form 2. Deep or recurrent neural network • Predicts a transition probability for each time unit (e.g., frame) • Runs in tandem with acoustic model 3. Quantile-based generation • Can be computed using P ( D ≤ d ) , the left tail of f D , only • Median duration: Special case more probable than mean • Benefits from statistical robustness (Henter et al., 2016) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 14 / 40

  17. Outline 1. Background 2. Formal specification 3. Experiments 4. Extensions Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 15 / 40

  18. Preliminaries • p ∈ { 1 , . . . , P } is a phone/state index • t ∈ { 1 , . . . , T } is a time-step (frame) index • D p is the (stochastic) duration of phone/state p • Outcome values d p ∈ Z > 0 • l p collects the per-phone linguistic features � � d 1 , . . . , � � • The task is to generate durations: ( l 1 , . . . , l P ) → d P Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 16 / 40

  19. Conventional setup • Phone-level dataset D p = (( l 1 , . . . , l P ) , ( d 1 , . . . , d P )) • L p denotes the linguistic information influencing predictor at p • L p = ( l 1 , . . . , l p ) for a unidirectional RNN • Phone-level DNN/RNN d ( L p ; W ) predicts duration directly • NN weights W chosen to minimise MSE prediction error � � ( d p − d ( L p ; W )) 2 W ( D p ) = argmin W p ∈D p • The theoretical MSE minimiser is the expected duration • Frame-level acoustic modelling is a separate stage Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 17 / 40

  20. Frame-level data • Frame-level sequence of linguistic features � � L t = ( l 1 , . . . , l t ) = l p ( 1 ) , . . . , l p ( t ) • p ( t ) is the current phone at frame t • t 0 is the end frame of the previous phone • The current phone has lasted n t = t − t 0 frames • Define per-frame indicator variables � � x t = I n t = d p ( t ) • Equal one if t is the last frame of phone p ( t ) , and zero otherwise • Frame-level dataset D t = ( L T , ( x 1 , . . . , x T )) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 18 / 40

  21. Example Example of binary x t sequence from database utterance 1.0 0.8 Binary transition indicator x t 0.6 0.4 0.2 0.0 0 50 100 150 200 250 300 350 Time t (frames) Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 19 / 40

  22. Transition probabilities • Idea: Consider the transition probability π t = π ( L t ) = P ( D p = n t | D p ≥ n t , L t ) • 1 − π t is the probability to remain in the same phone/state • This defines an unambiguous, proper duration distribution t 0 + n t − 1 � P ( D p = n t | L t ) = π ( L t ) ( 1 − π ( L t ′ )) t ′ = t 0 + 1 if and only if • π t ∈ [ 0 , 1 ] ∀ t • � ∞ t ′ = t 0 + 1 ( 1 − π t ′ ) = 0 when p ( t ′ ) constant • All distributions on the positive integers writeable like this Henter et al. (NII & UEDIN) Non-parametric TTS duration modelling 2017-01-20 20 / 40

Recommend


More recommend