A study of hypo- and hyper-articulated synthesized speech Mauro Nicolao Speech and Hearing Research Group - Department of Computer Science The University of Sheffield SCALE - Speech Communication with Adaptive Learning 2 nd Winter School, Aachen, February 15, 2011
Outline a) The “Speech Synthesis by Analysis” project b) Complete project architecture c) TTS prototype with control on speech quality (towards H&H) a) Weighted MLLR transformation b) Global Variance model manipulation c) Dynamic- vs static-feature weight control in speech generation d) Next steps Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Outline a) The “Speech Synthesis by Analysis” project b) Complete project architecture c) TTS prototype with control on speech quality (towards H&H) a) Weighted MLLR transformation b) Global Variance model manipulation c) Dynamic- vs static-feature weight control in speech generation d) Next steps Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Speech Synthesis by Analysis Project Modifications of human speech: • Success in communication: • ‒ voice intensity increasing ‒ to produce an intelligible speech ‒ speech rate adjustments ‒ to satisfy listener’s needs ‒ noise rhythm adaptation ‒ to transfer a concept form talker’s to listener’s mind ‒ signal processing (i.e. Lombard effect) ‒ change of word vocabulary Lindblom (1990), Lane et al. (2007), Levelt et al. (1999) Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Speech Synthesis by Analysis Project • Automatic TTS ignore environmental effects on speech and any feedback from listener. • Many researchers in different disciplines are investigating model to describe the human behaviour • New way of thinking automatic speech synthesis Moore (2007), Casserly and Pisoni (2010) Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Outline a) The “Speech Synthesis by Analysis” project b) Complete project architecture c) TTS prototype with control on speech quality (towards H&H) a) Weighted MLLR transformation b) Global Variance model manipulation c) Dynamic- vs static-feature weight control in speech generation d) Next steps Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Complete project architecture FEEDFORWARD FEEDBACK SII Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Outline a) The “Speech Synthesis by Analysis” project b) Complete project architecture c) TTS prototype with control on speech quality (towards H&H) a) Weighted MLLR transformation b) Global Variance model manipulation c) Dynamic- vs static-feature weight control in speech generation d) Next steps Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
TTS prototype with control on speech quality • Control function: ‒ none • Synthesis: • Control actions: ‒ HTS + SAT synthesis ‒ Phoneme substitution ‒ STRAIGHT parameters ‒ MLLR transformation ‒ GV control ‒ GV gaussian model manipulation ‒ Dynamic feature weight control Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
TTS prototype with control on speech quality Hyper-articulated speech Hypo-articulated speech HTS-Demo speech Intelligible but unnatural Muttered but “friendly” • Aim: ‒ Manipulate HTS model parameters to shift the speech quality along this line ‒ Act on generation parameters ‒ Only acoustic model manipulation • Strategies ‒ Weighted MLLR transformation ‒ Global Variance model manipulation ‒ Dynamic- vs static-feature weight control in speech generation Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Weighted MLLR transformation Idea: hypo articulation can be obtained by reducing all the normally-articulated vowels to minimally articulated schwa. A CMLLR can be trained to perform this change. Ideally, the “opposite” CMLLR transformation should define a transformation from the standard to the hyper-articulated acoustic space T’1 T’2 T1 T2 Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Weighted MLLR transformation 1. Substituting in each vowel in generation label files with a schwa vowel, because this is the less articulated vowel amongst the others. 2. Generating a small corpus of hypo-articulated speech HTS-Demo Hypo speech examples (about 1100 utterances) speech 3. Training a CMLLR transformation from standard to hypo acoustic model. 4. New observation vectors (spectrum, F0 and duration) o � = Ao + b o: observation vector generated by standard model. A, b: parameters of transformation 5. This transformation can be weighted by using a scalar α. − I: Identity matrix 0: all-zero matrix o = ( α ∗ A + (1 − α ) I ) o + ( α ∗ b + (1 − α ) O ) ˆ 6. Ideally, the “opposite” CMLLR transformation should define a transformation from the standard to the hyper-articulated HTS-Demo Hyper speech acoustic space. speech 7. The inverse transformation has been computed: − ∗ − ∗ − ∗ o = ( α ∗ A + (1 − α ) I ) − 1 ˆ o − ( α ∗ A + (1 − α ) I ) − 1 ( α ∗ b + (1 − α ) O ) Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Global Variance model manipulation Idea: to change Global-Variance model parameters either to reduce or to amplify the range of variations in the generated feature vectors. ‒ generation of c vectors with Global Variance term � P ( c | λ , λ ν ) = P ( Wc , Q | λ ) ω P ( ν ( c ) | λ ν ) Toda and Tokuda (2007) all Q ‒ Manipulation of GV model is the manipulation of the variance value range of observation vectors ‒ Scaling factors are used to control the transformation (none for F0) ‒ This allows for a increasing of variance but the mean of observation vector is still leading the feature generation Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Dynamic- vs static-feature weight control Idea: to give more importance to dynamic vs. static features in the speech generation process 1. By increasing (decreasing) the window weights in generation process, among the possible realizations of a phoneme it is chosen the one with the low (high) variations c = ( W T ˆ U − 1 W ) − 1 W T ˆ U − 1 ˆ µ 2. Different weight for each dynamic feature. Transformation defined by [α 1 α 2 α 3 ] vector . . . . . . . . . . . . . . . c t α 1 0 α 1 I α 1 0 c t − 1 · · · · · · = − α 2 I / 2 − α 2 I / 2 ∆ c t α 2 0 c t · · · · · · ∆ 2 c t − 2 α 3 I α 3 I α 3 I c t +1 · · · · · · . . . . . . . . . . . . . . . � �� � � �� � � �� � = o W c 3. α 1 usually set to 1 for F0 (pitch shifting) Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Dynamic- vs static-feature weight control F1 0.1 0.1 0.1 0.463141502 0.463141502 0.463141502 1000 1000 1000 α 1 =1 α 2 =0.2 α 3 =0.2 Formant frequency (Hz) Formant frequency (Hz) Formant frequency (Hz) α 1 =1 α 2 =1 α 3 =1 α 1 =1 α 2 =10 α 3 =10 ae l ax s 0 0 0 0.1 0.1 0.1 0.4631 0.4631 0.4631 Time (s) Time (s) Time (s) Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Audio examples Hyper-articulated speech Hypo-articulated speech HTS-Demo speech Vowel Reduction GV weight Dynamic control Dynamic + reduction Dynamic + reduction in noise GUI Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Outline a) The “Speech Synthesis by Analysis” project b) Complete project architecture c) First realizations: a) TTS prototype with extended Speech Intelligibility Index (SII) feedback b) TTS prototype with control on speech quality (towards H&H) d) Next steps Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Next steps … • Add articulatory constraints • Find new parameters to control feature generation • Complete the control feedback by: ‒ defining an optimization function ‒ adding recognition function ‒ real-time reactions • Investigate formant synthesiser as possible vocoder • Add more generalization in the parameter generation process: ‒ Multiple phonetization activated by same word ‒ Bayesan synthesiser (ref. Zen, H.) Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Thank you Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
Recommend
More recommend