Text-to-Speech Synthesis Bernd Möbius Language Science and Technology Saarland University Lecture 4 June 4, 2020 Diphone Synthesis B Möbius Concatenative synthesis
l Concatenative synthesis: general procedure ▪ Data-based, concatenative synthesis ▪ offline : ▪ extract units from recordings of natural speech ▪ store one (the best) token of each unit in acoustic unit inventory (corpus) ▪ online : ▪ retrieve required units from inventory ▪ concatenate units sequentially and smoothly ▪ impose prosody (F 0 , duration, (amplitude)) B Möbius Concatenative synthesis
l Concatenative synthesis: basic unit ▪ Which acoustic units are appropriate? ▪ allophones? [Eng/Ger: 45; Hawaiian: 13; !Xóõ: 159] ▪ diphones? [Eng/Ger: 2,025] ▪ triphones? [Eng/Ger: 91,125] ▪ syllables? [Eng/Ger: 12,500+; Jap: 110] ▪ Default case in these slides, unless noted otherwise: diphone as basic unit B Möbius Concatenative synthesis
Allophone synthesis (visited again) 4 B Möbius Concatenative synthesis
l Basic unit: diphone ə -v ɛ -s v- ɛ B Möbius Concatenative synthesis
l Acoustic inventory construction ▪ Steps involved in constructing acoustic unit inventories for concatenative speech synthesis ▪ inventory design: list of required units (types) ▪ selection or construction of text material ▪ speaker selection ▪ recordings ▪ selection of best candidate (token) of each unit (type) ▪ unit extraction ('cutting'=indexing) ▪ fixed or flexible cut points? B Möbius Concatenative synthesis
l Acoustic inventory design ▪ Comprise all relevant phonemic/allophonic variants (spectral properties, individual vowel space) ▪ Cover all well-formed sound sequences of target language (phonotactics, also across word boundaries) ▪ Model the most important coarticulatory effects (devoicing, rounding, nasalization, …) ▪ Concatenate units without audible discontinuities ('cuttability', unit candidate selection) ▪ Reasonable inventory size (recording time, quality control) B Möbius Concatenative synthesis
l Individual vowel space F2 F1 B Möbius Concatenative synthesis
l Individual vowel space B Möbius Concatenative synthesis
l Coarticulation (voicing) ǝ v ɛ k v̥ ɛ B Möbius Concatenative synthesis
l Coarticulation (voicing) B Möbius Concatenative synthesis
l ' Cuttability' Hard cuts in locations of minimal spectral change B Möbius Concatenative synthesis
l Required units ▪ Why is the prediction of required units (types) difficult? ▪ speaker-specific properties of spoken language ▪ individual vowel space ▪ coarticulation and context-sensitivity ▪ sounds from foreign languages ▪ Criteria ▪ language-specific phonotactic constraints ▪ acoustic properties of speech sounds: some diphone types may not be required ▪ text book vs. phonetic reality (cf. vowel space) B Möbius Concatenative synthesis
l Text materials ▪ Selection or construction of text material for recordings, covering required units ▪ "natural" sentences ▪ large phonetic variation ▪ selection by greedy algorithm ▪ relatively small number of sentences ▪ carrier sentences ▪ controlled segmental and prosodic context ▪ constructed nonsense sentences or words /I-m/ "Er hatte T imm erei gesagt." "He said t imm y again." ▪ relatively large number of sentences B Möbius Concatenative synthesis
l Speaker selection ▪ Criteria for selecting a good voice ("voice talent") ▪ professional or "naïve" speaker? ▪ longer-term availability ▪ Is the voice pleasant (auditive-aesthetical)? ▪ Is the voice robust against signal processing? ▪ Does the voice remain pleasant after resynthesis? B Möbius Concatenative synthesis
l Speaker selection ▪ Formal procedure [Syrdal et al. 1997, 1998; Schweitzer et al. 2006] ▪ "mini" TTS ▪ perception test with 3 voices, 15 sentences each ▪ intelligibility and pleasantness judgments (5-point scale) ▪ comparison for several factors ▪ signal processing method (e.g. PSOLA, HNM) ▪ RMS energy in voiceless regions ▪ spectral balance ▪ F 0 variability ▪ different results for male vs. female voices B Möbius Concatenative synthesis
l Recordings ▪ Recording conditions and practical considerations ▪ anechoic booth, or at least sound-treated studio ▪ professional microphone and headset ▪ parallel recording of speech and laryngograph signals ▪ auditory monitoring of extraneous noises ▪ phonetic monitoring of target units ▪ automatic recording regime, parallel back-up device ▪ monotonous or flat speaking style (?) ▪ all recordings in one session (?) ▪ make-up sessions for bad units B Möbius Concatenative synthesis
l Unit candidate selection ▪ Selection of best candidate (token) of each unit (type) ▪ Objectives [Olive et al. 1998; Möbius 2001] ▪ find optimal cut and concatenation points ▪ cause minimal inter-segmental discontinuities ▪ optimal representation of target speech sounds ▪ Problem: phonetic variability ▪ systematic variation (coarticulation) ▪ random variability B Möbius Concatenative synthesis
l Unit candidate selection: coarticulation ▪ Effects of prevocalic consonants on vowel formants (early, mid, late in vowel) B Möbius Concatenative synthesis
l Unit candidate selection: coarticulation ▪ Effects of postvocalic consonants on vowel formants (early, mid, late in vowel) B Möbius Concatenative synthesis
l Unit candidate selection: procedure ▪ Selection of best candidate (token) of each unit (type) ▪ Globally optimal selection, minimizing spectral discrepancies between any two diphones that can be concatenated (i.e. /t-i/ ⎯ /i-m/) ▪ Search for ideal point in F 1,2,3 space [Olive et al. 1998; Möbius 2001] ▪ exhaustive search ▪ iterative grid search B Möbius Concatenative synthesis
l Optimal cut and concatenation point diph. R[1] R[2] ✓✓✓✓✓ ✓✓ k-i ✓✓✓✓✓ ✓✓ i-t ✓ g-i x ✓ i-m x ✓ d-i x ✓ i-n x ✓ ✓ l-i ✓ ✓ i-k ✓ m-i x ✓ i-d x region [1] covers 12 diphones (tokens), 4 types region [2] covers all 10 diphone types ( ideal point ) B Möbius Concatenative synthesis
l Optimal cut and concatenation point ▪ Evaluating spectral discrepancies at concatenation point: DMAX = max (( |T i - F i | ) / B i ); i= {1,2,3} T i = target formant values (data-based) F i = actual formant values (measured) B i = formant bandwidths (postulated) ▪ DMAX: maximal acceptable formant discrepancy ▪ here: threshold set by expert ▪ desired: perceptually motivated threshold B Möbius Concatenative synthesis
l Unit candidate selection: problems ▪ Choice of appropriate speech representation (formants?) ▪ Choice of distance measure (perceptually motivated?) ▪ absolute distance vs. change of direction ▪ What to do if no suitable candidate is available? ▪ Need for diagnostic tools ▪ Criteria for selecting consonant candidates? ▪ e.g. amplitude profile, spectral balance ▪ Weighting of vocalic vs. consonantal features B Möbius Concatenative synthesis
l Final selection for inventory ▪ Selection of best candidate for each required diphone ▪ final selection of best candidate (if more than one meets the DMAX criterion) ▪ final selection of cut point (if more than one meets the DMAX criterion) ▪ automatically (objectively best candidate/cut point) ▪ interactively (subjective decision by expert) ▪ build inventory ▪ extract speech signal intervals of selected diphones ▪ produce index file with diphone start and end points in corpus (preferred) B Möbius Concatenative synthesis
l Concatenative synthesis: Summary ▪ Synthesis by re-sequencing and concatenating selected units of natural speech (typically: diphones) + units comprise dynamic phone-to-phone transitions + units cover local coarticulatory effects − longer-range coarticulation not covered − signal processing at least for smoothing concatention signal processing for prosodic modifications compromise between coverage and inventory size ▪ Standard synthesis technique in the 1990s ▪ suboptimal naturalness ▪ stable, predictable quality B Möbius Concatenative synthesis
l Essential content: diphone synthesis ▪ What is a diphone? ▪ What is the motivation for using the diphone as the basic synthesis unit rather than phones? ▪ Which procedures can be used to ensure that the concatenation between any two diphones is maximally smooth or, in other words, that the discontinuities caused by concatenation are minimized? B Möbius Concatenative synthesis
Recommend
More recommend