FLST: Prosodic Models for Speech Technology Bernd Mbius - PowerPoint PPT Presentation

FLST: Prosodic Models for Speech Technology Bernd Möbius moebius@coli.uni-saarland.de http://www.coli.uni-saarland.de/courses/FLST/2014/ FLST: Prosodic Models

Prosody: Duration and intonation � Temporal and tonal structure in speech synthesis � all synthesis methods • use models to predict duration and F0 • models are trained on observed duration and F0 data � Unit Selection: • phone duration and phone-level F0 used in target specification • F0 smoothness considered � HMM synthesis: duration modeled by probability of remaining in the same state 2 FLST: Prosodic Models

Duration prediction � Task of duration model in TTS: � predict duration of speech sound as precisely as possible, based on factors affecting duration � factors must be computable/inferrable from text 3 FLST: Prosodic Models

Duration prediction 4 FLST: Prosodic Models

Duration prediction � Task of duration model in TTS: � predict duration of speech sound as precisely as possible, based on factors affecting duration � factors must be computable/inferrable from text � Why is this task difficult? � extremely context-dependent durations, e.g. [ɛ] ¡= ¡35 ¡ms ¡in ¡ jetzt , 252 ms in Herren � factors: accent status of word, syllabic stress, position in ¡utterance, ¡segmental ¡context, ¡… � factors define a huge feature space 5 FLST: Prosodic Models

Duration models � Automatic construction of duration models � general-purpose statistical prediction systems • Classification and Regression Trees [Breiman et al. 1984; e.g. Riley 1992] • Multiple regression [e.g. Iwahashi and Sagisaka 1993] • Neural Nets [e.g. Campbell 1992] � statistically accurate for training data � but often insufficient performance on new data 6 FLST: Prosodic Models

Data sparsity � Why is this a problem? � data sparsity : feature space (>10k vectors) cannot be covered exhaustively by training data � LNRE distribution : large number of rare events - rare vectors must not be ignored, because there are so many rare vectors that the probability of encountering at least one of them in any sentence is very high 7 FLST: Prosodic Models

Data sparsity: word frequencies 8 FLST: Prosodic Models

Data sparsity � Why is this a problem? � data sparsity : feature space (>10k vectors) cannot be covered exhaustively by training data � LNRE distribution : large number of rare events - rare vectors must not be ignored, because there are so many rare vectors that the probability of encountering at least one of them in any sentence is very high � vectors unseen in training data must be predicted by extrapolation and generalization � general-purpose prediction systems have poor extrapolation and are not robust w.r.t. missing data 9 FLST: Prosodic Models

Sum-of-products model � Current best practice: Sum-of-products model [van Santen 1993, 1998; Möbius and van Santen 1996] � exploits expert knowledge and well-behaved properties of speech (e.g. directional invariance, monotonicity) � uses well-behaved mathematical operations (add./mult.) � estimates parameters even for unbalanced frequency distributions of features in training data 10 FLST: Prosodic Models

Sum-of-products model � Sum-of-products model: general form [van Santen 1993, 1998] K : set of indices of product terms I i : set of indices of factors occurring in i-th product term S i,j : set of parameters, each corresponding to a level on j-th factor f j : feature on j-th factor (e.g., f 1 = Vowel_ID, f 2 = � stress, ...) 11 FLST: Prosodic Models

Sum-of-products model � Sum-of-products model: specific form [van Santen 1993, 1998] V : vowel identity (15 levels) C : consonant after V (2 levels: � voiced) P : position in phrase (2 levels: medial/final) here: 21 parameters to estimate (2+2 + 2 + 15) 12 FLST: Prosodic Models

Sum-of-products model � SoP model requires: � definition of factors affecting duration (literature, pilot) � segmented and annotated speech corpus � greedy algorithm to optimize coverage: select from large text corpus a smallest subset with same coverage � SoP model yields: � complete picture of temporal characteristics of speaker � homogeneous, consistent results for set of factors � best performance: r = 0.9 for observed vs. predicted phone durations (Engl., Ger., Fr., Dutch, Chin., Jap., …) 13 FLST: Prosodic Models

SoP model: phonetic tree 14 FLST: Prosodic Models

Intonation prediction � Task of intonation model in TTS � compute a continuous acoustic parameter (F0) from a symbolic representation of intonation inferred from text 15 FLST: Prosodic Models

Intonation ( F 0 ) 16 FLST: Prosodic Models

Intonation prediction � Task of intonation model in TTS � compute a continuous acoustic parameter (F0) from a symbolic representation of intonation inferred from text � Intonation models commonly applied in TTS systems: � phonological tone-sequence models (Pierrehumbert) � acoustic-phonetic superposition models (Fujisaki) � acoustic stylization models (Tilt, PaIntE, IntSint) � perception-based models (IPO) � function-oriented models (KIM) 17 FLST: Prosodic Models

Tone sequence model � Autosegmental-metrical theory of intonation [Pierrehumbert 1980] � intonation is represented by sequence of high (H) and low (L) tones � H and L are members of a primary phonological contrast � hierarchy of intonational domains • IP – Intonation Phrase; boundary tones: H%, L% • ip – intermediary phrase; phrase tones: H-, L- • pw – prosodic word; pitch accents: H*, H*L, L*H, … 18 FLST: Prosodic Models

Pierrehumbert's model � Finite-state grammar of well-formed tone sequences pw ip IP � Example [adapted from Pierrehumbert 1980, p. 276] That's a remarkably clever suggestion. | | %H H* H*L L- L% 19 FLST: Prosodic Models

Pierrehumbert's model � Finite-state graph pw ip IP 20 FLST: Prosodic Models

ToBI: Tones and Break Indices � Formalization of intonation model as transcription system [Pitrelli et al. 1992] � phonemic (=broad phonetic) transcription � originally designed for American English � limited applicability to other varieties/languages • language-specific inventory of phonological units • language-specific details of F0 contours � adapted to many languages (e.g. GToBI, JToBI, KToBI) � implemented in many TTS systems • abstract tonal representation converted to F0 contours by means of phonetic realization rules 21 FLST: Prosodic Models

Fujisaki's model [Fujisaki 1983, 1988; Möbius 1993] 22 FLST: Prosodic Models

Fujisaki's model � Properties: � superpositional � physiological basis and interpretation of components and control parameters � linguistic interpretation of components � applied to many (typologically diverse) languages � Origins: � Öhman and Lindqvist (1966), Öhman (1967) � Fujisaki et al. (1979), Fujisaki (1983, 1988), … 23 FLST: Prosodic Models

Fujisaki's model: Components [Möbius 1993] 24 FLST: Prosodic Models

Fujisaki's model: Example [Möbius 1993] Approximation of natural F 0 by optimal parameter values within linguistic constraints (accents, phrase structure) 25 FLST: Prosodic Models

Comparison of models � Tone sequence or superposition? � intonation • TS: consists of linear sequence of tonal elements • SP: overlay of components of longer/shorter domain � F0 contour • TS: generated from sequences of phonological tones • SP: complex patterns from superimposed components � interaction • TS: tones locally determined, non-interactive • SP: simultaneous, highly interactive components 26 FLST: Prosodic Models

F 0 as a complex phenomenon � Main problem for intonation models: linguistic, paralinguistic, extralinguistic factors – all conveyed by F0 � lexical tones � syllabic stress, word accent � stress groups, accent groups � prosodic phrasing � sentence mode � discourse intonation � pitch range, register � phonation type, voice quality � microprosody: intrinsic and coarticulatory F0 27 FLST: Prosodic Models

More on prosody in speech technology: ASR (Wed Jan 28) Thanks! 28 FLST: Prosodic Models

FLST: Prosodic Models for Speech Technology Bernd Mbius - PowerPoint PPT Presentation

FLST: Prosodic Models for Speech Technology Bernd Mbius moebius@coli.uni-saarland.de http://www.coli.uni-saarland.de/courses/FLST/2014/ FLST: Prosodic Models Prosody: Duration and intonation Temporal and tonal structure in speech

FLST: Speech Recognition Bernd Mbius moebius@coli.uni-saarland.de

FLST:Cognitive Foundations I Matthew W. Crocker crocker@coli.uni-sb.de FLST: Cognitive

FLST: Cognitive Foundations Francesca Delogu delogu@coli.uni-saarland.de

FLST: Linguistic Foundations Francesca Delogu delogu@coli.uni-saarland.de

FLST:Cognitive Foundations I Matthew W. Crocker crocker@coli.uni-sb.de FLST: Cognitive

FLST: Linguistic Foundations Francesca Delogu delogu@coli.uni-saarland.de

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Foundations of Language Science and Technology (FLST) Lecture 3 (19.10.2009) PD Dr.Valia Kordoni

Foundations of Language Science and Technology (FLST) Lecture 4 (28.10.2009): Syntax PD Dr.Valia

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

The start and end of it: Prosodic marking of speech report boundaries in Dolakha Newar Carol

Automatic Detection and Classification of Prosodic Events Thesis Proposal Andrew Rosenberg

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Overview accessibility accessibility It is commonly assumed that new information is marked

A MODEL FOR TEACHING INTONATION EMPHASIZING SELF- LEARNING FOR VIETNAMESE LEARNERS OF ENGLISH

Lu Luke 3: 3:15 15-17, 17, 21 21-22 22 Luke ke 3: 3:15 15 As the people were filled

Intertextuality revised: 08.08.13 || English 1302: Composition II || D. Glen Smith, instructor

Rise-fall-rise intonation and secondary QUDs Matthijs Westera Institute for Logic, Language and

Intonation in Indian Art Music Gopala Krishna Koduri 1 , Joan Serr 2 , Xavier Serra 1 1 Music

crafting effective descriptions for print , social media , and the web Tracy McPeck Prince William

Meaningful Tradeshow Engagement by Session Title Using Social Media ROSE MARY MOEGLING Manager

Sambuz

Useful Links

Newsletter

Mail Us