Speech synthesis Marc Schrder, DFKI schroed@dfki.de 28 January - PowerPoint PPT Presentation

Foundations of Language Science and Technology Speech synthesis Marc Schröder, DFKI schroed@dfki.de 28 January 2009

What is text-to-speech synthesis? “You have one message from Dr. Johnson.” TTS Marc Schröder, DFKI 2

Applications of TTS Texts readers for the blind in eyes-free environments (e.g., while driving) Telephone-based voice portals Multi-modal interactive systems talking heads “embodied conversational agents” (ECAs) Marc Schröder, DFKI 3

Telephone-based voice portals Example: Synthesising a phone number monotonous 0-6-8-1-3-0-2-5-3-0-3 unnatural (SMS-to-speech example) 0. 6. 8. 1. 3. 0. 2. 5. 3. 0. 3. optimal (Baumann & Trouvain, 2001) 0681 - 302 - 53 - 03 Marc Schröder, DFKI 4

A Talking Head “Hello, nice to meet you.” Information on timing TTS and mouth shapes Facial Animation Model, Computer Graphics Group, MPI Saarbrücken Marc Schröder, DFKI 5

An instrumented Poker game: “AI Poker” user is playing against two virtual characters user shuffles and deals (RFID) game events trigger emotions in characters emotion is expressed in synthetic voices Marc Schröder, DFKI 6

Structure of a TTS system Text or Speech synthesis markup Either plain text or SSML document text analysis natural language processing techniques phonetic transcription + prosodic parameters Intonation specification Pausing & speech timing audio generation signal processing techniques Wave or mp3 Marc Schröder, DFKI 7

Structure of a TTS system: MARY Input markup parser Shallow NLP Phonemisation Prosody Physical realisation

System structure: Input markup parser System-internal XML representation MaryXML => speech synthesis markup parsing is simple XML transformation Use XSLT => easily adaptable to new markup language Marc Schröder, DFKI 9

Speech Synthesis Markup: SSML Author (human or machine) provides additional information to the speech synthesis engine: Er hat sich in München <emphasis> verlaufen </emphasis> Im Jahr <say-as interpret-as="date" format="y">1999</say-as> wurden <say-as interpret-as="cardinal">1999</say-as> Aufträge zur Bestellnummer <say-as interpret-as="digits">1999</say-as> erteilt. <prosody pitch=”high” rate=”fast”> Das müssen wir ganz schnell in Ordnung bringen! </prosody> <prosody pitch=”low” rate=”slow”> Immer mit der Ruhe! <prosody> Marc Schröder, DFKI 10

System structure: Shallow NLP Marc Schröder, DFKI 11

Preprocessing / Text normalisation schroed@dfki.de Net patterns (email, web addresses) Date patterns 23.07.2001 12:24 h, 12:24 Uhr Time patterns 12:24 h, 12:24 Std. Duration patterns Currency patterns 12,95 € 123,09 km Measure patterns Telephone number patterns 0681/302-5303 3 3. III Number patterns (cardinal, ordinal, roman) engl. Abbreviations Special characters & Marc Schröder, DFKI 12

System structure: Phonemisation lexicon lookup letter-to-sound conversion morphological decomposition letter-to-sound rules syllabification word stress assignment Marc Schröder, DFKI 13

System structure: Prosody “Prosody” intonation (accented syllables; high or low phrase boundaries) rhythmic effects (pauses, syllable durations) loudness, voice quality assign prosody by rule, based on punctuation part-of-speech modelled using “Tones and Break Indices” (ToBI) tonal targets: accents, boundary tones phrase breaks Marc Schröder, DFKI 14

Prosody and meaning Example: contrast and accentuation No, I said it's a blue MOON (not a blue horse) No, I said it's a BLUE moon (not a yellow moon) Prosody can express contrast getting it wrong will make communication more difficult Marc Schröder, DFKI 15

System structure: Calculation of acoustic parameters timing: segment duration predicted by rules or by decision trees intonation: fundamental frequency curve predicted by rules or by decision trees Marc Schröder, DFKI 16

System structure: Waveform synthesis Marc Schröder, DFKI 17

Creating sound: Waveform synthesis technologies (1) Formant synthesis acoustic model of speech generate acoustic structure by rule robotic sound Marc Schröder, DFKI 18

Creating sound: Waveform synthesis technologies (2) Concatenative synthesis diphone synthesis glue pre-recorded “diphones” together adapt prosody through signal processing unit selection synthesis glue units from a large corpus of speech together prosody comes from the corpus, (nearly) no signal processing Marc Schröder, DFKI 19

Creating sound: Waveform synthesis technologies (3) Statistical-parametric speech synthesis with Hidden Markov Models models trained on speech corpora no data needed at runtime => small footprint Marc Schröder, DFKI 20

Examples of various speech synthesis systems unit selection systems: HMM-based systems: L&H RealSpeak MARY AT&T Natural Voices (others exist: HTS, USTC, Festival, ...) Loquendo ACTOR MARY diphone systems: Elan TTS MBROLA-based (MARY ) formant synthesis systems: SpeechWorks Infovox Marc Schröder, DFKI 21

Concatenative synthesis: Isolated phones don't work target: w I n t r= d eI w I eI d n a T t r= acoustic unit database (units = phone segments recorded in isolation) Marc Schröder, DFKI 22

Concatenative synthesis: Diphones target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_ _-w (wonder) t-r= (water) w-I (will) r=-d (nerdy) I-n (spin) d-eI (date) n-t (fountain) eI-_ (away) Diphones = sound segments acoustic unit database from the middle of one phone units = diphone segments recorded in carrier words to the middle of the next phone (flat intonation) Marc Schröder, DFKI 23

Concatenative synthesis: Diphones (2) target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_ PSOLA pitch manipulation Marc Schröder, DFKI 24

Concatenative synthesis Unit selection target: w I n t r= d eI “Which of these?” “Let's discuss the question of interchanges another day.” acoustic unit database units = (di-)phone segments recorded in natural sentences (natural intonation) Marc Schröder, DFKI 25

AI Poker: The voices of Sam and Max Sam: Max: Unit Selection Synthesis HMM-based synthesis Voice specifically Sound quality is limited recorded for AI Poker but constant with any Natural sound within text poker domain Marc Schröder, DFKI 26

Sam's voice: Unit selection syntheis “Ich habe zwei Paare.” + + + ... Unit selection corpus several hours of speech recordings => very good quality within the poker domain! Marc Schröder, DFKI 27

Sam's voice: Unit selection syntheis “Ich kann auch ganz andere Sachen...” + + + ... Unit selection corpus several hours of speech recordings reduced quality with arbitrary text Marc Schröder, DFKI 28

Max's voice: HMM-based synthesis “Ich habe zwei Paare.” Hidden Markov Models acoustic feature vectors statistical vocoder models Marc Schröder, DFKI 29

Max's voice: HMM-based synthesis “Ich kann auch ganz andere Sachen...” Hidden Markov Models acoustic feature vectors statistical vocoder models constant quality with arbitrary text Marc Schröder, DFKI 30

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 28 January - PowerPoint PPT Presentation

Foundations of Language Science and Technology Speech synthesis Marc Schrder, DFKI schroed@dfki.de 28 January 2009 What is text-to-speech synthesis? You have one message from Dr. Johnson. TTS Marc Schrder, DFKI 2 Applications of

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

simon Open-Source Speech Recognition Developed by the non profit organization Simon Listens in

Focusing Language Models For Automatic Speech Recognition Daniele Falavigna, Roberto Gretter

EXEMPLAR-BASED SPEECH RECOGNITION IN A RESCORING APPROACH Georg Heigold, Google, USA Joint work

Missing data speech recognition in rever- berant acoustic conditions Kalle, Guy and Jon ONE SIX

SSML for Indian Languages Text to Speech Synthesis Presented by: Vibhu Agarwal President and co-

Speech Technology Using in Wechat FENG RAO Powered by WeChat Outline Introduce Algorithm of

CS 523: Multimedia Systems Angus Forbes creativecoding.evl.uic.edu/courses/cs523 Image-to-Image

Speech Synthesis, Reinforcement Learning Milan Straka May 13, 2019 Charles University in Prague

Sambuz

Useful Links

Newsletter

Mail Us

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 28 January - PowerPoint PPT Presentation

Foundations of Language Science and Technology Speech synthesis Marc Schrder, DFKI schroed@dfki.de 28 January 2009 What is text-to-speech synthesis? You have one message from Dr. Johnson. TTS Marc Schrder, DFKI 2 Applications of

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

simon Open-Source Speech Recognition Developed by the non profit organization Simon Listens in

Focusing Language Models For Automatic Speech Recognition Daniele Falavigna, Roberto Gretter

EXEMPLAR-BASED SPEECH RECOGNITION IN A RESCORING APPROACH Georg Heigold, Google, USA Joint work

Missing data speech recognition in rever- berant acoustic conditions Kalle, Guy and Jon ONE SIX

SSML for Indian Languages Text to Speech Synthesis Presented by: Vibhu Agarwal President and co-

Speech Technology Using in Wechat FENG RAO Powered by WeChat Outline Introduce Algorithm of

CS 523: Multimedia Systems Angus Forbes creativecoding.evl.uic.edu/courses/cs523 Image-to-Image

Speech Synthesis, Reinforcement Learning Milan Straka May 13, 2019 Charles University in Prague

Sambuz

Useful Links

Newsletter

Mail Us

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and