Speech synthesis Marc Schrder, DFKI schroed@dfki.de 20 January - PowerPoint PPT Presentation

Foundations of Language Science and Technology Speech synthesis Marc Schröder, DFKI schroed@dfki.de 20 January 2010

What is text-to-speech synthesis? “You have one message from Dr. Johnson.” TTS Marc Schröder, DFKI 2

Applications of TTS Texts readers for the blind in eyes-free environments (e.g., while driving) Telephone-based voice portals Multi-modal interactive systems talking heads “embodied conversational agents” (ECAs) Marc Schröder, DFKI 3

Telephone-based voice portals Example: Synthesising a phone number monotonous 0-6-8-1-3-0-2-5-3-0-3 unnatural (SMS-to-speech example) 0. 6. 8. 1. 3. 0. 2. 5. 3. 0. 3. optimal (Baumann & Trouvain, 2001) 0681 - 302 - 53 - 03 Marc Schröder, DFKI 4

A Talking Head “Hello, nice to meet you.” Information on timing TTS and mouth shapes Facial Animation Model, Computer Graphics Group, MPI Saarbrücken Marc Schröder, DFKI 5

An instrumented Poker game: “AI Poker” user is playing against two virtual characters user shuffles and deals (RFID) game events trigger emotions in characters emotion is expressed in synthetic voices Marc Schröder, DFKI 6

Structure of a TTS system TEXT Text or Speech synthesis markup SSML Either plain text or SSML document text analysis natural language processing techniques phonetic transcription + ACOUSTPARAMS prosodic parameters Intonation specification Pausing & speech timing audio generation signal processing techniques Wave or mp3 AUDIO Marc Schröder, DFKI 7

Structure of a TTS system: MARY TTS Text analysis Input markup parser TEXT or SSML → RAWMARYXML Shallow NLP RAWMARYXML → PARTSOFSPEECH Phonemiser PARTSOFSPEECH → ALLOPHONES Symbolic prosody ALLOPHONES → INTONATION Acoust. parameters INTONATION → ACOUSTPARAMS Audio generation waveform synthesis ACOUSTPARAMS → AUDIO Marc Schröder, DFKI 8

System structure: Input markup parser TEXT or SSML → RAWMARYXML System-internal XML representation MaryXML => speech synthesis markup parsing is simple XML transformation Use XSLT => easily adaptable to new markup language Marc Schröder, DFKI 9

Speech Synthesis Markup: SSML Author (human or machine) provides additional information to the speech synthesis engine: Er hat sich in München <emphasis> verlaufen </emphasis> Im Jahr <say-as interpret-as="date" format="y">1999</say-as> wurden <say-as interpret-as="cardinal">1999</say-as> Aufträge zur Bestellnummer <say-as interpret-as="digits">1999</say-as> erteilt. <prosody pitch=”high” rate=”fast”> Das müssen wir ganz schnell in Ordnung bringen! </prosody> <prosody pitch=”low” rate=”slow”> Immer mit der Ruhe! <prosody> Marc Schröder, DFKI 10

System structure: Shallow NLP Shallow NLP Tokeniser RAWMARYXML → TOKENS sentence boundaries, “tokens” = word-like units Text normalisation TOKENS → WORDS expanded, pronounceable forms (see next slide) Part-of-speech tagger WORDS → PARTSOFSPEECH Marc Schröder, DFKI 11

Preprocessing / Text normalisation schroed@dfki.de Net patterns (email, web addresses) 23.07.2001 Date patterns Time patterns 12:24 h, 12:24 Uhr Duration patterns 12:24 h, 12:24 Std. Currency patterns 12,95 € Measure patterns 123,09 km Telephone number patterns 0681/302-5303 3 3. III Number patterns (cardinal, ordinal, roman) engl. Abbreviations & Special characters Marc Schröder, DFKI 12

System structure: Phonemisation Phonemiser PARTSOFSPEECH → PHONEMES lexicon lookup letter-to-sound conversion morphological decomposition letter-to-sound rules syllabification word stress assignment Custom pronounciation PHONEMES → ALLOPHONES slurring, non-standard pronounciation potentially trainable from annotated data of a given person Marc Schröder, DFKI 13

System structure: Prosody “Prosody”? intonation (accented syllables; high or low phrase boundaries) rhythmic effects (pauses, syllable durations) loudness, voice quality Symbolic prosody prediction ALLOPHONES → INTONATION assign prosody by rule, based on punctuation part-of-speech modelled using “Tones and Break Indices” (ToBI) tonal targets: accents, boundary tones phrase breaks Marc Schröder, DFKI 14

Prosody and meaning Example: contrast and accentuation No, I said it's a blue MOON (not a blue horse) No, I said it's a BLUE moon (not a yellow moon) Prosody can express contrast getting it wrong will make communication more difficult Marc Schröder, DFKI 15

System structure: Calculation of acoustic parameters Duration prediction INTONATION → DURATIONS segment duration predicted by rules or by decision trees Contour generation DURATIONS → ACOUSTPARAMS fundamental frequency curve predicted by rules or by decision trees Marc Schröder, DFKI 16

System structure: Waveform synthesis Waveform synthesis ACOUSTPARAMS → AUDIO several waveform generation technologies Marc Schröder, DFKI 17

Creating sound: Waveform synthesis technologies (1) Formant synthesis acoustic model of speech generate acoustic structure by rule robotic sound Marc Schröder, DFKI 18

Creating sound: Waveform synthesis technologies (2) Concatenative synthesis diphone synthesis glue pre-recorded “diphones” together adapt prosody through signal processing unit selection synthesis glue units from a large corpus of speech together prosody comes from the corpus, (nearly) no signal processing Marc Schröder, DFKI 19

Creating sound: Waveform synthesis technologies (3) Statistical-parametric speech synthesis with Hidden Markov Models models trained on speech corpora no data needed at runtime => small footprint Marc Schröder, DFKI 20

Examples of various speech synthesis systems unit selection systems: HMM-based systems: L&H RealSpeak MARY AT&T Natural Voices (others exist: HTS, USTC, Festival, ...) Loquendo ACTOR MARY diphone systems: Elan TTS MBROLA-based (MARY ) formant synthesis systems: SpeechWorks Infovox Marc Schröder, DFKI 21

Concatenative synthesis: Isolated phones don't work target: w I n t r= d eI w I eI d n a T t r= acoustic unit database (units = phone segments recorded in isolation) Marc Schröder, DFKI 22

Concatenative synthesis: Diphones target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_ _-w (wonder) t-r= (water) w-I (will) r=-d (nerdy) I-n (spin) d-eI (date) n-t (fountain) eI-_ (away) Diphones = sound segments acoustic unit database from the middle of one phone units = diphone segments to the middle of the next phone recorded in carrier words (flat intonation) Marc Schröder, DFKI 23

Concatenative synthesis: Diphones (2) target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_ PSOLA pitch manipulation Marc Schröder, DFKI 24

Concatenative synthesis Unit selection target: w I n t r= d eI “Which of these?” “Let's discuss the question of interchanges another day.” acoustic unit database units = (di-)phone segments recorded in natural sentences (natural intonation) Marc Schröder, DFKI 25

AI Poker: The voices of Sam and Max Sam: Max: Unit Selection Synthesis HMM-based synthesis Voice specifically Sound quality is limited recorded for AI Poker but constant with any Natural sound within text poker domain Marc Schröder, DFKI 26

Sam's voice: Unit selection syntheis “Ich habe zwei Paare.” + + + ... Unit selection corpus several hours of speech recordings => very good quality within the poker domain! Marc Schröder, DFKI 27

Sam's voice: Unit selection syntheis “Ich kann auch ganz andere Sachen...” + + + ... Unit selection corpus several hours of speech recordings reduced quality with arbitrary text Marc Schröder, DFKI 28

Max's voice: HMM-based synthesis “Ich habe zwei Paare.” Hidden Markov Models acoustic feature vectors statistical vocoder models Marc Schröder, DFKI 29

Max's voice: HMM-based synthesis “Ich kann auch ganz andere Sachen...” Hidden Markov Models acoustic feature vectors statistical vocoder models constant quality with arbitrary text Marc Schröder, DFKI 30

MARY TTS: New language support workflow MARY TTS: New language support workflow Wikipedia clean text XML dump Wikipedia text import Feature maker Dump splitter allo- most frequent phones words in .xml Markup cleaner the language sentences w/ diphone+prosody features Transcription GUI Script selection letter-to- optimising coverage pronoun- list of sound for ciation function unknown lexicon words words selected Manual check, exclude sentences / unsuitable sentences script Basic NLP components enable conversion TEXT->ALLOPHONES Redstart Synthesis components in new locale record speech db enable conversion ALLOPHONES->Audio rudimentary Phonemiser in new voice POS tagger speaker- acoustic unit HMM- audio generic implementations with specific models selection based files basic functionality: pronoun- for F0+ voice voice Symbolic ciation duration files files Tokeniser prosody Voice Import Tools

Hands-on TTS: MARY TTS 4.0 Get it from http://mary.dfki.de either download onto your machine (~32 MB min download) or use online demo Marc Schröder, DFKI 32

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 20 January - PowerPoint PPT Presentation

Foundations of Language Science and Technology Speech synthesis Marc Schrder, DFKI schroed@dfki.de 20 January 2010 What is text-to-speech synthesis? You have one message from Dr. Johnson. TTS Marc Schrder, DFKI 2 Applications of

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

las vegas caesars palace g a m b l i n g p l a c e s

Counting Techniques Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University,

The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA

Reduced Variance Payoff Estimation in Adversarial Bandit Problems Levente Kocsis Csaba Szepesv

CS 2334: Lab 7 Generics, Lists and Queues Andrew H. Fagg: CS2334: Lab 7 1 Generics We know

Technology Online Webinar: Raranga Matihiko|Weaving Digital Futures Karakia Timatanga Kia hora

HistoRadar Alberto Gonzlez Palomo Uwe-Matthias Boltz Seminar Unlocking the Secrets of the

Data sharing reforms Office of the National Data Commissioner Department of the Prime Minister

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 20 January - PowerPoint PPT Presentation

Foundations of Language Science and Technology Speech synthesis Marc Schrder, DFKI schroed@dfki.de 20 January 2010 What is text-to-speech synthesis? You have one message from Dr. Johnson. TTS Marc Schrder, DFKI 2 Applications of

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

las vegas caesars palace g a m b l i n g p l a c e s

Counting Techniques Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University,

The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA

Reduced Variance Payoff Estimation in Adversarial Bandit Problems Levente Kocsis Csaba Szepesv

CS 2334: Lab 7 Generics, Lists and Queues Andrew H. Fagg: CS2334: Lab 7 1 Generics We know

Technology Online Webinar: Raranga Matihiko|Weaving Digital Futures Karakia Timatanga Kia hora

HistoRadar Alberto Gonzlez Palomo Uwe-Matthias Boltz Seminar Unlocking the Secrets of the

Data sharing reforms Office of the National Data Commissioner Department of the Prime Minister

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and