Text-to-Speech synthesis using OpenMARY An introduction and - PowerPoint PPT Presentation

Text-to-Speech synthesis using OpenMARY An introduction and practical tutorial Marc Schröder, DFKI marc.schroeder@dfki.de eNTERFACE Amsterdam, 14 July 2010

Overview Some Text-to-Speech (TTS) basics Natural Language Processing Generating the sound diphone synthesis unit selection synthesis HMM-based synthesis OpenMARY existing system MARY 4.0 toolkit for adding new languages and voices Tutorial overview what you will learn to do in the tutorial Marc Schröder, DFKI 2

What is text-to-speech synthesis? “You have one message from Dr Johnson.” TTS Marc Schröder, DFKI 3

Applications of TTS Texts readers for the blind in eyes-free environments (e.g., while driving) Telephone-based voice portals Multi-modal interactive systems talking heads “embodied conversational agents” (ECAs) Marc Schröder, DFKI 4

A Talking Head “Hello, nice to meet you.” Information on timing TTS and mouth shapes Marc Schröder, DFKI 5

Structure of a TTS system TEXT Text or Speech synthesis markup SSML Either plain text or SSML document text analysis natural language processing techniques phonetic transcription + ACOUSTPARAMS prosodic parameters Intonation specification Pausing & speech timing audio generation signal processing techniques Wave file AUDIO Marc Schröder, DFKI 6

Structure of a TTS system: MARY TTS Text analysis Input markup parser TEXT or SSML → RAWMARYXML Shallow NLP RAWMARYXML → PARTSOFSPEECH Phonemiser PARTSOFSPEECH → ALLOPHONES Symbolic prosody ALLOPHONES → INTONATION Acoust. parameters INTONATION → ACOUSTPARAMS Audio generation waveform synthesis ACOUSTPARAMS → AUDIO Marc Schröder, DFKI 7

System structure: Input markup parser TEXT or SSML → RAWMARYXML System-internal XML representation MaryXML => speech synthesis markup parsing is simple XML transformation Use XSLT => easily adaptable to new markup language Marc Schröder, DFKI 8

System structure: Shallow NLP Shallow NLP Tokeniser RAWMARYXML → TOKENS sentence boundaries, “tokens” = word-like units Text normalisation TOKENS → WORDS expanded, pronounceable forms (see next slide) Part-of-speech tagger WORDS → PARTSOFSPEECH Marc Schröder, DFKI 9

Preprocessing / Text normalisation info @dfki.de Net patterns (email, web addresses) 23/07/2001 Date patterns Time patterns 12:24 h, 12:24 Duration patterns 12:24 h, 12 h 24 min Currency patterns 12.95 € Measure patterns 123.09 km Telephone number patterns +49- 681-85775-5303 3 3rd III. Number patterns (cardinal, ordinal, roman) engl. Abbreviations & Special characters Marc Schröder, DFKI 10

System structure: Phonemisation Phonemiser PARTSOFSPEECH → PHONEMES lexicon lookup letter-to-sound conversion morphological decomposition letter-to-sound rules syllabification word stress assignment Custom pronounciation PHONEMES → ALLOPHONES slurring, non-standard pronounciation potentially trainable from annotated data of a given person Marc Schröder, DFKI 11

System structure: Prosody “Prosody”? intonation (accented syllables; high or low phrase boundaries) rhythmic effects (pauses, syllable durations) loudness, voice quality Symbolic prosody prediction ALLOPHONES → INTONATION assign prosody by rule, based on punctuation part-of-speech modelled using “Tones and Break Indices” (ToBI) tonal targets: accents, boundary tones phrase breaks Marc Schröder, DFKI 12

System structure: Calculation of acoustic parameters Duration prediction INTONATION → DURATIONS segment duration predicted by rules or by decision trees Contour generation DURATIONS → ACOUSTPARAMS fundamental frequency curve predicted by rules or by decision trees Marc Schröder, DFKI 13

System structure: Waveform synthesis Waveform synthesis ACOUSTPARAMS → AUDIO several waveform generation technologies Marc Schröder, DFKI 14

Creating sound: Waveform synthesis technologies (1) Formant synthesis acoustic model of speech generate acoustic structure by rule robotic sound Marc Schröder, DFKI 15

Creating sound: Waveform synthesis technologies (2) Concatenative synthesis diphone synthesis glue pre-recorded “diphones” together adapt prosody through signal processing unit selection synthesis glue units from a large corpus of speech together prosody comes from the corpus, (nearly) no signal processing Marc Schröder, DFKI 16

Creating sound: Waveform synthesis technologies (3) Statistical-parametric speech synthesis with Hidden Markov Models models trained on speech corpora no data needed at runtime => small footprint Marc Schröder, DFKI 17

Examples of speech synthesis technologies MARY TTS Commercial unit selection unit selection IVONA HMM-based Loquendo formant synthesis MBROLA diphones DecTalk expressive unit selection Marc Schröder, DFKI 18

Concatenative synthesis: Isolated phones don't work target: w I n t r= d eI w I eI d n a T t r= acoustic unit database (units = phone segments recorded in isolation) Marc Schröder, DFKI 19

Concatenative synthesis: Diphones target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_ _-w (wonder) t-r= (water) w-I (will) r=-d (nerdy) I-n (spin) d-eI (date) n-t (fountain) eI-_ (away) Diphones = sound segments acoustic unit database from the middle of one phone units = diphone segments to the middle of the next phone recorded in carrier words (flat intonation) Marc Schröder, DFKI 20

Concatenative synthesis: Diphones (2) target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_ PSOLA pitch manipulation Marc Schröder, DFKI 21

Concatenative synthesis Unit selection target: w I n t r= d eI “Which of these?” “Let's discuss the question of interchanges another day.” acoustic unit database units = (di-)phone segments recorded in natural sentences (natural intonation) Marc Schröder, DFKI 22

AI Poker: The voices of Sam and Max Sam: Max: Unit Selection Synthesis HMM-based synthesis Voice specifically Sound quality is limited recorded for AI Poker but constant with any Natural sound within text poker domain Marc Schröder, DFKI 23

Sam's voice: Unit selection syntheis “Ich habe zwei Paare.” + + + ... Unit selection corpus several hours of speech recordings => very good quality within the poker domain! Marc Schröder, DFKI 24

Sam's voice: Unit selection syntheis “Ich kann auch ganz andere Sachen...” + + + ... Unit selection corpus several hours of speech recordings reduced quality with arbitrary text Marc Schröder, DFKI 25

Max's voice: HMM-based synthesis “Ich habe zwei Paare.” Hidden Markov Models acoustic feature vectors statistical vocoder models Marc Schröder, DFKI 26

Max's voice: HMM-based synthesis “Ich kann auch ganz andere Sachen...” Hidden Markov Models acoustic feature vectors statistical vocoder models constant quality with arbitrary text Marc Schröder, DFKI 27

MARY TTS 4.0 Pure Java Runs on any platform with Java 5 Client-server architecture http interface – your browser is a MARY client Multilingual, with UTF-8 support English (US and GB) Willkommen German Turkish Konuşma Telugu స్చ స్నసస Marc Schröder, DFKI 28

Audio effects in MARY 4.0 Some can be applied to any voice vocal tract length (longer – shorter ) Robot effect Whisper effect Jet pilot More effects for HMM-based voices pitch level (higher – lower ) pitch range (wider – narrower ) speaking rate (faster – slower ) Can be parameterised & combined to create characteristic voices Marc Schröder, DFKI 29

MARY TTS: New language support workflow Wikipedia clean text XML dump Wikipedia text import Feature maker Dump splitter allo- most frequent phones words in .xml Markup cleaner the language sentences w/ diphone+prosody features Transcription GUI Script selection letter-to- optimising coverage pronoun- list of sound for ciation function unknown lexicon words words selected Manual check, exclude sentences / unsuitable sentences script Basic NLP components enable conversion TEXT->ALLOPHONES Redstart Synthesis components in new locale record speech db enable conversion ALLOPHONES->Audio rudimentary Phonemiser in new voice POS tagger speaker- acoustic unit HMM- audio generic implementations with specific models selection based files basic functionality: pronoun- for F0+ voice voice Symbolic ciation duration files files Tokeniser prosody Voice Import Tools

What you will learn to do in the MARY Tutorial Installing the MARY system languages and voices Interacting with MARY using the web client basic experimentation interactive test of audio effects interactive documentation of http interface Triggering TTS from your own software http interface Java client code selecting language, voice and effects in requests Marc Schröder, DFKI 31

What you will learn to do in the MARY Tutorial (2) Using timing information: REALISED_ACOUSTPARAMS and REALISED_DURATIONS Performance: caching Marc Schröder, DFKI 32

Text-to-Speech synthesis using OpenMARY An introduction and - PowerPoint PPT Presentation

Text-to-Speech synthesis using OpenMARY An introduction and practical tutorial Marc Schrder, DFKI marc.schroeder@dfki.de eNTERFACE Amsterdam, 14 July 2010 Overview Some Text-to-Speech (TTS) basics Natural Language Processing Generating

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Text-to-Image Generation Yu Cheng Text-to-Image Synthesis Text-to-Image Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Data Analysis, New Knowledge, and then What? Perspectives on Mobilizing Computable Biomedical

Recommended Practices for the design of business surveys questionnaires Stefania Macchia

A Digital Research Platform for the Semantic Reconstruction of Giacomo Leopardis Zibaldone

Statistical Encoding of Succinct Data Structures alez 1 Gonzalo Navarro 1 Rodrigo Gonz 1

Project Plan Presentations Today September 30, Anthony 1279 Team Learning A-Z Team

National Hydrogen Learning Demonstration Status (Presentation) Article CITATIONS READS 0 23 6

Presentation to ITAC CHILDS Replacement Program (Guardian) State of Arizona Department of

New Mexico Public Schools: Budget & Finance June 4, 2020 Dr. Gloria Rendon David Craig,

Text-to-Speech synthesis using OpenMARY An introduction and - PowerPoint PPT Presentation

Text-to-Speech synthesis using OpenMARY An introduction and practical tutorial Marc Schrder, DFKI marc.schroeder@dfki.de eNTERFACE Amsterdam, 14 July 2010 Overview Some Text-to-Speech (TTS) basics Natural Language Processing Generating

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Text-to-Image Generation Yu Cheng Text-to-Image Synthesis Text-to-Image Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Data Analysis, New Knowledge, and then What? Perspectives on Mobilizing Computable Biomedical

Recommended Practices for the design of business surveys questionnaires Stefania Macchia

A Digital Research Platform for the Semantic Reconstruction of Giacomo Leopardis Zibaldone

Statistical Encoding of Succinct Data Structures alez 1 Gonzalo Navarro 1 Rodrigo Gonz 1

Project Plan Presentations Today September 30, Anthony 1279 Team Learning A-Z Team

National Hydrogen Learning Demonstration Status (Presentation) Article CITATIONS READS 0 23 6

Presentation to ITAC CHILDS Replacement Program (Guardian) State of Arizona Department of

New Mexico Public Schools: Budget &amp; Finance June 4, 2020 Dr. Gloria Rendon David Craig,

New Mexico Public Schools: Budget & Finance June 4, 2020 Dr. Gloria Rendon David Craig,