Text-to-Speech synthesis using OpenMARY An introduction and practical tutorial Marc Schröder, DFKI marc.schroeder@dfki.de eNTERFACE Amsterdam, 14 July 2010
Overview Some Text-to-Speech (TTS) basics Natural Language Processing Generating the sound diphone synthesis unit selection synthesis HMM-based synthesis OpenMARY existing system MARY 4.0 toolkit for adding new languages and voices Tutorial overview what you will learn to do in the tutorial Marc Schröder, DFKI 2
What is text-to-speech synthesis? “You have one message from Dr Johnson.” TTS Marc Schröder, DFKI 3
Applications of TTS Texts readers for the blind in eyes-free environments (e.g., while driving) Telephone-based voice portals Multi-modal interactive systems talking heads “embodied conversational agents” (ECAs) Marc Schröder, DFKI 4
A Talking Head “Hello, nice to meet you.” Information on timing TTS and mouth shapes Marc Schröder, DFKI 5
Structure of a TTS system TEXT Text or Speech synthesis markup SSML Either plain text or SSML document text analysis natural language processing techniques phonetic transcription + ACOUSTPARAMS prosodic parameters Intonation specification Pausing & speech timing audio generation signal processing techniques Wave file AUDIO Marc Schröder, DFKI 6
Structure of a TTS system: MARY TTS Text analysis Input markup parser TEXT or SSML → RAWMARYXML Shallow NLP RAWMARYXML → PARTSOFSPEECH Phonemiser PARTSOFSPEECH → ALLOPHONES Symbolic prosody ALLOPHONES → INTONATION Acoust. parameters INTONATION → ACOUSTPARAMS Audio generation waveform synthesis ACOUSTPARAMS → AUDIO Marc Schröder, DFKI 7
System structure: Input markup parser TEXT or SSML → RAWMARYXML System-internal XML representation MaryXML => speech synthesis markup parsing is simple XML transformation Use XSLT => easily adaptable to new markup language Marc Schröder, DFKI 8
System structure: Shallow NLP Shallow NLP Tokeniser RAWMARYXML → TOKENS sentence boundaries, “tokens” = word-like units Text normalisation TOKENS → WORDS expanded, pronounceable forms (see next slide) Part-of-speech tagger WORDS → PARTSOFSPEECH Marc Schröder, DFKI 9
Preprocessing / Text normalisation info @dfki.de Net patterns (email, web addresses) 23/07/2001 Date patterns Time patterns 12:24 h, 12:24 Duration patterns 12:24 h, 12 h 24 min Currency patterns 12.95 € Measure patterns 123.09 km Telephone number patterns +49- 681-85775-5303 3 3rd III. Number patterns (cardinal, ordinal, roman) engl. Abbreviations & Special characters Marc Schröder, DFKI 10
System structure: Phonemisation Phonemiser PARTSOFSPEECH → PHONEMES lexicon lookup letter-to-sound conversion morphological decomposition letter-to-sound rules syllabification word stress assignment Custom pronounciation PHONEMES → ALLOPHONES slurring, non-standard pronounciation potentially trainable from annotated data of a given person Marc Schröder, DFKI 11
System structure: Prosody “Prosody”? intonation (accented syllables; high or low phrase boundaries) rhythmic effects (pauses, syllable durations) loudness, voice quality Symbolic prosody prediction ALLOPHONES → INTONATION assign prosody by rule, based on punctuation part-of-speech modelled using “Tones and Break Indices” (ToBI) tonal targets: accents, boundary tones phrase breaks Marc Schröder, DFKI 12
System structure: Calculation of acoustic parameters Duration prediction INTONATION → DURATIONS segment duration predicted by rules or by decision trees Contour generation DURATIONS → ACOUSTPARAMS fundamental frequency curve predicted by rules or by decision trees Marc Schröder, DFKI 13
System structure: Waveform synthesis Waveform synthesis ACOUSTPARAMS → AUDIO several waveform generation technologies Marc Schröder, DFKI 14
Creating sound: Waveform synthesis technologies (1) Formant synthesis acoustic model of speech generate acoustic structure by rule robotic sound Marc Schröder, DFKI 15
Creating sound: Waveform synthesis technologies (2) Concatenative synthesis diphone synthesis glue pre-recorded “diphones” together adapt prosody through signal processing unit selection synthesis glue units from a large corpus of speech together prosody comes from the corpus, (nearly) no signal processing Marc Schröder, DFKI 16
Creating sound: Waveform synthesis technologies (3) Statistical-parametric speech synthesis with Hidden Markov Models models trained on speech corpora no data needed at runtime => small footprint Marc Schröder, DFKI 17
Examples of speech synthesis technologies MARY TTS Commercial unit selection unit selection IVONA HMM-based Loquendo formant synthesis MBROLA diphones DecTalk expressive unit selection Marc Schröder, DFKI 18
Concatenative synthesis: Isolated phones don't work target: w I n t r= d eI w I eI d n a T t r= acoustic unit database (units = phone segments recorded in isolation) Marc Schröder, DFKI 19
Concatenative synthesis: Diphones target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_ _-w (wonder) t-r= (water) w-I (will) r=-d (nerdy) I-n (spin) d-eI (date) n-t (fountain) eI-_ (away) Diphones = sound segments acoustic unit database from the middle of one phone units = diphone segments to the middle of the next phone recorded in carrier words (flat intonation) Marc Schröder, DFKI 20
Concatenative synthesis: Diphones (2) target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_ PSOLA pitch manipulation Marc Schröder, DFKI 21
Concatenative synthesis Unit selection target: w I n t r= d eI “Which of these?” “Let's discuss the question of interchanges another day.” acoustic unit database units = (di-)phone segments recorded in natural sentences (natural intonation) Marc Schröder, DFKI 22
AI Poker: The voices of Sam and Max Sam: Max: Unit Selection Synthesis HMM-based synthesis Voice specifically Sound quality is limited recorded for AI Poker but constant with any Natural sound within text poker domain Marc Schröder, DFKI 23
Sam's voice: Unit selection syntheis “Ich habe zwei Paare.” + + + ... Unit selection corpus several hours of speech recordings => very good quality within the poker domain! Marc Schröder, DFKI 24
Sam's voice: Unit selection syntheis “Ich kann auch ganz andere Sachen...” + + + ... Unit selection corpus several hours of speech recordings reduced quality with arbitrary text Marc Schröder, DFKI 25
Max's voice: HMM-based synthesis “Ich habe zwei Paare.” Hidden Markov Models acoustic feature vectors statistical vocoder models Marc Schröder, DFKI 26
Max's voice: HMM-based synthesis “Ich kann auch ganz andere Sachen...” Hidden Markov Models acoustic feature vectors statistical vocoder models constant quality with arbitrary text Marc Schröder, DFKI 27
MARY TTS 4.0 Pure Java Runs on any platform with Java 5 Client-server architecture http interface – your browser is a MARY client Multilingual, with UTF-8 support English (US and GB) Willkommen German Turkish Konuşma Telugu స్చ స్నసస Marc Schröder, DFKI 28
Audio effects in MARY 4.0 Some can be applied to any voice vocal tract length (longer – shorter ) Robot effect Whisper effect Jet pilot More effects for HMM-based voices pitch level (higher – lower ) pitch range (wider – narrower ) speaking rate (faster – slower ) Can be parameterised & combined to create characteristic voices Marc Schröder, DFKI 29
MARY TTS: New language support workflow Wikipedia clean text XML dump Wikipedia text import Feature maker Dump splitter allo- most frequent phones words in .xml Markup cleaner the language sentences w/ diphone+prosody features Transcription GUI Script selection letter-to- optimising coverage pronoun- list of sound for ciation function unknown lexicon words words selected Manual check, exclude sentences / unsuitable sentences script Basic NLP components enable conversion TEXT->ALLOPHONES Redstart Synthesis components in new locale record speech db enable conversion ALLOPHONES->Audio rudimentary Phonemiser in new voice POS tagger speaker- acoustic unit HMM- audio generic implementations with specific models selection based files basic functionality: pronoun- for F0+ voice voice Symbolic ciation duration files files Tokeniser prosody Voice Import Tools
What you will learn to do in the MARY Tutorial Installing the MARY system languages and voices Interacting with MARY using the web client basic experimentation interactive test of audio effects interactive documentation of http interface Triggering TTS from your own software http interface Java client code selecting language, voice and effects in requests Marc Schröder, DFKI 31
What you will learn to do in the MARY Tutorial (2) Using timing information: REALISED_ACOUSTPARAMS and REALISED_DURATIONS Performance: caching Marc Schröder, DFKI 32
Recommend
More recommend