Text-to-Speech Synthesis Bernd Mbius Language Science and - PowerPoint PPT Presentation

Text-to-Speech Synthesis Bernd Möbius Language Science and Technology Saarland University Lecture 1 May 7, 2020 Introduction: Synthesis methods B Möbius TTS: Introduction 1

l Speech synthesis: Ambition and dilemma ▪ Ambition of speech synthesis: ▪ modeling the production side of the most complex human cognitive ability ▪ Dilemma of speech synthesis: ▪ emulate a human speaker or reader, without ▪ world knowledge ▪ language comprehension ▪ speech organs ▪ achieve optimal intelligibility and naturalness ▪ Speech synthesis: an impossible task!? B Möbius TTS: Introduction 2

Human-machine dialog (1) B Möbius TTS: Introduction 3

End-to-end synthesis (TACOTRON) Tacotron 2: Generating Human-like Speech from Text Tacotron 2: Audio samples Text B Möbius TTS: Introduction 4

Human-machine dialog (2) B Möbius TTS: Introduction 5

l Course details ▪ Offered for: ▪ M.Sc. Language Science and Technology, LCT ▪ B.Sc. Computerlinguistik ▪ M.Sc./B.Sc. Computer- und Kommunikationstechnik ▪ M.Sc./B.Sc. Computer Science ▪ Coordinates, contact: ▪ Lecture, Thu 10-12, C7.4/1.17, 2 SWS, 3 LP/ECTS, ▪ LSF #121407 ▪ http://www.coli.uni-saarland.de/~moebius/ → Teaching ▪ moebius@lst.uni-saarland.de B Möbius TTS: Introduction 6

"Speaking" statues Devices designed by Heron of Alexandria (1st cent. BC) Colossi of Memnon, Theban, Egypt (cf. Terra X, ZDF, 6-2-2011) B Möbius TTS: Introduction 7

Mechanical systems Wolfgang von Kempelen (1791): speaking machine https://www.youtube.com/watch?v=k_YUB_S6Gpo B Möbius TTS: Introduction 8

Mechanical systems Wolfgang von Kempelen (1770) B Möbius TTS: Introduction 9

Mechanical systems Kratzenstein (1779): Wheatstone (1838): connected sounds isolated sounds B Möbius TTS: Introduction 10

Electrical systems Dudley (1939): the Voder B Möbius TTS: Introduction 11

Formant synthesis Gunnar Fant (1953): OVE I, serial filters John Holmes (1973): parallel filters B Möbius TTS: Introduction 12

Formant synthesis ▪ Acoustic-parametric synthesis ▪ modeling the acoustic properties of speech sounds B Möbius TTS: Introduction 13

Formant s ynthesis ▪ http://www.youtube.com/watch?v=J-8a55jeR-A (1:13 – 1:32) ▪ http://www.youtube.com/watch?v=wlrOKpQ6UBI Prof. Stephen Hawking † and speech synthesizer (DECtalk DTC01) DecTalk Infovox B Möbius TTS: Introduction 14

Articulatory s ynthesis ▪ Articulatory synthesis ▪ modeling components of the speech production system ▪ voice source, articulators, 3D vocal tract, etc. Vocal Tract Lab (2007) IP Köln (1995) http://www.vocaltractlab.de/ B Möbius TTS: Introduction 15

Synthesis methods ▪ Acoustic-parametric synthesis ▪ a.k.a. formant synthesis ▪ modeling the acoustic properties of speech sounds ▪ Articulatory synthesis ▪ modeling components of the speech production system ▪ voice source, articulators, 3D vocal tract, etc. ▪ Concatenative synthesis ▪ uses segments of natural speech, concatenated and resequenced to synthesize the intended utterance ▪ e.g. diphone synthesis, unit selection synthesis B Möbius TTS: Introduction 16

Concatenative synthesis ▪ Data-based, concatenative synthesis ▪ offline : extraction of units from recordings of natural speech ▪ online : selection and sequential concatenation of units ▪ Which units are appropriate? ▪ allophones? [Ger: 45] B Möbius TTS: Introduction 17

Allophone synthesis B Möbius TTS: Introduction 18

Concatenative synthesis ▪ Data-based, concatenative synthesis ▪ offline : extraction of units from recordings of natural speech ▪ online : selection and sequential concatenation of units ▪ Which units are appropriate? ▪ allophones? [Ger: 45] ▪ diphones? [Ger: 2025] B Möbius TTS: Introduction 19

Diphone synthesis Hadifix Festival SVOX Bell Labs B Möbius TTS: Introduction 20

Concatenative synthesis ▪ Data-based, concatenative synthesis ▪ offline : extraction of units from recordings of natural speech ▪ online : selection and sequential concatenation of units ▪ Which units are appropriate? ▪ (allo)phones? [Ger: 45] ▪ diphones? [Ger: 2,025] ▪ triphones? [Ger: 91,125] ▪ syllables? [Ger: 12,500+] B Möbius TTS: Introduction 21

Concatenative synthesis ▪ Unit Selection: dynamic selection of units at synthesis run-time ▪ "The best solution to the synthesizer problem is to avoid it." [Carlson & Granström, 1991] ▪ sound inventory: large, phonetically rich speech database ▪ selection of the smallest number of the longest units from a large corpus (2 – 10+) of recorded natural speech ▪ variable unit size (phones, syllables, words, ...) B Möbius TTS: Introduction 22

l Unit Selection: units=words ▪ Target utterance: I have time on Monday. ▪ Step 1: list all candidate words for target sentence I have time on Monday I have time on Monday I have on Monday I on B Möbius TTS: Introduction 23

l Unit Selection: units=words ▪ Target utterance: I have time on Monday. ▪ Step 2: connect all units I have time on Monday I have time on Monday S E I have on Monday I on concatenation (time) B Möbius TTS: Introduction 24

l Unit Selection: units=words ▪ Target utterance: I have time on Monday. ▪ Step 3: selection of units along optimal path I have time on Monday I have time on Monday S E I have on Monday I on concatenation (time) B Möbius TTS: Introduction 25

Unit Selection synthesis ▪ best path minimizes 2 cost functions ▪ target costs : how similar to target unit is the candidate unit? ▪ concatenation costs : how smoothly does the unit connect to its neighbors? B Möbius TTS: Introduction 26

Unit Selection: variable-size units B Möbius TTS: Introduction 27

Unit Selection: demos ▪ example speech output from several systems: ▪ CHATR (1996) ▪ AT&T (2001) ▪ Festival (2004) ▪ SmartKom (2005) ▪ Loquendo (2010) ▪ BOSS (pol., 2009) B Möbius TTS: Introduction 28

Statistical Parametric synthesis B Möbius TTS: Introduction 29

DNN synthesis (Wavenet) Text B Möbius TTS: Introduction 30

End-to-end synthesis (Tacotron) Text B Möbius TTS: Introduction 31

l TTS: Audio demos System Method interactive Lang. DECTalk formant no Eng Infovox formant no Ger IP Köln articulatory no Ger Hadifix diphones yes Ger SVOX diphones yes Ger Bell Labs diphones yes Ger Festival diphones yes Ger AT&T unit selection yes Eng "Welcome to the Cocosda / LDC interactive TTS comparison site." "Willkommen auf der interaktiven Seite von Cocosda und LDC für den Vergleich von Sprachsynthesesystemen." B Möbius TTS: Introduction 32

Essential content Speech synthesis methods ▪ expert systems, rule-based approaches ▪ formant synthesis ▪ articulatory synthesis ▪ concatenative approaches ▪ diphone synthesis ▪ unit selection synthesis ▪ statistical approaches ▪ statistical-parametric (HMM) synthesis ▪ neural network based synthesis B Möbius TTS: Introduction 33

The tone of voice B Möbius TTS: Introduction 34

Text-to-Speech Synthesis Bernd Mbius Language Science and - PowerPoint PPT Presentation

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University Lecture 1 May 7, 2020 Introduction: Synthesis methods B Mbius TTS: Introduction 1 l Speech synthesis: Ambition and dilemma Ambition of

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Parallel and cascaded deep neural networks for text-to-speech synthesis M. Sam Ribeiro, Oliver

Text-to-Speech synthesis using OpenMARY An introduction and practical tutorial Marc Schrder,

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 20 January 2010 What is text-to-speech

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 28 January 2009 What is text-to-speech

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 06 February 2008 What is text-to-speech

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Hidden Markov ov Model (HMM) based S Speech Synthesis using ing HTS Toolkit. Presenter: Omer

for Speech Synthesis and Sensor Data Augmentation Deep Generative Neural Network Speech Text

A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis M.

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger and Yang Zhang 1 OUTLINE

Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis

SSML for Indian Languages Text to Speech Synthesis Presented by: Vibhu Agarwal President and co-

A template-based approach for speech synthesis intonation generation using LSTMs Srikanth Ronanki

Text-to-Image Generation Yu Cheng Text-to-Image Synthesis Text-to-Image Synthesis

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

Speech and Language CS 188: Artificial Intelligence Spring 2011 Speech technologies

Modern speech synthesis and its implications for speech sciences Zofia Malisz 1 , Gustav Eje