6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept - PowerPoint PPT Presentation

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone Sequence To Speech – Articulatory Approaches – Concatenative Approaches – HMM-based Approaches – Rule-Based Approaches 1

Speech Synthesis Concept Text Speech waveform Speech waveform Text Text to Phone Sequence Phone Sequence to Speech Natural Language Speech Processing Processing (NLP) 2

Speech Naturalness Obviation of undesirable noise and distortion and dissociation from speech Prosody generation – Speech energy – Duration – pitch – Intonation – Stress 3

Speech Naturalness (Cont’d) Intonation and Stress are very effective in speech naturalness Intonation : Variation of Pitch frequency along speaking Stress : Increasing the pitch frequency in a specific time 4

Which word receives an intonation? It depends on the context. – The ‘new’ information in the answer to a question is often accented while the ‘old’ information is usually not. – Q1: What types of foods are a good source of vitamins? – A1: LEGUMES are a good source of vitamins. – Q2: Are legumes a source of vitamins? – A2: Legumes are a GOOD source of vitamins. – Q3: I’ve heard that legumes are healthy, but what are they a good source of ? – A3: Legumes are a good source of VITAMINS. Slide from Jennifer Venditti

Same ‘tune’, different alignment 400 350 300 250 200 150 100 LEGUMES are a good source of vitamins 50 The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti

Same ‘ tune ’ , different alignment 400 350 300 250 200 150 100 Legumes are a GOOD source of vitamins 50 The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti

Same ‘tune’, different alignment 400 350 300 250 200 150 100 legumes are a good source of VITAMINS 50 The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti

Types of Waveform Synthesis Articulatory Synthesis: – Model movements of articulators and acoustics of vocal tract Concatenative Synthesis: – Use databases of stored speech to assemble new utterances. Diphone Unit Selection Statistical (HMM) Synthesis – Trains parameters on databases of speech Rule-Based (Formant) Synthesis: – Start with acoustics, create rules/filters to create waveform 9

Articulatory Synthesis Simulation of physical processes of human articulation Wolfgang von Kempelen (1734-1804) and others used bellows, reeds and tubes to construct mechanical speaking machines Modern versions “simulate” electronically the effect of articulator positions, vocal tract shape, etc on air flow. 10

Concatenative approaches Two main approaches: 1- Concatenating Phone Units – Example: concatenating samples of recorded diphones or syllables 2- Unit selection – Uses several samples for each phone unit and selects the most appropriate one when synthesizing 11

Phone Units   Paragraph ( )   Sentence ( )  Word (Depends on the language. Usually more than 100,000)  Syllable  Diphone & Triphone  Phoneme (Between 10 , 100) 12

Phone Units (Cont’d) Diphone : We model Transitions between two phonemes . . . . . p1 p3 p2 p4 p5 Diphone Phoneme 13

Phone Units (Cont ’ d) Farsi phonemes: 30 Farsi diphones: 30*30 = 900 Farsi triphones: 27000 in theory Not all of the triphones are used 14

Phone Units (Cont’d) Syllable = Onset (Consonant) + Rhyme Syllable is a set of phonemes that exactly contains one vowel Syllables in Farsi : CV , CVC , CVCC We have about 4000 Syllables in Farsi Syllables in English :V, CV , CVC ,CCVC, CCVCC, CCCVC, CCCVCC, . . . Number of Syllables in English is too many 15

Phone Sequence To Speech (Cont’d) Phone primitive Text to Speech Text Sequence utterance Phone to primitive to Natural Sequence utterance Speech NLP Speech Processing 16

Concatenative Approaches In this approaches we store units of natural speech for reconstruction of desired speech We could select the appropriate phone unit for speech synthesis we can store compressed parameters instead of main waveform 17

Concatenative Approaches (Cont’d) Benefits of storing compressed parameters instead of main waveform – Less memory use – General state instead of a specific stored utterance – Generating prosody easily 18

Concatenative Approaches (Cont’d) Type of Storing Phone Unit Paragraph Main Waveform Sentence Main Waveform Word Main Waveform Syllable Coded/Main Waveform Diphone Coded Waveform Phoneme Coded Waveform 19

Concatenative Approaches (Cont’d)  Pitch Synchronous Overlap-Add-Method (PSOLA) is a famous method in phoneme transmit smoothing  Overlap-Add-Method is a standard DSP method  PSOLA is a base action for Voice Conversion.  In this method in analysis stage we select frames that are synchronous by pitch markers. 20

Diphone Architecture Example Training: – Choose units (kinds of diphones) – Record 1 speaker saying 1 example of each diphone – Mark the boundaries of each diphones, cut each diphone out and create a diphone database Synthesizing an utterance, – grab relevant sequence of diphones from database – Concatenate the diphones, doing slight signal processing at boundaries – use signal processing to change the prosody (F0, energy, duration) of selected sequence of diphones 21

Unit Selection Same idea as concatenative synthesis, but database contains bigger varieties of “phone units” from diphones to sentences Multiple examples of phone units (under different prosodic conditions) are recorded Selection of appropriate unit therefore becomes more complex, as there are in the database competing candidates for selection 22

Unit Selection Unlike diphone concatenation, little or no signal processing applied to each unit Natural data solves problems with diphones – Diphone databases are carefully designed but: Speaker makes errors Speaker doesn’t speak intended dialect Require database design to be right – If it’s automatic Labeled with what the speaker actually said Coarticulation, schwas, flaps are natural “There’s no data like more data” – Lots of copies of each unit mean you can choose just the right one for the context – Larger units mean you can capture wider effects 23

Unit Selection Issues Given a big database For each segment (diphone) that we want to synthesize – Find the unit in the database that is the best to synthesize this target segment What does “ best ” mean? – “ Target cost ” : Closest match to the target description, in terms of Phonetic context F0, stress, phrase position – “ Join cost ” : Best join with neighboring units Matching formants + other spectral characteristics Matching energy Matching F0 n n   n n target join C(t ,u )= C (t ,u )+ C (u ,u ) 1 1 i i i-1 i i=1 i=2 24

Unit Selection Search 25

Joining Units unit selection, just like diphone, need to join the units – Pitch-synchronously For diphone synthesis, need to modify F0 and duration – For unit selection, in principle also need to modify F0 and duration of selection units – But in practice, if unit-selection database is big enough (commercial systems) no prosodic modifications (selected targets may already be close to desired prosody) 26

Unit Selection Summary Advantages – Quality is far superior to diphones – Natural prosody selection sounds better Disadvantages: – Quality can be very bad in some places HCI problem: mix of very good and very bad is quite annoying – Synthesis is computationally expensive – Needs more memory than diphone synthesis 27

Rule-Based Approach Stages Determine the speech model and model parameters Determine type of phone units Determine some parameter amount for each phone unit Substitute sequence of phone units by its equivalent parameter sequence Put parameter sequence in speech model 28

KLATT 80 Model 29

KLATT 88 Model 30

THE KLSYN88 CASCADE PARALLEL FORMANT SYNTHESIZER FNP FNZ FTP FTZ F1 B1 BNP BNZ BTP BTZ DF1 DB1 F2 B2 F3 B3 F4 B4 F5 B5 GLOTTAL SOUND SOURCES NASAL TRACHEAL FIRST SECOND THIRTH FOURTH FIFTH FILTERED POLE ZERO FORMANT FORMANT FORMANT FORMANT FORMANT POLE ZERO IMPULSE PAIR RESONATOR RESONATOR RESONATOR RESONATOR RESONATOR PAIR TRAIN TL CASCADE VOCAL TRACT MODEL LARYNGEAL SOUND SOURCES F0 AV OO FL DI SPECTRAL TILT LOW-PAS SS KL GLOTT CP RESONANTOR 88 model NASAL + (default) AH FORMANT ANV ASPIRATION RESONATOR SO NOISE MODIFIED GENERATOR FIRST LF FORMANT A1V MODEL RESONATOR SECOND + B2F FORMANT FIRST A2F SECOND + RESONATOR DIFFERENCE FORMANT A2V PREEMPHASIS - RESONATOR + THIRD + B3F FORMANT A3F THIRTH - AF RESONATOR FORMANT A3V - FRICATION RESONATOR NOISE FOURTH B4F GENERATOR FORMANT A4F FOURTH RESONATOR FORMANT A4V RESONATOR FIFTH B5F + FORMANT A5F TRACHEAL RESONATOR - FORMANT ATV + RESONATOR B6F F6 SIXTH - FORMANT A6F RESONATOR + PARALLEL VOCAL TRACT MODEL LYRYNGEAL AB - SOUND SOURCES (NORMALLY NOT USED) BYPASS PATH 31 PARALLEL VOCAL TRACT MODEL FRICATION SOUND SOURCES

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept - PowerPoint PPT Presentation

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone Sequence To Speech Articulatory Approaches Concatenative Approaches HMM-based Approaches Rule-Based Approaches 1 Speech Synthesis Concept

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Text-to-Image Generation Yu Cheng Text-to-Image Synthesis Text-to-Image Synthesis

TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger and Yang Zhang 1 OUTLINE

General Presentation Kormarine/Glovis Conference Oct 2017 TTS Services Vision and Mission TTS

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

DNN Based TTS Systems TTS Architecture: Traditional Pipeline Typical statistical parametric

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 20 January 2010 What is text-to-speech

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 28 January 2009 What is text-to-speech

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 06 February 2008 What is text-to-speech

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

! Belohlavek R., Trnecka M. (DAMOL) Basic Level in Formal Concept Analysis August 8, 2013 1 /

Heterogeneity, Endogeneity and Causal Effect Estimation Kevin Sheppard

Jet list decoding D. J. Bernstein University of Illinois at Chicago Thanks to: NSF (1018836)

KBS Knowledge-Based Systems Group 1 / 18 Motivation Overview Preliminaries Independent

Control Lyapunov functions and partial differential equations Jean-Michel Coron Laboratoire

Chubanovs Method Khachiyans Algorithm . . . A New Polynomial-Time Karmarkars

MA162: Finite mathematics . Jack Schmidt University of Kentucky February 20, 2012 Schedule:

AE-705: Introduction to Flight Pressure & Airspeed Measurement Part-II Siddharth Joshi

Sambuz

Useful Links

Newsletter

Mail Us

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept - PowerPoint PPT Presentation

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone Sequence To Speech Articulatory Approaches Concatenative Approaches HMM-based Approaches Rule-Based Approaches 1 Speech Synthesis Concept

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Text-to-Image Generation Yu Cheng Text-to-Image Synthesis Text-to-Image Synthesis

TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger and Yang Zhang 1 OUTLINE

General Presentation Kormarine/Glovis Conference Oct 2017 TTS Services Vision and Mission TTS

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

DNN Based TTS Systems TTS Architecture: Traditional Pipeline Typical statistical parametric

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 20 January 2010 What is text-to-speech

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 28 January 2009 What is text-to-speech

Speech synthesis Marc Schrder, DFKI schroed@dfki.de 06 February 2008 What is text-to-speech

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

! Belohlavek R., Trnecka M. (DAMOL) Basic Level in Formal Concept Analysis August 8, 2013 1 /

Heterogeneity, Endogeneity and Causal Effect Estimation Kevin Sheppard

Jet list decoding D. J. Bernstein University of Illinois at Chicago Thanks to: NSF (1018836)

KBS Knowledge-Based Systems Group 1 / 18 Motivation Overview Preliminaries Independent

Control Lyapunov functions and partial differential equations Jean-Michel Coron Laboratoire

Chubanovs Method Khachiyans Algorithm . . . A New Polynomial-Time Karmarkars

MA162: Finite mathematics . Jack Schmidt University of Kentucky February 20, 2012 Schedule:

AE-705: Introduction to Flight Pressure &amp; Airspeed Measurement Part-II Siddharth Joshi

Sambuz

Useful Links

Newsletter

Mail Us

AE-705: Introduction to Flight Pressure & Airspeed Measurement Part-II Siddharth Joshi