6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone Sequence To Speech – Articulatory Approaches – Concatenative Approaches – HMM-based Approaches – Rule-Based Approaches 1
Speech Synthesis Concept Text Speech waveform Speech waveform Text Text to Phone Sequence Phone Sequence to Speech Natural Language Speech Processing Processing (NLP) 2
Speech Naturalness Obviation of undesirable noise and distortion and dissociation from speech Prosody generation – Speech energy – Duration – pitch – Intonation – Stress 3
Speech Naturalness (Cont’d) Intonation and Stress are very effective in speech naturalness Intonation : Variation of Pitch frequency along speaking Stress : Increasing the pitch frequency in a specific time 4
Which word receives an intonation? It depends on the context. – The ‘new’ information in the answer to a question is often accented while the ‘old’ information is usually not. – Q1: What types of foods are a good source of vitamins? – A1: LEGUMES are a good source of vitamins. – Q2: Are legumes a source of vitamins? – A2: Legumes are a GOOD source of vitamins. – Q3: I’ve heard that legumes are healthy, but what are they a good source of ? – A3: Legumes are a good source of VITAMINS. Slide from Jennifer Venditti
Same ‘tune’, different alignment 400 350 300 250 200 150 100 LEGUMES are a good source of vitamins 50 The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti
Same ‘ tune ’ , different alignment 400 350 300 250 200 150 100 Legumes are a GOOD source of vitamins 50 The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti
Same ‘tune’, different alignment 400 350 300 250 200 150 100 legumes are a good source of VITAMINS 50 The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti
Types of Waveform Synthesis Articulatory Synthesis: – Model movements of articulators and acoustics of vocal tract Concatenative Synthesis: – Use databases of stored speech to assemble new utterances. Diphone Unit Selection Statistical (HMM) Synthesis – Trains parameters on databases of speech Rule-Based (Formant) Synthesis: – Start with acoustics, create rules/filters to create waveform 9
Articulatory Synthesis Simulation of physical processes of human articulation Wolfgang von Kempelen (1734-1804) and others used bellows, reeds and tubes to construct mechanical speaking machines Modern versions “simulate” electronically the effect of articulator positions, vocal tract shape, etc on air flow. 10
Concatenative approaches Two main approaches: 1- Concatenating Phone Units – Example: concatenating samples of recorded diphones or syllables 2- Unit selection – Uses several samples for each phone unit and selects the most appropriate one when synthesizing 11
Phone Units Paragraph ( ) Sentence ( ) Word (Depends on the language. Usually more than 100,000) Syllable Diphone & Triphone Phoneme (Between 10 , 100) 12
Phone Units (Cont’d) Diphone : We model Transitions between two phonemes . . . . . p1 p3 p2 p4 p5 Diphone Phoneme 13
Phone Units (Cont ’ d) Farsi phonemes: 30 Farsi diphones: 30*30 = 900 Farsi triphones: 27000 in theory Not all of the triphones are used 14
Phone Units (Cont’d) Syllable = Onset (Consonant) + Rhyme Syllable is a set of phonemes that exactly contains one vowel Syllables in Farsi : CV , CVC , CVCC We have about 4000 Syllables in Farsi Syllables in English :V, CV , CVC ,CCVC, CCVCC, CCCVC, CCCVCC, . . . Number of Syllables in English is too many 15
Phone Sequence To Speech (Cont’d) Phone primitive Text to Speech Text Sequence utterance Phone to primitive to Natural Sequence utterance Speech NLP Speech Processing 16
Concatenative Approaches In this approaches we store units of natural speech for reconstruction of desired speech We could select the appropriate phone unit for speech synthesis we can store compressed parameters instead of main waveform 17
Concatenative Approaches (Cont’d) Benefits of storing compressed parameters instead of main waveform – Less memory use – General state instead of a specific stored utterance – Generating prosody easily 18
Concatenative Approaches (Cont’d) Type of Storing Phone Unit Paragraph Main Waveform Sentence Main Waveform Word Main Waveform Syllable Coded/Main Waveform Diphone Coded Waveform Phoneme Coded Waveform 19
Concatenative Approaches (Cont’d) Pitch Synchronous Overlap-Add-Method (PSOLA) is a famous method in phoneme transmit smoothing Overlap-Add-Method is a standard DSP method PSOLA is a base action for Voice Conversion. In this method in analysis stage we select frames that are synchronous by pitch markers. 20
Diphone Architecture Example Training: – Choose units (kinds of diphones) – Record 1 speaker saying 1 example of each diphone – Mark the boundaries of each diphones, cut each diphone out and create a diphone database Synthesizing an utterance, – grab relevant sequence of diphones from database – Concatenate the diphones, doing slight signal processing at boundaries – use signal processing to change the prosody (F0, energy, duration) of selected sequence of diphones 21
Unit Selection Same idea as concatenative synthesis, but database contains bigger varieties of “phone units” from diphones to sentences Multiple examples of phone units (under different prosodic conditions) are recorded Selection of appropriate unit therefore becomes more complex, as there are in the database competing candidates for selection 22
Unit Selection Unlike diphone concatenation, little or no signal processing applied to each unit Natural data solves problems with diphones – Diphone databases are carefully designed but: Speaker makes errors Speaker doesn’t speak intended dialect Require database design to be right – If it’s automatic Labeled with what the speaker actually said Coarticulation, schwas, flaps are natural “There’s no data like more data” – Lots of copies of each unit mean you can choose just the right one for the context – Larger units mean you can capture wider effects 23
Unit Selection Issues Given a big database For each segment (diphone) that we want to synthesize – Find the unit in the database that is the best to synthesize this target segment What does “ best ” mean? – “ Target cost ” : Closest match to the target description, in terms of Phonetic context F0, stress, phrase position – “ Join cost ” : Best join with neighboring units Matching formants + other spectral characteristics Matching energy Matching F0 n n n n target join C(t ,u )= C (t ,u )+ C (u ,u ) 1 1 i i i-1 i i=1 i=2 24
Unit Selection Search 25
Joining Units unit selection, just like diphone, need to join the units – Pitch-synchronously For diphone synthesis, need to modify F0 and duration – For unit selection, in principle also need to modify F0 and duration of selection units – But in practice, if unit-selection database is big enough (commercial systems) no prosodic modifications (selected targets may already be close to desired prosody) 26
Unit Selection Summary Advantages – Quality is far superior to diphones – Natural prosody selection sounds better Disadvantages: – Quality can be very bad in some places HCI problem: mix of very good and very bad is quite annoying – Synthesis is computationally expensive – Needs more memory than diphone synthesis 27
Rule-Based Approach Stages Determine the speech model and model parameters Determine type of phone units Determine some parameter amount for each phone unit Substitute sequence of phone units by its equivalent parameter sequence Put parameter sequence in speech model 28
KLATT 80 Model 29
KLATT 88 Model 30
THE KLSYN88 CASCADE PARALLEL FORMANT SYNTHESIZER FNP FNZ FTP FTZ F1 B1 BNP BNZ BTP BTZ DF1 DB1 F2 B2 F3 B3 F4 B4 F5 B5 GLOTTAL SOUND SOURCES NASAL TRACHEAL FIRST SECOND THIRTH FOURTH FIFTH FILTERED POLE ZERO FORMANT FORMANT FORMANT FORMANT FORMANT POLE ZERO IMPULSE PAIR RESONATOR RESONATOR RESONATOR RESONATOR RESONATOR PAIR TRAIN TL CASCADE VOCAL TRACT MODEL LARYNGEAL SOUND SOURCES F0 AV OO FL DI SPECTRAL TILT LOW-PAS SS KL GLOTT CP RESONANTOR 88 model NASAL + (default) AH FORMANT ANV ASPIRATION RESONATOR SO NOISE MODIFIED GENERATOR FIRST LF FORMANT A1V MODEL RESONATOR SECOND + B2F FORMANT FIRST A2F SECOND + RESONATOR DIFFERENCE FORMANT A2V PREEMPHASIS - RESONATOR + THIRD + B3F FORMANT A3F THIRTH - AF RESONATOR FORMANT A3V - FRICATION RESONATOR NOISE FOURTH B4F GENERATOR FORMANT A4F FOURTH RESONATOR FORMANT A4V RESONATOR FIFTH B5F + FORMANT A5F TRACHEAL RESONATOR - FORMANT ATV + RESONATOR B6F F6 SIXTH - FORMANT A6F RESONATOR + PARALLEL VOCAL TRACT MODEL LYRYNGEAL AB - SOUND SOURCES (NORMALLY NOT USED) BYPASS PATH 31 PARALLEL VOCAL TRACT MODEL FRICATION SOUND SOURCES
Recommend
More recommend