Speech Synthesis Lecture 19 CS 753 Instructor: Preethi Jyothi
Project Preliminary Report • Preliminary project report will contribute towards 5% of your final grade. Deadline is on 27th October, 2019. • Define the following for your project: 1) Input-output behaviour of your system 5 points 2) Evaluation metric 3) At least two existing (or related) approaches to your problem • Propose a model and an algorithm for the problem you're tackling and give detailed descriptions for both. Do not provide generic descriptions of the model. Describe precisely how it applies to your problem. 5 points • Describe how much of your algorithm has been implemented. If you are using existing APIs/libraries, clearly demarcate which parts you will be implementing and for which 5 points parts you will rely on existing implementations. • Describe the experiments you are planning to run. If you have already run any 5 points preliminary experiments, please describe them along with reporting your initial results.
Text-To-Speech (TTS) Systems Storied History Von Kempelen’s speaking machine (1791) • Bellows simulated the lungs • Rubber mouth and nose; nostrils had to be covered with • two fingers for non-nasals Homer Dudley’s VODER (1939) • First device to synthesize speech sounds via electrical • means Gunnar Fant’s OVE formant synthesizer (1960s) • Formant synthesizer for vowels • Computer-aided speech synthesis (1970s) • Concatenative (unit selection) • Parametric (HMM-based and NN-based) • All images from http://www2.ling.su.se/staff/hartmut/kemplne.htm
Speech synthesis or TTS systems Goal of a TTS system: Produce a natural-sounding high- • quality speech waveform for a given word sequence TTS systems are typically divided into two parts: • A. Linguistic specification B. Waveform generation
Current TTS systems Constructed using a large amount of speech data • Referred to as corpus-based TTS systems • Two prominent instances of corpus-based TTS: • 1. Unit selection and concatenation 2. Statistical parametric speech synthesis
Unit Selection Synthesis
Unit selection synthesis or Concatenative speech synthesis All segments Synthesize new • sentences by selecting sub-word units from a database of speech Optimal size of units? • Diphones? Half-phones? Target cost Concatenation cost Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2001
Unit selection synthesis Target cost between a candidate, u i , and a target unit t i : • p w ( t ) j C ( t ) C ( t ) ( t i , u i ) = � j ( t i , u i ) , j =1 Concatenation cost between candidate units: • q w ( c ) k C ( c ) C ( c ) ( u i − 1 , u i ) = � k ( u i − 1 , u i ) , k =1 Find string of units that minimises the overall cost: • u 1: n = arg min ˆ u 1: n { C ( t 1: n , u 1: n ) } n n � C ( t ) ( t i , u i ) + � C ( c ) ( u i − 1 , u i ) . C ( t 1: n , u 1: n ) = i =1 i =2
Unit selection synthesis Clustered segments Target cost is • pre-calculated using a clustering method Target cost Concatenation cost
Statistical Parametric Speech Synthesis
Parametric Speech Synthesis Framework O Speech Speech Train Parameter speech Synthesis Analysis Model Generation ˆ W Text λ Text text Analysis Analysis Training • Estimate acoustic model given speech utterances (O), word sequences (W)* • ˆ λ = arg max p ( O | W, λ ) λ * Here W could refer to any textual features relevant to the input text
Parametric Speech Synthesis Framework O ô Speech Speech Train Parameter speech Synthesis Analysis Model Generation ˆ W Text λ Text text Analysis Analysis Training • Estimate acoustic model given speech utterances (O), word sequences (W) • ˆ λ = arg max p ( O | W, λ ) HMMs! λ Synthesis • ˆ Find the most probable ô from and a given word sequence to be synthesised, w λ • p ( o | w, ˆ o = arg max ˆ λ ) o Synthesize speech from ô •
HMM-based speech synthesis amount Speech signal SPEECH w DATABASE weights Excitation Spectral parameter parameter syn- extraction extraction heuris- Excitation Spectral parameters parameters the Training of HMM Labels concatenation-cost pro- Training part and Synthesis part TEXT ) context-dependent HMMs implementations Text analysis & duration models All Labels Parameter generation from HMM that Excitation Spectral parameters parameters Excitation Synthesis SYNTHESIZED opti- generation filter SPEECH ger domain.
<latexit sha1_base64="lHhErvouZMX+5sUo7itrc2XQw/E=">ACw3icdVFdixMxFM2MH7vWj636EuwKBWMrMK64tQFMHFezuQlOGO2mDU0maXJHt8zOn/RNf42ZbhHbrhdCDuec29yb26V9Jgkv6L4zt179w8OH3QePnr85Kj79Nm5N5XjYsSNMu4yBy+ULMUIJSpxaZ0AnStxkS8+tfmL78J5acpvuLJiomFWykJywEBl3d9sDlibhr7+QBm4GdNwlRlq+b6x/E6x1SoNoXmDWs61ivtJZzQrjQCm6bFrb8fJ2IwNrnbnasrfX8v8mutvur/56uaO2/f0KWbeXDJ10H2QbkCPbOIs6/5kU8MrLUrkCrwfp4nFSQ0OJVei6bDKCwt8ATMxDrAELfykXu+goa8CM6VhFOGUSNfsv4atPcrnQelBpz73VxL3pYbV1i8n9SytBWKkt80KipF0dB2oXQqneCoVgEAdzK8lfI5OAY1t4JQ0h3v7wPzk8G6dvByd3veHzTgOyQvykvRJSk7JkHwhZ2REeDSMishENv4cL2IX40jae52Qr4uYPIL/afw=</latexit> Speech parameter generation Generate the most probable observation vectors given the HMM and w: p ( o | w, ˆ o = arg max ˆ λ ) o p ( o, q | w, ˆ X = arg max λ ) o ∀ q p ( o, q | w, ˆ ≈ arg max max λ ) o q p ( o | q, ˆ λ ) p ( q | w, ˆ = arg max max λ ) o q Determine the best state sequence and outputs sequentially: p ( q | w, ˆ q = arg max ˆ λ ) q q, ˆ Let’s explore this first o = arg max ˆ p ( o | ˆ λ ) o
Determining state outputs q, ˆ p ( o | ˆ o = arg max ˆ λ ) o N ( o ; µ ˆ = arg max q ) q , Σ ˆ o � ⊤ is a state-output vector sequence � o ⊤ 1 , . . . , o ⊤ where o = T to be generated, q = { q 1 , . . . , q T } is a state sequence, and � ⊤ is the mean vector for q . � µ ⊤ q 1 , . . . , µ ⊤ = Here, µ q q T = diag [ Σ , . . . , Σ ] is the covariance matrix for q and Σ What would look like? ˆ o Mean Variance
Adding dynamic features to state outputs � ⊤ c ⊤ t , ∆ c ⊤ � ∆ c t = c t − c t − 1 where o t = t dynamic feature is calculated as 7 between and can State output vectors contain both static ( c t ) and dynamic ( Δ c t ) features o W c . . . . . ⎡ ⎤ ⎡ ⎤ . . . . . · · · · · · . . . . . . ⎡ ⎤ . ⎢ ⎥ ⎢ ⎥ . · · · · · · c t − 1 I 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · ∆ c t − 1 c t − 2 − I I 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · c t − 1 c t I 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · ∆ c t − I I 0 0 c t ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · · · · · · c t +1 I c t +1 0 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ . · · · · · · ∆ c t +1 − I I 0 0 . ⎢ ⎥ ⎢ ⎥ . ⎣ ⎦ ⎣ ⎦ . . . . . . . . . . · · · · · · . . . . . (17) o and c arranged in matrix form
Speech parameter generation Introducing dynamic feature constraints: • q, ˆ o = arg max ˆ p ( o | ˆ λ ) where o = Wc o If the output distributions are single Gaussians: • q, ˆ p ( o | ˆ λ ) = N ( o ; µ ˆ q ) q , Σ ˆ Then, by setting we get: ∂ log N ( o ; µ ˆ q , Σ ˆ q ) / ∂ c = 0 • W T Σ − 1 q Wc = W T Σ − 1 q µ ˆ q ˆ ˆ
Synthesis overview Clustered states Merged states Sentence HMM Static Delta Gaussian ML trajectory
Speech parameter generation Generate the most probable observation vectors given the HMM and w: p ( o | w, ˆ o = arg max ˆ λ ) o p ( o, q | w, ˆ X = arg max λ ) o ∀ q p ( o, q | w, ˆ ≈ arg max max λ ) q o p ( o | q, ˆ λ ) P ( q | w, ˆ = arg max max λ ) q o Determine the best state sequence and outputs sequentially: p ( q | w, ˆ Let’s explore this next q = arg max ˆ λ ) q q, ˆ o = arg max ˆ p ( o | ˆ λ ) o
Duration modeling How are durations • 0.4 modelled within an State duration probability p k ( d ) p k ( d ) = a d − 1 kk (1 − a kk ) HMM? ( a kk = 0.6 ) 0.3 Implicitly modelled by state • 0.2 self-transition probabilities p k ( d ) = a d − 1 · (1 − a kk ) 0.1 kk PMFs of state durations are • 0.0 geometric distributions 1 2 3 4 5 6 7 8 9 10 State duration d (frame) State durations are determined by maximising: • N X log P ( d | λ ) = log p j ( d j ) , j = 1 What would this solution look like if the PMFs of state durations are • geometric distributions?
Explicit modeling of state durations Each state duration is explicitly modelled as a single • σ 2 ( i ) Gaussian. The mean and variance of duration ξ ( i ) density of state i: T T � � χ t 0 ,t 1 ( i )( t 1 − t 0 + 1) t 0 =1 t 1 = t 0 ξ ( i ) = , T T � � χ t 0 ,t 1 ( i ) t 0 =1 t 1 = t 0 T T � � χ t 0 ,t 1 ( i )( t 1 − t 0 + 1) 2 t 0 =1 t 1 = t 0 σ 2 ( i ) − ξ 2 ( i ) , = T T � � χ t 0 ,t 1 ( i ) t 0 =1 t 1 = t 0 1 where t 1 � χ t 0 ,t 1 ( i ) = (1 − γ t 0 − 1 ( i )) · γ t ( i ) · (1 − γ t 1 +1 ( i )) , t = t 0 and γ t ( i ) is the probability of being in state i at time t
Recommend
More recommend