Lecture 14: LPC speech synthesis and autocorrelation- based pitch tracking ECE 417, Multimedia Signal Processing October 10, 2019
Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable
The LPC-10 speech synthesis model
The LPC-10 Speech Coder: Transmitted Parameters Each frame is 54 bits, and is used to synthesize 22.5ms of speech. (54 bits/frame)/(0.0225 seconds/frame)=2400 bits/second • Pitch : 7 bits/frame (127 distinguishable non-zero pitch periods) • Energy : 5 bits/frame (32 levels, on a logRMS scale) • 10 linear predictive coefficients (LPC): 41 bits/frame • Synchronization: 1 bit/frame
The LPC-10 speech synthesis model 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 Voiced Speech, pitch period P 𝐼(𝑓 $% ) 𝑡[𝑜] 𝐻 G Gain= 𝑓 𝑜 ~𝒪 0,1 Vocal Tract: 𝑓 ;<=>?@ Unvoiced Speech Binary Control Modeled by Switch: an LPC synthesis Voiced (P>0) vs. Filter. Unvoiced (P=0)
Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable
Autocorrelation is maximum at n=0 0 𝑠 BB 𝑜 = , 𝑦 𝑛 𝑦[𝑛 − 𝑜] C./0
Autocorrelation is maximum at n=0 0 𝑦 𝑛 𝑦[𝑛 − 𝑜] = 𝑦 𝑜 ∗ 𝑦 −𝑜 = ℱ /H 𝑌 𝜕 K 𝑠 BB 𝑜 = , C./0 O = 1 K 𝑓 $%P 𝑒𝜕 2𝜌 N 𝑌 𝜕 /O Notice that, for n=0, this becomes just Parseval’s theorem: 0 O 𝑦 K 𝑛 = 1 K 𝑒𝜕 𝑠 BB 0 = , 2𝜌 N 𝑌 𝜕 /O C./0 K is positive and real, any value of 𝑓 $%P that is NOT positive and But since 𝑌 𝜕 real will reduce the value of the integral! O O BB 𝑜 = 1 K 𝑓 $%P 𝑒𝜕 ≤ 1 K 𝑒𝜕 = 𝑠 𝑠 2𝜌 N 𝑌 𝜕 2𝜌 N 𝑌 𝜕 BB 0 /O /O
Example of an autocorrelation function computed from file0.wav, “Four score and seven years ago…”
Autocorrelation of a periodic signal Suppose x[n] is periodic, 𝑦[𝑜] = 𝑦[𝑜 − 𝑄] . Then the autocorrelation is also periodic: 0 0 𝑦 K 𝑛 = 𝑠 𝑠 BB 𝑄 = , 𝑦 𝑛 𝑦[𝑛 − 𝑄] = , BB 0 C./0 C./0
Autocorrelation of a periodic signal is periodic Pitch period = 9ms = 99 samples Pitch period = 9ms = 99 samples
Autocorrelation pitch tracking • Compute the autocorrelation • Find the pitch period: 𝑄 = argmax 𝑠 BB [𝑛] X YZ[ \C\X Y]^ • The search limits, 𝑄 ?_` and 𝑄 ?ab , are important for good performance: • 𝑄 ?_` corresponds to a high woman’s pitch, about 𝐺 @ /𝑄 ?_` ≈ 250 Hz • 𝑄 ?ab corresponds to a low man’s pitch, about 𝐺 @ /𝑄 ?ab ≈ 80 Hz 𝑄 ?_` 𝑄 ?ab
The LPC-10 speech synthesis model 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 Voiced Speech, pitch period P 𝐼(𝑓 $% ) 𝑡[𝑜] 𝐻 G Gain= 𝑓 𝑜 ~𝒪 0,1 Vocal Tract: 𝑓 ;<=>?@ Unvoiced Speech Binary Control Modeled by Switch: an LPC synthesis Voiced (P>0) vs. Filter. Unvoiced (P=0)
voiced: 𝑦 𝑜 + 𝑄 ≈ 𝑦 𝑜 The voiced/unvoiced decision • 𝑦[𝑜] voiced: 𝑠 BB 𝑄 ≈ 𝑠 BB 0 • 𝑦[𝑜] unvoiced (white noise): 𝑠 BB 𝑜 ≈ 𝜀[𝑜] , which means that 𝑠 BB 𝑄 ≪ 𝑠 BB 0 So a reasonable V/UV decision is: unvoiced: E[𝑦 𝑛 𝑦 𝑛 − 𝑜 ] ≈ 𝜀[𝑜] i jj X i jj k ≥ 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 : say the frame is voiced. • i jj X i jj k < 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 : say the frame is • unvoiced. Setting threshold~0.25 works reasonably well.
Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable
Inter-frame interpolation of pitch contours We don’t want the pitch period to Frame Boundary Frame Boundary Frame Boundary Frame Boundary change suddenly at frame Pitch boundaries; it sounds weird. Period Sample Number, n
Inter-frame interpolation of pitch contours Linear interpolation sounds much better. We can accomplish linear Frame Boundary Frame Boundary Frame Boundary Frame Boundary interpolation using a formula like Pitch Period 𝑄 𝑜 = (1 − 𝑔)𝑄 u + 𝑔𝑄 uvH Where • 𝑄 u is the pitch period in frame t P/u@ • 𝑔 = is how far sample n is @ from the beginning of frame t • S is the frame-skip. Sample Number, n
Inter-frame interpolation of energy Linear interpolation is also useful for energy, EXCEPT: it sounds better if we interpolate log energy, not energy. u@v}/H 1 𝑦 K [𝑜] log 𝑆𝑁𝑇 u = log , 𝑀 P.u@
Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable
The LPC-10 speech synthesis model 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 Voiced Speech, pitch period P 𝐼(𝑓 $% ) 𝑡[𝑜] 𝐻 G Gain= 𝑓 𝑜 ~𝒪 0,1 Vocal Tract: 𝑓 ;<=>?@ Unvoiced Speech Binary Control Modeled by Switch: an LPC synthesis Voiced vs. Filter. Unvoiced
Unvoiced speech: e[n]=white noise • Use zero-mean, unit-variance Gaussian white noise • The choice, to use “unvoiced speech,” is communicated by the special code word “P=0” By Morn - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index. php?curid=24084756
Voiced speech: e[n]=pulse train • The basic idea: 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 • Modification #1: in order for the RMS to equal 1.0, we need to scale each pulse by 𝑄 : 0 𝑓 𝑜 = 𝑄 , 𝜀 𝑜 − 𝑞𝑄 -./0
Modification #2: the first pulse is not at n=0 30 Pitch period = 80 samples ⇒ first pulse in frame 31 can’t occur until the 70 th sample of the frame
A mechanism for keeping track of pitch phase from one frame to the next • Start out, at the beginning of the speech, with a pitch phase equal to zero, 𝜒 0 = 0 • For every sample thereafter: • If the sample is unvoiced (P[n]=0), don’t increment the pitch phase • If the sample is voiced (P[n]>0), then increment the pitch phase 𝜒 𝑜 = 𝜒 𝑜 − 1 + 2𝜌 𝑄[𝑜] • Every time the phase passes a multiple of 2𝜌 , output a pitch pulse 𝜒 𝑜 − 𝜒 𝑜 − 1 𝑓 𝑜 = € 𝑄 > 0 2𝜌 2𝜌 0 𝑓𝑚𝑡𝑓
The pitch phase method: generate an excitation pulse whenever pitch phase crosses a 2𝜌 -level Phase 𝜒 𝑜 𝜒 𝑜 8𝜌 6𝜌 4𝜌 2𝜌 Sample Number, n 30 𝑓 𝑜
Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable
Speech is predictable • Speech is not just white noise and pulse train. In fact, each sample is highly predictable from previous samples. Hk 𝑦[𝑜] ≈ , 𝛽 C 𝑦[𝑜 − 𝑛] C.H • In fact, the pitch pulses are the only major exception to this predictability!
Linear predictive coding (LPC) The LPC idea: 𝑦 𝑜 1. Model the excitation as error Hk 𝑓 𝑜 = 𝑦 𝑜 − , 𝛽 C 𝑦[𝑜 − 𝑛] C.H 𝑓 𝑜 2. Force the coefficients 𝛽 C to explain as much as they can, so that 𝑓 𝑜 is as close to zero as possible.
Linear predictive coding (LPC) K Hk 𝜁 = 𝐹 𝑓 K [𝑜] = 𝐹 𝑦 𝑜 − , 𝛽 ‡ 𝑦[𝑜 − 𝑗] ‡.H Hk 𝜖𝜁 = −2𝐹 𝑦 𝑜 − 𝑘 𝑦 𝑜 − , 𝛽 ‡ 𝑦 𝑜 − 𝑗 𝜖𝛽 $ ‡.H ‹Œ Setting ‹• Ž = 0 gives Hk 𝐹 𝑦 𝑜 − 𝑘 𝑦[𝑜] = , 𝛽 ‡ 𝐹 𝑦 𝑜 − 𝑘 𝑦[𝑜 − 𝑗] ‡.H 𝑆 BB 𝑘 𝑆 BB |𝑗 − 𝑘|
Linear predictive coding (LPC) So we have a set of linked equations, for 1 ≤ 𝑘 ≤ 10 : Hk 𝑆 BB 𝑘 = , 𝛽 ‡ 𝑆 BB |𝑗 − 𝑘| ‡.H • We can write these 10 equations as a 10x10 matrix equation: ⃗ 𝛿 = 𝑆 ⃗ 𝛽 𝛽 = 𝑆 /H ⃗ • …which immediately gives the solution: ⃗ 𝛿 • …where 𝛽 H 𝑆 BB 0 𝑆 BB 1 ⋯ 𝑆 BB 1 ⋮ 𝛿 = ⃗ ⋮ , 𝑆 = 𝑆 BB 1 𝑆 BB 0 ⋯ , 𝛽 = ⃗ 𝛽 Hk 𝑆 BB 10 ⋮ ⋮ 𝑆 BB 0
Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable
Recommend
More recommend