Lecture 14: LPC speech synthesis and autocorrelation- based pitch - PowerPoint PPT Presentation

Lecture 14: LPC speech synthesis and autocorrelation- based pitch tracking ECE 417, Multimedia Signal Processing October 10, 2019

Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable

The LPC-10 speech synthesis model

The LPC-10 Speech Coder: Transmitted Parameters Each frame is 54 bits, and is used to synthesize 22.5ms of speech. (54 bits/frame)/(0.0225 seconds/frame)=2400 bits/second • Pitch : 7 bits/frame (127 distinguishable non-zero pitch periods) • Energy : 5 bits/frame (32 levels, on a logRMS scale) • 10 linear predictive coefficients (LPC): 41 bits/frame • Synchronization: 1 bit/frame

The LPC-10 speech synthesis model 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 Voiced Speech, pitch period P 𝐼(𝑓 $% ) 𝑡[𝑜] 𝐻 G Gain= 𝑓 𝑜 ~𝒪 0,1 Vocal Tract: 𝑓 ;<=>?@ Unvoiced Speech Binary Control Modeled by Switch: an LPC synthesis Voiced (P>0) vs. Filter. Unvoiced (P=0)

Autocorrelation is maximum at n=0 0 𝑠 BB 𝑜 = , 𝑦 𝑛 𝑦[𝑛 − 𝑜] C./0

Autocorrelation is maximum at n=0 0 𝑦 𝑛 𝑦[𝑛 − 𝑜] = 𝑦 𝑜 ∗ 𝑦 −𝑜 = ℱ /H 𝑌 𝜕 K 𝑠 BB 𝑜 = , C./0 O = 1 K 𝑓 $%P 𝑒𝜕 2𝜌 N 𝑌 𝜕 /O Notice that, for n=0, this becomes just Parseval’s theorem: 0 O 𝑦 K 𝑛 = 1 K 𝑒𝜕 𝑠 BB 0 = , 2𝜌 N 𝑌 𝜕 /O C./0 K is positive and real, any value of 𝑓 $%P that is NOT positive and But since 𝑌 𝜕 real will reduce the value of the integral! O O BB 𝑜 = 1 K 𝑓 $%P 𝑒𝜕 ≤ 1 K 𝑒𝜕 = 𝑠 𝑠 2𝜌 N 𝑌 𝜕 2𝜌 N 𝑌 𝜕 BB 0 /O /O

Example of an autocorrelation function computed from file0.wav, “Four score and seven years ago…”

Autocorrelation of a periodic signal Suppose x[n] is periodic, 𝑦[𝑜] = 𝑦[𝑜 − 𝑄] . Then the autocorrelation is also periodic: 0 0 𝑦 K 𝑛 = 𝑠 𝑠 BB 𝑄 = , 𝑦 𝑛 𝑦[𝑛 − 𝑄] = , BB 0 C./0 C./0

Autocorrelation of a periodic signal is periodic Pitch period = 9ms = 99 samples Pitch period = 9ms = 99 samples

Autocorrelation pitch tracking • Compute the autocorrelation • Find the pitch period: 𝑄 = argmax 𝑠 BB [𝑛] X YZ[ \C\X Y]^ • The search limits, 𝑄 ?_` and 𝑄 ?ab , are important for good performance: • 𝑄 ?_` corresponds to a high woman’s pitch, about 𝐺 @ /𝑄 ?_` ≈ 250 Hz • 𝑄 ?ab corresponds to a low man’s pitch, about 𝐺 @ /𝑄 ?ab ≈ 80 Hz 𝑄 ?_` 𝑄 ?ab

The LPC-10 speech synthesis model 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 Voiced Speech, pitch period P 𝐼(𝑓 $% ) 𝑡[𝑜] 𝐻 G Gain= 𝑓 𝑜 ~𝒪 0,1 Vocal Tract: 𝑓 ;<=>?@ Unvoiced Speech Binary Control Modeled by Switch: an LPC synthesis Voiced (P>0) vs. Filter. Unvoiced (P=0)

voiced: 𝑦 𝑜 + 𝑄 ≈ 𝑦 𝑜 The voiced/unvoiced decision • 𝑦[𝑜] voiced: 𝑠 BB 𝑄 ≈ 𝑠 BB 0 • 𝑦[𝑜] unvoiced (white noise): 𝑠 BB 𝑜 ≈ 𝜀[𝑜] , which means that 𝑠 BB 𝑄 ≪ 𝑠 BB 0 So a reasonable V/UV decision is: unvoiced: E[𝑦 𝑛 𝑦 𝑛 − 𝑜 ] ≈ 𝜀[𝑜] i jj X i jj k ≥ 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 : say the frame is voiced. • i jj X i jj k < 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 : say the frame is • unvoiced. Setting threshold~0.25 works reasonably well.

Inter-frame interpolation of pitch contours We don’t want the pitch period to Frame Boundary Frame Boundary Frame Boundary Frame Boundary change suddenly at frame Pitch boundaries; it sounds weird. Period Sample Number, n

Inter-frame interpolation of pitch contours Linear interpolation sounds much better. We can accomplish linear Frame Boundary Frame Boundary Frame Boundary Frame Boundary interpolation using a formula like Pitch Period 𝑄 𝑜 = (1 − 𝑔)𝑄 u + 𝑔𝑄 uvH Where • 𝑄 u is the pitch period in frame t P/u@ • 𝑔 = is how far sample n is @ from the beginning of frame t • S is the frame-skip. Sample Number, n

Inter-frame interpolation of energy Linear interpolation is also useful for energy, EXCEPT: it sounds better if we interpolate log energy, not energy. u@v}/H 1 𝑦 K [𝑜] log 𝑆𝑁𝑇 u = log , 𝑀 P.u@

The LPC-10 speech synthesis model 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 Voiced Speech, pitch period P 𝐼(𝑓 $% ) 𝑡[𝑜] 𝐻 G Gain= 𝑓 𝑜 ~𝒪 0,1 Vocal Tract: 𝑓 ;<=>?@ Unvoiced Speech Binary Control Modeled by Switch: an LPC synthesis Voiced vs. Filter. Unvoiced

Unvoiced speech: e[n]=white noise • Use zero-mean, unit-variance Gaussian white noise • The choice, to use “unvoiced speech,” is communicated by the special code word “P=0” By Morn - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index. php?curid=24084756

Voiced speech: e[n]=pulse train • The basic idea: 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 • Modification #1: in order for the RMS to equal 1.0, we need to scale each pulse by 𝑄 : 0 𝑓 𝑜 = 𝑄 , 𝜀 𝑜 − 𝑞𝑄 -./0

Modification #2: the first pulse is not at n=0 30 Pitch period = 80 samples ⇒ first pulse in frame 31 can’t occur until the 70 th sample of the frame

A mechanism for keeping track of pitch phase from one frame to the next • Start out, at the beginning of the speech, with a pitch phase equal to zero, 𝜒 0 = 0 • For every sample thereafter: • If the sample is unvoiced (P[n]=0), don’t increment the pitch phase • If the sample is voiced (P[n]>0), then increment the pitch phase 𝜒 𝑜 = 𝜒 𝑜 − 1 + 2𝜌 𝑄[𝑜] • Every time the phase passes a multiple of 2𝜌 , output a pitch pulse 𝜒 𝑜 − 𝜒 𝑜 − 1 𝑓 𝑜 = € 𝑄 > 0 2𝜌 2𝜌 0 𝑓𝑚𝑡𝑓

The pitch phase method: generate an excitation pulse whenever pitch phase crosses a 2𝜌 -level Phase 𝜒 𝑜 𝜒 𝑜 8𝜌 6𝜌 4𝜌 2𝜌 Sample Number, n 30 𝑓 𝑜

Speech is predictable • Speech is not just white noise and pulse train. In fact, each sample is highly predictable from previous samples. Hk 𝑦[𝑜] ≈ , 𝛽 C 𝑦[𝑜 − 𝑛] C.H • In fact, the pitch pulses are the only major exception to this predictability!

Linear predictive coding (LPC) The LPC idea: 𝑦 𝑜 1. Model the excitation as error Hk 𝑓 𝑜 = 𝑦 𝑜 − , 𝛽 C 𝑦[𝑜 − 𝑛] C.H 𝑓 𝑜 2. Force the coefficients 𝛽 C to explain as much as they can, so that 𝑓 𝑜 is as close to zero as possible.

Linear predictive coding (LPC) K Hk 𝜁 = 𝐹 𝑓 K [𝑜] = 𝐹 𝑦 𝑜 − , 𝛽 ‡ 𝑦[𝑜 − 𝑗] ‡.H Hk 𝜖𝜁 = −2𝐹 𝑦 𝑜 − 𝑘 𝑦 𝑜 − , 𝛽 ‡ 𝑦 𝑜 − 𝑗 𝜖𝛽 $ ‡.H ‹Œ Setting ‹• Ž = 0 gives Hk 𝐹 𝑦 𝑜 − 𝑘 𝑦[𝑜] = , 𝛽 ‡ 𝐹 𝑦 𝑜 − 𝑘 𝑦[𝑜 − 𝑗] ‡.H 𝑆 BB 𝑘 𝑆 BB |𝑗 − 𝑘|

Linear predictive coding (LPC) So we have a set of linked equations, for 1 ≤ 𝑘 ≤ 10 : Hk 𝑆 BB 𝑘 = , 𝛽 ‡ 𝑆 BB |𝑗 − 𝑘| ‡.H • We can write these 10 equations as a 10x10 matrix equation: ⃗ 𝛿 = 𝑆 ⃗ 𝛽 𝛽 = 𝑆 /H ⃗ • …which immediately gives the solution: ⃗ 𝛿 • …where 𝛽 H 𝑆 BB 0 𝑆 BB 1 ⋯ 𝑆 BB 1 ⋮ 𝛿 = ⃗ ⋮ , 𝑆 = 𝑆 BB 1 𝑆 BB 0 ⋯ , 𝛽 = ⃗ 𝛽 Hk 𝑆 BB 10 ⋮ ⋮ 𝑆 BB 0

Lecture 14: LPC speech synthesis and autocorrelation- based pitch - PowerPoint PPT Presentation

Lecture 14: LPC speech synthesis and autocorrelation- based pitch tracking ECE 417, Multimedia Signal Processing October 10, 2019 Outline The LPC-10 speech synthesis model Autocorrelation-based pitch tracking Inter-frame

Lecture 17: LPC speech synthesis and autocorrelation- based pitch tracking ECE 401, Signal and

LPC Docket 17-8216: 266 West End Avenue LPC Docket 17-8216: 266 West End Avenue LPC Docket

Partial and Autocorrelation Functions Overview Autocorrelation Function Defined Normalized

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

ATTACHMENT 7 LPC 05-03-12 Page 1 of 58 ATTACHMENT 7 LPC 05-03-12 Page 2 of 58 ATTACHMENT 7

DITMAS PARK PASSIVE HOUSE RETROFIT 476 E 18TH STREET BROOKLYN NY 11226 Passive House is an

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Jonatan Laland With inputs from his colleagues CERN, for LPC meeting, 2018 @ CERN te-epc-lpc

Lecture 19: Autocorrelation Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Chapter 9 Linear Predictive Analysis of Speech Signals 1 LPC

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

14 Gb/s AC Coupled Receiver in 90 nm CMOS Masum Hossain & Tony Chan Carusone University of

Data Quality Challenges ACM JDIQ EiC Open Knowledge Networks (Biomedicine) Data

Benchmarking Adversarial Robustness on Image Classification Yinpeng Dong, Qi-An Fu, Xiao Yang,

S WEST COAST MARINE TOURISM COLLABORATION S WELCOME FROM WCMTC STEERING GROUP S Fiona McPhail,

Laser test progress in Prague Peter Kody, Zden k Doleal, Jan Bro, Peter Kvasni ka,

BGC float navigation and parameters (CTS4 et CTS5-USEA) How we managed to deploy 206 BGC float

Conceptual definition of limit: overview of video More at http://calculus.subwiki.org/wiki/Limit

A Low Datarate Localization System IMST GmbH Jac Romme Carl-Friedrich-Gau-Strae 2 Phone:

Lecture 14: LPC speech synthesis and autocorrelation- based pitch - PowerPoint PPT Presentation

Lecture 14: LPC speech synthesis and autocorrelation- based pitch tracking ECE 417, Multimedia Signal Processing October 10, 2019 Outline The LPC-10 speech synthesis model Autocorrelation-based pitch tracking Inter-frame

Lecture 17: LPC speech synthesis and autocorrelation- based pitch tracking ECE 401, Signal and

LPC Docket 17-8216: 266 West End Avenue LPC Docket 17-8216: 266 West End Avenue LPC Docket

Partial and Autocorrelation Functions Overview Autocorrelation Function Defined Normalized

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

ATTACHMENT 7 LPC 05-03-12 Page 1 of 58 ATTACHMENT 7 LPC 05-03-12 Page 2 of 58 ATTACHMENT 7

DITMAS PARK PASSIVE HOUSE RETROFIT 476 E 18TH STREET BROOKLYN NY 11226 Passive House is an

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Jonatan Laland With inputs from his colleagues CERN, for LPC meeting, 2018 @ CERN te-epc-lpc

Lecture 19: Autocorrelation Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Chapter 9 Linear Predictive Analysis of Speech Signals 1 LPC

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

14 Gb/s AC Coupled Receiver in 90 nm CMOS Masum Hossain &amp; Tony Chan Carusone University of

Data Quality Challenges ACM JDIQ EiC Open Knowledge Networks (Biomedicine) Data

Benchmarking Adversarial Robustness on Image Classification Yinpeng Dong, Qi-An Fu, Xiao Yang,

S WEST COAST MARINE TOURISM COLLABORATION S WELCOME FROM WCMTC STEERING GROUP S Fiona McPhail,

Laser test progress in Prague Peter Kody, Zden k Doleal, Jan Bro, Peter Kvasni ka,

BGC float navigation and parameters (CTS4 et CTS5-USEA) How we managed to deploy 206 BGC float

Conceptual definition of limit: overview of video More at http://calculus.subwiki.org/wiki/Limit

A Low Datarate Localization System IMST GmbH Jac Romme Carl-Friedrich-Gau-Strae 2 Phone:

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

14 Gb/s AC Coupled Receiver in 90 nm CMOS Masum Hossain & Tony Chan Carusone University of