EE E6820: Speech & Audio Processing & Recognition Lecture 5: - - PowerPoint PPT Presentation

ee e6820 speech audio processing recognition lecture 5
SMART_READER_LITE
LIVE PREVIEW

EE E6820: Speech & Audio Processing & Recognition Lecture 5: - - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and synthesis 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear Predictive models (LPC) 4 Other signal models 5 Speech synthesis Dan


slide-1
SLIDE 1

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 1

EE E6820: Speech & Audio Processing & Recognition

Lecture 5: Speech modeling and synthesis

Modeling speech signals Spectral and cepstral models Linear Predictive models (LPC) Other signal models Speech synthesis

Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/

1 2 3 4 5

slide-2
SLIDE 2

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 2

The speech signal

  • Speech sounds in the spectrogram
  • Elements of the speech signal:
  • spectral resonances (formants, moving)
  • periodic excitation (voicing, pitched)

+ pitch contour

  • noise excitation (fricatives, unvoiced, no pitch)
  • transients (stop-release bursts)
  • amplitude modulation (nasals, approximants)
  • timing!

1

watch thin as a dime a has

m d n c tcl

^

θ z w z h e

I I I

a

y

ε

slide-3
SLIDE 3

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 3

The source-filter model

  • Notional separation of:

source : excitation, fine t-f structure & filter : resonance, broad spectral structure

  • More a modeling approach than a model

Glottal pulse train Frication noise Vocal tract resonances + Radiation characteristic Speech Voiced/ unvoiced Pitch Formants

Source Filter

t f t

slide-4
SLIDE 4

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 4

Signal modeling

  • Signal models are a kind of

representation

  • to make some aspect explicit
  • for efficiency
  • for flexibility
  • Nature of model depends on goal
  • classification: remove irrelevant details
  • coding/transmission: remove perceptual

irrelevance

  • modification: isolate control parameters
  • But commonalities emerge
  • perceptually irrelevant detail (coding) will also be

irrelevant for classification

  • modification domain will usually reflect

‘independent’ perceptual attributes

  • getting at the

abstract information in the signal

slide-5
SLIDE 5

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 5

Different influences for signal models

  • Receiver:
  • see how signal is treated by listeners

→ cochlea-style filterbank models

  • Transmitter (source)
  • physical apparatus can generate only a limited

range of signals... → LPC models of vocal tract resonances

  • Making explicit particular aspects
  • compact, separable resonance correlates

→ cepstrum

  • modeling prominent features of NB spectrogram

→ sinusoid models

  • addressing unnaturalness in synthesis

→ Harmonic+Noise model

slide-6
SLIDE 6

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 6

Applications of (speech) signal models

  • Classification / matching

Goal: highlight important information

  • speech recognition (lexical content)
  • speaker recognition (identity or class)
  • other signal classification
  • content-based retrieval
  • Coding / transmission / storage

Goal: represent just enough information

  • real-time transmission e.g. mobile phones
  • archive storage e.g. voicemail
  • Modification/synthesis

Goal: change certain parts independently

  • speech synthesis / text-to-speech

(change the words)

  • speech transformation / disguise

(change the speaker)

slide-7
SLIDE 7

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 7

Outline

Modeling speech signals Spectral and cepstral models

  • Auditorily-inspired spectra
  • The cepstrum
  • Feature correlation

Linear predictive models (LPC) Other models Speech synthesis 1 2 3 4 5

slide-8
SLIDE 8

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 8

Spectral and cepstral models

  • Spectrogram seems like a good representation
  • long history
  • satisfying in use
  • experts can ‘read’ the speech
  • What is the information?
  • intensity in time-frequency cells;

typically 5ms x 200 Hz x 50 dB → Discarded information:

  • phase
  • fine-scale timing
  • The starting point for other representations

2

slide-9
SLIDE 9

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 9

The filterbank interpretation of the short-time Fourier transform (STFT)

  • Can regard spectrogram rows as coming from

separate bandpass filters:

  • Mathematically:

where sound

f

X k n0 , [ ] x n [ ] w n n0 – [ ] j 2πk n n0 – ( ) N

   – exp ⋅ ⋅

n

= x n [ ] hk n0 n – [ ] ⋅

n

= hk n [ ] w n – [ ] j 2πkn N

   exp ⋅ =

n hk[n] w[-n] ω Hk(ejω) W(ej(ω − 2πk/N)) 2πk/N

slide-10
SLIDE 10

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 10

Spectral models: Which bandpass filters?

  • Constant bandwidth? (analog / FFT)
  • But: cochlea physiology & critical bandwidths

→ use actual bandpass filters in ear models & choose bandwidths by e.g. CB estimates

  • Auditory frequency scales
  • constant ‘Q’ (center freq/bandwidth), mel, Bark...
slide-11
SLIDE 11

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 11

Gammatone filterbank

  • Given bandwidths, which filter shapes?
  • match inferred temporal integration window
  • match inferred spectral shape (sharp hi-F slope)
  • keep it simple (since it’s only approximate)

→ Gammatone filters

  • 2N poles, 2 zeros, low complexity
  • reasonable linear match to cochlea

h n [ ] n

N 1 –

bn – exp ωin ( ) cos ⋅ ⋅ =

time →

100 50 200 500 1000 2000 5000

  • 10
  • 30
  • 20
  • 40
  • 50

mag / dB freq / Hz z plane

2 2 2 2

log axis!

slide-12
SLIDE 12

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 12

Constant-BW vs. cochlea model

  • Magnitude smoothed over 5-20 ms time window
  • Spectrograms:
  • Frequency responses:

FFT-based WB spectrogram (N=128) freq / Hz

0.5 1 1.5 2 2.5 3 2000 4000 6000 8000

Q=4 4 pole 2 zero cochlea model downsampled @ 64 freq / Hz time / s

0.5 1 1.5 2 2.5 3 100 200 500 1000 2000 5000

  • 50
  • 40
  • 30
  • 20
  • 10

Effective FFT filterbank Gain / dB

  • 50
  • 40
  • 30
  • 20
  • 10

Gain / dB Gammatone filterbank

1000 2000 3000 4000 5000 6000 7000 8000 1000 2000 3000 4000 5000 6000 7000 8000

Freq / Hz

linear axis

slide-13
SLIDE 13

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 13

Limitations of spectral models

  • Not much data thrown away
  • just fine phase/time structure (smoothing)
  • little actual ‘modeling’
  • still a large representation!
  • Little separation of features
  • e.g. formants and pitch
  • Highly correlated features
  • modifications affect multiple parameters
  • But, quite easy to reconstruct
  • iterative reconstruction of lost phase
slide-14
SLIDE 14

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 14

The cepstrum

  • Original motivation: Assume a source-filter model:
  • Define ‘Homomorphic deconvolution’:
  • source-filter convolution:

g[n]*h[n]

  • FT → product

G(ejω)·H(ejω)

  • log → sum:

logG(ejω) + logH(ejω)

  • IFT

→ separate fine structure: cg[n] + ch[n] = deconvolution

  • Definition:

Real cepstrum Excitation source g[n] n n Resonance filter H(ejω) ω

cn idft dft x n [ ] ( ) log ( ) =

slide-15
SLIDE 15

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 15

Stages in cepstral deconvolution

  • Original waveform has

excitation fine structure convolved with resonances

  • DFT shows harmonics

modulated by resonances

  • Log DFT is sum of

harmonic ‘comb’ and resonant bumps

  • IDFT separates out

resonant bumps (low quefrency) and regular, fine structure (‘pitch pulse’)

  • Selecting low-n cepstrum

separates resonance information (deconvolution / ‘liftering’)

100 200 300 400

  • 0.2

0.2 Waveform and min. phase IR samps 1000 2000 3000 10 20 abs(dft) and liftered freq / Hz freq / Hz 1000 2000 3000

  • 40
  • 20

log(abs(dft)) and liftered 100 200 100 200 real cepstrum and lifter quefrency dB

pitch pulse

slide-16
SLIDE 16

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 16

Properties of the cepstrum

  • Separate source (fine) & filter (broad structure)
  • smooth the log mag. spectrum to get resonances
  • Smoothing spectrum is filtering along freq.
  • i.e. convolution applied in Fourier domain

→ multiplication in IFT (‘liftering’)

  • Periodicity in time → harmonics in spectrum

→ ‘pitch pulse’ in high-n cepstrum

  • Low-n cepstral coefficients are DCT of

broad filter / resonance shape:

cn X e jω ( ) log nω cos j nω sin + ( ) ⋅ ω d

=

1000 2000 3000 4000 5000 6000 7000

  • 0.1

0.1 1 2 3 4 5

  • 1

1 2

5th order Cepstral reconstruction Cepstral coefs 0..5

slide-17
SLIDE 17

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 17

Aside: Correlation of elements

  • Cepstrum is a popular in speech recognition
  • feature vector elements are decorrelated:
  • c0 ‘normalizes out’ average log energy
  • Decorrelated pdfs fit diagonal Gaussians
  • simple correlation is a waste of parameters
  • DCT is close to PCA for spectra?

frames 5 10 15 20 25 4 8 12 16 20 5 10 15 20 4 8 12 16 20

  • 5
  • 4
  • 3
  • 2

50 100 150

Cepstral coefficients Auditory spectrum

Covariance matrix Features Example joint distrib (10,15) 2 4 6 8 10 12 14 16 18

  • 5

5

  • 4
  • 3
  • 2
  • 1

1 2 3

slide-18
SLIDE 18

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 18

Outline

Modeling speech signals Spectral and cepstral modes Linear Predictive models (LPC)

  • The LPC model
  • Interpretation & application
  • Formant tracking

Other models Speech synthesis 1 2 3 4 5

slide-19
SLIDE 19

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 19

Linear predictive modeling (LPC)

  • LPC is a very successful speech model
  • it is mathematically efficient (IIR filters)
  • it is remarkably successful for voice

(fits source-filter distinction)

  • it has a satisfying physical interpretation

(resonances)

  • Basic math
  • model output as lin. function of previous outputs:

... hence “linear prediction” (pth order)

  • e[n] is excitation (input), a/k/a prediction error

→ ... all-pole modeling, ‘autoregression’ (AR) model

3

s n [ ] ak s n k – [ ] ⋅

k 1 = p

( ) e n [ ] + = S z ( ) E z ( )

  • 1

1 ak z

k –

k 1 = p

– ( )

  • 1

A z ( )

  • =

=

slide-20
SLIDE 20

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 20

Vocal tract motivation for LPC

  • Direct expression of source-filter model:
  • Acoustic tube models suggest all-pole model

for vocal tract

  • Relatively slowly-changing
  • update A(z) every 10-20 ms
  • Not perfect: Nasals introduce zeros

s n [ ] ak s n k – [ ] ⋅

k 1 = p

( ) e n [ ] + =

Pulse/noise excitation Vocal tract e[n] s[n] H(z) = 1/A(z)

z-plane

H(z) f

|H(ejω)|

slide-21
SLIDE 21

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 21

Estimating LPC parameters

  • Minimize short-time squared prediction error:

Differentiate w.r.t. ak to get: where are correlation coefficients

  • p linear equations to solve for all ajs...

E e2 n [ ]

n 1 = m

s n [ ] aks n k – [ ]

k 1 = p

–      2

n

= = 2 s n [ ] a js n j – [ ]

j 1 = p

– ( ) s n k – [ ] – ( ) ⋅

n

= s n [ ]s n k – [ ]

n

a j

j

s n j – [ ]s n k – [ ]

n

⋅ = φ 0 k , ( ) a j

j

φ j k , ( ) ⋅ = φ j k , ( ) s n j – [ ]s n k – [ ]

n 1 = m

=

slide-22
SLIDE 22

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 22

Evaluating parameters

  • Linear equations
  • If s[n] is assumed zero outside some window

Hence equations become:

  • Toeplitz matrix (equal antidiagonals)

→ can use Durbin recursion to solve

  • (Solve full

via Cholesky)

φ 0 k , ( ) a j

j 1 = p

φ j k , ( ) ⋅ = φ j k , ( ) s n j – [ ]s n k – [ ]

n

r j k – ( ) = = r 1 ( ) r 2 ( ) … r p ( ) r 0 ( ) r 1 ( ) … r p 1 – ( ) r 1 ( ) r 2 ( ) … r p 2 – ( ) … … … … r p 1 – ( ) r p 2 – ( ) … r 0 ( ) a1 a2 … ap = φ j k , ( )

slide-23
SLIDE 23

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 23

LPC illustration

  • Actual poles:

1000 2000 3000 4000 5000 6000 7000 freq / Hz

time / samp

  • 60
  • 40
  • 20

50 100 150 200 250 300 350 400

dB windowed original

  • riginal spectrum

LPC residual residual spectrum LPC spectrum

  • 0.3
  • 0.2
  • 0.1

0.1

z-plane

slide-24
SLIDE 24

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 24

Interpreting LPC

  • Picking out resonances
  • if signal really was source + all-pole resonances,

LPC should find the resonances

  • Least-squares fit to spectrum
  • minimizing e2[n] in time domain is the same as

minimizing E2(ejω) (by Parseval) →close fit to spectral peaks; valleys don’t matter

  • Removing smooth variation in spectrum
  • 1/A(z) is low-order approximation to S(z)
  • hence, residual E(z) = A(z)S(z) is ‘flat’ version of S
  • Signal whitening:
  • white noise (independent x[n]s) has flat spectrum

→whitening removes temporal correlation

S z ( ) E z ( )

  • 1

A z ( )

  • =
slide-25
SLIDE 25

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 25

Alternative LPC representations

  • Many alternate p-dimensional representations:
  • coefficients {ai}
  • roots {λi} :
  • line spectrum frequencies...
  • reflection coefficients {ki} from lattice form
  • tube model log area ratios
  • Choice depends on:
  • mathematical convenience/complexity
  • quantization sensitivity
  • ease of guaranteeing stability
  • what is made explicit
  • distributions as statistics

1 λiz 1

– ( )

1 aiz 1

– = gi 1 ki – 1 ki +

     log =

slide-26
SLIDE 26

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 26

LPC Applications

  • Analysis-synthesis (coding, transmission):
  • hence can reconstruct by filtering e[n] with {ai}s
  • whitened, decorrelated, minimized e[n]s

are easy to quantize

  • .. or can model e[n] e.g. as simple pulse train
  • Recognition/classification
  • LPC fit responds to spectral peaks (formants)
  • can use for recognition (convert to cepstra?)
  • Modification
  • separating source and filter supports cross-

synthesis

  • pole / resonance model supports ‘warping’

(e.g. male → female)

S z ( ) E z ( ) A z ( )

  • =
slide-27
SLIDE 27

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 27

Aside: Formant tracking

  • Formants carry (most?) linguistic information
  • Why not classify → speech recognition ?
  • e.g. local maxima in cepstral-liftered spectrum

pole frequencies in LPC fit

  • But: recognition needs to work in all

circumstances

  • formants can be obscure or undefined

→ Need more graceful, robust parameters

freq / Hz freq / Hz 1000 2000 3000 4000 time / s 0.2 0.4 0.6 0.8 1 1.2 1.4 1000 2000 3000 4000

Original (mpgr1_sx419) Noise-excited LPC resynthesis with pole freqs

slide-28
SLIDE 28

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 28

Outline

Modeling speech signals Spectral and cepstral modes Linear predictive models (LPC) Other models

  • Sinewave modeling
  • Harmonic+Noise model (HNM)

Speech synthesis 1 2 3 4 5

slide-29
SLIDE 29

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 29

Other models: Sinusoid modeling

  • Early signal models required low complexity
  • e.g. LPC
  • Advances in hardware open new possibilities...
  • NB spectrogram suggests harmonics model:
  • ‘important’ info in 2-D surface is set of tracks?
  • harmonic tracks have ~ smooth properties
  • straightforward resynthesis

4

time / s freq / Hz

0.5 1 1.5 1000 2000 3000 4000

slide-30
SLIDE 30

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 30

Sine wave models

  • Model sound as sum of AM/FM sinusoids:
  • Ak, ωk, φk piecewise linear or constant
  • can enforce harmonicity: ωk = k.ω0
  • Extract parameters directly from STFT frames:
  • find local maxima of |S[k,n]| along frequency
  • track birth/death & correspondence

s n [ ] Ak n [ ] n ωk n [ ] ⋅ φk n [ ] + ( ) cos

k 1 = N n [ ]

=

freq time mag

slide-31
SLIDE 31

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 31

Finding sinusoid peaks

  • Look for local maxima along DFT frame
  • i.e. |S[k-1,n]| < |S[k,n]| > |S[k+1,n]|
  • Want exact frequency of implied sinusoid
  • DFT is normally quantized quite coarsely

e.g. 4000 Hz / 256 bins = 15.6 Hz

  • interpolate at peaks via quadratic fit?
  • may also need interpolated unwrapped phase
  • Or, use differential of phase along time (pvoc):
  • where S[k,n] = a + jb

magnitude frequency

spectral samples quadratic fit to 3 points interpolated frequency and magnitude

ω ab ˙ ba ˙ – a

2

b

2

+

  • =
slide-32
SLIDE 32

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 32

Sinewave modeling applications

  • Modification (interpolation) & synthesis
  • connecting arbitrary ω & φ requires

cubic phase interpolation (because )

  • Types of modification
  • time & frequency scale modification

.. with or without changing formant envelope

  • concatenation/smoothing boundaries
  • phase realignment (for crest reduction)
  • Non-harmonic signals? OK-ish

ω φ ˙ =

time / s freq / Hz

0.5 1 1.5 1000 2000 3000 4000

slide-33
SLIDE 33

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 33

Harmonics + noise model

  • Motivation to modify sinusoid model because:
  • problems with analysis of real (noisy) signals
  • problems with synthesis quality (esp. noise)
  • perceptual suspicions
  • Model:
  • sinusoids are forced to be harmonic
  • remainder is filtered & time-shaped noise
  • ‘Break frequency’ Fm[n] between H and N:

s n [ ] Ak n [ ] n k ω0 n [ ] ⋅ ⋅ ( ) cos

k 1 = N n [ ]

e n [ ] hn n [ ] b n [ ] ⊗ ( ) ⋅ + =

Harmonics Noise

Harmonicity limit Fm[n] Harmonics Noise

freq / Hz 40 dB 20 1000 2000 3000

slide-34
SLIDE 34

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 34

HNM analysis and synthesis

  • Dynamically adjust Fm[n] based on

‘harmonic test’:

  • Noise has envelopes in time e[n] and freq Hn
  • reconstruct bursts / synchronize to pitch pulses

time / s freq / Hz

0.5 1 1.5 1000 2000 3000 4000 time / s 40 dB 1000 2000 3000 freq / Hz 0.01 0.02 0.03

Hn[k] e[n]

slide-35
SLIDE 35

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 35

Outline

Modeling speech signals Spectral and cepstral modes Linear predictive models (LPC) Other models Speech synthesis

  • Phone concatenation
  • Diphone synthesis

1 2 3 4 5

slide-36
SLIDE 36

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 36

Speech synthesis

  • One thing you can do with models
  • Easier than recognition?
  • listeners do the work
  • .. but listeners are very critical
  • Overview of synthesis
  • normalization disambiguates text (abbreviations)
  • phonetic realization from pronouncing dictionary
  • prosodic synthesis by rule (timing, pitch contour)
  • .. all controls waveform generation

5

text speech Text normalization Synthesis algorithm Phoneme generation Prosody generation

slide-37
SLIDE 37

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 37

Source-filter synthesis

  • Flexibility of source-filter model is ideal for

speech synthesis

  • Excitation source issues:
  • voiced / unvoiced / mixture ([th] etc.)
  • pitch cycle of voiced segments
  • glottal pulse shape → voice quality?

Glottal pulse source Noise source Vocal tract filter + Speech Voiced/ unvoiced Pitch info Phoneme info

t t t t

th ax k ae t

slide-38
SLIDE 38

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 38

Vocal tract modeling

  • Simplest idea:

Store a single VT model for each phoneme

  • but: discontinuities are very unnatural
  • Improve by smoothing between templates
  • trick is finding the right domain

th ax k ae t

time freq

th ax k ae t

time freq

slide-39
SLIDE 39

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 39

Cepstrum-based synthesis

  • Low-n cepstrum is compact model of target

spectrum

  • Can invert to get actual VT IR waveform:

  • All-zero (FIR) VT response

→ can pre-convolve with glottal pulses

  • cross-fading between templates is OK

cn idft dft x n [ ] ( ) log ( ) = h n [ ] idft dft cn ( ) ( ) exp ( ) =

time ee ae ah Glottal pulse inventory Pitch pulse times (from pitch contour)

slide-40
SLIDE 40

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 40

LPC-based synthesis

  • Very compact representation of target spectra
  • 3 or 4 pole pairs per template
  • Low-order IIR filter → very efficient synthesis
  • How to interpolate?
  • cannot just interpolate ai in a running filter
  • but: lattice filter has better-behaved interpolation
  • What to use for excitation
  • residual from original analysis
  • reconstructed periodic pulse train
  • parameterized residual resynthesis

+ + z-1 a1 kp-1 a2 a3 z-1 z-1 e[n] + e[n] s[n]

  • 1

s[n] + k0 + z-1 z-1 z-1 +

slide-41
SLIDE 41

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 41

Diphone synthesis

  • Problems in phone-concatenation synthesis
  • phonemes are context-dependent
  • coarticulation is complex
  • transitions are critical to perception

→ store transitions instead of just phonemes

  • ~40 phones → 800 diphones
  • or even more context if have a larger database
  • How to splice diphones together?
  • TD-PSOLA: align pitch pulses and cross-fade
  • MBROLA: normalized, multiband

m d n c tcl

^

θ z w z h e

I I I

a

y

ε

Phones Diphone segments

slide-42
SLIDE 42

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 42

HNM synthesis

  • High quality resynthesis of real diphone units

+ parametric representation for modifications

  • pitch, timing modifications
  • removal of discontinuities at boundaries
  • Synthesis procedure:
  • linguistic processing gives phones, pitch, timing
  • database search gives best-matching units
  • use HNM to fine-tune pitch & timing
  • cross-fade Ak and ω0 parameters at boundaries
  • Careful preparation of database is key
  • sine models allow phase alignment of all units
  • larger database improves unit match

time freq

slide-43
SLIDE 43

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 43

Generating prosody

  • The real factor limiting speech synthesis?
  • Waveform synthesizers have inputs for
  • intensity (stress)
  • duration (phrasing)
  • fundamental frequency (pitch)
  • Curves produced by superposition of (many)

inferred linguistic rules

  • phrase final lengthening, unstressed shortening..
  • Or learn rules from transcribed examples
slide-44
SLIDE 44

E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 44

Summary

  • Range of models:
  • spectral
  • cepstral
  • LPC
  • Sinusoid
  • HNM
  • Range of applications:
  • general spectral shape (filterbank) → ASR
  • precise description (LPC+residual) → coding
  • pitch, time modification (HNM) → synthesis
  • Issues:
  • performance vs. computational complexity
  • generality vs. accuracy
  • representation size vs. quality