Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017
Speech Signal Analysis Generate “A frame” discrete samples Need to focus on short segments of speech ( speech frames ) • that more or less correspond to a subphone and are stationary Each speech frame is typically 20-50 ms long • Use overlapping frames with frame shi fu of around 10 ms •
Frame-wise processing frame frame size shift (25 ms) (10 ms)
Speech Signal Analysis Generate “A frame” discrete samples Need to focus on short segments of speech ( speech frames ) • that more or less correspond to a phoneme and are stationary Each speech frame is typically 20-50 ms long • Use overlapping frames with frame shi fu of around 10 ms • Generate acoustic features corresponding to each speech • frame
Acoustic feature extraction for ASR Desirable feature characteristics: Capture essential information about underlying phones • Compress information into compact form • Factor out information that’s not relevant to recognition e.g. • speaker-specific information such as vocal-tract length, channel characteristics, etc. Would be desirable to find features that can be well-modelled • by known distributions (Gaussian models, for example) Feature widely used in ASR: Mel-frequency Cepstral • Coe ff icients ( MFCCs )
MFCC Extraction y t ( j ) ( y t ( j ) , e t ) iDFT ∆ y t ( j ) , ∆ e t Derivatives ∆ 2 y t ( j ) , ∆ 2 e t log Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x ( j )
Pre-emphasis Pre-emphasis increases the amount of energy in the high • frequencies compared with lower frequencies Why? Because of spectral tilt • In voiced speech, signal has more energy at low frequencies • Due to the glo tu al source • Boosting high frequency energy improves phone detection • accuracy Image credit: Jurafsky & Martin, Figure 9.9
MFCC Extraction y t ( j ) ( y t ( j ) , e t ) iDFT Time ∆ y t ( j ) , ∆ e t derivatives ∆ 2 y t ( j ) , ∆ 2 e t log Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x ( j )
Windowing Speech signal is modelled as a sequence of frames • (assumption: stationary across each frame) Windowing: multiply the value of the signal at time n, s [ n ] by • the value of the window at time n, w [ n ] : y [ n ] = w [ n ] s [ n ] ( 1 0 ≤ n ≤ L − 1 Rectangular: w [ n ] = 0 otherwise ( 0 . 54 − 0 . 46cos 2 π n 0 ≤ n ≤ L − 1 Hamming: L w [ n ] = 0 otherwise
Windowing: Illustration Rectangular window Hamming window
MFCC Extraction y t ( j ) ( y t ( j ) , e t ) iDFT Time ∆ y t ( j ) , ∆ e t derivatives ∆ 2 y t ( j ) , ∆ 2 e t log Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x ( j )
Discrete Fourier Transform (DFT) Extract spectral information from the windowed signal: Compute the DFT of the sampled signal N − 1 x [ n ] e − j 2 π X N kn X [ k ] = n =0 Input: windowed signal x [ 1 ],…, x [ n ] Output: complex number X [ k ] giving magnitude/phase for the kth frequency component Image credit: Jurafsky & Martin, Figure 9.12
MFCC Extraction y t ( j ) ( y t ( j ) , e t ) iDFT Time ∆ y t ( j ) , ∆ e t derivatives ∆ 2 y t ( j ) , ∆ 2 e t log Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x ( j )
Mel Filter Bank DFT gives energy at each frequency band • However, human hearing is not sensitive at all frequencies: less • sensitive at higher frequencies Warp the DFT output to the mel scale: mel is a unit of pitch • such that sounds which are perceptually equidistant in pitch are separated by the same number of mels
Mels vs Hertz
Mel filterbank Mel frequency can be computed from the raw frequency f as: • mel( f ) = 1127ln(1 + f 700) 10 filters spaced linearly below 1kHz and remaining filters • spread logarithmically above 1kHz 1 Amplitude T 0 0 1000 2000 3000 4000 4000 Frequency (Hz) Mel Spectrum ... 1 F A Image credit: Jurafsky & Martin, Figure 9.13 R D
Mel filterbank inspired by speech perception
Mel filterbank Mel frequency can be computed from the raw frequency f as: • mel( f ) = 1127ln(1 + f 700) 10 filters spaced linearly below 1kHz and remaining filters • spread logarithmically above 1kHz 1 Amplitude T 0 0 1000 2000 3000 4000 4000 Frequency (Hz) Mel Spectrum ... 1 F Take log of each mel spectrum value 1) human sensitivity to signal • energy is logarithmic 2) log makes features robust to input variations A Image credit: Jurafsky & Martin, Figure 9.13 R D
MFCC Extraction y t ( j ) ( y t ( j ) , e t ) iDFT Time ∆ y t ( j ) , ∆ e t derivatives ∆ 2 y t ( j ) , ∆ 2 e t log Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x ( j )
Cepstrum: Inverse DFT Recall speech signals are created when a glo tu al source of a • particular fundamental frequency passes through the vocal tract Most useful information for phone detection is the vocal tract • filter (and not the glo tu al source) How do we deconvolve the source and filter to retrieve • information about the vocal tract filter? Cepstrum
Cepstrum Cepstrum: spectrum of the log of the spectrum • magnitude spectrum log magnitude spectrum cepstrum Image credit: Jurafsky & Martin, Figure 9.14
Cepstrum For MFCC extraction, we use the first 12 cepstral values • Variance of the di ff erent cepstral coe ff icients tend to be • uncorrelated Useful property when modelling using GMMs in the • acoustic model — diagonal covariance matrices will su ff ice Cepstrum is formally defined as the inverse DFT of the log • magnitude of the DFT of a signal � � N − 1 N − 1 ! � � x [ n ] e − j 2 π e j 2 π X X N kn N kn c [ n ] = log � � � � � � n =0 n =0
MFCC Extraction y t ( j ) ( y t ( j ) , e t ) DCT Time ∆ y t ( j ) , ∆ e t derivatives ∆ 2 y t ( j ) , ∆ 2 e t log Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x ( j )
Deltas and double-deltas From the cepstrum, use 12 cepstral coe ff icients for each frame • 13th feature represents energy from the frame — computed as • sum of the power of the samples in the frame Also add features related to change in cepstral features over time • to capture speech dynamics P N n =1 n ( c t + n − c t − n ) ∆ t = 2 P N n =1 n 2 Typical value for N is 2. Static cepstral coe ff icients are c t+n and c t-n • Add 13 delta features ( Δ t ) and 13 double-delta features ( Δ 2t ) •
Recap: MFCCs Motivated by human speech perception and speech production • For each speech frame • Compute frequency spectrum and apply Mel binning ‣ Compute cepstrum using inverse DFT on the log of the mel- ‣ warped spectrum 39-dimensional MFCC feature vector: First 12 cepstral ‣ coe ff icients + energy + 13 delta + 13 double-delta coe ff icients
Other features Neural network-based: “Bo tu leneck features” (saw this in • lecture 10) Train deep NN using conventional acoustic features • Introduce a narrow hidden layer (e.g. 40 hidden units) • referred to as the bo tu leneck layer Force neural network to encode relevant information in the • bo tu leneck layer Use hidden unit activations in the bo tu leneck layer as • features
Recommend
More recommend