Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN View this talk on YouTube: https://youtu.be/sI_8EA0_ha8 Training neural network acoustic models on (multichannel) waveforms Ron Weiss in SANE 2015 2015-10-22 Joint work with Tara Sainath, Kevin Wilson, Andrew Senior, Arun Narayanan, Michiel Bacchiani, Oriol Vinyals, Yedid Hoshen Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 1 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Outline Review: Filterbanks 1 Waveform CLDNN 2 What do these things learn 3 Multichannel waveform CLDNN 4 Sainath, T. N., Weiss, R. J., Senior, A., Wilson, K. W., and Vinyals, O. (2015b). Learning 2 the speech front-end with raw waveform CLDNNs. In Proc. Interspeech Sainath, T. N., Weiss, R. J., Wilson, K. W., Narayanan, A., Bacchiani, M., and Senior, A. 4 (2015c). Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms. In Proc. ASRU . to appear Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 2 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Acoustic modeling in 2015 his captain was thin and haggard sil sil k ae sil t ihn wah thih n ae n hh ae er sil hh s sil d ih z g 35 30 mel band 25 20 15 10 5 0 0.0 0.5 1.0 1.5 2.0 Time (seconds) Classify each 10ms audio frame into context-dependent phoneme state Log-mel filterbank features passed into a neural network Modern vision models are trained directly from the pixels, can we train an acoustic model directly from the samples? Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 3 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Frequency domain filterbank: log-mel waveform window 1 window 2 window N localization in time FFT FFT FFT | | | | | | pointwise nonlinearity bandpass filtering mel mel mel dynamic range compression log log log feature feature feature frame 1 frame 2 frame N Bandpass filtering implemented using FFT and mel warping Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 4 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Time-domain filterbank fine time structure removed here! :) √ log/ 3 BP filter 1 nonlinearity smoothing/decimation feature band 1 √ log/ 3 waveform BP filter 2 nonlinearity smoothing/decimation feature band 2 √ nonlinearity smoothing/decimation log/ 3 BP filter P feature band P Swap order of filtering and decimation, but basically the same thing Cochleagrams, gammatone features for ASR (Schluter et al., 2007) Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 5 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Time-domain filterbank as a neural net layer conv 1 max pool stabilized log f 1 [ n ] ReLU windowed conv 2 waveform ReLU max pool stabilized log f 2 [ n ] segment n conv P max pool stabilized log f P [ n ] ReLU These are common neural network operations (FIR) filter → convolution nonlinearity → rectified linear (ReLU) activation smoothing/decimation → pooling Window waveform into short ( < 300ms) overlapping segments Pass each segment into FIR filterbank to generate feature frame Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 6 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Previous work: Representation learning from waveforms Jaitly and Hinton (2011) unsupervised representation learning using a time-convolutional RBM supervised DNN training on learned features for phone recognition T¨ uske et al. (2014), Bhargava and Rose (2015) supervised training, fully connected DNN learns similar filter shapes at different shifts Palaz et al. (2013, 2015b,a), Hoshen et al. (2015), Golik et al. (2015) supervised training, convolution to share parameters across time shifts No improvement over log-mel baseline on large vocabulary task in above work Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 7 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Deep waveform DNN (Hoshen et al., 2015) Input Convolution Max pooling Nonlinearity Fully connected Softmax 275 ms F filters 25 ms window log(ReLU(...)) 4 layers, 640 units 13568 25 ms weights 10 ms step ReLU activations classes convolution output nonlinearity output (F x 4401) (F x 26) Choose parameters to match log-mel DNN 40 filters, 25ms impulse response, 10 ms hop stack 26 frames of context using strided pooling: 40x26 “brainogram” Adding stabilized log compression gave 3-5% relative WER decrease Overall 5-6% relative WER increase compared to log-mel DNN Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 8 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN CLDNN (Sainath et al., 2015a) Combine all the neural net tricks: CLDNN = Convolution + LSTM + DNN Frequency convolution gives some pitch/vocal tract length invariance LSTM layers model long term temporal structure DNN learns linearly separable function of LSTM state 4 − 6% improvement over LSTM baseline No need for extra frames of context in input: memory in LSTM can remember previous inputs Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 9 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Waveform CLDNN (Sainath et al., 2015b) output targets DNN Time convolution (tConv) produces a 40dim frame 35ms window ( M = 561 samples) , hopped by 10ms CLDNN similar to (Sainath et al., 2015a) LSTM Frequency convolution (fConv) layer: 8x1 filter, 256 outputs, pool by 3 without overlap LSTM 8x256 output fed into linear dim reduction layer 3 LSTM layers: LSTM 832 cells/layer with 512 dim projection layer DNN layer: 1024 nodes, ReLU activations fConv linear dim reduction layer with 512 outputs x t ∈ ℜ P Total of 19M parameters, 16K in tConv tConv All trained jointly with tConv filterbank raw waveform M samples Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 10 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Experiments US English Voice Search task, Clean dataset: 3M utterances ( ∼ 2 k hours) train, 30k ( ∼ 20 hours) test 1 MTR20 multicondition dataset: 2 simulated noise and reverberation SNR between 5-25dB (average ∼ 20dB) RT 60 between 0-400ms (average ∼ 160ms) Target to mic distance between 0-2m (average ∼ 0 . 75m) 13522 context-dependent state outputs Asynchronous SGD training, optimizing a cross-entropy loss Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 11 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Compared to log-mel (Sainath et al., 2015b) Train/test set Feature WER Clean log-mel 14.0 waveform 13.7 MTR20 log-mel 16.2 waveform 16.2 waveform+log-mel 15.7 Matches performance of log-mel baseline in clean and moderate noise 3% relative improvement by stacking log-mel features and tConv output Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 12 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN How important are LSTM layers? (Sainath et al., 2015b) MTR20 WER Architecture log-mel waveform D6 22.3 23.2 F1L1D1 17.3 17.8 F1L2D1 16.6 16.6 F1L3D1 16.2 16.2 Fully connected DNN: waveform 4% worse than log-mel Log-mel outperforms waveform with one or zero LSTM layers Time convolution layer gives short term shift invariance, but seems to need recurrence to model longer time scales Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 13 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Bring on the noise (Sainath et al., 2015c) MTR12 : noisier version of MTR20 12dB average SNR, 600ms average RT 60 , more farfield Num filters log-mel waveform 40 25.2 24.7 84 25.0 23.7 128 24.4 23.5 Waveform consistently outperforms log-mel in high noise Larger improvements with more filters Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 14 / 31
Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN Filterbank magnitude responses mel trained 8 0 8 3 36 7 7 6 32 6 6 9 Frequency (kHz) Frequency (kHz) 5 5 28 12 4 15 4 24 18 3 3 20 21 2 2 16 24 1 1 27 12 0 30 0 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Filter index Filter index Sort filters by index of frequency band with peak magnitude Looks mostly like an auditory filterbank mostly bandpass filters, bandwidth increases with center frequency Consistently higher resolution in low frequencies: 20 filters below 1kHz vs ∼ 10 in mel somewhat consistent with an ERB auditory frequency scale Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 15 / 31
Recommend
More recommend