Phoneme state posteriorgram features for speech based automatic classification of speakers in cold and healthy condition Akshay Kalkunte Suresh, Srinivasa Raghavan K M, Dr. Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore, India SPIRE LAB, IISc, Bangalore 1 1 January 2017
Overview 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments Effect of feature selection e2e model Effect of corpora Decision fusion 6 Conclusion SPIRE LAB, IISc, Bangalore 2
Introduction Topics 1 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion SPIRE LAB, IISc, Bangalore 3
Introduction Introduction Problem statement Given a database of speech samples recorded from speakers in healthy condition and suffering from common cold, we have to automatically classify the speech samples into cold affected and healthy speech. Why do we need to do this? Detection of presence of common cold in speech could find applications in healthcare. It could also help in improving the accuracy in automatic speech and speaker recognition systems. SPIRE LAB, IISc, Bangalore 4
Introduction Illustration Identifying whether speaker has cold from speech signal example 1 - Non-cold example 2 - Cold Design your own wardrobe example 3 - Non-cold example 4 - Cold They wandered away SPIRE LAB, IISc, Bangalore 5
Introduction Frequency domain perspective Design your own wardrobe (a) example 1 - Non cold (b) example 2 - Cold They wandered away (a) example 3 - Non cold (b) example 4 - Cold SPIRE LAB, IISc, Bangalore 6
Introduction Speech signal production perspective Congestion in Nasal and Vocal cavity in cold condition could possibly affect speech SPIRE LAB, IISc, Bangalore 7
Introduction Previous works Studies by Tull et al. 1 reveal differences in formant patterns, nasality parameters and melcepstral coefficients between normal and cold speech. Shan et al. 2 observed variations in the energy levels at lower and higher frequency bands and using mel-frequency cepstral coefficient (MFCC) found improvement in speaker recognition systems. P.Rose 3 pointed out that the cold is often accompanied by nasal cavity‘s inflammation and swelling, which changes the volume and shape of nasal cavity and furthermore affects the nasal modulation of sound source excitation signal and causes the speakers voice to change. 1 Tull, “Investigating The Common Cold To Improve Speech Technology” 2 Shan and Zhu, “Speaker Identification Under The Changed Sound Environment” 3 Rose, Forensic Speaker Identification SPIRE LAB, IISc, Bangalore 8
Our hypothesis Topics 2 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion SPIRE LAB, IISc, Bangalore 9
Our hypothesis Our hypothesis We hypothesize that the change in voice quality in speech affected by common cold could result in lower likelihoods from a model built using normal, healthy speech. We also hypothesize that some phonemes are affected to greater extent. SPIRE LAB, IISc, Bangalore 10
Phoneme state posteriorgram features Topics 3 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion SPIRE LAB, IISc, Bangalore 11
Phoneme state posteriorgram features Steps to compute phoneme state posteriorgram features Computing the PSP features involves the following stages - Acoustic feature (MFCC) extraction. Gaussian Mixture Models from non-cold speech. Likelihoods of features from Gaussian Mixture Models. Computing functionals. SPIRE LAB, IISc, Bangalore 12
Phoneme state posteriorgram features Feature Extraction SPIRE LAB, IISc, Bangalore 13
Phoneme state posteriorgram features Feature Extraction The speech utterances are divided into ’ N l ’ frames with a window size of 25ms shifted by 10ms. 13-dim MFCC vector is obtained for each frame. Velocity and Acceleration features are appended to obtain a 39-dim feature vector. SPIRE LAB, IISc, Bangalore 14
Phoneme state posteriorgram features Gaussian Mixture Models from non-cold speech SPIRE LAB, IISc, Bangalore 15
Phoneme state posteriorgram features Gaussian Mixture Models from non-cold speech We train a phonetic three state hidden Markov model (HMM) from the non-cold speech data. The GMMs for the HMM states are denoted by G 1 , G 2 , ... G n . SPIRE LAB, IISc, Bangalore 16
Phoneme state posteriorgram features Likelihoods of features from Gaussian Mixture Models SPIRE LAB, IISc, Bangalore 17
Phoneme state posteriorgram features Likelihoods of features from Gaussian Mixture Models The parameters for the i -th GMM ( G i ) is given by λ i = { w i j , µ i j , Σ i j , j = 1 : 256 } , where w i j is the weight for the j -th component; µ i j and Σ i j are the mean vector and diagonal covariance matrix for the j -th component. Given 39-dim acoustic feature vector x k , the log likelihood using G 1 , G 2 , · · · , G n are computed as follows: 256 � w i j N ( x k ; µ i j , Σ i , 1 ≤ i ≤ n, L i ( k ) = P ( x k | G i ) = log j ) j =1 SPIRE LAB, IISc, Bangalore 18
Phoneme state posteriorgram features Computing functionals SPIRE LAB, IISc, Bangalore 19
Phoneme state posteriorgram features Computing functionals The n-dim Log likelihood vector computed for all frames of utterance l is passed through the functionals block to get a single n x 43 vector. The functional block computes 43 opensmile 4 functionals over all the frames of each of the n dimensions. 4 Eyben, W¨ ollmer, and Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor” SPIRE LAB, IISc, Bangalore 20
Observation Topics 4 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion SPIRE LAB, IISc, Bangalore 21
Observation Average Likelihood Plot We plot the average log likelihoods for all cold and non-cold utterances from the training set of URTIC speech corpus across 120 phonetic classes GMMSs from acoustic model trained on TIMIT + Boston University Radio News (BN). ’n’ phonetic classes = (3 X ’m’ phonemes) = (3 X 40) = 120. Cold speech features, on average, result in lower likelihoods against the GMMs of each phoneme state compared to the non-cold speech features. SPIRE LAB, IISc, Bangalore 22
Observation We mark the top 10 phonetic classes in the average likelihood plot. The phonemes with highest ten differences in the likelihoods are AA, EH, V, DH, IY, AX, JH, W, T, NG. The nasal sound NG appears in the top ten most discriminating phonemes particularly due to the change in the nasal cavity due to cold. SPIRE LAB, IISc, Bangalore 23
Experiments Topics 5 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion SPIRE LAB, IISc, Bangalore 24
Experiments Overview of Experimental Results SPIRE LAB, IISc, Bangalore 25
Experiments We report results obtained using the proposed 5160-dim PSP features, End-to-End (e2e) model and discuss the effect of feature selection, effect of corpora and decision fusion. We use unweighted average recall (UAR) as the metric to compare performance among the models as it is invariant to class imbalance. We also consider 2017 InterSpeech Cold Sub-Challenge baseline results. SPIRE LAB, IISc, Bangalore 26
Experiments Effect of feature selection SPIRE LAB, IISc, Bangalore 27
Experiments Effect of feature selection Scores for cold speech classification ( UAR% ) Model Dev Test ComParE functionals (baseline) 64.00 70.20 PSP (5160-dim) 64.00 61.09 SPIRE LAB, IISc, Bangalore 28
Experiments Effect of feature selection Scores for cold speech classification ( UAR% ) Model Dev ComParE functionals (baseline) 64.00 PSP (473-dim) 63.60 SPIRE LAB, IISc, Bangalore 29
Experiments Effect of feature selection Scores for cold speech classification ( UAR% ) Model Dev ComParE functionals (baseline) 64.00 PSP (500-dim) 63.50 SPIRE LAB, IISc, Bangalore 30
Experiments Effect of feature selection We divide the ComParE features into 27 categories, C1 to C27. Among the 27 categories, we observe that pcm fft Mag mfcc performs the best. However, the rest of the classes perform uniformly and worse than pcm fftMag mfcc. SPIRE LAB, IISc, Bangalore 31
Experiments Effect of feature selection e2e model SPIRE LAB, IISc, Bangalore 32
Experiments e2e model e2e model A baseline e2e model with 8 convolutional and 2 LSTM layers is trained on raw audio files. We hypothesize that the e2e classification approach could learn unique time-frequency representations using the convolutional and LSTM layers with the potential to observe new representations in the data. Scores for cold speech classification ( UAR% ) Model Dev ComParE functionals (baseline) 64.00 e2e 66.50 SPIRE LAB, IISc, Bangalore 33
Experiments e2e model Effect of corpora SPIRE LAB, IISc, Bangalore 34
Recommend
More recommend