harmonic structure transform for speaker recognition
play

Harmonic Structure Transform for Speaker Recognition Kornel - PowerPoint PPT Presentation

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music &


  1. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music & Hearing, Stockholm, Sweden 29 August, 2011 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 1/21

  2. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Spectral Transforms in General Given x ≡ the energy spectrum of a speech frame, F − 1 � � �� M T x y = log − � normalization term � The matrix M is a filterbank, whose columns look like: · · · · · · M defines the number of filters, and their central frequencies , widths , and general shapes . Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 2/21

  3. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Spectral Transforms in General Given x ≡ the energy spectrum of a speech frame, F − 1 � � �� M T x y = log − � normalization term � The matrix M is a filterbank, whose columns look like: · · · · · · M defines the number of filters, and their central frequencies , widths , and general shapes . Importantly here , the filters of all such filterbanks integrate energy across frequencies related by adjacency . Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 2/21

  4. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions The Harmonic Structure Transform (HST) In contrast, the HST is implemented by a matrix H whose columns look like: · · · · · · Each filter integrates energy across frequencies related by harmonicity (not adjacency). this is novel (Laskowski & Jin, 2010) for speaker recognition related to (Li´ enard, Barras & Signol, 2008) for pitch detection unknown: number of filters, and their fundamental frequencies , tooth widths , and individual tooth shapes Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 3/21

  5. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Outline of this Talk 1 Baseline Performance What is known? 2 Experiments in HSCC Filterbank Design linear spacing in fundamental frequency piecewise linear spacing in fundamental frequency logarithmic spacing in fundamental frequency fundamental frequency range and density 3 Score-level Fusion with Standard MFCCs 4 Generalization 5 Conclusions Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 4/21

  6. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions HST Processing frame FFT idealized FFT analysis every 8 ms x t (comb filter h ) frames 32 ms wide f h [ i − 1] comb filter teeth triangular (global width parameter) 400 filters, linearly spanning f h [ i ] from 50 Hz to 450 Hz logarithm at each filter f h [ i + 1] output, then normalization decorrelation using LDA yields harmonic structure cepstral coefficients (HSCCs) as a function of i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 5/21

  7. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions HSCC Modeling for Classification As simple as possible. one GMM per speaker assume one Gaussian element 1 determine optimal number N D of LDA dimensions 2 hold N D fixed 3 determine optimal number of N G Gaussians 4 maximum likelihood closed-set classification (MAP under uniform prior) Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 6/21

  8. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Available Results (Laskowski & Jin, ODYSSEY 2010) Wall Street Journal data, mostly read speech 100-way closed-set classification, per gender ≈ 1500 10-second trials, per gender and dataset matched channel and matched multi-session conditions Female, ♀ Male, ♂ System Dev Test Dev Test F 0 17.6 18.4 26.2 27.4 HST/LDA 99.7 99.9 99.7 99.7 MEL/DCT 98.7 99.3 99.3 98.6 MEL/LDA 98.7 99.3 99.3 98.9 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 7/21

  9. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Session Mismatch MIXER5 data, various speaking styles 66-way closed-set classification ≈ 3000 10-second trials, per dataset matched channel and matched session: accuracies of 100% matched channel but mismatched session : System Dev Test F 0 14.1 16.2 HST/LDA 59.8 68.1 MEL/DCT 74.4 84.4 MEL/LDA 81.5 87.8 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 8/21

  10. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 400 34.4 56.7 200 26.8 51.6 400 27.9 56.7 200 27.2 49.6 400 38.0 59.8 200 26.5 48.2 400 28.5 56.7 200 28.9 48.5 400 42.2 63.9 200 28.4 52.5 400 30.3 60.4 200 37.1 59.4 400 42.4 67.7 200 26.8 54.5 400 33.3 64.7 200 41.6 65.0 400 42.0 66.5 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

  11. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 400 34.4 56.7 200 26.8 51.6 400 27.9 56.7 200 27.2 49.6 400 38.0 59.8 200 26.5 48.2 400 28.5 56.7 200 28.9 48.5 400 42.2 63.9 200 28.4 52.5 400 30.3 60.4 200 37.1 59.4 400 42.4 67.7 200 26.8 54.5 400 33.3 64.7 200 41.6 65.0 400 42.0 66.5 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

  12. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 400 34.4 56.7 200 26.8 51.6 400 27.9 56.7 200 27.2 49.6 400 38.0 59.8 200 26.5 48.2 400 28.5 56.7 200 28.9 48.5 400 42.2 63.9 200 28.4 52.5 400 30.3 60.4 200 37.1 59.4 400 42.4 67.7 200 26.8 54.5 400 33.3 64.7 200 41.6 65.0 400 42.0 66.5 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

  13. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 400 34.4 56.7 200 26.8 51.6 400 27.9 56.7 200 27.2 49.6 400 38.0 59.8 200 26.5 48.2 400 28.5 56.7 200 28.9 48.5 400 42.2 63.9 200 28.4 52.5 400 30.3 60.4 200 37.1 59.4 400 42.4 67.7 200 26.8 54.5 400 33.3 64.7 200 41.6 65.0 400 42.0 66.5 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

  14. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 400 34.4 56.7 200 26.8 51.6 400 27.9 56.7 200 27.2 49.6 400 38.0 59.8 200 26.5 48.2 400 28.5 56.7 200 28.9 48.5 400 42.2 63.9 200 28.4 52.5 400 30.3 60.4 200 37.1 59.4 400 42.4 67.7 200 26.8 54.5 400 33.3 64.7 200 41.6 65.0 400 42.0 66.5 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

  15. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 400 34.4 56.7 200 26.8 51.6 400 27.9 56.7 200 27.2 49.6 400 38.0 59.8 200 26.5 48.2 400 28.5 56.7 200 28.9 48.5 400 42.2 63.9 200 28.4 52.5 400 30.3 60.4 200 37.1 59.4 400 42.4 67.7 200 26.8 54.5 400 33.3 64.7 200 41.6 65.0 400 42.0 66.5 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

  16. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 400 34.4 56.7 200 26.8 51.6 400 27.9 56.7 200 27.2 49.6 400 38.0 59.8 200 26.5 48.2 400 28.5 56.7 200 28.9 48.5 400 42.2 63.9 200 28.4 52.5 400 30.3 60.4 200 37.1 59.4 400 42.4 67.7 200 26.8 54.5 400 33.3 64.7 200 41.6 65.0 400 42.0 66.5 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

  17. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 400 34.4 56.7 200 26.8 51.6 400 27.9 56.7 200 27.2 49.6 400 38.0 59.8 200 26.5 48.2 400 28.5 56.7 200 28.9 48.5 400 42.2 63.9 200 28.4 52.5 400 30.3 60.4 200 37.1 59.4 400 42.4 67.7 200 26.8 54.5 400 33.3 64.7 200 41.6 65.0 400 42.0 66.5 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

  18. Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 400 34.4 56.7 200 26.8 51.6 400 27.9 56.7 200 27.2 49.6 400 38.0 59.8 200 26.5 48.2 400 28.5 56.7 200 28.9 48.5 400 42.2 63.9 200 28.4 52.5 400 30.3 60.4 200 37.1 59.4 400 42.4 67.7 200 26.8 54.5 400 33.3 64.7 200 41.6 65.0 400 42.0 66.5 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

Recommend


More recommend