lid senone extraction via deep neural networks for end to
play

LID-senone Extraction via Deep Neural Networks for End-to-End - PowerPoint PPT Presentation

LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification Ma Jin 1 Yan Song 1 Ian Mcloughlin 2 Li-Rong Dai 1 Zhong-Fu Ye 1,3 1 National Engineering Laboratory of Speech and Language Information Processing University of


  1. LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification Ma Jin 1 Yan Song 1 Ian Mcloughlin 2 Li-Rong Dai 1 Zhong-Fu Ye 1,3 1 National Engineering Laboratory of Speech and Language Information Processing University of Science and Technology of China, China 2 School of Computing, University of Kent, Medway, UK 3 State Key Laboratory of Mathematical Engineering and Advanced Computing, China Presented ed b by Profes essor Ian n McLou oughlin 2016. 2016.06.22

  2. Outline • Introduction • Proposed Method • Experiments and Analysis • Conclusion and Feature Work

  3. Introduction – background • What is Language Identification? • Extract utterance representation from a given speech • State-of-the-art Method • GMM/i-vector • Unsupervised fashion • Deep Learning Method • Natural advantages of supervised training

  4. Introduction – existing method • Improved i-vector Method via Deep Learning • Deep bottleneck network based i-vector representation for language identification (Song et.al ) • Study of senone-based deep neural network approaches for spoken language recognition (Ferrer et.al ) • End-to-End Neural Network • Automatic language identification using deep eep n neu eural n networks (Lopez-Moreno et.al ) • Automatic language identification using lon long s shor ort-term me memor ory r y recurrent neu eural n networks (Gonzalez-Dominguez et.al ) • An end-to-end approach to language identification in short utterances using con onvolu lution ional al n neural al n networ orks (Lozano-Diez et.al )

  5. Outline • Introduction • Proposed Method • Experiments and Analysis • Conclusion and Feature Work

  6. Proposed Method – motivation and structure • Convolutional Neural Network • convolutional layers: feature extractor at frame level • pooling layers: map frame level features to utterance representation • Structure • DNN layer: transform acoustic features to a compact representation frame by frame • convolutional layer: transform BN features into units discriminative to languages

  7. Proposed Method – structure details • LID-feature • general acoustic features contain too much useless information, may degrade performance • deep bottleneck features (DBF) are discriminate on phones, not on languages • LID-features are discriminative on languages, and irrelevant between dimensions (large conv kernel) • Spatial Pyramid Pooling • spans features from frame level to utterance level • deals with arbitrary input sizes • obtain statistical information at different time scales

  8. Proposed Method – incremental training strategy • LID-features cannot be extracted directly from general acoustic features • lack of training data • features should be bonded with phones at a frame level, so the target cannot be languages • Incremental Training Strategy • transfer learning from large-scale corpus • incremental training with language corpus

  9. Proposed Method – LID-senone and its statistics discriminate at frame discriminate on level utterance level only few LID-senones can be activated

  10. Proposed Method – hybrid temporal evaluation • 30s/10s/3s neural networks are trained independently • 30s speech could be segmented into 10s/3s and use corresponding networks • 10s speech could be segmented into 3s and use the corresponding network

  11. Outline • Introduction • Proposed Method • Experiments and Analysis • Conclusion and Feature Work

  12. Experiments and Analysis • Dataset • six most confusable languages from NIST LRE 09 (Dari, Farsi, Russian, Ukrainian, Hindi and Urdu) • training duration about 150 hours • evaluation on 30s/10s/3s • Performance indicators: Equal Error Rate (EER) • System • baseline1: BN-GMM/i-vector • baseline2: BN-DNN/i-vector • proposed network1: LID-net • proposed network2: LID-HT-net, LID-net with hybrid temporal evaluation

  13. Experiments and Analysis • Evaluation of Different Convolutional Filter Sizes n c changes • As a consequence, a filter size of 50x21 is selected for all of the following experiments.

  14. Experiments and Analysis • Evaluation of Convolutional Layer Complexity complexity o of conv. l layer er changes es • The performance improves when the complexity increases

  15. Experiments and Analysis • Hybrid Temporal Evaluation • the final LID-net performs well compared with the two baseline systems • i-vector use both zeroth order and first order Baum-Welch statistics. In LID-net, the SPP layer only uses zeroth order Baum-Welch statistics

  16. Outline • Introduction • Proposed Method • Experiments and Analysis • Conclusion and Future Work

  17. Conclusion and Feature Work • Conclusion • we have proposed a comprehensive task-aware network spanning frame to utterance level • an incremental training strategy scheme has been introduced to address over- fitting issues in the deep structure • hybrid temporal evaluation is proposed for various time scales in the same test dataset • Future Work • consider a more comprehensive network rather than relying on three independent networks • Can we incorporate first order B-W statistics?

Recommend


More recommend