fo for dia ialect classification of f
play

fo for Dia ialect Classification of f Sp Spectrogram Im Images - PowerPoint PPT Presentation

www.intelligentvoice.com Deep Convolution Neural Networks fo for Dia ialect Classification of f Sp Spectrogram Im Images Nigel Cannings Chase Information Technology Services Limited 1 www.intelligentvoice.com Convolution Networks: Brief


  1. www.intelligentvoice.com Deep Convolution Neural Networks fo for Dia ialect Classification of f Sp Spectrogram Im Images Nigel Cannings Chase Information Technology Services Limited 1

  2. www.intelligentvoice.com Convolution Networks: Brief History  Inspired from receptive fields in the visual cortex  Notable Implementations: • Fukushima’s NeoCognitron (1980) • Explicit parallel Fukushima, Kunihiko , ‘ Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position,’ Biological implementations (1988) Cybernetics 36 (4): 193-202, 1980 • LeCun’s LeNet-5 (1998) • Ciresan’s GPU Implementation (2011) • GoogLeNet (2014) 2 LeNet 5 (1998), image source: http://yann.lecun.com/exdb/lenet/

  3. www.intelligentvoice.com Deep Le Learning  Sigmoidal activation functions have now been largely replaced with rectified linear units (ReLU)  ‘Vanishing error’ problem ( Hochreiter, 1991) doesn’t exist with ReLU  Now we can do `deep’ learning i.e. networks with more than 2 hidden layers  This discovery and GPU computing has resulted in much recent activity in the Neural Network community 3

  4. www.intelligentvoice.com GoogLeNet  State of the Art winner of the ImageNet 2014 competition: classifying 1.2M images into 1K classes  Convolution neural network inspired by LeCun’s LeNet-5  Has 9 ‘Inception’ modules, multiple convolution sizes, and pooling in each module  Stochastic Gradient Descent used to train the network with ‘dropout’ which helps prevents overfitting 4 Szegedy , ‘Going deeper with convolutions,’ arXiv, 2014

  5. www.intelligentvoice.com GoogLeNet St Structure Topology consists of ‘Inception’ modules consisting of:  Convolutions – Filters for extracting features, filter size tends to be small in the early layers, bigger in later layers  Pooling – dimensionality reduction  Softmax loss for predicting classes at 3 progressive stages of the network  Other – concatenations for combining convolutions ‘Rinse and Repeat’ 9 times 5

  6. www.intelligentvoice.com NIS IST LR LRE Competition  6 Language clusters, 20 dialects: • Ara rabic ic (Egyptian, Iraqi, Levantine, Maghrebi, Modern Standard) • Ch Chin inese (Cantonese, Mandarin, Min, Wu) • Englis glish (British, General American, Indian) • Fre rench (West African, Haitian Creole) • Ib Iberia ian (Caribbean Spanish, European Spanish, Latin American Spanish, Brazilian Portuguese) • Sla lavic ic (Polish, Russian)  500+ hours of speech data  Data set very unbalanced 6 2015 NIST Language Recognition Evaluation, http://www.nist.gov/itl/iad/lre15.cfm

  7. www.intelligentvoice.com Spectrogram Convolution Network  Based on Nvidia’s Digits implementation of GoogLeNet  Converted speech to 256x256 pixel spectrograms  Tried different spectral representations and coding… MATLAB RASTA RASTA 12 SOX PYTHON 7

  8. www.intelligentvoice.com GoogLeNet Processing 8

  9. www.intelligentvoice.com GoogLeNet Processing 9

  10. www.intelligentvoice.com GoogLeNet Processing Dat Database: 501248 spectrograms for training 24352 spectrograms for validation 51501 spectrograms for testing 10

  11. www.intelligentvoice.com GoogLeNet Processing Apply convolutions to extract primitives such as edges Dat Database: 501248 spectrograms for training 24352 spectrograms for validation 51501 spectrograms for testing 11

  12. www.intelligentvoice.com GoogLeNet Processing Apply convolutions to extract primitives such as edges Object parts extracted Dat Database: 501248 spectrograms for training 24352 spectrograms for validation 51501 spectrograms for testing 12

  13. www.intelligentvoice.com GoogLeNet Processing Full Spectral Features, e.g. Apply convolutions to phones, extract primitives such words as edges Object parts extracted Database: Dat 501248 spectrograms for training 24352 spectrograms for validation 51501 spectrograms for testing 13

  14. www.intelligentvoice.com GoogLeNet Processing Full Spectral Features, e.g. Apply convolutions to phones, extract primitives such words as edges Object parts extracted Dat Database: 501248 spectrograms for training 24352 spectrograms Refinement of accuracy for validation 51501 spectrograms for testing 14

  15. www.intelligentvoice.com GoogLeNet Processing Full Spectral Features, e.g. Apply convolutions to phones, Dialect Dial extract primitives such words Clas lassi sifi fication as edges Object parts extracted Dat Database: 501248 spectrograms for training 24352 spectrograms Refinement of accuracy for validation 51501 spectrograms for testing Loss3 Loss2 15 Loss1

  16. www.intelligentvoice.com English-South_Asian_(Indian) Pre reliminary Results Portuguese-Brazilian Spanish-… Spanish-European Chinese-Min_Dong Arabic-Modern_Standard Chinese-Cantonese Arabic-Egyptian English-British Spanish-Caribbean Slavic-Russian Arabic-Maghrebi Chinese-Mandarin Arabic-Iraqi English-American French-West_African Chinese-Wu Slavic-Polish  Accuracy – 83.99 (Top-1), 98.89% (Top-5) French-Haitian Arabic-Leventine 0 20 40 60 80 100 16

  17. www.intelligentvoice.com Still to be investigated…  Many of the scaling, cropping, rotating of images common in image classification to balance data and improve generalisation is not appropriate for spectrograms  Dynamic frequency warping techniques to balance the data sets and improve generalisation  Taxonomy of languages investigation of the similarity of classification results across dialects • David Cameron – Arabic? 17

  18. www.intelligentvoice.com Questions Th Thank you 18

Recommend


More recommend