convolutional neural networks in speech
play

Convolutional Neural Networks in Speech Lecture 20 CS 753 - PowerPoint PPT Presentation

Convolutional Neural Networks in Speech Lecture 20 CS 753 Instructor: Preethi Jyothi Convolutional Neural Networks (CNNs) Fully connected (dense) layers have no awareness of spatial information Key concept behind convolutional layers is


  1. Convolutional Neural Networks in Speech Lecture 20 CS 753 Instructor: Preethi Jyothi

  2. Convolutional Neural Networks (CNNs) • Fully connected (dense) layers have no awareness of spatial information • Key concept behind convolutional layers is that of kernels or filters • Filters slide across an input space to detect spatial patterns (translation invariance) in local regions (locality)

  3. Fully Connected Layers Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 1 1 10 x 3072 3072 10 weights Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 28 Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf

  4. Convolution Layer Convolution Layer 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” 32 3 Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 31 Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf

  5. Convolution Layer Convolution Layer activation map 32x32x3 image 5x5x3 filter 32 28 convolve (slide) over all spatial locations 28 32 3 1 Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 34 Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf

  6. consider a second, green filter Convolution Layer Convolution Layer activation maps 32x32x3 image 5x5x3 filter 32 28 convolve (slide) over all spatial locations 28 32 3 1 Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 35 Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf

  7. Convolution Layer For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: activation maps 32 28 Convolution Layer 28 32 3 6 We stack these up to get a “new image” of size 28x28x6! Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 36

  8. Convolutional Neural Network Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 28 24 …. CONV, CONV, CONV, ReLU ReLU ReLU e.g. 6 e.g. 10 5x5x3 5x5x 6 32 28 24 filters filters 3 6 10 Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 38 Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf

  9. What do these layers learn? Visualization of VGG-16 by Lane McIntosh. VGG-16 [Zeiler and Fergus 2013] Preview architecture from [Simonyan and Zisserman 2014]. Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 39 Image from: Simonyan and Zisserman, 2014

  10. Convolutional Neural Networks (CNNs) All animations are from: https://github.com/vdumoulin/conv_arithmetic

  11. Convolutional Neural Networks (CNNs) All animations are from: https://github.com/vdumoulin/conv_arithmetic

  12. Convolutional Neural Networks (CNNs) All animations are from: https://github.com/vdumoulin/conv_arithmetic

  13. Convolutional Neural Networks (CNNs) All animations are from: https://github.com/vdumoulin/conv_arithmetic

  14. Convolutional Neural Networks (CNNs) All animations are from: https://github.com/vdumoulin/conv_arithmetic

  15. Convolution Layers: Summary Summary from: http://cs231n.github.io/convolutional-networks/

  16. Pooling Layer Image from: http://cs231n.github.io/convolutional-networks/

  17. Pooling Layer Summary from: http://cs231n.github.io/convolutional-networks/

  18. CNNs for Speech

  19. Speech features to be fed to a CNN Image from Abdel-Hamid et al., “Convolutional Neural Networks for Speech Recognition”, TASLP 2014

  20. Illustrating a CNN layer these n s, and - ithin con- both - s. umber Convolution Layer Pooling Layer Image from Abdel-Hamid et al., “Convolutional Neural Networks for Speech Recognition”, TASLP 2014

  21. Convolution operations involve a large sparse matrix Image from Abdel-Hamid et al., “Convolutional Neural Networks for Speech Recognition”, TASLP 2014

  22. CNN Architecture used in a hybrid ASR system n previous e f al t at f n dimen- e y a ea- Image from Abdel-Hamid et al., “Convolutional Neural Networks for Speech Recognition”, TASLP 2014 d

  23. Performance on TIMIT of different CNN architectures (Comparison with DNNs)

  24. More recent ASR system: Deep Speech 2 CTC Architecture Channels Filter dimension Stride Regular Dev Noisy Dev Fully Connected 1-layer 1D 1280 11 2 9.52 19.36 2-layer 1D 640, 640 5, 5 1, 2 9.67 19.21 Lookahead 3-layer 1D 512, 512, 512 5, 5, 5 1, 1, 2 9.20 20.22 Convolution 1-layer 2D 32 41x11 2x2 8.94 16.22 2-layer 2D 32, 32 41x11, 21x11 2x2, 2x1 9.06 15.71 3-layer 2D 32, 32, 96 41x11, 21x11, 21x11 2x2, 2x1, 2x1 8.61 14.74 Vanilla or GRU Uni or Bi Test set Ours Human directional WSJ eval’92 3.10 5.03 tion RNN Read WSJ eval’93 4.42 8.08 LibriSpeech test-clean 5.15 5.83 LibriSpeech test-other 12.73 12.69 VoxForge American-Canadian 7.94 4.85 Accented VoxForge Commonwealth 14.85 8.15 1D or 2D VoxForge European 18.44 12.76 Invariant Convolution VoxForge Indian 22.89 22.15 Noisy CHiME eval real 21.59 11.84 CHiME eval sim 42.55 31.33 Spectrogram Image from Amodei et al., “Deep speech 2: End - to - end speech recognition in English and Mandarin”, ICML 2016

  25. TTS: Wavenet Speech synthesis using an auto-regressive generative model • Generates waveform sample-by-sample:16kHz sampling rate • Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

  26. Causal Convolutions Fully convolutional • Prediction at timestep t cannot depend on any future timesteps • Output Hidden Layer Hidden Layer Hidden Layer Input Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

  27. Dilated Convolutions Wavenet uses “dilated convolutions” • Enables the network to have very large receptive fields • Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 1 https://techcrunch.com/2017/10/04/googles-wavenet-machine-learning-based-speech-synthesis-comes-to-assistant/

  28. Convolutional Neural Networks (CNNs) All animations are from: https://github.com/vdumoulin/conv_arithmetic

  29. Conditional Wavenet Condition the model on input variables to generate audio with the • required characteristics Global (same representation used to influence all timesteps) • Local (use a second timeseries for conditioning) • Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

  30. Tacotron Griffin-Lim reconstruction Linear-scale spectrogram CBHG Seq2seq target with r=3 CBHG Decoder Decoder Decoder RNN RNN RNN Attention Attention Attention Attention RNN RNN RNN Attention is applied Pre-net to all decoder steps Pre-net Pre-net Pre-net Character embeddings <GO> frame Image from Wang et al., “Tacotron: Towards end-to-end speech synthesis”, 2017. “https://arxiv.org/pdf/1703.10135.pdf

  31. Tacotron: CBHG Module Bidirectional RNN Highway layers Residual connection Conv1D layers Conv1D projections Max-pool along time (stride=1) Conv1D bank + stacking Image from Wang et al., “Tacotron: Towards end-to-end speech synthesis”, 2017. “https://arxiv.org/pdf/1703.10135.pdf

  32. Grapheme to phoneme (G2P) conversion

  33. Grapheme to phoneme (G2P) conversion Produce a pronunciation (phoneme sequence) given a • written word (grapheme sequence) Learn G2P mappings from a pronunciation dictionary • Useful for: • ASR systems in languages with no pre-built lexicons • Speech synthesis systems • Deriving pronunciations for out-of-vocabulary (OOV) words •

  34. G2P conversion (I) One popular paradigm: Joint sequence models [BN12] • Grapheme and phoneme sequences are first aligned • using EM-based algorithm Results in a sequence of graphones (joint G-P tokens) • Ngram models trained on these graphone sequences • WFST-based implementation of such a joint graphone • model [Phonetisaurus] [BN12]:Bisani & Ney , “Joint sequence models for grapheme-to-phoneme conversion”,Specom 2012 [Phonetisaurus] J. Novak, Phonetisaurus Toolkit

  35. G2P conversion (II) Neural network based methods are the new state-of-the-art • for G2P Bidirectional LSTM-based networks using a CTC output • layer [Rao15]. Comparable to Ngram models. Incorporate alignment information [Yao15]. Beats Ngram • models. No alignment. Encoder-decoder with attention. Beats the • above systems [Toshniwal16].

Recommend


More recommend