Convolutional Neural Networks in Speech Lecture 20 CS 753 - PowerPoint PPT Presentation

Convolutional Neural Networks in Speech Lecture 20 CS 753 Instructor: Preethi Jyothi

Convolutional Neural Networks (CNNs) • Fully connected (dense) layers have no awareness of spatial information • Key concept behind convolutional layers is that of kernels or filters • Filters slide across an input space to detect spatial patterns (translation invariance) in local regions (locality)

Fully Connected Layers Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 1 1 10 x 3072 3072 10 weights Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 28 Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf

Convolution Layer Convolution Layer 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” 32 3 Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 31 Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf

Convolution Layer Convolution Layer activation map 32x32x3 image 5x5x3 filter 32 28 convolve (slide) over all spatial locations 28 32 3 1 Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 34 Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf

consider a second, green filter Convolution Layer Convolution Layer activation maps 32x32x3 image 5x5x3 filter 32 28 convolve (slide) over all spatial locations 28 32 3 1 Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 35 Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf

Convolution Layer For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: activation maps 32 28 Convolution Layer 28 32 3 6 We stack these up to get a “new image” of size 28x28x6! Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 36

Convolutional Neural Network Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 28 24 …. CONV, CONV, CONV, ReLU ReLU ReLU e.g. 6 e.g. 10 5x5x3 5x5x 6 32 28 24 filters filters 3 6 10 Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 38 Image from:http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture05.pdf

What do these layers learn? Visualization of VGG-16 by Lane McIntosh. VGG-16 [Zeiler and Fergus 2013] Preview architecture from [Simonyan and Zisserman 2014]. Lecture 5 - Lecture 5 - April 16, 2019 April 16, 2019 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 39 Image from: Simonyan and Zisserman, 2014

Convolutional Neural Networks (CNNs) All animations are from: https://github.com/vdumoulin/conv_arithmetic

Convolution Layers: Summary Summary from: http://cs231n.github.io/convolutional-networks/

Pooling Layer Image from: http://cs231n.github.io/convolutional-networks/

Pooling Layer Summary from: http://cs231n.github.io/convolutional-networks/

CNNs for Speech

Speech features to be fed to a CNN Image from Abdel-Hamid et al., “Convolutional Neural Networks for Speech Recognition”, TASLP 2014

Illustrating a CNN layer these n s, and - ithin con- both - s. umber Convolution Layer Pooling Layer Image from Abdel-Hamid et al., “Convolutional Neural Networks for Speech Recognition”, TASLP 2014

Convolution operations involve a large sparse matrix Image from Abdel-Hamid et al., “Convolutional Neural Networks for Speech Recognition”, TASLP 2014

CNN Architecture used in a hybrid ASR system n previous e f al t at f n dimen- e y a ea- Image from Abdel-Hamid et al., “Convolutional Neural Networks for Speech Recognition”, TASLP 2014 d

Performance on TIMIT of different CNN architectures (Comparison with DNNs)

More recent ASR system: Deep Speech 2 CTC Architecture Channels Filter dimension Stride Regular Dev Noisy Dev Fully Connected 1-layer 1D 1280 11 2 9.52 19.36 2-layer 1D 640, 640 5, 5 1, 2 9.67 19.21 Lookahead 3-layer 1D 512, 512, 512 5, 5, 5 1, 1, 2 9.20 20.22 Convolution 1-layer 2D 32 41x11 2x2 8.94 16.22 2-layer 2D 32, 32 41x11, 21x11 2x2, 2x1 9.06 15.71 3-layer 2D 32, 32, 96 41x11, 21x11, 21x11 2x2, 2x1, 2x1 8.61 14.74 Vanilla or GRU Uni or Bi Test set Ours Human directional WSJ eval’92 3.10 5.03 tion RNN Read WSJ eval’93 4.42 8.08 LibriSpeech test-clean 5.15 5.83 LibriSpeech test-other 12.73 12.69 VoxForge American-Canadian 7.94 4.85 Accented VoxForge Commonwealth 14.85 8.15 1D or 2D VoxForge European 18.44 12.76 Invariant Convolution VoxForge Indian 22.89 22.15 Noisy CHiME eval real 21.59 11.84 CHiME eval sim 42.55 31.33 Spectrogram Image from Amodei et al., “Deep speech 2: End - to - end speech recognition in English and Mandarin”, ICML 2016

TTS: Wavenet Speech synthesis using an auto-regressive generative model • Generates waveform sample-by-sample:16kHz sampling rate • Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Causal Convolutions Fully convolutional • Prediction at timestep t cannot depend on any future timesteps • Output Hidden Layer Hidden Layer Hidden Layer Input Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Dilated Convolutions Wavenet uses “dilated convolutions” • Enables the network to have very large receptive fields • Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 1 https://techcrunch.com/2017/10/04/googles-wavenet-machine-learning-based-speech-synthesis-comes-to-assistant/

Convolutional Neural Networks (CNNs) All animations are from: https://github.com/vdumoulin/conv_arithmetic

Conditional Wavenet Condition the model on input variables to generate audio with the • required characteristics Global (same representation used to influence all timesteps) • Local (use a second timeseries for conditioning) • Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Tacotron Griffin-Lim reconstruction Linear-scale spectrogram CBHG Seq2seq target with r=3 CBHG Decoder Decoder Decoder RNN RNN RNN Attention Attention Attention Attention RNN RNN RNN Attention is applied Pre-net to all decoder steps Pre-net Pre-net Pre-net Character embeddings <GO> frame Image from Wang et al., “Tacotron: Towards end-to-end speech synthesis”, 2017. “https://arxiv.org/pdf/1703.10135.pdf

Tacotron: CBHG Module Bidirectional RNN Highway layers Residual connection Conv1D layers Conv1D projections Max-pool along time (stride=1) Conv1D bank + stacking Image from Wang et al., “Tacotron: Towards end-to-end speech synthesis”, 2017. “https://arxiv.org/pdf/1703.10135.pdf

Grapheme to phoneme (G2P) conversion

Grapheme to phoneme (G2P) conversion Produce a pronunciation (phoneme sequence) given a • written word (grapheme sequence) Learn G2P mappings from a pronunciation dictionary • Useful for: • ASR systems in languages with no pre-built lexicons • Speech synthesis systems • Deriving pronunciations for out-of-vocabulary (OOV) words •

G2P conversion (I) One popular paradigm: Joint sequence models [BN12] • Grapheme and phoneme sequences are first aligned • using EM-based algorithm Results in a sequence of graphones (joint G-P tokens) • Ngram models trained on these graphone sequences • WFST-based implementation of such a joint graphone • model [Phonetisaurus] [BN12]:Bisani & Ney , “Joint sequence models for grapheme-to-phoneme conversion”,Specom 2012 [Phonetisaurus] J. Novak, Phonetisaurus Toolkit

G2P conversion (II) Neural network based methods are the new state-of-the-art • for G2P Bidirectional LSTM-based networks using a CTC output • layer [Rao15]. Comparable to Ngram models. Incorporate alignment information [Yao15]. Beats Ngram • models. No alignment. Encoder-decoder with attention. Beats the • above systems [Toshniwal16].

Convolutional Neural Networks in Speech Lecture 20 CS 753 - PowerPoint PPT Presentation

Convolutional Neural Networks in Speech Lecture 20 CS 753 Instructor: Preethi Jyothi Convolutional Neural Networks (CNNs) Fully connected (dense) layers have no awareness of spatial information Key concept behind convolutional layers is

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Convolutional Neural Networks ---- Off the shelf top notch performances Convolutional Neural

Convolutional Kuan-Ting Lai 2020/3/31 Neural Network Convolutional Neural Networks (CNN)

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Introduction CSCE 970 CSCE 970 Lecture 4: Lecture 4: Convolutional Convolutional Neural

Convolutional Neural Nets 4-25-16 Reading Quiz Convolutional neural networks are most commonly

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Convolutional Neural Networks for Sentence Classification Yoon Kim New York University 1 / 34

Convolutional Neural Networks 08, 10 & 17 Nov, 2016 J. Ezequiel Soto S. Image Processing

Neural Network Part 3: Convolutional Neural Networks CS 760@UW-Madison Goals for the lecture

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Semantic Segmentation of the sekleton in bone scintigraphy images with convolutional neural

Convolutional Neural Networks (Part III) 08, 10 & 17 Nov, 2016 J. Ezequiel Soto S. Image

MICROBOONE Taritree Wongjirad DPF 2017 Tufts/MIT Outline Convolutional neural networks

Neural Networks + Convolutional Neural Networks Last Class Global Features The perceptron

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Lecture 11: Neural Networks (Part 3) March 2nd, 2020 Lecturer: Steven Wu Scribe: Steven Wu 1

Deep Learning (CNNs) Jumpstart 2018 Chaoqi Wang, Amlan Kar Why study it? To the basics and

Le Lecture 9 9 - Convolu lutional l Neural l Networks I2DL: Prof. Niessner, Prof.

+ + Concave Aspects of Submodular Functions International Symposium on Information Theory

Spiking row-by-row FPGA Multi-kernel and Multi-layer Convolution Processor. Ricardo Tapiador

Convolutions CON VOLUTION AL N EURAL N ETW ORK S F OR IMAGE P ROCES S IN G Ariel Rokem Senior

AMMI Introduction to Deep Learning 9.1. Transposed convolutions Fran cois Fleuret

Optimization Problems for Neural Networks Chih-Jen Lin National Taiwan University Last updated:

Convolutional Neural Networks in Speech Lecture 20 CS 753 - PowerPoint PPT Presentation

Convolutional Neural Networks in Speech Lecture 20 CS 753 Instructor: Preethi Jyothi Convolutional Neural Networks (CNNs) Fully connected (dense) layers have no awareness of spatial information Key concept behind convolutional layers is

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Convolutional Neural Networks ---- Off the shelf top notch performances Convolutional Neural

Convolutional Kuan-Ting Lai 2020/3/31 Neural Network Convolutional Neural Networks (CNN)

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Introduction CSCE 970 CSCE 970 Lecture 4: Lecture 4: Convolutional Convolutional Neural

Convolutional Neural Nets 4-25-16 Reading Quiz Convolutional neural networks are most commonly

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Convolutional Neural Networks for Sentence Classification Yoon Kim New York University 1 / 34

Convolutional Neural Networks 08, 10 &amp; 17 Nov, 2016 J. Ezequiel Soto S. Image Processing

Neural Network Part 3: Convolutional Neural Networks CS 760@UW-Madison Goals for the lecture

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Semantic Segmentation of the sekleton in bone scintigraphy images with convolutional neural

Convolutional Neural Networks (Part III) 08, 10 &amp; 17 Nov, 2016 J. Ezequiel Soto S. Image

MICROBOONE Taritree Wongjirad DPF 2017 Tufts/MIT Outline Convolutional neural networks

Neural Networks + Convolutional Neural Networks Last Class Global Features The perceptron

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Lecture 11: Neural Networks (Part 3) March 2nd, 2020 Lecturer: Steven Wu Scribe: Steven Wu 1

Deep Learning (CNNs) Jumpstart 2018 Chaoqi Wang, Amlan Kar Why study it? To the basics and

Le Lecture 9 9 - Convolu lutional l Neural l Networks I2DL: Prof. Niessner, Prof.

+ + Concave Aspects of Submodular Functions International Symposium on Information Theory

Spiking row-by-row FPGA Multi-kernel and Multi-layer Convolution Processor. Ricardo Tapiador

Convolutions CON VOLUTION AL N EURAL N ETW ORK S F OR IMAGE P ROCES S IN G Ariel Rokem Senior

AMMI Introduction to Deep Learning 9.1. Transposed convolutions Fran cois Fleuret

Optimization Problems for Neural Networks Chih-Jen Lin National Taiwan University Last updated:

Convolutional Neural Networks 08, 10 & 17 Nov, 2016 J. Ezequiel Soto S. Image Processing

Convolutional Neural Networks (Part III) 08, 10 & 17 Nov, 2016 J. Ezequiel Soto S. Image