en 601 467 667 introduc3on to human language technology
play

EN. 601.467/667 Introduc3on to Human Language Technology Deep - PowerPoint PPT Presentation

EN. 601.467/667 Introduc3on to Human Language Technology Deep Learning I Shinji Watanabe 1 Todays agenda Introduction of deep neural network Basics of neural network 2 Short bio Research interests Automatic speech


  1. EN. 601.467/667 Introduc3on to Human Language Technology Deep Learning I Shinji Watanabe 1

  2. Today’s agenda • Introduction of deep neural network • Basics of neural network 2

  3. Short bio • Research interests • Automatic speech recognition (ASR), speech enhancement, application of machine learning to speech processing • Around 20 years of ASR experience since 2001

  4. Speech recogni3on evalua3on metric • Word error rate (WER) • Using edit distance word-by-word: Reference) I want to go to the Johns Hopkins campus Recognition result) I want to go to the 10 top kids campus • # inserJon errors = 1, # subsJtuJon errors = 2, # of deleJon errors = 0 ➡ Edit distance = 3 • Word error rate (%): Edit distance (=3) / # reference words (=9) * 100 = 33.3% • How to compute WERs for languages that do not have word boundaries? • Chunking or using character error rate

  5. 2001: when I started speech recognition…. (Pallett’03, Saon’15, Xiong’16) 100% Word error rate (WER) 10% 1% 1995 2000 2005 2010 2015 2016 Switchboard task (Telephone conversation speech)

  6. Really bad age…. • No applicaJon • No breakthrough technologies • Everyone outside speech research criJcized it… • General people don’t know “what is speech recogniJon”

  7. Now we are at (Pallett’03, Saon’15, Xiong’16) 100% Deep learning Word error rate (WER) 10% 5.9% 1% 1995 2000 2005 2010 2015 2016 Switchboard task (Telephone conversation speech)

  8. Everything was changed • No application • No breakthrough technologies • Everyone outside speech research criticized it… • General people don’t know “what is speech recognition”

  9. Everything was changed • No application voice search, smart speakers • No breakthrough technologies • Everyone outside speech research criticized it… • General people don’t know “what is speech recognition”

  10. Everything was changed • No application voice search, smart speakers • No breakthrough technologies deep neural network • Everyone outside speech research criticized it… • General people don’t know “what is speech recognition”

  11. Everything was changed • No applicaJon voice search, smart speakers • No breakthrough technologies deep neural network • Everyone outside speech research criJcized it… many people outside speech research know/respect it • General people don’t know “what is speech recogniJon”

  12. Everything was changed • No application voice search, smart speakers • No breakthrough technologies deep neural network • Everyone outside speech research criticized it… many people outside speech research know/respect it • General people don’t know “what is speech recognition” now my wife knows what I’m doing

  13. <latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit> <latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit> <latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit> Acoustic model • From Bayes decision theory to acoustic model X argmax p ( W | O ) = argmax p ( W, L | O ) W W L X = argmax p ( O | L, W ) p ( L, W ) W L X = argmax p ( O | L ) p ( L | W ) p ( W ) W Language L Acoustic model model 15

  14. GMM/HMM • Given HMM state j, we can represent the likelihood function as • Deep neural network acoustic model only replaces this GMM representation with a neural network 16

  15. Problem • Input MFCC vector: 𝐩 " • Output phoneme (or HMM state): 𝑡 " = {/a/, /k/, …, } • How to find a probabilistic distribution of 𝑞(𝑡 " |𝐩 " ) ??? , • We use a large amounts of pair data 𝐩 " , 𝑡 " "*+ to train the model parameter of the distribution 17

  16. Very easy case 𝑏 . 𝑝 . + 𝑏 1 𝑝 1 + 𝑐 = 0 • We can use a linear classifier • /a/: 𝑏 . 𝑝 . + 𝑏 1 𝑝 1 + 𝑐 ≥ 0 • /k/: 𝑏 . 𝑝 . + 𝑏 1 𝑝 1 + 𝑐 < 0 /a/ • We can also make a probability with the sigmoid funcJon 𝜏() 𝑝 1 • 𝑞( /a/ 𝐩 = 𝜏(𝑏 . 𝑝 . + 𝑏 1 𝑝 1 + 𝑐) • 𝑞( /k/ 𝐩 = 1 - 𝜏(𝑏 . 𝑝 . + 𝑏 1 𝑝 1 + 𝑐) /k/ • Sigmoid funcJon: 1 𝜏 𝑦 = 𝑝 . 1 + 𝑓 :. from http://cs.jhu.edu/~kevinduh/a/deep2014/140114-ResearchSeminar.pdf 18

  17. Very easy case • We can use GMM (although not so suitable) 𝑞(𝐩| /a/ ) = ∑ < 𝜕 < 𝑂(𝐩|𝜈 < , Σ < ) /a/ 𝑞(𝐩| /k/ ) = ∑ < 𝜕′ < 𝑂(𝐩|𝜈′ < , Σ′ < ) 𝑝 1 C(𝐩| /a/ ) • 𝑞( /a/ 𝐩 ≈ C 𝐩 /a/ DC(𝐩| /k/ ) /k/ 𝑝 . 19

  18. Getting more difficult with the GMM classifier or linear classifier /a/ 𝑝 1 /k/ 𝑝 . 20

  19. Getting more difficult with the GMM classifier or linear classifier /a/ 𝑝 1 /k/ 𝑝 . 21

  20. Neural network • Combination of linear classifiers to classify complicated patterns • More layers, more complicated patterns 22

  21. Neural network used in speech recogni3on Output HMM state or phoneme • Very large combinaJon of linear classifiers 30 ~ 10,000 units a i u w N ・・・ ~7hidden layers, 2048 units ・・・ ・・・ Input speech features Log mel filterbank + 11 context frames 23

  22. Why neural network was not focused Output HMM state 1. Very difficult to train 30 ~ 10,000 units a i u w N • Batch? On-line? Mini-batch? ・・・ • Stochastic gradient decent • Learning rate? Scheduling? ~7hidden layers, 2048 units • What kind of topologies? • Large computational cost 2. The amount of training data is very critical ・・・ ・・・ 3. CPU -> GPU Input speech features Log mel filterbank + 11 context frames 24

  23. Before deep learning (2002 – 2009) • Success of neural networks was very old period • People believed that GMM was beqer • But very small gain from standard GMMs 25

  24. from https://en.wikipedia.org/wiki/Geoffrey_Hinton 26

  25. When I noticed deep learning (2010) • A. Mohamed, G. E. Dahl, and G. E. Hinton, “Deep belief networks for phone recognition,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009. Using deep belief network as pre- • training Fine-tuning deep neural network • → Provides stable esJmaJon • This still did not fully convince me (I introduced it at NTT’s reading group) 27

  26. Pre-training and fine-tuning • First train neural network like 1,000 ~ 10,000 units parameters with deep belief a i u w N ・・・ network or autoencoder • Then, using deep neural network training ・・・ ・・・ 28

  27. Interspeech 2011 at Florence • The following three papers convinced me • Feature extraction: Valente, Fabio / Magimai-Doss, Mathew / Wang, Wen (2011): "Analysis and comparison of recent MLP features for LVCSR systems", In INTERSPEECH-2011 , 1245-1248. • Acoustic model: Seide, Frank / Li, Gang / Yu, Dong (2011): "Conversational speech transcription using context-dependent deep neural networks", In INTERSPEECH-2011, 437-440. • Language model: Mikolov, Tomáš / Deoras, Anoop / Kombrink, Stefan / Burget, Lukáš / Černocký, Jan (2011): "Empirical evaluation and combination of advanced language modeling techniques", In INTERSPEECH-2011, 605-608. • I discussed this potential to my NLP folks in NTT but they did not believe it (SVM, log linear model) 29

  28. Late 2012 • My first deep learning (Kaldi nnet) • Kaldi started to support DNN since 2012 (mainly developed by Karel Vesely) • Deep belief network based pre-training • Feed forward neural network • Sequence-discriminative training Hub5 ‘00 (SWB) WSJ GMM 18.6 5.6 DNN 14.2 3.6 DNN with sequence- 12.6 3.2 discriminative training 30

  29. 31

Recommend


More recommend