-- Deep Neural Network or Gaussian Mixture Model? Dong Yu - PowerPoint PPT Presentation

Who Can Understand Your Speech Better -- Deep Neural Network or Gaussian Mixture Model? Dong Yu Microsoft Research Thanks to my collaborators: Li Deng, Frank Seide, Gang Li, Mike Seltzer, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Adam Eversole, George Dahl, Abdel-rahman Mohamed, Xie Chen, Hang Su, Ossama Abdel-Hamid, Eric Wang, Andrew Maas, and many more

Demo: Real Time Speech to Speech Translation http://youtu.be/Nu-nlQqFCKg Microsoft Chief Research Officer Dr. Rick Rashid demoed the real time speech- to-speech translation technique at 14th Computing in the 21st Century Conference held at Tianjin, China, on Oct. 25, 2012. Dong Yu: Keynote at IWSLT 2012 12/7/2012 2

Speech to Speech Translation Personalized Machine Speech Speech Translation Recognition Synthesis Frank Seide Xiaodong He Yao Qian Gang Li Dongdong Zhang Frank Soong Dong Yu Mei-Yuh Hwang Lijuan Wang Li Deng Mu Li Mohamed Abdel-Hady Ming Zhou Project Management: Noelle Sophy, Chris Wendt Dong Yu: Keynote at IWSLT 2012 12/7/2012 3

Speech to Speech Translation SI DNN trained with 2000-hr SWB data Personalized Has 180 million parameters Machine Speech Speech Translation Recognition Synthesis 7 hidden layers each 32k tied with 2048 neurons triphone states 11 frames of 52- dim plp feature Dong Yu: Keynote at IWSLT 2012 12/7/2012 4

DNN-HMM Performs Very Well (Dahl, Yu, Deng, Acero 2012, Seide, Li, Yu 2011, Chen et al. 2012)  Table: Voice Search SER (24 hours training) AM Setup Test GMM-HMM MPE (760 24-mixture) 36.2% DNN-HMM 5 layers x 2048 30.1% (-17%)  Table: Switch Board WER (309 hours training) Hub5’00 -SWB AM Setup RT03S-FSH 23.6% 27.4% GMM-HMM BMMI (9K 40-mixture) 15.8% (-33%) 18.5% (-33%) DNN-HMM 7 x 2048  Table: Switch Board WER (2000 hours training) Hub5’00 -SWB AM Setup RT03S-FSH 21.7% 23.0% GMM-HMM (A) BMMI (18K 72-mixture) 19.6% 20.5% GMM-HMM (B) BMMI + fMPE 14.4% (A: -34% 15.6% (A: -32% DNN-HMM 7 x 3076 B: -27%) B: -24%) Dong Yu: Keynote at IWSLT 2012 12/7/2012 5

DNN-HMM Performs Very Well  Microsoft audio video indexing service (Knies, 2012) ◦ “It’s a big deal. The benefits, says Behrooz Chitsaz, director of Intellectual Property Strategy for Microsoft Research, are improved accuracy and faster processor timing. He says that tests have demonstrated that the algorithm provides a 10- to 20- percent relative error reduction and uses about 30 percent less processing time than the best-of-breed speech-recognition algorithms based on so- called Gaussian Mixture Models.”  Google voice search (Simonite, 2012): ◦ “Google is now using these neural networks to recognize speech more accurately, a technology increasingly important to Google's smartphone operating system, Android, as well as the search app it makes available for Apple devices (see "Google's Answer to Siri Thinks Ahead"). "We got between 20 and 25 percent improvement in terms of words that are wrong," says Vincent Vanhoucke, a leader of Google's speech-recognition efforts. "That means that many more people will have a perfect experience without errors.” Dong Yu: Keynote at IWSLT 2012 12/7/2012 6

Outline Considered Obstacles | Other Advances | Summary CD-DNN-HMM | Invariant Features | Once  CD-DNN-HMM  Invariant Features  Once Considered Obstacles  Other Advances  Summary Dong Yu: Keynote at IWSLT 2012 12/7/2012 7

Deep Neural Network Considered Obstacles | Other Advances | Summary CD-DNN-HMM | Invariant Features | Once  A fancy name for multi-layer perceptron (MLP) with many hidden layers.  Each sigmoidal hidden neuron follows Bernoulli distribution  The last layer (softmax layer) follows multinomial distribution 𝐼 p 𝑚 = 𝑙|𝐢; θ = 𝑓𝑦𝑞 𝜇 𝑗𝑙 ℎ 𝑗 + 𝑏 𝑙 𝑗=1 𝑎 𝒊  Training can be difficult and tricky. Optimization algorithm and strategy can be important. Dong Yu: Keynote at IWSLT 2012 12/7/2012 8

Deep Neural Network Considered Obstacles | Other Advances | Summary CD-DNN-HMM | Invariant Features | Once  A fancy name for multi-layer perceptron (MLP) with many hidden layers.  Each sigmoidal hidden neuron follows Bernoulli distribution  The last layer (softmax layer) follows multinomial distribution 𝐼 p 𝑚 = 𝑙|𝐢; θ = 𝑓𝑦𝑞 𝜇 𝑗𝑙 ℎ 𝑗 + 𝑏 𝑙 𝑗=1 𝑎 𝒊  Training can be difficult and tricky. Optimization algorithm and strategy can be important. Dong Yu: Keynote at IWSLT 2012 12/7/2012 9

Restricted Boltzmann Machine (Hinton, Osindero, Teh 2006) Considered Obstacles | Other Advances | Summary CD-DNN-HMM | Invariant Features | Once Hidden Layer No within layer connection Visible Layer No within layer connection  Joint distribution p 𝐰, 𝐢; θ is defined in terms of an energy function E 𝐰, 𝐢; θ p 𝐰, 𝐢; θ = 𝑓𝑦𝑞 −E 𝐰, 𝐢; θ 𝑎 p 𝐰; θ = 𝑓𝑦𝑞 −E 𝐰, 𝐢; θ = 𝑓𝑦𝑞 −F 𝒘; θ 𝑎 𝑎 𝐢  Conditional independence 𝐼−1 𝑞 𝐢|𝐰 = 𝑞 ℎ 𝑘 |𝐰 j=0 𝑊−1 𝑞 𝐰|𝐢 = 𝑞 𝑤 𝑗 |𝐢 i=0 Dong Yu: Keynote at IWSLT 2012 12/7/2012 10

Generative Pretraining a DNN Considered Obstacles | Other Advances | Summary CD-DNN-HMM | Invariant Features | Once  First learn with all the weights tied ◦ equivalent to learning an RBM  Then freeze the first layer of weights and learn the remaining weights (still tied together). ◦ equivalent to learning another RBM, using the aggregated conditional probability on 𝒊 0 as the data ◦ Continue the process to train the next layer  Intuitively log 𝑞 𝒘 improves as new layer is added and trained. Dong Yu: Keynote at IWSLT 2012 12/7/2012 11

Discriminative Pretraining Considered Obstacles | Other Advances | Summary CD-DNN-HMM | Invariant Features | Once  Train a single hidden layer DNN using BP (without convergence)  Insert a new hidden layer and train it using BP (without convergence)  Do the same thing till the predefined number of layers is reached  Jointly fine-tune all layers till convergence  Can reduce gradient diffusion problem  Guaranteed to help if done right Dong Yu: Keynote at IWSLT 2012 12/7/2012 15

CD-DNN-HMM: Three Key Components (Dahl, Yu, Deng, Acero 2012) Considered Obstacles | Other Advances | Summary CD-DNN-HMM | Invariant Features | Once Model senones (tied triphone states) directly Many layers of nonlinear feature Long window transformation of frames Dong Yu: Keynote at IWSLT 2012 12/7/2012 16

Modeling Senones is Critical Considered Obstacles | Other Advances | Summary CD-DNN-HMM | Invariant Features | Once  Table: 24-hr Voice Search (760 24-mixture senones) Model monophone senone GMM-HMM MPE - 36.2 DNN-HMM 1 × 2K 41.7 31.9 DNN-HMM 3 × 2k 30.4 35.8  Table: 309-hr SWB (9k 40-mixture senones) Model monophone senone GMM-HMM BMMI - 23.6 DNN-HMM 7 × 2K 34.9 17.1  ML-trained CD-GMM-HMM generated alignment was used to generate senone and monophone labels for training DNNs. Dong Yu: Keynote at IWSLT 2012 12/7/2012 17

Exploiting Neighbor Frames Considered Obstacles | Other Advances | Summary CD-DNN-HMM | Invariant Features | Once  Table: 309-hr SWB (GMM-HMM BMMI = 23.6%) Model 1 frame 11 frames CD-DNN-HMM 1 × 𝟓𝟕𝟒𝟓 26.0 22.4 CD-DNN-HMM 7 × 2k 23.2 17.1 ML-trained CD-GMM-HMM generated alignment was used to generate senone labels for training DNNs  It seems 23.2% is only slightly better than 23.6% but note that DNN is not trained using sequential criterion but GMM is.  To exploit info in neighbor frames, GMM systems need to use fMPE, region dependent transformation, or tandem structure Dong Yu: Keynote at IWSLT 2012 12/7/2012 18

-- Deep Neural Network or Gaussian Mixture Model? Dong Yu - PowerPoint PPT Presentation

Who Can Understand Your Speech Better -- Deep Neural Network or Gaussian Mixture Model? Dong Yu Microsoft Research Thanks to my collaborators: Li Deng, Frank Seide, Gang Li, Mike Seltzer, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Adam

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING ARJUN

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Neural Networks: What can a network represent Deep Learning, Fall 2020 1 Recap : Neural

Neural Networks: What can a network represent Deep Learning, Spring 2018 1 Recap : Neural

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Visualizing and Interpreting Deep Neural Networks Bolei Zhou Department of Information

Optimal Sparse Decision Trees Xiyang Hu Cynthia Rudin Margo Seltzer Carnegie Mellon Duke

Responding to the Housing Challenges Posed by the Pandemic Presenters Call llie S Selt

Layering in Provenance Systems Kiran-Kumar Muniswamy-Reddy , Uri Braun, David A. Holland, Peter

PASS PASS Provenance-Aware Storage System Provenance-Aware Storage System Margo Seltzer, David

Born-Again Tree Ensembles Thibaut Vidal 1 , Maximilian Schiffer 2 with the support of Toni Pacheco

Geant4 Physics in More Detail Fermilab Geant4 Tutorial 27-29 October 2003 Dennis Wright (SLAC)

DNS / DNSSEC / DANE / DPRIVE Results at IETF 93

POLI 120N: Contention and Conflict in Africa Professor Adida Explaining civil conflict: political