EN. 601.467/667 Introduction to Human Language Technology Deep - PowerPoint PPT Presentation

EN. 601.467/667 Introduction to Human Language Technology Deep Learning II Shinji Watanabe 1

Today’s agenda • Basics of (deep) neural network • How to integrate DNN with HMM • Recurrent neural network • Language modeling • Attention based encoder-decoder • Machine translation or speech recognition 2

Deep neural network Output HMM state or phoneme • Very large combination of linear classifiers 30 ~ 10,000 units a i u w N 𝑡 ! = 𝑘 ~7hidden layers, 2048 units 𝒑 ! Speech features ~50 to ~1000 dim 3

Feed-forward neural networks • Affine transformation and non-linear activation function (sigmoid function) • Apply the above transformation L times • Softmax operation to get the probability distribution 4

Linear operation • Transforms 𝐸 ("#$) -dimensional input to 𝐸 (") output 𝑔(𝐢 ("#$) ) = 𝐗 (") 𝐢 ("#$) + 𝐜 (") • 𝐗 (&) ∈ ℝ ( (") ×( ("$%) : Linear transformation matrix • 𝐜 (&) ∈ ℝ ( (") : bias vector • Derivatives * ∑ & , '& - & ./ ' = 𝜀(𝑗, 𝑗′) • */ '( *(∑ & , '& - & ./ ' ) = 𝜀 𝑗, 𝑗 0 ℎ 10 • *, '(&( 5

DNN model size • Mainly determined by the number of dimensions (units) 𝐸 and the number of layers 𝑀 Which one makes the model larger? How much? 6

Sigmoid function • Sigmoid function • Convert the domain from ℝ to [0, 1] • Elementwise sigmoid function: • No trainable parameter in general • Derivative • !"($) = 𝜏(𝑦)(1 − 𝜏(𝑦)) !$ 7

Sigmoid function cont’d • 𝜏(𝑦) • 𝜏 & 𝑦 = 𝜏(𝑦)(1 − 𝜏 𝑦 ) 8

Softmax function • Softmax function • Convert the domain from ℝ & to 0, 1 & (make a multinomial dist. → classification) • Satisfy the sum to one condition, i.e., ∑ '() 𝑞 𝑘 𝐢 = 1 • 𝐾 = 2 : sigmoid function • Derivative • For 𝑗 = 𝑘 : !*('|𝐢) = 𝑞(𝑘|𝐢)(1 − 𝑞 𝑘 𝐢 ) !- " !*('|𝐢) • For 𝑗 ≠ 𝑘 : = −𝑞(𝑗|𝐢) 𝑞 𝑘 𝐢 !- " • Or we can write as !*('|𝐢) = 𝑞(𝑘|𝐢)(𝜀(𝑗, 𝑘) − 𝑞 𝑗 𝐢 ) : 𝜀(𝑗, 𝑘) : Kronecker’s delta !- " 9

Why it is used for the probability distribution? 10

What functions/operations we can use and cannot use? • Most of elementary functions • +, −,×,÷, log , exp , sin , cos , tan() • The function/operations that we cannot take a derivative, including some discrete operation • argmax , 𝑞(𝑋|𝑃) : Basic ASR operation, but we cannot take a derivative…. • Discretization 11

Objective function design • We usually use the cross entropy as an objective function • Since the Viterbi sequence is a hard assignment, the summation over states is simplified 12

Other objective functions • Square error 5 𝐢 234 − 𝐢 • We could also use p norm, e.g., L1 norm • Binary cross entropy • Again this is a special case of the cross entropy when the number of classes is two 13

Building blocks Output: 𝑡 ! ∈ {1, … , 𝐾} Softmax activation Sigmoid activation Linear transformation ＋ , −, exp , log , etc. Input: 𝐩 ! ∈ ℝ # 14

Building blocks Output: 𝑡 ! ∈ {1, … , 𝐾} Softmax activation Linear transformation Input: 𝐩 ! ∈ ℝ # 15

Building blocks Output: 𝑡 ! ∈ {1, … , 𝐾} Softmax activation Linear transformation Sigmoid activation Linear transformation Input: 𝐩 ! ∈ ℝ # 16

Building blocks Output: 𝑡 ! ∈ {1, … , 𝐾} Softmax activation Sigmoid activation Linear transformation ＋ , −, exp , log , etc. Input: 𝐩 ! ∈ ℝ # 17

How to optimize? Gradient decent and their variants • Take a derivative and update parameters with this derivative • Chain rule • Learning rate ρ 19

Chain rule • Chain rule • For example • 𝑔 𝑕 𝜄 = 𝜏(𝑏𝑦 + 𝑐) where 𝜄 = 𝑏, 𝑐 • 𝑔 𝑦 = 𝜏 𝑦 • 𝑕 𝑦 = 𝑏𝑦 + 𝑐 𝑧 = 𝑏𝑦 + 𝑐 • *I(J K ) = *L(M) *(NO./) = 𝜏 𝑧 1 − 𝜏 𝑧 = 𝜏 𝑏𝑦 + 𝑐 1 − 𝜏 𝑏𝑦 + 𝑐 */ *M */ • *I(J K ) = *N 20

Deep neural network: nested function • Chain rule to get a derivative recursively • Each transformation (Affine, sigmoid, and softmax) has analytical derivatives and we just combine these derivatives • We can obtain the derivative from the back propagation algorithm Linear Linear Sigmoid activation Softmax activation transformation transformation 21

Minibatch processing • Batch processing • Slow convergence • Effective computation where • Online processing • Fast convergence • Very inefficient computation Whole data Θ '()* Θ $%& • Minibatch processing (batch) • Something between batch and online processing Split the whole data into minibatch mini mini mini Θ $%& Θ '()* Θ '(), Θ '()+ batch batch batch 22

How to set 𝜍 ? Θ ("#$) = Θ (") − 𝜍 % Δ &'() (") • Stochastic Gradient Decent (SGD) • Use a constant value (hyper-parameter) • Can have some heuristic tuning (e.g., 𝜍 ← 0.5 × 𝜍 when the validation loss started to be degraded. Then the decay factor becomes another a hyperparameter) • Adam, AdaDelta, RMSProp, etc. • Use current or previous gradient information adaptively update 23) , … ) 2 𝜍(Δ ./01 , Δ ./01 • Still has hyperparameters to make a balance between current and previous gradient information • Choice of an appropriate optimizer and its hyperparameters is critical 23

Difficulties of training • Blue: accuracy of training data (higher is better) • Orange: accuracy of validation data (higher is better) 24

Today’s agenda • Basics of (deep) neural network • How to integrate DNN with HMM • Recurrent neural network • Language modeling • Attention based encoder-decoder • Machine translation or speech recognition 25

EN. 601.467/667 Introduction to Human Language Technology Deep - PowerPoint PPT Presentation

EN. 601.467/667 Introduction to Human Language Technology Deep Learning II Shinji Watanabe 1 Todays agenda Basics of (deep) neural network How to integrate DNN with HMM Recurrent neural network Language modeling

EN. 601.467/667 Introduc3on to Human Language Technology Deep Learning I Shinji Watanabe 1

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

journey! circulation: 667,000 1.2 million readers WHO ARE AAA journey READERS? 667,000 CIRCULATION

Primer on Auditory Processing Mounya Elhilali Department of Electrical & Computer Engineering

CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 15 October 28, 2015

CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 3 September 3, 2014

Human Language vs. Animal Communication Linguistics 101 Human Language vs. Animal Communication

Is language the key to reasoning? July 2004 - 667 words Evidence from deafness, aphasia and

Glycocalyx Loss Podocyte and Kidney Injury Induces Albuminuria Nat Rev Nephrol 2015;11:667-676

Discovery of BLU-667 for RET -driven cancers Jason Brubaker Blueprint Medicines Corporation AACR

BLU-667 is a potent and highly selective RET inhibitor in development for RET -driven thyroid

J_ J_sus W[ W[lks ks J_ J_rus us[l_m J_ J_rus us[l_m Jerusalem is mentioned 667

Charmonia Review of Particle Physics, Phys. Lett. B 667 (2008) 1 Bottomonia Review of Particle

House Human Services Committee December 13, 2017 Testimony Good morning Chairman Raymond and

US 20 & SR 601 STAKEHOLDER MEETING JUNE 17, 2020 AGENDA Updates Since April 2019 Meeting

EAS 467: Review Paper Dr. Roger Graves Director, Writing Across the Curriculum Associate

Gradient for Cross-Entropy Loss with Sigmoid For a single example ( x , y ): K

A general-purpose method for faithfully rounded floating-point function approximation in FPGAs

Deep Learning T HEORY , H ISTORY , S TATE OF THE A RT & P RACTICAL T OOLS by Ilya Kuzovkin

Activation Functions Activation Functions In [1]: % matplotlib inline import d2l from mxnet

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu

Sambuz

Useful Links

Newsletter

Mail Us

EN. 601.467/667 Introduction to Human Language Technology Deep - PowerPoint PPT Presentation

EN. 601.467/667 Introduction to Human Language Technology Deep Learning II Shinji Watanabe 1 Todays agenda Basics of (deep) neural network How to integrate DNN with HMM Recurrent neural network Language modeling

EN. 601.467/667 Introduc3on to Human Language Technology Deep Learning I Shinji Watanabe 1

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

journey! circulation: 667,000 1.2 million readers WHO ARE AAA journey READERS? 667,000 CIRCULATION

Primer on Auditory Processing Mounya Elhilali Department of Electrical &amp; Computer Engineering

CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 15 October 28, 2015

CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 3 September 3, 2014

Human Language vs. Animal Communication Linguistics 101 Human Language vs. Animal Communication

Is language the key to reasoning? July 2004 - 667 words Evidence from deafness, aphasia and

Glycocalyx Loss Podocyte and Kidney Injury Induces Albuminuria Nat Rev Nephrol 2015;11:667-676

Discovery of BLU-667 for RET -driven cancers Jason Brubaker Blueprint Medicines Corporation AACR

BLU-667 is a potent and highly selective RET inhibitor in development for RET -driven thyroid

J_ J_sus W[ W[lks ks J_ J_rus us[l_m J_ J_rus us[l_m Jerusalem is mentioned 667

Charmonia Review of Particle Physics, Phys. Lett. B 667 (2008) 1 Bottomonia Review of Particle

House Human Services Committee December 13, 2017 Testimony Good morning Chairman Raymond and

US 20 &amp; SR 601 STAKEHOLDER MEETING JUNE 17, 2020 AGENDA Updates Since April 2019 Meeting

EAS 467: Review Paper Dr. Roger Graves Director, Writing Across the Curriculum Associate

Gradient for Cross-Entropy Loss with Sigmoid For a single example ( x , y ): K

A general-purpose method for faithfully rounded floating-point function approximation in FPGAs

Deep Learning T HEORY , H ISTORY , S TATE OF THE A RT &amp; P RACTICAL T OOLS by Ilya Kuzovkin

Activation Functions Activation Functions In [1]: % matplotlib inline import d2l from mxnet

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &amp;

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu

Sambuz

Useful Links

Newsletter

Mail Us

Primer on Auditory Processing Mounya Elhilali Department of Electrical & Computer Engineering

US 20 & SR 601 STAKEHOLDER MEETING JUNE 17, 2020 AGENDA Updates Since April 2019 Meeting

Deep Learning T HEORY , H ISTORY , S TATE OF THE A RT & P RACTICAL T OOLS by Ilya Kuzovkin

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &