EN. 601.467/667 Introduction to Human Language Technology Deep Learning II Shinji Watanabe 1
Today’s agenda • Basics of (deep) neural network • How to integrate DNN with HMM • Recurrent neural network • Language modeling • Attention based encoder-decoder • Machine translation or speech recognition 2
Deep neural network Output HMM state or phoneme • Very large combination of linear classifiers 30 ~ 10,000 units a i u w N 𝑡 ! = 𝑘 ~7hidden layers, 2048 units 𝒑 ! Speech features ~50 to ~1000 dim 3
Feed-forward neural networks • Affine transformation and non-linear activation function (sigmoid function) • Apply the above transformation L times • Softmax operation to get the probability distribution 4
Linear operation • Transforms 𝐸 ("#$) -dimensional input to 𝐸 (") output 𝑔(𝐢 ("#$) ) = 𝐗 (") 𝐢 ("#$) + 𝐜 (") • 𝐗 (&) ∈ ℝ ( (") ×( ("$%) : Linear transformation matrix • 𝐜 (&) ∈ ℝ ( (") : bias vector • Derivatives * ∑ & , '& - & ./ ' = 𝜀(𝑗, 𝑗′) • */ '( *(∑ & , '& - & ./ ' ) = 𝜀 𝑗, 𝑗 0 ℎ 10 • *, '(&( 5
DNN model size • Mainly determined by the number of dimensions (units) 𝐸 and the number of layers 𝑀 Which one makes the model larger? How much? 6
Sigmoid function • Sigmoid function • Convert the domain from ℝ to [0, 1] • Elementwise sigmoid function: • No trainable parameter in general • Derivative • !"($) = 𝜏(𝑦)(1 − 𝜏(𝑦)) !$ 7
Sigmoid function cont’d • 𝜏(𝑦) • 𝜏 & 𝑦 = 𝜏(𝑦)(1 − 𝜏 𝑦 ) 8
Softmax function • Softmax function • Convert the domain from ℝ & to 0, 1 & (make a multinomial dist. → classification) • Satisfy the sum to one condition, i.e., ∑ '() 𝑞 𝑘 𝐢 = 1 • 𝐾 = 2 : sigmoid function • Derivative • For 𝑗 = 𝑘 : !*('|𝐢) = 𝑞(𝑘|𝐢)(1 − 𝑞 𝑘 𝐢 ) !- " !*('|𝐢) • For 𝑗 ≠ 𝑘 : = −𝑞(𝑗|𝐢) 𝑞 𝑘 𝐢 !- " • Or we can write as !*('|𝐢) = 𝑞(𝑘|𝐢)(𝜀(𝑗, 𝑘) − 𝑞 𝑗 𝐢 ) : 𝜀(𝑗, 𝑘) : Kronecker’s delta !- " 9
Why it is used for the probability distribution? 10
What functions/operations we can use and cannot use? • Most of elementary functions • +, −,×,÷, log , exp , sin , cos , tan() • The function/operations that we cannot take a derivative, including some discrete operation • argmax , 𝑞(𝑋|𝑃) : Basic ASR operation, but we cannot take a derivative…. • Discretization 11
Objective function design • We usually use the cross entropy as an objective function • Since the Viterbi sequence is a hard assignment, the summation over states is simplified 12
Other objective functions • Square error 5 𝐢 234 − 𝐢 • We could also use p norm, e.g., L1 norm • Binary cross entropy • Again this is a special case of the cross entropy when the number of classes is two 13
Building blocks Output: 𝑡 ! ∈ {1, … , 𝐾} Softmax activation Sigmoid activation Linear transformation + , −, exp , log , etc. Input: 𝐩 ! ∈ ℝ # 14
Building blocks Output: 𝑡 ! ∈ {1, … , 𝐾} Softmax activation Linear transformation Input: 𝐩 ! ∈ ℝ # 15
Building blocks Output: 𝑡 ! ∈ {1, … , 𝐾} Softmax activation Linear transformation Sigmoid activation Linear transformation Input: 𝐩 ! ∈ ℝ # 16
Building blocks Output: 𝑡 ! ∈ {1, … , 𝐾} Softmax activation Sigmoid activation Linear transformation + , −, exp , log , etc. Input: 𝐩 ! ∈ ℝ # 17
How to optimize? Gradient decent and their variants • Take a derivative and update parameters with this derivative • Chain rule • Learning rate ρ 19
Chain rule • Chain rule • For example • 𝑔 𝜄 = 𝜏(𝑏𝑦 + 𝑐) where 𝜄 = 𝑏, 𝑐 • 𝑔 𝑦 = 𝜏 𝑦 • 𝑦 = 𝑏𝑦 + 𝑐 𝑧 = 𝑏𝑦 + 𝑐 • *I(J K ) = *L(M) *(NO./) = 𝜏 𝑧 1 − 𝜏 𝑧 = 𝜏 𝑏𝑦 + 𝑐 1 − 𝜏 𝑏𝑦 + 𝑐 */ *M */ • *I(J K ) = *N 20
Deep neural network: nested function • Chain rule to get a derivative recursively • Each transformation (Affine, sigmoid, and softmax) has analytical derivatives and we just combine these derivatives • We can obtain the derivative from the back propagation algorithm Linear Linear Sigmoid activation Softmax activation transformation transformation 21
Minibatch processing • Batch processing • Slow convergence • Effective computation where • Online processing • Fast convergence • Very inefficient computation Whole data Θ '()* Θ $%& • Minibatch processing (batch) • Something between batch and online processing Split the whole data into minibatch mini mini mini Θ $%& Θ '()* Θ '(), Θ '()+ batch batch batch 22
How to set 𝜍 ? Θ ("#$) = Θ (") − 𝜍 % Δ &'() (") • Stochastic Gradient Decent (SGD) • Use a constant value (hyper-parameter) • Can have some heuristic tuning (e.g., 𝜍 ← 0.5 × 𝜍 when the validation loss started to be degraded. Then the decay factor becomes another a hyperparameter) • Adam, AdaDelta, RMSProp, etc. • Use current or previous gradient information adaptively update 23) , … ) 2 𝜍(Δ ./01 , Δ ./01 • Still has hyperparameters to make a balance between current and previous gradient information • Choice of an appropriate optimizer and its hyperparameters is critical 23
Difficulties of training • Blue: accuracy of training data (higher is better) • Orange: accuracy of validation data (higher is better) 24
Today’s agenda • Basics of (deep) neural network • How to integrate DNN with HMM • Recurrent neural network • Language modeling • Attention based encoder-decoder • Machine translation or speech recognition 25
Recommend
More recommend