ASR Chapter 9: Feature Representation Learning in Deep Learning Networks 조성재 협동과정 인지과학전공 강기천 협동과정 인지과학전공 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Abstract for Chapter 9 Deep neural networks that jointly learn the feature representation and the classifier. Through many layers of nonlinear processing, DNNs transform the raw input feature to a more invariant and discriminative representation that can be better classified by the log- linear model. DNNs learn a hierarchy of features. The lower-level features typically catch local patterns. These patterns are very sensitive to changes in the raw feature. The higher-level features are built upon the low-level features and are more abstract and invariant to the variations in the raw feature. We demonstrate that the learned high-level features are robust to speaker and environment variations. SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 2
ASR Chapter 9: Feature Representation Learning in Deep Learning Networks (Part 1) 조성재 협동과정 인지과학전공 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
목차 9.1 Joint Learning of Feature Representation and Classifier 9.2 Feature Hierarchy 9.3 Flexibility in Using Arbitrary Input Features SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 4
9.1 Joint Learning of Feature Representation and Classifier Deep vs. shallow models Deep models: DNNs Shallow models: GMM, SVM Comparing performance of the models in speech recognition DNN > GMM DNN > SVM Why? Because the DNNs are able to learn complicated feature representations and classifiers jointly. SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 5
9.1 Joint Learning of Feature Representation and Classifier Feature engineering In the conventional shallow models (GMMs, SVMs), feature engineering is the key to the succes s of the system. Practitioner’s main job is to construct features that perform well. Better features often come from someone who has great domain knowledge. Examples of feature sets from feature engineering SIFT : scale-invariant feature transform In image recognition MFCC : mel-frequency cepstrum coefficients In speech recognition SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 6
9.1 Joint Learning of Feature Representation and Classifier Deep models such as DNNs, however, do not require hand-crafted high-level features. Good raw features still help though since the existing DNN learning algorithms may generate a n underperformed system. DNNs automatically learn the feature representations and classifiers jointly. SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 7
9.1 Joint Learning of Feature Representation and Classifier In the DNN, the combination of all hid den layers can be considered as a feat ure learning module. The composition of simple nonlinear tr ansformations results in very complicat ed nonlinear transformation. The last layer = a softmax layer = a simple log-linear classifier = a maximum entropy (MaxEnt) model Fig. 9.2 DNN: A joint feature representation and classifier learning view SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 8
9.1 Joint Learning of Feature Representation and Classifier In DNN, the estimation of the posterior probability 𝑞(𝑧 = 𝑡|𝐩) , where ( 𝑡 : target class, 𝐩 : observation vector), can be considered[=interpreted] as a two-step nonstochastic process: Step 1: Transformation 𝐩 → 𝐰 → 𝐰 𝟑 → ⋯ → 𝐰 𝑀−1 Step 2: 𝑞(𝑧 = 𝑡|𝐩) is estimated using the log-linear model. “log -linear model ”: ( https://en.wikipedia.org/wiki/Log-linear_model) 𝑀 𝐰 𝑀−1 • exp 𝑑 + σ 𝑗 𝑥 𝑗 𝑔 ← exp 𝑑 + σ 𝑗 𝑥 𝑗 𝑗 𝑌 » 𝑌 : variables 𝑗 𝑌 : quantities that are functions of the variable 𝑌 » 𝑔 » 𝑑, 𝑥 𝑗 : model parameters SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 9
9.1 Joint Learning of Feature Representation and Classifier A MaxEnt model [25] is a model that estimates the optimal model parameter 𝑞 ∗ with the following scheme. 𝑞 ∗ = arg max 𝑞∈𝑄 𝐼 𝑞 - maximization - entropy 𝐼 𝑞 = − σ 𝑦 𝑞 𝑦 log 𝑞 𝑦 The last layer of the DNN becomes a MaxEnt model because of the softmax layer. exp 𝑨 𝑡 𝑞 𝑦 ← 𝑞 𝑧 = 𝑡 𝐩 = 𝐿 σ 𝑙=1 exp 𝑨 𝑙 In the conventional MaxEnt model, features are manually designed. The MaxEnt model of the DNN is automatically designed. SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 10
9.1 Joint Learning of Feature Representation and Classifier Manual feature construction works fine for tasks (group 1) that people can easily inspect and know what feature to use but not for tasks (group 2) whose raw features are highly variable. Group 2: speech recognition In DNNs, however, the features are defined by the first 𝑀 − 1 layers and are jointly learned with the MaxEnt model from the data automatically . DNNs eliminate the tedious manual feature construction. DNNs have the potential of extracting good (= invariant and discriminative) features , wh ich are impossible to construct manually. SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 11
9.1 Joint Learning of Feature Representation and Classifier – Questions (O) 딥뉴럴넷이 feature representation learning 을 잘한다는 걸 이론적으로 밝힌 논문을 소개해주시면 감사하겠습니다 . – 변석현 The universal approximation theorem (Hornik et al., 1989 Cybenko 1989): “Regardless of what function we are trying to learn, we know that a large MLP will be able to represent this function.” → DNNs are good at learning a complex function. A deep neural net have the hierarchical composition of features. In a neural nets, weighted sum and activation functions combine features to generate a feature in the next layer . Using deeper models can reduce the number of units required to represent the desired function. [ 참고문헌 ] Goodfellow, I., Bengio, Y., and Courville , A. C. (2016). “Universal Approximation Properties and Depth.” Deep Learning . The MIT Press. pp.197-200. Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Net works , 2 , 359 – 366. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems , 2 , 303 – 314. Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F.: Feature learning in deep neural networks — studies on speech recognition tasks. In: Proceedings of the ICLR (2013) SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 12
9.2 Feature Hierarchy DNNs learn feature representations that are suitable for the classifier. DNNs learn a feature hierarchy . What is a feature hierarchy? Feature hierarchy : Raw input feature → low-level features → higher-level features Low-level features catches local patterns . Local patterns are very variant/sensitive to changes in the input features. Higher-level features are built on the low-level features More abstract than the low-level features Invariant/robust to the input feature variations SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 13
9.2 Feature Hierarchy The feature hierarchy learned from imageNet dataset SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 14
9.2 Feature Hierarchy 𝑚 < 0.01 or 𝑤 𝑗 𝑚 < 0.99 . A saturated neuron is satisfied that 𝑤 𝑗 The lower layers: small percentage of saturated neurons The higher layers: large percentage of saturated neurons [<0.01] The training label is a one-hot vector. → The training label is sparse. → The associ ated features are sparse. → Majority of the saturated neurons are deactivated . SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 15
9.2 Feature Hierarchy The magnitude of the majority of the we ights is typically very small. The magnitude of 98% of the weights in Near 0.5 all layers except the input layer (layer 1) is less than 0.5 . SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 16
Recommend
More recommend