feature representation learning in
play

Feature Representation Learning in Deep Learning Networks - PowerPoint PPT Presentation

ASR Chapter 9: Feature Representation Learning in Deep Learning Networks SNU Spoken Language Processing Lab /


  1. ASR Chapter 9: Feature Representation Learning in Deep Learning Networks 조성재 협동과정 인지과학전공 강기천 협동과정 인지과학전공 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  2. Abstract for Chapter 9 Deep neural networks that jointly learn the feature representation and the classifier.  Through many layers of nonlinear processing, DNNs transform the raw input feature to a  more invariant and discriminative representation that can be better classified by the log- linear model. DNNs learn a hierarchy of features.   The lower-level features typically catch local patterns. These patterns are very sensitive to changes in the raw feature.  The higher-level features are built upon the low-level features and are more abstract and invariant to the variations in the raw feature. We demonstrate that the learned high-level features are robust to speaker and  environment variations. SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 2

  3. ASR Chapter 9: Feature Representation Learning in Deep Learning Networks (Part 1) 조성재 협동과정 인지과학전공 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

  4. 목차 9.1 Joint Learning of Feature Representation and Classifier  9.2 Feature Hierarchy  9.3 Flexibility in Using Arbitrary Input Features  SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 4

  5. 9.1 Joint Learning of Feature Representation and Classifier Deep vs. shallow models   Deep models: DNNs  Shallow models: GMM, SVM Comparing performance of the models in speech recognition   DNN > GMM  DNN > SVM Why?   Because the DNNs are able to learn complicated feature representations and classifiers jointly. SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 5

  6. 9.1 Joint Learning of Feature Representation and Classifier Feature engineering   In the conventional shallow models (GMMs, SVMs), feature engineering is the key to the succes s of the system.  Practitioner’s main job is to construct features that perform well.  Better features often come from someone who has great domain knowledge. Examples of feature sets from feature engineering   SIFT : scale-invariant feature transform  In image recognition  MFCC : mel-frequency cepstrum coefficients  In speech recognition SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 6

  7. 9.1 Joint Learning of Feature Representation and Classifier Deep models such as DNNs, however, do not require hand-crafted high-level features.   Good raw features still help though since the existing DNN learning algorithms may generate a n underperformed system. DNNs automatically learn the feature representations and classifiers jointly.  SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 7

  8. 9.1 Joint Learning of Feature Representation and Classifier In the DNN, the combination of all hid  den layers can be considered as a feat ure learning module. The composition of simple nonlinear tr  ansformations results in very complicat ed nonlinear transformation. The last layer   = a softmax layer  = a simple log-linear classifier  = a maximum entropy (MaxEnt) model Fig. 9.2 DNN: A joint feature representation and classifier learning view SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 8

  9. 9.1 Joint Learning of Feature Representation and Classifier In DNN, the estimation of the posterior probability 𝑞(𝑧 = 𝑡|𝐩) , where ( 𝑡 : target class, 𝐩 :  observation vector), can be considered[=interpreted] as a two-step nonstochastic process:  Step 1: Transformation 𝐩 → 𝐰 → 𝐰 𝟑 → ⋯ → 𝐰 𝑀−1  Step 2: 𝑞(𝑧 = 𝑡|𝐩) is estimated using the log-linear model.  “log -linear model ”: ( https://en.wikipedia.org/wiki/Log-linear_model) 𝑀 𝐰 𝑀−1 • exp 𝑑 + σ 𝑗 𝑥 𝑗 𝑔 ← exp 𝑑 + σ 𝑗 𝑥 𝑗 𝑗 𝑌 » 𝑌 : variables 𝑗 𝑌 : quantities that are functions of the variable 𝑌 » 𝑔 » 𝑑, 𝑥 𝑗 : model parameters SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 9

  10. 9.1 Joint Learning of Feature Representation and Classifier A MaxEnt model [25] is a model that estimates the optimal model parameter 𝑞 ∗ with the  following scheme.  𝑞 ∗ = arg max 𝑞∈𝑄 𝐼 𝑞 - maximization - entropy  𝐼 𝑞 = − σ 𝑦 𝑞 𝑦 log 𝑞 𝑦 The last layer of the DNN becomes a MaxEnt model because of the softmax layer.  exp 𝑨 𝑡  𝑞 𝑦 ← 𝑞 𝑧 = 𝑡 𝐩 = 𝐿 σ 𝑙=1 exp 𝑨 𝑙  In the conventional MaxEnt model, features are manually designed.  The MaxEnt model of the DNN is automatically designed. SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 10

  11. 9.1 Joint Learning of Feature Representation and Classifier Manual feature construction works fine   for tasks (group 1) that people can easily inspect and know what feature to use  but not for tasks (group 2) whose raw features are highly variable.  Group 2: speech recognition In DNNs, however, the features   are defined by the first 𝑀 − 1 layers and  are jointly learned with the MaxEnt model from the data automatically . DNNs eliminate the tedious manual feature construction.  DNNs have the potential of extracting good (= invariant and discriminative) features , wh  ich are impossible to construct manually. SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 11

  12. 9.1 Joint Learning of Feature Representation and Classifier – Questions (O) 딥뉴럴넷이 feature representation learning 을 잘한다는 걸 이론적으로 밝힌 논문을 소개해주시면 감사하겠습니다 . – 변석현 The universal approximation theorem (Hornik et al., 1989 Cybenko 1989):  “Regardless of what function we are trying to learn, we know that a large MLP will be able to represent this function.”  → DNNs are good at learning a complex function.  A deep neural net have the hierarchical composition of features.  In a neural nets, weighted sum and activation functions combine features to generate a feature in the next layer .  Using deeper models can reduce the number of units required to represent the desired function.  [ 참고문헌 ] Goodfellow, I., Bengio, Y., and Courville , A. C. (2016). “Universal Approximation Properties and Depth.”  Deep Learning . The MIT Press. pp.197-200. Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Net  works , 2 , 359 – 366. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems  , 2 , 303 – 314. Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F.: Feature learning in deep neural networks — studies on speech recognition  tasks. In: Proceedings of the ICLR (2013) SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 12

  13. 9.2 Feature Hierarchy DNNs learn feature representations that are suitable for the classifier.  DNNs learn a feature hierarchy .   What is a feature hierarchy? Feature hierarchy : Raw input feature → low-level features → higher-level features   Low-level features catches local patterns .  Local patterns are very variant/sensitive to changes in the input features.  Higher-level features are built on the low-level features  More abstract than the low-level features  Invariant/robust to the input feature variations SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 13

  14. 9.2 Feature Hierarchy The feature hierarchy learned from imageNet dataset  SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 14

  15. 9.2 Feature Hierarchy 𝑚 < 0.01 or 𝑤 𝑗 𝑚 < 0.99 . A saturated neuron is satisfied that 𝑤 𝑗  The lower layers: small percentage of saturated neurons  The higher layers: large percentage of saturated neurons  [<0.01] The training label is a one-hot vector. → The training label is sparse. → The associ  ated features are sparse. → Majority of the saturated neurons are deactivated . SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 15

  16. 9.2 Feature Hierarchy The magnitude of the majority of the we  ights is typically very small. The magnitude of 98% of the weights in  Near 0.5 all layers except the input layer (layer 1) is less than 0.5 . SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 16

Recommend


More recommend