lecture 8 autoencoder dbm
play

Lecture 8: Autoencoder & DBM Princeton University COS 495 - PowerPoint PPT Presentation

Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder Neural networks trained to attempt to copy its input to its output Contain two parts: Encoder: map


  1. Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang

  2. Autoencoder

  3. Autoencoder • Neural networks trained to attempt to copy its input to its output • Contain two parts: • Encoder: map the input to a hidden representation • Decoder: map the hidden representation to the output

  4. Autoencoder ℎ Hidden representation (the code) 𝑦 𝑠 Input Reconstruction

  5. Autoencoder ℎ Encoder 𝑔(⋅) Decoder 𝑕(⋅) 𝑦 𝑠 ℎ = 𝑔 𝑦 , 𝑠 = 𝑕 ℎ = 𝑕(𝑔 𝑦 )

  6. Why want to copy input to output • Not really care about copying • Interesting case: NOT able to copy exactly but strive to do so • Autoencoder forced to select which aspects to preserve and thus hopefully can learn useful properties of the data • Historical note: goes back to (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994).

  7. Undercomplete autoencoder • Constrain the code to have smaller dimension than the input • Training: minimize a loss function 𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) 𝑦 ℎ 𝑠

  8. Undercomplete autoencoder • Constrain the code to have smaller dimension than the input • Training: minimize a loss function 𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) • Special case: 𝑔, 𝑕 linear, 𝑀 mean square error • Reduces to Principal Component Analysis

  9. Undercomplete autoencoder • What about nonlinear encoder and decoder? • Capacity should not be too large • Suppose given data 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 • Encoder maps 𝑦 𝑗 to 𝑗 • Decoder maps 𝑗 to 𝑦 𝑗 • One dim ℎ suffices for perfect reconstruction

  10. Regularization • Typically NOT • Keeping the encoder/decoder shallow or • Using small code size • Regularized autoencoders: add regularization term that encourages the model to have other properties • Sparsity of the representation (sparse autoencoder) • Robustness to noise or to missing inputs (denoising autoencoder) • Smallness of the derivative of the representation

  11. Sparse autoencoder • Constrain the code to have sparsity • Training: minimize a loss function 𝑀 𝑆 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) + 𝑆(ℎ) 𝑦 ℎ 𝑠

  12. Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 𝑞(ℎ, 𝑦) • MLE on 𝑦 𝑞(ℎ ′ , 𝑦) log 𝑞(𝑦) = log ෍ ℎ ′ •  Hard to sum over ℎ ′

  13. Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 𝑞(ℎ, 𝑦) • MLE on 𝑦 𝑞(ℎ ′ , 𝑦) max log 𝑞(𝑦) = max log ෍ ℎ ′ • Approximation: suppose ℎ = 𝑔(𝑦) gives the most likely hidden representation, and σ ℎ ′ 𝑞(ℎ ′ , 𝑦) can be approximated by 𝑞(ℎ, 𝑦)

  14. Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 𝑞(ℎ, 𝑦) • Approximate MLE on 𝑦, ℎ = 𝑔(𝑦) max log 𝑞(ℎ, 𝑦) = max log 𝑞(𝑦|ℎ) + log 𝑞(ℎ) Loss Regularization

  15. Sparse autoencoder • Constrain the code to have sparsity 𝜇 𝜇 • Laplacian prior: 𝑞 ℎ = 2 exp(− 2 ℎ 1 ) • Training: minimize a loss function 𝑀 𝑆 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) + 𝜇 ℎ 1

  16. Denoising autoencoder • Traditional autoencoder: encourage to learn 𝑕 𝑔 ⋅ to be identity • Denoising : minimize a loss function 𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 ෤ 𝑦 ) where ෤ 𝑦 is 𝑦 + 𝑜𝑝𝑗𝑡𝑓

  17. Boltzmann machine

  18. Boltzmann machine • Introduced by Ackley et al. (1985) • General “connectionist” approach to learning arbitrary probability distributions over binary vectors exp(−𝐹 𝑦 ) • Special case of energy model: 𝑞 𝑦 = 𝑎

  19. Boltzmann machine • Energy model: 𝑞 𝑦 = exp(−𝐹 𝑦 ) 𝑎 • Boltzmann machine: special case of energy model with 𝐹 𝑦 = −𝑦 𝑈 𝑉𝑦 − 𝑐 𝑈 𝑦 where 𝑉 is the weight matrix and 𝑐 is the bias parameter

  20. Boltzmann machine with latent variables • Some variables are not observed 𝑦 = 𝑦 𝑤 , 𝑦 ℎ , 𝑦 𝑤 visible, 𝑦 ℎ hidden 𝑈 𝑇𝑦 ℎ − 𝑐 𝑈 𝑦 𝑤 − 𝑑 𝑈 𝑦 ℎ 𝑈 𝑆𝑦 𝑤 − 𝑦 𝑤 𝑈 𝑋𝑦 ℎ − 𝑦 ℎ 𝐹 𝑦 = −𝑦 𝑤 • Universal approximator of probability mass functions

  21. Maximum likelihood 1 , 𝑦 𝑤 2 , … , 𝑦 𝑤 𝑜 • Suppose we are given data 𝑌 = 𝑦 𝑤 • Maximum likelihood is to maximize 𝑗 ) log 𝑞 𝑌 = ෍ log 𝑞(𝑦 𝑤 𝑗 where 1 𝑞 𝑦 𝑤 = ෍ 𝑞(𝑦 𝑤 , 𝑦 ℎ ) = ෍ 𝑎 exp(−𝐹(𝑦 𝑤 , 𝑦 ℎ )) 𝑦 ℎ 𝑦 ℎ • 𝑎 = σ exp(−𝐹(𝑦 𝑤 , 𝑦 ℎ )) : partition function, difficult to compute

  22. Restricted Boltzmann machine • Invented under the name harmonium (Smolensky, 1986) • Popularized by Hinton and collaborators to Restricted Boltzmann machine

  23. Restricted Boltzmann machine • Special case of Boltzmann machine with latent variables: 𝑞 𝑤, ℎ = exp(−𝐹 𝑤, ℎ ) 𝑎 where the energy function is 𝐹 𝑤, ℎ = −𝑤 𝑈 𝑋ℎ − 𝑐 𝑈 𝑤 − 𝑑 𝑈 ℎ with the weight matrix 𝑋 and the bias 𝑐, 𝑑 • Partition function 𝑎 = ෍ ෍ exp(−𝐹 𝑤, ℎ ) 𝑤 ℎ

  24. Restricted Boltzmann machine Figure from Deep Learning , Goodfellow, Bengio and Courville

  25. Restricted Boltzmann machine • Conditional distribution is factorial 𝑞 ℎ|𝑤 = 𝑞(𝑤, ℎ) 𝑞(𝑤) = ෑ 𝑞(ℎ 𝑘 |𝑤) 𝑘 and 𝑘 + 𝑤 𝑈 𝑋 𝑞 ℎ 𝑘 = 1|𝑤 = 𝜏 𝑑 :,𝑘 is logistic function

  26. Restricted Boltzmann machine • Similarly, 𝑞 𝑤|ℎ = 𝑞(𝑤, ℎ) 𝑞(ℎ) = ෑ 𝑞(𝑤 𝑗 |ℎ) 𝑗 and 𝑞 𝑤 𝑗 = 1|ℎ = 𝜏 𝑐 𝑗 + 𝑋 𝑗,: ℎ is logistic function

  27. Deep Boltzmann machine • Special case of energy model. Take 3 hidden layers and ignore bias: 𝑞 𝑤, ℎ 1 , ℎ 2 , ℎ 3 = exp(−𝐹 𝑤, ℎ 1 , ℎ 2 , ℎ 3 ) 𝑎 • Energy function 𝐹 𝑤, ℎ 1 , ℎ 2 , ℎ 3 = −𝑤 𝑈 𝑋 1 ℎ 1 − (ℎ 1 ) 𝑈 𝑋 2 ℎ 2 − (ℎ 2 ) 𝑈 𝑋 3 ℎ 3 with the weight matrices 𝑋 1 , 𝑋 2 , 𝑋 3 • Partition function exp(−𝐹 𝑤, ℎ 1 , ℎ 2 , ℎ 3 ) 𝑎 = ෍ 𝑤,ℎ 1 ,ℎ 2 ,ℎ 3

  28. Deep Boltzmann machine Figure from Deep Learning , Goodfellow, Bengio and Courville

Recommend


More recommend