Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang
Autoencoder
Autoencoder • Neural networks trained to attempt to copy its input to its output • Contain two parts: • Encoder: map the input to a hidden representation • Decoder: map the hidden representation to the output
Autoencoder ℎ Hidden representation (the code) 𝑦 𝑠 Input Reconstruction
Autoencoder ℎ Encoder 𝑔(⋅) Decoder (⋅) 𝑦 𝑠 ℎ = 𝑔 𝑦 , 𝑠 = ℎ = (𝑔 𝑦 )
Why want to copy input to output • Not really care about copying • Interesting case: NOT able to copy exactly but strive to do so • Autoencoder forced to select which aspects to preserve and thus hopefully can learn useful properties of the data • Historical note: goes back to (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994).
Undercomplete autoencoder • Constrain the code to have smaller dimension than the input • Training: minimize a loss function 𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑔 𝑦 ) 𝑦 ℎ 𝑠
Undercomplete autoencoder • Constrain the code to have smaller dimension than the input • Training: minimize a loss function 𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑔 𝑦 ) • Special case: 𝑔, linear, 𝑀 mean square error • Reduces to Principal Component Analysis
Undercomplete autoencoder • What about nonlinear encoder and decoder? • Capacity should not be too large • Suppose given data 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 • Encoder maps 𝑦 𝑗 to 𝑗 • Decoder maps 𝑗 to 𝑦 𝑗 • One dim ℎ suffices for perfect reconstruction
Regularization • Typically NOT • Keeping the encoder/decoder shallow or • Using small code size • Regularized autoencoders: add regularization term that encourages the model to have other properties • Sparsity of the representation (sparse autoencoder) • Robustness to noise or to missing inputs (denoising autoencoder) • Smallness of the derivative of the representation
Sparse autoencoder • Constrain the code to have sparsity • Training: minimize a loss function 𝑀 𝑆 = 𝑀(𝑦, 𝑔 𝑦 ) + 𝑆(ℎ) 𝑦 ℎ 𝑠
Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 𝑞(ℎ, 𝑦) • MLE on 𝑦 𝑞(ℎ ′ , 𝑦) log 𝑞(𝑦) = log ℎ ′ • Hard to sum over ℎ ′
Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 𝑞(ℎ, 𝑦) • MLE on 𝑦 𝑞(ℎ ′ , 𝑦) max log 𝑞(𝑦) = max log ℎ ′ • Approximation: suppose ℎ = 𝑔(𝑦) gives the most likely hidden representation, and σ ℎ ′ 𝑞(ℎ ′ , 𝑦) can be approximated by 𝑞(ℎ, 𝑦)
Probabilistic view of regularizing ℎ • Suppose we have a probabilistic model 𝑞(ℎ, 𝑦) • Approximate MLE on 𝑦, ℎ = 𝑔(𝑦) max log 𝑞(ℎ, 𝑦) = max log 𝑞(𝑦|ℎ) + log 𝑞(ℎ) Loss Regularization
Sparse autoencoder • Constrain the code to have sparsity 𝜇 𝜇 • Laplacian prior: 𝑞 ℎ = 2 exp(− 2 ℎ 1 ) • Training: minimize a loss function 𝑀 𝑆 = 𝑀(𝑦, 𝑔 𝑦 ) + 𝜇 ℎ 1
Denoising autoencoder • Traditional autoencoder: encourage to learn 𝑔 ⋅ to be identity • Denoising : minimize a loss function 𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑔 𝑦 ) where 𝑦 is 𝑦 + 𝑜𝑝𝑗𝑡𝑓
Boltzmann machine
Boltzmann machine • Introduced by Ackley et al. (1985) • General “connectionist” approach to learning arbitrary probability distributions over binary vectors exp(−𝐹 𝑦 ) • Special case of energy model: 𝑞 𝑦 = 𝑎
Boltzmann machine • Energy model: 𝑞 𝑦 = exp(−𝐹 𝑦 ) 𝑎 • Boltzmann machine: special case of energy model with 𝐹 𝑦 = −𝑦 𝑈 𝑉𝑦 − 𝑐 𝑈 𝑦 where 𝑉 is the weight matrix and 𝑐 is the bias parameter
Boltzmann machine with latent variables • Some variables are not observed 𝑦 = 𝑦 𝑤 , 𝑦 ℎ , 𝑦 𝑤 visible, 𝑦 ℎ hidden 𝑈 𝑇𝑦 ℎ − 𝑐 𝑈 𝑦 𝑤 − 𝑑 𝑈 𝑦 ℎ 𝑈 𝑆𝑦 𝑤 − 𝑦 𝑤 𝑈 𝑋𝑦 ℎ − 𝑦 ℎ 𝐹 𝑦 = −𝑦 𝑤 • Universal approximator of probability mass functions
Maximum likelihood 1 , 𝑦 𝑤 2 , … , 𝑦 𝑤 𝑜 • Suppose we are given data 𝑌 = 𝑦 𝑤 • Maximum likelihood is to maximize 𝑗 ) log 𝑞 𝑌 = log 𝑞(𝑦 𝑤 𝑗 where 1 𝑞 𝑦 𝑤 = 𝑞(𝑦 𝑤 , 𝑦 ℎ ) = 𝑎 exp(−𝐹(𝑦 𝑤 , 𝑦 ℎ )) 𝑦 ℎ 𝑦 ℎ • 𝑎 = σ exp(−𝐹(𝑦 𝑤 , 𝑦 ℎ )) : partition function, difficult to compute
Restricted Boltzmann machine • Invented under the name harmonium (Smolensky, 1986) • Popularized by Hinton and collaborators to Restricted Boltzmann machine
Restricted Boltzmann machine • Special case of Boltzmann machine with latent variables: 𝑞 𝑤, ℎ = exp(−𝐹 𝑤, ℎ ) 𝑎 where the energy function is 𝐹 𝑤, ℎ = −𝑤 𝑈 𝑋ℎ − 𝑐 𝑈 𝑤 − 𝑑 𝑈 ℎ with the weight matrix 𝑋 and the bias 𝑐, 𝑑 • Partition function 𝑎 = exp(−𝐹 𝑤, ℎ ) 𝑤 ℎ
Restricted Boltzmann machine Figure from Deep Learning , Goodfellow, Bengio and Courville
Restricted Boltzmann machine • Conditional distribution is factorial 𝑞 ℎ|𝑤 = 𝑞(𝑤, ℎ) 𝑞(𝑤) = ෑ 𝑞(ℎ 𝑘 |𝑤) 𝑘 and 𝑘 + 𝑤 𝑈 𝑋 𝑞 ℎ 𝑘 = 1|𝑤 = 𝜏 𝑑 :,𝑘 is logistic function
Restricted Boltzmann machine • Similarly, 𝑞 𝑤|ℎ = 𝑞(𝑤, ℎ) 𝑞(ℎ) = ෑ 𝑞(𝑤 𝑗 |ℎ) 𝑗 and 𝑞 𝑤 𝑗 = 1|ℎ = 𝜏 𝑐 𝑗 + 𝑋 𝑗,: ℎ is logistic function
Deep Boltzmann machine • Special case of energy model. Take 3 hidden layers and ignore bias: 𝑞 𝑤, ℎ 1 , ℎ 2 , ℎ 3 = exp(−𝐹 𝑤, ℎ 1 , ℎ 2 , ℎ 3 ) 𝑎 • Energy function 𝐹 𝑤, ℎ 1 , ℎ 2 , ℎ 3 = −𝑤 𝑈 𝑋 1 ℎ 1 − (ℎ 1 ) 𝑈 𝑋 2 ℎ 2 − (ℎ 2 ) 𝑈 𝑋 3 ℎ 3 with the weight matrices 𝑋 1 , 𝑋 2 , 𝑋 3 • Partition function exp(−𝐹 𝑤, ℎ 1 , ℎ 2 , ℎ 3 ) 𝑎 = 𝑤,ℎ 1 ,ℎ 2 ,ℎ 3
Deep Boltzmann machine Figure from Deep Learning , Goodfellow, Bengio and Courville
Recommend
More recommend