machine learning for computational linguistics
play

Machine Learning for Computational Linguistics Autoencoders + deep - PowerPoint PPT Presentation

Machine Learning for Computational Linguistics Autoencoders + deep learning summary ar ltekin University of Tbingen Seminar fr Sprachwissenschaft July 12, 2016 Restricted Boltzmann machines Autoencoders July 12, 2016 SfS /


  1. Machine Learning for Computational Linguistics Autoencoders + deep learning summary Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft July 12, 2016

  2. Restricted Boltzmann machines Autoencoders July 12, 2016 SfS / University of Tübingen Ç. Çöltekin, models So far, we only studied supervised recurrent links feed-forward, or may include sparse connectivity connected, or may can use representations) layers (learned/useful 1 / 13 … (Deep) neural networks so far Unsupervised pre-training y • x is the input vector • y is the output vector • h 1 . . . h m are the hidden h 4 h 3 • The network can be fully h 2 • The connections can be h 1 x x 1 x m

  3. Restricted Boltzmann machines Autoencoders July 12, 2016 SfS / University of Tübingen Ç. Çöltekin, models So far, we only studied supervised recurrent links feed-forward, or may include sparse connectivity connected, or may can use representations) layers (learned/useful 1 / 13 … (Deep) neural networks so far Unsupervised pre-training y • x is the input vector • y is the output vector • h 1 . . . h m are the hidden h 4 h 3 • The network can be fully h 2 • The connections can be h 1 x x 1 x m

  4. Restricted Boltzmann machines Autoencoders Unsupervised pre-training Unsupervised learning in ANNs similar to the latent variable models (e.g., Gaussian mixtures), consider the representation learned by hidden maximize the probability of the (unlabeled)data train a constrained feed-forward network to predict its output Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 2 / 13 • Restricted Boltzmann machines (RBM) layers as hidden variables ( h ), and learn p ( x , h ) that • Autoencoders

  5. Restricted Boltzmann machines Autoencoders July 12, 2016 SfS / University of Tübingen Ç. Çöltekin, * As usual, biases are omitted from the diagrams and the formulas. graphical models from unlabeled data variable models, they learn only 3 / 13 Unsupervised pre-training Restricted Boltzmann machines (RBMs) • RBMs are unsupervised latent h 1 h 2 h 3 h 4 W • They are generative models of x 1 x 2 x 3 x 4 the joint probability p ( h , x ) • They correspond to undirected • No links within layers • The aim is to learn useful features ( h )

  6. Restricted Boltzmann machines Autoencoders July 12, 2016 SfS / University of Tübingen Ç. Çöltekin, * As usual, biases are omitted from the diagrams and the formulas. graphical models from unlabeled data variable models, they learn only 3 / 13 Unsupervised pre-training Restricted Boltzmann machines (RBMs) • RBMs are unsupervised latent h 1 h 2 h 3 h 4 W • They are generative models of x 1 x 2 x 3 x 4 the joint probability p ( h , x ) • They correspond to undirected • No links within layers x h • The aim is to learn useful features ( h )

  7. Restricted Boltzmann machines calculate). July 12, 2016 SfS / University of Tübingen Ç. Çöltekin, Autoencoders calculate But conditional distributions are easy to 4 / 13 The distribution defjned by RBMs Unsupervised pre-training p ( h , x ) = e h T W x W Z x 4 h 4 which is intractable ( Z is diffjcult to x 3 h 3 x 2 h 2 1 ∏ p ( h | x ) = p ( h j | x ) = 1 + e W j x x 1 j h 1 1 ∏ p ( x | h ) = p ( x k | h ) = 1 + e W T k h k

  8. Restricted Boltzmann machines Contrastive divergence algorithm July 12, 2016 SfS / University of Tübingen Ç. Çöltekin, Autoencoders distribution 5 / 13 Unsupervised pre-training algorithms exist Learning in RBMs: contrastive divergence algorithm • We want to maximize the probability that the model assigns to the input, p ( x ) , or equivalently minimize − log p ( x ) • In general, this is not tractable. But effjcient approximate 1. Given a training example x , calculate the probabilities of hidden units, and sample a hidden activation, h , from this ′ from p ( x | h ) , and re-sample h ′ 2. Sample a reconstruction , x ′ using x ′ ′ 3. Set the update rule to ∆w ij = ( x i v j − x i h j ) ϵ

  9. Restricted Boltzmann machines representations of input at the feed-forward networks The main difgerence is that they are trained to predict their input (they try to learn the identity function) The aim is to learn useful hidden layer Autoencoders Typically weights are tied ( ) Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 Autoencoders are standard 6 / 13 Unsupervised pre-training Autoencoders x 1 ˆ x 2 ˆ x 3 ˆ ˆ x 4 x 5 ˆ h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5

  10. Restricted Boltzmann machines Autoencoders July 12, 2016 SfS / University of Tübingen Ç. Çöltekin, hidden layer representations of input at the the identity function) their input (they try to learn they are trained to predict feed-forward networks Decoder Encoder 6 / 13 Unsupervised pre-training Autoencoders • Autoencoders are standard ˆ x 1 x 2 ˆ x 3 ˆ ˆ x 4 ˆ x 5 • The main difgerence is that W ∗ h 1 h 2 h 3 W • The aim is to learn useful x 1 x 2 x 3 x 4 x 5 • Typically weights are tied ( W ∗ = W T )

  11. Restricted Boltzmann machines Autoencoders July 12, 2016 SfS / University of Tübingen Ç. Çöltekin, learning non-linear features PCA hidden layer is equivalent to (compress) representation of the input a more compact inputs fewer hidden units than under-complete if there are 7 / 13 Unsupervised pre-training Under-complete autoencoders • An autoencoder is said to be x 5 x 5 ˆ x 4 h 3 x 4 ˆ • The network is forced to learn x 3 h 2 ˆ x 3 x 2 h 1 ˆ x 2 • An autoencoder with a single x 1 ˆ x 1 • We need multiple layers for

  12. Restricted Boltzmann machines Autoencoders July 12, 2016 SfS / University of Tübingen Ç. Çöltekin, L1 regularization) in sparse hidden units (e.g., regularization term resulting useful if trained with a memorize the input perfectly hidden units than inputs over-complete if there are more 8 / 13 Unsupervised pre-training Over-complete autoencoders h 5 • An autoencoder is said to be x 3 h 4 x 3 ˆ • The network can normally x 2 h 3 ˆ x 2 • This type of networks are x 1 h 2 x 1 ˆ h 1

  13. Restricted Boltzmann machines Autoencoders July 12, 2016 SfS / University of Tübingen Ç. Çöltekin, (without noise) reconstruct the original input noise – adding random (Gaussian) – randomly setting some input we introduce noise by 0 0 9 / 13 Unsupervised pre-training Denoising autoencoders ˆ x 1 x 2 ˆ x 3 ˆ ˆ x 4 x 5 ˆ x ˆ • Instead of providing the exact h 1 h 2 h 3 h inputs to 0 (dropout) x 2 x 4 x 5 � x • Network is still expected to x 1 x 2 x 3 x 4 x 5 x

  14. Restricted Boltzmann machines Autoencoders Unsupervised pre-training Learning manifolds Figure: Goodfellow et al. (2016) Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 10 / 13

  15. Restricted Boltzmann machines Autoencoders Unsupervised pre-training Unsupervised pre-training Deep belief networks or stacked autoencoders pre-training methods for supervised networks used for initializing the weights of a supervised network deep networks Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 11 / 13 • A common use case for RBMs and autoencoders are as • Autoencoders or RBMs are trained using unlabeled data • The weights learned during the unsupervised learning is • This approach has been one of the reasons for success of

  16. Restricted Boltzmann machines Autoencoders Unsupervised pre-training Deep unsupervised learning as input, train another hidden layer, … networks Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 12 / 13 • Both autoencoders and RBMs can be ‘stacked’ • Learn the weights of the fjrst hidden layer from the data • Freeze the weights, and using the hidden layer activations • This approach is called greedy layer-wise training • In case of RBMs resulting networks are called deep belief • Deep autoencoders are called stacked autoencoders

  17. Restricted Boltzmann machines Autoencoders Unsupervised pre-training Why use pre-training? allowing in effjcient computation for the supervised phase manifold that contains input data, counteracting curse of dimensionality Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 13 / 13 • Pre-training does not require labeled data • It can be considered as a form of regularization • Unsupervised methods may reduce the dimensionality, • Unsupervised learning on large-scale data may fjnd the

Recommend


More recommend