photo by unsplash user @tuvaloland CMP784 DEEP LEARNING Lecture #12 – Self-Supervised Learning Aykut Erdem // Hacettepe University // Spring 2020
latent by Tom White Previously on CMP784 • Motivation for Variational Autoencoders (VAEs) • Mechanics of VAEs • Separatibility of VAEs • Training of VAEs • Evaluating representations • Vector Quantized Variational Autoencoders (VQ-VAEs) 2
Lecture Overview • Predictive / Self-supervised learning • Self-supervised learning in NLP • Self-supervised learning in vision sclaimer: Much of the material and slides for this lecture were borrowed from Discl — Andrej Risteski's CMU 10707 class — Jimmy Ba's UToronto CSC413/2516 class 3
Unsupervised Learning • Learning from data without labels. • What can we hope to do: – Task A: Fit a parametrized structure (e.g. clustering, low-dimensional subspace, manifold) to data to reveal something meaningful about data (Structure learning) – Task B: Learn a (parametrized) distribution close to data generating distribution. (Distribution learning) – Task C: Learn a (parametrized) distribution that implicitly reveals an “embedding”/“representation” of data for downstream tasks. (Representation/feature learning) • Entangled! The “structure” and “distribution” often reveals an embedding. 4
Self-Supervised/Predictive Learning • Given unlabeled data, design supervised tasks that induce a good representation for downstream tasks. • No good mathematical formalization, but the intuition is to “force” the predictor used in the task to learn something “semantically meaningful” about the data. 5
Self-Supervised/Predictive Learning ► Predict any part of the input from any other part. ► Predict the future from the past . ► Predict the future from the recent past . ► Predict the past from the present . ► Predict the top from the bottom . ► Predict the occluded from the visible ► Pretend there is a part of the input you don’t know and predict that. Slide by Yann LeCun 6
How Much Information Does the Machine Need to Predict? Y LeCun “Pure” Reinforcement Learning (cherry) The machine predicts a scalar reward given once in a while. A few bits for some samples Supervised Learning (icing) The machine predicts a category or a few numbers for each input Predicting human-supplied data 10 10,000 bits per sample → Unsupervised/Predictive Learning (cake) The machine predicts any part of its input for any observed part. • LeCun’s original cake Predicts future frames in videos analogy slide, presented Millions of bits per sample at his keynote speech in (Yes, I know, this picture is slightly offensive to RL folks. But I’ll make it up) NIPS 2016. 7
Y. LeCun How Much Information is the Machine Given during Learning? “Pure” Reinforcement Learning (cherry) The machine predicts a scalar reward given once in a while. A few bits for some samples Supervised Learning (icing) The machine predicts a category or a few numbers for each input Predicting human-supplied data 10→10,000 bits per sample Self-Supervised Learning (cake génoise) The machine predicts any part of its input for any observed part. • Updated version at (ISSCC 2019, where he Predicts future frames in videos replaced “ unsupervised learning ” with Millions of bits per sample “ self-supervised learning ”. 8
Self-Supervised Learning in NLP 9
Word Embeddings ations of words • Semantically meaningful ve vect ctor represe sentat Tiger Tiger Lion Lion Ex Example : Inner product (possibly scaled, i.e. cosine similarity) correlates with word si similarity. Table Table 10
Word Embeddings ations of words • Semantically meaningful ve vect ctor represe sentat Semantically meaningful �T�e �e���ce �� g�ea�� fa�� "The service is great, a�d f��e�d���� fast and friendly!" Example : Can use embeddings to do se sentiment classification by training a si simple (e.g. linear) classifier 11
Word Embeddings Semantically meaningful ations of words • Semantically meaningful ve vect ctor represe sentat Engli�h� �I��� �aining English: "It’s raining o���ide�� outside". Ex � Can ��ain a ��imple� ne Example: Can train a “simple” network for that if fed word embeddings for two tra languages, can effectively translate. Ge�man� �E� regnet German: "Es regnet draussen �� draussen". 12
<latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit> <latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit> <latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit> <latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit> Word Embeddings via Predictive Learning • Basic task: predict the next word, given a few previous ones. : predict the next word, given a few previous ones. Late: 0.9 Early: 0.05 I am running a little ???? Tired: 0.04 Table: 0.01 In other words, optimize for In other words, optimize for X max log p θ ( x t | x t − 1 , x t − 2 , . . . , x t − L ) max � log � � � � � ��� , � ��� , … , � ��� � � θ � t 13
Recommend
More recommend