theory for representation learning
play

Theory for representation learning Sanjeev Arora Princeton - PowerPoint PPT Presentation

http://www.cs.princeton.edu/~arora/ Support: NSF, ONR, Simons Foundation, Group website: unsupervised.cs.princeton.edu Schmidt Foundation, Amazon Resarch, Blog: www.offconvex.org Mozilla Research. DARPA/SRC Twitter: @prfsanjeevarora Theory


  1. http://www.cs.princeton.edu/~arora/ Support: NSF, ONR, Simons Foundation, Group website: unsupervised.cs.princeton.edu Schmidt Foundation, Amazon Resarch, Blog: www.offconvex.org Mozilla Research. DARPA/SRC Twitter: @prfsanjeevarora Theory for representation learning Sanjeev Arora Princeton University and Institute for Advanced Study Hrishi Nikunj Misha Orestis Paper 1: A theoretical analysis of contrastive unsupervised representation learning (CURL)” [A., Hrishikesh Khandeparkar, Mikhail Khodak (CMU), Orestis Plevrakis, Nikunj Saunshi ICML’19) Paper 2: A graph-theoretic analysis of CURL. A, Plevrakis, Saunshi 2019 manuscript. 5/31/2019 Theoretically understanding CURL

  2. Big motivation for GANs/VAEs etc: Semantic Embeddings f: {images} ➔ embeddings, s.t. f(x) is good representation of x for classification tasks Preferably as good as those from “Headless Well-trained Net” Can we bypass generative models and learn semantic embeddings directly? 5/31/2019 Theoretically understanding CURL

  3. Conceptual hurdle: Why does learning to do A help you do B later on? Example: A = Learn embeddings B = Use them in new classification tasks Surprisingly, this is hard to capture* for Machine Learning Theory (*except if you go hardcore, full Bayesian, but even then many conceptual difficulties, eg bits of precision) 5/31/2019 Theoretically understanding CURL

  4. Conceptual difficulties with generative model approach (or related ones, eg info. theory) Evidence they don’t actually learn the distribution, but suppose they did, sort of… x = image; ! " ($|ℎ) h = seed = “semantic embedding” of x ! " (ℎ|$) Way to generate semantic embedding of x [A., Risteski, blogpost 2017]: If want linear classification on h to work with accuracy ( then must learn ! " (ℎ|$) with accuracy ( 2 (follows from Pinsker’s Inequality) 5/31/2019 Theoretically understanding CURL

  5. <latexit sha1_base64="5bdJIbBfJZxsBYe3swmtF6I5dWE=">AG5nicrVLb9tGEGYeUhOlDyc95rKoJZSsLEM0CrQXAUEeTQ8JmjpxEsArEcvVUlqYXMrcZSRjvbn0kODoNf+pt7yX3LIcEnRkmykCBACkoYzOzPfN+QCmcxl6rf3/p8pWrjeZX163bnz9zbfbd289VymeUbZAU3jNHsZEsliLtiB4ipmL2cZI0kYsxfh0b0i/uIVyRPxTN1MmPDhEwEjzglClzBzasfOo8CLfOZcRWRzso8gaYiwjHPOFKBnpucELUNJ1p+xuG+oExgXYXO9TDkieoSDM4Tieu38UyT+pM+iMW7BhRw0Y6chfe6Jk7L7ymNw+oZ7xWByd5QAdndQO9gN6QURxvdR6NoGduKngFsPNI6m4WBlpPWU7UWs5YRlcm/XKjFbPY4exQdrx1MHjOx2xKFPTFMcNRqj2jfZ7WBHABAhzUxp7z3URZtHjIRPN5jaM0I3GMIqDK4qYk1r8ZYBOAQuVIAdoaVwlr1DO9tfsuMJfxyVRZLJ+gFcYOpSL0SC9G3YKUuV9UxbCeDHqXRDpGWMuhATfo57XO7vpekswrY4dcvBHmTeLzmoK+boLafAqVL+6HxjH5sugjtj/awPM6UrjIBgK6JxGMWKwK4KxBdc+aejRArl/xXzpBQRDaxkAQCz9A7dYq/a6au3NRVvX/FzNpcZ7Vt+lzkhW7S7SHzowhdr4KZ8kxI1gU9soZEhNGaLpK5JxIihDwEvGFyiNULuguY3mUyZQe2G3elUn2kZEjBEsT8gmXGh2nNsXw0+wT/L/HjlqVc+mqbFyHaK+dMVdPg02DOo86kyi83loQafFrCL3FKkIWZivArurE7Z+nMXrSobG3d/v2QucNvzK2nep6Emz9h8cpzRMmFI2JlId+f6aGmSK05gBrlyGTwtZMIOwRQkYXKo7WsaWADPGMHjCx+hkPWuZmiSHmShHCyGE9uxgrnRbHDXEW/DjUXs1wxQctGUR4jlaLinY/GPGNUxSdgEJpxwIrolMB2KfhnaAEJ/ubI543ne7u+v+v/+fP2nbsVHdec284Pjuv4zi/OHed354lz4NAGbxp/N1425w2/2q+a/5THr18qcr53lm7mv9+BKYrYEo=</latexit> <latexit sha1_base64="5bdJIbBfJZxsBYe3swmtF6I5dWE=">AG5nicrVLb9tGEGYeUhOlDyc95rKoJZSsLEM0CrQXAUEeTQ8JmjpxEsArEcvVUlqYXMrcZSRjvbn0kODoNf+pt7yX3LIcEnRkmykCBACkoYzOzPfN+QCmcxl6rf3/p8pWrjeZX163bnz9zbfbd289VymeUbZAU3jNHsZEsliLtiB4ipmL2cZI0kYsxfh0b0i/uIVyRPxTN1MmPDhEwEjzglClzBzasfOo8CLfOZcRWRzso8gaYiwjHPOFKBnpucELUNJ1p+xuG+oExgXYXO9TDkieoSDM4Tieu38UyT+pM+iMW7BhRw0Y6chfe6Jk7L7ymNw+oZ7xWByd5QAdndQO9gN6QURxvdR6NoGduKngFsPNI6m4WBlpPWU7UWs5YRlcm/XKjFbPY4exQdrx1MHjOx2xKFPTFMcNRqj2jfZ7WBHABAhzUxp7z3URZtHjIRPN5jaM0I3GMIqDK4qYk1r8ZYBOAQuVIAdoaVwlr1DO9tfsuMJfxyVRZLJ+gFcYOpSL0SC9G3YKUuV9UxbCeDHqXRDpGWMuhATfo57XO7vpekswrY4dcvBHmTeLzmoK+boLafAqVL+6HxjH5sugjtj/awPM6UrjIBgK6JxGMWKwK4KxBdc+aejRArl/xXzpBQRDaxkAQCz9A7dYq/a6au3NRVvX/FzNpcZ7Vt+lzkhW7S7SHzowhdr4KZ8kxI1gU9soZEhNGaLpK5JxIihDwEvGFyiNULuguY3mUyZQe2G3elUn2kZEjBEsT8gmXGh2nNsXw0+wT/L/HjlqVc+mqbFyHaK+dMVdPg02DOo86kyi83loQafFrCL3FKkIWZivArurE7Z+nMXrSobG3d/v2QucNvzK2nep6Emz9h8cpzRMmFI2JlId+f6aGmSK05gBrlyGTwtZMIOwRQkYXKo7WsaWADPGMHjCx+hkPWuZmiSHmShHCyGE9uxgrnRbHDXEW/DjUXs1wxQctGUR4jlaLinY/GPGNUxSdgEJpxwIrolMB2KfhnaAEJ/ubI543ne7u+v+v/+fP2nbsVHdec284Pjuv4zi/OHed354lz4NAGbxp/N1425w2/2q+a/5THr18qcr53lm7mv9+BKYrYEo=</latexit> <latexit sha1_base64="5bdJIbBfJZxsBYe3swmtF6I5dWE=">AG5nicrVLb9tGEGYeUhOlDyc95rKoJZSsLEM0CrQXAUEeTQ8JmjpxEsArEcvVUlqYXMrcZSRjvbn0kODoNf+pt7yX3LIcEnRkmykCBACkoYzOzPfN+QCmcxl6rf3/p8pWrjeZX163bnz9zbfbd289VymeUbZAU3jNHsZEsliLtiB4ipmL2cZI0kYsxfh0b0i/uIVyRPxTN1MmPDhEwEjzglClzBzasfOo8CLfOZcRWRzso8gaYiwjHPOFKBnpucELUNJ1p+xuG+oExgXYXO9TDkieoSDM4Tieu38UyT+pM+iMW7BhRw0Y6chfe6Jk7L7ymNw+oZ7xWByd5QAdndQO9gN6QURxvdR6NoGduKngFsPNI6m4WBlpPWU7UWs5YRlcm/XKjFbPY4exQdrx1MHjOx2xKFPTFMcNRqj2jfZ7WBHABAhzUxp7z3URZtHjIRPN5jaM0I3GMIqDK4qYk1r8ZYBOAQuVIAdoaVwlr1DO9tfsuMJfxyVRZLJ+gFcYOpSL0SC9G3YKUuV9UxbCeDHqXRDpGWMuhATfo57XO7vpekswrY4dcvBHmTeLzmoK+boLafAqVL+6HxjH5sugjtj/awPM6UrjIBgK6JxGMWKwK4KxBdc+aejRArl/xXzpBQRDaxkAQCz9A7dYq/a6au3NRVvX/FzNpcZ7Vt+lzkhW7S7SHzowhdr4KZ8kxI1gU9soZEhNGaLpK5JxIihDwEvGFyiNULuguY3mUyZQe2G3elUn2kZEjBEsT8gmXGh2nNsXw0+wT/L/HjlqVc+mqbFyHaK+dMVdPg02DOo86kyi83loQafFrCL3FKkIWZivArurE7Z+nMXrSobG3d/v2QucNvzK2nep6Emz9h8cpzRMmFI2JlId+f6aGmSK05gBrlyGTwtZMIOwRQkYXKo7WsaWADPGMHjCx+hkPWuZmiSHmShHCyGE9uxgrnRbHDXEW/DjUXs1wxQctGUR4jlaLinY/GPGNUxSdgEJpxwIrolMB2KfhnaAEJ/ubI543ne7u+v+v/+fP2nbsVHdec284Pjuv4zi/OHed354lz4NAGbxp/N1425w2/2q+a/5THr18qcr53lm7mv9+BKYrYEo=</latexit> <latexit sha1_base64="5bdJIbBfJZxsBYe3swmtF6I5dWE=">AG5nicrVLb9tGEGYeUhOlDyc95rKoJZSsLEM0CrQXAUEeTQ8JmjpxEsArEcvVUlqYXMrcZSRjvbn0kODoNf+pt7yX3LIcEnRkmykCBACkoYzOzPfN+QCmcxl6rf3/p8pWrjeZX163bnz9zbfbd289VymeUbZAU3jNHsZEsliLtiB4ipmL2cZI0kYsxfh0b0i/uIVyRPxTN1MmPDhEwEjzglClzBzasfOo8CLfOZcRWRzso8gaYiwjHPOFKBnpucELUNJ1p+xuG+oExgXYXO9TDkieoSDM4Tieu38UyT+pM+iMW7BhRw0Y6chfe6Jk7L7ymNw+oZ7xWByd5QAdndQO9gN6QURxvdR6NoGduKngFsPNI6m4WBlpPWU7UWs5YRlcm/XKjFbPY4exQdrx1MHjOx2xKFPTFMcNRqj2jfZ7WBHABAhzUxp7z3URZtHjIRPN5jaM0I3GMIqDK4qYk1r8ZYBOAQuVIAdoaVwlr1DO9tfsuMJfxyVRZLJ+gFcYOpSL0SC9G3YKUuV9UxbCeDHqXRDpGWMuhATfo57XO7vpekswrY4dcvBHmTeLzmoK+boLafAqVL+6HxjH5sugjtj/awPM6UrjIBgK6JxGMWKwK4KxBdc+aejRArl/xXzpBQRDaxkAQCz9A7dYq/a6au3NRVvX/FzNpcZ7Vt+lzkhW7S7SHzowhdr4KZ8kxI1gU9soZEhNGaLpK5JxIihDwEvGFyiNULuguY3mUyZQe2G3elUn2kZEjBEsT8gmXGh2nNsXw0+wT/L/HjlqVc+mqbFyHaK+dMVdPg02DOo86kyi83loQafFrCL3FKkIWZivArurE7Z+nMXrSobG3d/v2QucNvzK2nep6Emz9h8cpzRMmFI2JlId+f6aGmSK05gBrlyGTwtZMIOwRQkYXKo7WsaWADPGMHjCx+hkPWuZmiSHmShHCyGE9uxgrnRbHDXEW/DjUXs1wxQctGUR4jlaLinY/GPGNUxSdgEJpxwIrolMB2KfhnaAEJ/ubI543ne7u+v+v/+fP2nbsVHdec284Pjuv4zi/OHed354lz4NAGbxp/N1425w2/2q+a/5THr18qcr53lm7mv9+BKYrYEo=</latexit> Contrastive Unsupervised Representation Learning (CURL) QuickThoughts [Logeswaran & Lee, ICLR’18] “like word2vec..” “Self-supervised” Using text corpus train deep representation function f to minimize 1 + e f ( x ) T f ( x − ) − f ( x ) T f ( x + ) ⌘i h ⇣ E log $, $ * are adjacent sentences, $ + is random sentence from corpus (“High inner product for adjacent sentences; low inner product for random pairs of sentences.”) Similar ideas work for embedding molecules, [For image embeddings, genes, social nets Wang-Gupta’15 use video…] 5/31/2019 Theoretically understanding CURL

  6. Learns representations by leveraging contrast between ”similar” and “dissimilar” (eg, random) pairs of datapoints. Graph-Based Framework for Understanding CURL “Why do learnt representations help in downstream classification tasks?” Doing Task A later helps in Task B?? 5/31/2019 Theoretically understanding CURL

  7. Graph G= (V, E) V = all possible datapoints (eg, sentences with < 30 words). $ * E = “similar” pairs. $ Nature’s sampling process: T 2 Repeat M times. Reveals e = ( $, $ * ) from some distribution on E. T 1 Reveal node $ + from some $ + distribution on V Task A: Run CURL on the M samples Task B: Nature picks datapoints from two classes T 1 , T 2 , represents each via f, and trains logistic classifier to separate them. CURL may not have seen any data from T 1 , T 2 5/31/2019 Theoretically understanding CURL

  8. Conceptual Framework Graph G= (V, E) V = all possible datapoints E = “similar” pairs. x + Nature’s sampling process: x Repeat M times. Reveals e = ( $, $ * ) from some T 2 distribution on E. Reveal node $ + from some T 1 distribution on V x -- , - = prob. of picking class c + ~ 2 3 4 ($) , $ + ~ 2 3 5 (x) - 1 , - 2 ~ , , $, $ Test time: Nature picks datapoints from two classes T 1 , T 2 and asks algo. to learn to classify using logistic classifier. Reminiscent of Multiview/cotraining 5/31/2019 Theoretically understanding CURL setup

  9. Graph G= (V, E) V = all possible datapoints E = “similar” pairs. x + Nature’s sampling process: x Repeat M times. Reveals e = ( $, $ * ) from some T 2 distribution on E. Reveal node $ + from some T 1 distribution on V x -- , - = prob. of picking class c + ~ 2 3 4 ($) , $ + ~ 2 3 5 (x) - 1 , - 2 ~ , , $, $ Unpacking a little… “Similarity” ≈ “Tend to go together (or not) for random class.” (will later relax this) 5/31/2019 Theoretically understanding CURL

  10. The analysis… Part 1: Why CURL makes sense even though graph is humongous, even infinite. Part 2: Why CURL representations can solve the classification tasks (NB: Will ignore computational cost, and just analyse quality of representations that have low training loss..) 5/31/2019 Theoretically understanding CURL

Recommend


More recommend