Contras(ve learning, mul(-view redundancy, and linear models Daniel Hsu Columbia University Joint work with: Akshay Krishnamurthy ( Microsoft Research ) Christopher Tosh ( Columbia University ) Johns Hopkins MINDS & CIS Seminar - October 6 th , 2020
Learning representations of data Probabilis)c modeling Deep learning Image credit: stats.stackexchange.com; bdtechtalks.com
Goal of representa-on learning Learned from data Image credit: towardsdatascience.com
Deep neural networks: Already doing it? task 1 task 2 task 3 Task%B% output y 1 Task%A% output y 2 output y 3 Task%C% %output% %shared% subsets%of% factors% %input% Image credit: [Bengio, Courville, Vincent, 2014]
Unsupervised / semi-supervised learning g n i n r a e l ! d e Unlabeled data s i v r e p u s - f l Feature map e S k s a t d e s i v r e " p u Labeled data s m a e r t s - n Predictor w o D
"Self-supervised learning" • Idea : Learn to solve self-derived predic3on problems, then introspect. • Example: Images • Predict color channel from grayscale channel [Zhang, Isola, Efros, 2017] • Example: Text documents • Predict missing word in a sentence from context [Mikolov, Sutskever, Chen, Corrado, Dean, 2013; Dhillon, Foster, Ungar, 2011] • Example: Dynamical systems • Predict future observa3ons from past observa3ons [Yule, 1927; Langford, Salakhutdinov, Zhang, 2009]
2 positive examples 2 negative examples Se Self lf-su super ervised sed l lea earning p prob oblem em wit with t text d documen ents Positive examples : Documents from a natural corpus Negative examples : First half of a document, randomly paired with second half of another document Can create training data from unlabeled documents !
Representations from self-supervised learning ! ("The new mascot appears to have bushier eyebrows") ! ("The S&P 500 fell more than 3.3 percent") ! ("European markets recorded their worst session since 2016") Improves down-stream supervised learning performance in many cases [Mikolov, Sutskever, Chen, Corrado, Dean, 2013; Logeswaran & Lee, 2018; Oord, Li, Vinyals, 2018; Arora, Khandeparkar, Khodak, Plevrakis, Saunshi, 2019] Q: For what problems can we prove these are representations useful?
What's in the representation? To understand the representations, we look to probabilistic modeling… ! Our focus: Representa0ons " derived from "Contras0ve Learning"
Our theore)cal results (informally) 1. Assume unlabeled data follow a topic model (e.g., LDA). Then: representa?on ! " = linear transform of topic posterior moments (of order up to document length). 1 + 2 + 2 ∼iid +0 5 5 5 sports science politics business
Our theore)cal results (informally) 2. Assume unlabeled data has two views ! and ! " , each with near- op9mal MSE for predic9ng a target variable # (possibly using non- linear func9ons). Then: a linear func+on of $(!) can achieve near-op9mal MSE
Our theore)cal results (informally) 3. Error transform theorem: Excess error in down-stream Excess error in self- ' ≤ supervised learning task with linear supervised learning functions of ! "($) problem i.e., be3er solu6ons to self-supervised learning problem yield be3er representa6ons for down-stream supervised learning task ! "
Rest of the talk 1. Representa+on learning method & topic model analysis 2. Mul+-view redundancy analysis 3. Experimental study
1. Representa,on learning method & topic model analysis
The plan a. Formalize the contrastive learning problem and representation b. Interpret the representation in context of topic models ! Unlabeled data Feature map " Labeled data Predictor
2 positive examples 2 negative examples Se Self lf-su super ervised sed l lea earning p prob oblem em wit with t text d documen ents Positive examples : Documents from a natural corpus Negative examples : First half of a document, randomly paired with second half of another document Can create training data from unlabeled documents !
[ Steinwart, Hush, Scovel, 2005; Abe, Zadrozny, Langford, 2006; "Contrastive learning" Gutmann & Hyvärinen, 2010; Oord, Li, Vinyals, 2018; Arora, Khandeparkar, Khodak, Plevrakis, Saunshi, 2019; … ] • Learn predictor to discriminate between (", "′) ∼ ' (,( ) [positive example] and ( ⊗ ' ( ) [negative example] ", "′ ∼ ' • Specifically, es6mate odds-ra6o : ∗ ", "′ = Pr positive ∣ (", "′) Pr negative ∣ (", "′) by training a neural network (or whatever) using a loss func6on like logis6c loss on random posi6ve & nega6ve examples (which are, WLOG, evenly balanced: 0.5 ' (,( ) + 0.5 ' ( ⊗ ' ( ) ).
Construc)ng the representa)on " of " ∗ , construct embedding func)on for • Given an es)mate ! document halves: $ " &, ) * ∶ , = 1, … , 0 ∈ ℝ 3 % & ≔ ! where ) 4 , … , ) 3 are "landmark documents" ) 4 ) 6 & ) 5
Topic model [Hofmann, 1999; Blei, Ng, Jordan, 2003; …] • ! topics, each specifies a distribu1on over the vocabulary • A document is associated with its own distribu1on " over ! topics • Words in document (BoW): i.i.d. from induced mixture distribu1on • Assume they are arbitrarily par11oned into two halves, # and #′ E.g., 1 + 2 + 2 ∼iid +0 5 5 5 sports science politics business
Simple case: One topic per document • Suppose ! ∈ # $ , … , # ' (i.e., document is about only one topic) • Fact : Odds ra9o = density ra9o: = >,> ? (*, *′) ( ∗ *, *′ ≔ Pr positive ∣ (*, *′) Pr negative ∣ (*, *′) = = > * = > ? (*′) Es9mated using Interpret via data contras9ve learning generating distribution
Interpreting the density ratio… Using BoW assump7on Density ra7o . Pr 1 + Pr & ∣ 1 + Pr &′ ∣ 1 + ! "," $ (&, & ' ) " & ! " $ (& ' ) = * ! ! " & ! " $ (&′) +,- . Pr 1 + ∣ & Pr &′ ∣ 1 + = * ! " $ (&′) +,- = 4 & 5 6 &′ ! " $ &′ Posterior over topics given & Likelihood of topics given &′
Inside the embedding ; / # • Embedding : ! ∗ # = % ∗ #, ' ( ∶ * = 1, … , - where % ∗ #, #′ = / # 0 1 #′ 2 3 4 (#′) • Therefore 87 1 ' 7 0 2 3 4 ' 7 ! ∗ # = /(#) ⋮ 87 1 ' : 0 2 3 4 ' : (Scaled) likelihoods of topics given ' ( 's Posterior over topics given #
Upshot in the simple case • In the "one topic per document" case, document embedding is a linear transforma7on of the posterior over topics ! ∗ # = % & # • Theorem : If % is full-rank, every linear func7on of topic posterior can be expressed as a linear func7on of ! ∗ (⋅)
General case: Exploit bag-of-word structure • In general, posterior distribution over ! (topic distribution) given " is not summarized by just a # -dimensional vector. • If " and "′ each have % words: • Let & ' ( ≔ ( * : , ≤ % where ( * = ∏ 0∈ 2 ( 0 * 3 for ( ∈ ℝ 2 • Let 5 " ≔ 6[& ' ! ∣ "] (order % multivariate conditional moments of ! ) • There is a corresponding :(⋅) (that depends on topic model params) such that > ∗ ", "′ = 5 " A :("′) B C D "′ • Theorem : There is a choice of landmark documents such that E ∗ " yields (linear transform of) conditional moments of ! of orders ≤ % .
2. Multi-view redundancy analysis
The plan a. Recap multi-view prediction setting b. How contrastive learning fares in the multi-view setting ! Unlabeled data Feature map " Labeled data Predictor
Setting for multi-view prediction • Assume (unlabeled) data provides two "views" ! and !′ , each equally good at predicting a target # • Example: topic identification • # = topic of article • ! = text of abstract ! • ! % = text of article • Example: web page classification !′ • # = web page type • ! = text of web page • ! % = text of hyper-links pointing to page
Mul$-view learning methods • Co-training [Blum & Mitchell, 1998]: • If ! ⊥ ! # ∣ % , then bootstrapping methods "work" • Canonical Correlation Analysis [Kakade & Foster, 2007]: • Suppose there is redundancy of views via linear predictors: for each & ∈ !, !′ - - * +,, ≥ * /,/ 0 ,, − 2 • Then CCA-based (linear) dimension reduction doesn't hurt much • (No assumption of conditional independence!) Q: What if views are redundant only via non-linear predictors?
Surrogate predictor via multi-view redundancy ! " ≔ $ $ % ∣ ' ( ∣ ' = " Best (possibly non-linear) prediction of % using '′ 0 ≤ 2 for each - ∈ ', '′ , ', ' ( Lemma : If $ $ % - − $ % 0 ≤ 42 . ', ' ( then $ ! ' − $ % Our strategy : Learn a representa9on + " such that ! " ≈ linear func9on of + "
Recommend
More recommend