Contras(ve learning, mul(-view redundancy, and linear models - PowerPoint PPT Presentation

Contras(ve learning, mul(-view redundancy, and linear models Daniel Hsu Columbia University Joint work with: Akshay Krishnamurthy ( Microsoft Research ) Christopher Tosh ( Columbia University ) Johns Hopkins MINDS & CIS Seminar - October 6 th , 2020

Learning representations of data Probabilis)c modeling Deep learning Image credit: stats.stackexchange.com; bdtechtalks.com

Goal of representa-on learning Learned from data Image credit: towardsdatascience.com

Deep neural networks: Already doing it? task 1 task 2 task 3 Task%B% output y 1 Task%A% output y 2 output y 3 Task%C% %output% %shared% subsets%of% factors% %input% Image credit: [Bengio, Courville, Vincent, 2014]

Unsupervised / semi-supervised learning g n i n r a e l ! d e Unlabeled data s i v r e p u s - f l Feature map e S k s a t d e s i v r e " p u Labeled data s m a e r t s - n Predictor w o D

"Self-supervised learning" • Idea : Learn to solve self-derived predic3on problems, then introspect. • Example: Images • Predict color channel from grayscale channel [Zhang, Isola, Efros, 2017] • Example: Text documents • Predict missing word in a sentence from context [Mikolov, Sutskever, Chen, Corrado, Dean, 2013; Dhillon, Foster, Ungar, 2011] • Example: Dynamical systems • Predict future observa3ons from past observa3ons [Yule, 1927; Langford, Salakhutdinov, Zhang, 2009]

2 positive examples 2 negative examples Se Self lf-su super ervised sed l lea earning p prob oblem em wit with t text d documen ents Positive examples : Documents from a natural corpus Negative examples : First half of a document, randomly paired with second half of another document Can create training data from unlabeled documents !

Representations from self-supervised learning ! ("The new mascot appears to have bushier eyebrows") ! ("The S&P 500 fell more than 3.3 percent") ! ("European markets recorded their worst session since 2016") Improves down-stream supervised learning performance in many cases [Mikolov, Sutskever, Chen, Corrado, Dean, 2013; Logeswaran & Lee, 2018; Oord, Li, Vinyals, 2018; Arora, Khandeparkar, Khodak, Plevrakis, Saunshi, 2019] Q: For what problems can we prove these are representations useful?

What's in the representation? To understand the representations, we look to probabilistic modeling… ! Our focus: Representa0ons " derived from "Contras0ve Learning"

Our theore)cal results (informally) 1. Assume unlabeled data follow a topic model (e.g., LDA). Then: representa?on ! " = linear transform of topic posterior moments (of order up to document length). 1 + 2 + 2 ∼iid +0 5 5 5 sports science politics business

Our theore)cal results (informally) 2. Assume unlabeled data has two views ! and ! " , each with near- op9mal MSE for predic9ng a target variable # (possibly using non- linear func9ons). Then: a linear func+on of $(!) can achieve near-op9mal MSE

Our theore)cal results (informally) 3. Error transform theorem: Excess error in down-stream Excess error in self- ' ≤ supervised learning task with linear supervised learning functions of ! "($) problem i.e., be3er solu6ons to self-supervised learning problem yield be3er representa6ons for down-stream supervised learning task ! "

Rest of the talk 1. Representa+on learning method & topic model analysis 2. Mul+-view redundancy analysis 3. Experimental study

1. Representa,on learning method & topic model analysis

The plan a. Formalize the contrastive learning problem and representation b. Interpret the representation in context of topic models ! Unlabeled data Feature map " Labeled data Predictor

2 positive examples 2 negative examples Se Self lf-su super ervised sed l lea earning p prob oblem em wit with t text d documen ents Positive examples : Documents from a natural corpus Negative examples : First half of a document, randomly paired with second half of another document Can create training data from unlabeled documents !

[ Steinwart, Hush, Scovel, 2005; Abe, Zadrozny, Langford, 2006; "Contrastive learning" Gutmann & Hyvärinen, 2010; Oord, Li, Vinyals, 2018; Arora, Khandeparkar, Khodak, Plevrakis, Saunshi, 2019; … ] • Learn predictor to discriminate between (", "′) ∼ ' (,( ) [positive example] and ( ⊗ ' ( ) [negative example] ", "′ ∼ ' • Specifically, es6mate odds-ra6o : ∗ ", "′ = Pr positive ∣ (", "′) Pr negative ∣ (", "′) by training a neural network (or whatever) using a loss func6on like logis6c loss on random posi6ve & nega6ve examples (which are, WLOG, evenly balanced: 0.5 ' (,( ) + 0.5 ' ( ⊗ ' ( ) ).

Construc)ng the representa)on " of " ∗ , construct embedding func)on for • Given an es)mate ! document halves: $ " &, ) * ∶ , = 1, … , 0 ∈ ℝ 3 % & ≔ ! where ) 4 , … , ) 3 are "landmark documents" ) 4 ) 6 & ) 5

Topic model [Hofmann, 1999; Blei, Ng, Jordan, 2003; …] • ! topics, each specifies a distribu1on over the vocabulary • A document is associated with its own distribu1on " over ! topics • Words in document (BoW): i.i.d. from induced mixture distribu1on • Assume they are arbitrarily par11oned into two halves, # and #′ E.g., 1 + 2 + 2 ∼iid +0 5 5 5 sports science politics business

Simple case: One topic per document • Suppose ! ∈ # $ , … , # ' (i.e., document is about only one topic) • Fact : Odds ra9o = density ra9o: = >,> ? (*, *′) ( ∗ *, *′ ≔ Pr positive ∣ (*, *′) Pr negative ∣ (*, *′) = = > * = > ? (*′) Es9mated using Interpret via data contras9ve learning generating distribution

Interpreting the density ratio… Using BoW assump7on Density ra7o . Pr 1 + Pr & ∣ 1 + Pr &′ ∣ 1 + ! "," $ (&, & ' ) " & ! " $ (& ' ) = * ! ! " & ! " $ (&′) +,- . Pr 1 + ∣ & Pr &′ ∣ 1 + = * ! " $ (&′) +,- = 4 & 5 6 &′ ! " $ &′ Posterior over topics given & Likelihood of topics given &′

Inside the embedding ; / # • Embedding : ! ∗ # = % ∗ #, ' ( ∶ * = 1, … , - where % ∗ #, #′ = / # 0 1 #′ 2 3 4 (#′) • Therefore 87 1 ' 7 0 2 3 4 ' 7 ! ∗ # = /(#) ⋮ 87 1 ' : 0 2 3 4 ' : (Scaled) likelihoods of topics given ' ( 's Posterior over topics given #

Upshot in the simple case • In the "one topic per document" case, document embedding is a linear transforma7on of the posterior over topics ! ∗ # = % & # • Theorem : If % is full-rank, every linear func7on of topic posterior can be expressed as a linear func7on of ! ∗ (⋅)

General case: Exploit bag-of-word structure • In general, posterior distribution over ! (topic distribution) given " is not summarized by just a # -dimensional vector. • If " and "′ each have % words: • Let & ' ( ≔ ( * : , ≤ % where ( * = ∏ 0∈ 2 ( 0 * 3 for ( ∈ ℝ 2 • Let 5 " ≔ 6[& ' ! ∣ "] (order % multivariate conditional moments of ! ) • There is a corresponding :(⋅) (that depends on topic model params) such that > ∗ ", "′ = 5 " A :("′) B C D "′ • Theorem : There is a choice of landmark documents such that E ∗ " yields (linear transform of) conditional moments of ! of orders ≤ % .

2. Multi-view redundancy analysis

The plan a. Recap multi-view prediction setting b. How contrastive learning fares in the multi-view setting ! Unlabeled data Feature map " Labeled data Predictor

Setting for multi-view prediction • Assume (unlabeled) data provides two "views" ! and !′ , each equally good at predicting a target # • Example: topic identification • # = topic of article • ! = text of abstract ! • ! % = text of article • Example: web page classification !′ • # = web page type • ! = text of web page • ! % = text of hyper-links pointing to page

Mul$-view learning methods • Co-training [Blum & Mitchell, 1998]: • If ! ⊥ ! # ∣ % , then bootstrapping methods "work" • Canonical Correlation Analysis [Kakade & Foster, 2007]: • Suppose there is redundancy of views via linear predictors: for each & ∈ !, !′ - - * +,, ≥ * /,/ 0 ,, − 2 • Then CCA-based (linear) dimension reduction doesn't hurt much • (No assumption of conditional independence!) Q: What if views are redundant only via non-linear predictors?

Surrogate predictor via multi-view redundancy ! " ≔ $ $ % ∣ ' ( ∣ ' = " Best (possibly non-linear) prediction of % using '′ 0 ≤ 2 for each - ∈ ', '′ , ', ' ( Lemma : If $ $ % - − $ % 0 ≤ 42 . ', ' ( then $ ! ' − $ % Our strategy : Learn a representa9on + " such that ! " ≈ linear func9on of + "

Contras(ve learning, mul(-view redundancy, and linear models - PowerPoint PPT Presentation

Contras(ve learning, mul(-view redundancy, and linear models Daniel Hsu Columbia University Joint work with: Akshay Krishnamurthy ( Microsoft Research ) Christopher Tosh ( Columbia University ) Johns Hopkins MINDS & CIS Seminar - October 6

Partial Redundancy Elimination CS243 Review Session Full Redundancy x = b + c y = b + c z = b

Mul&lingualism @ ECUAD Debora O & Tara Wren

Mul$-Object Synchroniza$on Mul$-Object Programs What happens

Mul$-Object Synchroniza$on Mul$-Object Programs What happens

Weakly-Supervised Learning with Cost-Augmented Contras;ve Es;ma;on

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Commission Deliverable: 2.2 Presentation of Innovation and the pro&contras of the

Naming MINERALITY of French white wines : a contras;ve study

Cumbernauld Academy Existing aerial view from west Site Plan Aerial view from South Aerial view

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

T echBrief Leveraging Redundancy to Leveraging Redundancy to Build Fault-T olerant Networks

Overview ECE 753: FAULT-TOLERANT Introduction - Sources COMPUTING Hardware redundancy

Chosen-Ciphertext Security Chosen-Ciphertext Security without Redundancy without Redundancy

Kinematic Redundancy Robert Platt Northeastern University Kinematic Redundancy A

Red nd nc Remo Red nd nc Remo Redundancy Removal Using ATPG Redundancy Removal Using ATPG l

Lessons Learned from Agnews None Presenter: Alan Wilens M.Ed. Panel Members: Amy M. Narciso,

Scripture 52% - No Absolute Truth 30% - Truth is Unknowable Ligonier (2020) and Cultural

New Physics at the TeV Scale? New Physics at the TeV Scale? A Supersymmetric and

General Robert W. Cone Commanding General United States Army Training and Doctrine Command 27

Episodic Memory in Lifelong Language Learning NIPS 19 Cyprien de Masson dAutume, Sebastian

Multiplier Effect: Case Studies in Distributions for Publishers Jon Peck | Courtney Yuskis |

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

Packet Spraying in Geneve Overlay Network draft-xiang-nvo3-geneve-packet-spray-00 Haizhou Xiang ,

Contras(ve learning, mul(-view redundancy, and linear models - PowerPoint PPT Presentation

Contras(ve learning, mul(-view redundancy, and linear models Daniel Hsu Columbia University Joint work with: Akshay Krishnamurthy ( Microsoft Research ) Christopher Tosh ( Columbia University ) Johns Hopkins MINDS & CIS Seminar - October 6

Partial Redundancy Elimination CS243 Review Session Full Redundancy x = b + c y = b + c z = b

Mul&amp;lingualism @ ECUAD Debora O &amp; Tara Wren

Mul$-Object Synchroniza$on Mul$-Object Programs What happens

Mul$-Object Synchroniza$on Mul$-Object Programs What happens

Weakly-Supervised Learning with Cost-Augmented Contras;ve Es;ma;on

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Commission Deliverable: 2.2 Presentation of Innovation and the pro&amp;contras of the

Naming MINERALITY of French white wines : a contras;ve study

Cumbernauld Academy Existing aerial view from west Site Plan Aerial view from South Aerial view

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

T echBrief Leveraging Redundancy to Leveraging Redundancy to Build Fault-T olerant Networks

Overview ECE 753: FAULT-TOLERANT Introduction - Sources COMPUTING Hardware redundancy

Chosen-Ciphertext Security Chosen-Ciphertext Security without Redundancy without Redundancy

Kinematic Redundancy Robert Platt Northeastern University Kinematic Redundancy A

Red nd nc Remo Red nd nc Remo Redundancy Removal Using ATPG Redundancy Removal Using ATPG l

Lessons Learned from Agnews None Presenter: Alan Wilens M.Ed. Panel Members: Amy M. Narciso,

Scripture 52% - No Absolute Truth 30% - Truth is Unknowable Ligonier (2020) and Cultural

New Physics at the TeV Scale? New Physics at the TeV Scale? A Supersymmetric and

General Robert W. Cone Commanding General United States Army Training and Doctrine Command 27

Episodic Memory in Lifelong Language Learning NIPS 19 Cyprien de Masson dAutume, Sebastian

Multiplier Effect: Case Studies in Distributions for Publishers Jon Peck | Courtney Yuskis |

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

Packet Spraying in Geneve Overlay Network draft-xiang-nvo3-geneve-packet-spray-00 Haizhou Xiang ,

Mul&lingualism @ ECUAD Debora O & Tara Wren

Commission Deliverable: 2.2 Presentation of Innovation and the pro&contras of the