Fast, Provable Algorithms for Learning Structured Dictionaries and - PowerPoint PPT Presentation

Fast, Provable Algorithms for Learning Structured Dictionaries and Autoencoders Chinmay Hegde Iowa State University Collaborators: Thanh Nguyen (ISU) Raymond Wong (Texas A&M) Akshay Soni (Yahoo! Research) 1/28

Flavors of machine learning Supervised learning Unsupervised learning ◮ Classification ◮ Representation learning ◮ Regression ◮ Clustering ◮ Categorization ◮ Dimensionality reduction ◮ Search ◮ Density estimation ◮ . . . ◮ . . . 2/28

Flavors of machine learning Supervised learning Unsupervised learning ◮ Classification ◮ Representation learning ◮ Regression ◮ Clustering ◮ Categorization ◮ Dimensionality reduction ◮ Search ◮ Density estimation ◮ . . . ◮ . . . In the landscape of ML research: ◮ Supervised ML dominates not only practice . . . 2/28

Flavors of machine learning Supervised learning Unsupervised learning ◮ Classification ◮ Representation learning ◮ Regression ◮ Clustering ◮ Categorization ◮ Dimensionality reduction ◮ Search ◮ Density estimation ◮ . . . ◮ . . . In the landscape of ML research: ◮ Supervised ML dominates not only practice . . . ◮ . . . but also theory 2/28

Learning data representations PCA was among the first attempts PCA on 12 × 12 -patches of natural images 3/28

Learning data representations PCA was among the first attempts PCA on 12 × 12 -patches of natural images not localized, visually difficult to interpret 3/28

Learning data representations Sparse coding (Olshausen and Field, ’96) 4/28

Learning data representations Sparse coding (Olshausen and Field, ’96) local, oriented, interpretable 4/28

Sparse coding Sparse coding (a.k.a. dictionary learning): learn an over-complete, sparse representation for a set of data points 5/28

Sparse coding Sparse coding (a.k.a. dictionary learning): learn an over-complete, sparse representation for a set of data points ≈ × y ∈ R n (e.g. images) dictionary A ∈ R n × m code x ∈ R m ◮ dictionary is overcomplete ( n < m ) ◮ representation (code) is sparse 5/28

Mathematical formulation Input: p data samples: Y = [ y (1) , y (2) , . . . , y ( p ) ] ∈ R n × p Goal: find dictionary A and codes X = [ x (1) , x (2) , . . . , x ( p ) ] ∈ R m × p that sparsely represent Y : 6/28

Mathematical formulation Input: p data samples: Y = [ y (1) , y (2) , . . . , y ( p ) ] ∈ R n × p Goal: find dictionary A and codes X = [ x (1) , x (2) , . . . , x ( p ) ] ∈ R m × p that sparsely represent Y : A,X L ( A, X ) = 1 2 � Y − AX � 2 s.t. � x ( j ) � 0 ≤ k min F , 6/28

Challenges A,X L ( A, X ) = 1 2 � Y − AX � 2 s.t. � x ( j ) � 0 ≤ k min F , Two major obstacles: 7/28

Challenges A,X L ( A, X ) = 1 2 � Y − AX � 2 s.t. � x ( j ) � 0 ≤ k min F , Two major obstacles: 1. Theory ◮ Highly non-convex both in objective and constraints ◮ few provably correct algorithms (barring recent breakthroughs) 7/28

Challenges A,X L ( A, X ) = 1 2 � Y − AX � 2 s.t. � x ( j ) � 0 ≤ k min F , Two major obstacles: 1. Theory ◮ Highly non-convex both in objective and constraints ◮ few provably correct algorithms (barring recent breakthroughs) 2. Practice ◮ even heuristics face memory and running-time issues ◮ merely storing an estimate of A requires mn = Ω( n 2 ) memory 7/28

This talk Overview of our recent algorithmic work on sparse coding Computational challenges Dealing with missing data Autoencoder training 8/28

Structured dictionaries Y ≈ AX Key idea: impose additional structure on A 9/28

Structured dictionaries Y ≈ AX Key idea: impose additional structure on A One type of structure is double-sparsity ◮ Dictionary is itself sparse in some fixed basis Φ × ≈ Φ × y ∈ R n sparse comp. A ∈ R n × m sparse code x ∈ R m 9/28

Double-sparsity Double-sparse coding 1 Double-sparse coding w/ sym8 Regular sparse coding wavelets 1 figures reproduced using Trainlets [Sulam et al. ’16] 10/28

Previous work Y ≈ AX + noise S.C S.C Setting Approach Run. Time (w/o noise) (w/ noise) Regular K-SVD (Aharon et al ’06) ✗ ✗ ✗ O ( n 2 log n ) � Ω( n 4 ) Er-SPuD (Spielman ’12) ✗ � � O ( mn 2 p ) Arora et al ’15 O ( mk ) ✗ 11/28

Previous work Y ≈ AX + noise S.C S.C Setting Approach Run. Time (w/o noise) (w/ noise) Regular K-SVD (Aharon et al ’06) ✗ ✗ ✗ O ( n 2 log n ) � Er-SPuD (Spielman ’12) Ω( n 4 ) ✗ � � O ( mn 2 p ) Arora et al ’15 O ( mk ) ✗ Rubinstein et al ’10 ✗ ✗ ✗ � � Gribonval et al ’15 O ( mr ) O ( mr ) ✗ Double Sparse Trainlets (Sulam et al ’16) ✗ ✗ ✗ ( r : sparsity of columns of A , k : sparsity of columns of X ) But no provable, tractable algorithms had been reported to date.. 12/28

Our contributions (I) Y ≈ AX + noise S.C S.C Run. Time Setting Approach (w/o noise) (w/ noise) Regular K-SVD (Aharon et al ’06) ✗ ✗ ✗ O ( n 2 log n ) � Er-SPuD (Spielman ’12) Ω( n 4 ) ✗ � � O ( mn 2 p ) Arora et al ’15 O ( mk ) ✗ Rubinstein et al ’10 ✗ ✗ ✗ � � Double Gribonval et al ’15 O ( mr ) O ( mr ) ✗ Sparse Sulam et al ’16 ✗ ✗ ✗ � � � O ( mr + σ 2 mnr Our method* O ( mr ) ) O ( mnp ) ε k *T. Nguyen, R. Wong, C. Hegde, ”A Provable Approach for Double-Sparse Coding”, AAAI 2018. 13/28

Setup We assume the following generative model Suppose that p samples are generated a as y ( i ) = A ∗ x ( i ) ∗ , i = 1 , 2 , . . . , p ◮ A ∗ is unknown, true dictionary with r -sparse columns ◮ x ∗ has uniform k -sparse support with independent nonzeros a For simplicity, assume Φ = I , no noise Goal: Provably learn A ∗ with low sample complexity and running time 14/28

Approach overview 4 f ( z ) 2 z 0 0 δ z ∗ − 2 − 1 0 1 2 3 4 15/28

Approach overview 4 f ( z ) 2 z 0 0 δ z ∗ − 2 − 1 0 1 2 3 4 1. Spectral initialization to obtain a coarse estimate A 0 15/28

Approach overview 4 f ( z ) 2 z 0 0 δ z ∗ − 2 − 1 0 1 2 3 4 1. Spectral initialization to obtain a coarse estimate A 0 2. Gradient descent to refine this estimate 15/28

Approach overview A,X L ( A, X ) = 1 2 � Y − AX � 2 min F , s.t. � x ( j ) � 0 ≤ k, � A • i � 0 ≤ r 1. Spectral initialization to obtain a coarse estimate of A 0 2. Gradient descent to refine the initial estimate 16/28

Approach overview A,X L ( A, X ) = 1 2 � Y − AX � 2 min F , s.t. � x ( j ) � 0 ≤ k, � A • i � 0 ≤ r 1. Spectral initialization to obtain a coarse estimate of A 0 2. Gradient descent to refine the initial estimate Two key elements in our (double-sparse coding) setup: 1. Identity atom supports in initialization (a la Sparse PCA) 2. Use projected gradient descent onto these supports 16/28

Initialization Intuition: Fix samples u, v such that u = A ∗ α, v = A ∗ α ′ , and consider a third sample y = A ∗ x ∗ ; 17/28

Initialization Intuition: Fix samples u, v such that u = A ∗ α, v = A ∗ α ′ , and consider a third sample y = A ∗ x ∗ ; then � y, u �� y, v � = � x ∗ , A ∗ T A ∗ α �� x ∗ , A ∗ T A ∗ α ′ � ≈ � x ∗ , α �� x ∗ , α ′ � 17/28

Initialization Intuition: Fix samples u, v such that u = A ∗ α, v = A ∗ α ′ , and consider a third sample y = A ∗ x ∗ ; then � y, u �� y, v � = � x ∗ , A ∗ T A ∗ α �� x ∗ , A ∗ T A ∗ α ′ � ≈ � x ∗ , α �� x ∗ , α ′ � The weight � y, u �� y, v � is big only if y shares an atom with both u and v 17/28

Init: Key lemma (I) Lemma (1) Fix samples u and v . Then, � e l � E [ � y, u �� y, v � y 2 i A ∗ 2 q i c i β i β ′ l ] = li + o ( k/m log n ) i ∈ U ∩ V where q i = P [ i ∈ S ] , q ij = P [ i, j ∈ S ] and c i = E [ x 4 i | i ∈ S ] . When U ∩ V = { i } , we can guess the support R of A ∗ • i : ◮ | e l | > Ω( k/mr ) for l ∈ supp ( A ∗ • i ) ◮ | e l | < o ( k/m log n ) otherwise This lets us “isolate” samples which share exactly one atom. 18/28

Init: Key lemma (II) Idea: Similar idea lets us (coarsely) estimate the atoms themselves: Lemma (2) Define the truncated weighted covariance matrix: � M u,v � E [ � y, u �� y, v � y R y T R,i A ∗ T q i c i β i β ′ i A ∗ R ] = R,i + o ( k/m log n ) i ∈ U ∩ V where q i = P [ i ∈ S ] , q ij = P [ i, j ∈ S ] and c i = E [ x 4 i | i ∈ S ] . When U ∩ V = { i } , ◮ M u,v has σ 1 > Ω( k/m ) ◮ the second σ 2 < o ( k/m log n ) 19/28

Descent stage Projected approximate gradient descent Given A 0 from the initialization stage 1) Encode: x ( i ) = threshold ( A T y ( i ) ) 2) Update: A ← A − η P k (( AX − Y ) sgn ( X ) T ) � �� g Note: g is a (biased) approximation of the true gradient: p � ( y ( i ) − Ax ( i ) )( x ( i ) ) T = − ( Y − AX ) X T ∇ A L = − i =1 20/28

Convergence analysis Intuition: If initialized well, then gradient approximation “points” in the right direction. Lemma (Descent) Suppose that A is column-wise δ -close to A ∗ and R = supp ( A ∗ • i ) , then: R,i � 2 + 1 / (2 α ) � g R,i � 2 − ǫ 2 /α � 2 g R,i , A R,i − A ∗ R,i � ≥ α � A R,i − A ∗ for α = O ( k/m ) and ǫ 2 = O ( αk 2 /n 2 ) . 21/28

Fast, Provable Algorithms for Learning Structured Dictionaries and - PowerPoint PPT Presentation

Fast, Provable Algorithms for Learning Structured Dictionaries and Autoencoders Chinmay Hegde Iowa State University Collaborators: Thanh Nguyen (ISU) Raymond Wong (Texas A&M) Akshay Soni (Yahoo! Research) 1/28 Flavors of machine

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Provable Data Plane Connectivity with Local Fast Failover Introducing OpenFlow Graph Algorithms

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Another Look at Provable Security Alfred Menezes (joint work with Sanjit Chatterjee, Neal

Provable Security in Cryptography ----- DL-based Systems ECC - Sept 24th 2002 - Essen David

Provable Security against Side-Channel Attacks Matthieu Rivain matthieu.rivain@cryptoexperts.com

Outline Cryptographic Algorithm Engineering and Provable Security Crypto refresher

Group Key Exchange and Provable Security joint work with E. Bresson and O. Chevassut David

The fundamental goal of provable security D. J. Bernstein University of Illinois at

Introduction to Provable Security Alejandro Hevia Dept. of Computer Science, Universidad de

Revisiting MAC Forgeries, Weak Keys and Provable Security of GCM Bo Zhu, Yin Tan and Guang Gong

Adapting Helios for Provable Ballot Privacy David Bernhard, Veronique Cortier, Olivier Pereira,

Ramseys theorem for pairs and provable recursive functions Alexander Kreuzer (joint work with

Teaching Vocabulary Pre-Teaching Vocabulary + Pre-Teaching Vocabulary: An Example for 2 nd -5 th

csci 210: Data Structures Maps and Hash Tables Summary Topics the Map ADT Map

Dictionaries Application Collection of student records in this class. Collection of

Heineken Worlds Apart https://www.youtube.com/watch?v=8wYXw4K0A3g Conversations in a Civil

Dic ictio ionaries and Sets dict set frozenset set/dict comprehensions Dic

CS 61A Lecture 10 Announcements Lists ['Demo'] Working with Lists 4 Working with Lists

Dictionaries A Key-Value Relationship C-START Python PD Workshop C-START Python PD Workshop

Computing history Readings: None Topics: Early history Challenges from Hilbert Leading up to