Fast, Provable Algorithms for Learning Structured Dictionaries and Autoencoders Chinmay Hegde Iowa State University Collaborators: Thanh Nguyen (ISU) Raymond Wong (Texas A&M) Akshay Soni (Yahoo! Research) 1/28
Flavors of machine learning Supervised learning Unsupervised learning ◮ Classification ◮ Representation learning ◮ Regression ◮ Clustering ◮ Categorization ◮ Dimensionality reduction ◮ Search ◮ Density estimation ◮ . . . ◮ . . . 2/28
Flavors of machine learning Supervised learning Unsupervised learning ◮ Classification ◮ Representation learning ◮ Regression ◮ Clustering ◮ Categorization ◮ Dimensionality reduction ◮ Search ◮ Density estimation ◮ . . . ◮ . . . In the landscape of ML research: ◮ Supervised ML dominates not only practice . . . 2/28
Flavors of machine learning Supervised learning Unsupervised learning ◮ Classification ◮ Representation learning ◮ Regression ◮ Clustering ◮ Categorization ◮ Dimensionality reduction ◮ Search ◮ Density estimation ◮ . . . ◮ . . . In the landscape of ML research: ◮ Supervised ML dominates not only practice . . . ◮ . . . but also theory 2/28
Learning data representations PCA was among the first attempts PCA on 12 × 12 -patches of natural images 3/28
Learning data representations PCA was among the first attempts PCA on 12 × 12 -patches of natural images not localized, visually difficult to interpret 3/28
Learning data representations Sparse coding (Olshausen and Field, ’96) 4/28
Learning data representations Sparse coding (Olshausen and Field, ’96) local, oriented, interpretable 4/28
Sparse coding Sparse coding (a.k.a. dictionary learning): learn an over-complete, sparse representation for a set of data points 5/28
Sparse coding Sparse coding (a.k.a. dictionary learning): learn an over-complete, sparse representation for a set of data points ≈ × y ∈ R n (e.g. images) dictionary A ∈ R n × m code x ∈ R m ◮ dictionary is overcomplete ( n < m ) ◮ representation (code) is sparse 5/28
Mathematical formulation Input: p data samples: Y = [ y (1) , y (2) , . . . , y ( p ) ] ∈ R n × p Goal: find dictionary A and codes X = [ x (1) , x (2) , . . . , x ( p ) ] ∈ R m × p that sparsely represent Y : 6/28
Mathematical formulation Input: p data samples: Y = [ y (1) , y (2) , . . . , y ( p ) ] ∈ R n × p Goal: find dictionary A and codes X = [ x (1) , x (2) , . . . , x ( p ) ] ∈ R m × p that sparsely represent Y : A,X L ( A, X ) = 1 2 � Y − AX � 2 s.t. � x ( j ) � 0 ≤ k min F , 6/28
Challenges A,X L ( A, X ) = 1 2 � Y − AX � 2 s.t. � x ( j ) � 0 ≤ k min F , Two major obstacles: 7/28
Challenges A,X L ( A, X ) = 1 2 � Y − AX � 2 s.t. � x ( j ) � 0 ≤ k min F , Two major obstacles: 1. Theory ◮ Highly non-convex both in objective and constraints ◮ few provably correct algorithms (barring recent breakthroughs) 7/28
Challenges A,X L ( A, X ) = 1 2 � Y − AX � 2 s.t. � x ( j ) � 0 ≤ k min F , Two major obstacles: 1. Theory ◮ Highly non-convex both in objective and constraints ◮ few provably correct algorithms (barring recent breakthroughs) 2. Practice ◮ even heuristics face memory and running-time issues ◮ merely storing an estimate of A requires mn = Ω( n 2 ) memory 7/28
This talk Overview of our recent algorithmic work on sparse coding Computational challenges Dealing with missing data Autoencoder training 8/28
Structured dictionaries Y ≈ AX Key idea: impose additional structure on A 9/28
Structured dictionaries Y ≈ AX Key idea: impose additional structure on A One type of structure is double-sparsity ◮ Dictionary is itself sparse in some fixed basis Φ × ≈ Φ × y ∈ R n sparse comp. A ∈ R n × m sparse code x ∈ R m 9/28
Double-sparsity Double-sparse coding 1 Double-sparse coding w/ sym8 Regular sparse coding wavelets 1 figures reproduced using Trainlets [Sulam et al. ’16] 10/28
Previous work Y ≈ AX + noise S.C S.C Setting Approach Run. Time (w/o noise) (w/ noise) Regular K-SVD (Aharon et al ’06) ✗ ✗ ✗ O ( n 2 log n ) � Ω( n 4 ) Er-SPuD (Spielman ’12) ✗ � � O ( mn 2 p ) Arora et al ’15 O ( mk ) ✗ 11/28
Previous work Y ≈ AX + noise S.C S.C Setting Approach Run. Time (w/o noise) (w/ noise) Regular K-SVD (Aharon et al ’06) ✗ ✗ ✗ O ( n 2 log n ) � Er-SPuD (Spielman ’12) Ω( n 4 ) ✗ � � O ( mn 2 p ) Arora et al ’15 O ( mk ) ✗ Rubinstein et al ’10 ✗ ✗ ✗ � � Gribonval et al ’15 O ( mr ) O ( mr ) ✗ Double Sparse Trainlets (Sulam et al ’16) ✗ ✗ ✗ ( r : sparsity of columns of A , k : sparsity of columns of X ) But no provable, tractable algorithms had been reported to date.. 12/28
Our contributions (I) Y ≈ AX + noise S.C S.C Run. Time Setting Approach (w/o noise) (w/ noise) Regular K-SVD (Aharon et al ’06) ✗ ✗ ✗ O ( n 2 log n ) � Er-SPuD (Spielman ’12) Ω( n 4 ) ✗ � � O ( mn 2 p ) Arora et al ’15 O ( mk ) ✗ Rubinstein et al ’10 ✗ ✗ ✗ � � Double Gribonval et al ’15 O ( mr ) O ( mr ) ✗ Sparse Sulam et al ’16 ✗ ✗ ✗ � � � O ( mr + σ 2 mnr Our method* O ( mr ) ) O ( mnp ) ε k *T. Nguyen, R. Wong, C. Hegde, ”A Provable Approach for Double-Sparse Coding”, AAAI 2018. 13/28
Setup We assume the following generative model Suppose that p samples are generated a as y ( i ) = A ∗ x ( i ) ∗ , i = 1 , 2 , . . . , p ◮ A ∗ is unknown, true dictionary with r -sparse columns ◮ x ∗ has uniform k -sparse support with independent nonzeros a For simplicity, assume Φ = I , no noise Goal: Provably learn A ∗ with low sample complexity and running time 14/28
Approach overview 4 f ( z ) 2 z 0 0 δ z ∗ − 2 − 1 0 1 2 3 4 15/28
Approach overview 4 f ( z ) 2 z 0 0 δ z ∗ − 2 − 1 0 1 2 3 4 1. Spectral initialization to obtain a coarse estimate A 0 15/28
Approach overview 4 f ( z ) 2 z 0 0 δ z ∗ − 2 − 1 0 1 2 3 4 1. Spectral initialization to obtain a coarse estimate A 0 2. Gradient descent to refine this estimate 15/28
Approach overview A,X L ( A, X ) = 1 2 � Y − AX � 2 min F , s.t. � x ( j ) � 0 ≤ k, � A • i � 0 ≤ r 1. Spectral initialization to obtain a coarse estimate of A 0 2. Gradient descent to refine the initial estimate 16/28
Approach overview A,X L ( A, X ) = 1 2 � Y − AX � 2 min F , s.t. � x ( j ) � 0 ≤ k, � A • i � 0 ≤ r 1. Spectral initialization to obtain a coarse estimate of A 0 2. Gradient descent to refine the initial estimate Two key elements in our (double-sparse coding) setup: 1. Identity atom supports in initialization (a la Sparse PCA) 2. Use projected gradient descent onto these supports 16/28
Initialization Intuition: Fix samples u, v such that u = A ∗ α, v = A ∗ α ′ , and consider a third sample y = A ∗ x ∗ ; 17/28
Initialization Intuition: Fix samples u, v such that u = A ∗ α, v = A ∗ α ′ , and consider a third sample y = A ∗ x ∗ ; then � y, u �� y, v � = � x ∗ , A ∗ T A ∗ α �� x ∗ , A ∗ T A ∗ α ′ � ≈ � x ∗ , α �� x ∗ , α ′ � 17/28
Initialization Intuition: Fix samples u, v such that u = A ∗ α, v = A ∗ α ′ , and consider a third sample y = A ∗ x ∗ ; then � y, u �� y, v � = � x ∗ , A ∗ T A ∗ α �� x ∗ , A ∗ T A ∗ α ′ � ≈ � x ∗ , α �� x ∗ , α ′ � The weight � y, u �� y, v � is big only if y shares an atom with both u and v 17/28
Init: Key lemma (I) Lemma (1) Fix samples u and v . Then, � e l � E [ � y, u �� y, v � y 2 i A ∗ 2 q i c i β i β ′ l ] = li + o ( k/m log n ) i ∈ U ∩ V where q i = P [ i ∈ S ] , q ij = P [ i, j ∈ S ] and c i = E [ x 4 i | i ∈ S ] . When U ∩ V = { i } , we can guess the support R of A ∗ • i : ◮ | e l | > Ω( k/mr ) for l ∈ supp ( A ∗ • i ) ◮ | e l | < o ( k/m log n ) otherwise This lets us “isolate” samples which share exactly one atom. 18/28
Init: Key lemma (II) Idea: Similar idea lets us (coarsely) estimate the atoms themselves: Lemma (2) Define the truncated weighted covariance matrix: � M u,v � E [ � y, u �� y, v � y R y T R,i A ∗ T q i c i β i β ′ i A ∗ R ] = R,i + o ( k/m log n ) i ∈ U ∩ V where q i = P [ i ∈ S ] , q ij = P [ i, j ∈ S ] and c i = E [ x 4 i | i ∈ S ] . When U ∩ V = { i } , ◮ M u,v has σ 1 > Ω( k/m ) ◮ the second σ 2 < o ( k/m log n ) 19/28
Descent stage Projected approximate gradient descent Given A 0 from the initialization stage 1) Encode: x ( i ) = threshold ( A T y ( i ) ) 2) Update: A ← A − η P k (( AX − Y ) sgn ( X ) T ) � �� � g Note: g is a (biased) approximation of the true gradient: p � ( y ( i ) − Ax ( i ) )( x ( i ) ) T = − ( Y − AX ) X T ∇ A L = − i =1 20/28
Convergence analysis Intuition: If initialized well, then gradient approximation “points” in the right direction. Lemma (Descent) Suppose that A is column-wise δ -close to A ∗ and R = supp ( A ∗ • i ) , then: R,i � 2 + 1 / (2 α ) � g R,i � 2 − ǫ 2 /α � 2 g R,i , A R,i − A ∗ R,i � ≥ α � A R,i − A ∗ for α = O ( k/m ) and ǫ 2 = O ( αk 2 /n 2 ) . 21/28
Recommend
More recommend