Learning sparsely used overcomplete dictionaries Alekh Agarwal Microsoft Research Joint work with Anima Anandkumar, Prateek Jain, Praneeth Netrapalli and Rashish Tandon Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Motivation I: Feature learning Practice Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Motivation I: Feature learning Practice Papers Features 1.2 0.8 0 1.5 0 3.5 2 0.1 0.7 0 0.3 0.8 Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Motivation I: Feature learning Practice Papers Features 1.2 0.8 0 1.5 0 3.5 2 0.1 Feature 0.7 0 0.3 0.8 eng. Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Motivation I: Feature learning Practice Papers Features 1.2 0.8 0 1.5 0 3.5 2 0.1 Feature 0.7 0 0.3 0.8 eng. Feature engineering takes considerable time and skill Typically critical to good performance Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Motivation I: Feature learning Practice Papers Features 1.2 0.8 0 1.5 0 3.5 2 0.1 Feature 0.7 0 0.3 0.8 eng. Feature engineering takes considerable time and skill Typically critical to good performance Can we learn good features from data? Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Motivation II: Signal compression Expensive to store high-dimensional signals Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Motivation II: Signal compression Expensive to store high-dimensional signals Sparse signals have compact representation Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Motivation II: Signal compression Expensive to store high-dimensional signals Sparse signals have compact representation Can we learn a representation where signals of interest are sparse? Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Dictionary learning in practice Image compression (Bruckstein et al., 2009) Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Dictionary learning in practice Image compression (Bruckstein et al., 2009) Similar successes in image denoising, inpainting, superresolution, . . . Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Dictionary learning in practice Image compression (Bruckstein et al., 2009) Similar successes in image denoising, inpainting, superresolution, . . . Non-convex optimization, limited theoretical understanding Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Dictionary learning setup Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements. Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Dictionary learning setup Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements. Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Dictionary learning setup Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements. Y A ∗ X ∗ = Examples Dictionary Coefficients Encode faces using dictionary rather than pixel values Sparsity for compression, signal processing . . . Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Dictionary learning setup Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements. A ∗ X ∗ Y d d = r n r n Examples Dictionary Coefficients Topic models, overlapping clustering, image representation Overcomplete setting , r ≫ d relevant in practice Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Alternating minimization Objective min A , X � X � 1 Y = AX subject to � �� � � i , j | X ij | Dominant approach in practice Start with initial dictionary A (0) Sparse regression for coefficients given dictionary X ( t + 1) i = arg min x ∈ R r � x � 1 s.t. � Y i − A ( t ) x � 2 ≤ ǫ t Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Alternating minimization Objective min A , X � X � 1 Y = AX subject to � �� � � i , j | X ij | Dominant approach in practice Start with initial dictionary A (0) Sparse regression for coefficients given dictionary Least squares for dictionary given coefficients A ( t + 1) = YX ( t + 1) + i.e. Y ≈ A ( t + 1) X ( t + 1) Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Alternating minimization Objective min A , X � X � 1 Y = AX subject to � �� � � i , j | X ij | Dominant approach in practice Start with initial dictionary A (0) Sparse regression for coefficients given dictionary Least squares for dictionary given coefficients A ( t + 1) = YX ( t + 1) + i.e. Y ≈ A ( t + 1) X ( t + 1) Similar to EM for this problem Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Alternating minimization Objective min A , X � X � 1 Y = AX subject to � �� � � i , j | X ij | Dominant approach in practice Start with initial dictionary A (0) Sparse regression for coefficients given dictionary Least squares for dictionary given coefficients A ( t + 1) = YX ( t + 1) + i.e. Y ≈ A ( t + 1) X ( t + 1) Similar to EM for this problem Does not converge to global optimum from arbitrary A (0) Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Alternating minimization goal ( � A , � X ) = min A , X � X � 1 subject to Y = AX Y = AX is a non-convex constraint Average of solutions is not a solution! Y = AX , Y = ( − A )( − X ) , Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Alternating minimization goal ( � A , � X ) = min A , X � X � 1 subject to Y = AX Y = AX is a non-convex constraint Average of solutions is not a solution! � A + ( − A ) � � X + ( − X ) � Y = AX , Y = ( − A )( − X ) , Y � = 2 2 Non-convex optimization, NP-hard in general Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Previous theory work Exact recovery in undercomplete setting by Spielman et al. via linear programming Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Previous theory work Exact recovery in undercomplete setting by Spielman et al. via linear programming We combine alternating minimization with a novel initialization Global optimum despite non-convexity in overcomplete setting Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Initialization: Key ideas Find several samples with a common dictionary element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Initialization: Key ideas 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Find several samples with a common dictionary element Top singular vector of these samples is an estimate of this element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Correlation graph Definition (Correlation graph) One node for each example Large correlation ⇒ common dictionary element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Correlation graph Definition (Correlation graph) One node for each example Edge { Y i , Y j } if |� Y i , Y j �| ≥ ρ Large correlation ⇒ common dictionary element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Correlation graph Definition (Correlation graph) One node for each example Edge { Y i , Y j } if |� Y i , Y j �| ≥ ρ Large correlation ⇒ common dictionary element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Correlation graph S 1 S 2 Definition (Correlation graph) One node for each example Edge { Y i , Y j } if Good |� Y i , Y j �| ≥ ρ Bad Large correlation ⇒ common dictionary element Samples in a clique contain a common dictionary element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Correlation graph S 1 S 2 Definition (Correlation graph) One node for each example Edge { Y i , Y j } if Good |� Y i , Y j �| ≥ ρ Bad Large correlation ⇒ common dictionary element Samples in a clique contain a common dictionary element Easy to construct cliques Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Correlation graph S 1 S 2 Definition (Correlation graph) One node for each example Edge { Y i , Y j } if Good |� Y i , Y j �| ≥ ρ Bad Large correlation ⇒ common dictionary element Samples in a clique contain a common dictionary element Easy to construct cliques Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Initialization algorithm 1. Construct correlation graph G ρ given a threshold ρ 2. For each edge ( Y i , Y j ) in G ρ Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning
Recommend
More recommend