learning sparsely used overcomplete dictionaries
play

Learning sparsely used overcomplete dictionaries Alekh Agarwal - PowerPoint PPT Presentation

Learning sparsely used overcomplete dictionaries Alekh Agarwal Microsoft Research Joint work with Anima Anandkumar, Prateek Jain, Praneeth Netrapalli and Rashish Tandon Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary


  1. Learning sparsely used overcomplete dictionaries Alekh Agarwal Microsoft Research Joint work with Anima Anandkumar, Prateek Jain, Praneeth Netrapalli and Rashish Tandon Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  2. Motivation I: Feature learning Practice Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  3. Motivation I: Feature learning Practice Papers Features 1.2 0.8 0 1.5 0 3.5 2 0.1 0.7 0 0.3 0.8 Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  4. Motivation I: Feature learning Practice Papers Features 1.2 0.8 0 1.5 0 3.5 2 0.1 Feature 0.7 0 0.3 0.8 eng. Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  5. Motivation I: Feature learning Practice Papers Features 1.2 0.8 0 1.5 0 3.5 2 0.1 Feature 0.7 0 0.3 0.8 eng. Feature engineering takes considerable time and skill Typically critical to good performance Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  6. Motivation I: Feature learning Practice Papers Features 1.2 0.8 0 1.5 0 3.5 2 0.1 Feature 0.7 0 0.3 0.8 eng. Feature engineering takes considerable time and skill Typically critical to good performance Can we learn good features from data? Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  7. Motivation II: Signal compression Expensive to store high-dimensional signals Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  8. Motivation II: Signal compression Expensive to store high-dimensional signals Sparse signals have compact representation Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  9. Motivation II: Signal compression Expensive to store high-dimensional signals Sparse signals have compact representation Can we learn a representation where signals of interest are sparse? Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  10. Dictionary learning in practice Image compression (Bruckstein et al., 2009) Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  11. Dictionary learning in practice Image compression (Bruckstein et al., 2009) Similar successes in image denoising, inpainting, superresolution, . . . Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  12. Dictionary learning in practice Image compression (Bruckstein et al., 2009) Similar successes in image denoising, inpainting, superresolution, . . . Non-convex optimization, limited theoretical understanding Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  13. Dictionary learning setup Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements. Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  14. Dictionary learning setup Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements. Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  15. Dictionary learning setup Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements. Y A ∗ X ∗ = Examples Dictionary Coefficients Encode faces using dictionary rather than pixel values Sparsity for compression, signal processing . . . Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  16. Dictionary learning setup Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements. A ∗ X ∗ Y d d = r n r n Examples Dictionary Coefficients Topic models, overlapping clustering, image representation Overcomplete setting , r ≫ d relevant in practice Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  17. Alternating minimization Objective min A , X � X � 1 Y = AX subject to � �� � � i , j | X ij | Dominant approach in practice Start with initial dictionary A (0) Sparse regression for coefficients given dictionary X ( t + 1) i = arg min x ∈ R r � x � 1 s.t. � Y i − A ( t ) x � 2 ≤ ǫ t Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  18. Alternating minimization Objective min A , X � X � 1 Y = AX subject to � �� � � i , j | X ij | Dominant approach in practice Start with initial dictionary A (0) Sparse regression for coefficients given dictionary Least squares for dictionary given coefficients A ( t + 1) = YX ( t + 1) + i.e. Y ≈ A ( t + 1) X ( t + 1) Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  19. Alternating minimization Objective min A , X � X � 1 Y = AX subject to � �� � � i , j | X ij | Dominant approach in practice Start with initial dictionary A (0) Sparse regression for coefficients given dictionary Least squares for dictionary given coefficients A ( t + 1) = YX ( t + 1) + i.e. Y ≈ A ( t + 1) X ( t + 1) Similar to EM for this problem Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  20. Alternating minimization Objective min A , X � X � 1 Y = AX subject to � �� � � i , j | X ij | Dominant approach in practice Start with initial dictionary A (0) Sparse regression for coefficients given dictionary Least squares for dictionary given coefficients A ( t + 1) = YX ( t + 1) + i.e. Y ≈ A ( t + 1) X ( t + 1) Similar to EM for this problem Does not converge to global optimum from arbitrary A (0) Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  21. Alternating minimization goal ( � A , � X ) = min A , X � X � 1 subject to Y = AX Y = AX is a non-convex constraint Average of solutions is not a solution! Y = AX , Y = ( − A )( − X ) , Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  22. Alternating minimization goal ( � A , � X ) = min A , X � X � 1 subject to Y = AX Y = AX is a non-convex constraint Average of solutions is not a solution! � A + ( − A ) � � X + ( − X ) � Y = AX , Y = ( − A )( − X ) , Y � = 2 2 Non-convex optimization, NP-hard in general Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  23. Previous theory work Exact recovery in undercomplete setting by Spielman et al. via linear programming Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  24. Previous theory work Exact recovery in undercomplete setting by Spielman et al. via linear programming We combine alternating minimization with a novel initialization Global optimum despite non-convexity in overcomplete setting Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  25. Initialization: Key ideas Find several samples with a common dictionary element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  26. Initialization: Key ideas 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Find several samples with a common dictionary element Top singular vector of these samples is an estimate of this element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  27. Correlation graph Definition (Correlation graph) One node for each example Large correlation ⇒ common dictionary element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  28. Correlation graph Definition (Correlation graph) One node for each example Edge { Y i , Y j } if |� Y i , Y j �| ≥ ρ Large correlation ⇒ common dictionary element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  29. Correlation graph Definition (Correlation graph) One node for each example Edge { Y i , Y j } if |� Y i , Y j �| ≥ ρ Large correlation ⇒ common dictionary element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  30. Correlation graph S 1 S 2 Definition (Correlation graph) One node for each example Edge { Y i , Y j } if Good |� Y i , Y j �| ≥ ρ Bad Large correlation ⇒ common dictionary element Samples in a clique contain a common dictionary element Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  31. Correlation graph S 1 S 2 Definition (Correlation graph) One node for each example Edge { Y i , Y j } if Good |� Y i , Y j �| ≥ ρ Bad Large correlation ⇒ common dictionary element Samples in a clique contain a common dictionary element Easy to construct cliques Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  32. Correlation graph S 1 S 2 Definition (Correlation graph) One node for each example Edge { Y i , Y j } if Good |� Y i , Y j �| ≥ ρ Bad Large correlation ⇒ common dictionary element Samples in a clique contain a common dictionary element Easy to construct cliques Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

  33. Initialization algorithm 1. Construct correlation graph G ρ given a threshold ρ 2. For each edge ( Y i , Y j ) in G ρ Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

Recommend


More recommend