part 3 latent representations and unsupervised learning
play

Part 3: Latent representations and unsupervised learning Dale - PowerPoint PPT Presentation

Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta Supervised versus unsupervised learning Prominent training principles Generative Discriminative y y x x typical for supervised typical for


  1. Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta

  2. Supervised versus unsupervised learning Prominent training principles Generative Discriminative y y x x typical for supervised typical for unsupervised

  3. Unsupervised representation learning Consider generative training x φ

  4. Unsupervised representation learning Examples • dimensionality reduction (PCA, exponential family PCA) • sparse coding • independent component analysis • deep learning . . . Usually involves learning both a latent representation for data and a data reconstruction model Context could be: unsupervised, semi-supervised, or supervised

  5. Challenge Optimal feature discovery appears to be generally intractable Have to jointly train • latent representation • data reconstruction model Usually resort to alternating minimization (sole exception: PCA)

  6. First consider unsupervised feature discovery

  7. Unsupervised feature discovery Single layer case = matrix factorization original data learned dictionary new representation ϕ ≈ Φ X B B n ! t n ! m t = # training examples x n = # original features m ! t m = # new features Choose B and Φ to minimize data reconstruction loss L ( B Φ; X ) = � t i =1 L ( B Φ : i ; X : i ) Seek desired structure in latent feature representation Φ low rank : dimensionality reduction Φ sparse : sparse coding Φ rows independent : independent component analysis

  8. Generalized matrix factorization Assume reconstruction loss L (ˆ x ; x ) is convex in first argument Bregman divergence L (ˆ x ; x ) = D F (ˆ x � x ) = D F ∗ ( f ( x ) � f (ˆ x )) ( F strictly convex potential with transfer f = ∇ F ) Tries to make ˆ x ≈ x Matching loss x � f − 1 ( x )) = D F ∗ ( x � f (ˆ L (ˆ x ; x ) = D F (ˆ x )) ϕ Tries to make f (ˆ x ) ≈ x B (A nonlinear predictor, but loss still convex in ˆ x ) x Regular exponential family x � f − 1 ( x )) − F ∗ ( x ) − const L (ˆ x ; x ) = − log p B ( x | φ ) = D F (ˆ

  9. Training problem min Φ ∈ R m × t L ( B Φ; X ) min B ∈ R n × m How to impose desired structure on Φ?

  10. Training problem min Φ ∈ R m × t L ( B Φ; X ) min B ∈ R n × m How to impose desired structure on Φ? Dimensionality reduction Fix # features m < min( n , t ) • But only known to be tractable if L ( ˆ X ; X ) = � ˆ X − X � 2 F (PCA) • No known efficient algorithm for other standard losses Problem rank (Φ) = m constraint is too hard

  11. Training problem min Φ ∈ R m × t L ( B Φ; X )+ α � Φ � 2 , 1 min B ∈B m 2 How to impose desired structure on Φ? Relaxed dimensionality reduction (subspace learning) Add rank reducing regularizer � Φ � 2 , 1 = � m j =1 � Φ j : � 2 Favors null rows in Φ But need to add constraint to B B : j ∈ B 2 = { b : � b � 2 ≤ 1 } (Otherwise can make Φ small just by making B large)

  12. Training problem min Φ ∈ R m × t L ( B Φ; X )+ α � Φ � 1 , 1 min B ∈B m q How to impose desired structure on Φ? Sparse coding Use sparsity inducing regularizer � Φ � 1 , 1 = � m � t i =1 | Φ ji | j =1 Favors sparse entries in Φ Need to add constraint to B B : j ∈ B q = { b : � b � q ≤ 1 } (Otherwise can make Φ small just by making B large)

  13. Training problem min Φ ∈ R m × t L ( B Φ; X )+ α D (Φ) min B ∈ R n × m How to impose desired structure on Φ? Independent components analysis Usually enforces B Φ = X as a constraint • but interpolation is generally a bad idea • Instead just minimize reconstruction loss plus a dependence measure D (Φ) as a regularizer Difficulty Formulating a reasonable convex dependence penalty

  14. Training problem Consider subspace learning and sparse coding min Φ ∈ R m × t L ( B Φ; X ) + α � Φ � min B ∈B m Choice of � Φ � and B determines type of representation recovered

  15. Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Choice of � Φ � and B determines type of representation recovered Problem Still have rank constraint imposed by # new features m Idea Just relax m → ∞ • Rely on sparsity inducing norm � Φ � to select features

  16. Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem Optimization problem is not jointly convex in B and Φ

  17. Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem Optimization problem is not jointly convex in B and Φ Idea 1: Alternate! • convex in B given Φ • convex in Φ given B Could use any other form of local training

  18. Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem Optimization problem is not jointly convex in B and Φ Idea 2: Boost! • Implicitly fix B to universal dictionary • Keep row-wise sparse Φ • Incrementally select column in B (“weak learning problem”) • Update sparse Φ Can prove convergence under broad conditions

  19. Training problem Consider subspace learning and sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � min B ∈B ∞ Still have a problem? Optimization problem is not jointly convex in B and Φ Idea 3: Solve! • Can easily solve for globally optimal joint B and Φ • But requires a significant reformulation

  20. A useful observation

  21. Equivalent reformulation Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ | is an induced matrix norm on ˆ • | � · � X determined by B and � · � p , 1 Important fact Norms are always convex Computational strategy 1. Solve for optimal response matrix ˆ X first (convex minimization) 2. Then recover optimal B and Φ from ˆ X

  22. Example: subspace learning min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � 2 , 1 min B ∈B ∞ 2 X ∈ R n × t L ( ˆ X ; X ) + α � ˆ = min X � tr ˆ Recovery • Let U Σ V ′ = svd ( ˆ X ) • Set B = U and Φ = Σ V ′ Preserves optimality • � B : j � 2 = 1 hence B ∈ B n 2 • � Φ � 2 , 1 = � Σ V ′ � 2 , 1 = � j σ j � V : j � 2 = � j σ j = � ˆ X � tr Thus L ( ˆ X ; X ) + α � ˆ X � tr L ( B Φ; X ) + α � Φ � 2 , 1 =

  23. Example: sparse coding min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � 1 , 1 min B ∈B ∞ q X ∈ R n × t L ( ˆ X ; X ) + α � ˆ X ′ � q , 1 = min ˆ Recovery � � 1 1 ˆ ˆ B = X :1 , ..., X : t (rescaled columns) � ˆ � ˆ X :1 � q X : t � q   � ˆ X :1 � q 0 ...   Φ = (diagonal matrix)   � ˆ 0 X : t � q Preserves optimality • � B : j � q = 1 hence B ∈ B t q • � Φ � 1 , 1 = � j � ˆ X : j � q = � ˆ X ′ � q , 1 Thus L ( ˆ X ; X ) + α � ˆ X ′ � q , 1 = L ( B Φ; X ) + α � Φ � 1 , 1

  24. Example: sparse coding Outcome Sparse coding with � · � 1 , 1 regularization = vector quantization • drops some examples • memorizes remaining examples Optimal solution is not overcomplete Could not make these observations using local solvers

  25. Simple extensions • Missing observations in X • Robustness to outliers in X X ∈ R n × t L ( ( ˆ � ˆ X + S ) Ω ; X Ω ) + α | X � | + β � S � 1 , 1 min min S ∈ R n × t ˆ Ω = observed indices in X S = speckled outlier noise (jointly convex in ˆ X and S )

  26. Explaining the useful result Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ � ˆ | = � ˆ X ′ � ∗ for an induced matrix norm | X � ( B , p ∗ )

  27. Explaining the useful result Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ � ˆ | = � ˆ X ′ � ∗ for an induced matrix norm | X � ( B , p ∗ ) A dual norm � Λ ′ � ( B , p ∗ ) ≤ 1 tr (Λ ′ ˆ � ˆ X ′ � ∗ ( B , p ∗ ) = max X ) (standard definition of a dual norm)

  28. Explaining the useful result Theorem min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ X ∈ R n × t L ( ˆ � ˆ = min X ; X ) + α | X � | ˆ � ˆ | = � ˆ X ′ � ∗ for an induced matrix norm | X � ( B , p ∗ ) A dual norm � Λ ′ � ( B , p ∗ ) ≤ 1 tr (Λ ′ ˆ � ˆ X ′ � ∗ ( B , p ∗ ) = max X ) (standard definition of a dual norm) of a vector-norm induced matrix norm � Λ ′ � ( B , p ∗ ) = max b ∈B � Λ ′ b � p ∗ (easy to prove this yields a norm on matrices)

  29. Proof outline min Φ ∈ R ∞× t L ( B Φ; X ) + α � Φ � p , 1 min B ∈B ∞ L ( ˆ = min min min X ; X ) + α � Φ � p , 1 ˆ B ∈B ∞ Φ: B Φ= ˆ X ∈ R n × t X X ∈ R n × t L ( ˆ = min X ; X ) + α min min � Φ � p , 1 ˆ B ∈B ∞ Φ: B Φ= ˆ X

Recommend


More recommend