a stochastic pca algorithm with an exponential
play

A Stochastic PCA Algorithm with an Exponential Convergence Rate - PowerPoint PPT Presentation

A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence 1/19 Principal Component Analysis


  1. A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence 1/19

  2. Principal Component Analysis PCA Input: x 1 , . . . , x n ∈ R d Goal: Find k directions with most variance n 1 � � 2 � � U ⊤ x � � max � n U ∈ R d × k : U ⊤ U = I i =1 For k = 1: Find leading eigenvector of covariance matrix � � n 1 � w ∈ R d : � w � =1 w ⊤ x i x ⊤ max w i n i =1 Ohad Shamir Stochastic PCA with Exponential Convergence 2/19

  3. Existing Approaches � � n 1 � w ∈ R d : � w � =1 w ⊤ x i x ⊤ max w i n i =1 Regime: n , d “large”, non-sparse matrix Ohad Shamir Stochastic PCA with Exponential Convergence 3/19

  4. Existing Approaches � � n 1 � w ∈ R d : � w � =1 w ⊤ x i x ⊤ max w i n i =1 Regime: n , d “large”, non-sparse matrix Approach 1: Eigendecomposition � n Compute leading eigenvector of 1 i =1 x i x ⊤ i exactly n (e.g. via QR decomposition) Runtime: O ( d 3 ) Ohad Shamir Stochastic PCA with Exponential Convergence 3/19

  5. Existing Approaches Approach 2: Power Iterations Initialize w 1 randomly on unit sphere For t = 1 , 2 , . . . � 1 � � n � n 1 w ′ i =1 x i x ⊤ t +1 := w t = i =1 � w t , x i � x i n i n � � w t +1 := w ′ � w ′ t +1 / � t +1 � 1 � 1 �� O λ log iterations for ǫ -optimality ǫ λ : Eigengap O ( nd ) runtime per iteration � nd � d �� Overall runtime O λ log ǫ Ohad Shamir Stochastic PCA with Exponential Convergence 4/19

  6. Existing Approaches Approach 2: Power Iterations Initialize w 1 randomly on unit sphere For t = 1 , 2 , . . . � 1 � � n � n 1 w ′ i =1 x i x ⊤ t +1 := w t = i =1 � w t , x i � x i n i n � � w t +1 := w ′ � w ′ t +1 / � t +1 � 1 � 1 �� O λ log iterations for ǫ -optimality ǫ λ : Eigengap O ( nd ) runtime per iteration � nd � d �� Overall runtime O λ log ǫ Approach 2.5: Lanczos Iterations More complex algorithm, but roughly similar iteration runtime � �� � 1 1 and only O λ log iterations [Kuczy´ √ nski and ǫ Woz´ niakowski 1989] � �� � d nd Overall runtime O λ log √ ǫ Ohad Shamir Stochastic PCA with Exponential Convergence 4/19

  7. Existing Approaches Approach 3: Stochastic/Incremental Algorithms Example (Oja’s algorithm) Initialize w 1 randomly on unit sphere For t = 1 , 2 , . . . Pick i t ∈ { 1 , . . . , n } (randomly or otherwise) w ′ t +1 := w t + η t x i t x ⊤ i t w t � � w t +1 := w ′ � w ′ t +1 / � t +1 Also Krasulina 1969; Arora, Cotter, Livescu, Srebro 2012; Mitliagkas, Caramanis, Jain 2013; De Sa, Olukotun, R´ e 2014... Ohad Shamir Stochastic PCA with Exponential Convergence 5/19

  8. Existing Approaches Approach 3: Stochastic/Incremental Algorithms Example (Oja’s algorithm) Initialize w 1 randomly on unit sphere For t = 1 , 2 , . . . Pick i t ∈ { 1 , . . . , n } (randomly or otherwise) w ′ t +1 := w t + η t x i t x ⊤ i t w t � � w t +1 := w ′ � w ′ t +1 / � t +1 Also Krasulina 1969; Arora, Cotter, Livescu, Srebro 2012; Mitliagkas, Caramanis, Jain 2013; De Sa, Olukotun, R´ e 2014... O ( d ) runtime per iteration Iteration bounds: � d � 1 �� Balsubramani, Dasgupta, Freund 2013: ˜ O ǫ + d λ 2 De Sa, Olukotun, R´ e 2014: For a different SGD method, � d � ˜ O λ 2 ǫ � � d 2 Runtime: ˜ O λ 2 ǫ Ohad Shamir Stochastic PCA with Exponential Convergence 5/19

  9. Existing Approaches Up to constants/log-factors: Algorithm Time per iter. # iter. Runtime d 3 Exact 1 nd Power/Lanczos nd λ p λ p d 2 d Incremental d λ 2 ǫ λ 2 ǫ Main Question Can we get the best of both worlds? O ( d ) time per iteration and fast convergence (logarithmic dependence on ǫ ?) Ohad Shamir Stochastic PCA with Exponential Convergence 6/19

  10. Convex Optimization to the Rescue? Our problem is equivalent to: n � − � w , x i � 2 � 1 � min n w : � w � =1 i =1 Much recent progress in solving strongly convex + smooth problems with finite-sum structure n 1 � min f i ( w ) n w ∈W i =1 Stochastic algorithms with O ( d ) runtime per iteration and exponential convergence [Le Roux, Schmidt, Bach 2012; Shalev-Shwartz and Zhang 2012; Johnson and Zhang 2013; Zhang, Mahdavi, Jin 2013; Koneˇ cn´ y and Richt´ arik 2013; Xiao and Zhang 2014; Zhang and Xiao, 2014...] Ohad Shamir Stochastic PCA with Exponential Convergence 7/19

  11. Convex Optimization to the Rescue? n � − � w , x i � 2 � 1 � min n w : � w � =1 i =1 Unfortunately: Function not strongly convex, or even convex (in fact, concave everywhere) Has > 1 global optima, plateaus... ⇒ Existing results don’t work as-is But: Maybe we can borrow some ideas... Ohad Shamir Stochastic PCA with Exponential Convergence 8/19

  12. Algorithm n 1 � − � w , x i � 2 � � min n w : � w � =1 i =1 Oja Iteration Choose i t ∈ { 1 , . . . , n } at random w ′ t +1 = w t + η t � w t , x i t � x i t � � w t +1 := w ′ � w ′ t +1 / � t +1 Essentially projected stochastic gradient descent Ohad Shamir Stochastic PCA with Exponential Convergence 9/19

  13. Algorithm � n Letting A = 1 i =1 x i x ⊤ i , update step is n w ′ t +1 = w t + η t x i t x ⊤ i t w t � � x i t x ⊤ = w t + η t A w t + η t i t − A w t � �� � � �� � power/gradient step zero-mean noise Ohad Shamir Stochastic PCA with Exponential Convergence 10/19

  14. Algorithm � n Letting A = 1 i =1 x i x ⊤ i , update step is n w ′ t +1 = w t + η t x i t x ⊤ i t w t � � x i t x ⊤ = w t + η t A w t + η t i t − A w t � �� � � �� � power/gradient step zero-mean noise Main idea: Replace by � � w ′ x i t x ⊤ t +1 = w t + η A w t + η i t − A ( w t − ˜ u ) � �� � � �� � power/gradient step zero-mean noise where ˜ u “close” to w t (similar to SVRG of Johnson and Zhang (2013)) Ohad Shamir Stochastic PCA with Exponential Convergence 10/19

  15. Algorithm VR-PCA Parameters: Step size η , epoch length m Input: Data set { x i } n i =1 , Initial unit vector ˜ w 0 For s = 1 , 2 , . . . � n u = 1 i =1 x i x ⊤ ˜ i ˜ w s − 1 n w 0 = ˜ w s − 1 For t = 1 , 2 , . . . , m Pick i t ∈ { 1 , . . . , n } uniformly at random w ′ � x i t x ⊤ � t = w t − 1 + η i t ( w t − 1 − ˜ w s − 1 ) + ˜ u 1 t � w ′ w t = t � w ′ w s = w m ˜ Ohad Shamir Stochastic PCA with Exponential Convergence 11/19

  16. Algorithm VR-PCA Parameters: Step size η , epoch length m Input: Data set { x i } n i =1 , Initial unit vector ˜ w 0 For s = 1 , 2 , . . . � n u = 1 i =1 x i x ⊤ ˜ i ˜ w s − 1 n w 0 = ˜ w s − 1 For t = 1 , 2 , . . . , m Pick i t ∈ { 1 , . . . , n } uniformly at random w ′ � x i t x ⊤ � t = w t − 1 + η i t ( w t − 1 − ˜ w s − 1 ) + ˜ u 1 t � w ′ w t = t � w ′ w s = w m ˜ To get k > 1 directions: Either repeat, or perform orthogonal-like iterations: Replace all vectors by k × d matrices Replace normalization step by orthogonalization step Ohad Shamir Stochastic PCA with Exponential Convergence 11/19

  17. Analysis Theorem Suppose max i � x i � 2 ≤ r, and A has leading eigenvector v 1 . 1 Assuming � ˜ w 0 , v 1 � ≥ 2 , then for any δ, ǫ ∈ (0 , 1) , if √ � η ≤ c 1 δ 2 m η 2 r 2 + r m ≥ c 2 log(2 /δ ) m η 2 log(2 /δ ) ≤ c 3 , r 2 λ , , ηλ � � log(1 /ǫ ) (where c 1 , c 2 , c 3 are constants) and we run T = epochs, log(2 /δ ) � � w T , v 1 � 2 ≥ 1 − ǫ � ˜ then Pr ≥ 1 − 2 log(1 /ǫ ) δ Corollary Picking η, m appropriately, ǫ -convergence w.h.p. � 1 � � � �� n + 1 in O d log runtime λ 2 ǫ Exponential convergence with O ( d )-time iterations Proportional to # examples plus eigengap Proportional to single data pass if λ ≥ 1 / √ n Ohad Shamir Stochastic PCA with Exponential Convergence 12/19

  18. Proof Idea Track decay of F ( w t ) = 1 − � w t , v 1 � 2 Key Lemma Assuming η = αλ and F ( w t ) ≤ 3 / 4, � � � � 1 − Θ( αλ 2 ) α 2 λ 2 F (˜ E [ F ( w t +1 ) | w t ] ≤ F ( w t ) + O w s − 1 ) . Ohad Shamir Stochastic PCA with Exponential Convergence 13/19

  19. Proof Idea Track decay of F ( w t ) = 1 − � w t , v 1 � 2 Key Lemma Assuming η = αλ and F ( w t ) ≤ 3 / 4, � � � � 1 − Θ( αλ 2 ) α 2 λ 2 F (˜ E [ F ( w t +1 ) | w t ] ≤ F ( w t ) + O w s − 1 ) . Ohad Shamir Stochastic PCA with Exponential Convergence 13/19

  20. Proof Idea Assume η = αλ ( α ≪ 1) Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

  21. Proof Idea Assume η = αλ ( α ≪ 1) Using martingale arguments: W.h.p., never reach “flat” region � � 1 in m ≤ O iterations α 2 λ 2 Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

  22. Proof Idea Assume η = αλ ( α ≪ 1) Using martingale arguments: W.h.p., never reach “flat” region � � 1 in m ≤ O iterations α 2 λ 2 ⇒ For all t ≤ m � � � � 1 − Θ( αλ 2 ) α 2 λ 2 F (˜ E [ F ( w t +1 ) | w t ] ≤ F ( w t )+ O w s − 1 ) . Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Recommend


More recommend