accurate fast and scalable kernel ridge regression on
play

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and - PowerPoint PPT Presentation

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems speaker: Yang You PhD student at UC Berkeley, advised by James Demmel with James Demmel 1 , Cho-Jui Hsieh 2 , and Richard Vuduc 3 1Professor at UC Berkeley


  1. Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems speaker: Yang You PhD student at UC Berkeley, advised by James Demmel with James Demmel 1 , Cho-Jui Hsieh 2 , and Richard Vuduc 3 1Professor at UC Berkeley 2Assistant Professor at UCLA 3Associate Professor at Georgia Tech Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 1 / 48

  2. Outline Introduction Existing Approaches Our Approach Analysis and Results Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 2 / 48

  3. Kernel Ridge Regression (KRR) Given n samples ( x 1 , y 1 ), ..., ( x n , y n ), find the empirical minimizer 4 α = argmin 1 � n i =1 � f i − y i � 2 2 + λ � f � 2 ˆ H n � n � n j =1 α j exp( −|| x i − x j || 2 / (2 σ 2 )) f i = j =1 α j Φ( x j , x i ) = This problem has a closed-form solution 5 : ( K + λ nI ) α = y f ∈ R n , x i ∈ R d , y i ∈ R , α ∈ R n , λ ∈ R , Φ : R d × R d → R 4 H is a Reproducing Kernel Hilbert Space 5 K is a n -by- n matrix where K [ i ][ j ] = Φ( x j , x i ), I is an identity matrix Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 3 / 48

  4. KRR by Direct Method MSE : correctness metric, lower is better difference between the predicted label and true label Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 4 / 48

  5. Bottleneck: solve a large linear equation ( K + λ nI ) α = y n -by- n dense kernel matrix K machine learning input dataset: a n -by- d matrix n : num of samples (e.g. num of users on Facebook: ∼ 2.2 billion) d : num of features (e.g. num of movies a user rated: ∼ 1000) n >> d , a small input dataset can generate a huge kernel matrix 357 MB dataset (520,000 × 90 matrix) = 2 TB kernel matrix Θ( n 3 ) to solve the linear equation directly very expensive in practice Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 5 / 48

  6. Weak Scaling Issue 1 million users 2 million users 4 million users 8 million users primary interest for machine learning at scale keep each machine fully loaded (more users, buy more servers) keep d and n / p fixed as p grows ( p is # nodes) KRR: memory grows as Θ( p ) and the flops as Θ( p 2 ) per node perfect scaling: memory and flops are constant per node Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 6 / 48

  7. Outline Introduction Existing Approaches Our Approach Analysis and Results Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 7 / 48

  8. Bottleneck: solve a large linear equation ( K + λ nI ) α = y Low-rank matrix approximation Kernel PCA (Scholkopf et al., 1998) Incomplete Cholesky Decomposition (Fine and Scheinberg, 2002) Nystorm Sampling (Williams and Seeger, 2001) Iterative optimization algorithm Gradient Descent (Raskutti et al., 2011) Conjugate Gradient Methods (Blanchard and Kramer, 2010) None of these methods can achieve the same level of accuracy as the direct method does 6 We reserve them for future study 6Y. Zhang, J. Duchi, M. Wainwright, Divide and Conquer Kernel Ridge Regression, COLT’13 Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 8 / 48

  9. DKRR: Straightforward Implementation by ScaLAPACK K + λ nI is symmetric positive definite (Cholesky decomposition). weak scaling efficiency drops to 0.32% when we increase to 64 nodes Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 9 / 48

  10. Divide-and-Conquer KRR (DC-KRR) 7 communication overhead is low, good scaling! 7Y. Zhang, J. Duchi, M. Wainwright, Divide and Conquer Kernel Ridge Regression, COLT’13 Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 10 / 48

  11. DC-KRR key idea: block-diagonal matrix approximate figure from DC-KRR authors (Y. Zhang, J. Duchi, M. Wainwright) Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 11 / 48

  12. DC-KRR beats pervious methods on tens of nodes figure from DC-KRR authors (Y. Zhang, J. Duchi, M. Wainwright) based on a dataset of music recommendation system Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 12 / 48

  13. weak scaling in accuracy (i.e. MSE ) Table 1: MSE : lower is better. 2k samples per node Methods 8k samples 32k samples 128k samples DKRR (baseline) 90.9 85.0 0.002 DCKRR 88.9 85.5 81.0 when we scale DC-KRR to many nodes, it is not correct Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 13 / 48

  14. Outline Introduction Existing Approaches Our Approach Analysis and Results Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 14 / 48

  15. Why DC-KRR does not work at scale? It is not safe to ignore the off-diagonal parts there are many nonzero numbers in the off-diagonal parts a 5k-by-5k Gaussian Kernel matrix by UCI Covertype dataset, visualized by Matlab spy Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 15 / 48

  16. How to diagonalize Kernel matrix? k-means clustering algorithm cluster the samples based on Euclidean distance x i and x j are in the same cluster: || x i − x j || is small x i and x j are in different clusters: || x i − x j || is large || x i − x j || → ∞ means K [ i ][ j ] → 0 K [ i ][ j ] = Φ( x j , x i ) = exp( −|| x i − x j || 2 / (2 σ 2 )) 1.1 Original Kernel 1.2 After K-means nonzero threshold: larger than 10 − 6 Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 16 / 48

  17. K-means KRR (KKRR) we expect KKRR achieves low MSE! Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 17 / 48

  18. KKRR performs poorly our system tries different hyper parameters iteratively, until gets the lowest MSE dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 18 / 48

  19. why KKRR performs poorly? different clusters are very different from each other they generate different models: averaging them is a bad idea Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 19 / 48

  20. KKRR2 we expect KKRR2 achieves low MSE! Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 20 / 48

  21. KKRR2 performs much better than KKRR our system tries different hyper parameters iteratively, until gets the lowest MSE dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 21 / 48

  22. how good this can be in the best situation? suppose we can select the best model (try each one-by-one) Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 22 / 48

  23. KKRR3: error lower bound for block diagonal method we believe KKRR3 will achieve lowest MSE! Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 23 / 48

  24. Block diagonal is great by an optimal selection algorithm our system tries different hyper parameters iteratively, until gets the lowest MSE dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 24 / 48

  25. However, KKRR family is slow our system tries different hyper parameters iteratively, until gets the lowest MSE dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 25 / 48

  26. K-means clustering: imbalance partitioning the sizes of different blocks are different Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 26 / 48

  27. K-means clustering: imbalance partitioning 1.3 Load Balance for Data Size 1.4 Load Balance for Time different nodes have different num of samples ( n ) memory: Θ( n 2 ), flops: Θ( n 3 ) Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 27 / 48

  28. Basic Idea of K-balance algorithm Run K-means to get all the cluster centers Find the closest center ( CC ) for a given sample If CC is already balanced, go on When every center has n / p samples, done Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 28 / 48

  29. K-balance distance matrix: 8 samples in 4 centers d[i][j] = the distance between i-th center and j-th sample balanced case: each center has 2 samples Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 29 / 48

  30. The center for S0 ⇒ C2 underload: C0, C1, C2, C3 balanced: None Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 30 / 48

  31. The center for S1 ⇒ C3 underload: C0, C1, C2, C3 balanced: None Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 31 / 48

  32. The center for S2 ⇒ C0 underload: C0, C1, C2, C3 balanced: None Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 32 / 48

Recommend


More recommend