sketched ridge regression
play

Sketched Ridge Regression: Optimization and Statistical Perspectives - PowerPoint PPT Presentation

Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang Alex Gittens Michael Mahoney UC Berkeley RPI UC Berkeley Overview Ridge Regression = 1 7 + 7 min 7 7


  1. Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang Alex Gittens Michael Mahoney UC Berkeley RPI UC Berkeley

  2. Overview

  3. Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 Over-determined: π‘œ ≫ 𝑒 π‘œΓ—π‘’

  4. Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 β€’ Efficient and approximate solution? β€’ Use only part of the data? π‘œΓ—π‘’

  5. Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 Matrix Sketching ng: β€’ Random selection β€’ Random projection

  6. Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 β€’ S ketched solution: 𝐱 O

  7. Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 β€’ S ketched solution: 𝐱 O S R β€’ Sketch size 𝑃 T sketch size

  8. Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 β€’ S ketched solution: 𝐱 O S R β€’ Sketch size 𝑃 T β€’ 𝑔 𝐱 O ≀ 1 + πœ— min 𝐱 𝑔 𝐱 sketch size Optimization Perspective

  9. Approximate Ridge Regression 𝐱 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 min 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 Statistical Perspective β€’ Bias β€’ Variance

  10. Related Work 7 β€’ Least squares regression: min 𝐘𝐱 βˆ’ 𝐳 7 𝐱 Reference β€’ Drineas, Mahoney, and Muthukrishnan: Sampling algorithms for l2 regression and applications. In SODA, 2006. β€’ Drineas, Mahoney, Muthukrishnan, and Sarlos: Faster least squares approximation. Numerische Mathematik, 2011. β€’ Clarkson and Woodruff: Low rank approximation and regression in input sparsity time. In STOC , 2013. β€’ Ma, Mahoney, and Yu: A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research , 2015. β€’ Pilanci and Wainwright: Iterative Hessian sketch: fast and accurate solution approximation for constrained least squares. Journal of Machine Learning Research , 2015. β€’ Raskutti and Mahoney: A statistical perspective on randomized sketching for ordinary least-squares. Journal of Machine Learning Research , 2016. β€’ Etc …

  11. Sketched Ridge Regression

  12. Matrix Sketching β€’ Turn big matrix into smaller one. β€’ 𝐘 ∈ ℝ ^Γ—S ➑ 𝐓 [ 𝐘 ∈ ℝ _Γ—S . β€’ 𝐓 ∈ ℝ ^Γ—_ is called sketching matrix, e.g., β€’ Uniform sampling β€’ Leverage score sampling β€’ Gaussian projection β€’ Subsampled randomized Hadamard transform (SRHT) β€’ Count sketch (sparse embedding) β€’ Etc. 𝐓 [ 𝐘 𝐘

  13. Matrix Sketching β€’ Some matrix sketching methods are efficient. β€’ Time cost is o(π‘œπ‘’π‘‘) β€” lower than multiplication. β€’ Examples: β€’ Leverage score sampling: 𝑃(π‘œπ‘’ log π‘œ) time β€’ SRHT: 𝑃(π‘œπ‘’ log 𝑑) time 𝐓 [ 𝐘 𝐘

  14. Ridge Regression β€’ Objective function: 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 β€’ Optimal solution: 𝐱 ⋆ = argmin 𝑔 𝐱 𝐱 = 𝐘 [ 𝐘 + π‘œπ›Ώπ‰ S e 𝐘 [ 𝐳 β€’ Time cost: 𝑃 π‘œπ‘’ 7 or 𝑃 π‘œπ‘’π‘’

  15. Sketched Ridge Regression β€’ Goal: efficiently and approximately solve 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 . argmin 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 𝐱

  16. Sketched Ridge Regression β€’ Goal: efficiently and approximately solve 𝑔 𝐱 = 1 7 + 𝛿 𝐱 7 . argmin 𝐘𝐱 βˆ’ 𝐳 π‘œ 7 7 𝐱 β€’ Approach: reduce the size of 𝐘 and 𝐳 by matrix sketching.

  17. Sketched Ridge Regression β€’ Sketched solution: 1 7 + 𝛿 𝐱 7 𝐱 O = argmin 𝐓 [ 𝐘𝐱 βˆ’ 𝐓 [ 𝐳 π‘œ 7 7 𝐱 = 𝐘 [ 𝐓𝐓 [ 𝐘 + π‘œπ›Ώπ‰ S e 𝐘 [ 𝐓𝐓 [ 𝐳

  18. Sketched Ridge Regression β€’ Sketched solution: 1 7 + 𝛿 𝐱 7 𝐱 O = argmin 𝐓 [ 𝐘𝐱 βˆ’ 𝐓 [ 𝐳 π‘œ 7 7 𝐱 = 𝐘 [ 𝐓𝐓 [ 𝐘 + π‘œπ›Ώπ‰ S e 𝐘 [ 𝐓𝐓 [ 𝐳 β€’ Time: 𝑃 𝑑𝑒 7 + π‘ˆ _ _ is the cost of sketching 𝐓 [ 𝐘 β€’ π‘ˆ β€’ E.g. π‘ˆ _ = 𝑃(π‘œπ‘’ log 𝑑) for SRHT. β€’ E.g. π‘ˆ _ = 𝑃 π‘œπ‘’ log π‘œ for leverage score sampling.

  19. Theory: Optimization Perspective

  20. Optimization Perspective 7 + 𝛿 𝐱 7 . β€’ Recall the objective function 𝑔 𝐱 = i ^ 𝐘𝐱 βˆ’ 𝐳 7 7 β€’ Bound 𝑔 𝐱 O βˆ’ 𝑔 𝐱 ⋆ . 7 ≀ 𝑔 𝐱 O βˆ’ 𝑔 𝐱 ⋆ . ^ 𝐘𝐱 O βˆ’ 𝐘𝐱 ⋆ β€’ i 7

  21. Optimization Perspective For the sketching methods 𝐘 ∈ ℝ ^Γ—S : the design matrix β€’ 𝛿: the regularization parameter β€’ jS R β€’ SRHT or leverage sampling with s = 𝑃 T , p 𝐘 p 𝛾 = p ∈ (0, 1] β€’ k jS lmn S ^qr 𝐘 p β€’ uniform sampling with s = 𝑃 , ^ 𝜈 ∈ 1, S : the row coherence of 𝐘 T β€’ 𝑔 𝐱 O βˆ’ 𝑔 𝐱 ⋆ ≀ πœ—π‘” 𝐱 ⋆ holds w.p. 0.9.

  22. Optimization Perspective For the sketching methods 𝐘 ∈ ℝ ^Γ—S : the design matrix β€’ 𝛿: the regularization parameter β€’ jS R β€’ SRHT or leverage sampling with s = 𝑃 T , p 𝐘 p 𝛾 = p ∈ (0, 1] β€’ k jS lmn S ^qr 𝐘 p β€’ uniform sampling with s = 𝑃 , ^ 𝜈 ∈ 1, S : the row coherence of 𝐘 T β€’ 𝑔 𝐱 O βˆ’ 𝑔 𝐱 ⋆ ≀ πœ—π‘” 𝐱 ⋆ holds w.p. 0.9. 7 ≀ πœ—π‘” 𝐱 ⋆ . ^ 𝐘𝐱 O βˆ’ 𝐘𝐱 ⋆ i 7

  23. Theory: Statistical Perspective

  24. Statistical Model β€’ 𝐘 ∈ ℝ ^Γ—S : fixed design matrix β€’ 𝐱 x ∈ ℝ S : the true and unknown model β€’ 𝐳 = 𝐘𝐱 x + 𝛆: observed response vector β€’ πœ€ i , β‹― , πœ€ ^ are random noise 𝔽 𝛆𝛆 [ = 𝜊 7 𝐉 ^ β€’ 𝔽 𝛆 = 𝟏 and

  25. Bias-Variance Decomposition 7 i β€’ Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱 x 7 β€’ 𝔽 is taken w.r.t. the random noise 𝛆 .

  26. Bias-Variance Decomposition 7 i β€’ Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱 x 7 β€’ 𝔽 is taken w.r.t. the random noise 𝛆 . β€’ Risk measures prediction error.

  27. Bias-Variance Decomposition 7 i β€’ Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱 x 7 β€’ R 𝐱 = bias 7 𝐱 + var 𝐱

  28. οΏ½ οΏ½ Bias-Variance Decomposition 7 i β€’ Risk: 𝑆 𝐱 = ^ 𝔽 𝐘𝐱 βˆ’ 𝐘𝐱 x 7 β€’ R 𝐱 = bias 7 𝐱 + var 𝐱 β€’ bias 𝐱 ⋆ = 𝛿 π‘œ 𝚻 7 + π‘œπ›Ώπ‰ S Ζ’i πš»π– [ 𝐱 x 7 , Optimal Solution β€’ var 𝐱 ⋆ = … p 7 , 𝐉 S + π‘œπ›Ώπš» Ζ’7 Ζ’i 7 ^ β€’ bias 𝐱 O = 𝛿 π‘œ πš»π• [ 𝐓𝐓 [ π•πš» + π‘œπ›Ώπ‰ S e πš»π– [ 𝐱 x 7 , Sketched 7 Solution β€’ var 𝐱 O = … p 𝐕 [ 𝐓𝐓 [ 𝐕 + π‘œπ›Ώπš» Ζ’7 e 𝐕 [ 𝐓𝐓 [ , ^ 7 β€’ Here 𝐘 = π•πš»π– [ is the SVD.

  29. Statistical Perspective For the sketching methods S β€’ SRHT or leverage sampling with s = 𝑃 R T p , 𝐘 ∈ ℝ ^Γ—S : the design matrix β€’ ^ β€’ 𝜈 ∈ 1, S : the row coherence of 𝐘 k S lmn S β€’ uniform sampling with s = 𝑃 , T p the followings hold w.p. 0.9: 1 βˆ’ πœ— ≀ bias 𝐱 O bias 𝐱 ⋆ ≀ 1 + πœ—, Good! 𝑑 ≀ var 𝐱 O 1 βˆ’ πœ— π‘œ var 𝐱 ⋆ ≀ 1 + πœ— π‘œ Bad! Because π‘œ ≫ 𝑑 . 𝑑 .

  30. Statistical Perspective For the sketching methods S β€’ SRHT or leverage sampling with s = 𝑃 R T p , 𝐘 ∈ ℝ ^Γ—S : the design matrix β€’ ^ β€’ 𝜈 ∈ 1, S : the row coherence of 𝐘 k S lmn S β€’ uniform sampling with s = 𝑃 , T p the followings hold w.p. 0.9: 1 βˆ’ πœ— ≀ bias 𝐱 O bias 𝐱 ⋆ ≀ 1 + πœ—, If 𝐳 is noisy variance dominates bias 𝑑 ≀ var 𝐱 O 1 βˆ’ πœ— π‘œ var 𝐱 ⋆ ≀ 1 + πœ— π‘œ 𝑆 𝐱 _ ≫ 𝑆(𝐱 ⋆ ) . 𝑑 .

  31. Conclusions β€’ Use sketched solution to initialize numerical optimization. Optimization Perspective β€’ 𝐘𝐱 O is close to 𝐘𝐱 ⋆ .

  32. οΏ½ οΏ½ Conclusions β€’ Use sketched solution to initialize numerical optimization. Optimization Perspective β€’ 𝐘𝐱 O is close to 𝐘𝐱 ⋆ . β€’ 𝐱 ‑ : output of the 𝑒-th iteration of CG algorithm. p 𝐘𝐱 Ε  Ζ’π˜π± ⋆ ‑ β€’ 𝐘 Ε½ 𝐘 Ζ’i . p p ≀ 2 β€’ β€’ 𝐘 Ε½ 𝐘 𝐘𝐱 β€Ή Ζ’π˜π± ⋆ ri p β€’ Initialization is important.

  33. Conclusions β€’ Use sketched solution to initialize numerical optimization. Optimization Perspective β€’ 𝐘𝐱 O is close to 𝐘𝐱 ⋆ . β€’ Never use sketched solution to replace the optimal solution. Statistical Perspective β€’ Much higher variance Γ¨ bad generalization.

  34. Model Averaging

  35. Model Averaging β€’ Independently draw 𝐓 i , β‹― , 𝐓 β€’ . O , β‹― , 𝐱 β€’ O . β€’ Compute the sketched solutions 𝐱 i β€’ Model averaging: 𝐱 O = i β€’ O β€’ βˆ‘ 𝐱 β€˜ . β€˜β€™i

  36. Optimization Perspective β€’ For sufficiently large 𝑑 , Without model averaging β€’ Ζ’β€œ 𝐱 ⋆ β€œ 𝐱 ” ≀ πœ— holds w.h.p. β€œ 𝐱 ⋆

  37. Optimization Perspective β€’ For sufficiently large 𝑑 , Without model averaging β€’ Ζ’β€œ 𝐱 ⋆ β€œ 𝐱 ” ≀ πœ— holds w.h.p. β€œ 𝐱 ⋆ β€’ Using the same matrix sketching and same 𝑑 , With model averaging β€œ 𝐱 β€’ Ζ’β€œ 𝐱 ⋆ ≀ T β€’ + πœ— 7 holds w.h.p. β€œ 𝐱 ⋆

  38. Optimization Perspective β€’ For sufficiently large 𝑑 , Without model averaging β€’ Ζ’β€œ 𝐱 ⋆ β€œ 𝐱 ” ≀ πœ— holds w.h.p. β€œ 𝐱 ⋆ β€’ Using the same matrix sketching and same 𝑑 , With model averaging β€œ 𝐱 β€’ Ζ’β€œ 𝐱 ⋆ ≀ T β€’ + πœ— 7 holds w.h.p. β€œ 𝐱 ⋆

Recommend


More recommend