Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang Alex Gittens Michael Mahoney UC Berkeley RPI UC Berkeley
Overview
Ridge Regression π± π π± = 1 7 + πΏ π± 7 min ππ± β π³ π 7 7 Over-determined: π β« π πΓπ
Ridge Regression π± π π± = 1 7 + πΏ π± 7 min ππ± β π³ π 7 7 β’ Efficient and approximate solution? β’ Use only part of the data? πΓπ
Ridge Regression π± π π± = 1 7 + πΏ π± 7 min ππ± β π³ π 7 7 Matrix Sketching ng: β’ Random selection β’ Random projection
Approximate Ridge Regression π± π π± = 1 7 + πΏ π± 7 min ππ± β π³ π 7 7 β’ S ketched solution: π± O
Approximate Ridge Regression π± π π± = 1 7 + πΏ π± 7 min ππ± β π³ π 7 7 β’ S ketched solution: π± O S R β’ Sketch size π T sketch size
Approximate Ridge Regression π± π π± = 1 7 + πΏ π± 7 min ππ± β π³ π 7 7 β’ S ketched solution: π± O S R β’ Sketch size π T β’ π π± O β€ 1 + π min π± π π± sketch size Optimization Perspective
Approximate Ridge Regression π± π π± = 1 7 + πΏ π± 7 min ππ± β π³ π 7 7 Statistical Perspective β’ Bias β’ Variance
Related Work 7 β’ Least squares regression: min ππ± β π³ 7 π± Reference β’ Drineas, Mahoney, and Muthukrishnan: Sampling algorithms for l2 regression and applications. In SODA, 2006. β’ Drineas, Mahoney, Muthukrishnan, and Sarlos: Faster least squares approximation. Numerische Mathematik, 2011. β’ Clarkson and Woodruff: Low rank approximation and regression in input sparsity time. In STOC , 2013. β’ Ma, Mahoney, and Yu: A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research , 2015. β’ Pilanci and Wainwright: Iterative Hessian sketch: fast and accurate solution approximation for constrained least squares. Journal of Machine Learning Research , 2015. β’ Raskutti and Mahoney: A statistical perspective on randomized sketching for ordinary least-squares. Journal of Machine Learning Research , 2016. β’ Etc β¦
Sketched Ridge Regression
Matrix Sketching β’ Turn big matrix into smaller one. β’ π β β ^ΓS β‘ π [ π β β _ΓS . β’ π β β ^Γ_ is called sketching matrix, e.g., β’ Uniform sampling β’ Leverage score sampling β’ Gaussian projection β’ Subsampled randomized Hadamard transform (SRHT) β’ Count sketch (sparse embedding) β’ Etc. π [ π π
Matrix Sketching β’ Some matrix sketching methods are efficient. β’ Time cost is o(πππ‘) β lower than multiplication. β’ Examples: β’ Leverage score sampling: π(ππ log π) time β’ SRHT: π(ππ log π‘) time π [ π π
Ridge Regression β’ Objective function: π π± = 1 7 + πΏ π± 7 ππ± β π³ π 7 7 β’ Optimal solution: π± β = argmin π π± π± = π [ π + ππΏπ S e π [ π³ β’ Time cost: π ππ 7 or π πππ’
Sketched Ridge Regression β’ Goal: efficiently and approximately solve π π± = 1 7 + πΏ π± 7 . argmin ππ± β π³ π 7 7 π±
Sketched Ridge Regression β’ Goal: efficiently and approximately solve π π± = 1 7 + πΏ π± 7 . argmin ππ± β π³ π 7 7 π± β’ Approach: reduce the size of π and π³ by matrix sketching.
Sketched Ridge Regression β’ Sketched solution: 1 7 + πΏ π± 7 π± O = argmin π [ ππ± β π [ π³ π 7 7 π± = π [ ππ [ π + ππΏπ S e π [ ππ [ π³
Sketched Ridge Regression β’ Sketched solution: 1 7 + πΏ π± 7 π± O = argmin π [ ππ± β π [ π³ π 7 7 π± = π [ ππ [ π + ππΏπ S e π [ ππ [ π³ β’ Time: π π‘π 7 + π _ _ is the cost of sketching π [ π β’ π β’ E.g. π _ = π(ππ log π‘) for SRHT. β’ E.g. π _ = π ππ log π for leverage score sampling.
Theory: Optimization Perspective
Optimization Perspective 7 + πΏ π± 7 . β’ Recall the objective function π π± = i ^ ππ± β π³ 7 7 β’ Bound π π± O β π π± β . 7 β€ π π± O β π π± β . ^ ππ± O β ππ± β β’ i 7
Optimization Perspective For the sketching methods π β β ^ΓS : the design matrix β’ πΏ: the regularization parameter β’ jS R β’ SRHT or leverage sampling with s = π T , p π p πΎ = p β (0, 1] β’ k jS lmn S ^qr π p β’ uniform sampling with s = π , ^ π β 1, S : the row coherence of π T β’ π π± O β π π± β β€ ππ π± β holds w.p. 0.9.
Optimization Perspective For the sketching methods π β β ^ΓS : the design matrix β’ πΏ: the regularization parameter β’ jS R β’ SRHT or leverage sampling with s = π T , p π p πΎ = p β (0, 1] β’ k jS lmn S ^qr π p β’ uniform sampling with s = π , ^ π β 1, S : the row coherence of π T β’ π π± O β π π± β β€ ππ π± β holds w.p. 0.9. 7 β€ ππ π± β . ^ ππ± O β ππ± β i 7
Theory: Statistical Perspective
Statistical Model β’ π β β ^ΓS : fixed design matrix β’ π± x β β S : the true and unknown model β’ π³ = ππ± x + π: observed response vector β’ π i , β― , π ^ are random noise π½ ππ [ = π 7 π ^ β’ π½ π = π and
Bias-Variance Decomposition 7 i β’ Risk: π π± = ^ π½ ππ± β ππ± x 7 β’ π½ is taken w.r.t. the random noise π .
Bias-Variance Decomposition 7 i β’ Risk: π π± = ^ π½ ππ± β ππ± x 7 β’ π½ is taken w.r.t. the random noise π . β’ Risk measures prediction error.
Bias-Variance Decomposition 7 i β’ Risk: π π± = ^ π½ ππ± β ππ± x 7 β’ R π± = bias 7 π± + var π±
οΏ½ οΏ½ Bias-Variance Decomposition 7 i β’ Risk: π π± = ^ π½ ππ± β ππ± x 7 β’ R π± = bias 7 π± + var π± β’ bias π± β = πΏ π π» 7 + ππΏπ S Ζi π»π [ π± x 7 , Optimal Solution β’ var π± β = β¦ p 7 , π S + ππΏπ» Ζ7 Ζi 7 ^ β’ bias π± O = πΏ π π»π [ ππ [ ππ» + ππΏπ S e π»π [ π± x 7 , Sketched 7 Solution β’ var π± O = β¦ p π [ ππ [ π + ππΏπ» Ζ7 e π [ ππ [ , ^ 7 β’ Here π = ππ»π [ is the SVD.
Statistical Perspective For the sketching methods S β’ SRHT or leverage sampling with s = π R T p , π β β ^ΓS : the design matrix β’ ^ β’ π β 1, S : the row coherence of π k S lmn S β’ uniform sampling with s = π , T p the followings hold w.p. 0.9: 1 β π β€ bias π± O bias π± β β€ 1 + π, Good! π‘ β€ var π± O 1 β π π var π± β β€ 1 + π π Bad! Because π β« π‘ . π‘ .
Statistical Perspective For the sketching methods S β’ SRHT or leverage sampling with s = π R T p , π β β ^ΓS : the design matrix β’ ^ β’ π β 1, S : the row coherence of π k S lmn S β’ uniform sampling with s = π , T p the followings hold w.p. 0.9: 1 β π β€ bias π± O bias π± β β€ 1 + π, If π³ is noisy variance dominates bias π‘ β€ var π± O 1 β π π var π± β β€ 1 + π π π π± _ β« π(π± β ) . π‘ .
Conclusions β’ Use sketched solution to initialize numerical optimization. Optimization Perspective β’ ππ± O is close to ππ± β .
οΏ½ οΏ½ Conclusions β’ Use sketched solution to initialize numerical optimization. Optimization Perspective β’ ππ± O is close to ππ± β . β’ π± β‘ : output of the π’-th iteration of CG algorithm. p ππ± Ε Ζππ± β β‘ β’ π Ε½ π Ζi . p p β€ 2 β’ β’ π Ε½ π ππ± βΉ Ζππ± β ri p β’ Initialization is important.
Conclusions β’ Use sketched solution to initialize numerical optimization. Optimization Perspective β’ ππ± O is close to ππ± β . β’ Never use sketched solution to replace the optimal solution. Statistical Perspective β’ Much higher variance Γ¨ bad generalization.
Model Averaging
Model Averaging β’ Independently draw π i , β― , π β’ . O , β― , π± β’ O . β’ Compute the sketched solutions π± i β’ Model averaging: π± O = i β’ O β’ β π± β . ββi
Optimization Perspective β’ For sufficiently large π‘ , Without model averaging β’ Ζβ π± β β π± β β€ π holds w.h.p. β π± β
Optimization Perspective β’ For sufficiently large π‘ , Without model averaging β’ Ζβ π± β β π± β β€ π holds w.h.p. β π± β β’ Using the same matrix sketching and same π‘ , With model averaging β π± β’ Ζβ π± β β€ T β’ + π 7 holds w.h.p. β π± β
Optimization Perspective β’ For sufficiently large π‘ , Without model averaging β’ Ζβ π± β β π± β β€ π holds w.h.p. β π± β β’ Using the same matrix sketching and same π‘ , With model averaging β π± β’ Ζβ π± β β€ T β’ + π 7 holds w.h.p. β π± β
Recommend
More recommend