Optimization MS Maths Big Data Alexandre Gramfort alexandre.gramfort@telecom-paristech.fr Telecom ParisTech M2 Maths Big Data
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Plan Notations 1 Ridge regression and quadratic forms 2 SVD 3 Woodbury 4 Dense Ridge 5 Sparse Ridge 6 Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 2
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Optimization problem Definition (Optimization problem ( P )) min f ( x ) , x ∈ C , where f : R n → R ∪ { + ∞} is called the objective function C = { x ∈ R n / g ( x ) ≤ 0 et h ( x ) = 0 } is the feasible set g ( x ) ≤ 0 represent inequality constraints . g ( x ) = ( g 1 ( x ) , . . . , g p ( x )) so with p contraints. h ( x ) = 0 represent equality contraints . h ( x ) = ( h 1 ( x ) , . . . , h q ( x )) so with q contraints. an element x ∈ C is said to be feasible Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 3
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Taylor at order 2 Assuming f is twice differentiable, the Taylor expansion at order 2 of f at x reads: ∀ h ∈ R n , f ( x + h ) = f ( x ) + ∇ f ( x ) T h + 1 2 h T ∇ 2 f ( x ) h + o ( � h � 2 ) ∇ f ( x ) ∈ R n is the gradient. ∇ 2 f ( x ) ∈ R n × n the Hessian matrix. Remark: Local quadratic approximation Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 4
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Plan Notations 1 Ridge regression and quadratic forms 2 SVD 3 Woodbury 4 Dense Ridge 5 Sparse Ridge 6 Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 5
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Ridge regression We consider problems with n samples, observations, and p features, variables. Definition (Ridge regression) Let y ∈ R n the n targets to predict and ( x i ) i the n samples in R p . Ridge regression consists in solving the following problem 2 � y − Xw − b � 2 + λ 1 2 � w � 2 , λ > 0 min w , b where w ∈ R p is called the weights vector, b ∈ R is the intercept (a.k.a. bias) and the ith row of X is x i . : Note that the intercept is not penalized with λ . Remark: Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 6
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Taking care of the intercept There are different ways to deal with the intercept. Option 1: Center the target y and each column feature. After centering the problem reads: 1 2 � y − Xw � 2 + λ 2 � w � 2 , λ > 0 min w Option 2: Add a column of 1 to X and try not to penalize it (too much). Exercise Denote by y ∈ R the mean of y and by X ∈ R p the mean of T ˆ each column of X . Show that ˆ b = − X w + y . Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 7
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Ridge regression Definition (Quadratic form) A quadratic form reads f ( x ) = 1 2 x T Ax + b T x + c where x ∈ R p , A ∈ R p × p , b ∈ R p and c ∈ R . Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 8
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Ridge regression Questions Show that ridge regression boils down to the minimization of a quadratic form. Propose a closed form solution. Show that the solution is obtained by solving a linear system. Is the objective function strongly convex? Assuming n < p what is the value of the constant of strong convexity? Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 9
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Plan Notations 1 Ridge regression and quadratic forms 2 SVD 3 Woodbury 4 Dense Ridge 5 Sparse Ridge 6 Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 10
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Singular value decomposition (SVD) SVD is a factorization of a matrix (real here) M = U Σ V T where M ∈ R n × p , U ∈ R n × n , Σ ∈ R n × p , V ∈ R p × p U T U = UU T = I n (orthogonal matrix) V T V = VV T = I p (orthogonal matrix) Σ diagonal matrix Σ i , i are called the singular values U are left-singular vectors V are right-singular vectors Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 11
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Singular value decomposition (SVD) SVD is a factorization of a matrix (real here) U contains the eigenvectors of MM T associated to the eigenvalues Σ 2 i , i V contains the eigenvectors of M T M associated to the eigenvalues Σ 2 i , i we assume here Σ i , i = 0 for min ( n , p ) ≤ i ≤ max ( n , p ) SVD is particularly useful to find the rank, null-space, image and pseudo-inverse of a matrix Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 12
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Plan Notations 1 Ridge regression and quadratic forms 2 SVD 3 Woodbury 4 Dense Ridge 5 Sparse Ridge 6 Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 13
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Matrix inversion lemma Proposition (Matrix inversion lemma) also known as Sherman–Morrison–Woodbury formula states that: � − 1 VA − 1 , ( A + UCV ) − 1 = A − 1 − A − 1 U C − 1 + VA − 1 U � where A ∈ R n × n , U ∈ R n × k , C ∈ R k × k , V ∈ R k × n . Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 14
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Matrix inversion lemma (proof) Just check that (A+UCV) times the RHS of the Woodbury identity gives the identity matrix: � − 1 VA − 1 � � A − 1 − A − 1 U C − 1 + VA − 1 U � ( A + UCV ) = I + UCVA − 1 − ( U + UCVA − 1 U )( C − 1 + VA − 1 U ) − 1 VA − 1 = I + UCVA − 1 − UC ( C − 1 + VA − 1 U )( C − 1 + VA − 1 U ) − 1 VA − 1 = I + UCVA − 1 − UCVA − 1 = I Questions Using the matrix inversion lemma show that if n < p , the ridge regression problem can be solved by inverting a matrix of size n × n rather than p × p . Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 15
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Plan Notations 1 Ridge regression and quadratic forms 2 SVD 3 Woodbury 4 Dense Ridge 5 Sparse Ridge 6 Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 16
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Primal and dual implementation The solution of the ridge regression problem (without intercept) is obtained by solving the problem in the primal form: w = ( X T X + λ I p ) − 1 X T y ˆ or in the dual form: w = X T ( XX T + λ I n ) − 1 y ˆ In the dual formulation the matrix to invert in R n × n . What if X is sparse, n is 1e5 and p is 1e6? Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 17
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Primal and dual implementation The solution of the ridge regression problem (without intercept) is obtained by solving the problem in the primal form: w = ( X T X + λ I p ) − 1 X T y ˆ or in the dual form: w = X T ( XX T + λ I n ) − 1 y ˆ In the dual formulation the matrix to invert in R n × n . What if X is sparse, n is 1e5 and p is 1e6? Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 17
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Plan Notations 1 Ridge regression and quadratic forms 2 SVD 3 Woodbury 4 Dense Ridge 5 Sparse Ridge 6 Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 18
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Conjugate gradient: Solve Ax = b , A ∈ R n × n and b ∈ R n 1: x 0 ∈ R n , g 0 = Ax 0 − b 2: for k = 0 to n do if g k = 0 then 3: break 4: end if 5: if k = 0 then 6: w k = g 0 7: else 8: � g k , Aw k − 1 � α k = − 9: � w k − 1 , Aw k − 1 � w k = g k + α k w k − 1 10: end if 11: � g k , w k � ρ k = 12: � w k , Aw k � x k + 1 = x k − ρ k w k 13: g k + 1 = Ax k + 1 − b 14: 15: end for 16: return x k + 1 Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 19
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Sparse ridge with CG cf. Notebook Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 20
Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Logistic regression with CG cf. Notebook Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 21
Recommend
More recommend