Regularization Paths Boosting fits a regularization path toward a - PowerPoint PPT Presentation

June 2006 Trevor Hastie, Stanford Statistics 1 June 2006 Trevor Hastie, Stanford Statistics 2 Theme Regularization Paths • Boosting fits a regularization path toward a max-margin classifier. Svmpath does as well. Trevor Hastie • In neither case is this endpoint always of interest — somewhere Stanford University along the path is often better. • Having efficient algorithms for computing entire paths drawing on collaborations with Brad Efron, Mee-Young Park, Saharon Rosset, Rob Tibshirani, Hui Zou and Ji Zhu. facilitates this selection. • A mini industry has emerged for generating regularization paths covering a broad spectrum of statistical problems. June 2006 Trevor Hastie, Stanford Statistics 3 June 2006 Trevor Hastie, Stanford Statistics 4 Adaboost Stumps for Classification Boosting Stumps for Regression 4.0 0.36 Adaboost Stump 3.5 Adaboost Stump shrink 0.1 0.34 GBM Stump GMB Stump shrink 0.1 3.0 Test Misclassification Error MSE (Squared Error Loss) 0.32 2.5 0.30 2.0 0.28 0.26 1.5 0.24 0 200 400 600 800 1000 1 5 10 50 100 500 1000 Iterations Number of Trees

June 2006 Trevor Hastie, Stanford Statistics 5 June 2006 Trevor Hastie, Stanford Statistics 6 Linear Regression Least Squares Boosting Here is a version of least squares boosting for multiple linear regression: (assume predictors are standardized) Friedman, Hastie & Tibshirani — see Elements of Statistical (Incremental) Forward Stagewise Learning (chapter 10) Supervised learning: Response y , predictors x = ( x 1 , x 2 . . . x p ). 1. Start with r = y , β 1 , β 2 , . . . β p = 0. 2. Find the predictor x j most correlated with r 1. Start with function F ( x ) = 0 and residual r = y 3. Update β j ← β j + δ j , where δ j = ǫ · sign � r, x j � 2. Fit a CART regression tree to r giving f ( x ) 4. Set r ← r − δ j · x j and repeat steps 2 and 3 many times 3. Set F ( x ) ← F ( x ) + ǫf ( x ), r ← r − ǫf ( x ) and repeat steps 2 and 3 many times δ j = � r, x j � gives usual forward stagewise; different from forward stepwise Analogous to least squares boosting, with trees=predictors June 2006 Trevor Hastie, Stanford Statistics 7 June 2006 Trevor Hastie, Stanford Statistics 8 Example: Prostate Cancer Data Linear regression via the Lasso (Tibshirani, 1995) Lasso Forward Stagewise lcavol lcavol • Assume ¯ y = 0, ¯ x j = 0, Var( x j ) = 1 for all j . 0.6 0.6 j x ij β j ) 2 subject to || β || 1 ≤ t • Minimize � i ( y i − � 0.4 0.4 • Similar to ridge regression , which has constraint || β || 2 ≤ t Coefficients Coefficients svi svi lweight lweight • Lasso does variable selection and shrinkage, while ridge only pgg45 pgg45 0.2 lbph 0.2 lbph shrinks. 0.0 0.0 gleason gleason β β 2 ^ . ^ . 2 β β age age -0.2 -0.2 lcp lcp β 1 β 1 0.0 0.5 1.0 1.5 2.0 2.5 0 50 100 150 200 250 t = P j | β j | Iteration

June 2006 Trevor Hastie, Stanford Statistics 9 June 2006 Trevor Hastie, Stanford Statistics 10 Diabetes Data Why are Forward Stagewise and Lasso so similar? Lasso Stagewise 9 9 500 500 • Are they identical? 3 3 6 6 4 4 • In orthogonal predictor case: yes 8 8 7 7 10 10 • In hard to verify case of monotone coefficient paths: yes • • • • • • • • • • • • • • • • • • • • 0 0 1 1 3 9 4 7 2 10 5 8 6 1 3 9 4 7 2 10 5 8 1 6 β j ˆ • In general, almost! 2 2 -500 -500 • Least angle regression (LAR) provides answers to these questions, and an efficient way to compute the complete Lasso 5 5 sequence of solutions. 0 1000 2000 3000 0 1000 2000 3000 t = P | ˆ t = P | ˆ β j | → β j | → June 2006 Trevor Hastie, Stanford Statistics 11 June 2006 Trevor Hastie, Stanford Statistics 18 Least Angle Regression — LAR LAR 0 2 3 4 5 7 8 10 f for LAR Like a “more democratic” version of forward stepwise regression. d 9 1. Start with r = y , ˆ β 1 , ˆ β 2 , . . . ˆ β p = 0. Assume x j standardized. 500 6 • d f are labeled at the 2. Find predictor x j most correlated with r . 4 Standardized Coefficients top of the figure 8 3. Increase β j in the direction of sign(corr( r, x j )) until some 10 • At the point a com- 0 1 other competitor x k has as much correlation with current petitor enters the ac- residual as does x j . 2 tive set, the d f are in- 4. Move (ˆ β j , ˆ β k ) in the joint least squares direction for ( x j , x k ) cremented by 1. −500 until some other competitor x ℓ has as much correlation with • Not true, for example, the current residual 5 for stepwise regression. 5. Continue in this way until all predictors have been entered. 0.0 0.2 0.4 0.6 0.8 1.0 |beta|/max|beta| Stop when corr( r, x j ) = 0 ∀ j , i.e. OLS solution.

✬ ✩ March 2003 Trevor Hastie, Stanford Statistics 14 June 2006 Trevor Hastie, Stanford Statistics 12 LARS x 2 x 2 20000 • 9 3 9 9 • 500 3 6 15000 4 8 ˆ 4 C k 7 10 8 7 10 10000 • • • • • • • • • • c kj | ¯ y 2 0 4 1 3 9 4 7 2 10 5 8 6 1 • β j ˆ | ˆ 7 2 5 • 1 u 2 5000 6 -500 2 y 1 ¯ • 10 ˆ • 5 µ 0 ˆ x 1 • µ 1 2 8 6 1 • • • 5 0 The LAR direction u 2 at step 2 makes an equal angle with x 1 and 0 1000 2000 3000 2 4 6 8 10 � | ˆ β j | → Step k → x 2 . ✫ ✪ June 2006 Trevor Hastie, Stanford Statistics 13 June 2006 Trevor Hastie, Stanford Statistics 14 Relationship between the 3 algorithms • Lasso and forward stagewise can be thought of as restricted versions of LAR • Forward Stagewise: Compute the LAR direction, but constrain • Lasso : Start with LAR. If a coefficient crosses zero, stop. Drop the sign of the coefficients to match the correlations corr( r, x j ). that predictor, recompute the best direction and continue. This • The incremental forward stagewise procedure approximates gives the Lasso path these steps, one predictor at a time. As step size ǫ → 0, can Proof: use KKT conditions for appropriate Lagrangian. Informally: show that it coincides with this modified version of LAR ∂ � 1 � 2 || y − X β || 2 � + λ | β j | = 0 ∂β j j ⇔ λ · sign( ˆ if ˆ � x j , r � = β j ) β j � = 0 (active)

June 2006 Trevor Hastie, Stanford Statistics 15 Data Mining Trevor Hastie, Stanford University 24 Cross-Validation Error Curve • 10-fold CV error curve using lasso on some diabetes data lars package (64 inputs, 442 samples). 6000 • Thick curve is CV error curve 5500 • The LARS algorithm computes the entire Lasso/FS/LAR path • Shaded region indicates stan- 5000 in same order of computation as one full least squares fit. CV Error dard error of CV estimate. 4500 • When p ≫ N , the solution has at most N non-zero coefficients. • Curve shows effect of over- 4000 Works efficiently for micro-array data ( p in thousands). fitting — errors start to in- 3500 • Cross-validation is quick and easy. crease above s = 0 . 2. 3000 • This shows a trade-off between bias and variance. 0.0 0.2 0.4 0.6 0.8 1.0 Tuning Parameter s June 2006 Trevor Hastie, Stanford Statistics 16 June 2006 Trevor Hastie, Stanford Statistics 17 Forward Stagewise and the Monotone Lasso Degrees of Freedom of Lasso • Expand the variable set to in- Lasso Coefficients (Positive) clude their negative versions 4 3 − x j . • The d f or effective number of parameters give us an indication 2 1 of how much fitting we have done. 0 • Original lasso corresponds to Coefficients (Negative) 4 a positive lasso in this en- 3 Stein’s Lemma: If y i are i.i.d. N ( µ i , σ 2 ), • 2 larged space. 1 � n n � 0 ∂ ˆ µ i µ i , y i ) /σ 2 = E def � � • Forward stagewise corre- d f (ˆ µ ) = cov(ˆ Forward Stagewise ∂y i Coefficients (Positive) 4 sponds to a monotone lasso . i =1 i =1 3 The L 1 norm || β || 1 in this 2 • Degrees of freedom formula for LAR: After k steps, d f (ˆ µ k ) = k 1 enlarged space is arc-length . 0 exactly (amazing! with some regularity conditions) Coefficients (Negative) 4 3 • Forward stagewise produces • Degrees of freedom formula for lasso: Let ˆ d f (ˆ µ λ ) be the 2 1 the maximum decrease in loss number of non-zero elements in ˆ β λ . Then E ˆ d f (ˆ µ λ ) = d f (ˆ µ λ ). 0 per unit arc-length in coeffi- 0 20 40 60 80 L1 Norm (Standardized) cients.

Regularization Paths Boosting fits a regularization path toward a - PowerPoint PPT Presentation

June 2006 Trevor Hastie, Stanford Statistics 1 June 2006 Trevor Hastie, Stanford Statistics 2 Theme Regularization Paths Boosting fits a regularization path toward a max-margin classifier. Svmpath does as well. Trevor Hastie In

"Interesting" Paths = Shortest Paths? "Interesting" Paths Shortest Paths!

Current Flight Paths Current Flight Paths Current approach and departure paths are all over

Plan Discrete paths as Heyting algebras Discrete paths as categories Discrete paths as quantales

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Linux file paths (Nearly?) anyplace you can specify a file or directory you can also include

Graphs II - Shortest paths Single Source Shortest Paths All Sources Shortest Paths some drawings

Finding Tutte Paths in Linear Time Philipp Kindermann Universit at W urzburg joint work

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Hochberg Multiple Test Procedure Under Negative Dependence Ajit C. Tamhane Northwestern

Statistical Filtering and Control for AI and Robotics Part II. Linear methods for regression

General AIMD Congestion Control Y. Richard Yang and Simon S. Lam Motivation for new congestion

Support Vector Machine Supervised Learning - Classification Ricco Rakotomalala Universit

CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462,

Simple Linear Regression Ronet Bachman, Ph.D. Presented by Justice Research and Statistics

Once again: the Central Limit Theorem and hypothesis testing

Unit 3: Foundations for inference Lecture 3: Decision errors, significance levels, sample size,

Regularization Paths Boosting fits a regularization path toward a - PowerPoint PPT Presentation

June 2006 Trevor Hastie, Stanford Statistics 1 June 2006 Trevor Hastie, Stanford Statistics 2 Theme Regularization Paths Boosting fits a regularization path toward a max-margin classifier. Svmpath does as well. Trevor Hastie In

&quot;Interesting&quot; Paths = Shortest Paths? &quot;Interesting&quot; Paths Shortest Paths!

Current Flight Paths Current Flight Paths Current approach and departure paths are all over

Plan Discrete paths as Heyting algebras Discrete paths as categories Discrete paths as quantales

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Linux file paths (Nearly?) anyplace you can specify a file or directory you can also include

Graphs II - Shortest paths Single Source Shortest Paths All Sources Shortest Paths some drawings

Finding Tutte Paths in Linear Time Philipp Kindermann Universit at W urzburg joint work

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Hochberg Multiple Test Procedure Under Negative Dependence Ajit C. Tamhane Northwestern

Statistical Filtering and Control for AI and Robotics Part II. Linear methods for regression

General AIMD Congestion Control Y. Richard Yang and Simon S. Lam Motivation for new congestion

Support Vector Machine Supervised Learning - Classification Ricco Rakotomalala Universit

CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462,

Simple Linear Regression Ronet Bachman, Ph.D. Presented by Justice Research and Statistics

Once again: the Central Limit Theorem and hypothesis testing

Unit 3: Foundations for inference Lecture 3: Decision errors, significance levels, sample size,

"Interesting" Paths = Shortest Paths? "Interesting" Paths Shortest Paths!

Regularization Overview Regularization Overview Problems & Multicollinearity We will