big data big bias small surprise
play

Big Data Big Bias Small Surprise S. Ejaz Ahmed Faculty of Math and - PowerPoint PPT Presentation

Big Data Big Bias Small Surprise S. Ejaz Ahmed Faculty of Math and Science Brock University, ON, Canada sahmed5@brocku.ca www.brocku.ca/sahmed Fields Workshop May 23, 2014 Joint Work with X. Gao S. Ejaz Ahmed Big Data Analysis Outline of


  1. Executive Summary Bancroft (1944) suggested two problems on preliminary test strategy. Data pooling problem based on a preliminary test. This stream followed by a host of researchers. Model selection problem in linear regression model based on a preliminary test. Stein (1956, 1961) developed highly efficient shrinkage estimators in balanced designs. Most statisticians have ignored these (perhaps due to lack of understanding) Modern regularization estimation strategies based on penalized least squares with penalties extend Stein’s procedures powerfully. S. Ejaz Ahmed Big Data Analysis

  2. Executive Summary Bancroft (1944) suggested two problems on preliminary test strategy. Data pooling problem based on a preliminary test. This stream followed by a host of researchers. Model selection problem in linear regression model based on a preliminary test. Stein (1956, 1961) developed highly efficient shrinkage estimators in balanced designs. Most statisticians have ignored these (perhaps due to lack of understanding) Modern regularization estimation strategies based on penalized least squares with penalties extend Stein’s procedures powerfully. S. Ejaz Ahmed Big Data Analysis

  3. Executive Summary Bancroft (1944) suggested two problems on preliminary test strategy. Data pooling problem based on a preliminary test. This stream followed by a host of researchers. Model selection problem in linear regression model based on a preliminary test. Stein (1956, 1961) developed highly efficient shrinkage estimators in balanced designs. Most statisticians have ignored these (perhaps due to lack of understanding) Modern regularization estimation strategies based on penalized least squares with penalties extend Stein’s procedures powerfully. S. Ejaz Ahmed Big Data Analysis

  4. Executive Summary Bancroft (1944) suggested two problems on preliminary test strategy. Data pooling problem based on a preliminary test. This stream followed by a host of researchers. Model selection problem in linear regression model based on a preliminary test. Stein (1956, 1961) developed highly efficient shrinkage estimators in balanced designs. Most statisticians have ignored these (perhaps due to lack of understanding) Modern regularization estimation strategies based on penalized least squares with penalties extend Stein’s procedures powerfully. S. Ejaz Ahmed Big Data Analysis

  5. Executive Summary Bancroft (1944) suggested two problems on preliminary test strategy. Data pooling problem based on a preliminary test. This stream followed by a host of researchers. Model selection problem in linear regression model based on a preliminary test. Stein (1956, 1961) developed highly efficient shrinkage estimators in balanced designs. Most statisticians have ignored these (perhaps due to lack of understanding) Modern regularization estimation strategies based on penalized least squares with penalties extend Stein’s procedures powerfully. S. Ejaz Ahmed Big Data Analysis

  6. Executive Summary Bancroft (1944) suggested two problems on preliminary test strategy. Data pooling problem based on a preliminary test. This stream followed by a host of researchers. Model selection problem in linear regression model based on a preliminary test. Stein (1956, 1961) developed highly efficient shrinkage estimators in balanced designs. Most statisticians have ignored these (perhaps due to lack of understanding) Modern regularization estimation strategies based on penalized least squares with penalties extend Stein’s procedures powerfully. S. Ejaz Ahmed Big Data Analysis

  7. Big Data Analysis Penalty Estimation Strategy The penalty estimators are members of the penalized least squares (PLS) family and they are obtained by optimizing a quadratic function subject to a penalty. PLS estimation provides a generalization of both nonparametric least squares and weighted projection estimators. A popular version of the PLS is given by Tikhonov (1963) regularization. A generalized version of penalty estimator is the bridge regression (Frank and Friedman,1993). S. Ejaz Ahmed Big Data Analysis

  8. Big Data Analysis Penalty Estimation Strategy The penalty estimators are members of the penalized least squares (PLS) family and they are obtained by optimizing a quadratic function subject to a penalty. PLS estimation provides a generalization of both nonparametric least squares and weighted projection estimators. A popular version of the PLS is given by Tikhonov (1963) regularization. A generalized version of penalty estimator is the bridge regression (Frank and Friedman,1993). S. Ejaz Ahmed Big Data Analysis

  9. Big Data Analysis Penalty Estimation Strategy The penalty estimators are members of the penalized least squares (PLS) family and they are obtained by optimizing a quadratic function subject to a penalty. PLS estimation provides a generalization of both nonparametric least squares and weighted projection estimators. A popular version of the PLS is given by Tikhonov (1963) regularization. A generalized version of penalty estimator is the bridge regression (Frank and Friedman,1993). S. Ejaz Ahmed Big Data Analysis

  10. Big Data Analysis Penalty Estimation Strategy The penalty estimators are members of the penalized least squares (PLS) family and they are obtained by optimizing a quadratic function subject to a penalty. PLS estimation provides a generalization of both nonparametric least squares and weighted projection estimators. A popular version of the PLS is given by Tikhonov (1963) regularization. A generalized version of penalty estimator is the bridge regression (Frank and Friedman,1993). S. Ejaz Ahmed Big Data Analysis

  11. Big Data Analysis Penalty Estimation Strategy The penalty estimators are members of the penalized least squares (PLS) family and they are obtained by optimizing a quadratic function subject to a penalty. PLS estimation provides a generalization of both nonparametric least squares and weighted projection estimators. A popular version of the PLS is given by Tikhonov (1963) regularization. A generalized version of penalty estimator is the bridge regression (Frank and Friedman,1993). S. Ejaz Ahmed Big Data Analysis

  12. Big Data Analysis Penalty Estimation Strategy For a given penalty function π ( · ) and regularization parameter λ , the general form of the objective function can be written as φ ( β ) = ( y − X β ) T ( y − X β ) + λπ ( β ) , Penalty function is of the form p � | β j | γ , γ > 0 . π ( β ) = (2) j = 1 S. Ejaz Ahmed Big Data Analysis

  13. Big Data Analysis Penalty Estimation Strategy For a given penalty function π ( · ) and regularization parameter λ , the general form of the objective function can be written as φ ( β ) = ( y − X β ) T ( y − X β ) + λπ ( β ) , Penalty function is of the form p � | β j | γ , γ > 0 . π ( β ) = (2) j = 1 S. Ejaz Ahmed Big Data Analysis

  14. Big Data Analysis Penalty Estimation Strategy For a given penalty function π ( · ) and regularization parameter λ , the general form of the objective function can be written as φ ( β ) = ( y − X β ) T ( y − X β ) + λπ ( β ) , Penalty function is of the form p � | β j | γ , γ > 0 . π ( β ) = (2) j = 1 S. Ejaz Ahmed Big Data Analysis

  15. Big Data Analysis Penalty Estimation Strategy For γ = 2, we have ridge estimates which are obtained by minimizing the penalized residual sum of squares � � � � p p � 2 � � � � � β ridge = arg min ˆ � � � � || β j || 2 , � y − X j β j + λ (3) � � � β j = 1 j = 1 λ is the tuning parameter which controls the amount of shrinkage and || · || = || · || 2 is the L 2 norm. S. Ejaz Ahmed Big Data Analysis

  16. Big Data Analysis Penalty Estimation Strategy For γ < 2, it shrinks the coefficient towards zero, and depending on the value of λ, it sets some of the coefficients to exactly zero. The procedure combines variable selection and shrinking of the coefficients of a penalized regression. An important member of the penalized least squares family is the L 1 penalized least squares estimator, which is obtained when γ = 1. This is known as the Least Absolute Shrinkage and Selection Operator (LASSO): Tibshirani(1996) S. Ejaz Ahmed Big Data Analysis

  17. Big Data Analysis Penalty Estimation Strategy For γ < 2, it shrinks the coefficient towards zero, and depending on the value of λ, it sets some of the coefficients to exactly zero. The procedure combines variable selection and shrinking of the coefficients of a penalized regression. An important member of the penalized least squares family is the L 1 penalized least squares estimator, which is obtained when γ = 1. This is known as the Least Absolute Shrinkage and Selection Operator (LASSO): Tibshirani(1996) S. Ejaz Ahmed Big Data Analysis

  18. Big Data Analysis Penalty Estimation Strategy For γ < 2, it shrinks the coefficient towards zero, and depending on the value of λ, it sets some of the coefficients to exactly zero. The procedure combines variable selection and shrinking of the coefficients of a penalized regression. An important member of the penalized least squares family is the L 1 penalized least squares estimator, which is obtained when γ = 1. This is known as the Least Absolute Shrinkage and Selection Operator (LASSO): Tibshirani(1996) S. Ejaz Ahmed Big Data Analysis

  19. Big Data Analysis Penalty Estimation Strategy For γ < 2, it shrinks the coefficient towards zero, and depending on the value of λ, it sets some of the coefficients to exactly zero. The procedure combines variable selection and shrinking of the coefficients of a penalized regression. An important member of the penalized least squares family is the L 1 penalized least squares estimator, which is obtained when γ = 1. This is known as the Least Absolute Shrinkage and Selection Operator (LASSO): Tibshirani(1996) S. Ejaz Ahmed Big Data Analysis

  20. Big Data Analysis Penalty Estimation Strategy For γ < 2, it shrinks the coefficient towards zero, and depending on the value of λ, it sets some of the coefficients to exactly zero. The procedure combines variable selection and shrinking of the coefficients of a penalized regression. An important member of the penalized least squares family is the L 1 penalized least squares estimator, which is obtained when γ = 1. This is known as the Least Absolute Shrinkage and Selection Operator (LASSO): Tibshirani(1996) S. Ejaz Ahmed Big Data Analysis

  21. Big Data Analysis Penalty Estimation Strategy LASSO is closely related to the ridge regression and its solutions are similarly obtained by replacing the squared penalty || β j || 2 in the ridge solution ( ?? ) with the absolute penalty || β j || 1 in the LASSO– � � � � p p � 2 � � � � � β LASSO = arg min ˆ � � � � � y − X j β j + λ || β j || 1 . (4) � � � β j = 1 j = 1 Good Strategy if Model is Sparse S. Ejaz Ahmed Big Data Analysis

  22. Big Data Analysis Penalty Estimation Strategy LASSO is closely related to the ridge regression and its solutions are similarly obtained by replacing the squared penalty || β j || 2 in the ridge solution ( ?? ) with the absolute penalty || β j || 1 in the LASSO– � � � � p p � 2 � � � � � β LASSO = arg min ˆ � � � � � y − X j β j + λ || β j || 1 . (4) � � � β j = 1 j = 1 Good Strategy if Model is Sparse S. Ejaz Ahmed Big Data Analysis

  23. Penalty Estimation Algorithm, Algorithm, Algorithm Efron et al. (2004, Annals of Statistics,32) proposed an efficient algorithm called Least Angle Regression (LARS) that produce the entire Lasso solution paths in only p steps . In comparison, the classical Lasso require hundreds or thousands of steps. LARS, least angle regression provides a clever and very efficient algorithm of computing the complete LASSO sequence of solutions as s is varied from 0 to ∞ Friedman, et al. (2007, 2008) and Wu and Lange developed the coordinate descent (CD) algorithm for penalized linear regression and penalized logistic regression and was shown to gain computational superiority. For a review, we refer to Zhang et al. (2010) S. Ejaz Ahmed Big Data Analysis

  24. Penalty Estimation Algorithm, Algorithm, Algorithm Efron et al. (2004, Annals of Statistics,32) proposed an efficient algorithm called Least Angle Regression (LARS) that produce the entire Lasso solution paths in only p steps . In comparison, the classical Lasso require hundreds or thousands of steps. LARS, least angle regression provides a clever and very efficient algorithm of computing the complete LASSO sequence of solutions as s is varied from 0 to ∞ Friedman, et al. (2007, 2008) and Wu and Lange developed the coordinate descent (CD) algorithm for penalized linear regression and penalized logistic regression and was shown to gain computational superiority. For a review, we refer to Zhang et al. (2010) S. Ejaz Ahmed Big Data Analysis

  25. Penalty Estimation Algorithm, Algorithm, Algorithm Efron et al. (2004, Annals of Statistics,32) proposed an efficient algorithm called Least Angle Regression (LARS) that produce the entire Lasso solution paths in only p steps . In comparison, the classical Lasso require hundreds or thousands of steps. LARS, least angle regression provides a clever and very efficient algorithm of computing the complete LASSO sequence of solutions as s is varied from 0 to ∞ Friedman, et al. (2007, 2008) and Wu and Lange developed the coordinate descent (CD) algorithm for penalized linear regression and penalized logistic regression and was shown to gain computational superiority. For a review, we refer to Zhang et al. (2010) S. Ejaz Ahmed Big Data Analysis

  26. Penalty Estimation Algorithm, Algorithm, Algorithm Efron et al. (2004, Annals of Statistics,32) proposed an efficient algorithm called Least Angle Regression (LARS) that produce the entire Lasso solution paths in only p steps . In comparison, the classical Lasso require hundreds or thousands of steps. LARS, least angle regression provides a clever and very efficient algorithm of computing the complete LASSO sequence of solutions as s is varied from 0 to ∞ Friedman, et al. (2007, 2008) and Wu and Lange developed the coordinate descent (CD) algorithm for penalized linear regression and penalized logistic regression and was shown to gain computational superiority. For a review, we refer to Zhang et al. (2010) S. Ejaz Ahmed Big Data Analysis

  27. Penalty Estimation Algorithm, Algorithm, Algorithm Efron et al. (2004, Annals of Statistics,32) proposed an efficient algorithm called Least Angle Regression (LARS) that produce the entire Lasso solution paths in only p steps . In comparison, the classical Lasso require hundreds or thousands of steps. LARS, least angle regression provides a clever and very efficient algorithm of computing the complete LASSO sequence of solutions as s is varied from 0 to ∞ Friedman, et al. (2007, 2008) and Wu and Lange developed the coordinate descent (CD) algorithm for penalized linear regression and penalized logistic regression and was shown to gain computational superiority. For a review, we refer to Zhang et al. (2010) S. Ejaz Ahmed Big Data Analysis

  28. Penalty Estimation Family Ever Growing!! Adaptive LASSO Elastic Net Penalty Minimax Concave Penalty SCAD S. Ejaz Ahmed Big Data Analysis

  29. Penalty Estimation Extension and Comparison with non-penalty Estimators Ahmed et al. (2008, 2009) penalty estimation for partially linear models. Fallahpour, Ahmed and Doksum (2010) partially linear models with Random Coefficient autoregressive Errors. Ahmed and Fallahpour (2012) for Quasi-likelihood models. Ahmed et al. (2012) for Weibull censored regression models. A relative performance of penalty, shrinkage and pretest estimators were showcased. S. Ejaz Ahmed Big Data Analysis

  30. Penalty Estimation Extension and Comparison with non-penalty Estimators Ahmed et al. (2008, 2009) penalty estimation for partially linear models. Fallahpour, Ahmed and Doksum (2010) partially linear models with Random Coefficient autoregressive Errors. Ahmed and Fallahpour (2012) for Quasi-likelihood models. Ahmed et al. (2012) for Weibull censored regression models. A relative performance of penalty, shrinkage and pretest estimators were showcased. S. Ejaz Ahmed Big Data Analysis

  31. Penalty Estimation Extension and Comparison with non-penalty Estimators Ahmed et al. (2008, 2009) penalty estimation for partially linear models. Fallahpour, Ahmed and Doksum (2010) partially linear models with Random Coefficient autoregressive Errors. Ahmed and Fallahpour (2012) for Quasi-likelihood models. Ahmed et al. (2012) for Weibull censored regression models. A relative performance of penalty, shrinkage and pretest estimators were showcased. S. Ejaz Ahmed Big Data Analysis

  32. Penalty Estimation Extension and Comparison with non-penalty Estimators Ahmed et al. (2008, 2009) penalty estimation for partially linear models. Fallahpour, Ahmed and Doksum (2010) partially linear models with Random Coefficient autoregressive Errors. Ahmed and Fallahpour (2012) for Quasi-likelihood models. Ahmed et al. (2012) for Weibull censored regression models. A relative performance of penalty, shrinkage and pretest estimators were showcased. S. Ejaz Ahmed Big Data Analysis

  33. Penalty Estimation Extension and Comparison with non-penalty Estimators Ahmed et al. (2008, 2009) penalty estimation for partially linear models. Fallahpour, Ahmed and Doksum (2010) partially linear models with Random Coefficient autoregressive Errors. Ahmed and Fallahpour (2012) for Quasi-likelihood models. Ahmed et al. (2012) for Weibull censored regression models. A relative performance of penalty, shrinkage and pretest estimators were showcased. S. Ejaz Ahmed Big Data Analysis

  34. Penalty Estimation Extension and Comparison with non-penalty Estimators Ahmed et al. (2008, 2009) penalty estimation for partially linear models. Fallahpour, Ahmed and Doksum (2010) partially linear models with Random Coefficient autoregressive Errors. Ahmed and Fallahpour (2012) for Quasi-likelihood models. Ahmed et al. (2012) for Weibull censored regression models. A relative performance of penalty, shrinkage and pretest estimators were showcased. S. Ejaz Ahmed Big Data Analysis

  35. Penalty Estimation Extension and Comparison with non-penalty Estimators Ahmed et al. (2008, 2009) penalty estimation for partially linear models. Fallahpour, Ahmed and Doksum (2010) partially linear models with Random Coefficient autoregressive Errors. Ahmed and Fallahpour (2012) for Quasi-likelihood models. Ahmed et al. (2012) for Weibull censored regression models. A relative performance of penalty, shrinkage and pretest estimators were showcased. S. Ejaz Ahmed Big Data Analysis

  36. Penalty Estimation Extension and Comparison with non-penalty Estimators S. E. Ahmed (2014). Penalty, Pretest and Shrinkage Estimation: Variable Selection and Estimation . Springer. S. E. Ahmed (Editor). Perspectives on Big Data Analysis: Methodologies and Applications. To be published by Contemporary Mathematics, a co-publication of American Mathematical Society and CRM, 2014 . S. Ejaz Ahmed Big Data Analysis

  37. Penalty Estimation Extension and Comparison with non-penalty Estimators S. E. Ahmed (2014). Penalty, Pretest and Shrinkage Estimation: Variable Selection and Estimation . Springer. S. E. Ahmed (Editor). Perspectives on Big Data Analysis: Methodologies and Applications. To be published by Contemporary Mathematics, a co-publication of American Mathematical Society and CRM, 2014 . S. Ejaz Ahmed Big Data Analysis

  38. Penalty Estimation Extension and Comparison with non-penalty Estimators S. E. Ahmed (2014). Penalty, Pretest and Shrinkage Estimation: Variable Selection and Estimation . Springer. S. E. Ahmed (Editor). Perspectives on Big Data Analysis: Methodologies and Applications. To be published by Contemporary Mathematics, a co-publication of American Mathematical Society and CRM, 2014 . S. Ejaz Ahmed Big Data Analysis

  39. Penalty Estimation Extension and Comparison with non-penalty Estimators S. E. Ahmed (2014). Penalty, Pretest and Shrinkage Estimation: Variable Selection and Estimation . Springer. S. E. Ahmed (Editor). Perspectives on Big Data Analysis: Methodologies and Applications. To be published by Contemporary Mathematics, a co-publication of American Mathematical Society and CRM, 2014 . S. Ejaz Ahmed Big Data Analysis

  40. Innate Difficulties: Can Signals be Septated from Noise? All penalty estimators may not provide an estimator with both estimation consistency and variable selection consistency simultaneously. Adaptive LASSO, SCAD, and MCP are Oracle (asymptoticaly). Asymptotic properties are based on assumptions on both true model and designed covariates. Sparsity in the model (most coefficients are exactly 0), few are not Nonzero coefficients are big enough to to be separated from zero ones. S. Ejaz Ahmed Big Data Analysis

  41. Innate Difficulties: Can Signals be Septated from Noise? All penalty estimators may not provide an estimator with both estimation consistency and variable selection consistency simultaneously. Adaptive LASSO, SCAD, and MCP are Oracle (asymptoticaly). Asymptotic properties are based on assumptions on both true model and designed covariates. Sparsity in the model (most coefficients are exactly 0), few are not Nonzero coefficients are big enough to to be separated from zero ones. S. Ejaz Ahmed Big Data Analysis

  42. Innate Difficulties: Can Signals be Septated from Noise? All penalty estimators may not provide an estimator with both estimation consistency and variable selection consistency simultaneously. Adaptive LASSO, SCAD, and MCP are Oracle (asymptoticaly). Asymptotic properties are based on assumptions on both true model and designed covariates. Sparsity in the model (most coefficients are exactly 0), few are not Nonzero coefficients are big enough to to be separated from zero ones. S. Ejaz Ahmed Big Data Analysis

  43. Innate Difficulties: Can Signals be Septated from Noise? All penalty estimators may not provide an estimator with both estimation consistency and variable selection consistency simultaneously. Adaptive LASSO, SCAD, and MCP are Oracle (asymptoticaly). Asymptotic properties are based on assumptions on both true model and designed covariates. Sparsity in the model (most coefficients are exactly 0), few are not Nonzero coefficients are big enough to to be separated from zero ones. S. Ejaz Ahmed Big Data Analysis

  44. Innate Difficulties: Can Signals be Septated from Noise? All penalty estimators may not provide an estimator with both estimation consistency and variable selection consistency simultaneously. Adaptive LASSO, SCAD, and MCP are Oracle (asymptoticaly). Asymptotic properties are based on assumptions on both true model and designed covariates. Sparsity in the model (most coefficients are exactly 0), few are not Nonzero coefficients are big enough to to be separated from zero ones. S. Ejaz Ahmed Big Data Analysis

  45. Innate Difficulties: Can Signals be Septated from Noise? All penalty estimators may not provide an estimator with both estimation consistency and variable selection consistency simultaneously. Adaptive LASSO, SCAD, and MCP are Oracle (asymptoticaly). Asymptotic properties are based on assumptions on both true model and designed covariates. Sparsity in the model (most coefficients are exactly 0), few are not Nonzero coefficients are big enough to to be separated from zero ones. S. Ejaz Ahmed Big Data Analysis

  46. Innate Difficulties: Ultrahigh Dimensional Features In genetic micro-array studies, n is measured in hundreds, the number of features p per sample can exceed millions!!! penalty estimators are not efficient when the dimension p becomes extremely large compared with sample size n . There are still challenging problems when p grows at a non-polynomial rate with n . Non-polynomial dimensionality poses substantial computational challenges. The developments in the arena of penalty estimation is still infancy. S. Ejaz Ahmed Big Data Analysis

  47. Innate Difficulties: Ultrahigh Dimensional Features In genetic micro-array studies, n is measured in hundreds, the number of features p per sample can exceed millions!!! penalty estimators are not efficient when the dimension p becomes extremely large compared with sample size n . There are still challenging problems when p grows at a non-polynomial rate with n . Non-polynomial dimensionality poses substantial computational challenges. The developments in the arena of penalty estimation is still infancy. S. Ejaz Ahmed Big Data Analysis

  48. Innate Difficulties: Ultrahigh Dimensional Features In genetic micro-array studies, n is measured in hundreds, the number of features p per sample can exceed millions!!! penalty estimators are not efficient when the dimension p becomes extremely large compared with sample size n . There are still challenging problems when p grows at a non-polynomial rate with n . Non-polynomial dimensionality poses substantial computational challenges. The developments in the arena of penalty estimation is still infancy. S. Ejaz Ahmed Big Data Analysis

  49. Innate Difficulties: Ultrahigh Dimensional Features In genetic micro-array studies, n is measured in hundreds, the number of features p per sample can exceed millions!!! penalty estimators are not efficient when the dimension p becomes extremely large compared with sample size n . There are still challenging problems when p grows at a non-polynomial rate with n . Non-polynomial dimensionality poses substantial computational challenges. The developments in the arena of penalty estimation is still infancy. S. Ejaz Ahmed Big Data Analysis

  50. Innate Difficulties: Ultrahigh Dimensional Features In genetic micro-array studies, n is measured in hundreds, the number of features p per sample can exceed millions!!! penalty estimators are not efficient when the dimension p becomes extremely large compared with sample size n . There are still challenging problems when p grows at a non-polynomial rate with n . Non-polynomial dimensionality poses substantial computational challenges. The developments in the arena of penalty estimation is still infancy. S. Ejaz Ahmed Big Data Analysis

  51. Innate Difficulties: Ultrahigh Dimensional Features In genetic micro-array studies, n is measured in hundreds, the number of features p per sample can exceed millions!!! penalty estimators are not efficient when the dimension p becomes extremely large compared with sample size n . There are still challenging problems when p grows at a non-polynomial rate with n . Non-polynomial dimensionality poses substantial computational challenges. The developments in the arena of penalty estimation is still infancy. S. Ejaz Ahmed Big Data Analysis

  52. Shrinkage Estimation for Big Data The classical shrinkage estimation methods are limited to fixed p . The asymptotic results depend heavily on a maximum likelihood full estimation with component-wise consistency at rate of √ n . When p n > n , a component-wise consistent estimator of β n is not available since β n is not identifiable. Here β n is not identifiable in the sense that there always exist two different estimations of β n , β ( 1 ) and β ( 2 ) n , such n i β ( 1 ) i β ( 2 ) that x ′ = x ′ for 1 ≤ i ≤ n . n n S. Ejaz Ahmed Big Data Analysis

  53. Shrinkage Estimation for Big Data The classical shrinkage estimation methods are limited to fixed p . The asymptotic results depend heavily on a maximum likelihood full estimation with component-wise consistency at rate of √ n . When p n > n , a component-wise consistent estimator of β n is not available since β n is not identifiable. Here β n is not identifiable in the sense that there always exist two different estimations of β n , β ( 1 ) and β ( 2 ) n , such n i β ( 1 ) i β ( 2 ) that x ′ = x ′ for 1 ≤ i ≤ n . n n S. Ejaz Ahmed Big Data Analysis

  54. Shrinkage Estimation for Big Data The classical shrinkage estimation methods are limited to fixed p . The asymptotic results depend heavily on a maximum likelihood full estimation with component-wise consistency at rate of √ n . When p n > n , a component-wise consistent estimator of β n is not available since β n is not identifiable. Here β n is not identifiable in the sense that there always exist two different estimations of β n , β ( 1 ) and β ( 2 ) n , such n i β ( 1 ) i β ( 2 ) that x ′ = x ′ for 1 ≤ i ≤ n . n n S. Ejaz Ahmed Big Data Analysis

  55. Shrinkage Estimation for Big Data The classical shrinkage estimation methods are limited to fixed p . The asymptotic results depend heavily on a maximum likelihood full estimation with component-wise consistency at rate of √ n . When p n > n , a component-wise consistent estimator of β n is not available since β n is not identifiable. Here β n is not identifiable in the sense that there always exist two different estimations of β n , β ( 1 ) and β ( 2 ) n , such n i β ( 1 ) i β ( 2 ) that x ′ = x ′ for 1 ≤ i ≤ n . n n S. Ejaz Ahmed Big Data Analysis

  56. Shrinkage Estimation for Big Data The classical shrinkage estimation methods are limited to fixed p . The asymptotic results depend heavily on a maximum likelihood full estimation with component-wise consistency at rate of √ n . When p n > n , a component-wise consistent estimator of β n is not available since β n is not identifiable. Here β n is not identifiable in the sense that there always exist two different estimations of β n , β ( 1 ) and β ( 2 ) n , such n i β ( 1 ) i β ( 2 ) that x ′ = x ′ for 1 ≤ i ≤ n . n n S. Ejaz Ahmed Big Data Analysis

  57. Shrinkage Estimation for Big Data we write the p n − dimensional coefficients vector β n = ( β ′ 1 n , β ′ 2 n ) ′ , , where β 1 n is the coefficient vector for main covariates, β 2 n include all nuisance parameters. Sub-vectors β 1 n , β 2 n , have dimensions p 1 n , p 2 n , respectively, where p 1 n ≤ n and p 1 n + p 2 n = p n . Let X 1 n and X 2 n be the sub-matrices of X n corresponding to β 1 n and β 2 n , respectively. Let us assume true parameter vector β 0 = ( β 01 , · · · , β 0 p n ) ′ = ( β ′ 10 , β ′ 20 ) ′ . S. Ejaz Ahmed Big Data Analysis

  58. Shrinkage Estimator for High Dimensional Data Let S 10 and S 20 represent the corresponding index sets for β 10 and β 20 , respectively. Specifically, S 10 includes important predictors and S 20 includes sparse and weak signals satisfying the following assumption. (A0) | β 0 j | = O ( n − ς ) , for ∀ j ∈ S 20 , where ς > 1 / 2 does not change with n . Condition (A0) is considered to be the sparsity of the model. A simpler representation for the finite sample is that β 0 j = 0 ∀ j ∈ S 20 , that is, most coefficients are 0 exactly. S. Ejaz Ahmed Big Data Analysis

  59. Shrinkage Estimator for High Dimensional Data A Class of Submodels Predictors indexed by S 10 are used to construct a submodel. However, other predictors, especially ones in S 20 may also make some contributions to the response and cannot be ignored. Consider 20 ) ′ = 0 p 2 n . ( β ′ UPI or AI : S. Ejaz Ahmed Big Data Analysis

  60. A Candidate Submodel Estimator We make the following assumptions on the random error and design matrix of the true model: (A1) The random error ǫ i ’s are independent and identically distributed with mean 0 and variance 0 < σ 2 < ∞ . Further, E ( ǫ m i ) < ∞ , for an even integer m not depending on n . (A2) ρ 1 n > 0, for all n , the smallest eigenvalue of C 12 n Under (A1-A2) and UPI/AE, the submodel estimator (SME) of β 1 n is defined as β SM ˆ 1 n X 1 n ) − 1 X ′ 1 n = ( X ′ 1 n y . S. Ejaz Ahmed Big Data Analysis

  61. A Candidate Full Model Estimator Weighted Ridge Estimation We estimate an estimator of β n by minimizing a partial penalized objective function, β ( r n ) = argmin {� y − X 1 n β 1 n − X 2 n β 2 n � 2 + r n � β 2 n � 2 } ˆ where “ � · � ” is the ℓ 2 norm and r n > 0 is a tuning parameter. S. Ejaz Ahmed Big Data Analysis

  62. Weighted Ridge Estimation Since p n >> n and under the sparsity assumption Define a n = c 1 n − ω , 0 < ω ≤ 1 / 2 , c 1 > 0 . We define a weighted ridge estimator of β n is denoted as � � ˆ β WR 1 n ( r n ) ˆ β WR ( r n , a n ) = , where ˆ n β WR 2 n ( r n , a n ) ˆ 1 n ( r n ) = ˆ β WR β 1 n ( r n ) and for j / ∈ S 10 , � ˆ ˆ β j ( r n , a n ) , β j ( r n , a n ) > a n ; ˆ β WR ( r n , a n ) = j 0 , otherwise . S. Ejaz Ahmed Big Data Analysis

  63. Weighted Ridge Estimation We call ˆ β ( r n , a n ) as a weighted ridge estimator from two aspects. We use a weighted ridge instead of ridge penalty for the HD shrinkage estimation strategy since we do not want to generate some additional biases caused by an additional penalty on β 1 n if we already have a candidate subset model. Here ˆ 1 n ( r n ) changes with r n and ˆ β WR β WR 2 n ( r n , a n ) changes with both r n and a n . For the notation’s convenience, we denote the weighted ridge estimators as ˆ 1 n and ˆ β WR β WR 2 n . S. Ejaz Ahmed Big Data Analysis

  64. A Candidate HD Shrinkage Estimator A HD shrinkage estimators (HD-SE) ˆ β S 1 n is β S ˆ 1 n = ˆ β WR 1 n − ( h − 2 ) T − 1 n ( ˆ β WR 1 n − ˆ β SM 1 n ) , h > 2 is the number of nonzero elements in ˆ β WR 2 n T n = ( ˆ 2 M 1 X 2 ) ˆ β WR β WR σ 2 , ) ′ ( X ′ / ˆ (5) 2 2 1 n X 1 n ) − 1 X ′ M 1 = I n − X 1 n ( X ′ 1 n σ 2 is a consistent estimator of σ 2 . ˆ For example, we can choose σ 2 = � n i ˆ β SM ) 2 / ( n − 1 ) under UPI or AI. ˆ i = 1 ( y i − x ′ S. Ejaz Ahmed Big Data Analysis

  65. A Candidate HD Positive Shrinkage Estimator A HD positive shrinkage estimator (HD-PSE), β PSE ˆ = ˆ β WR 1 n − (( h − 2 ) T − 1 n ) 1 ( ˆ β WR 1 n − ˆ β SM 1 n ) , 1 n where ( a ) 1 = 1 and a for a > 1 and a ≤ 1, respectively. S. Ejaz Ahmed Big Data Analysis

  66. Consistency and Asymptotic Normality Weighted Ridge Estimation Let s 2 n = σ 2 d ′ n Σ − 1 n d n for any p 12 n × 1 vector d n satisfying � d n � ≤ 1. n � n 1 / 2 s − 1 n ( ˆ β WR 12 n − β 120 ) = n − 1 / 2 s − 1 n Σ − 1 n d ′ ǫ i d ′ n z i + o P ( 1 ) n i = 1 → N ( 0 , 1 ) . d − S. Ejaz Ahmed Big Data Analysis

  67. Asymptotic Distributional Risk Define Σ n 11 = lim n →∞ X ′ 1 n X 1 n / n , Σ n 22 = lim n →∞ X ′ 2 n X 2 n / n , Σ n 12 = lim n →∞ X ′ Σ n 21 = lim n →∞ X ′ 1 n X 2 n / n , 2 n X 1 n / n , Σ n 22 . 1 = lim n →∞ n − 1 X ′ 1 n X 1 n ) − 1 X ′ 2 n X 2 n − X ′ 2 n X 1 n ( X ′ 1 n X 2 n Σ n 11 . 2 = lim n →∞ n − 1 X ′ 2 n X 2 n ) − 1 X ′ 1 n X 1 n − X ′ 1 n X 2 n ( X ′ 2 n X 1 n S. Ejaz Ahmed Big Data Analysis

  68. Asymptotic Distributional Risk K n : β 20 = n − 1 / 2 δ β 30 = 0 p 3 n , and δ = ( δ 1 , δ 2 , · · · , δ p 2 n ) ′ ∈ R p 2 n , δ j is fixed . Define ∆ n = δ ′ Σ n 22 . 1 δ , 1 n s − 1 n 1 / 2 d ′ 1 n ( β ∗ 1 n − β 10 ) is asymptotically normal under 1 n Σ − 1 { K n } , where s 2 1 n = σ 2 d ′ n 11 . 2 d 1 n . The asymptotic distributional risk (ADR) of d ′ 1 n β ∗ 1 n is n →∞ E { [ n 1 / 2 s − 1 1 n − β 10 )] 2 } . ADR ( d ′ 1 n β ∗ 1 n ) = lim 1 n d ′ 1 n ( β ∗ S. Ejaz Ahmed Big Data Analysis

  69. Asymptotic Distributional Risk Analysis Mathematical Proof Under regularity conditions and K n , and suppose there exists 0 ≤ c ≤ 1 such that c = lim n →∞ s − 2 1 n Σ − 1 1 n d ′ n 11 d 1 n , we have 1 n ˆ β WR ADR ( d ′ 1 n ) = 1 , (6a) 1 n ˆ β SM ADR ( d ′ 1 n ) = 1 − ( 1 − c )( 1 − ∆ d 1 n ) , (6b) 1 n ˆ β S ADR ( d ′ 1 n ) = 1 − E [ g 1 ( z 2 + δ )] , (6c) 1 n ˆ β PSE ADR ( d ′ 1 n ) = 1 − E [ g 2 ( z 2 + δ )] , (6d) 1 n ( Σ − 1 n 11 Σ n 12 δδ ′ Σ n 21 Σ − 1 d ′ n 11 ) d 1 n ∆ d 1 n = . 1 n ( Σ − 1 n 11 Σ n 12 Σ − 1 n 22 . 1 Σ n 21 Σ − 1 d ′ n 11 ) d 1 n s − 1 2 n d ′ 2 n z 2 → N ( 0 , 1 ) d 2 n = Σ n 21 Σ − 1 n 11 d 1 n s 2 2 n Σ − 1 2 n = d ′ n 22 . 1 d 2 n S. Ejaz Ahmed Big Data Analysis

  70. Asymptotic Distributional Risk Analysis Mathematical Proof � � 2 − x ′ (( p 2 n + 2 ) d 2 n d ′ n →∞ ( 1 − c ) p 2 n − 2 2 n ) x g 1 ( x ) = lim , s 2 x ′ Σ n 22 . 1 x 2 n x ′ Σ n 22 . 1 x � � �� p 2 n − 2 2 − x ′ (( p 2 n + 2 ) d 2 n d ′ 2 n ) x g 2 ( x ) = lim n →∞ ( 1 − c ) x ′ Σ n 22 . 1 x s 2 2 n x ′ Σ n 22 . 1 x I ( x ′ Σ n 22 . 1 x ≥ p 2 n − 2 ) + lim n →∞ [( 2 − s − 2 2 n x ′ δ 2 n δ ′ 2 n x )( 1 − c )] I ( x ′ Σ n 22 . 1 x ≤ p 2 n − 2 ) S. Ejaz Ahmed Big Data Analysis

  71. Moral of the Story By Ignoring the Bias, it will Not go away! Submodel estimator provided by some existing variable selection techniques when p n ≫ n are subject to bias. The prediction performance can be improved by the shrinkage strategy. Particulary when an under-fitted submodel is selected by an aggressive penalty parameter. S. Ejaz Ahmed Big Data Analysis

Recommend


More recommend