Arthur CHARPENTIER, Advanced Econometrics Graduate Course Advanced Econometrics #3: Model & Variable Selection * A. Charpentier (Université de Rennes 1) Université de Rennes 1, Graduate Course, 2018. 1 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course “ Great plot. Now need to find the theory that explains it ” Deville (2017) http://twitter.com 2 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preliminary Results: Numerical Optimization Problem : x ⋆ ∈ argmin { f ( x ); x ∈ R d } Gradient descent : x k +1 = x k − η ∇ f ( x k ) starting from some x 0 Problem : x ⋆ ∈ argmin { f ( x ); x ∈ X ⊂ R d } � � x k − η ∇ f ( x k ) Projected descent : x k +1 = Π X starting from some x 0 A constrained problem is said to be convex if min { f ( x ) } with f convex s.t. g i ( x ) = 0 , ∀ i = 1 , · · · , n with g i linear h i ( x ) ≤ 0 , ∀ i = 1 , · · · , m with h i convex n m � � Lagrangian : L ( x , λ , µ ) = f ( x ) + λ i g i ( x ) + µ i h i ( x ) where x are primal i =1 i =1 variables and ( λ , µ ) are dual variables. Remark L is an affine function in ( λ , µ ) 3 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preliminary Results: Numerical Optimization Karush–Kuhn–Tucker conditions : a convex problem has a solution x ⋆ if and only if there are ( λ ⋆ , µ ⋆ ) such that the following condition hold • stationarity : ∇ x L ( x , λ , µ ) = 0 at ( x ⋆ , λ ⋆ , µ ⋆ ) • primal admissibility : g i ( x ⋆ ) = 0 and h i ( x ⋆ ) ≤ 0, ∀ i • dual admissibility : µ ⋆ ≥ 0 x {L ( x , λ , µ ) } Let L denote the associated dual function L ( λ , µ ) = min L is a convex function in ( λ , µ ) and the dual problem is max λ , µ { L ( λ , µ ) } . 4 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course References Motivation Banerjee, A., Chandrasekhar, A.G., Duflo, E. & Jackson, M.O. (2016). Gossip: Identifying Central Individuals in a Social Networks . References Belloni, A. & Chernozhukov, V. 2009. Least squares after model selection in high-dimensional sparse models . Hastie, T., Tibshirani, R. & Wainwright, M. 2015 Statistical Learning with Sparsity: The Lasso and Generalizations . CRC Press. 5 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preambule Assume that y = m ( x ) + ε , where ε is some idosyncatic impredictible noise. The error E [( y − m ( x )) 2 ] is the sum of three terms m ( x )) 2 ] • variance of the estimator : E [( y − � • bias 2 of the estimator : [ m ( x ) − � m ( x )] 2 • variance of the noise : E [( y − m ( x )) 2 ] (the latter exists, even with a ‘perfect’ model). 6 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preambule Consider a parametric model, with true (unkown) parameter θ , then � θ − θ ) 2 � � ) 2 � � − θ ) 2 � � ˆ � � ˆ � mse(ˆ (ˆ (ˆ θ ) = E = E θ − E θ + E ( E θ � �� � � �� � variance bias 2 0.8 Let � θ denote an unbiased estimator of θ . Then 0.6 mse( � θ 2 θ ) ˆ · � θ = � · � θ = θ − θ θ 2 + mse( � θ 2 + mse( � 0.4 θ ) θ ) � �� � penalty 0.2 variance satisfies mse(ˆ θ ) ≤ mse( � θ ). 0.0 −2 −1 0 1 2 3 4 7 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course Bayes vs. Frequentist, inference on heads/tails Consider some Bernoulli sample x = { x 1 , x 2 , · · · , x n } , where x i ∈ { 0 , 1 } . X i ’s are i.i.d. B ( p ) variables, f X ( x ) = p x [1 − p ] 1 − x , x ∈ { 0 , 1 } . Standard frequentist approach � � n n � � p = 1 � x i = argman f X ( x i ) n i =1 i =1 � �� � L ( p ; x ) From the central limit theorem √ n p − p � L � → N (0 , 1) as n → ∞ p (1 − p ) we can derive an approximated 95% confidence interval � � � p ± 1 . 96 √ n � p (1 − � � p ) 8 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course Bayes vs. Frequentist, inference on heads/tails Example out of 1,047 contracts, 159 claimed a loss 0.035 (True) Binomial Distribution Poisson Approximation 0.030 Gaussian Approximation 0.025 0.020 Probability 0.015 0.010 0.005 0.000 100 120 140 160 180 200 220 Number of Insured Claiming a Loss 9 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course Bayes’s theorem Consider some hypothesis H and some evidence E , then P E ( H ) = P ( H | E ) = P ( H ∩ E ) = P ( H ) · P ( E | H ) P ( E ) P ( E ) Bayes rule, prior probability P ( H ) versus posterior probability after receiving evidence E, P E ( H ) = P ( H | E ) . In Bayesian (parametric) statistics, H = { θ ∈ Θ } and E = { X = x } . Bayes’ Theorem, π ( θ | x ) = π ( θ ) · f ( x | θ ) π ( θ ) · f ( x | θ ) � = f ( x | θ ) π ( θ ) dθ ∝ π ( θ ) · f ( x | θ ) f ( x ) 10 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course Small Data and Black Swans Consider sample x = { 0 , 0 , 0 , 0 , 0 } . Here the likelihood is f ( x i | θ ) = θ x i [1 − θ ] 1 − x i f ( x | θ ) = θ x T 1 [1 − θ ] n − x T 1 and we need a priori distribution π ( · ) e.g. a beta distribution π ( θ ) = θ α [1 − θ ] β 1 0.8 0.6 B ( α, β ) 0.4 0.2 0 0 θ α + x T 1 [1 − θ ] β + n − x T 1 π ( θ | x ) = B ( α + x T 1 , β + n − x T 1 ) 11 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course On Bayesian Philosophy, Confidence vs. Credibility for frequentists, a probability is a measure of the the frequency of repeated events → parameters are fixed (but unknown), and data are random for Bayesians, a probability is a measure of the degree of certainty about values → parameters are random and data are fixed “ Bayesians : Given our observed data, there is a 95% probability that the true value of θ falls within the credible region vs. Frequentists : There is a 95% probability that when I compute a confidence interval from data of this sort, the true value of θ will fall within it. ” in Vanderplas (2014) Example see Jaynes (1976) , e.g. the truncated exponential 12 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course On Bayesian Philosophy, Confidence vs. Credibility Example What is a 95% confidence interval ● ● ● ● ● ● ● ● ● ● ● ● ● ● of a proportion ? Here x = 159 and n = 1047. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1. draw sets (˜ x 1 , · · · , ˜ x n ) k with X i ∼ B ( x/n ) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2. compute for each set of values confidence ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● intervals ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3. determine the fraction of these confidence ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● interval that contain x ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● → the parameter is fixed, and we guarantee ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● that 95% of the confidence intervals will con- ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● tain it. ● 140 160 180 200 13 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course On Bayesian Philosophy, Confidence vs. Credibility Example What is 95% credible region of a pro- portion ? Here x = 159 and n = 1047. 1. draw random parameters p k with from the posterior distribution, π ( ·| x ) x 1 , · · · , ˜ x n ) k with X i,k ∼ B ( p k ) 2. sample sets (˜ 3. compute for each set of values means x k 4. look at the proportion of those x k that are within this credible region [Π − 1 ( . 025 | x ); Π − 1 ( . 975 | x )] 140 160 180 200 → the credible region is fixed, and we guarantee that 95% of possible values of x will fall within it it. 14 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course Occam’s Razor The “law of parsimony”, “ pluralitas non est ponenda sine necessitate ” Penalize too complex models 15 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course James & Stein Estimator Let X ∼ N ( µ , σ 2 I ). We want to estimate µ . � � µ , σ 2 µ mle = X n ∼ N � . n I From James & Stein (1961) Estimation with quadratic loss � � 1 − ( d − 2) σ 2 µ JS = � y n � y � 2 where � · � is the Euclidean norm. One can prove that if d ≥ 3, ��� � 2 � ��� � 2 � µ JS − � µ mle − � µ < E µ E Samworth (2015) Stein’s paradox , “ one should use the price of tea in China to obtain a better estimate of the chance of rain in Melbourne ”. 16 @freakonometrics
Arthur CHARPENTIER, Advanced Econometrics Graduate Course James & Stein Estimator Heuristics : consider a biased estimator, to decrease the variance. See Efron (2010) Large-Scale Inference 17 @freakonometrics
Recommend
More recommend