STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 5 1/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Kernel Smoothing Methods One dimensional kernel smoothers Selecting the width of a kernel Local linear regression Local polynomial regression Local regression in R p Structured local regression models in R p Kernel density estimation Mixture models for density estimation Nonparametric Density Estimation with a Parametric Start STK-IN4300: lecture 5 2/ 41
STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: from k NN to kernel smoothers When we introduced the k NN algorithm, ˆ f p x q “ Ave p y i | x i P N k p x qq ‚ justified as an estimate of E r Y | X “ x s . Drawbacks: ‚ ugly discontinuities; ‚ same weight to all points despite their distance to x . STK-IN4300: lecture 5 3/ 41
STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: definition Alternative: weight the effect of each point based on its distance. ř N i “ 1 K p x 0 , x i q y i ˆ f p x 0 q “ , ř N i “ 1 K p x 0 , x i q where ˆ | x ´ x 0 | ˙ K λ p x 0 , x q “ D (1) . λ Here: ‚ D p¨q is called kernel; ‚ λ is the bandwidth or smoothing parameter. STK-IN4300: lecture 5 4/ 41
STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: comparison STK-IN4300: lecture 5 5/ 41
STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: typical kernels We need to choose D p¨q : ‚ symmetric around x 0 ; ‚ goes off smoothly with the distance. Typical choices: Nucleus D(t) Support 1 2 π exp t´ 1 2 t 2 u Normal ? R 1 Rectangular p´ 1 , 1 q 2 3 4 p 1 ´ t 2 q Epanechnikov p´ 1 , 1 q 16 p 1 ´ t 2 q 2 15 Biquadratic p´ 1 , 1 q 70 81 p 1 ´ | t | 3 q 3 Tricubic p´ 1 , 1 q STK-IN4300: lecture 5 6/ 41
STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: comparison STK-IN4300: lecture 5 7/ 41
STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: choice of the smoothing parameter Choice of the bandwidth λ : ‚ controls how large is the interval around x 0 to consider, § for Epanechnikov, biquadratic or tricubic kernels Ñ radius of the support; § for Gaussian kernel, standard deviation; ‚ large values implies lower variance but higher bias, § λ small Ñ ˆ f p x 0 q based on few points Ñ y i ’s closer to y 0 ; § λ large Ñ more points Ñ stronger effect of averaging; ‚ alternatively, § adapt to the local density (fix k as in k NN); § expressed by substituting λ with h λ p x 0 q in (1); § keep bias constant, variance is inversely proportional to the local density. STK-IN4300: lecture 5 8/ 41
STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: effect of the smoothing parameter STK-IN4300: lecture 5 9/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Selecting the width of a kernel: bias and variance Assume y i “ f p x i q ` ǫ i , ǫ i i.i.d. s.t. E r ǫ i s “ 0 and Var “ σ 2 , then f p x qs « f p x q ` λ 2 E r ˆ D f 2 p x q 2 σ 2 and f p x qs « σ 2 R D Var r ˆ g p x q Nλ for N large and λ sufficiently close to 0 (Azzalini & Scarpa, 2012). Here: ‚ σ 2 ş t 2 D p t q dt ; D “ ş D p t q 2 dt ; ‚ R D “ ‚ g p x q is the density from which the x i were sampled. STK-IN4300: lecture 5 10/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Selecting the width of a kernel: bias and variance Note: ‚ the bias is a multiple of λ 2 ; § λ Ñ 0 reduce the bias; 1 ‚ the variance is a multiple of Nλ ; § λ Ñ 8 reduce the variance. The quantities g p x q and f 2 p x q are unknown, otherwise ˙ 1 { 5 σ 2 R D ˆ λ opt “ ; σ 4 D f 2 p x q g p x q N note that λ must tend to 0 with rate N ´ 1 { 5 (i.e., very slowly). STK-IN4300: lecture 5 11/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Selecting the width of a kernel: AIC Anyway, local smoothers are linear estimators, ˆ f p x q “ S λ y as S λ , the smoothing matrix, does not depend on y . Therefore, an Akaike Information Criterion can be implemented, AIC “ log ˆ σ ` 2 trace t S λ u where trace t S λ u are the effective degrees of freedom. Otherwise it is always possible to implement a cross-validation procedure. STK-IN4300: lecture 5 12/ 41
STK-IN4300 - Statistical Learning Methods in Data Science One dimensional Kernel Smoothers: other issues Other points to consider: ‚ boundary issues: § estimates are less accurate close to the boundaries; § less observations; § asymmetry in the kernel; ‚ ties in the x i ’s: § possibly more weight on a single x i ; § there can be different y i for the same x i . STK-IN4300: lecture 5 13/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Local linear regression: problems at the boundaries STK-IN4300: lecture 5 14/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Local linear regression: problems at the boundaries By fitting a straight line, we solve the problem to the first order. Ó Local linear regression Locally weighted linear regression solves, at each target point x 0 , N ÿ K λ p x 0 , x i qr y i ´ α p x 0 q ´ β p x 0 q x i s 2 . min α p x 0 q ,β p x 0 q i “ 1 The estimate is ˆ α p x 0 q ` ˆ f p x 0 q “ ˆ β p x 0 q x 0 : ‚ the model is fit on all data belonging to the support of K λ ; ‚ it is only evaluated in x 0 . STK-IN4300: lecture 5 15/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Local linear regression: estimation Estimation f p x 0 q “ b p x 0 q T p B T W p x 0 q B q ´ 1 B T W p x 0 q y ˆ N ÿ “ l i p x 0 q y i , i “ 1 where: ‚ b p x 0 q T “ p 1 , x 0 q ‚ B “ p � 1 , X q ; ‚ W p x 0 q is a N ˆ N diagonal matrix with i -th term K λ p x 0 , x i q ; ‚ ˆ f p x 0 q is linear in y ( l i p x 0 q does not depend on y i ); ‚ the weights l i p x 0 q are sometimes called equivalent kernels, § combine the weighting kernel K λ p x 0 , ¨q and the LS operator. STK-IN4300: lecture 5 16/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Local linear regression: bias correction due asymmetry STK-IN4300: lecture 5 17/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Local linear regression: bias Using a Taylor expansion of f p x i q around x 0 , N ÿ E r ˆ f p x 0 qs “ l i p x 0 q f p x i q i “ 1 N N ÿ l i p x 0 q ` f 1 p x 0 q ÿ “ f p x 0 q p x i ´ x 0 q l i p x 0 q` i “ 1 i “ 1 N ` f 2 p x 0 q ÿ p x i ´ x 0 q 2 l i p x 0 q ` . . . (2) 2 i “ 1 For local linear regression, ‚ ř i “ 1 l i p x 0 q “ 1 ; ‚ ř N i “ 1 p x i ´ x 0 q l i p x 0 q “ 0 . Therefore, f p x 0 qs ´ f p x 0 q “ f 2 p x 0 q ‚ E r ˆ ř N i “ 1 p x i ´ x 0 q 2 l i p x 0 q ` . . . . 2 STK-IN4300: lecture 5 18/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Local polynomial regression: bias Why limiting to a linear fit? ff 2 « N d β j p x 0 q x j ÿ ÿ K λ p x 0 , x i q y i ´ α p x 0 q ´ , min i α p x 0 q ,β 1 p x 0 q ,...,β d p x 0 q i “ 1 j “ 1 ˆ β p x 0 q x j α p x 0 q ` ř d j “ 1 ˆ with solution f p x 0 q “ ˆ 0 . ‚ it can be shown that the bias, using (2), only involves components of degree d ` 1 ; ‚ in contrast to local linear regression, it tends to be closer to the true function in regions with high curvature, § no trimming the hills and filling the gaps effect. STK-IN4300: lecture 5 19/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Local polynomial regression: regions with high curvature STK-IN4300: lecture 5 20/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Local polynomial regression: bias-variance trade-off Not surprisingly, there is a price for having less bias. Assuming a model y i “ f p x i q ` ǫ i , where ǫ i are i.i.d. with mean 0 and variance σ 2 , Var p ˆ f p x i qq “ σ 2 || l p x 0 q|| It can be shown that || l p x 0 q|| increase with d ñ bias-variance trade-off in the choice of d . STK-IN4300: lecture 5 21/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Local polynomial regression: variance STK-IN4300: lecture 5 22/ 41
STK-IN4300 - Statistical Learning Methods in Data Science Local polynomial regression: final remarks Some final remarks: ‚ local linear fits help dramatically in alleviating boundary issues; ‚ quadratic fits do a little better, but increase variance; ‚ quadratic fits solve issues in high curvature regions; ‚ asymptotic analyses suggest that polynomials of odd degrees should be preferred to those of even degrees, § the MSE is asymptotically dominated by boundary effects; ‚ anyway, the choice of d is problem specific. STK-IN4300: lecture 5 23/ 41
Recommend
More recommend