computational statistics lectures 10 13 smoothing and
play

Computational Statistics Lectures 10-13: Smoothing and - PowerPoint PPT Presentation

Computational Statistics Lectures 10-13: Smoothing and Nonparametric Inference Dr Jennifer Rogers Hilary Term 2017 Background Smoothing and nonparametric methods Approximating function that attempts to capture important patterns in


  1. Local least squares We can rewrite the kernel regression estimator as n � x − x i � � ( Y i − m x ) 2 . ˆ m ( x ) = argmin m x ∈ R K h i = 1 Exercise : Can be verified by solving h )( Y i − m x ) 2 = 0 � n i = 1 K ( x − x i d dm x ◮ Thus, for every fixed x , have to search for the best local constant m x such that the localized sum of squares is minimized ◮ Localization is here described by the kernel and gives a large weight to those observations ( x i , Y i ) where x i is close to the point x of interest. The choice of the bandwidth h is crucial

  2. Example: FTSE stock market index

  3. Example: Cosmic microwave background data

  4. Choosing the bandwidth

  5. Choosing the bandwidth ◮ Measure “success” of fit using mean squared error on new observations (MSE), m h ( x )) 2 � � ( Y − ˆ MSE ( h ) = E , ◮ Splitting into noise, bias and variance MSE ( h ) = Noise + Bias 2 + Variance ◮ Bias decreases if h ↓ 0 ◮ Variance increases if h ↓ 0 ◮ Choosing the bandwidth is a bias-variance trade-off.

  6. Choosing the bandwidth ◮ But...we don’t know MSE ( h ) , as we just have a n observations and cannot generate new random observations Y ◮ First idea is to compute, for various values of the bandwidth h : ◮ The estimator ˆ m h for a training sample ◮ The error n n m h ( x )) 2 = n − 1 ( Y i − ˆ n − 1 � � Y i ) 2 , ( Y i − ˆ i = 1 i = 1 ◮ Choose the bandwidth with the smallest training error

  7. CMB data

  8. Choosing the bandwidth ◮ We would choose a bandwidth close to h = 0, giving near perfect interpolation of the data, that is ˆ m ( x i ) ≈ Y i ◮ This is unsurprising ◮ Parametric context ◮ Shape of the model is fixed ◮ Minimising the MSE makes the parametric model as close as possible to the data ◮ Nonparametric setting ◮ Don’t have a fixed shape ◮ Value of h dictates the model ◮ Minimising the MSE → fitted model as close as possible to the data ◮ Lead us to choose h as small as possible ◮ Interpolation of the data ◮ Misleading result → Only noise is fitted for very small bandwidths

  9. Cross-validation ◮ Solution...don’t use X i to construct ˆ m ( X i ) ◮ This is the idea behind cross-validation ◮ Leave-one-out cross-validation ◮ Least squares cross-validation ◮ For each value of h m ( − i ) ◮ For each i = 1 , . . . , n , compute the estimator ˆ ( x ) , h m ( − i ) where ˆ ( x ) is computed without using observation i h ◮ The estimated MSE is then given by � m ( − i ) MSE ( h ) = n − 1 � ( x i )) 2 ( Y i − ˆ h i

  10. CMB data

  11. CMB data A bandwidth of 44 minimises the estimated MSE

  12. Cross-validation ◮ A drawback of LOO-CV is that it is expensive to compute ◮ Fit has to be recalculated n times (once for each left-out observation) m ( x ) ( − i ) for all i ◮ We can avoid needing to calculate ˆ ◮ For some n × n -matrix S , the linear smoother fulfills ˆ Y = SY The risk (MSE) under LOO-CV can subsequently be written as � 2 n � Y i − ˆ m h ( x i ) � MSE ( h ) = n − 1 � 1 − S ii i = 1

  13. Cross-validation ◮ Do not need to recompute ˆ m h while leaving out each of the n observations in turn ◮ Results can much faster be obtained by rescaling the residuals Y i − ˆ m h ( x i ) with the factor ( 1 − S ii ) ◮ S ii is the i -th diagonal entry in the smoothing matrix

  14. Generalized Cross-Validation n � Y i − ˆ m h ( x i ) � 2 � � MSE ( h ) = n − 1 , 1 − S ii i = 1 Replace S ii by its average ν/ n (where ν = � i S ii ) Choose bandwidth h that minimizes n � Y i − ˆ m h ( x i ) � 2 GCV ( h ) = n − 1 � . 1 − ν/ n i = 1

  15. Local polynomial regression

  16. Nadaraya-Watson kernel estimator ◮ Major disadvantage of the Nadaraya-Watson kernel estimator → boundary bias ◮ Bias is of large order at the boundaries

  17. Local polynomial regression ◮ Even when a curve doesn’t look like a polynomial ◮ Restrict to a small neighbourhood of a given point, x ◮ Approximate the curve by a polynomial in that neighbourhood ◮ Fit its coefficients using only observations X i close to x ◮ (or rather, putting more weight to observations close to x ) ◮ Repeat this procedure at every point x where we want to estimate m ( x )

  18. Local polynomial regression ◮ Kernel estimator approximates the data by taking local averages within small bandwidths ◮ Use local linear regression to obtain an approximation

  19. Kernel regression estimator Recall that the kernel regression estimator is the solution to: n � x − x i � � ( Y i − m ( x )) 2 ˆ m ( x ) = argmin m ( x ) ∈ R K h i = 1 This given by � n i = 1 Y i K ( x − X i h ) ˆ m ( x ) = . j = 1 K ( x − X j � n h ) Thus estimation corresponds to the solution to the weighted sum of squares problem

  20. Local polynomial regression Using Taylor Series, we can approximate m ( x ) , where x is close to a point x 0 using the following polynomial: m ( x 0 ) + m ( 1 ) ( x 0 )( x − x 0 ) + m ( 2 ) ( x 0 ) ( x − x 0 ) 2 + . . . m ( x ) ≈ 2 ! . . . + m ( p ) ( x 0 ) ( x − x 0 ) p p ! m ( x 0 ) + β 1 ( x − x 0 ) + β 2 ( x − x 0 ) 2 + · · · + β p ( x − x 0 ) p = where m ( k ) ( x 0 ) = k ! β k , provided that all the required derivatives exist

  21. Local polynomial regression ◮ Use the data to estimate that polynomial of degree p which best approximates m ( x i ) in a small neighbourhood around the point x ◮ Minimise with respect to β 0 , β 1 , . . . , β p the function: n � x − x i � � { Y i − β 0 − β 1 ( x i − x ) − . . . − β p ( x i − x ) p } 2 K h i ◮ Weighted least squares problem, where the weights are given by the kernel functions K (( x − x i ) / h ) ◮ As m ( k ) ( x ) = k ! β k , we then have that m ( x ) = β 0 , or m ( x ) = ˆ ˆ β 0

  22. CMB data Red: Kernel smoother ( p = 0) Green: Local linear regression ( p = 1)

  23. Boundary bias: kernel estimator Let ℓ i ( x ) = ω i ( x ) / � j ω j ( x ) , so that � ˆ m ( x ) = ℓ i ( x ) Y i i For the Kernel smoother ( p = 0), the bias of the linear smoother is thus � m ′ ( x ) Bias = E ( ˆ m ( x )) − m ( x ) ( x i − x ) ℓ i ( x ) + = i m ′′ ( x ) � ( x i − x ) 2 ℓ i ( x ) + R , 2 i

  24. Boundary bias: Kernel estimator First term in the expansion is equal to � x − x i � ( x i − x ) K m ′ ( x ) � h � x − x i � K h i ◮ vanishes if the design points x i are centred symmetrically around x ◮ does not vanish if x sits at the boundary (all x i − x will have the same sign)

  25. Boundary bias: polynomial estimator ◮ m ( x ) is truly a local polynomial of degree p ◮ At least p + 1 points with positive weights in the neighbourhood of x Bias will hence be of order m ( p + 1 ) ( x ) � ( x j − x ) p + 1 ℓ j ( x ) + R Bias = E ( ˆ m ( x )) − m ( x ) = ( p + 1 )! j Why not choose p = 20?

  26. Boundary bias: polynomial estimator ◮ Y i = m ( x ) + σε i with ε i ∼ N ( 0 , 1 ) ◮ Variance of the linear smoother, ˆ m ( x ) = � j ℓ j ( x ) Y i , is, m ( x )) = σ 2 � ℓ 2 j ( x ) = σ 2 � ℓ ( x ) � 2 Var ( ˆ j ◮ � ℓ ( x ) � 2 tends to be large if p is large ◮ In practice, p = 1 is a good choice

  27. Example: Doppler function � 2 . 1 π � � x ( 1 − x ) sin 0 ≤ x ≤ 1 . m ( x ) = , x + . 05

  28. Example: Doppler function > n <- 1000 > x <- seq(0,1,length=n) > m <- sqrt(x*(1-x))*sin(2.1*pi/(x+0.05)) > plot(x,m,type=’l’) > y <- m+rnorm(n)*0.075 > plot(x,y) > fit <- locpoly(x,y,bandwidth=dpill(x,y)*2,degree=1) > lines(fit,col=2) > plot(x,y) > fit2 <- locpoly(x,y,bandwidth=dpill(x,y)/2,degree=1) > lines(fit2,col=2) > plot(x,y) > fit3 <- locpoly(x,y,bandwidth=dpill(x,y)/4,degree=1) > lines(fit3,col=2)

  29. Example: Doppler function

  30. Penalised regression

  31. Penalised regression Regression model i = 1 , . . . , n , Y i = m ( x i ) + ε i E ( ε i ) = 0 Estimating m by choosing ˆ m ( x ) to minimize n � m ( x i )) 2 ( Y i − ˆ i = 1 leads to ◮ linear regression estimate if minimizing over all linear functions ◮ an interpolation of the data if minimizing over all functions .

  32. Penalised regression Estimate m by choosing ˆ m ( x ) to minimize n m ( x i )) 2 + λ J ( m ) , � ( Y i − ˆ i = 1 J ( m ) : roughness penalty Typically � ( m ′′ ( x )) 2 dx . J ( m ) = Parameter λ controls trade-off between fit and penalty ◮ For λ = 0: interpolation ◮ For λ → ∞ : linear least squares line

  33. Example: Doppler function

  34. Splines

  35. Splines ◮ Kernel regression ◮ Researcher isn’t interested in actually calculating m ( x ) for a single location x ◮ m ( x ) calculated on sufficiently small grid of x -values ◮ Curve obtained by interpolation ◮ Local polynomial regression: unknown mean function was assumed to be locally well approximated by a polynomial ◮ Alternative approach ◮ Represent the fit as a piecewise polynomial ◮ Pieces connecting at points called knots ◮ Once the knots are selected, an estimator can be computed globally ◮ Manner similar to that for a parametrically specified mean function This is the idea behind splines

  36. Splines IID sample ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , . . . ( X n , Y n ) coming from the model Y i = m ( X i ) + ε i Want to estimate the mean of the variable Y with m ( x ) = E ( Y | X = x ) A very naive estimator of E ( Y | X = x ) would be the sample mean of x : � n i = 1 Y i ˆ m ( x ) = n Not very good (same for all x )

  37. Splines Approximate m by piecewise polynomials, each on a small interval:  c 1 if x < ξ 1   c 2 if ξ 1 ≤ x < ξ 2    ˆ m ( x ) = . . . c k if ξ k − 1 ≤ x < ξ k     c k + 1 if x ≥ ξ k 

  38. Splines Use more general lines, which join at the ξ s:  a 1 + b 1 x if x < ξ 1   a 2 + b 2 x if ξ 1 ≤ x < ξ 2    ˆ m ( x ) = . . . a k + b k x if ξ k − 1 ≤ x < ξ k     a k + 1 + b k + 1 x if x ≥ ξ k  a and b are such that the lines join at each ξ

  39. Splines Approximate m ( x ) by polynomials � p j = 0 β 1 , j x j  if x < ξ 1   � p j = 0 β 2 , j x j  if ξ 1 ≤ x < ξ 2    ˆ m ( x ) = . . . � p j = 0 β k , j x j if ξ k − 1 ≤ x < ξ k     � p j = 0 β k + 1 , j x j  if x ≥ ξ k  β j s are such that the polynomials join at each ξ and the approximation has p − 1 derivatives Splines which are piecewise polynomials of order p ◮ Splines of order p + 1 ◮ Splines of degree p ◮ ξ : knots

  40. Splines

  41. Piecewise constant splines

  42. Knots How many knots should we have? ◮ Choose a lot of knots well widespread over the data range → reduce the bias of the estimator ◮ If we make it too local → estimator will be too wiggly ◮ Overcome the bias problem without increasing the variance → take a lot of knots, but constrain their influence ◮ We can do this using penalised regression

  43. Spline order What order spline should we use? ◮ Increase the value of p → make the estimator ˆ m p smoother (since it has p − 1 continuous derivatives) ◮ If we have p too large → increase the number of parameters to estimate ◮ In practice rarely useful to take p > 3 ◮ p = 2 ◮ Splines of order three or quadratic splines ◮ p = 3 ◮ Splines of order 4 or cubic splines ◮ p -th order spline is a piecewise p − 1 degree polynomial with p − 2 continuous derivatives at the knots

  44. Natural splines ◮ Natural spline: linear beyond the boundary is called a natural spline ◮ Why this constraint? ◮ We usually have very few observations beyond the two extreme knots ◮ Want to obtain an estimator of the regression curve there ◮ Cannot reasonably estimate anything correct there ◮ Rather use a simplified model (e.g. linear) ◮ Often gives more or less reasonable results

  45. Natural cubic splines ξ 1 < ξ 2 < . . . < ξ n set of ordered points, so-called knots, contained in an interval ( a , b ) A cubic spline is a continuous function m such that (i) m is a cubic polynomial over ( ξ 1 , ξ 2 ) , . . . , and (ii) m has a continuous first and second derivatives at the knots. The solution to n � m ( x i )) 2 + λ m ′′ ( x )) 2 dx � ( y i − ˆ ( ˆ i = 1 is a natural cubic spline with knots at the data points ˆ m ( x ) is called a smoothing spline

  46. Natural cubic splines ◮ Sequence of values f 1 , . . . , f n at specified locations x 1 < x 2 < · · · < x n ◮ Find a smooth curve g ( x ) that passes through the points ( x 1 , f 1 ) , ( x 2 , f 2 ) , . . . , ( x n , f n ) ◮ The natural cubic spline g is an interpolating function that satisfies the following conditions: (i) g ( x j ) = f j , j = 1 , . . . , n , (ii) g ( x ) is cubic on each subinterval ( x k , x k + 1 ) , k = 1 , . . . , ( n − 1 ) , (iii) g ( x ) is continuous and has continuous first and second derivatives, (iv) g ′′ ( x 1 ) = g ′′ ( x n ) = 0 .

  47. B-splines ◮ Need a basis for natural polynomial splines ◮ Convenient is the so-called B-spline basis ◮ Data points a = ξ 0 < ξ 1 < ξ 2 < . . . , ξ n ≤ ξ n + 1 = b in ( a , b ) ◮ There are n + 2 real values ◮ The n ≥ 0 are called ‘interior knots’ or ‘control points’ ◮ And there are always two endpoints, ξ 0 and ξ n + 1 ◮ When the knots are equidistant they are said to be ‘uniform’

  48. B-splines Now define new knots τ as ◮ τ 1 ≤ . . . ≤ τ p = ξ 0 = a ◮ τ j + p = ξ j ◮ b = ξ n + 1 = τ n + p + 1 ≤ τ n + p + 2 ≤ . . . ≤ τ n + 2 p ◮ p : order of the polynomial ◮ p + 1 is the order of the spline ◮ Append lower and upper boundary knots ξ 0 and ξ n + 1 p times ◮ Needed due to the recursive nature of B-splines

  49. B-slines Define recursively ◮ For k = 0 and i = 1 , . . . , n + 2 p � 1 τ i ≤ x < τ i + 1 B i , 0 ( x ) = 0 otherwise ◮ For k = 1 , 2 , . . . , p and i = 1 , . . . , n + 2 p x − τ i τ i + k − x B i , k ( x ) = B i , k − 1 ( x ) + B i + 1 , k − 1 ( x ) τ i + k − 1 − τ i τ i + k − τ i + 1 Support of B i , k ( x ) is [ τ i , τ i + k ]

  50. B-splines

  51. Solving ◮ Solution depends on regularization parameter λ ◮ Determines the amount of “roughness” ◮ Choosing λ isn’t necessarily intuitive ◮ Degrees of freedom = trace of the smoothing parameter S ◮ Sum of the eigenvalues S = B ( B T B + λ Ω) − 1 B T ◮ Monotone relationship between df and λ ◮ Search for a value of λ for desired df ◮ df=2 → linear regression ◮ df= n → interpolate data exactly

  52. Example: Doppler function

  53. Example: Doppler function Could of course choose λ by LOO-CV or GCV

  54. Cross validation > plot(x,y) > fitcv <- smooth.spline(x,y,cv=T) > lines(fitcv,col=2) > fitcv Call: smooth.spline(x = x, y = y, cv = T) Smoothing Parameter spar= 0.157514 lambda= 2.291527e-08 (16 iterations) Equivalent Degrees of Freedom (Df): 124.738 Penalized Criterion: 6.071742 PRESS: 0.007898575

  55. Generalised cross validation > plot(x,y) > fitgcv <- smooth.spline(x,y,cv=F) > lines(fitgcv,col=4) > fitgcv Call: smooth.spline(x = x, y = y, cv = F) Smoothing Parameter spar= 0.1597504 lambda= 2.378386e-08 (15 iterations) Equivalent Degrees of Freedom (Df): 124.2353 Penalized Criterion: 6.078626 GCV: 0.007925571

  56. Multivariate smoothing

  57. Multivariate smoothing ◮ So far we have only considered univariate functions ◮ Suppose there are several predictors that we would like to treat nonparametrically ◮ Most ‘interesting’ statistical problems nowadays are high-dimensional with, easily, p > 1000 ◮ Biology: Microarrays, Gene maps, Network inference ◮ Finance: Prediction from multi-variate time-series ◮ Physics: Climate models ◮ Can we just extend the methods and model functions R p �→ R nonparametrically?

  58. Curse of dimensionality ◮ One might consider multidimensional smoothers aimed at estimating: Y = m ( x 1 , x 2 , . . . , x p ) ◮ Considered methods rely on ‘local’ approximations ◮ Examine behaviour of data-points in the neighbourhood of the point of interest ◮ What is ‘local’ and ‘neighbourhood’ if p → ∞ and n constant

  59. Curse of dimensionality x = ( x ( 1 ) , x ( 2 ) , . . . , x ( p ) ) ∈ [ 0 , 1 ] p . To get 5% of all n sample points into a cube-shaped neighbourhood of x , we need a cube with side-length 0 < ℓ < 1 such that ℓ p ≤ 0 . 05 Dimension p side length ℓ 1 0 . 05 2 0 . 22 5 0 . 54 10 0 . 74 1000 0 . 997

  60. Additive models Require the function m : R p �→ R to be of the form µ + m 1 ( x ( 1 ) ) + m 2 ( x ( 2 ) ) + . . . + m p ( x ( p ) ) m add ( x ) = p � m j ( x ( j ) ) , m ∈ R = µ + j = 1 m j ( · ) : R �→ R just a univariate nonparametric function E [ m j ( x ( j ) )] = 0 j = 1 , . . . , p ◮ Choice of smoother is left open ◮ Avoids curse of dimensionality → ‘less flexible’ ◮ Functions can be estimated by ‘backfitting’

  61. Backfitting Data x ( j ) , 1 ≤ i ≤ n and 1 ≤ j ≤ p i A linear smoother for variable j can be described by a n × n -matrix S j , so that m j = S ( j ) Y , ˆ ◮ Y = ( Y 1 , . . . , Y n ) T : observed vector of responses m j ( x ( j ) m j ( x ( j ) ◮ ˆ m j = ( ˆ 1 ) , . . . , ˆ n )) : regression fit ◮ S ( j ) smoother with bandwidth estimated by LOO-CV or GCV

  62. Backfitting p � m j ( x ( j ) ) , ˆ ˆ m add ( x ) = ˆ µ + j = 1 µ and ˆ Suppose ˆ m k given for all k � = j � � m k ( x ( k ) m j ( x ( j ) � ˆ ˆ + ˆ m add ( x i ) = µ + ˆ ) ) i i k � = j Now to apply the smoother S ( j ) to � � � ˆ Y − µ + ˆ m k k � = j Cycle through all j = 1 , . . . , p to get p m j ( x ( j ) � ˆ ˆ m ∈ R . m add ( x ) = ˆ µ + ) , i j = 1

  63. Backfitting µ ← n − 1 � n i = 1 Y i . Start with ˆ 1. Use ˆ m j ≡ 0 for all j = 1 , . . . , p 2. Cycle through the indices j = 1 , 2 , . . . , p , 1 , 2 , . . . , p , . . . , m j ← S ( j ) ( Y − ˆ � ˆ ˆ µ 1 − m k . k � = j Also normalize n m j ( x ( j ) m j ( · ) − n − 1 � m j ( · ) ← ˆ ˆ ˆ ) i i = 1 m k ( x ( k ) µ ← n − 1 � n k ˆ update ˆ i = 1 ( Y i − � )) i Stop iterations if functions do not change very much 3. Return p m j ( x ( j ) � ˆ ˆ m add ( x i ) ← ˆ µ + ) i j = 1

Recommend


More recommend