Local least squares We can rewrite the kernel regression estimator as n � x − x i � � ( Y i − m x ) 2 . ˆ m ( x ) = argmin m x ∈ R K h i = 1 Exercise : Can be verified by solving h )( Y i − m x ) 2 = 0 � n i = 1 K ( x − x i d dm x ◮ Thus, for every fixed x , have to search for the best local constant m x such that the localized sum of squares is minimized ◮ Localization is here described by the kernel and gives a large weight to those observations ( x i , Y i ) where x i is close to the point x of interest. The choice of the bandwidth h is crucial
Example: FTSE stock market index
Example: Cosmic microwave background data
Choosing the bandwidth
Choosing the bandwidth ◮ Measure “success” of fit using mean squared error on new observations (MSE), m h ( x )) 2 � � ( Y − ˆ MSE ( h ) = E , ◮ Splitting into noise, bias and variance MSE ( h ) = Noise + Bias 2 + Variance ◮ Bias decreases if h ↓ 0 ◮ Variance increases if h ↓ 0 ◮ Choosing the bandwidth is a bias-variance trade-off.
Choosing the bandwidth ◮ But...we don’t know MSE ( h ) , as we just have a n observations and cannot generate new random observations Y ◮ First idea is to compute, for various values of the bandwidth h : ◮ The estimator ˆ m h for a training sample ◮ The error n n m h ( x )) 2 = n − 1 ( Y i − ˆ n − 1 � � Y i ) 2 , ( Y i − ˆ i = 1 i = 1 ◮ Choose the bandwidth with the smallest training error
CMB data
Choosing the bandwidth ◮ We would choose a bandwidth close to h = 0, giving near perfect interpolation of the data, that is ˆ m ( x i ) ≈ Y i ◮ This is unsurprising ◮ Parametric context ◮ Shape of the model is fixed ◮ Minimising the MSE makes the parametric model as close as possible to the data ◮ Nonparametric setting ◮ Don’t have a fixed shape ◮ Value of h dictates the model ◮ Minimising the MSE → fitted model as close as possible to the data ◮ Lead us to choose h as small as possible ◮ Interpolation of the data ◮ Misleading result → Only noise is fitted for very small bandwidths
Cross-validation ◮ Solution...don’t use X i to construct ˆ m ( X i ) ◮ This is the idea behind cross-validation ◮ Leave-one-out cross-validation ◮ Least squares cross-validation ◮ For each value of h m ( − i ) ◮ For each i = 1 , . . . , n , compute the estimator ˆ ( x ) , h m ( − i ) where ˆ ( x ) is computed without using observation i h ◮ The estimated MSE is then given by � m ( − i ) MSE ( h ) = n − 1 � ( x i )) 2 ( Y i − ˆ h i
CMB data
CMB data A bandwidth of 44 minimises the estimated MSE
Cross-validation ◮ A drawback of LOO-CV is that it is expensive to compute ◮ Fit has to be recalculated n times (once for each left-out observation) m ( x ) ( − i ) for all i ◮ We can avoid needing to calculate ˆ ◮ For some n × n -matrix S , the linear smoother fulfills ˆ Y = SY The risk (MSE) under LOO-CV can subsequently be written as � 2 n � Y i − ˆ m h ( x i ) � MSE ( h ) = n − 1 � 1 − S ii i = 1
Cross-validation ◮ Do not need to recompute ˆ m h while leaving out each of the n observations in turn ◮ Results can much faster be obtained by rescaling the residuals Y i − ˆ m h ( x i ) with the factor ( 1 − S ii ) ◮ S ii is the i -th diagonal entry in the smoothing matrix
Generalized Cross-Validation n � Y i − ˆ m h ( x i ) � 2 � � MSE ( h ) = n − 1 , 1 − S ii i = 1 Replace S ii by its average ν/ n (where ν = � i S ii ) Choose bandwidth h that minimizes n � Y i − ˆ m h ( x i ) � 2 GCV ( h ) = n − 1 � . 1 − ν/ n i = 1
Local polynomial regression
Nadaraya-Watson kernel estimator ◮ Major disadvantage of the Nadaraya-Watson kernel estimator → boundary bias ◮ Bias is of large order at the boundaries
Local polynomial regression ◮ Even when a curve doesn’t look like a polynomial ◮ Restrict to a small neighbourhood of a given point, x ◮ Approximate the curve by a polynomial in that neighbourhood ◮ Fit its coefficients using only observations X i close to x ◮ (or rather, putting more weight to observations close to x ) ◮ Repeat this procedure at every point x where we want to estimate m ( x )
Local polynomial regression ◮ Kernel estimator approximates the data by taking local averages within small bandwidths ◮ Use local linear regression to obtain an approximation
Kernel regression estimator Recall that the kernel regression estimator is the solution to: n � x − x i � � ( Y i − m ( x )) 2 ˆ m ( x ) = argmin m ( x ) ∈ R K h i = 1 This given by � n i = 1 Y i K ( x − X i h ) ˆ m ( x ) = . j = 1 K ( x − X j � n h ) Thus estimation corresponds to the solution to the weighted sum of squares problem
Local polynomial regression Using Taylor Series, we can approximate m ( x ) , where x is close to a point x 0 using the following polynomial: m ( x 0 ) + m ( 1 ) ( x 0 )( x − x 0 ) + m ( 2 ) ( x 0 ) ( x − x 0 ) 2 + . . . m ( x ) ≈ 2 ! . . . + m ( p ) ( x 0 ) ( x − x 0 ) p p ! m ( x 0 ) + β 1 ( x − x 0 ) + β 2 ( x − x 0 ) 2 + · · · + β p ( x − x 0 ) p = where m ( k ) ( x 0 ) = k ! β k , provided that all the required derivatives exist
Local polynomial regression ◮ Use the data to estimate that polynomial of degree p which best approximates m ( x i ) in a small neighbourhood around the point x ◮ Minimise with respect to β 0 , β 1 , . . . , β p the function: n � x − x i � � { Y i − β 0 − β 1 ( x i − x ) − . . . − β p ( x i − x ) p } 2 K h i ◮ Weighted least squares problem, where the weights are given by the kernel functions K (( x − x i ) / h ) ◮ As m ( k ) ( x ) = k ! β k , we then have that m ( x ) = β 0 , or m ( x ) = ˆ ˆ β 0
CMB data Red: Kernel smoother ( p = 0) Green: Local linear regression ( p = 1)
Boundary bias: kernel estimator Let ℓ i ( x ) = ω i ( x ) / � j ω j ( x ) , so that � ˆ m ( x ) = ℓ i ( x ) Y i i For the Kernel smoother ( p = 0), the bias of the linear smoother is thus � m ′ ( x ) Bias = E ( ˆ m ( x )) − m ( x ) ( x i − x ) ℓ i ( x ) + = i m ′′ ( x ) � ( x i − x ) 2 ℓ i ( x ) + R , 2 i
Boundary bias: Kernel estimator First term in the expansion is equal to � x − x i � ( x i − x ) K m ′ ( x ) � h � x − x i � K h i ◮ vanishes if the design points x i are centred symmetrically around x ◮ does not vanish if x sits at the boundary (all x i − x will have the same sign)
Boundary bias: polynomial estimator ◮ m ( x ) is truly a local polynomial of degree p ◮ At least p + 1 points with positive weights in the neighbourhood of x Bias will hence be of order m ( p + 1 ) ( x ) � ( x j − x ) p + 1 ℓ j ( x ) + R Bias = E ( ˆ m ( x )) − m ( x ) = ( p + 1 )! j Why not choose p = 20?
Boundary bias: polynomial estimator ◮ Y i = m ( x ) + σε i with ε i ∼ N ( 0 , 1 ) ◮ Variance of the linear smoother, ˆ m ( x ) = � j ℓ j ( x ) Y i , is, m ( x )) = σ 2 � ℓ 2 j ( x ) = σ 2 � ℓ ( x ) � 2 Var ( ˆ j ◮ � ℓ ( x ) � 2 tends to be large if p is large ◮ In practice, p = 1 is a good choice
Example: Doppler function � 2 . 1 π � � x ( 1 − x ) sin 0 ≤ x ≤ 1 . m ( x ) = , x + . 05
Example: Doppler function > n <- 1000 > x <- seq(0,1,length=n) > m <- sqrt(x*(1-x))*sin(2.1*pi/(x+0.05)) > plot(x,m,type=’l’) > y <- m+rnorm(n)*0.075 > plot(x,y) > fit <- locpoly(x,y,bandwidth=dpill(x,y)*2,degree=1) > lines(fit,col=2) > plot(x,y) > fit2 <- locpoly(x,y,bandwidth=dpill(x,y)/2,degree=1) > lines(fit2,col=2) > plot(x,y) > fit3 <- locpoly(x,y,bandwidth=dpill(x,y)/4,degree=1) > lines(fit3,col=2)
Example: Doppler function
Penalised regression
Penalised regression Regression model i = 1 , . . . , n , Y i = m ( x i ) + ε i E ( ε i ) = 0 Estimating m by choosing ˆ m ( x ) to minimize n � m ( x i )) 2 ( Y i − ˆ i = 1 leads to ◮ linear regression estimate if minimizing over all linear functions ◮ an interpolation of the data if minimizing over all functions .
Penalised regression Estimate m by choosing ˆ m ( x ) to minimize n m ( x i )) 2 + λ J ( m ) , � ( Y i − ˆ i = 1 J ( m ) : roughness penalty Typically � ( m ′′ ( x )) 2 dx . J ( m ) = Parameter λ controls trade-off between fit and penalty ◮ For λ = 0: interpolation ◮ For λ → ∞ : linear least squares line
Example: Doppler function
Splines
Splines ◮ Kernel regression ◮ Researcher isn’t interested in actually calculating m ( x ) for a single location x ◮ m ( x ) calculated on sufficiently small grid of x -values ◮ Curve obtained by interpolation ◮ Local polynomial regression: unknown mean function was assumed to be locally well approximated by a polynomial ◮ Alternative approach ◮ Represent the fit as a piecewise polynomial ◮ Pieces connecting at points called knots ◮ Once the knots are selected, an estimator can be computed globally ◮ Manner similar to that for a parametrically specified mean function This is the idea behind splines
Splines IID sample ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , . . . ( X n , Y n ) coming from the model Y i = m ( X i ) + ε i Want to estimate the mean of the variable Y with m ( x ) = E ( Y | X = x ) A very naive estimator of E ( Y | X = x ) would be the sample mean of x : � n i = 1 Y i ˆ m ( x ) = n Not very good (same for all x )
Splines Approximate m by piecewise polynomials, each on a small interval: c 1 if x < ξ 1 c 2 if ξ 1 ≤ x < ξ 2 ˆ m ( x ) = . . . c k if ξ k − 1 ≤ x < ξ k c k + 1 if x ≥ ξ k
Splines Use more general lines, which join at the ξ s: a 1 + b 1 x if x < ξ 1 a 2 + b 2 x if ξ 1 ≤ x < ξ 2 ˆ m ( x ) = . . . a k + b k x if ξ k − 1 ≤ x < ξ k a k + 1 + b k + 1 x if x ≥ ξ k a and b are such that the lines join at each ξ
Splines Approximate m ( x ) by polynomials � p j = 0 β 1 , j x j if x < ξ 1 � p j = 0 β 2 , j x j if ξ 1 ≤ x < ξ 2 ˆ m ( x ) = . . . � p j = 0 β k , j x j if ξ k − 1 ≤ x < ξ k � p j = 0 β k + 1 , j x j if x ≥ ξ k β j s are such that the polynomials join at each ξ and the approximation has p − 1 derivatives Splines which are piecewise polynomials of order p ◮ Splines of order p + 1 ◮ Splines of degree p ◮ ξ : knots
Splines
Piecewise constant splines
Knots How many knots should we have? ◮ Choose a lot of knots well widespread over the data range → reduce the bias of the estimator ◮ If we make it too local → estimator will be too wiggly ◮ Overcome the bias problem without increasing the variance → take a lot of knots, but constrain their influence ◮ We can do this using penalised regression
Spline order What order spline should we use? ◮ Increase the value of p → make the estimator ˆ m p smoother (since it has p − 1 continuous derivatives) ◮ If we have p too large → increase the number of parameters to estimate ◮ In practice rarely useful to take p > 3 ◮ p = 2 ◮ Splines of order three or quadratic splines ◮ p = 3 ◮ Splines of order 4 or cubic splines ◮ p -th order spline is a piecewise p − 1 degree polynomial with p − 2 continuous derivatives at the knots
Natural splines ◮ Natural spline: linear beyond the boundary is called a natural spline ◮ Why this constraint? ◮ We usually have very few observations beyond the two extreme knots ◮ Want to obtain an estimator of the regression curve there ◮ Cannot reasonably estimate anything correct there ◮ Rather use a simplified model (e.g. linear) ◮ Often gives more or less reasonable results
Natural cubic splines ξ 1 < ξ 2 < . . . < ξ n set of ordered points, so-called knots, contained in an interval ( a , b ) A cubic spline is a continuous function m such that (i) m is a cubic polynomial over ( ξ 1 , ξ 2 ) , . . . , and (ii) m has a continuous first and second derivatives at the knots. The solution to n � m ( x i )) 2 + λ m ′′ ( x )) 2 dx � ( y i − ˆ ( ˆ i = 1 is a natural cubic spline with knots at the data points ˆ m ( x ) is called a smoothing spline
Natural cubic splines ◮ Sequence of values f 1 , . . . , f n at specified locations x 1 < x 2 < · · · < x n ◮ Find a smooth curve g ( x ) that passes through the points ( x 1 , f 1 ) , ( x 2 , f 2 ) , . . . , ( x n , f n ) ◮ The natural cubic spline g is an interpolating function that satisfies the following conditions: (i) g ( x j ) = f j , j = 1 , . . . , n , (ii) g ( x ) is cubic on each subinterval ( x k , x k + 1 ) , k = 1 , . . . , ( n − 1 ) , (iii) g ( x ) is continuous and has continuous first and second derivatives, (iv) g ′′ ( x 1 ) = g ′′ ( x n ) = 0 .
B-splines ◮ Need a basis for natural polynomial splines ◮ Convenient is the so-called B-spline basis ◮ Data points a = ξ 0 < ξ 1 < ξ 2 < . . . , ξ n ≤ ξ n + 1 = b in ( a , b ) ◮ There are n + 2 real values ◮ The n ≥ 0 are called ‘interior knots’ or ‘control points’ ◮ And there are always two endpoints, ξ 0 and ξ n + 1 ◮ When the knots are equidistant they are said to be ‘uniform’
B-splines Now define new knots τ as ◮ τ 1 ≤ . . . ≤ τ p = ξ 0 = a ◮ τ j + p = ξ j ◮ b = ξ n + 1 = τ n + p + 1 ≤ τ n + p + 2 ≤ . . . ≤ τ n + 2 p ◮ p : order of the polynomial ◮ p + 1 is the order of the spline ◮ Append lower and upper boundary knots ξ 0 and ξ n + 1 p times ◮ Needed due to the recursive nature of B-splines
B-slines Define recursively ◮ For k = 0 and i = 1 , . . . , n + 2 p � 1 τ i ≤ x < τ i + 1 B i , 0 ( x ) = 0 otherwise ◮ For k = 1 , 2 , . . . , p and i = 1 , . . . , n + 2 p x − τ i τ i + k − x B i , k ( x ) = B i , k − 1 ( x ) + B i + 1 , k − 1 ( x ) τ i + k − 1 − τ i τ i + k − τ i + 1 Support of B i , k ( x ) is [ τ i , τ i + k ]
B-splines
Solving ◮ Solution depends on regularization parameter λ ◮ Determines the amount of “roughness” ◮ Choosing λ isn’t necessarily intuitive ◮ Degrees of freedom = trace of the smoothing parameter S ◮ Sum of the eigenvalues S = B ( B T B + λ Ω) − 1 B T ◮ Monotone relationship between df and λ ◮ Search for a value of λ for desired df ◮ df=2 → linear regression ◮ df= n → interpolate data exactly
Example: Doppler function
Example: Doppler function Could of course choose λ by LOO-CV or GCV
Cross validation > plot(x,y) > fitcv <- smooth.spline(x,y,cv=T) > lines(fitcv,col=2) > fitcv Call: smooth.spline(x = x, y = y, cv = T) Smoothing Parameter spar= 0.157514 lambda= 2.291527e-08 (16 iterations) Equivalent Degrees of Freedom (Df): 124.738 Penalized Criterion: 6.071742 PRESS: 0.007898575
Generalised cross validation > plot(x,y) > fitgcv <- smooth.spline(x,y,cv=F) > lines(fitgcv,col=4) > fitgcv Call: smooth.spline(x = x, y = y, cv = F) Smoothing Parameter spar= 0.1597504 lambda= 2.378386e-08 (15 iterations) Equivalent Degrees of Freedom (Df): 124.2353 Penalized Criterion: 6.078626 GCV: 0.007925571
Multivariate smoothing
Multivariate smoothing ◮ So far we have only considered univariate functions ◮ Suppose there are several predictors that we would like to treat nonparametrically ◮ Most ‘interesting’ statistical problems nowadays are high-dimensional with, easily, p > 1000 ◮ Biology: Microarrays, Gene maps, Network inference ◮ Finance: Prediction from multi-variate time-series ◮ Physics: Climate models ◮ Can we just extend the methods and model functions R p �→ R nonparametrically?
Curse of dimensionality ◮ One might consider multidimensional smoothers aimed at estimating: Y = m ( x 1 , x 2 , . . . , x p ) ◮ Considered methods rely on ‘local’ approximations ◮ Examine behaviour of data-points in the neighbourhood of the point of interest ◮ What is ‘local’ and ‘neighbourhood’ if p → ∞ and n constant
Curse of dimensionality x = ( x ( 1 ) , x ( 2 ) , . . . , x ( p ) ) ∈ [ 0 , 1 ] p . To get 5% of all n sample points into a cube-shaped neighbourhood of x , we need a cube with side-length 0 < ℓ < 1 such that ℓ p ≤ 0 . 05 Dimension p side length ℓ 1 0 . 05 2 0 . 22 5 0 . 54 10 0 . 74 1000 0 . 997
Additive models Require the function m : R p �→ R to be of the form µ + m 1 ( x ( 1 ) ) + m 2 ( x ( 2 ) ) + . . . + m p ( x ( p ) ) m add ( x ) = p � m j ( x ( j ) ) , m ∈ R = µ + j = 1 m j ( · ) : R �→ R just a univariate nonparametric function E [ m j ( x ( j ) )] = 0 j = 1 , . . . , p ◮ Choice of smoother is left open ◮ Avoids curse of dimensionality → ‘less flexible’ ◮ Functions can be estimated by ‘backfitting’
Backfitting Data x ( j ) , 1 ≤ i ≤ n and 1 ≤ j ≤ p i A linear smoother for variable j can be described by a n × n -matrix S j , so that m j = S ( j ) Y , ˆ ◮ Y = ( Y 1 , . . . , Y n ) T : observed vector of responses m j ( x ( j ) m j ( x ( j ) ◮ ˆ m j = ( ˆ 1 ) , . . . , ˆ n )) : regression fit ◮ S ( j ) smoother with bandwidth estimated by LOO-CV or GCV
Backfitting p � m j ( x ( j ) ) , ˆ ˆ m add ( x ) = ˆ µ + j = 1 µ and ˆ Suppose ˆ m k given for all k � = j � � m k ( x ( k ) m j ( x ( j ) � ˆ ˆ + ˆ m add ( x i ) = µ + ˆ ) ) i i k � = j Now to apply the smoother S ( j ) to � � � ˆ Y − µ + ˆ m k k � = j Cycle through all j = 1 , . . . , p to get p m j ( x ( j ) � ˆ ˆ m ∈ R . m add ( x ) = ˆ µ + ) , i j = 1
Backfitting µ ← n − 1 � n i = 1 Y i . Start with ˆ 1. Use ˆ m j ≡ 0 for all j = 1 , . . . , p 2. Cycle through the indices j = 1 , 2 , . . . , p , 1 , 2 , . . . , p , . . . , m j ← S ( j ) ( Y − ˆ � ˆ ˆ µ 1 − m k . k � = j Also normalize n m j ( x ( j ) m j ( · ) − n − 1 � m j ( · ) ← ˆ ˆ ˆ ) i i = 1 m k ( x ( k ) µ ← n − 1 � n k ˆ update ˆ i = 1 ( Y i − � )) i Stop iterations if functions do not change very much 3. Return p m j ( x ( j ) � ˆ ˆ m add ( x i ) ← ˆ µ + ) i j = 1
Recommend
More recommend