Current Trends in Small Area Estimation Research Partha Lahiri JPSM, University of Maryland, College Park, USA Paper to be presented at Q2008, Rome, Italy, July 10, 2008
What is a Small Area? • A subpopulation of interest, for which the sample size is not adequate to produce reliable direct estimates. • Example: Geographic Region Small Area Nation State State County, school district Demographic Group Small Domain Broad group Narrow groups by sex/race/ethnicity 2
Examples • Survey of drug use in Nebraska, N=4300. Boone County has n =14 and only 1 white, female age 25-44 was sampled. • In SAIPE, about one-third of the counties are in the sample. • In NHANSE III, a majority of US states do not have sample. 3
A Historical Note • 11th century England and 17th century Canada – Based on census or administrative records. • Recent 3 decades – Increasing demand for small area statistics, due to growing use in formulating policies and programs in the allocation of government funds and in regional planning. 4
Design Issues Ref: Singh et al. (1994), Marker (2001), Rao (2003) • Stratification – Use a large number of smaller strata • Degree of Clustering – Minimize clustering • Sample Allocation – Reallocate sample from large planned domains to smaller planned domains • Rolling samples (ACS), multiple frames • In the Canadian LFS, max(CV) for UI regions was reduced by about half using compromise allocation. 5
Planned Domains: • Minimize a weighted sum of sampling variances of direct small area estimators subject to fixed overall sample size. Ref: Longford (2006) • Minimize total sample size (or cost) subject to desired tolerances on the area sampling variances and on the aggregate sampling variance. Ref: Rao (2007) • Achieve (approx.) equal RRMSE of GREG for the planned domains subject to a fixed cost. Ref: Gabler, Ganninger, Münnich, and others 6
• Achieve equal RRMSE of EBP (or, the estimator to be used) for the planned domains subject to a fixed cost. However, “the client will always require more than specified at the design stage” (Fuller, 1999). 7
Issues in Small Area Estimation 1. Definition of small-areas 2. Identification of relevant sources of information 3. Method of combining information 4. Small area estimates 5. Accuracy of the SAE method 6. Robust validation 7. Computer programming 8. Presentation of SAE statistics 8
Borrowing Strength: • Relevant Source of Information – Census data – Administrative information – Related surveys • Method of Combining Information – Choices of good small area models – Use of a good statistical methodology 9
Synthetic Estimators 1944 Radio Listening Survey, Hansen, Hurwitz and Madow (1953, p. 483-486): To estimate the median number of radio stations heard during the day for over 500 counties (small areas). The following explicit regression equation based on data for 85 counties was used: y = 0.52 + 0.74x ˆ i i where for county i y : estimate obtained from the personal interview survey i x : estimate obtained from the personal interview survey i 10
County Crop Production (Stasny et al., 1991) To estimate wheat production for each county of Kansas ˆ ˆ ˆ y = β + β x + � + β x , where ij 0 1 1ij p pij y : wheat production of the jth farm in the ith county ij � x = (1,x , ,x )' : a vector of auxiliary variables 1ij pij ij Regression-synthetic estimator: ∑ ˆ ˆ ˆ ˆ � Y = y = N β + X β + + X β p ˆ i ij i 0 i1 1 ip j The total no. of farms N and the totals of the auxiliary i variables X (l = 1, � ,p ) are known. il 11
ˆ Y ˆ ˆ ˆ Ratio Adjustment: Y = i Y, where Y is the direct ∑ i,adj ˆ Y i i design-based estimate for the state from a large probability sample. NCHS synthetic State estimates for health variables: assume homogeneity within carefully constructed post- strata. More refined synthetic estimation: SPREE. World Bank Method: Elbers et al. (2003), Haslett-Jones (2005) Off-the-Shelf Methods: Schirm and Zaslavsky (1997) 12
Basic Area Level Model To estimate small area means Y using direct design-based i estimates y and area level auxiliary variables ’s. x i i A Basic Area-Level Model: ˆ Level 1: θ = g(y ) ~ ind. N( θ , ψ ) i i i i T 2 Level 2 : θ = g(Y ) ~ ind. N(x β , τ ) i i i Fay and Herriot (1979): g(Y ) = log(Y ) i i 13
Carter and Rolph (1974), Efron and Morris (1975): g(Y ) = arcsine( Y ) i i SAIPE: g(Y ) = Y i for state level estimation of proportion i of poor school-age children and g(Y ) = log(Y ) for county i i level poverty counts of school-age children The model can be written as a simple linear mixed normal model: ˆ = θ T θ + e = x β + v + e i , where i i i i i e : sampling error; e ~ ind. N(0, ψ ) i i i 2 v : area specific random effects; v ~ iid N(0, τ ) i i 14
Supplementary Information Used • Per-Capita Income for the county • Value of housing for the place • Value of housing for the county • IRS-adjusted gross income per exemption for the place • IRS-adjusted gross income per exemption for the county 15
The BP: ˆ BP ˆ 2 θ = E( θ | θ ; β , τ ) i i i T ˆ T = x β + γ ( θ - x β ) , i i i i ˆ T = γ θ + (1- γ )x β i i i i 2 τ where γ = τ + ψ i 2 i ˆ BP ˆ ˆ ˆ 2 EBP (or EBLUP): θ = E( θ | θ ; β , τ ) i i i 16
Different MSE of EBP: ˆ EBP 2 i E( θ - θ ) ( ) i i ⎡ ⎤ ˆ EBP 2 (ii)E ( θ - θ ) | θ ⎣ ⎦ i i i ⎡ ⎤ ˆ ˆ EBP 2 (iii)E ( θ - θ ) | θ ⎣ ⎦ i i i ⎡ ⎤ ˆ EBP 2 ˆ (iv)E ( θ - θ ) | θ , i = 1, � ,m ⎣ ⎦ i i i Majority of research focused on the unconditional MSE (i) estimation. 17
≈ ˆ EBP 2 2 2 MSE( θ ) g ( τ ) + g ( τ ) + g ( τ ) i 1i 2i 3i 2 ˆ BP g ( τ ) = MSE( θ ) 1i i 2 g ( τ ): the extra variability due to the estimation of β 2i 2 2 g ( τ ): the extra variability due to the estimation of τ 3i Ref: Prasad and Rao (1990) and Datta and Lahiri (2000) 2 and 2 are of the same order and is The terms g ( τ ) g ( τ ) 2i 3i 2 PR and DL lower than that of the leading term g ( τ ). 1i obtained a second-order (or nearly unbiased) estimator of unconditional MSE using the above approximation and 2 correcting the bias of g ( τ ) 1i 18
Longford (2007): The PR MSE estimator did not perform well in estimating design-based MSE for the EURAREA project. Zhang (2007): The PR MSE estimator, averaged over areas, tracks average of design-based MSE for large m, if the model holds. Different resampling methods [jackknife and parametric boostrap] have been proposed by Butar and Lahiri (2003), Jiang and Lahiri (2002), and Wan (2002), Hall and Maiti (2006), Pfeffermann and Glickmann (2004) and Chatterjee and Lahiri (2007). Compared to the Taylor seriesmethod, they performed well in simulations; see Fabrizi et al. (2007) and Pereira and Pedro (2008) 19
Issues: The method uses a simple model and results in an EBP which is design-consistent Normality: EBP method is extendable to specified non- normal distributions for the sampling and random effects. For unspecified non-normality of the sampling and random effects, one can use EBLUP [Lahiri and Rao, 1995] or certain adaptive [Lahiri, 2002; Fabrizi and Trivisano, 2007] or linear EB [Ghosh and Lahiri, 1987; Cocchi and Mouchart] 20
Known sampling variances ψ : GVF type methods are i generally used. The method usually does not consider small area effect and the uncertainty in estimating the sampling variances are not included in the EBP. In some situation, standard estimates [REML, ML, 2 ANOVA, etc.] of the model variance τ can be zero. When ˆ 2 τ is zero, EBLUP reduces to the regression synthetic estimate. One way to avoid the problem is to use the ADM or AML estimates [Morris, 1987; Li and Lahiri, 2007] 21
A simple back transformation is often used to obtain the estimate of Y . The optimum property of the BP is lost by i such a back transformation. Y = g ( θ ) : ( ) -1 -1 ˆ 2 The BP of E g ( θ )| θ ; β , τ i i i i ( ) -1 ˆ ˆ ˆ 2 An EBP Y: E g ( θ )| θ ; β , τ i i The rationale behind the transformation rests on the g(.) Taylor series argument and is used primarily to stabilize the variance. A direct modeling of the direct estimates is possible, but this is likely to lead to non-linear non-normal mixed model. 22
Confidence Interval: The intuitive interval [Cox, 1976] ˆ EBP 2 θ ±1.96 g ( τ ) ˆ i 1i has an undercoverage problem. The correction ˆ EBP PR θ ±1.96 mse i i does not solve the problem – it has either undercoverage or overcoverage problem. 23
Parametric bootstrap interval: ( ˆ EBP 2 ˆ EBP 2 θ - L g ( τ ) θ - U g ( τ )), , i 1i i 1i where L and U are obtained from the parametric ˆ *EBP * θ - θ i i bootstrap histogram: g ( τ ) [Ref: Chatterjee, Lahiri *2 1i and Li, 2008] Hall and Maiti (2006) has an alternative parametric bootstrap method, but the method is synthetic (Rao, 2005) 24
Estimation of Small Area Proportions: Two Basic Area Models Ref: Liu, Lahiri and Kalton (2007) Model 1: P (1- P ) Level 1: p | P ~ ind N(P , i i deff ) iw i i i n i T 2 Level 2: logit(P ) ~ ind N(x β , τ ) i i Model 2: P (1- P ) Level 1: p | P ~ ind Beta(P , deff ) i i iw i i i n i T 2 Level 2 : logit(P ) ~ ind N(x β , τ ) i i 25
Recommend
More recommend