Improved Estimation for Complex Surveys Using Modern Regression Technique
Kelly McConville, STAT-PHD Candidate, Colorado State University
Tuesday, June 21, 2011
4:00 p.m., room 006, Statistics Bldg
In the field of survey statistics, finite population quantities are often estimated based on complex survey data. In this talk, estimation of the finite population total of a study variable is considered. It is assumed that the study variable is available for the sample and is supplemented by auxiliary information, which is available for every element in the finite population. Following a model-assisted framework, estimators are constructed that exploit the relationship which may exist between the study variable and ancillary data. These estimators have good design properties regardless of model accuracy.
Nonparametric survey regression estimation is applicable in natural resource surveys where the relationship between the auxiliary information and study variable is complex or unknown. A penalized spline regression estimator is studied and its asymptotic properties when the number of knots goes to infinity and the locations of the knots are allowed to change is considered. The estimator is shown to be design consistent and asymptotically design unbiased. In the course of the proof, a result is established on the uniform convergence in probability of the survey-weighted quantile estimators. This result is obtained by deriving a survey-weighted Hoeffding inequality for bounded random variables. A variance estimator is proposed and shown to be design consistent for the asymptotic mean squared error. Simulation results demonstrate the usefulness of the asymptotic approximations.
Also in natural resource surveys, there is often a substantial amount of auxiliary information, typically derived from remotely-sensed imagery and organized in the form of spatial layers in a geographic information system (GIS). Some of this ancillary data may be extraneous and so a sparse model would be appropriate. Model selection methods are therefore warranted. The `least absolute shrinkage and selection operator' (lasso) conducts model selection and parameter estimation simultaneously by penalizing the sum of the absolute values of the model coefficients. A survey-weighted lasso criterion, which accounts for the sampling design, is derived and a survey-weighted lasso regression estimator is presented. The root-n design consistency of the estimator and a central limit theorem result are proved. Several variants of the survey-weighted lasso regression estimator are constructed. In particular, a calibration estimator and a ridge regression approximation estimator are constructed to produce lasso weights that can be applied to several study variables. Simulation studies show the lasso estimators are more efficient than the regression estimator when the true model is sparse. The lasso estimators are used to estimate the proportion of tree canopy cover for a region of Utah. Under a joint design-model framework, the survey-weighted lasso coefficients are shown to be root-N consistent for the parameters of the superpopulation model and a central limit theorem result is found. The methodology is applied to estimate the risk factors for the Zika virus from an epidemiological survey on the island of Yap. A logistic survey-weighted lasso regression model is fit to the data and important covariates are identified.
This talk will focus on the survey-weighted lasso regression estimator.
Dr. F. Jay Breidt, Advisor
Dr. Thomas Lee, Co-advisor
Dr. Myung-Hee Lee, Committee Member
Dr. Jean Opsomer, Committee Member
Dr. Paul Doherty, FWCB, Outside Member