ASSIGNING CATEGORICAL PREDICTOR SCORES IN LOGISTIC REGRESSION USING A BETABINOMIAL MODEL 
Daniel Edstrom
Master's Candidate
Department of Statistics
Colorado State University
Wednesday, September 21, 2005
3:10 p.m.
E206 Engineering Building
ABSTRACT
Students who inquire at a college or university provide various pieces of information at the time of inquiry. Often, this information is categorical in nature, such as the student's intended major or the zip code in which the student resides. The ForecastPlus model developed by NoelLevitz uses this categorical information in a logistic regression model to estimate the probability that inquiring students will enroll at an institution. These models include six to eight categorical predictors, which would result in thousands of indicator variables, making estimation of the logistic regression parameter estimates impractical. ForecastPlus currently solves this problem by creating a continuous variable for each categorical variable. The values of the continuous variables are category scores, computed using the marginal enrollment rate for each category when the number of students in the category exceeds a threshold, and computed using the average enrollment rate otherwise. The models are currently validated using a data splitting method, which involves testing the model on responses from which the logistic regression parameters are estimated. This circularity may result in optimistic assessment of predictive accuracy.
This paper evaluates two potential changes to the ForecastPlus method: a new method for producing category scores and a new method for validating the resulting logistic parameter estimates. The new category scores are computed using an Empirical Bayes approach under the assumptions of a betabinomial model. This method estimates the betabinomial prior parameters using the method of maximum likelihood and estimates category scores using the posterior mean. The new model validation method eliminates the circularity of the current evaluation process, but at the expense of producing potentially less valid logistic regression parameter estimates. The methods are applied to enrollment data from four universities across two years.
