Adaptive Regression by Mixing: an alternative to model selection demonstrated on a capture-recapture study.
Lihua Chen, Panayotis Giannakouros & Yuhong Yang
Assessing how many clusters (an MDL criterion)
Arta Doçi & Peter Bryant
One of the major challenges in cluster analysis is the estimation of the
appropriate number of clusters in a dataset. Many approaches have been
proposed including, but not limited to, the elbow phenomenon, within-cluster
dispersion, and viewing the estimation of the number of clusters as a model
selection problem. In this presentation, we have adopted the third approach,
i.e. we view the estimation of the number of clusters in the K-Means Algorithm
as a model selection problem. We use a criteria based on Rissanen's Minimum
Description Length (MDL) Principle to assess the number of clusters in a
dataset. We present the criterion, provide results from the analysis of a
number of data sets, and suggest some possible further avenues for future
Persistence of Plug-in Rule in Classification of High Dimensional Binary Data ( )
We consider the classification when the predictors are multivariate binary random variables. For the given class, each variable is modeled as an independent Bernoulli( ) for 1≤i≤d and j=1,2 where represents the parameter of i th variable in the j th class. Triangular array for parameters, , is assumed to allow the parameters to change and the number of the variables, d, to increase for adopting the more flexible models as the sample size, n, increases. Difficulties under triangular array are pointed out, and so moderate conditions are assumed. We use maximum likelihood estimators for the parameters and plug them into Bayes classifier, say plug-in rule. Under , using the linearity of the plug-in rule, we show persistence of the plug-in rule when the variance of the plug-in rule is divergent; otherwise we show there exists an example of non-persistence of plug-in rule. We consider the sparsity of the variables in case of the non-persistence of plug-in rule, and under the condition of the sparsity, we overcome the non-persistence by subset selection of the variables. This shows that plug-in rule with selected variables may achieve a better performance than the full classifier especially in high dimensional data. We briefly discuss on the convergence rate.
Improving Polar Cloud Detection by fusing MISR and MODIS information
Tao Shi, Bin Yu, Eugene E.
Clothiaux, & Amy J. Braverman
Clouds play a major role in controlling Earth's climate. A key to predicting climate change is to observe and understand the global distribution of clouds, their physical properties, and their relationship to regional and global climate. NASA's Earth Observ- ing System is designed for studying the Earth from space using a multiple-instrument, multiple-satellite approach. However, clouds above snow- and ice-covered surfaces over polar regions are especially difficult to detect from satellite data because their temper- ature and reflectivity are similar to that of the surface.
The Multi-angle Imaging SpectroRadiometer (MISR) and the Moderate Resolu- tion Imaging Spectroradiometer (MODIS), two instruments on the first EOS satellite TERRA, ware launched in 1999 to provide scientists with data for global cloud study. Fusing the information from different instruments is one of NASA's high priorities in the multiple-instrument and multiple-satellite EOS system, since the combined informa- tion of two or more sources of complementary data can validate and improve the results from each part. In this paper, we made an effort to improve polar cloud detection by fusing MISR and MODIS data. Compared to expert labels, the agreed pixels of the MISR ELCMC cloud detection algorithms (Shi et al 2004) and MODIS operational cloud mask (Ackerman et al 2002) are highly accurate. Therefore, those pixels may serve as a good source of accurate labels for training other classifiers. A Quadratic Discriminate Analysis classifier is trained on all MISR and MODIS features using the agreed pixels, and the classifier is then applied to the full data. The QDA classifier provides an error rate much lower than those of using either MISR or MODIS data alone.
Multitarget Tracking with Application to Convective Systems
A statistical approach to multiple target tracking is presented which allows for birth, death, splitting and merging of targets. Targets are also allowed to go undetected for several frames. The splitting and merging of targets is a novel addition for a statistically based tracking algorithm. This addition is essential for the tracking of storms, which is the motivation for this work. The utility of this tracker extends well beyond the tracking of storms however. It can be valuable in other tracking applications that have splitting or merging, such as vortices, radar/sonar signals, or groups of people. The method assumes that the location of a target behaves like a Gaussian Process when it is observable. A Markov State Model decides when the birth, death, splitting, or merging of targets takes place. The tracking estimate is achieved by an algorithm that finds the paths that maximize the likelihood of the assumed model. Some theoretical properties of tracking estimates are also developed such as sufficient conditions for consistency. The problem of how to quantify the confidence in a tracking estimate is addressed as well. The properties of the proposed method will be demonstrated on simulated data. Finally, the method is applied to the problem for which it was designed, tracking storms from radar reflectivity data.
Human-Guided, Ultra-Adaptive Learning for Local Regression
Benjamin Tyner Purdue University
Our goal is to enhance model selection in local regression by providing an effective way to study the space of model, or tuning, parameters. These parameters include the local polynomial degree (including mixtures of consecutive degrees), the smoothing parameter (span), the depth of an interpolating kd-tree (nc), and the relative scaling of the independent variables.
Using ideas and methods from the study of response surfaces in designed experiments, we a study a model fit criterion, a model complexity metric, and an estimate of the variance of the noise, all as functions of the tuning parameters. A loess surface is fitted for each of a set of values of the tuning parameters. The tuning set is the design space of our experiment, and the fit, complexity, and variance estimate are the experimental responses. These meta-data are then studied using response surface methods.
Here we will take the model selection criterion to be the cross-validation sum of squares divided by a strawman noise variance estimate, and take the complexity measure to be a way of measuring the equivalent number of parameters of the fit. To effectively study the meta-data we use about a dozen tools of data visualization, many of which exploit the framework of trellis display.
It can happen that the analysis indicates that no points in the design space provide a good fit. In this case we move to new region of the tuning- parameter space, in effect, carrying out evolutionary operation.
We have explored our ideas initially with 15 test data sets: five univariate functions suggested by L. Brown, across three levels of noise.
Discrimination and classification based on nonparametric hypothesis testing
In this paper we are concerned with clustering and classification
problem in high dimension, low sample size data. This is a increasingly
important topic that can be applied in a wide range of practical
contexts, including gene microarray analysis, chemometrics, medical
image analysis, etc. Classical multivariate methods, which often need
sphere the data by multiplying the root inverse of the covariance
matrix, can not be applied to such cases due to the fact that there are
more parameters than sample sizes. Support vector machine and distance
weighted discrimination (Marron and Todd, 2002) work well when the
dimension is moderately large but the misclassification rate increases
significantly as the dimension increases. Here we will present a
clustering and classification method based on nonparametric hypothesis
testing developed specially for such setting. This is an appealing
alternative to techniques like inverse regression since it does not
require constant variance or normality of the predictors. Simulation
study will be given to evaluate the performance. An application to a
microarray dataset will be illustrated.
Simulating red flour beetle movement by an agent-based model
Kurt Zhang Kansas State University Susan Romero, Kansas State University Paul Flinn, Grain Marketing and Production Research Center, USDA Jim Campbell, Grain Marketing and Production Research Center, USDA
We developed a spatially explicit, agent-based model to better understand and predict beetle population dynamics in spatially complex landscapes. Because an agent-based model allows for differences in agent behavior and competition between agents, these models can be more realistic than deterministic models, which have a number of limitations, such as oversimplification of the system and their inability to portray stochastic factors and behavioral interactions. An object-oriented language was used to simulate the behavior of beetles. The model allows movement of individual beetles to be traced. After analyzing the circular data from biological experiments, a von Mises distribution was found to fit the turning angles of walking adult beetles well. A wrapped normal distribution also fit the data well, and was faster to compute than the von Mises distribution. Moving speed was shown to be negatively correlated with turning angles. The autocorrelation and cross-correlation within and between beetle paths were investigated.