|
Clustering High Dimension, Low Sample Size Data Using the Maximal Data Piling Distance |
Jeongyoun Ahn, University of Georgia
Monday, October 12, 2009
4:00 p.m., 223 Weber
| ABSTRACT |
We present new hierarchical clustering method for high dimension, low sample size
(HDLSS) data. The method utilizes the fact that each individual data vector accounts
for exactly one dimension in the subspace generated by HDLSS data. The linkage
that is used for measuring the distance between clusters is the orthogonal distance
between a ne subspaces generated by each cluster. The ideal implementation would
be to consider all possible binary splits of data and choose the one that maximizes
the distance in-between. Since this is not computationally feasible in general, however,
we use singular value decomposition for its approximation. We provide theoretical
justification of the method by studying high dimensional asymptotics. Also we obtain
the probability distribution of the distance measure under the null hypothesis of no split,
which we use to propose a criterion for determining the number of clusters. Simulation
and real data analyses with microarray data show competitive clustering performance
of the proposed method.
(Joint work with Myung Hee Lee and Youngju Yoon)