"Everything should be made as simple as possible, but not simpler." - Albert Einstein

Seminar Announcement

Clustering High Dimension, Low Sample Size Data Using the Maximal Data Piling Distance

Jeongyoun Ahn, University of Georgia

Monday, October 12, 2009

4:00 p.m., 223 Weber

ABSTRACT

We present new hierarchical clustering method for high dimension, low sample size
(HDLSS) data. The method utilizes the fact that each individual data vector accounts
for exactly one dimension in the subspace generated by HDLSS data. The linkage
that is used for measuring the distance between clusters is the orthogonal distance
between a ne subspaces generated by each cluster. The ideal implementation would
be to consider all possible binary splits of data and choose the one that maximizes
the distance in-between. Since this is not computationally feasible in general, however,
we use singular value decomposition for its approximation. We provide theoretical
justification of the method by studying high dimensional asymptotics. Also we obtain
the probability distribution of the distance measure under the null hypothesis of no split, which we use to propose a criterion for determining the number of clusters. Simulation and real data analyses with microarray data show competitive clustering performance of the proposed method.

(Joint work with Myung Hee Lee and Youngju Yoon)