| Sequential Low-Storage Methods for the Analysis of Massive Datasets |
Jim McDermott
Nonclinical Biostatistics Division
Bristol-Myers Squibb
Monday, November 28, 2005
4:10 p.m.
E203 Engineering Building
ABSTRACT
Even simple analysis of streaming data becomes problematic when the size of the dataset becomes too large, both in terms of storage required and computation time. For example, to compute the median sequentially as new data arrives we must store and resort the entire stream of data for each new data point. To solve this problem, we consider a very simple sequential algorithm for the estimation of the median for a data stream with no fixed length. The method processes the data sequentially as it arrives and stores a very small, fixed number of points that are chosen by carefully constructed probabilistic decision rules. The computation time required for the execution of the proposed method is linear as a function of the sample size.
|