Estimating Simple Statistical Parameters on High Speed Data Streams
In a number of applications, input arrives very rapidly and there is
limited memory to store the input. One needs to monitor simple statistical
quantities of such ``streams''.
In the past few years, researchers in Theoretical Computer Science
have developed new estimation algorithms that work within these space and
time constraints. The methods rely on metric embeddings, pseudo-random
computations and sparse approximation theory. The applications include IP
network traffic analysis, mining text message streams for Homeland
Security and processing massive data sets in general.
I will present an overview of the algorithmic principles, and discuss issues
in building data stream systems that work at IP line speeds.
I will also discuss open problems, in particular, in performing more
sophisticated statistical analyses on data streams.
This talk is based on an updated
version of the survey at http://www.cs.rutgers.edu/~muthu/stream-1-1.ps