|Estimating the number of Structural Breaks in Non-Stationary Time Series
| Stacey Hancock
Ph.D. Candidate, Department of Statistics, Colorado State University
Tuesday, May 8, 2007
Many time series exhibit structural breaks in a variety of ways, the most obvious being a mean level shift. In this case, the mean level of the process is constant over periods of time, jumping to different levels at times called “changepoints”. These jumps may be due to outside influences such as changes in government policy or manufacturing regulations. Structural breaks may also be a result of changes in variability. Financial data often have time periods where prices fall quickly (a crash) as well as periods of stability. Other changes in a time series may occur in the spectrum of the process. For example, sound and speech data can change from a high frequency (rough) signal to a low frequency (smooth) signal when the type of sound or spoken word changes. A seismic signal changes from a period of low frequency to a period of high frequency when a seismic event occurs. The goal of this research is to estimate where these structural breaks occur and to provide a model for the data within each stationary segment. In some time series, such as global climate measurements, the process changes gradually, rather than at distinct breaks. Segmenting the data in this case can provide useful approximations to the underlying slowly varying process.
The program AutoPARM (Automatic Piecewise AutoRegressive Modeling procedure), developed by Davis , Lee, and Rodriguez-Yam (2005), estimates the number and locations of changepoints in a time series by fitting autoregressive models to each segment.
The setup is as follows: for k = 1, …, m , denote the changepoint between the k th and( k +1)st segments as t k , and set t 0 = 1 and t m +1 = n + 1. Let , k = 1, …, m + 1, be independent and identically distributed with mean 0 and variance 1. Then for given initial values , the k th segment follows the AR( p k ) process
with parameters , k =1, …, m + 1. AutoPARM applies the minimum description length (MDL) criterion of Rissanen (1989) to define a best-fitting model. Estimates for the number of changepoints, the locations of the changepoints, and the orders of the AR processes, are obtained by finding those values of that minimize the MDL. The MDL principle is an important concept from information theory and learning theory that naturally extends to statistical model selection. It defines the best-fitting model as the one which describes the data in the least amount of space, i.e., with the shortest code length. Codes that compress the data include a description of the model plus a description of the data under that model. Thus, minimizing the code length is similar to a penalized maximum likelihood approach.
Empirical studies show that AutoPARM works well in a variety of situations. Assuming the number of changepoints is known, Davis et al. showed that the MDL criterion produces consistent estimates of the locations of the changepoints under the assumption that t j = ? j n , where ? j (0,1). It is not difficult to show that estimates of m based on minimizing the MDL are greater than or equal to m . A key objective of this research is to show consistency of the estimate of m as well as the orders of the AR processes.
Even when the observations are independent, the consistency of the estimator for the number of changepoints is known in only some special cases. The consistency in an independent normal sequence based on minimizing the Schwarz criterion was proved by Yao (1988). While the means of adjacent segments were different, a common variance over all segments was assumed. Lee (1995) gave a consistency proof for the same situation, but used a different model selection criterion. A nonparametric approach to estimating the number of changepoints was proposed by Lee (1996). In this case, the sequence of random variables was still assumed to be independent, but no distributional assumptions were made on each segment. Chen and Gupta (1997) showed consistency of an estimator for the number of changepoints in an independent normal sequence where the means were assumed to be constant, but the variances were allowed to change. The number of changepoints was estimated using a modified version of Schwarz's criterion. Lee (1997) used a penalized likelihood criterion to prove consistency of an estimator for the number of changepoints for a sequence of independent exponential family random variables where the parameters of the exponential family were allowed to change between segments.
In order to attack consistency for AutoPARM's estimate of the number of changepoints, we intend to utilize the functional law of the iterated logarithm. The basic law of the iterated logarithm (LIL) describes the order of convergence for sums of independent, identically distributed random variables. Hannan and Quinn (1979) used the LIL to prove consistency when estimating the order of an autoregressive process. The functional law of the iterated logarithm extends the LIL to functions of sums of random variables. Rio (1995) gives sufficient conditions for the functional LIL to hold for stationary strongly mixing sequences, relaxing the assumption of independence. If we consider the difference in MDLs between fitting a model with the correct number of changepoints and a model with more changepoints than necessary, the functional LIL allows us to show that for large sample sizes, the MDL with the correct number is smaller than the MDL with extra changepoints with probability one.
The consistency proof for the number of changepoints estimate requires us to assume that the true underlying process is piecewise autoregressive, and that consecutive changepoints are at least en units apart for some e > 0. We would like to show that the estimated number of changepoints and the estimated locations of the changepoints obtained by AutoPARM are consistent even if the underlying process is not autoregressive. We will assume some change in the mean or covariance structure between segments, but the underlying process may be a segmented autoregressive moving average (ARMA), a general linear time series, or even a nonlinear time series model. Examples of the latter include the generalized autoregressive conditional heteroscedastic (GARCH) model, a commonly used model for financial time series. Other generalizations will be considered. These include relaxing the assumptions on the driving noise of these processes and the requirement that changepoints are at least en units apart.
The primary goal of this dissertation is to study some of the theoretical issues involved with estimating the number of change points using AutoPARM, and to demonstrate the robustness of the technique to different types of nonstationary or nonlinear series. We hope to apply this methodology/theory to the problem of analyzing natural sound data collected over four years by the National Park Service (NPS) in about 20 of the 388 National Parks. The goal of this analysis is to estimate the proportion of manmade sound in the National Parks. Thus, we will use our method to estimate the changepoints in the sound, and then apply a classification technique to determine if each segment is natural or manmade. Preliminary analyses using ideas from machine learning show promise.