Typesetting math: 14%

Sequential (Quickest) Change Detection: Classical Results and New Directions

Publisher: IEEE

Abstract:

Online detection of changes in stochastic systems, referred to as sequential change detection or quickest change detection, is an important research topic in statistics, signal processing, and information theory, and has a wide range of applications. This survey starts with the basics of sequential change detection, and then moves on to generalizations and extensions of sequential change detection theory and methods. We also discuss some new dimensions that emerge at the intersection of sequential change detection with other areas, along with a selection of modern applications and remarks on open questions.
Published in: IEEE Journal on Selected Areas in Information Theory ( Volume: 2, Issue: 2, June 2021)
Page(s): 494 - 514
Date of Publication: 13 April 2021
Electronic ISSN: 2641-8770
Publisher: IEEE

Funding Agency:


SECTION I.

Introduction

The efficient detection of abrupt changes in the statistical behavior of streaming data is a classical and fundamental problem in signal processing and statistics. The abrupt change-point usually corresponds to a triggering event that could have catastrophic consequences if it is not detected in a timely manner. Therefore, the goal is to detect the change as quickly as possible, subject to false alarm constraints. Such problems have been studied under the theoretical framework of sequential (or quickest) change detection [160], [194], [215]. With an increasing availability of high-dimensional streaming data, sequential change detection has become a centerpiece for many real-world applications, ranging from monitoring power networks [37], Internet traffic [100], cyber-physical systems [142], sensor networks [164], social networks [152], [165], epidemic detection [17], scientific imaging [162], genomic signal processing [179], seismology [7], video surveillance [109], and wireless communications [95].

various applications, the streaming data is high-dimensional and collected over networks, such as social networks, sensor networks, and cyber-physical systems. For this reason, the modern sequential change detection problem’s scope has been extended far beyond its traditional setting, often challenging the assumptions made by classical methods. These challenges include complex spatial and temporal dependence of the data streams, transient and dynamic changes, high-dimensionality, and structured changes, as explained below. These challenges have fostered new advances in sequential change detection theory and methods in recent years.

  1. Complex data distributions. In modern applications, sequential data could have a complex spatial and temporal dependency, for instance, induced by the network structure [16], [68], [167]. In social networks, dependencies are usually due to interaction and information diffusion [116]: users in the social network have behavior patterns influenced by their past, while at the same time, each user in the network will be influenced by friends and connections. In sensor networks for river contamination monitoring [34], sensor observations tend to be spatially and temporally correlated.

  2. Data dynamics. The statistical behavior of sequential data is often non-stationary, particularly in the post-change regime due to the dynamic behavior of the anomaly that causes the change. For example, after a linear outage in the power systems, the system’s transient behavior is dominated by the generators’ inertial response, and the post-change statistical behavior can be modeled using a sequence of temporally cascaded transient phases [171].

  3. High-dimensionality. Sequential data in modern applications is usually high-dimensional. For example, in sensor networks, the Long Beach 3D seismic array consists of approximately 5300 seismic sensors that record data continuously for seismic activity detection and analysis. Changes in high-dimensional time series usually exhibit low-dimensional structures in the form of sparsity, low-rankness, and subset structures, which can be exploited to enhance the capability to detect weak signals quickly.

In this tutorial, our aim is to introduce standard methods and fundamental results in sequential change detection, along with recent advances. We also present new dimensions at the intersection of sequential change detection with other areas, as well as a selection of modern applications. We should emphasize that our focus is on sequential change detection, where the goal is to detect the change from sequential data in real-time and as soon as possible. Another important line of related research is offline change detection (e.g., [59], [188]), where the goal is to identify and localize changes in data sequence in a retrospective manner, which is not our focus here. Prior books and surveys on related topics include, for instance, change detection for dynamic systems [97], sequential analysis [98], [194], sequential change detection [19], [160], [215], Bayesian change detection [201], change detection assuming known pre- and post-change distributions [159] and using likelihood-based approaches [186], as well as time-series change detection [6].

The rest of the survey is organized as follows. In Section II, we present the basic problem setup and classical results. In Section III, we discuss several extensions and generalizations of the classical methods. In Section IV, we discuss new dimensions which intersect with sequential change detection, with some remarks on open questions. In Section V, we present some modern applications of sequential change detection. In Section VI, we make some concluding remarks.

SECTION II.

Classical Results

A. Problem Definition

In the sequential change detection problem, also known as the quickest change detection (QCD) problem [131], [160], [215], the aim is to detect a possible change in the data generating distribution of a sequence of observations {Xn,n=1,2,} . The initial distribution of the observations is the one corresponding to normal system operation. At some unknown time γ (referred to as the change-point), due to some event, the distribution of the random observations changes. The goal is to detect the change as quickly as possible, subject to false-alarm constraints. We start by assuming that the observations are independent and identically distributed (i.i.d.) with probability density function (pdf) f0 before and pdf f1 after the change-point, respectively. We discuss generalizations to non-i.i.d. observations in Section III.

To motivate the design of algorithms for sequential change detection, we consider the example of detecting a change in the mean of the data generating distribution. In Fig. 1(a), we plot a sample path of observations that are distributed according to a normal distribution with zero mean and unit variance N(0,1) before the change-point of 500, and N(0.1,1) after the change-point. As can be seen in Fig. 1(a), such a small mean shift cannot be detected through manual inspection of the samples. In Fig. 1(b), we plot the evolution path of a sequential change detection procedure, the CUSUM algorithm (which is discussed in detail in Section II-C2), applied to the observations in Fig. 1(a). It can be seen that the test statistic stays close to zero before the change and has a positive drift after the change. Therefore, the change can be detected by comparing the test statistic to a positive threshold b (for instance, b=2 ) and raising an alarm when the test statistic exceeds the threshold for the first time. For the sample path in Fig. 1(a), this approach incurs a detection delay of 60 samples (if we take samples daily, this means a detection delay of 2 months; if the sampling rate is 60 samples per second, this means a detection delay of one second). One natural question to ask is that: can we do better, at least on average? Clearly, if we set a lower threshold, for instance b=1 , we can detect the change much more quickly. However, this would result in a false alarm at k=112 . This example illustrates the tradeoff between false-alarm and detection delay, which is a central problem when designing sequential change detection procedures. The goal in sequential change detection theory is to find detection procedures that have guaranteed optimality properties in terms of this tradeoff.

Fig. 1. - To motivate the need for sequential change detection procedures, we plot a sample path with samples distributed according to 
$\mathcal N(0, 1)$
 before the change and 
$\mathcal N(0.1, 1)$
 after the change. We set the change-point 
$\gamma = 500$
. As illustrated in (a), such a small mean shift cannot be detected through manual inspection of the samples. In (b), we plot the evolution of the CUSUM algorithm (detailed in Section II-C2) corresponding to the observations in (a), which can detect the change quickly.
Fig. 1.

To motivate the need for sequential change detection procedures, we plot a sample path with samples distributed according to N(0,1) before the change and N(0.1,1) after the change. We set the change-point γ=500 . As illustrated in (a), such a small mean shift cannot be detected through manual inspection of the samples. In (b), we plot the evolution of the CUSUM algorithm (detailed in Section II-C2) corresponding to the observations in (a), which can detect the change quickly.

B. Mathematical Preliminaries

Sequential change detection is closely related to the problem of statistical hypothesis testing, in which observations, whose distribution depends on the hypothesis, are used to decide which of the hypotheses is true. For the special case of binary hypothesis testing, we have two hypotheses, the null hypothesis and the alternative hypothesis. The classic Neyman-Pearson Lemma [136] establishes the form of the optimal test for this problem. In particular, consider the case of a single observation X , and suppose the pdf of X under the null and alternative hypotheses are f0 and f1 , respectively. Then, the test that minimizes the false negative error (Type-II error), under the constraint of the false positive error (Type-I error), is to compare the likelihood ratio f1(X)/f0(X) to a threshold to decide which hypothesis is true. The likelihood ratio test is also optimal under other criteria such as Bayesian and minimax [131]. As we will see, the likelihood ratio also plays a key role in the development of sequential change detection algorithms.

The goal of sequential change detection is to design a stopping time on the observation sequence at which it is declared that a change has occurred. A stopping time is formally defined as follows:

Definition 1 (Stopping Time):

A stopping time with respect to a random sequence {Xn,n=1,2,} is a random variable τ such that for each n , the event {τ=n}σ(X1,,Xn) , where σ(X1,,Xn) denotes the sigma-algebra generated by (X1,,Xn) . Equivalently, the event {τ=n} is a function of only X1,,Xn .

The main results on stopping times that are most useful for sequential change detection problems include Doob’s Optional Stopping Theorem [43] and Wald’s Identity [185].

A quantity that plays an important role in the performance of sequential change detection algorithms is the Kullback-Leibler (KL) divergence between two distributions.

Definition 2 (KL Divergence):

The KL divergence between two pdfs f1 and f0 is defined as D(f1f0)=f1(x)log(f1(x)/f0(x))dx .

Note that D(f1f0)0 with equality if and only if f1=f0 almost surely. It is usually assumed that 0<D(f1f0)< .

Define the log-likelihood ratio for an observation X :(X):=logf1(X)/f0(X).(1) View SourceRight-click on figure for MathML and additional features. A fundamental property of the log-likelihood ratio, which is useful for constructing sequential change detection algorithms, is that before the change n<γ , the expected value of (Xn) is equal to D(f0||f1)<0 ; and after the change, nγ , the expected value of (Xn) is equal to D(f1||f0)>0 . As will be seen later, the KL divergence between the pre- and post-change distributions is an important quantity that characterizes the tradeoff between the average detection delay and the false-alarm rate.

C. Common Sequential Change Detection Procedures

We now present several commonly used sequential change detection procedures, including the Shewhart chart, CUSUM, and Shiryaev-Roberts procedure, which enjoy certain optimality properties that we will make more precise later in Section II-D. These algorithms can be efficiently implemented in an online setting, which makes them useful in practice. We also briefly discuss some other sequential change detection procedures.

1) Shewhart Chart:

One of the earliest sequential change detection procedures is the Shewhart chart [180], [181], which is widely used in industrial quality control [130]. The Shewhart chart was first introduced for the Gaussian model and based on comparing the instant observation to a threshold. We consider the log-likelihood-based modification and generalization of the standard Shewhart chart, where we compute the log-likelihood ratio based on the current observation (or the current batch of observations) and compare it with a threshold (called the control limit) to make a decision about the change. The property of the log-likelihood ratio discussed in Section II-B is utilized, which motivates the Shewhart chart:τSh=inf{n1:(Xn)>b}, View SourceRight-click on figure for MathML and additional features. where b is a pre-specified threshold. The Shewhart chart is widely used in practice due to its simplicity. In [155], Pollak and Krieger showed that the Shewhart chart enjoys the optimality property that it maximizes the probability of detecting the change at the time it occurs, subject to false-alarm constraints. However, the Shewhart chart may suffer from “information loss” due to the fact that it ignores past observations in making a decision about the change, which leads to performance loss when we consider criteria, e.g., the tradeoff between detection delay and false-alarm rate (see Section II-D).

2) Cumulative Sum (CUSUM) Procedure:

The CUSUM procedure, first introduced by Page [149], addresses the problem of “information loss” in the Shewhart chart. The CUSUM procedure uses past observations and thus can achieve a significant performance gain, especially when the change is small. Although the CUSUM procedure was developed heuristically, it was later shown in [96], [122], [132], [170] that it has very strong optimality properties, which we will discuss further in Section II-D1.

The CUSUM procedure utilizes the properties of the cumulative log-likelihood ratio sequence:Sn=k=1n(Xk). View SourceRight-click on figure for MathML and additional features. Before the change occurs, the statistic has a negative drift because the expected value of (Xk) before the change is negative. After the change, it has a positive drift because the expected value of (Xk) after the change is positive. Thus, Sn roughly attains its minimum at the change-point γ . The CUSUM procedure is then constructed to detect this change in the drift of Sn . Specifically, the exceedance of Sn with respect to its past minimum is taken and compared with a threshold b>0 :\begin{equation*} \tau _{\scriptscriptstyle \text {C}}= \inf \left \{{n\geq 1: W_{n} = \left ({S_{n} - \min _{0\leq k \leq n } S_{k}}\right) \geq b }\right \}.\tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features. The CUSUM statistic can be rewritten as:\begin{equation*} W_{n} = \max _{0 \leq k \leq n }\sum _{i=k+1}^{n} \ell (X_{i}) = \max _{1\leq k \leq n+1 }\sum _{i=k}^{n} \ell (X_{i}).\tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features. Note that the maximization over all possible \gamma = k corresponds to plugging in a maximum likelihood estimate of the unknown change-point location in the log-likelihood ratio of the observations to form the CUSUM statistic. It can be shown that W_{n} can be computed recursively:\begin{equation*} W_{n} = \left ({W_{n-1} + \ell (X_{n})}\right)^{+}, \quad W_{0} = 0,\end{equation*} View SourceRight-click on figure for MathML and additional features. where (x)^{+} = \max \{x, 0\} . This recursion enables the efficient online implementation of the CUSUM procedure in practice.

3) Shiryaev-Roberts Procedure:

The maximum likelihood interpretation of the CUSUM procedure is closely related to another popular algorithm in the literature, called the Shiryaev-Roberts (SR) procedure. In the SR procedure, the maximum in (3) is replaced by a sum and the log-likelihood ratio is replaced by likelihood ratio. The detection statistic for the SR procedure is then defined as:\begin{equation*} T_{n} \mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=} \sum _{1\leq k \leq n }\prod _{i=k}^{n} e^{\ell (X_{i})},\tag{4}\end{equation*} View SourceRight-click on figure for MathML and additional features. and the corresponding stopping time is defined as \begin{equation*} \tau _{\scriptscriptstyle \text {SR}}= \inf \left \{{n\geq 1: T_{n} \geq b}\right \}.\end{equation*} View SourceRight-click on figure for MathML and additional features. The SR statistic can also be computed recursively:\begin{equation*} T_{n} = \left ({1+T_{n-1}}\right) e^{\ell (X_{n})}, \quad T_{0} = 0.\end{equation*} View SourceRight-click on figure for MathML and additional features.

D. Optimality

We now briefly summarize optimality results in the existing literature for the above procedures. We begin by considering the non-Bayesian setting, where we do not assume a prior on the change-point \gamma , and then consider the Bayesian setting, where the change-point is assumed to follow a certain distribution.

A fundamental problem in sequential change detection is to optimize the tradeoff between the false-alarm rate and the average detection delay, as illustrated in Section II-A using the example in Fig. 1. Controlling the false-alarm rate is commonly achieved by setting an appropriate threshold on a test statistic such as the one in (2). But the threshold also affects the average detection delay. A larger threshold incurs fewer false alarms but leads to a larger detection delay, and vice versa.

1) Minimax Optimality:

In non-Bayesian settings, the change-point is assumed to be a deterministic and unknown variable. In this case, the average run length ({\mathsf {ARL}} ) to false alarm is generally used as a performance measure for false alarms:\begin{equation*} {\mathsf {ARL}}(\tau)= \mathbb {E}_{\infty }[\tau],\tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \mathbb P_{\infty } is the probability measure on the sequence of observations when the change never occurs, and \mathbb {E}_{\infty } is the corresponding expectation. Its reciprocal, the false-alarm rate ({\mathsf {FAR}} ), is also commonly used:\begin{equation*} {\mathsf {FAR}}(\tau) = \frac {1}{ {\mathsf {ARL}}(\tau)}=\frac {1}{ \mathbb {E}_{\infty }[\tau]}.\tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features. {\mathsf {FAR}} can also be interpreted as the rate at which false alarms occur in the pre-change regime if we repeat the change detection procedure after each false alarm. Denote the set of stopping times that satisfy a constraint \alpha on the {\mathsf {FAR}} by:\begin{equation*} { \mathcal {D}}_{\alpha }=\left \{{\tau: {\mathsf {FAR}}(\tau) \leq \alpha }\right \}.\tag{7}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Finding a uniformly powerful test that minimizes the delay over all possible values of the change-point \gamma , subject to a {\mathsf {FAR}} constraint, is generally intractable. Therefore, it is more tractable to pose the problem in the so-called minimax setting. There are two essential measures of the detection delay in the minimax setting, due to Lorden [122] and Pollak [154], respectively.

Lorden considers the supremum of the average detection delay conditioned on the worst possible realizations. In particular, Lorden defines1:\begin{equation*} {\mathsf {WADD}}(\tau) = \underset {n \geq 1}{\mathrm {\sup }}~\mathop {\mathrm {ess\,sup}} ~\mathbb {E}_{n}\left [{(\tau -n)^{+}| X_{1}, {\dots }, X_{n-1}}\right],\tag{8}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \mathbb P_{n} denotes the probability measure on the observations when the change occurs at time n , and \mathbb {E}_{n} denotes the corresponding expectation. We then have the following Lorden’s formulation:\begin{equation*} \text {minimize } {\mathsf {WADD}}(\tau) {~\text {subject to }} {\mathsf {FAR}}(\tau) \leq \alpha. \tag{9}\end{equation*} View SourceRight-click on figure for MathML and additional features. For the i.i.d. setting, Lorden showed that Page’s CUSUM procedure given in (2) is asymptotically optimal as \alpha \rightarrow 0 . It was later shown in [132] and [170] that a slight modification of the CUSUM procedure, with W_{n} = (W_{n-1})^{+} + \ell (X_{n}) , is exactly optimal for (9) for all \alpha >0 .

Although the CUSUM procedure is exactly optimal under Lorden’s formulation, {\mathsf {WADD}}(\tau) is a pessimistic measure of detection delay since it considers the worst-case pre-change samples. An alternative measure of detection delay was suggested by Pollak [154]:\begin{equation*} {\mathsf {CADD}}(\tau) = \underset {n \geq 1}{\mathrm {\sup }}~\mathbb {E}_{n}\left [{\tau -n| \tau \geq n}\right],\tag{10}\end{equation*} View SourceRight-click on figure for MathML and additional features. for all stopping times \tau for which the expectation is well-defined. It is easy to see that for any stopping time \tau , {\mathsf {WADD}}(\tau) \geq {\mathsf {CADD}} (\tau) , and therefore, Pollak’s formulation is less pessimistic.

In general, it may be challenging to exactly solve the problem in (9) and the corresponding problem defined using {\mathsf {CADD}} in (10). For this reason, asymptotically optimal solutions for the above problems are often investigated in the literature. Specifically, a stopping time \tau is said to be first-order asymptotically optimal if it satisfies:\begin{equation*} \frac { {\mathsf {CADD}}(\tau)}{\inf _{\tau \in { \mathcal {D}}_{\alpha }} {\mathsf {CADD}}(\tau)} \rightarrow 1, ~~\text {as } \alpha \rightarrow 0;\end{equation*} View SourceRight-click on figure for MathML and additional features. it is second-order asymptotically optimal if {\mathsf {CADD}}(\tau) is within a constant of the best possible delay over the class { \mathcal {D}}_{\alpha } :\begin{equation*} { {\mathsf {CADD}}(\tau)}-{\inf _{\tau \in { \mathcal {D}}_{\alpha }} {\mathsf {CADD}}(\tau)} =O(1);\end{equation*} View SourceRight-click on figure for MathML and additional features. and it is third-order asymptotically optimal if such a constant goes to 0 as \alpha \to 0 :\begin{equation*} { {\mathsf {CADD}}(\tau)}-{\inf _{\tau \in { \mathcal {D}}_{\alpha }} {\mathsf {CADD}}(\tau)} =o(1).\end{equation*} View SourceRight-click on figure for MathML and additional features. These notions can also be defined similarly for the problem in (9) defined using {\mathsf {WADD}} .

Pollak’s formulation has been studied for the i.i.d. data in [154] and [197]. The first-order asymptotic optimality for Lorden’s formulation can also be extended to Pollak’s formulation. To show this, Lorden in [122] established a universal lower bound for {\mathsf {WADD}} and Lai in [96] proved the lower bound to {\mathsf {CADD}} :

Theorem 1 (Lower Bound for{\mathsf {CADD}} [96]):

As \alpha \rightarrow 0 , \begin{equation*} \inf _{\tau \in { \mathcal {D}}_{\alpha }} {\mathsf {CADD}}(\tau) \geq \frac {|\log \alpha |}{D(f_{1}|| f_{0})} \left ({1+o(1) }\right).\end{equation*} View SourceRight-click on figure for MathML and additional features. It can be shown that the CUSUM procedure with a threshold b=|\log \alpha | is first-order asymptotically optimum for both Lorden’s and Pollak’s formulations. In particular, as \alpha \to 0 , \begin{equation*} {\mathsf {CADD}}(\tau _{\scriptscriptstyle \text {C}}) = {\mathsf {WADD}}(\tau _{\scriptscriptstyle \text {C}}) \sim \frac {|\log \alpha |}{D(f_{1} || f_{0})},\end{equation*} View SourceRight-click on figure for MathML and additional features. where ~ means the ratio of the quantities on its two sides approaches 1 as \alpha \rightarrow 0 .

The SR procedure is also asymptotically optimal and it was shown in [197] that by setting the threshold b=1/\alpha , \begin{equation*} {\mathsf {CADD}}(\tau _{\scriptscriptstyle \text {SR}}) = \frac {|\log \alpha |}{D\left ({f_{1}|| f_{0}}\right)} + \xi + o(1),\end{equation*} View SourceRight-click on figure for MathML and additional features. where \xi is a constant that can be characterized using the nonlinear renewal theory [230] (details omitted here).

Finally, results in [133], [155], [196] show that the Shewhart chart is optimal for the criterion of maximizing the probability of detecting the change upon its occurrence subject to the {\mathsf {FAR}} constraints. A more precise statement of this optimality property is as follows. Let the post-change density be denoted by f_{\theta }(x) , where \theta \in \Theta is the post-change parameter. The Shewhart chart as defined earlier becomes the following stopping time:\begin{equation*} \tau _{\scriptscriptstyle \text {Sh}}= \inf \left \{{n\geq 1~:~\frac {f_{\theta } (X_{n})}{f_{0}(X_{n})} > b}\right \},\end{equation*} View SourceRight-click on figure for MathML and additional features. where b is a pre-specified threshold. It is shown that when the threshold b is selected such that {\mathsf {FAR}}(\tau _{\scriptscriptstyle \text {Sh}})=\alpha , then \tau _{\scriptscriptstyle \text {Sh}} is the optimal solution to the following optimization problem:\begin{align*}&\text {maximize } \inf _{1\leq n < \infty } \mathbb P_{n}^{\theta } (\tau =n|\tau \geq n){~\text {subject to }} {\mathsf {FAR}}(\tau) \leq \alpha, \\\tag{11}\end{align*} View SourceRight-click on figure for MathML and additional features. where \mathbb P_{n}^{\theta } denotes the probability when the change happens at n with \theta being the post-change parameter. Moreover, it was shown in [196] that if the likelihood ratio f_{\theta } (X)/f_{0}(X) is a monotone non-decreasing function of a statistic S(X) , then the Shewhart chart is equivalent to \tau _{\scriptscriptstyle \text {Sh}}=\inf \{n\geq 1: S(X_{n})> b\} and when b is selected such that {\mathsf {FAR}}(\tau _{\scriptscriptstyle \text {Sh}})=\alpha , the Shewhart chart is uniform optimal in \theta \in \Theta in the sense of solving (11) for all \theta \in \Theta .

In summary, both the CUSUM and SR procedures are asymptotically optimal with respect to Lorden’s formulation and Pollak’s formulation. The FAR decays to zero exponentially with exponent D(f_{1}|| f_{0}) . We demonstrate the theory using an example in Fig. 2 by plotting the tradeoff curve between the {\mathsf {CADD}} and -\log ({\mathsf {FAR}}) for the CUSUM procedure. Note that the curve has a slope approximately of 1/D(f_{1} || f_{0}) , which is consistent with the theory.

Fig. 2. - Tradeoff curve between 
${\mathsf {CADD}}$
 and 
$-\log ({\mathsf {FAR}})$
 for the CUSUM algorithm. The pre-change distribution is 
$f_{0}=\mathcal {N}(0,1)$
, and the post-change distribution is 
$f_{1}=\mathcal {N}(0.75,1)$
. The slope of the curve is approximately 
$1/D(f_{1} || f_{0})$
.
Fig. 2.

Tradeoff curve between {\mathsf {CADD}} and -\log ({\mathsf {FAR}}) for the CUSUM algorithm. The pre-change distribution is f_{0}=\mathcal {N}(0,1) , and the post-change distribution is f_{1}=\mathcal {N}(0.75,1) . The slope of the curve is approximately 1/D(f_{1} || f_{0}) .

Some more optimality results are summarized as follows. Under Pollak’s criterion, it was shown in [197] that the SR algorithm is second-order asymptotically optimal, and that the SRP algorithm (Pollak’s version of the SR algorithm that starts from a quasi-stationary distribution of the SR statistic) is third-order asymptotically optimal (as was also first established in [154]). More importantly, in [197], it was proved that the SR-r procedure that starts from a specially selected fixed point r is third-order optimal. In [158], it was shown that SR-r is strictly optimal for {\mathsf {CADD}} in some special cases. We also note that the (generalized) Shewhart chart is optimal for the criterion of maximizing the probability of detecting a change subject to false alarm constraints.

2) Bayesian Optimality:

In the Bayesian setting, it is assumed that the change-point is a random variable \Gamma taking values on the non-negative integers, with probability mass function \pi _{n} = \mathbb {P}\{\Gamma =n\} . For a stopping time \tau , define the average detection delay ({\mathsf {ADD}} ) and the probability of false alarm (\mathsf {PFA} ) as follows:\begin{align*} {\mathsf {ADD}}(\tau)=&\mathbb {E}\left [{(\tau - \Gamma)^{+}}\right] = \sum _{n=0}^{\infty } \pi _{n} \mathbb {E}_{n} \left [{(\tau - \Gamma)^{+}}\right], \tag{12}\\ \mathsf {PFA}(\tau)=&\mathbb {P}(\tau < \Gamma) = \sum _{n=0}^{\infty } \pi _{n} \mathbb {P}_{n} (\tau < \Gamma).\tag{13}\end{align*} View SourceRight-click on figure for MathML and additional features. In Bayesian sequential change detection, the goal is to minimize {\mathsf {ADD}} subject to a constraint on \mathsf {PFA} . Shiryaev [183] formulated the Bayesian sequential change detection problem as follows:\begin{align*}&\text {minimize } {\mathsf {ADD}}(\tau) {~\text {subject to }} \mathsf {PFA}(\tau) \leq \alpha. \quad (\text {Shiryaev}) \\\tag{14}\end{align*} View SourceRight-click on figure for MathML and additional features. The prior on the change-point \Gamma is usually assumed to be a geometric distribution with parameter 0 < \rho < 1 , \begin{equation*} \pi _{n} = \mathbb {P}\{ \Gamma = n \} = \rho (1-\rho)^{n-1}\mathbb I_{\{n \geq 1\}}, \quad \pi _{0} =0,\tag{15}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \mathbb I is the indicator function. The justification for this assumption is that the geometric distribution is memoryless. Moreover, it leads to a tractable formulation and convenient optimal solutions to the Bayesian problem in (14) as we will discuss in the following.

The detection statistic of the Shiryaev algorithm is the posterior probability that the change has taken place given the observations so far. Denote by X_{1}^{n} = (X_{1}, \ldots, X_{n}) the observations up to time n , and by \begin{equation*} p_{n} = \mathbb {P}\left ({\Gamma \leq n \; | \; X_{1}^{n} }\right)\tag{16}\end{equation*} View SourceRight-click on figure for MathML and additional features. the a posteriori probability at time n that the change has taken place given the observations up to time n . It then follows from the Bayes’ rule that p_{n} can be updated recursively:\begin{equation*} p_{n+1} = \frac {\tilde {p}_{n} e^{\ell (X_{n+1})}}{ \tilde {p}_{n} e^{\ell (X_{n+1})} + \left ({1-\tilde {p}_{n} }\right)},\tag{17}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \tilde {p}_{n} = p_{n} + (1-p_{n}) \rho , and p_{0}=0 . Then the Shiryaev algorithm is defined by comparing p_{n} with a given threshold b_{\alpha } :\begin{equation*} \tau _{\scriptscriptstyle \text {S}}= \inf \left \{{ n\geq 1: p_{n} \geq b_{\alpha } }\right \},\tag{18}\end{equation*} View SourceRight-click on figure for MathML and additional features. where b_{\alpha } \in (0,1) is chosen such that the false alarm constraint, \mathsf {PFA}(\tau _{\scriptscriptstyle \text {S}}) \leq \alpha , is satisfied.

Theorem 2 (Optimal Bayesian Procedure[183],[184]):

When the threshold b_{\alpha } is selected such that \mathsf {PFA}(\tau _{\scriptscriptstyle \text {S}}) = \alpha , the Shiryaev algorithm in (18) is Bayesian optimal for (14).

An equivalent form of the Shiryaev statistic can be developed using the idea of the likelihood ratio test. This builds a connection to the earlier SR statistic defined in (4), and it reveals useful insights about the nature of the procedure. Consider two hypotheses: “H_{1}: \Gamma \leq n ” and “H_{0}: \Gamma > n ”. Denote by R_{n,\rho } = p_{n}/[\rho (1-p_{n})] the scaled likelihood ratio between the two hypotheses averaged over the change-point. It then follows that R_{n,\rho } can be updated recursively as:\begin{equation*} R_{n+1,\rho } = \frac {1+R_{n,\rho }}{1-\rho } e^{\ell (X_{n+1})}, \quad R_{0,\rho } = 0.\tag{19}\end{equation*} View SourceRight-click on figure for MathML and additional features. The Shiryaev stopping time \tau _{\scriptscriptstyle \text {S}} in (18) can then be rewritten as a comparison of R_{n,\rho } with a threshold. We remark here that if we set \rho = 0 , then the Shiryaev statistic reduces to the SR statistic in (4).

A generalized Shewhart chart is also Bayesian optimal, as shown in [155], in the sense that it minimizes the expected loss where the loss function is \mathbb I_{\{\tau \neq \Gamma \}} , assuming that the change-point \Gamma follows a geometric prior and the parameter \theta of the post-change distribution follows a known prior G . This result was generalized in [196, Th.5.1]. Moreover, both the CUSUM and SR procedures are first-order asymptotically optimal for the Bayesian setting when the prior has a heavy tail, or when the change-point is geometrically distributed with a small enough parameter.

3) Evaluating the Performance Metrics:

In the definition of the {\mathsf {WADD}} metric (8) and the {\mathsf {CADD}} metric (10), it appears that we need to consider the supremum over all possible past observations and over all possible change-points. However, we can actually show that for the CUSUM and SR procedures, and some other algorithms, that the supremum over all possible change-points in {\mathsf {WADD}} and {\mathsf {CADD}} is achieved at time n=1 :\begin{align*} {\mathsf {CADD}}(\tau _{\scriptscriptstyle \text {C}})=&{\mathsf {WADD}}(\tau _{\scriptscriptstyle \text {C}}) = \mathbb {E}_{1} \left [{ \tau _{\scriptscriptstyle \text {C}}- 1}\right],\\ {\mathsf {CADD}}(\tau _{\scriptscriptstyle \text {SR}})=&{\mathsf {WADD}}(\tau _{\scriptscriptstyle \text {SR}}) = \mathbb {E}_{1} \left [{ \tau _{\scriptscriptstyle \text {SR}}- 1}\right].\end{align*} View SourceRight-click on figure for MathML and additional features. Therefore, the {\mathsf {CADD}} and the {\mathsf {WADD}} can be conveniently evaluated by setting \gamma =1 , without “taking the supremum”.

E. Other Sequential Change Detection Procedures

1) Mixture and Generalized Likelihood Ratio (GLR) Statistics:

The CUSUM and SR procedures require full knowledge of pre- and post-change distributions to obtain the log-likelihood ratio \ell (X) used in computing the test statistics. In practice, the post-change distribution f_{1} may be unknown. In the parametric setting, the post-change distribution can be parameterized using f_{\theta } , where \theta \in \Theta is the unknown parameter. Two commonly used methods for the situation here, which corresponds to the problem of composite hypothesis testing, are the generalized likelihood ratio (GLR) approach and the mixture approach. In the GLR approach, a supremum over \theta \in \Theta is taken in constructing the test statistic. In particular, the test statistic for the GLR-CUSUM algorithm is given by:\begin{equation*} W_{n}^{\text {G}} = \max _{1\leq k \leq n+1 } \sup _{\theta \in \Theta } \sum _{i=k}^{n} \ell _{\theta } (X_{i}),\tag{20}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \ell _{\theta } (X) = \log (f_{\theta }(X)/f_{0} (X)) . Performance analysis of the GLR-CUSUM algorithm for one-parameter exponential families can be found in [122], [123]. A major drawback of the GLR approach is that the corresponding GLR statistic (e.g., the one given in (20)) cannot be computed recursively in time, except in some special cases (e.g., when the parameter set \Theta has finite cardinality). To reduce the computational cost, a window-limited GLR approach was developed in [229] and generalized in [96], [99]. Window-limited versions of the GLR algorithm can be shown to be asymptotically optimal in certain cases if the window size is carefully chosen as a function of {\mathsf {FAR}} .

The mixture method replaces the supremum over \theta \in \Theta by a weighted average. For example, the mixture-CUSUM statistic is computed as:\begin{equation*} W_{n}^{\text {m}} = \max _{1\leq k \leq n+1 } \log \int _{\Theta }\prod _{i=k}^{n}\frac {f_{\theta } (X_{i})}{f_{0}(X_{i})} \omega (\theta) d\theta,\tag{21}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \omega (\theta) is a weight function that integrates (sums) to 1 over \Theta . Note that, like the GLR test statistic, the mixture test statistic cannot be computed recursively in general. It was shown in [196] that the mixture approach can result in first-order asymptotically optimal tests for practically any prior for both the i.i.d. and non-i.i.d. cases. In [189], the optimal prior was established such that the resultant mixture SR procedure is asymptotically optimal in a certain stronger sense.

2) EWMA:

Note that the CUSUM and SR procedures can achieve a significant gain in performance when compared to the Shewhart chart by making use of past observations, i.e., CUSUM and SR have memory. The exponentially weighted moving average (EWMA) chart is another type of sequential change detection procedure that employs past observations. The EWMA detection statistic was originally defined as Z_{n} = \lambda X_{n} + (1-\lambda) Z_{n-1} , where \lambda \in (0,1] is a pre-specified constant, with the aim to detect mean shift. The EWMA can be generalized to Z_{n} = \lambda \ell (X_{n}) + (1-\lambda) Z_{n-1} to detect shift in distribution, more generally. Thus, Z_{n} is a weighted moving average of all past information with weights decreasing exponentially in time. The EWMA chart is simple to implement and does not require any prior knowledge of the pre- and post-change distributions. A performance comparison of the EWMA chart and the CUSUM and SR procedures is given in [157].

SECTION III.

Generalizations and Extensions

A. General Asymptotic Theory for Non-i.i.d. Data

There has been a considerable amount of effort to generalize the optimality results for sequential change detection to the non-i.i.d. setting. Lai [96] initiated the development of a general minimax asymptotic theory for both Lorden’s and Pollak’s formulations, while Tartakovsky and Veeravalli [206] initiated the development of a general Bayesian asymptotic theory.

1) General Minimax Asymptotic Theory:

Under the minimax setting, Lai in [96] obtained a general lower bound for non-i.i.d. data on the {\mathsf {CADD}} (and hence on the {\mathsf {WADD}} ) for any stopping time that satisfies the constraint that {\mathsf {FAR}} is no larger than \alpha . It was then shown that an extension of the CUSUM procedure (2) to the non-i.i.d. setting achieves this lower bound asymptotically as \alpha \to 0 . There are also works investigating non-i.i.d. data under some specific settings, e.g., multi-sensor slope change detection [28], linear regression models [63], [221], generalized autoregressive conditional heteroskedasticity (GARCH) models [22], non-stationary time series [42], general stochastic models [195], [200], and hidden Markov models [62]. We refer to [196] for more recent developments on this topic.

We now present a generalized CUSUM procedure for non-i.i.d. data. In this setting, conditional distributions are used to compute the likelihood ratios. In the pre- and post-change regimes, the conditional distribution of X_{n} given X_{1}^{n-1} is given by f_{0,n}(X_{n}|X_{1}^{n-1}) and f_{1,n}(X_{n}|X_{1}^{n-1}) , respectively. Define the conditional log-likelihood ratio and the CUSUM statistic, respectively, as:\begin{equation*} Y_{i} = \log \frac {f_{1,i}\left ({X_{i}|X_{1}^{i-1}}\right)}{f_{0,i}\left ({X_{i}|X_{1}^{i-1}}\right)},\quad \text {and } C_{n} = \max _{1\leq k\leq n+1} \sum _{i=k}^{n} Y_{i}.\end{equation*} View SourceRight-click on figure for MathML and additional features. Then the stopping time for the generalized CUSUM is defined as:\begin{equation*} \tau _{\scriptscriptstyle \text {G}}= \inf \left \{{ n\geq 1: C_{n} \geq b }\right \}.\tag{22}\end{equation*} View SourceRight-click on figure for MathML and additional features. Note the generalized CUSUM (for non-i.i.d. data) takes a similar form as the original CUSUM (for i.i.d. data) except that we replace the log-likelihood ratio with the conditional log-likelihood ratio.

The minimax optimality of the generalized CUSUM for the non-i.i.d. data was established in [96]. Under some regularity conditions, by setting the threshold b=|\log \alpha | , we have \tau _{\scriptscriptstyle \text {G}}\in { \mathcal {D}}_{\alpha } . If there exists I such that \begin{equation*} \underset {m\leq t }{\mathrm {\max }} \frac {1}{t}\sum _{i=n}^{n+m} Y_{i} \to I \quad \text {a.s. } \mathbb {P}_{n}, ~~\text {as } t \to \infty \quad \forall n,\tag{23}\end{equation*} View SourceRight-click on figure for MathML and additional features. and the convergence is complete in the sense that \sum _{n=1}^{\infty }\mathbb P_{1}\left({|(1/n) \sum _{i=1}^{n} Y_{i} - I| \geq \epsilon }\right) < \infty for all \epsilon >0 , then as \alpha \to 0 :\begin{align*} {\mathsf {CADD}}(\tau _{\scriptscriptstyle \text {G}})\sim&{\mathsf {WADD}}(\tau _{\scriptscriptstyle \text {G}}) \sim \underset {\tau \in { \mathcal {D}}_{\alpha }}{\mathrm {\inf }}~{\mathsf {WADD}}(\tau) \\\sim&\underset {\tau \in { \mathcal {D}}_{\alpha }}{\mathrm {\inf }}~{\mathsf {CADD}}(\tau) \sim \frac {|\log \alpha |}{I},\tag{24}\end{align*} View SourceRight-click on figure for MathML and additional features. where the positive constant I>0 plays a similar role as the KL divergence in the i.i.d. setting.

2) General Bayesian Asymptotic Theory:

Under the Bayesian setting, when the samples conditioned on the change-point are non-i.i.d., it is generally difficult to find an exact solution to the Shiryaev problem in (14). Tartakovsky and Veeravalli [206] showed that the Shiryaev algorithm is asymptotically optimal as \alpha \to 0 , under some regularity conditions on the pre- and post-change distributions.

Similar to the i.i.d. case, we can define the posterior probability p_{n} of change having occurred before time n given all previous samples, in the same expression as (16). The Shiryaev algorithm for the non-i.i.d. setting is then defined in the same way as in (18). Note that the recursion in (17) may not hold for a general distribution for \Gamma . However, if the change-point \Gamma is geometrically distributed, a recursive expression for p_{n} can still be derived. Define \begin{equation*} d = - \lim _{n\to \infty } \frac {\log \mathbb {P} (\Gamma > n)}{n},\end{equation*} View SourceRight-click on figure for MathML and additional features. which captures the decay rate of the tail probability of change-point \Gamma ’s prior distribution as the sample size n increases. When \Gamma is “heavy-tailed”, d=0 , and when \Gamma has an “exponential tail”, d>0 . For example, when the prior distribution is geometric with parameter \rho as defined in (15), d=|\log (1-\rho)| . If there exists I such that \begin{equation*} \frac {1}{t}\sum _{i=n}^{n+t} Y_{i} \to I \quad \text {a.s. } \mathbb {P}_{n} ~~\text {as } t \to \infty \quad \forall n,\tag{25}\end{equation*} View SourceRight-click on figure for MathML and additional features. and some additional conditions on the rates of convergence are satisfied (see [206] for the details), then the Shiryaev algorithm in (18) with a threshold b_{\alpha }=1-\alpha is asymptotically optimal for Bayesian optimization problem in (14) as \alpha \to 0 [206]:\begin{equation*} {\mathsf {ADD}}(\tau _{\scriptscriptstyle \text {S}}) \sim \inf _{\tau: \mathsf {PFA}(\tau)\leq \alpha } {\mathsf {ADD}}(\tau)\sim \frac {|\log \alpha |}{I + d}.\tag{26}\end{equation*} View SourceRight-click on figure for MathML and additional features. Note that in [206], a general result for the m -th moment of the delay was developed. Here, for simplicity, we only presented the result for m=1 .

B. Change-of-Measure for Accurate ARL Approximations

For CUSUM and SR procedures with i.i.d. samples, it may be relatively easy to evaluate their performance (such as the {\mathsf {ARL}} ) both theoretically and numerically, as discussed in Section II-D3. However, in many settings such as those involving non-i.i.d. observations, GLR statistics [188], and non-parametric statistics [115], it may be challenging to develop exact analytical expressions for the {\mathsf {ARL}} (or its inverse the {\mathsf {FAR}} ). In these situations, one has to use onerous numerical simulation to obtain a threshold for a target {\mathsf {ARL}} . To tackle this problem, techniques based on extremes in random fields have been developed [238], from which one can obtain accurate approximations to the {\mathsf {ARL}} for many problems.

1) Using Change-of-Measure to Analyze the {\mathsf{ARL}} :

The main idea here is to relate finding {\mathsf {ARL}} to finding the tail probability of the maximum of a random field. To obtain a more accurate approximation of the {\mathsf {ARL}} , an alternative probability measure is considered, under which false alarms are more likely to occur. This is analogous to “importance sampling”, but it is more involved since the alternative probability measure is usually a mixture of distributions.

The analysis usually involves two steps. First, we aim to find the probability \mathbb P_{\infty }\{\tau \leq m\} , for a large constant m>0 and stopping time \tau (the first time the detection statistic exceeds the threshold b ). Finding this probability is challenging because \{\tau \leq m\} is a rare event under the pre-change regime, especially when the threshold b is large (this is the asymptotic scenario that we are interested in). Therefore, the change-of-measure technique plays an important role by considering an alternative measure under which \{\tau \leq m\} happens with a much higher probability. More specifically, we choose the alternative measure such that the expectation of the detection statistic equals the threshold b . Then, using the local central limit theorem and the local behavior of the correlated random field, we can obtain an analytical expression for the probability of \{\tau \leq m\} under the alternative measure. The probability under the alternative measure is then converted back to the probability under the original measure through Mill’s ratio. The rigorous mathematical derivations can be found in [238].

Second, we will relate the above probability to the {\mathsf {ARL}} , leveraging the fact that the stopping time \tau as threshold b\rightarrow \infty is asymptotically exponentially distributed [4], [187]. Although this fact only holds strictly for stopping times for algorithms such as the CUSUM and SR when observations are i.i.d. [156], this method has been widely used and is verified to be highly accurate in practice (see examples in [28], [115], [188], [236]). Thus, for a large m , \mathbb {P}_{\infty } \{\tau \leq m\} \sim 1-e^{-\lambda _{b} m} , where \lambda _{b} is the parameter of the exponential distribution. By definition, the mean of the exponential distribution is 1/\lambda _{b} , which corresponds to the {\mathsf {ARL}} .

2) Example: Analyzing MMD-Based Sequential Change Detection Procedure:

Below, we illustrate the change-of-measure technique by analyzing the non-parametric kernel-based maximum mean discrepancy (MMD) statistics (details can be found in [115]). The kernel MMD divergence, which measures the distance between two arbitrary distributions, is widely adopted in signal processing and machine learning. Given two sets of samples X{\, \mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=}\,}\{x_{1}, \ldots, x_{n}\} generated i.i.d. from a distribution f_{0} and Y{\, \mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=}\,}\{y_{1}, \ldots, y_{n}\} generated i.i.d. from a distribution f_{1} , an unbiased estimator of the MMD between f_{0} and f_{1} is the following:\begin{align*} \text {MMD}(X, Y)=&\frac {1}{n(n-1)} \sum _{i\neq j} \left \{{ k(x_{i}, x_{j}) + k\left ({y_{i}, y_{j}}\right)}\right. \\&\qquad \qquad \qquad \left.{-\,\, k\left ({x_{i}, y_{j}}\right) - k\left ({x_{j}, y_{i}}\right)}\right \},\end{align*} View SourceRight-click on figure for MathML and additional features. where k(\cdot, \cdot) is a kernel in the reproducing kernel Hilbert space (RKHS), e.g., Gaussian kernel. Intuitively, the MMD statistic is small when f_{0} is similar to f_{1} , and is large otherwise.

The sequential change detection procedure based on the MMD statistic is then defined as follows [115]. At each time t , we treat the most recent B samples, denoted by X_{t-B+1}^{t}{\, \mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=}\,}\{X_{t-B+1}, \ldots, X_{t}\} , as the test block (B > 0 is a pre-specified parameter). Then we sample N blocks of size B from the “reference” data generated from the pre-change distribution, denoted by \{\tilde X_{1},\ldots,\tilde X_{N}\} . We compute an average MMD statistic of all the reference blocks with respect to the test block:\begin{equation*} U_{t} = \frac {1} N \sum _{i=1}^{N} \text {MMD}\left ({\tilde X_{i}, X_{t-B+1}^{t}}\right).\end{equation*} View SourceRight-click on figure for MathML and additional features. Define Z_{t}'{\, \mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=}\,}U_{t}/\sqrt {\text {var}[U_{t}]} as the standardized detection statistic, where the variance \text {var}[U_{t}] can be found in closed-form and can be estimated conveniently from data [115]. The MMD-based procedure stops when the standardized MMD statistic exceeds a threshold b :\begin{equation*} \tau _{\scriptscriptstyle \text {M}}= \inf \left \{{t~:~ Z_{t}'> b}\right \}.\end{equation*} View SourceRight-click on figure for MathML and additional features. This corresponds to a generalized type of Shewhart chart.

Theorem 3 ({\mathsf {ARL}} of MMD-based Procedure[115]):

Let B > 0 . When b\rightarrow \infty , the {\mathsf {ARL}} of the stopping time \tau _{\scriptscriptstyle \text {M}} , \mathbb E_{\infty }[\tau _{\scriptscriptstyle \text {M}}] , is given by:\begin{equation*} \frac {e^{b^{2}/2}}{b} \left \{{\frac {2B-1}{\sqrt {2\pi } B(B-1)} \nu \left ({b \sqrt {\frac {2(2B-1)}{B(B-1)}}}\right)}\right \}^{-1}(1+o(1)),\end{equation*} View SourceRight-click on figure for MathML and additional features. where \nu (\cdot) is a special function whose definition can be found in [185].

We present the main step of the proof to Theorem3 to illustrate the change-of-measure technique. First, note that the event \{\tau \leq m\} is the same as the maximum of the detection statistic has exceed the threshold b at some point before m , i.e., \{\sup _{2\leq t\leq m}Z_{t}' \geq b\} , and \begin{align*}&\mathbb P_{\infty }\left \{{\sup _{2\leq t\leq m}Z_{t}' \geq b}\right \} \\&\;= \mathbb E_{\infty }\left \{{ \frac {\sum _{t=2}^{m} e^{\xi _{t}}}{\sum _{s=2}^{m}e^{\xi _{s}}}\mathbb I_{\left \{{ \sup _{2\leq t\leq m}Z_{t}' \geq b }\right \}} }\right \} \\&\;= e^{-b^{2}/2}\sum _{t=2}^{m} \mathbb E_{t} \left \{{R_{t} e^{- \left [{\xi _{t} -b^{2}/2+ \log M_{t}}\right] }\mathbb I_{\left \{{\xi _{t} - b^{2}/2+ \log M_{t} \geq 0}\right \}}}\right \},\end{align*} View SourceRight-click on figure for MathML and additional features. where \xi _{t} = bZ_{t}' - b^{2}/2 is the log-likelihood ratio between the changed measure \mathbb E_{t}[X] = \mathbb E_{\infty }[X e^{\xi _{t}}] and the original measure \mathbb E_{\infty } , M_{t} = \max _{s} e^{\xi _{s} - \xi _{t}} , S_{t} = \sum _{s} e^{\xi _{s} - \xi _{t}} , and the so-called Mill’s ratio R_{t} = M_{t}/S_{t} . The result in Theorem3 is established by establishing properties of the local field \{\xi _{s} - \xi _{t}\} and the global term \xi _{t} - b^{2}/2 (details omitted here).

The numerical example in Fig. 3 demonstrates that the threshold b (to achieve a target {\mathsf {ARL}} ) obtained using the theoretical approximation in Theorem3 is consistent with that obtained from simulations, especially after a skewness correction. This example demonstrates that the theoretical approximation of the {\mathsf {ARL}} obtained using the change-of-measure technique is of high accuracy, and thus can help avoid computationally expensive simulations to calibrate the procedure.

Fig. 3. - Accuracy of 
${\mathsf {ARL}}$
 approximations, obtained by “change-of-measure”, for the sequential MMD-based procedure: comparison of the thresholds obtained by simulation and from Theorem 3.
Fig. 3.

Accuracy of {\mathsf {ARL}} approximations, obtained by “change-of-measure”, for the sequential MMD-based procedure: comparison of the thresholds obtained by simulation and from Theorem 3.

C. Non-Stationary and Multiple Changes

In various modern applications, for instance, line outage detection in power systems [171] and stochastic power supply control in data centers [173], the change is not stationary. There can be a sequence of multiple changes: one followed by another. Below, we review some recent advances in sequential detection of dynamic changes.

1) Sequential Change Detection Under Transient Dynamics:

In classical sequential change detection formulations [19], [160], [194], [215], the statistical behavior of the observations is characterized by one pre-change distribution and one post-change distribution (known or unknown). In other words, the statistical behavior after the change is stationary. This assumption may be too restrictive for many practical applications with more involved statistical behavior after the change-point.

An example of the problem where the observations are non-stationary after the change, is sequential change detection under transient dynamics, which was studied in [171], [173], [174], [251]. Specifically, the pre-change distribution does not change to a persistent post-change distribution instantaneously, but after several transient phases, each phase is associated with a distinct data generating distribution. The goal is to detect the change as quickly as possible, either during the transient phases or during the persistent phase. This problem is fundamentally different from detecting a transient change (see, e.g., [51], [52], [64]), where the system goes back to the pre-change mode after a single transient phase, and the goal is to detect the change within the transient phase. The problem is also related to sequential change detection in the presence of a nuisance change, where the presence of the nuisance change can be modeled as a transient phase. However, an alarm should be raised only if the critical change occurs [103].

Two algorithms were proposed and investigated in [171], [251] for the minimax setting, the dynamic-CUSUM (D-CUSUM), and the weighted dynamic-CUSUM (WD-CUSUM), where the change-point and the transient durations are assumed to be unknown and deterministic. The basic idea is to construct a generalized likelihood based algorithm taking the supremum over the unknown change-point and the durations of transient phases. It was shown in [171], [251] that the D-CUSUM and WD-CUSUM test statistics can be updated recursively, and thus are computationally efficient. In [251], it was demonstrated that both algorithms are adaptive to the unknown transient dynamics, although durations of transient phases were unknown and were not employed in algorithm implementation. Moreover, both the D-CUSUM (under certain conditions) and the WD-CUSUM algorithms were shown to be first-order asymptotically optimal in [251]. The Bayesian setting was investigated in [174], where the change-point and the durations of transient phases are assumed to be geometrically distributed. The optimal test was constructed, and a computationally efficient alternative test based on thresholding the posterior probability that the change has occurred was also proposed.

2) Sequential Detection of Moving Anomaly:

Existing studies on sequential change detection in networks usually assume that the change is persistent once it affects a node. However, there are scenarios where the change may not necessarily be persistent at a particular node; instead, it is persistent across the network as a whole, e.g., a moving anomaly in a sensor network. In this case, existing approaches using CUSUM statistics from each node, e.g., [55], [66], [126], [255], cannot be applied. Recently, the problem of sequential moving anomaly detection in networks was studied in [175], [176]. Specifically, after an anomaly emerges in the network, one node is affected by the anomaly at each time instant and receives data from a post-change distribution. The anomaly dynamically moves across the network with an unknown trajectory, and the node that it affects changes with time. Two approaches have been proposed to model the trajectory of the anomaly: the hidden Markov model [176], and the worst-case approach [175], which we discuss in the following.

The first approach (hidden Markov model) [176] models the anomaly’s trajectory as a Markov chain, and thus the samples are generated according to a hidden Markov model. The advantage of this model is that it takes into consideration the network’s topology, i.e., that the anomaly only moves from a node to one of its neighbors. In [176], a windowed GLR based algorithm was constructed and was shown to be first-order asymptotically optimal. Alternative algorithms were also designed with performance guarantees, including the dynamic SR procedure, recursive change-point estimation, and a mixture CUSUM algorithm.

The second approach (worst-case approach) [175] assumes that the anomaly’s trajectory is unknown but deterministic and considers the worst-case performance over all possible trajectories. A CUSUM-type procedure was constructed. The main idea is to use the mixture likelihood to construct a test statistic, which is further used to build a procedure of the CUSUM-type. This procedure was shown to be exactly optimal in [175] when the sensors are homogeneous. This idea has been further generalized to solve the sequential moving anomaly detection problem with heterogeneous sensors and has been shown to be first-order asymptotically optimal [172].

3) Multiple Change Detection:

A related line of research is multiple change detection in the offline setting, which aims to estimate multiple change-points from observations in a retrospective study. Various methods were proposed to estimate the number and locations of change-points, including hierarchical clustering based method [125], binary segmentation type methods [9], [40], [41], [60], [61], [220], (penalized) least-squared methods [23], [106], [107], [108], [240], Schwarz criterion [239], kernel-based algorithms [8], [69], and so on. Another line of work aims to reduce the computational complexity of the multiple change detection methods, such as [71], [89], [169]. We refer to [210] for a recent review on multiple change detection. Some offline multiple change detection algorithms can motivate the development of their online versions.

4) Decentralized and Asynchronous Change Detection in Networks:

When the information for detection is distributed across a network of sensors, detection problems fall under the umbrella of distributed (or decentralized) detection [31], [212], [214], [216]. In the decentralized setting, each sensor sends messages to the fusion center based on the observations it has received so far. The fusion center may provide feedback to sensors and make the final decision. The problem of decentralized sequential change detection in distributed sensor systems was introduced in [217], considering the observation model where all sensors are affected by the change at the same time. There have been a number of papers on the topic since then, see, e.g., [127], [205], [207]. A more recent (and practical) perspective is that the change may affect sensors with delay, i.e., different sensors may observe the change at different times, which we will present in the following.

In the case of multiple data streams, the change may happen asynchronously for different sensors. When we desire to detect the first onset of change, it is proposed in [66] to monitor each data stream by local CUSUM procedures and raise the alarm when any sensor raises an alarm. The sum of local CUSUM statistics has been considered in [126] and was shown to be asymptotically optimal. The problem where the change propagates from one sensor to the next with known Markov dynamics after the change was studied in [164], and an asymptotically optimal test was developed. A recent procedure proposed in [231] finds an optimal combination of local data streams accounting for their delays in being affected by the change, which can boost the signal-to-noise ratio and reduce the detection delay especially when the signal is weak.

In [255], the problem of sequentially detecting a significant change (i.e., when at least \eta number of sensors are affected by the change) was investigated. The event is dynamic, i.e., different nodes are affected at different times. Instead of using a scan statistic, which is computationally costly, a spartan-CUSUM (S-CUSUM) algorithm was constructed, which compares the sum of the smallest N-\eta +1 local CUSUM statistics to a threshold, where N is the total number of nodes. For the case where the change propagates along network edges, a network-CUSUM (N-CUSUM) algorithm was further constructed based on the idea that the affected nodes shall induce a connected subgraph. The N-CUSUM algorithm was also shown to be first-order asymptotically optimal, and performs much better than the S-CUSUM numerically. The decentralized setting where there is no fusion center and nodes can only communicate with their neighbors was studied in [111], [253], and the approach is based on a novel combination of the alternating direction method of multipliers (ADMM) and average consensus approaches. In [94], a Bayesian approach is used to model the dynamic change with an unknown propagation pattern, where the goal is to detect the change when it firstly emerges in the network; an optimal solution structure is derived using a dynamic programming framework.

D. Robust Sequential Change Detection

Many classical procedures (for instance, CUSUM and SR) require exact knowledge of the pre- and post-change distributions. However, in real-world scenarios, the actual data distributions may be complex and different from what we have assumed. There can be adversarial attacks that significantly perturb the data distributions. This can lead to performance degradation of the optimal procedures. How to make the procedures more robust in the presence of model mismatch is the topic of robust sequential change detection.

1) Robustness to Model Uncertainties:

There have been many efforts to make the detection procedure more robust to model uncertainties. One approach is to treat the pre- and post-change distributions to belong to some parametric family with unknown parameters in uncertainty sets and then form the GLR based test as we discussed earlier in Section II-E. Another approach to developing good tests in the presence of model uncertainties is through the use of minimax robustness as the criterion as is done in the seminal work of Huber on robust hypothesis testing [82], [83]. The solution to the robust hypothesis testing problem usually relies on finding the least favorable distributions (LFDs) within the uncertainty classes, with likelihood ratio of these distributions used in constructing the robust tests. It can be shown that LFDs exist for uncertainty classes satisfying a certain joint stochastic boundedness (JSB) condition [218]. The problem of minimax robust sequential change detection was explored in [213], in which an exactly optimal solution was obtained for uncertainty classes satisfying the JSB condition under a generalized Lorden criterion. An extension of this result to asymptotic minimax robust sequential change detection is studied in [129], where a weaker notion of stochastic boundedness is introduced.

A robust CUSUM algorithm is developed in [27] by making a connection to convex optimization, which is particularly useful for the high-dimensional setting and leads to a tractable formulation. For instance, assuming the covariance matrix lies in an uncertainty set centered around a nominal value, the problem of finding LFDs can be cast as solving a semidefinite program and can be solved efficiently.

2) Robustness to Adversarial Attacks:

The problem of sequential change detection in sensor networks in the presence of adversarial attacks [102] was investigated in [20], [56]. In the presence of Byzantine attacks, an adversary may modify observations arbitrarily to defer the detection of a change and increase the false alarm rate. In [20], it is assumed that the change affects all but one compromised sensor, and the detection strategy is to raise a global alarm until two local CUSUMs exceed the threshold. In [56], a more general setting was investigated, where an unknown subset of sensors can be compromised. Sequential detection strategies were designed by waiting until L local CUSUM statistics exceed the threshold (simultaneously or not) or by comparing the sum of the L smallest CUSUM statistics to a threshold. With a proper choice of L , the above approaches are robust to Byzantine attacks.

E. Data-Efficient Sequential Change Detection

There is usually a cost associated with making observations in practical engineering applications, e.g., the power consumption in sensor networks. An extension of Shiryaev’s formulation (Section II-D2) was investigated in [11] by including an additional constraint on the average number of observations taken before the change. The cost of observations after the change is included in the detection delay. Specifically, whether to take an observation at time t is controlled by an on-off binary control variable S_{t} , and S_{t} is a function of all the information available up to time t-1 . A data-efficient Shiryaev (DE-Shiryaev) algorithm was constructed in [11], and was shown to be asymptotically optimal as \mathsf {PFA} goes to zero. The DE-Shiryaev algorithm is also shown to have good observation cost-delay tradeoff curves: for moderate values of \mathsf {PFA} , for Gaussian observations, the delay of the algorithm is within 10% of the Shiryaev delay even when the observation cost is reduced by more than 50%. Furthermore, the DE-Shiryaev algorithm is substantially better than the standard approach of fractional sampling scheme, where the Shiryaev algorithm is used and where the observations to be skipped are determined a priori in order to meet the observation constraint. A minimax formulation was further proposed in [13] to address the scenario when a prior on the change-point is not available. The DE-CUSUM algorithm developed in [13] is shown to be asymptotically optimal as {\mathsf {FAR}} goes to zero, and significantly outperforms fractional sampling in simulations. Extensions to composite post-change distributions were studied in [14], and generalizations to distributed sensor networks were explored in [15].

F. High-Dimensional Streaming Data

High-dimensional data usually have low-dimensional structures, such as sparsity and low-rankness, which can be leveraged to achieve improved detection performance and computational efficiency. Meanwhile, missing data is very common for high-dimensional streaming data. In this section, we review recent advances in these directions.

1) Sparse Change in Multiple Data Streams:

For multiple independent streams of data, a mixture procedure was developed in [236] to monitor parallel streams for a change-point that affects only a subset of them (usually sparse). Both the subset being affected and the post-change distribution are unknown. The mixture model hypothesizes that each sensor is affected with a small probability \varrho \in (0, 1) by the change, where \varrho is pre-specified. The mixture detection statistic at time t is defined as \begin{equation*} \sum _{n=1}^{N} \log \left [{1-\varrho + \varrho f_{1}\left ({X_{t}^{(n)}}\right)/f_{0}\left ({X_{t}^{(n)}}\right)}\right],\end{equation*} View SourceRight-click on figure for MathML and additional features. where X_{t}^{(n)} denotes the observation at the n -th sensor and at time t , and N is the number of sensors. Another efficient global monitoring scheme was proposed in [227] by combining hard thresholding with linear shrinkage estimator for the post-change parameters. In recent works [121], [247], a similar problem was tackled by running local detection procedures and using the sum of the shrinkage transformation of local detection statistics as a global detection statistic. This sum-shrinkage framework was further extended in [248] to be more robust to outliers using the Box-Cox transformation. Recent work [53] studied change detection in regimes where the dimension tends to infinity and the length of the sequence grows with the dimension.

2) Subspace Change Detection:

In many applications, the change in high-dimensional data covariance structure can be represented as a low-rank change. For instance, in seismic signal detection [232], a similar waveform is observed at a subset of sensors after the change. Such a change can be modeled as the covariance matrix shifts from an identity matrix to a “spiked” covariance model [88]. The subspace-CUSUM procedure was developed in [232], in which the unknown subspace in the post-change spiked model is estimated sequentially and further used to obtain the log-likelihood ratio statistic. A CUSUM procedure for detecting switching subspace (from a known subspace to another target subspace) was studied in [86].

3) Missing Data:

In high-dimensional time series, it is common that we cannot observe all the entries at each time. The missing components in the observed data handicap conventional approaches. In [234], a mixture type of approach was proposed by combining subspace tracking with missing data to model the underlying dynamic of data geometry (submanifold). Specifically, streaming data is used to track a submanifold approximation, to measure deviations from this approximation, and to calculate a series of statistics of the deviations for detecting when the underlying manifold has changed.

4) Sketching to Conquer High-Dimensionality:

To detect changes quickly over high-dimensional data, we may need to conquer the challenges presented by the data’s high dimensionality. Sketching is a commonly used strategy to reduce data dimensionality, which performs linear projections of high-dimensional data into a small number of sketches. A GLR procedure based on data sketches was studied in [237], with the precise characterization of performance metrics and the minimum number of sketches needed to achieve good performance. Multiple types of sketching matrices can be used, such as Gaussian random matrices, expander graphs, and network topology constrained matrices. The sketching procedure is relevant to large power networks where we cannot place a sensor on each node or edge. Instead, each sensor will measure aggregates of the network states at a few edges or nodes. In [237], the mean-shift detection problem in power networks is studied, where each measurement corresponds to a linear combination of the state at an edge, e.g., real power flow. This leads to a sketching matrix determined by the network topology.

G. Joint Detection and Estimation

It is common that the distribution after the change is unknown. For instance, before the change in industrial process monitoring applications, the production line is in-control and well-calibrated (thus the distribution before the change is known). However, after the change, an anomaly causes a shift to the operation into an unknown status. Therefore, it is interesting to incorporate estimates of the possible post-change status into the detection statistic when performing detection; this problem is related to robust sequential change detection, as discussed in Section III-D1. In other situations, we need to estimate the post-change distribution in retrospect for identifying the change. There has been much work establishing the theoretical foundation for joint detection and estimation. For instance, [135] combines the Bayesian formulation of the estimation and detection and develops an optimal procedure to achieve a tradeoff between detection power and estimation quality. In another context, it is also referred to as sequential change diagnosis [46]. Quickest searching of the change-point (e.g., quickest search for rare events) has been developed in [78], [79], [193].

H. Spatio-Temporal Change Detection

When modeling discrete event data, the point process model [45] is frequently used due to its capability of modeling the time intervals between events directly. Point processes assume that time intervals between events are exponentially distributed. For example, in Poisson processes the intervals are independent, and in Hawkes processes the intervals are dependent, and the intensity depends on the events that occurred in the past [54]. The “autoregressive” nature of Hawkes processes makes them attractive in modeling temporal dependence and causal relationships, including market models [209], earthquake event prediction [144], inferring leadership in e-mail networks [57], and topic models [75]. The multi-dimensional Hawkes process model over networks can model highly correlated discrete event data [168] and capture dependence over networks and propagation of the signal in such settings.

Detection of changes for point processes has attracted much attention for both single event stream and multiple streams over networks (or over multiple locations). For example, there are works focusing on Poisson processes [76], [179], [246], and some recent work on one-dimensional [124], [153] and multi-dimensional (network) point processes [116], [226]. In particular, [116] studied the change detection for networked streaming event data and constructed GLR type procedures; [226] developed the penalized dynamic programming algorithm to detect coefficient changes in discrete-time high-dimensional self-exciting Poisson processes in an offline setting.

This topic is also related to the multisource quickest detection problem, mostly assuming independence between multiple data streams. For instance, the quickest detection of the minimum of change-points for two independent compound Poisson processes was considered in [21] and optimal Bayesian sequential detection procedures were developed.

I. Change Detection-Isolation-Identification

In addition to detecting the change quickly after it occurs, sometimes we are also interested in identifying the post-change model and/or isolating a subset of nodes within a large network affected by the change. In [47], an asymptotically optimal Bayesian detection—isolation scheme was proposed assuming the post-change model is one of the finitely many distinct alternatives. In a series of works, Nikiforov introduced a minimax optimal detection-isolation algorithm for stochastic dynamical systems [137], developed a recursive variant of the algorithm that achieves better computational efficiency [138], and provided an asymptotic lower bound for the mean detection-isolation delay with constraints on the probability of false isolation and the average time before a false alarm [139]. Natural generalizations of CUSUM and SR procedures for detection-isolation problems were discussed in [198]. See [194], [196] for more detailed overviews.

J. Alternative Performance Metrics

Other than what have been presented in this survey, many alternative performance metrics have also been considered. For instance, [161] investigated an exponential penalty of delay rather than a linear penalty (as used in the definition of {\mathsf {CADD}} , for instance). Such performance measures can be more accurate, sometimes for financial applications. In these cases, the change-point may not represent a time at which a fundamental shift in the performance occurs, but the compounding of investment growth can be a more suitable measure of the cost of delay. Similarly, in the health monitoring of components in aircraft systems, communication networks, and power grids, the effects of undetected faults can exponentiate with time. For problems involving estimation, the performance measures can also involve estimation accuracy, for instance, change-point location and other parameters involved in the problem. With many parallel data streams, the error metric can be the false discovery rate ({\mathsf {FDR}} ), which is the expected ratio of the number of falsely declared data streams to the total number of declared data streams [33].

SECTION IV.

New Dimensions

A. Machine Learning and Change Detection

Modern machine learning approaches can be adopted for solving sequential change detection problems, which we will review in this subsection.

1) Density Ratio Estimation:

Instead of estimating the post-change density f_{1} as in the GLR procedure, we may estimate the density ratio f_{1}/f_{0} directly (referred to as density ratio estimation [192]), based on which we develop sequential change detection procedures. A data-driven framework using neural networks was developed in [134]. More specifically, given two sets of data sampled from the densities of interest, an optimization problem is defined so that the solution, specified through neural networks, will correspond to the desired likelihood ratio function or its transformations and can then be used for sequential change detection.

2) Anomaly Detection:

Change detection is closely related to anomaly detection, which is a popular topic in machine learning and data mining, and many machine learning techniques have been developed. In particular, an recurrent neural network (RNN) based approach computes the detection statistic (referred to as the anomaly score) in an online fashion and compares with a threshold for anomaly detection [177]. The RNN-based approach can benefit certain situations since they are known to capture complex temporal dependencies for multivariate time series. We refer to [30] for a recent survey on deep learning techniques for anomaly detection. Developing mathematical theory for RNN-based sequential change detection is still an open question.

3) Online Learning and Change Detection:

Online implementation is one of the most critical aspects of sequential change detection algorithms in practice. Although many algorithms enjoy recursive structure, such as CUSUM and SR procedures, some sequential detection procedures face a significant hurdle of online implementation due to their non-recursive nature. For instance, window-limited GLR statistic, although enjoying robust performance in the presence of unknown post-change distributions, is not recursive since the parameters need to be continuously estimated by incorporating new samples. To tackle this challenge, inspired by online learning, [26] develops an online mirror descent-based GLR procedure to update the estimate of the unknown post-change parameter with new data. Another highly cited work [2] develops an online change detection procedure based on Bayesian computing. In recent work, [208] develops a framework for joint sequential change detection and online model fitting, which will be particularly suitable for parameterized models. A GLR procedure is developed in this framework using estimates of the unknown high-dimensional parameter obtained by the gradient descent update.

4) Tracking Data Dynamics:

Many sequential data are dynamic even before the change has happened; for instance, solar flare detection from satellite video streaming [233], [234]. To build methods that work with real-world scenarios, we need to develop robust methods that can adapt to normal data dynamics without mislabeling them as change-points. A possible strategy is to combine tracking with detection. For instance, [233], [234] developed a procedure to detect sparse changes when the pre-change high-dimensional data is time-varying. The data dynamic is captured by tracking a time-varying manifold using variants of subspace tracking (e.g., GROUSE [245], PETRELS [39], or MOUSSE algorithm [234]). Another instance is the network Hawkes process model, where we may track the Hawkes process through online learning techniques [67].

5) Active Learning and Change Detection:

For certain applications such as material science and recovering seafloor depth, data acquisition is expensive. Thus, it is desirable to collect data that is most useful in a sequential fashion, which is the theme of active learning (see, e.g., [29], [190]). The combination of active learning and change detection was introduced as active change-point detection (ACPD) problem in [74]. The task is to adaptively determine the next input to detect the change-point in a black-box expensive-to-evaluate function, with as few evaluations as possible. The method utilizes the existing change detection method to compute change scores and a Bayesian optimization method to determine the next input. A CUSUM procedure with an adaptive sampling strategy to detect mean shifts was developed in [120].

6) Detection With Data Privacy:

As data privacy has growing importance in modern applications in social settings, it also leads to developing private change detection algorithms. Both offline and online change detection methods through the lens of differential privacy have been developed in [44]. A different privacy-aware sequential change detection method was studied in [104], using maximal leakage as the privacy metric, which is a weaker form of privacy compared with [44].

7) Change Detection for Reinforcement Learning:

Reinforcement learning is a major type of sequential decision-making methodology in the era of artificial intelligence. How to implement reinforcement learning in a non-stationary and changing environment is still a mostly unexplored area. Recently, there have been some attempts to combine sequential change detection and reinforcement learning [147], where change detection algorithms are utilized to detect the transition of the environment and trigger transitions of reinforcement learning algorithms.

B. Distribution-Free Methods

Distribution-free methods aim to detect the change without making explicit distributional assumptions on the data. Such methods are particularly attractive in machine learning, such as kernel MMD based method discussed in SectionIII-B.2, due to their flexibility in working with complex data. There have been kernel-based non-parametric methods developed in terms of change detection, both for the offline setting [8], [70], [72] and the online setting [115]. MMD statistics have also been used for anomalous sequence detection, for instance, [252], [254]. Besides MMD, other distribution-free methods have been developed for change detection. For instance, dissimilarity measures based on the kernel support vector machine (SVM) were built in [50], and generalized likelihood test directly using data empirical distributions when the true distributions are supported on a finite alphabet were constructed in [24], [105], [140], [141].

There are many other types of distribution-free non-parametric tests for change detection developed in various contents. For instance, the maximal k -largest sample coherence between columns of each observed random matrix was developed to detect change for large-scale random matrices [12]. A nearest-neighbors-based statistic was proposed in [32] to detect the change in sequences of multivariate observations or non-Euclidean data objects such as network data. The weighted moving averages were studied in [58] to detect univariate drifts. A non-parametric approach was developed in [150] to detect departure from the reference signal with non-i.i.d. underlying time series. The spectral scan statistic for change detection over graphs was considered in [178]. Wasserstein distance was used to detect segments of times series in [38]. In [10], test statistics were constructed using martingales under the null hypothesis, and the rejection threshold is determined using a uniform non-asymptotic law of the iterated logarithm.

C. Non-Stationary Multi-Armed Bandits With Changes

Multi-armed bandit is a class of fundamental problems in online learning and sequential decision-making. A learning agent aims to maximize its expected cumulative reward by repeatedly selecting to pull one arm at each time step. Change detection can play a role in the scenario where the reward distributions may change in a piece-wise-stationary fashion at unknown time steps. To handle dynamic multi-armed bandit problems, various change detection methods were considered, including the Page-Hinkley test [73], a windowed mean-shift detection [243], CUSUM test [119], and sample mean based test [25]. Usually, the algorithm will reset once a change is detected. From a Bayesian perspective, the Thompson sampling strategy equipped with a Bayesian change-point mechanism was considered in [128]. The adversarial multi-armed bandit problem with change points was also considered in [5].

D. Optimization for Change Detection and Estimation

Optimization is becoming a centerpiece in developing modern machine learning algorithms. Recent advances in convex optimization have enabled solving many large-scale problems. A line of research aims to casts (offline) change detection and estimation (of their locations) as an optimization problem. The benefits of this optimization-based approach typically include computational efficiency (when the optimization problem is convex) and theoretical performance guarantees based on optimization theory. Below we give some examples.

The univariate change detection for a mean shift using an optimization approach has been studied in [117], and performance guarantees were established by relating the \ell _{2} recovery error to detection performance. A \ell _{0} -penalized least squares method was considered in [224]. By connecting to binary segmentation methods, change detection and localization for univariate data in the non-parametric settings was studied in [148].

Multivariate change detection using an optimization approach has also been studied. For instance, a dynamic programming approach was developed for recovering an unknown number of change-points from multivariate autoregressive models [225]. A network binary segmentation method for change detection was proposed in [223], which has been extended for covariance matrix change detection in [222]. Finally, the work [191] combined the filtered derivative with convex optimization methods to estimate change-points for multi-dimensional data.

SECTION V.

Modern Applications

Sequential change detection has traditionally been used in industrial process monitoring applications, which was probably the original motivation for change detection procedures to be developed in the early days. The wide adoption of change detection in industrial quality engineering and manufacturing initiates the field of statistical process control (SPC) (see, e.g., [143], [182]). Recently, there have been many more modern applications for sequential change detection, and we present a selection of them here.

A. Smart Grids

The sequential change detection methodology has been recently successfully applied for sequential line outage detection in power transmission systems. In modern smart grids, high-speed synchronized voltage phase angle measurements are taken from phasor measurement units (PMU). Based on PMU measurements, a linearized incremental small-signal power system model was developed in [37]. Once a line outage occurs, there is a change in the covariance matrix of incremental phases, by monitoring which, line outages can be detected and identified using sequential change detection algorithms. In [171], the transient dynamics of the power system following a line outage is further incorporated. The D-CUSUM algorithm was then developed to incorporate the dynamic nature of the line outage in [171] (see Section III-C1 for more details).

There have been other works on sequential change detection for smart grids. The generalized local likelihood ratio test was applied for voltage quality monitoring [114], photovoltaic systems [36], attack detection in the multi-agent reputation systems [113], wide-area monitoring [112], and cyber-attacks detection in discrete-time linear dynamic system [92], [93]. The decentralized detection with level-triggered sampling was considered in [241]. In [77], a general stochastic graphical framework for modeling the bus measurements and a data-adaptive data-acquisition and decision-making processes were designed for the quickest search and localization of anomaly in power grids.

B. Cybersecurity

Cybersecurity has become a critical problem with the development of wireless communication, networking, and the Internet of Things. It is of practical importance to detect attacks and intrusions in real-time from network streaming data, e.g., denial-of-service attacks, worm-based attacks, port-scanning, and man-in-the-middle attacks. The sequential change detection approach is a natural fit since the attacks usually change network traffic distribution. In [203], multi-channel generalizations of the CUSUM procedure and non-parametric tests were proposed. In [204], adaptive sequential methods were proposed for early detection of subtle network attacks, utilizing data from multiple layers of the network protocol. In [202], a multi-cyclic detection procedure based on the SR procedure was proposed. In [199], score-based CUSUM and SR procedures were exploited for network anomaly detection, and a hybrid detection system was proposed. The application to cybersecurity was also discussed in books [194], [196], and recent reviews [84], [85].

C. Sensors Networks

Sensor networks collecting sequential data have been widely used for geophysical, environmental, traffic, and Internet traffic monitoring applications, which we will briefly summarize in this subsection.

Seismology is experiencing rapid growth in the quantity of data. Earthquake detection aims to identify seismic events in continuous data – a fundamental operation for seismology [242]. Modern ultra-dense seismic sensor arrays have obtained a massive amount of continuous data for seismic studies, and many such data are publicly available through IRIS [1]. In the old days, network seismology treated seismic signals individually - one sensor at a time - and detected an earthquake upon multiple impulsive arrivals consistent with a source within the Earth [87]. Recently, with advances in sensor technology, which bring densely sampled data, high-performance computing and high-speed communication, we are able to use a network-based detection by exploiting correlations between sensors to extract coherence signals. This will enhance the systematic detection of weak and unusual events that currently go undetected using individual sensors. Detecting such weak events is very crucial for earthquake prediction [145], [219], oil field exploration, volcano monitoring, and deeper earth studies [80]. Towards this goal, in [232], a subspace-CUSUM procedure was developed for network-based detection by exploiting the low-rank subspace structure induced by waveform similarity.

Sensor networks have also been deployed to monitor drinking water safety from the water tower to private residences. Sequential change detection using residual chlorine concentration measurements from the sensors network was developed in [65]. Methods have also been developed for monitoring river contamination [34], [35], which specifically consider the spatio-temporal correlation in observations along the sensor network due to water dynamics.

Sequential monitoring of traffic flow using traffic sensors has been considered in [166], and a distributed, online, sequential algorithm for detecting multiple faults in a sensor network was presented therein. Recently, Hawkes processes models for correlated traffic anomalies using data collected by inductive-loop traffic detectors were developed in [250].

D. Wireless Communications

Sequential change detection has been used for wireless communications, including online user activity detection for multi-user direct-sequence/code-division multiple-access (DS-CDMA) environment [146], detecting “spectrum opportunities” in the cognitive radio setting by identifying the occupancy and idle of channels from primary user’s activities [95], [235]. More recently, [81] established a change detection framework for low probability of detection (LPD) communication, where a transmitter, Alice, wants to hide her transmission to a receiver, Bob, from an adversary, Willie; three different sequential tests were considered, including Shewhart, CUSUM, and SR procedures, to model Willie’s detection process.

E. Video Processing and Computer Vision

Change detection is one of the most commonly encountered low-level tasks in computer vision and video processing [163], and many such problems are essentially sequential. A plethora of practical algorithms have been developed to date; for instance, scene change detection [110], street-view change detection [3], and change detection in video sequences [211]. In [48], a pixel-based weightless neural network (WNN) method was developed to detect changes in the field of view of a camera. In [118], multiple images from reference and mission passes of a scene of interest were used to improve detection performance. There are still many open questions regarding how to leverage the power of statistical sequential change detection for computer vision and video processing. We present an example of solar flare detection from video sequences in Fig. 4, which has been considered in several works along this line including [234].

Fig. 4. - Solar flare detection with the mixture procedure as considered in [234]; the first minor solar flare at 
$t = 142$
 is hardly visible, and it is missed entirely by a baseline detection statistic (sum of CUSUM at each data dimension without exploiting sparsity in the change: “solar flare”). This also illustrates the importance of exploiting sparsity in the change.
Fig. 4.

Solar flare detection with the mixture procedure as considered in [234]; the first minor solar flare at t = 142 is hardly visible, and it is missed entirely by a baseline detection statistic (sum of CUSUM at each data dimension without exploiting sparsity in the change: “solar flare”). This also illustrates the importance of exploiting sparsity in the change.

F. Social Networks

The wide-spread use of social networks and the great availability of information networks (e.g., Twitter, Facebook, blogs) lead to a large amount of user-generated data [91], which are quite valuable in studying many social phenomena. One important aspect is to detect change-points in streaming social network data [90], which may represent the collective anticipation of or response to external events or system “shocks” [152]. Detecting such changes could provide a better understanding of the patterns of social life. In other cases, early detection of change-points can predict or even prevent social stress due to disease or international threat, for instance, detecting self-exciting changes (modeled by network Hawkes processes) in social networks [116]. A related topic is distributed hypothesis testing in social networks: Reference [101] showed the exponential convergence rate of a Bayesian update scheme of nodal belief (distribution estimate) in the social learning setting.

G. Epidemiology

Sequential change detection can potentially play an important role in public health and disease surveillance. Early detection of epidemics is a very important topic. In [17], [18], Baron cast the early detection of epidemics as a Bayes sequential change detection problem and proposed an asymptotically pointwise optimal stopping rule, which is computationally efficient for complicated prior distributions arising in epidemiology. In [244], a modified CUSUM procedure was proposed for the susceptible—infected—recovered (SIR) epidemic model to detect change-point in the infection rate parameter. Moreover, change detection has been incorporated into studying the intervention’s effectiveness, based on the premise that the underlying epidemiological model may change over time due to interventions. Evaluating intervention measures’ effectiveness requires detecting underlying change-points, which becomes even more important in the COVID-19 era [249]. Such works include [151], [228], which estimate the change-points in time series to assess the effectiveness of interventions such as lock-down and mask usage; in [49], the problem of detecting the growth rate change for the COVID-19 spread in Germany was studied, where results were further incorporated into forecasting. There are still many open questions in this area regarding developing effective sequential change detection procedures suitable for infectious disease early detection.

SECTION VI.

Conclusion

Our goal in this survey was to provide a glimpse of the past and recent advances in sequential change detection, and its application in various domains. We have covered different types of sequential change detection procedures, both theoretically optimal and practical. We also discussed how the intersection of sequential change detection with other areas has created interesting new directions for research.

ACKNOWLEDGMENT

The authors are grateful to the Guest Editor and the anonymous reviewers for their helpful comments.

    References

    References is not available for this document.