Many time series models, possibly with trends and seasonal components, and even nonstationary, can be written as components of a vector series \(\mathbf{Y} = \{\mathbf{Y}_t\}\) satisfying \[ \begin{array}{l} \mathbf{Y}_t = G_t \mathbf{X}_t + \mathbf{W}_t,\\ \mathbf{X}_{t+1} = F_t \mathbf{X}_t + \mathbf{V}_t, \end{array} \] where \(\{\mathbf{W}_t\}\sim {\rm WN}(0,R_t)\), \(\{\mathbf{V}_t\}\sim {\rm WN}(0,Q_t)\), and \(G_t\), \(F_t\), \(R_t\), \(Q_t\) are the model “parameters”. Such systems are known as state-space models and can be estimated, predicted, … through computationally efficient Kalman recursions. Here, \(\mathbf{Y}_t\) are observable but \(\mathbf{X}_t\) are latent (not observable).
In the above sense, the model trend and/or seasonal component can be estimated simultaneously with other model parameters. Such modeling approach is referred to as structural. This is one of the more powerful techniques, with extensions/connections to HMM, Bayesian analysis, factor models, etc. and also, as will be seen below, EM algorithm and handling missing data.
The theory goal is to have some basic understanding on how a state-space model is fitted, its latent variables are estimated, and what underlying Kalman recursions are. From the data (modeling) perspective, the goal is to get a sense of the advantages of the state-space modeling, and what the state-space modeling gives one in practice.
Recall the state-space model of a vector series \(\mathbf{Y} = \{\mathbf{Y}_t\}\) satisfying \[ \begin{array}{l} \mathbf{Y}_t = G_t \mathbf{X}_t + \mathbf{W}_t,\\ \mathbf{X}_{t+1} = F_t \mathbf{X}_t + \mathbf{V}_t, \end{array} \] where \(\{\mathbf{W}_t\}\sim {\rm WN}(0,R_t)\), \(\{\mathbf{V}_t\}\sim {\rm WN}(0,Q_t)\), and \(G_t\), \(F_t\), \(R_t\), \(Q_t\) are the model “parameters”.
Kalman recursions involve three different problems: prediction, filtering and smoothing. Estimation of \(\mathbf{X}_t\) in terms of:
We would also like to be able to measure the variability of these estimates. By convention, \(\mathbf{Y}_0\) is a vector of ones.
Definition 3.1: For a random vector \(\mathbf{X} = (X_1,\ldots,X_v)'\), set \[ P_t(\mathbf{X}) = (P_t(X_1),\ldots,P_t(X_v)', \] where \(P_t(X_i) := P(X_i|\mathbf{Y}_0,\ldots,\mathbf{Y}_t)\) is the best linear predictor of \(X_i\) in terms of all the components of \(\mathbf{Y}_0,\ldots,\mathbf{Y}_t\).
Prediction: Want \(P_{t-1}(\mathbf{X}_t) = : \widehat{\mathbf{X}}_t\)
Filtering: Want \(P_{t}(\mathbf{X}_t) = : \mathbf{X}_{t|t}\)
Smoothing: Want \(P_{T}(\mathbf{X}_t) = : \mathbf{X}_{t|T}\) (with \(T\) replaced by \(n\) below, \(P_{n}(\mathbf{X}_t) = : \mathbf{X}_{t|n}\))
Prediction:
Filtering:
Smoothing:
Suppose the state-space model is parametrized by a parameter vector \(\mathbf{\theta}\). The likelihood can be written as: \[ L(\mathbf{\theta}; \mathbf{Y}_1,\ldots,\mathbf{Y}_T) = \prod_{t=1}^T f_t(\mathbf{Y}_t|\mathbf{Y}_1,\ldots,\mathbf{Y}_{t-1}) \] where \(f_t(\cdot|\mathbf{y}_1,\ldots,\mathbf{y}_{t-1})\) is the conditional density of \(\mathbf{Y}_t\) given \(\mathbf{Y}_1=\mathbf{y}_1,\ldots,\mathbf{Y}_{t-1}=\mathbf{y}_{t-1}\).
Assuming joint Gaussianity of the state-space model variables, we have \[ f_t(\mathbf{Y}_t|\mathbf{Y}_1,\ldots,\mathbf{Y}_{t-1}) = \frac{1}{(2\pi)^{w/2} \mbox{det}(\Delta_t)^{1/2}} \exp\Big\{ - \frac{1}{2} \mathbf{I}_t'\Delta_t^{-1} \mathbf{I}_t \Big\}, \] where \(\mathbf{I}_t = \mathbf{Y}_t - P_{t-1}\mathbf{Y}_t = \mathbf{Y}_t - G_t\widehat{\mathbf{X}_t}\), and \(P_{t-1}\mathbf{Y}_t\) and \(\Delta_t\) are the one-step predictors and error covariance matrices found from the Kalman prediction recursions. The likelihood above can therefore be expressed as \[ L(\mathbf{\theta}; \mathbf{Y}_1,\ldots,\mathbf{Y}_T) = \frac{1}{(2\pi)^{wT/2} \prod_{t=1}^T \mbox{det}(\Delta_t)^{1/2}} \exp\Big\{ - \frac{1}{2} \sum_{t=1}^T \mathbf{I}_t'\Delta_t^{-1} \mathbf{I}_t \Big\} \] and computed efficiently using the Kalman prediction recursion.
A look back to Johnson and Johnson data set and the code. E.g. What is the function Kfilter0?
Kfilter0
## function (num, y, A, mu0, Sigma0, Phi, cQ, cR)
## {
## Q = t(cQ) %*% cQ
## R = t(cR) %*% cR
## Phi = as.matrix(Phi)
## pdim = nrow(Phi)
## y = as.matrix(y)
## qdim = ncol(y)
## xp = array(NA, dim = c(pdim, 1, num))
## Pp = array(NA, dim = c(pdim, pdim, num))
## xf = array(NA, dim = c(pdim, 1, num))
## Pf = array(NA, dim = c(pdim, pdim, num))
## innov = array(NA, dim = c(qdim, 1, num))
## sig = array(NA, dim = c(qdim, qdim, num))
## x00 = as.matrix(mu0, nrow = pdim, ncol = 1)
## P00 = as.matrix(Sigma0, nrow = pdim, ncol = pdim)
## xp[, , 1] = Phi %*% x00
## Pp[, , 1] = Phi %*% P00 %*% t(Phi) + Q
## sigtemp = A %*% Pp[, , 1] %*% t(A) + R
## sig[, , 1] = (t(sigtemp) + sigtemp)/2
## siginv = solve(sig[, , 1])
## K = Pp[, , 1] %*% t(A) %*% siginv
## innov[, , 1] = y[1, ] - A %*% xp[, , 1]
## xf[, , 1] = xp[, , 1] + K %*% innov[, , 1]
## Pf[, , 1] = Pp[, , 1] - K %*% A %*% Pp[, , 1]
## sigmat = as.matrix(sig[, , 1], nrow = qdim, ncol = qdim)
## like = log(det(sigmat)) + t(innov[, , 1]) %*% siginv %*%
## innov[, , 1]
## for (i in 2:num) {
## if (num < 2)
## break
## xp[, , i] = Phi %*% xf[, , i - 1]
## Pp[, , i] = Phi %*% Pf[, , i - 1] %*% t(Phi) + Q
## sigtemp = A %*% Pp[, , i] %*% t(A) + R
## sig[, , i] = (t(sigtemp) + sigtemp)/2
## siginv = solve(sig[, , i])
## K = Pp[, , i] %*% t(A) %*% siginv
## innov[, , i] = y[i, ] - A %*% xp[, , i]
## xf[, , i] = xp[, , i] + K %*% innov[, , i]
## Pf[, , i] = Pp[, , i] - K %*% A %*% Pp[, , i]
## sigmat = as.matrix(sig[, , i], nrow = qdim, ncol = qdim)
## like = like + log(det(sigmat)) + t(innov[, , i]) %*%
## siginv %*% innov[, , i]
## }
## like = 0.5 * like
## list(xp = xp, Pp = Pp, xf = xf, Pf = Pf, like = like, innov = innov,
## sig = sig, Kn = K)
## }
## <environment: namespace:astsa>
Note: The Gaussian likelihood is used even for non-Gaussian models.
Local level model: \[ Y_t = \mu_t + V_t,\quad \mu_t = \mu_{t-1} + W_t, \] where \(\{V_t\}\sim \mbox{WN}(0,\sigma^2_V)\) and \(\{W_t\}\sim \mbox{WN}(0,\sigma^2_W)\). Or the same model with locally linear trend: \[ Y_t = \mu_t + V_t,\quad \mu_t = \mu_{t-1} + b_t + W_t,\quad b_t = b_{t-1} + Z_t, \] where in addition \(\{Z_t\}\sim \mbox{WN}(0,\sigma^2_Z)\).
Extension to one “factor” local level model: \[ Y_{1t} = \mu_t + V_{1t},\quad Y_{2t} = \mu_t + V_{2t},\quad \mu_t = \mu_{t-1} + W_t. \] Shumway and Stoffer (2011) use this for two series of global temepratures, one obtained as the averages from land-based stations and the other from marine-based stations.
The usual (vector) ARIMA models can be written in state-space form.
Many more interesting models can be considered in the case when the state-space models are nonlinear (the so-called hidden Markov models).
Thus, if we had the complete data, we could then use the results from multivariate normal theory to easily obtain the MLEs of \(\theta\). We do not have the complete data; however, the EM algorithm provides an iterative method for finding the MLEs of \(\Theta\) based on the incomplete data \(\mathbf{Y}_t\) by successively maximizing the conditional expectation of the complete data likelihood.
To describe the EM method for our state-space model, let \[ Q(\theta|\theta^{(j-1)}) = E_{\mathbf{X}|\mathbf{Y},\theta^{(j-1)}}(-2 \log L_{\mathbf{X},\mathbf{Y}}(\theta) ). \] The EM algorithm then is the iterative procedure involving the following two steps:
E step (“E” for Expectation): Calculate \(Q(\theta|\theta^{(j-1)})\). It turns out that this can be done explicitly using Kalman recursions (e.g. Shumway and Stoffer (2011), p. 343-344).
M step (“M” for Maximization): Minimize \(Q(\theta|\theta^{(j-1)})\), which turns out to be done explicitly as well (e.g. Shumway and Stoffer (2011), p. 343-344). Set \(\theta^{(j)}\) to be the minimum.
Iterate till converges.
Some advantages over direct minimization: The EM does not require calculation of the derivatives of the likelihood (in contrast to other iterative minimization procedures). It generally works better when the dimension of the state vector \(\mathbf{X}_t\) is large. But the real interest is in that it can be adapted to deal with missing data.
Longitudinal series of blood parameter levels monitored, log (white blood count) [top], log (platelet) [middle], and hematocrit (HCT) [bottom], after a bone marrow transplant (\(T = 91\) days). Approximately 40% of the values are missing, with missing values occurring primarily after the 35th day. (Shumway and Stoffer (2011), Chapter 6.)
plot(blood, type="o", pch=19, xlab="day", main="")
y = cbind(WBC, PLT, HCT)
num = nrow(y)
A = array(0, dim=c(3,3,num)) # creates num 3x3 zero matrices
for(k in 1:num) if (y[k,1] > 0) A[,,k]= diag(1,3)
# Initial values
mu0 = matrix(0,3,1)
Sigma0 = diag(c(.1,.1,1) ,3)
Phi = diag(1,3)
cQ = diag(c(.1,.1,1), 3)
cR = diag(c(.1,.1,1), 3)
(em = EM1(num, y, A, mu0, Sigma0, Phi, cQ, cR, 100, .001))
## iteration -loglikelihood
## 1 68.28328
## 2 -183.9361
## 3 -194.2051
## 4 -197.5444
## 5 -199.7442
## 6 -201.6431
## 7 -203.4226
## 8 -205.1253
## 9 -206.7595
## 10 -208.3251
## 11 -209.8209
## 12 -211.2464
## 13 -212.602
## 14 -213.8891
## 15 -215.1094
## 16 -216.2651
## 17 -217.3589
## 18 -218.3931
## 19 -219.3705
## 20 -220.2935
## 21 -221.1649
## 22 -221.9869
## 23 -222.762
## 24 -223.4924
## 25 -224.1805
## 26 -224.8282
## 27 -225.4377
## 28 -226.0109
## 29 -226.5495
## 30 -227.0555
## 31 -227.5305
## 32 -227.9762
## 33 -228.3941
## 34 -228.7857
## 35 -229.1524
## 36 -229.4956
## 37 -229.8166
## 38 -230.1166
## 39 -230.3967
## 40 -230.6582
## 41 -230.9019
## 42 -231.1289
## $Phi
## [,1] [,2] [,3]
## [1,] 0.98052698 -0.03494377 0.008287009
## [2,] 0.05279121 0.93299479 0.005464917
## [3,] -1.46571679 2.25780951 0.795200344
##
## $Q
## [,1] [,2] [,3]
## [1,] 0.013786772 -0.001724166 0.01882951
## [2,] -0.001724166 0.003032109 0.03528162
## [3,] 0.018829510 0.035281625 3.61897901
##
## $R
## [,1] [,2] [,3]
## [1,] 0.007124671 0.0000000 0.0000000
## [2,] 0.000000000 0.0168669 0.0000000
## [3,] 0.000000000 0.0000000 0.9724247
##
## $mu0
## [,1]
## [1,] 2.119269
## [2,] 4.407390
## [3,] 23.905038
##
## $Sigma0
## [,1] [,2] [,3]
## [1,] 4.553949e-04 -5.249215e-05 0.0005877626
## [2,] -5.249215e-05 3.136928e-04 -0.0001199788
## [3,] 5.877626e-04 -1.199788e-04 0.1677365489
##
## $like
## [1] 68.28328 -183.93608 -194.20509 -197.54440 -199.74425 -201.64313
## [7] -203.42258 -205.12530 -206.75951 -208.32511 -209.82091 -211.24639
## [13] -212.60202 -213.88906 -215.10935 -216.26514 -217.35887 -218.39311
## [19] -219.37048 -220.29354 -221.16485 -221.98686 -222.76196 -223.49243
## [25] -224.18049 -224.82824 -225.43771 -226.01085 -226.54953 -227.05552
## [31] -227.53054 -227.97621 -228.39410 -228.78569 -229.15242 -229.49563
## [37] -229.81661 -230.11659 -230.39674 -230.65816 -230.90189 -231.12893
##
## $niter
## [1] 42
##
## $cvg
## [1] 0.0009832656
# Graph smoother
ks = Ksmooth1(num, y, A, em$mu0, em$Sigma0, em$Phi, 0, 0, chol(em$Q), chol(em$R), 0)
y1s = ks$xs[1,,]
y2s = ks$xs[2,,]
y3s = ks$xs[3,,]
p1 = 2*sqrt(ks$Ps[1,1,])
p2 = 2*sqrt(ks$Ps[2,2,])
p3 = 2*sqrt(ks$Ps[3,3,])
par(mfrow=c(3,1))
plot(WBC, type='p', pch=19, ylim=c(1,5), xlab='day')
lines(y1s)
lines(y1s+p1, lty=2, col=4)
lines(y1s-p1, lty=2, col=4)
plot(PLT, type='p', ylim=c(3,6), pch=19, xlab='day')
lines(y2s)
lines(y2s+p2, lty=2, col=4)
lines(y2s-p2, lty=2, col=4)
plot(HCT, type='p', pch=19, ylim=c(20,40), xlab='day')
lines(y3s)
lines(y3s+p3, lty=2, col=4)
lines(y3s-p3, lty=2, col=4)
Q0: Supplement your analysis of time series data from Q1, Lecture I, by state-space modeling if seems appropriate. You can also omit some of the data so that some values are missing.
Q1: Why confidence intervals associated with trend and seasonality are decreasing in time for jj data? Is this feature general or specific to jj data?
Q2: Derive Kalman recursions for prediction, filtering and smoothing.
Q3: Write down the details behind the use of the EM algorithm for state-space models when it is applied with missing data.
Q4: Outline the Bayesian treatment of state-space models.
Q5: Write down and examine Kalman recursions explicitly for the local level model.
Q6: Write down how ARIMA\((p,d,q)\) model is embedded into a state-space model, and explain how this can be used to handle missing data.
Q7: The so-called AR\((1)\) model with observational noise written in state-space form turns out to be equivalent to an ARMA\((1,1)\) model (Shumway and Stoffer (2011), Example 6.3). Is this a more general phenomena?
Brockwell, P. J., and R. A. Davis. 2016. Introduction to Time Series and Forecasting. Third. Springer Texts in Statistics. Springer.
Durbin, J., and S. J. Koopman. 2012. Time Series Analysis by State Space Methods. Second. Vol. 38. Oxford Statistical Science Series. Oxford University Press, Oxford.
Gómez, V. 2016. Multivariate Time Series with Linear State Space Structure. Springer.
Petris, G., S. Petrone, and P. Campagnoli. 2009. Dynamic Linear Models with R. Use R! Springer, New York.
Prado, R., and M. West. 2010. Time Series. Texts in Statistical Science Series. CRC Press, Boca Raton, FL.
Shumway, R. H., and D. S. Stoffer. 2011. Time Series Analysis and Its Applications. Third. Springer Texts in Statistics. Springer, New York.