Simply put, Hidden Markov Models (HMMs) are models in which your data is generated from an underlying Markov process. The theory of HMMs is incredibly rich and has been derived (often independently) in different areas of math and statistics. For example, Bayesian Networks and HMMs overlap quite a bit. These types of models have a plethora of different application areas including genetics, medicine, climate science/weather prediction, and finance. The literature on the theory of these models is incredibly dense. In addition, the computational techniques used to fit these models and make predictions can become quite complicated. One could (and people do) get a PhD in these topics. The pupose of this lecture is to just give you a basic feel for the way these models work and the settings in which they are useful so we will omit most of the mathematical and computational details, mentioning them where we feel appropriate.
Let’s get started!
Consider a model with the following dependence structure:
I.e. each observation \(y_t\) is generated by an underlying (unobservable) state \(x_t\). State \(x_{t+1}\) is then dependent only on the previous state. Note that, conditional on \(x_{t+1}\), \(y_{t+1}\) is independent of all previous observations. Mathematically, we can write this model as follows: \[X_{t} = AX_{t-1}+W_{t}\] \[Y_{t} = CX_{t}+V_{t}\] where \(W_t\) and \(V_{t}\) are (usually independent) white noise processes. The first equation is known as the State-Space Equation while the second is called the Observation Equation. (Let’s think of some examples!) Note that \(A\) and \(C\) may be nonlinear operators. However, the computation complexity of estimation and inference becomes more complicated given nonlinearity. In addition, some of the most common models assume the white noise processes are Gaussian which immensely simplifies computation.
In the HMM case we assume that \(X_t\) satisfies the Markov property, i.e. \[P(X_t|X_{t-1}, X_{t-2},...,X_1) = P(X_t|X_{t-1})\] and \(X_t\) is causal variable to \(Y_t\). The observant student may notice that the way we have written our state space equation already implies that \(\{X_t\}\) satifies the Markov property. This is true and state space models are, indeed, HMMs. As was mentioned in the introduction these models have been derived in parallel in various different types of research and often can be referred to by many different names. Whether something is called a “Hidden Markov Model” is usually dependent on the background of the author. In the past, the majority of model referred to as HMMs have an underlying Markov chain with discrete state space. However, continuous state processes are becoming more and more in vogue due to the increase in computational power and advances in simulation techniques. Such models are often useful in Speech recognition, Robot localization, User attention, Medical monitoring, to name a few.
One can think of the simple example of a grad student who is stuck in the basement of Hanes Hall for a week at his basement office, and wants to know if it is rainy or sunny outside. The student has no windows so cannot see for himself if its raining but can observe whether his officemate has an umbrella or not. He decides to model his situation with an HMM. The state process \(X_t\) can take values in {rainy, sunny} and the observations \(Y_t\) can take values in {umbrella, no umbrella}. The student assumes that the underlying Markov chain \(\{X_t\}\) has the following transition probabilities:Consider the following conditional probability: \[P(X_{s:s'}|Y_{k:k'}) = P(X_s,\ldots,X_{s'}|Y_k,\ldots,Y_{k'}),\qquad s<s',k<k'\]
For example, consider the simple example above. Suppose the students wants to estimate the probability that it is currently raining. This would correspond to a filtering problem. In this case, given the observations, one can explicitly compute the corresponding filtered probability. This filtering procedure consists of 2 parts: Prediction and Bayesian update.
First, the prediction step is: \[P(X_s|Y_{1:s-1}) = \sum_{i=1}^sP(X_i|X_{i-1})P(X_{i-1}|Y_{1:i-1})\]
Second, the Bayesian update is: \[P(X_s|Y_{1:s}) = \frac{P(X_s|Y_s)P(X_s|Y_{1:s-1})}{\sum_{i=1}^s P(Y_s|X_s)P(X_t|Y_{1:s-1})}\]
By iterating this process for \(s=1,\ldots,t\) one can compute the filtered probability at time \(t\). This is called the Forward algorithm.
Suppose the student wishes to know if it will rain tomorrow or, perhaps, whether it rained each day this week. These would be referred to as Prediction and Smoothing problems, respectively. Using similar techniques to the one described above, one can obtain prediction and smoothing probabilities explicitly as well.
Before we dive into some examples let’s take a minute to gather our thoughts. The first thing that you may ask yourself is “what is the goal of an HMM?”. Well, there are are several potential area’s of interest.
The first is parameter estimation. Suppose you are simply interested in knowing the structure behind your data. The second is forecasting. Suppose you have a sequence of observations \(\{Y_t\}\) and you wish to forecast what future observations will be. Finally you may be interested in estimating the latent variables \(\{X_t\}\) in your model. Your end goal will depend entirely on the data you have and the particular situation you are in. The nice thing is that HMMs are incredibly flexible.
In addition, there are a ton of ways to accomplish these goals, almost all of them computational. Models can be fit, and values/intervals of interest can be estimated using Maximum Likelihood, Expectation Maximization, Importance Sampling, Particle Filtering, and a variety Bayesian Methods (just to name a few). Each of these methods comes with its own advantages and disadvantages but the one thing in common is that they all become complicated very quickly so we will gloss over almost all of the details in the examples to come.
We consider an example from Zucchini and MacDonald (2009) and Douc, Moulines, and Stoffer (2014). Consider the following plot of number of major earthquakes (magnitude 7 or greater) in the world, 1900-2006. Below is a time-series plot and the ACF/PACF plots for the counts.
library(nltsa)
library(ggplot2)
data("EQcount")
par(mar=c(3,3,1,1), mgp=c(1.6,.6,0))
layout(matrix(c(1,2, 1,3), nc=2))
plot(EQcount, type='h')
acf(EQcount)
pacf(EQcount)
The simplest way to model this data might be to assume that each is a random draw from a Poisson distribution. We can already see that there is significant autocorrelation. In addition, we can see a good degree of over dispersion. I.e. the sample variance is much greater than the sample mean.
ggplot()+
geom_bar(mapping = aes(x=EQcount, y = (..count..)/sum(..count..))) +
geom_line(mapping = aes(x= 0:40,y=dpois(0:40,lambda = mean(EQcount))),colour = 'red') +
ylab('Proportion of Earthquakes')
mean(EQcount)
## [1] 19.36449
var(EQcount)
## [1] 51.57344
Based on the time series data it appears that there may be periods with higher rates of earthquakes. Let’s fit a HMM! The underlying state process \(\{X_t, t \in\mathbb{N}\}\) is a two-state Markov chain. Then the obervations (counts ) \(\{Y_t,t\in\mathbb{N}\}\) are generated from a Poisson(\(\lambda_{X_t}\)). I.e. the mean of the observation distribution \(\lambda_{x},x\in\{1,2\}\) is dependent on the state process. The idea is that earthquakes occur more frequently in one of the states than the other (e.g. \(\lambda_1<\lambda_2\)). Now let’s first fit out model.
library(depmixS4)
## estimation ##
set.seed(90210)
model <- depmix(EQcount ~1, nstates=2, data=data.frame(EQcount), family=poisson())
(fm <- fit(model)) # fm contains results from estimation
## iteration 0 logLik: -384.3701
## iteration 5 logLik: -342.4233
## iteration 10 logLik: -341.897
## iteration 15 logLik: -341.8801
## iteration 20 logLik: -341.8788
## iteration 25 logLik: -341.8787
## converged at iteration 26 with logLik: -341.8787
## Convergence info: Log likelihood converged to within tol. (relative change)
## 'log Lik.' -341.8787 (df=5)
## AIC: 693.7574
## BIC: 707.1216
u <-as.vector(getpars(fm))
if (u[7] <= u[8]) { para.mle = c(u[3:6], exp(u[7]), exp(u[8]))
} else { para.mle = c(u[6:3], exp(u[8]), exp(u[7]))
}
mtrans = matrix(para.mle[1:4], byrow=TRUE, nrow=2)
lams = para.mle[5:6]
lams
## [1] 15.41901 26.01445
mtrans
## [,1] [,2]
## [1,] 0.9283489 0.07165115
## [2,] 0.1189548 0.88104517
I.e. when in the first state, earthquakes occur at a rate of 15.4 per year while in the second state earthquakes occur at a rate of 26.0 per year (i.e. \(\lambda_1 = 15.4\) and \(\lambda_2=26.0\)). We can also see the estimated transition rate matrix is:
\[\begin{equation}
M = \left(
\begin{matrix}
M(1,1) & M(1,2)\\
M(2,1) & M(2,2)
\end{matrix}\right)
=\left(
\begin{matrix}
.93 & .07\\
.12 & .88
\end{matrix}\right)
\end{equation}\]
We can now take a look at the ACF/PACF functions by simulating from this model:
# function to generate data
pois.HMM.generate_sample = function(n,m,lambda ,Mtrans,StatDist=NULL){
# n = data length, m = # of states, Mtrans = transition matrix, StatDist = stationary distn
if(is.null(StatDist))StatDist =solve(t(diag(m)-Mtrans +1),rep(1,m))
mvect = 1:m
state = numeric(n)
state [1] = sample(mvect ,1,prob=StatDist)
for (i in 2:n)
state[i]=sample(mvect ,1,prob=Mtrans[state[i-1] ,])
y = rpois(n,lambda=lambda[state ])
list(y= y, state= state)
}
nobs = length(EQcount)
x.star = pois.HMM.generate_sample(n=nobs, m=2, lambda=lams, Mtrans=mtrans)$y
ACFSim = acf(x.star, plot=FALSE)$acf
PACFSim = pacf(x.star, plot=FALSE)$acf
par(mar=c(3,3,1,1), mgp=c(1.6,.6,0))
layout(matrix(c(1,2, 1,3), nc=2))
plot(x.star, type='h')
acf(EQcount)
points(0:20,ACFSim, type = 'h', col='red')
pacf(EQcount)
points(PACFSim, type = 'h', col='red')
Althought it’s not perfect, we are fitting a very basic HMM and we are already capturing a lot of temporal dependance. In addition, we can see below that we are also capturing much of the overdispersion:
mean(x.star)
## [1] 21.1215
var(x.star)
## [1] 58.69265
Finally, we can take a look at our estimated states based on the smoothed probabilities:
pi1 = mtrans[2,1]/(2 - mtrans[1,1] - mtrans[2,2])
pi2 = 1 - pi1
## graphics ##
layout(matrix(c(1,2,1,3), 2))
par(mar=c(3,3,1,1), mgp=c(1.6,.6,0))
# data and states
plot(EQcount, main="", ylab='EQcount', type='h', col='#808080')
text(EQcount, col=posterior(fm)[,1], labels=posterior(fm)[,1], cex=.9)
# prob of state 2
plot(ts(posterior(fm)[,3], start=1900), ylab=expression(hat(pi)[~t]*' (2 | data)'))
abline(h=.5, lty=2)
# histogram
hist(EQcount, breaks=30, prob=TRUE, main="")
xvals = seq(1,45)
u1 = pi1*dpois(xvals, lams[1])
u2 = pi2*dpois(xvals, lams[2])
lines(xvals, u1, col=3)
lines(xvals, u2, col=4)
Now consider the sequence of residuals:
resid <- EQcount - lams[posterior(fm)[,1]]
layout(matrix(c(1,2,1,3), 2))
par(mar=c(3,3,1,1), mgp=c(1.6,.6,0))
plot(resid, main="", ylab='Residuals', type='h', col='#808080')
acf(resid)
pacf(resid)
We now present an example from Shumway and Stoffer (2010). The time series consists of monthly influenza deaths per 10,000 in the U.S. for the 11 years spanning 1968 to 1978.
library(astsa)
plot.ts(flu)
The researchers are concerned with predicting when there are flu epidemics. This can be modeled using HMMs with two underlying states, epidemic vs. non-epidemic. Suppose that normally we can model influenza mortality (\(y_t\)) as follows: \[y_t = x_{t1}+x_{t3}+v_t\] where \(x_{t1}\) is an AR(2) processes representing seasonal fluctuations, \(x_{t3}\) represents a fixed trend component, and \(v_t\) is white noise with \(var(v_t)=\sigma^2_v\). More explicitly we write, \[x_{t1} = \alpha_1x_{t-1,1}+\alpha_2x_{t-2,1}+w_{t1}\] and \[x_{t3} = x_{t-1,3}+w_{t3}\] where \(w_{t1}\) is white noise with \(var(w_{t1})=\sigma^2_1\) and \(w_{t3}\) is NOT white noise and \(var(w_{t3})=0\) (this is a fancy way of saying \(w_{t3}\) is a constant).
Now suppose that in an epidemic there is a third component \(x_{t2}\) which represents the sharp rise in motality during an epidemic. We model this component as the AR(1) process \[x_{t2}=\beta_0+\beta_1 x_{t-1,2}+w_{t2}\] where \(w_{t2}\) is white noise with \(var(w_{t2})=\sigma^2_2\). Then, during an epidemic, we express our observation equation as \[y_t = x_{t1}+x_{t2}+x_{t3}+v_t.\]
Great! Now let’s write this model in its full state space representation. The state equation should look like this: \[x_t = \Phi x_{t-1} + \Upsilon u_t + w_t.\] TO THE BOARD!!!!
As we have shown: \[\begin{equation} x_t = \left( \begin{matrix} x_{t1}\\ x_{t-1,1}\\ x_{t2}\\ x_{t3} \end{matrix}\right),\quad \Phi = \left( \begin{matrix} \alpha_1 & \alpha_2 & 0 & 0\\ 0 & 1 & 0 & 0 \\ 0 & 0 & \beta_1 & 0\\ 0 & 0 & 0 & 1 \end{matrix}\right),\quad \Upsilon = \left( \begin{matrix} 0\\ 0\\ \beta_0\\ 0 \end{matrix}\right),\quad u_t=1,\quad w_t = \left( \begin{matrix} w_{t1}\\ 0\\ w_{t2}\\ 0 \end{matrix}\right) \end{equation}\] I.e. the state equation is: \[\begin{equation} \left( \begin{matrix} x_{t1}\\ x_{t-1,1}\\ x_{t2}\\ x_{t3} \end{matrix}\right)= \left( \begin{matrix} \alpha_1 & \alpha_2 & 0 & 0\\ 0 & 1 & 0 & 0 \\ 0 & 0 & \beta_1 & 0\\ 0 & 0 & 0 & 1 \end{matrix}\right) \left( \begin{matrix} x_{t-1,1}\\ x_{t-2,1}\\ x_{t-1,2}\\ x_{t-1,3} \end{matrix}\right)+ \left( \begin{matrix} 0\\ 0\\ \beta_0\\ 0 \end{matrix}\right)+ \left( \begin{matrix} w_{t1}\\ 0\\ w_{t2}\\ 0 \end{matrix}\right) \end{equation}\] Then the observation equation is simply \[y_t=A_tx_t+v_t\] where \(A_t = M_1 = [1,0,0,1]\) normally and \(A_t=M_2 = [1,0,1,1]\) during an epidemic. We will assume this sequence of states is a Markov Chain with transition probability matrix \[\begin{equation} \Pi = \left( \begin{matrix} .75 & .25 \\ .25 & .75 \end{matrix}\right) \end{equation}\]M1 = as.matrix(cbind(1,0,0,1)) # obs matrix normal
M2 = as.matrix(cbind(1,0,1,1)) # obs matrix flu epi
There are now several goals ahead of us:
Estimate our parameters \(\Theta = (\alpha_1,\alpha_2,\beta_0,\beta_1,\sigma_1,\sigma_2,\sigma_v)\). This is accomplished using computational techniques to maximize the likelihood function (we well not go into detail on this).
Obtain one-step-ahead predictions for when there will be an epidemic. I.e. try to forecast, one month in advance, whether there will be an epidemic.
Note that in this case \(m=2\) because there are two states of the system (normal and epidemic). \(\pi_j(t|t-1)\) will represent the one-step-ahead probabilities. I.e. \(\pi_2(t|t-1)\) represents the probability that the system will be in an epidemic at time \(t\) given data up to time \(t-1\). There is much more structure/theory going on behind the scenes than presented here but the important thing to keep in mind is that \(\pi_j(t|t-1)\) depends on \(\Pi\).
# Function to Calculate Likelihood
y = as.matrix(flu)
num = length(y)
nstate = 4;
prob = matrix(0,num,1); yp = y # to store pi2(t|t-1) & y(t|t-1)
xfilter = array(0, dim=c(nstate,1,num)) # to store x(t|t)
Linn = function(para){
alpha1=para[1]; alpha2=para[2]; beta0=para[3]
sQ1=para[4]; sQ2=para[5]; like=0
xf=matrix(0, nstate, 1) # x filter
xp=matrix(0, nstate, 1) # x predict
Pf=diag(.1, nstate) # filter covar
Pp=diag(.1, nstate) # predict covar
pi11 <- .75 -> pi22; pi12 <- .25 -> pi21; pif1 <- .5 -> pif2
phi=matrix(0,nstate,nstate)
phi[1,1]=alpha1; phi[1,2]=alpha2; phi[2,1]=1; phi[4,4]=1
Ups = as.matrix(rbind(0,0,beta0,0))
Q = matrix(0,nstate,nstate)
Q[1,1]=sQ1^2; Q[3,3]=sQ2^2; R=0 # R=0 in final model
# begin filtering
for(i in 1:num){
xp = phi%*%xf + Ups; Pp = phi%*%Pf%*%t(phi) + Q
sig1 = as.numeric(M1%*%Pp%*%t(M1) + R)
sig2 = as.numeric(M2%*%Pp%*%t(M2) + R)
k1 = Pp%*%t(M1)/sig1; k2 = Pp%*%t(M2)/sig2
e1 = y[i]-M1%*%xp; e2 = y[i]-M2%*%xp
pip1 = pif1*pi11 + pif2*pi21; pip2 = pif1*pi12 + pif2*pi22;
den1 = (1/sqrt(sig1))*exp(-.5*e1^2/sig1);
den2 = (1/sqrt(sig2))*exp(-.5*e2^2/sig2);
denom = pip1*den1 + pip2*den2;
pif1 = pip1*den1/denom; pif2 = pip2*den2/denom;
pif1=as.numeric(pif1); pif2=as.numeric(pif2)
e1=as.numeric(e1); e2=as.numeric(e2)
xf = xp + pif1*k1*e1 + pif2*k2*e2
eye = diag(1, nstate)
Pf = pif1*(eye-k1%*%M1)%*%Pp + pif2*(eye-k2%*%M2)%*%Pp
like = like - log(pip1*den1 + pip2*den2)
prob[i]<<-pip2; xfilter[,,i]<<-xf; innov.sig<<-c(sig1,sig2)
yp[i]<<-ifelse(pip1 > pip2, M1%*%xp, M2%*%xp)
}
return(like)
}
Step 2: Estimation. Spoiler alert: when the model was ran for the first time the parameters \(\beta_1\) and \(\sigma_v\) were insignificant so they were removed from the model. Note this means that during an epidemic there is only a level shift in the mortality rate and there is also no measurement error. The likelihood function is maximized using a quasi-Newton-Raphson method:
alpha1=1.4; alpha2=-.5; beta0=.3; sQ1=.1; sQ2=.1
init.par = c(alpha1, alpha2, beta0, sQ1, sQ2)
(est = optim(init.par, Linn, NULL, method="BFGS", hessian=TRUE, control=list(trace=1,REPORT=1)))
## initial value -236.860522
## iter 2 value -313.882451
## iter 3 value -320.373891
## iter 4 value -322.538581
## iter 5 value -326.037522
## iter 6 value -326.220434
## iter 7 value -328.165001
## iter 8 value -337.802054
## iter 9 value -339.102455
## iter 10 value -339.278478
## iter 11 value -339.295742
## iter 12 value -339.295935
## iter 13 value -339.295949
## iter 13 value -339.295949
## final value -339.295949
## converged
## $par
## [1] 1.40570967 -0.62198715 0.21049042 0.02310306 0.11217287
##
## $value
## [1] -339.2959
##
## $counts
## function gradient
## 69 13
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## [,1] [,2] [,3] [,4] [,5]
## [1,] 781.6937 779.66092 112.58312 1956.837 -32.89720
## [2,] 779.6609 1000.99490 82.68173 -107.393 -53.07502
## [3,] 112.5831 82.68173 1786.32460 -3798.730 -594.69751
## [4,] 1956.8370 -107.39297 -3798.72986 418827.665 7721.44658
## [5,] -32.8972 -53.07502 -594.69751 7721.447 3892.96236
SE = sqrt(diag(solve(est$hessian)))
u = cbind(estimate=est$par, SE)
rownames(u)=c("alpha1","alpha2","beta0","sQ1","sQ2");
u
## estimate SE
## alpha1 1.40570967 0.078587727
## alpha2 -0.62198715 0.068733109
## beta0 0.21049042 0.024625302
## sQ1 0.02310306 0.001635291
## sQ2 0.11217287 0.016684663
YAYAYAYAYAY!!!! We’ve fit out model! Note not only have we fit our parameters but also computed our one-step-ahead probabilities. In plot (a) below we have plotted the original data (solid-line) along with the estimated one-step-ahead probabilities. In (b) we round the probabilities to 1 or 0.
predepi = ifelse(prob<.5,0,1); k = 6:length(y)
Time = time(flu)[k]
par(mfrow=c(2,1), mar=c(2,3,1,1)+.1, cex=.9)
plot(Time, y[k], type="o", ylim=c(0,1),ylab="")
lines(Time, prob[k], lty="dashed", lwd=1.2)
text(1979,.75,"(a)")
plot(Time, y[k], type="o", ylim=c(0,1),ylab="")
lines(Time, predepi[k], lty="dashed", lwd=1.2)
text(1979,.75,"(b)")
Below, the dotted lines show confidence intervals for the mortality rates. Note that these are one-step-ahead CI’s.
plot(Time, y[k], type="p", pch=1, ylim=c(.1,.9),ylab="")
prde1 = 2*sqrt(innov.sig[1]); prde2 = 2*sqrt(innov.sig[2])
prde = ifelse(predepi[k]<.5, prde1,prde2)
lines(Time, yp[k]+prde, lty=2, lwd=1.5)
lines(Time, yp[k]-prde, lty=2, lwd=1.5)
Finally we show a decomposition of the three filtered components \(x_{t1},x_{t2},x_{t3}\).
plot(Time, xfilter[1,,k], type="l", ylim=c(-.1,.4), ylab="")
lines(Time, xfilter[3,,k]); lines(Time, xfilter[4,,k])
Just for fun, we refitted the model with the transition probability as parameters. I’m omitting the code but the analogous output is as follows:
## initial value -236.860522
## iter 2 value -313.915875
## iter 3 value -320.395110
## iter 4 value -322.589054
## iter 5 value -326.300102
## iter 6 value -326.645930
## iter 7 value -339.498612
## iter 8 value -345.527929
## iter 9 value -350.415329
## iter 10 value -351.176512
## iter 11 value -351.756986
## iter 12 value -353.050129
## iter 13 value -353.113650
## iter 14 value -353.119098
## iter 15 value -353.120880
## iter 16 value -353.121849
## iter 17 value -353.122026
## iter 17 value -353.122026
## final value -353.122026
## converged
## $par
## [1] 1.36328558 -0.59576499 0.20542276 0.02326422 0.11489695 2.70259894
## [7] 1.04261253
##
## $value
## [1] -353.122
##
## $counts
## function gradient
## 75 17
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## [,1] [,2] [,3] [,4] [,5]
## [1,] 745.2237426 724.365135 -17.926333 1346.41611 155.193104
## [2,] 724.3651354 918.775224 4.365372 15.85268 68.345003
## [3,] -17.9263332 4.365372 1765.663749 -1811.42447 -465.898955
## [4,] 1346.4161085 15.852675 -1811.424466 397867.00311 6339.956190
## [5,] 155.1931040 68.345003 -465.898955 6339.95619 3115.962601
## [6,] 4.3705429 1.717474 7.594773 37.15773 -4.151962
## [7,] 0.4882843 -6.508209 -10.149621 363.52564 13.917451
## [,6] [,7]
## [1,] 4.3705429 0.4882843
## [2,] 1.7174741 -6.5082092
## [3,] 7.5947732 -10.1496209
## [4,] 37.1577327 363.5256372
## [5,] -4.1519625 13.9174511
## [6,] 6.8156936 -0.1866078
## [7,] -0.1866078 3.8459440
## estimate SE
## alpha1 1.36328558 0.078832534
## alpha2 -0.59576499 0.070913541
## beta0 0.20542276 0.024454206
## sQ1 0.02326422 0.001685588
## sQ2 0.11489695 0.018704894
## pi11 2.70259894 0.386351399
## pi22 1.04261253 0.550438995
## [1] 0.9371798
## [1] 0.7393538
Douc, Randal, Eric Moulines, and David Stoffer. 2014. Nonlinear Time Series: Theory, Methods and Applications with R Examples. CRC Press.
Shumway, Robert, and David Stoffer. 2010. Time Series Analysis and Its Applications: With R Examples. Springer Science & Business Media.