Vector Autoregressions (VARs) are canonical models for (stationary) multivariate time series data \(\mathbf{X}_t = (X_{1,t},\ldots,X_{d,t})'\), \(t=1,\ldots,T\). Note that the number of VAR\((p)\) model parameters is (assuming zero mean) \[ pd^2 + d(d-1)/2. \] This can be pretty large, even for moderate \(d\)’s. E.g.
p <- 2
d <- 5
p*d^2 + d*(d-1)/2
## [1] 60
p <- 3
d <- 10
p*d^2 + d*(d-1)/2
## [1] 345
# In the example used below:
p <- 2
d <- 51
p*d^2 + d*(d-1)/2
## [1] 6477
# While T = 281
Many of these coefficients are expected to be “non-significant” (i.e. zero). Choosing “significant” (i.e. non-zero) coefficients and “significant” variables/components (and their lags) is known as variable selection.
The goal is to review some of the variable selection methods. (A related question is choosing the order \(p\).) As dimension \(d\) increases, these variable selection methods become computationally expensive and/or do not work. In this case, penalized estimation (regularized estimation) methods such as LASSO and its variants come handy. We shall discuss some of these penalized estimation methods as well. The methods will be illustrated through the following Google flu trends time series.
The following is a data set of Google flu trends that attracted quite a bit of attention (see Ginsberg et al. (2009), Dugas et al. (2013), Lazer et al. (2014)). The data are weakly measurements for 51 US states and the period September, 2003 - April, 2013, derived from flu related Google searches per state. Several states have records starting only a number of years following 2003.
The states are separated below by the CDC (Center for Disease Control) regions, as designated on their website. For example, “CT”, “ME”, “MA”, “NH”, “RI” and “VT” make one region designated by the CDC.
stateflu <- read.csv("states_abbrev.csv", sep=",", header=TRUE)
stateflu <- as.matrix(stateflu[,c(-1)])
stateflu <- log(na.omit(stateflu))
#plot data by state
x <- 1:dim(stateflu)[1]
matplot(x, stateflu, type = "l", lty = 1)
library(vars)
library(corrplot)
lag <- 2
cols <- dim(stateflu)[2]
rows <- dim(stateflu)[1]
fitvar <- VAR(stateflu, p = lag, type="const")
#get intercept estimate, coefficient matrices
mu <- matrix(, nrow = 1, ncol = cols)
states <- names(fitvar$varresult)
count <- 1
for (s in states) {
temp <- fitvar$varresult[s]
end <- length(coef(temp[[1]]))
mu[count] = coef(temp[[1]])[end]
count = count + 1
}
A1var <- Acoef(fitvar)[[1]]
A2var <- Acoef(fitvar)[[2]]
#change labels to state abbreviations
colnames(A1var) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
colnames(A2var) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
rownames(A1var) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
rownames(A2var) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
#reorder A1 by CDC regions
A1varbyregn <- A1var[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]
#reorder A2 by CDC regions
A2varbyregn <- A2var[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]
corrplot(round(A1varbyregn,1), mar=c(1,1,2,1), method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="VAR(2), A_1")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")
corrplot(round(A2varbyregn,1), mar=c(1,1,2,1), method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="VAR(2), A_2")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")
Note a number of small coefficients. Numerical “instability” (larger variability) of estimates is also expected since a large number of coefficients needs to be estimated from relatively small sample of size \(T\), and the regression matrix “\(X'X\)” might be ill-conditioned.
Question: What is the point of selecting a smaller number of variables?
A number of approaches have been suggested, and work well for smaller dimensions (number of parameters). The following is an approach based on \(t\)-statistics: Let \(\widehat \Phi_r(j,k) = \widehat \Phi_{r,jk}\), \(r=1,\ldots,p\), \(j,k=1,\ldots,d\), be the OLS/MLE estimators of the VAR coefficients.
Compute the \(t\)-statistics: \[ t_{r,j,k} = \frac{\widehat \Phi_r(j,k)}{\mbox{s.e.}(\widehat \Phi_r(j,k))}. \]
Create a sequence \(Q\) of the \(d^2p\) triplets \((r,j,k)\) by ranking the absolute values of the above \(t\)-ratios from highest to lowest.
For each \(m \in \{0, 1,\ldots, d^2p\}\), consider the model that selects the \(m\) non-zero VAR coefficients corresponding to the top \(m\) triplets in the sequence \(Q\). Under this parameter non-zero constraint, execute the constrained parameter estimation (see notes below) and compute the corresponding BIC according to \(\mbox{BIC}(m) = −2 \log L + \log T \cdot m\).
Choose \(m^*\) that gives the minimum BIC value.
The method seems to be working quite well in lower-dimensional settings. See several illustrations below.
Notes
library(vars)
library(fpp)
data(usconsumption)
var_consumption <- VAR(usconsumption,p=1,type="const")
restrict <- matrix(c(1, 0, 1,
1, 1, 1),
nrow=2, ncol=3, byrow=TRUE)
restrict(var_consumption, method = "manual", resmat = restrict)
##
## VAR Estimation Results:
## =======================
##
## Estimated coefficients for equation consumption:
## ================================================
## Call:
## consumption = consumption.l1 + const
##
## consumption.l1 const
## 0.3573904 0.4865219
##
##
## Estimated coefficients for equation income:
## ===========================================
## Call:
## income = consumption.l1 + income.l1 + const
##
## consumption.l1 income.l1 const
## 0.5633422 -0.2316257 0.4840611
The following are a few examples, for simulated and real data, illustrating the variable selection method described above.
source("VAR-library.R")
# # Generating sVAR(1) with dim=3
# Tt <- 500
# delta <- 1
# Sigma <- matrix(c(1, delta/4, delta/4, delta/4, 1, 0, delta/4, 0, 1), ncol=3)
# A <- matrix(c(.9, 0, .1, 0, -.3, 0, 0, 0, .2), ncol=3)
# nrep <- 100
#
# Asum <- matrix(0,3,3)
# Afre <- matrix(0,3,3)
# for(r in 1:nrep){
# # print(r)
# y <- VAR.sim(Tt, A, Sigma=Sigma)
# sVar.out <- sVAR.tstat(y, p=1)
# Asum <- Asum + sVar.out$hatA
# Afre <- Afre + 1*(sVar.out$hatA != 0)
# # print(sVar.out$hatA)
# }
# Amean <- Asum/nrep
# Afre <- 100*Afre/nrep
# # print(Amean)
# # print(Afre)
# save(A,Amean,Afre,file="VarSelectSimul.RData")
load("VarSelectSimul.RData")
library(corrplot)
par(mfrow=c(1,3))
corrplot(A, mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7, addCoef.col=TRUE)
title("True")
corrplot(Amean, mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7, addCoef.col=TRUE)
title("Average estimated A_1")
corrplot(Afre, mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7, addCoef.col=TRUE)
title("Frequency of non-zero estimates")
# ptm <- proc.time()
# stateflu.some <- stateflu[,c("CT","ME","MA","NH","RI","VT")]
# sVar.out <- sVAR.tstat(t(stateflu.some), p=2)
# ltm <- proc.time() - ptm
# save(sVar.out,ltm,file="VarSelect6States.RData")
load("VarSelect6States.RData")
ltm
## user system elapsed
## 10.833 0.410 11.274
par(mfrow=c(1,2))
corrplot(sVar.out$hatA[,1:6], mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7, addCoef.col=TRUE)
corrplot(sVar.out$hatA[,7:12], mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7, addCoef.col=TRUE)
# ptm <- proc.time()
# stateflu.some2 <- stateflu[,c("CT","ME","MA","NH","RI","VT","NJ","NY")]
# sVar.out2 <- sVAR.tstat(t(stateflu.some2), p=2)
# ltm <- proc.time() - ptm
# save(sVar.out2,ltm,file="VarSelect8States.RData")
load("VarSelect8States.RData")
ltm
## user system elapsed
## 44.440 2.743 47.432
par(mfrow=c(1,2))
corrplot(sVar.out2$hatA[,1:8], mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7, addCoef.col=TRUE)
corrplot(sVar.out2$hatA[,9:16], mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7, addCoef.col=TRUE)
Note: The method is time consuming for larger \(d\), e.g. running it for 14 states took almost one hour.
Question: What can be done then for larger dimension \(d\)? At least two approaches:
LASSO (Least Absolute Shrinkage and Selection Operator) of Tibshirani (1996) estimates the coefficients through \[ (\widehat \Phi_1,\ldots,\widehat \Phi_p) = \mbox{argmin}_{\Phi_1,\ldots,\Phi_p} \frac{1}{2T} \sum_{t=p+1}^T \| \mathbf{X}_t - \Phi_1 \mathbf{X}_{t-1} - \ldots - \Phi_p \mathbf{X}_{t-p}\|^2_2 + \lambda \sum_{r=1}^p \|\Phi_r\|_1, \] where \(\|\Phi_r\|_1 = \sum_{j,k=1}^d |\Phi_{r,jk}|\) is the \(\ell_1\)-penalty and \(\lambda\) is a tuning regularization/penalty parameter. There exists no closed-form solution for the LASSO estimator but the above optimization can be solved numerically in order to obtain the estimator. This is implemented through the R package glmnet, which allows for more general elastic-net penalty as in: \[ (\widehat \Phi_1,\ldots,\widehat \Phi_p) = \mbox{argmin}_{\Phi_1,\ldots,\Phi_p} \frac{1}{2T} \sum_{t=p+1}^T \| \mathbf{X}_t - \Phi_1 \mathbf{X}_{t-1} - \ldots - \Phi_p \mathbf{X}_{t-p}\|^2_2 + \lambda \Big(\alpha \sum_{r=1}^p \|\Phi_r\|_1 + \frac{(1-\alpha)}{2} \sum_{r=1}^p \|\Phi_r\|_2^2 \Big), \] where \(\alpha =1\) (the default) corresponds to LASSO and \(\alpha=0\) to the Ridge regression.
In fact, there is an efficient algorithm (LARS; see Efron et al. (2004)) that produces the solutions to LASSO estimation for all \(\lambda\)’s, referred to as a LASSO solution path. The glmnet, on the othet hand, uses the coordinate discent.
Notes
set.seed(2)
T <- 1000
p1 <- 0.4
p2 <- -0.3
p3 <- -0.1
y <- arima.sim(list(order = c(3,0,0), ar = c(p1,p2,p3)), n = T)
plot(y)
response <- y[5:T]
input <- cbind(y[4:(T-1)],y[3:(T-2)],y[2:(T-3)],y[1:(T-4)])
response <- (response - mean(response))/sqrt(var(response))
library(matrixStats)
input <- (input - colMeans(input))/colSds(input)
library(glmnet)
fit <- glmnet(input, response)
plot(fit,xvar = "lambda",label = TRUE)
#print(fit)
#coef(fit,s=0.1)
cvfit <- cv.glmnet(input, response)
plot(cvfit)
cvfit$lambda.min
## [1] 0.01339785
coef(cvfit, s = "lambda.min")
## 5 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.001057139
## V1 0.396988092
## V2 -0.225247532
## V3 -0.110871611
## V4 .
# nf <- 10
# p <- 2
# data <- t(stateflu)
#
# y <- data - rowMeans(data)
# T1 <- dim(y)[2]-p
# k <- dim(y)[1]
# # create vector X1 and Y1
# X1 <- matrix(0, k*p, T1)
# Y1 <- matrix(0, k, T1)
# for(j in 1:T1){
# # ar term
# id <- seq(from= j+p-1, to = j, by=-1)
# x <- as.vector(y[,id])
# X1[,j] <- x
# Y1[,j] <- y[,(j+p)]
# }
# ty <- t(y)
# ty <- data.frame(ty)
#
# y1 <- as.vector(Y1)
# x1 <- kronecker(t(X1), diag(1, k))
#
# cvfit <- cv.glmnet(x1, y1, alpha = 1, intercept=TRUE, standardize=TRUE, type.measure = "mse", nfolds=nf)
# save(cvfit,k,file="LassoStand.RData")
# # runs 10 minutes
# # standardize=FALSE gives slightly different results visually
load("LassoStand.RData")
#load("LassoStandno.RData")
cf.cv <- coef(cvfit, s = "lambda.min")
cf.cv <- cf.cv[-1]
hA1 <- matrix(cf.cv, nrow=k, byrow=FALSE)
A1svar <- hA1[,1:51]
A2svar <- hA1[,52:102]
colnames(A1svar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
colnames(A2svar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
rownames(A1svar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
rownames(A2svar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
A1svarbyregn <- A1svar[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]
A2svarbyregn <- A2svar[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]
corrplot(round(A1svarbyregn,1), mar=c(1,1,2,1), method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="sVAR(2), A_1")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")
corrplot(round(A2svarbyregn,1), mar=c(1,1,2,1), method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="sVAR(2), A_2")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")
E.g. the cross-dependence in the first CDC region.
lag <- 2
cols <- dim(stateflu)[2]
rows <- dim(stateflu)[1]
holdoutsize <- 10
trainsize <- rows - holdoutsize
stateflu.train <- stateflu[1:trainsize,]
stateflu.mean <- colMeans(stateflu.train)
fitvar <- VAR(stateflu.train - stateflu.mean, p = lag, type="none")
# nf <- 10
# p <- 2
# data <- t(stateflu.train)
# y <- data - rowMeans(data)
# T1 <- dim(y)[2]-p
# k <- dim(y)[1]
# # create vector X1 and Y1
# X1 <- matrix(0, k*p, T1)
# Y1 <- matrix(0, k, T1)
# for(j in 1:T1){
# # ar term
# id <- seq(from= j+p-1, to = j, by=-1)
# x <- as.vector(y[,id])
# X1[,j] <- x
# Y1[,j] <- y[,(j+p)]
# }
# ty <- t(y)
# ty <- data.frame(ty)
#
# y1 <- as.vector(Y1)
# x1 <- kronecker(t(X1), diag(1, k))
#
# cvfit <- cv.glmnet(x1, y1, alpha = 1, intercept=TRUE, standardize=TRUE, type.measure = "mse", nfolds=nf)
# save(cvfit,k,file="LassoStandNoTrain.RData")
load("LassoStandNoTrain.RData")
hatA0 <- Bcoef(fitvar)
cf.cv <- coef(cvfit, s = "lambda.min")
cf.cv <- cf.cv[-1]
hatA1 <- matrix(cf.cv, nrow=k, byrow=FALSE)
sum(hatA0 != 0)
## [1] 5202
sum(hatA1 != 0)
## [1] 1974
ff0 <- ff1 <- matrix(0, 51, holdoutsize)
for( i in 1:holdoutsize){
train <- t(stateflu[1:(trainsize+i-1),])
mf <- rowMeans(train)
train <- train - mf
ff0[,i] <- VAR.forecast(train, h=1, hatA0) + mf
ff1[,i] <- VAR.forecast(train, h=1, hatA1) + mf
}
## 1-step out-of-sample forecast MSPE
YTF <- t(stateflu[-(1:trainsize),])
sum((YTF - ff0)^2)
## [1] 14.22553
sum((YTF - ff1)^2)
## [1] 7.913374
plot.ts(YTF[1,], col=1)
lines(ff0[1,], col=2)
lines(ff1[1,], col=3)
legend("bottomleft", c("TRUE", "hatA0", "hatA1"), lty=c(1,1,1), col= c(1,2,3))
title("AL")
It is quite well known that regular LASSO overestimates the number of non-zero coefficients. Adaptive LASSO (A-LASSO) tends to correct this tendency. A-LASSO estimation of Zou (2006) is defined as \[ (\widehat \Phi_1,\ldots,\widehat \Phi_p) = \mbox{argmin}_{\Phi_1,\ldots,\Phi_p} \frac{1}{2T} \sum_{t=p+1}^T \| \mathbf{X}_t - \Phi_1 \mathbf{X}_{t-1} - \ldots - \Phi_p \mathbf{X}_{t-p}\|^2_2 + \lambda\sum_{r=1}^p \sum_{j,k=1}^d \widehat w_{r,jk} |\Phi_{r,jk}|, \] where e.g. supposing a VAR model can be fitted through OLS (i.e. the number of parameters being smaller than the number of observations), \[ \widehat w_{r,jk} = \frac{1}{|\widehat \Phi_{r,jk}^{ols}|}. \] In fact, the A-LASSO version which I typically use in my work is based on \[ (\widehat \Phi_1,\ldots,\widehat \Phi_p) = \mbox{argmin}_{\Phi_1,\ldots,\Phi_p} \frac{1}{2T} \sum_{t=p+1}^T \mbox{vec}(\mathbf{X}_t - \Phi_1 \mathbf{X}_{t-1} - \ldots - \Phi_p \mathbf{X}_{t-p})' \Sigma_{\mathbf Z} \mbox{vec}(\mathbf{X}_t - \Phi_1 \mathbf{X}_{t-1} - \ldots - \Phi_p \mathbf{X}_{t-p}) \] \[ + \lambda\sum_{r=1}^p \sum_{j,k=1}^d \widehat w_{r,jk} |\Phi_{r,jk}|, \] which is implemented numerically by iteration over plugging in the estimated \(\widehat \Sigma_{\mathbf Z}\) till convergence. (The same or similar objective functions are used for regular LASSO as well.) It is known that in “lower dimensions” and VARs, the A-LASSO estimates the correct model asymptotically with increasing sample size. In particular, no additional assumptions as in LASSO are needed.
# The saved A-LASSO estimates were obtained using the A-LASSO function in the library VAR-library.R
load("LassoAStandNoTrain.RData")
hatA2 <- fitsvar$hatA
sum(hatA2 != 0)
## [1] 545
A1savar <- hatA2[,1:51]
A2savar <- hatA2[,52:102]
colnames(A1savar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
colnames(A2savar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
rownames(A1savar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
rownames(A2savar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
A1savarbyregn <- A1savar[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]
A2savarbyregn <- A2savar[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]
corrplot(round(A1savarbyregn,1), mar=c(1,1,2,1), method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="saVAR(2), A_1")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")
corrplot(round(A2savarbyregn,1), mar=c(1,1,2,1), method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="saVAR(2), A_1")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")
ff2 <- matrix(0, 51, holdoutsize)
for( i in 1:holdoutsize){
train <- t(stateflu[1:(trainsize+i-1),])
mf <- rowMeans(train)
train <- train - mf;
ff2[,i] <- VAR.forecast(train, h=1, hatA2) + mf
}
sum((YTF - ff2)^2)
## [1] 7.138015
plot.ts(YTF[1,], col=1)
lines(ff0[1,], col=2)
lines(ff1[1,], col=3)
lines(ff2[1,], col=4)
legend("bottomleft", c("TRUE","hatA0","hatA1","hatA2"), lty=c(1,1,1,1), col= c(1,2,3,4))
title("AL")
Q0: Supplement your analysis of multivariate time series with variable selection, LASSO and its variants where seem appropriate.
Q1: Explain how exactly restricted VARs are estimated.
Q2: Compare numerically how the variable selection method described above for smaller dimension compares numerically to stepwise regression methods.
Q3: Look into some of the conditions (Irrepresentable, Restricted Eigenvalue, etc) for LASSO to work.
Q4: Explain how the coordinate descent method is used to obtain the LASSO and related estimators.
Baek, C., R. A. Davis, and V. Pipiras. 2017. “Sparse Seasonal and Periodic Vector Autoregressive Modeling.” Computational Statistics & Data Analysis 106: 103–26. doi:10.1016/j.csda.2016.09.005.
Davis, R. A., P. Zang, and T. Zheng. 2016. “Sparse Vector Autoregressive Modeling.” Journal of Computational and Graphical Statistics 25 (4): 1077–96. doi:10.1080/10618600.2015.1092978.
Dugas, A. F., M. Jalalpour, Y. Gel, S. Levin, F. Torcaso, T. Igusa, and R. E. Rothman. 2013. “Influenza Forecasting with Google Flu Trends.” PloS One 8 (2): e56176.
Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani. 2004. “Least Angle Regression.” The Annals of Statistics 32 (2): 407–99.
Ginsberg, J., M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, and L. Brilliant. 2009. “Detecting Influenza Epidemics Using Search Engine Query Data.” Nature 457: 1012–4. http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html.
Hamilton, J. D. 1994. Time Series Analysis. Princeton University Press, Princeton, NJ.
Lazer, D., R. Kennedy, G. King, and A. Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (14 March).
Lütkepohl, H. 2005. New Introduction to Multiple Time Series Analysis. Springer-Verlag, Berlin.
Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B. Methodological 58 (1): 267–88.
Wasserman, L. 2004. All of Statistics. Springer Texts in Statistics. Springer-Verlag, New York. doi:10.1007/978-0-387-21736-9.
Zheng, X., and W.-Y. Loh. 1995. “Consistent Variable Selection in Linear Models.” Journal of the American Statistical Association 90 (429): 151–56.
Zou, H. 2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the American Statistical Association 101 (476): 1418–29.