Abstract:
Methods able to identify and estimate a condensed latent structure that summarizes a large set of variables with few " factors" attracted attention early on in Statistics (Hotelling, 1933 40). The Economics and Econometrics literature caught on with investigations ranging from the estimation of the underlying factors that drive the economy, as in Geweke (1977) 34 and Sargent and Sims (1977) 51, to the description of asset prices as in Chamberlain and Rothschild (1983) [16], to applications in labor markets as in Engle and Watson (1981) 31. In this body of work, the common thread is the usage of factor models and the focus is on identifying the common latent sources of correlation of a set of given variables without explicit reference to a target variable.
In part due to the availability of richer datasets and also to the seminal work of Stock and Watson (2002a) 54, factor models under the form of Dynamic Factor Models (DFM) have resurfaced in the Econometric literature in the past 15 years and are now a standard tool for both measuring comovement and forecasting time series. In contrast to the early applications of factor models and their use to measure comovement, a distinctive feature of applying DFM in forecasting is the inherent targeted nature of the process. Departing from being simply a tool to identify and estimate a latent structure in multivariate data, DFMs aim to 1) reduce the dimension of a large panel of data to a sufficiently " lean" factor structure and 2) exploit such structure to forecast a target variable y.
Targeting comes into the picture only after a condensed latent structure is estimated and is resolved by postulating a linear relationship between the target variable y and the factors. The reduction step has so far been largely disconnected from the targeting step, likely a legacy of the origin of factor models.
Sufficient Dimension Reduction (SDR), a parallel, yet so far unrelated, collection of novel tools for reducing the dimension of multivariate data in regression problems without losing inferential information on the distribution of a target variable y, has emerged over the last twenty five years in Statistics. SDR focuses on finding sufficient (in a statistical sense) reductions of a potentially large set of explanatory variables with the aim of modeling a given target response y. In SDR, the reduction and the targeting are obtained simultaneously by exploiting the concept of statistical sufficiency. SDR aims to identify and estimate a sufficient function of the regressors, $$ {\mathbf{R}}(\mathbf{x})$$ , that preserves the information in the conditional distribution of $$ y\vert\mathbf{x}$$ . SDR also offers a powerful toolbox to analyze the link between the target y and the panel of regressors $$ \mathbf{x}$$ in contrast to the potentially unjustified assumption that y depends linearly on some factors as in the DFM literature.
Since SDR methods preserve the information of the conditional distribution of $$ y\vert\mathbf{x}$$ it should prove superior to current practice in producing density forecasts although we do not explore this aspect in this paper.
SDR is not the only approach that allows for simultaneous reduction and targeting, and the potential gains that can be obtained from linking the two modeling steps have been already acknowledged in the Econometrics literature. Bai and Ng (2008) 4 and De Mol, Giannone and Reichlin (2008) 25 explore the effectiveness of RIDGE regression, a penalty based regression, and other shrinkage estimators including the LASSO and execute their papers on the canvas laid out by Stock and Watson (2002b and 2002b) [54] 55. In more recent contributions, Kelly and Pruit (2014)43 and Groen and Kapetanios (2014) 35 explore variants of partial least squares (PLS) in order to obtain simultaneous reduction and targeting. In contrast to the innovative (in the Econometrics literature) statistical learning tools explored by these authors, SDR methods ensure the preservation of all the statistical information contained in the data as encapsulated in the conditional distribution of $$ y\vert\mathbf{x}$$ . SDR methods also allow to selectively preserve targeted statistical information regarding the conditional distribution of $$ y\vert\mathbf{x}$$ , such as the conditional mean, the conditional variance, or both, and clearly specify the assumptions required for the proper extraction of such information. In contrast, RIDGE regression is constrained by the forecasting functional form of the model, whereas PLS builds predictors that are little understood and their reliability ultimately depends on the particular application at hand using, quite arbitrarily, the covariance of the response and the original predictors.
Although the SDR methodology offers a potentially powerful source of new methods for the econometrician, several hurdles need be overcome. SDR has been developed and tested with statistical modeling in mind and its effectiveness has not been tested and proven in macro-forecasting. Most importantly, SDR has been developed for cross-sectional applications and adaptations are necessary for a successful application in forecasting and econometrics, analogously to the contributions of Stock and Watson (2002a and 2012b)54 [55], Bai and Ng (2003) 2 and Forni, Hallin, Lippi and Reichlin (2005) 32 that enabled the application of principal components to a time series setting.
This paper takes a first stab at introducing SDR techniques to macro-forecasting by establishing a connection between the SDR and DFM bodies of literature, by extending some key SDR results to a time series setting and by offering a first assessment of the effectiveness of SDR methods in a real-world forecasting application. In order to draw analogies and highlight differences with results in the DFM literature we conduct our real-world forecasting experiment with a large panel of macro variables as in Stock and Watson (2002a) 54 and Stock and Watson (2002b) 55 although our data source is the novel repository FRED-MD maintained by the St. Louis Fed and documented by McCracken and Ng (2015) 49. The task of conducting extensive Monte-Carlo simulations drawing from Stock and Watson (2002a) 54 and Doz et al. (2012) 28 in order to compare the performance of competing methods is deferred to a companion paper (see Barbarino and Bura (2015) 7).
In Section 2 we list the challenges of macro-forecasting in a data rich environment and describing the specific solution adopted in the DFM framework. We next propose an alternative forecasting framework based on SDR methods and contrast it with the DFM framework providing a connection between the two. In Section 3 we review shrinkage estimators that have been proposed within the DFM literature and that are tested in the empirical application. Section 4 contains the conceptual motivation for targeted reductions. Section 5 is a detailed exposition of the principles of SDR and our proposal for an SDR-based forecasting framework, including extensions to a time-series setting of sliced inverse regression (SIR), the SDR method we choose to present and apply in the empirical Section. Finally, as in Stock and Watson (2002a and 2002b) 54 [55] Section 6 contains the description and results of a horserace between the estimators reviewed in the paper on a large panel of macro variables in which the focus is on forecasting accuracy in predicting inflation and industrial production in an out-of-sample forecasting experiment. We find that SIR has similar or superior performance to PCR and always superior to PLS. The last Section concludes. Coverage of likelihood-based SDR methods is outside the scope of this exploratory paper and deferred to future work.
A large set of p explanatory variables $$ \underset{\left( p\times 1\right) }{{\mathbf{x}}_{t}}$$ is available to forecast a single variable $$ \underset{\left( 1\times 1\right) }{y_{t}}$$ using a sample of size T. The most immediate approach to the problem, as described, for instance, in the survey of Stock and Watson (2006) 57, is to consider all regressors to be potentially useful in modeling yt and use OLS to estimate the model
$$\displaystyle y_{t+h}=\mathbf{\beta }^{\prime }{\mathbf{x}}_{t}+\boldsymbol{\gamma }^{\prime }\mathbf{w}_{t}+\varepsilon _{t+h}$$ | (2.1) |
Estimation in (2.1) via OLS can be problematic when p is large relative to T, or variables in $$ {\mathbf{x}}_{t}$$ are nearly collinear, as is the case in the macro forecasting literature (see, e.g., Stock and Watson (2006) 57). The variance of the prediction is of order $$ p/T,$$ so when p is large other estimation methods may dominate OLS, even under assumptions that guarantee that the OLS estimator is unbiased. Moreover, when $$ p>3$$ , the OLS estimator is not even admissible under the mean square error (MSE) criterion,2 a striking result by James and Stein (1961) that inaugurated research in shrinkage estimation.
When the number of predictors exceeds the number of observations ($$ p>T$$ ), the OLS parameter estimates are not identifiable and multiple estimators of the parameter vector are solutions to the OLS problem. In this case, different shrinkage estimators, such as RIDGE and LASSO, can be viewed as specific criteria-based choices in the space of OLS-consistent solutions.
Furthermore, for typical datasets encountered in macro and finance forecasting, even if $$ p<<T$$ , collinearity or ill-conditioning of the $$ \underset{\left( T\times p\right) }{\mathbf{X}_{t}}$$ matrix, which collects all the observations on the vector $$ {\mathbf{x}}_{t}$$ , makes OLS predictions very unstable. This is likely to occur if the set of explanatory variables contains an index and several sub-indexes, e.g. industrial production (IP) along with its sub-indexes such as manufacturing IP or mining IP, or when variables are linked by identities or tight relationships, such as the inclusion of assets linked by arbitrage conditions.
Pioneering results in Stock and Watson in a series of papers (1999, 2002a, 2002b) 535455, re-launched the idea of working around the impossibility and/or lack of desirability of using OLS for estimation when $$ p $$ is large by:
$$\displaystyle y_{t+h}=\boldsymbol{\lambda }^{\prime }\mathbf{f}_{t}+\boldsymbol{\gamma }^{\prime }\mathbf{w}_{t}+\varepsilon _{t+h}$$ | (2.3) |
The primary goal of considering the factors $$ {\mathbf{f}}_{t}$$ is to introduce structure that reduces the high-dimensionality of the problem. The linear factor structure implies that although p is potentially large, the information content of the regressors is drastically reduced to r. However the factors are latent variables hence they need to be estimated. The complexity induced by the high dimensionality of the problem is traded off with the need of estimating additional fictional unobserved variables $$ {\mathbf{f}}_{t}$$ . Stock and Watson 54 provide conditions under which the factors are estimated via Principal Component Analysis (PCA) of the variance-covariance matrix of the observed predictors.
In fact, most estimators of the factors proposed in the literature turn out to be linear functions of the explanatory variables, $$ {\mathbf{v}}^{\prime }{\mathbf{x}}_{t}$$ , where the matrix $$ {\mathbf{v}}$$ of coefficients or weights corresponds to the particular method, e.g. PCR, RIDGE or PLS. The rank of the column space of $$ {\mathbf{v}}$$ is the dimension of the problem detected by that particular estimation method. In this sense there is no unique dimension of the information contained in the explanatory variables. Rather the dimension of the problem is estimator-dependent.
When the factors are estimated via PCA, as in Stock and Watson (2002a) 54, $$ {\mathbf{v}}$$ is a matrix whose columns are the leading principal components of $$ \boldsymbol{\Sigma }=({\mathbf{X}}_{T}-\bar{{{\mathbf{X}}}}_{T})^{\prime }({\mathbf{X}}_{T}-\bar{{{\mathbf{X}}}_{T}})/T$$ . If the dynamic principal components of Forni, Lippi Hallin and Reichlin (2005) 32 are used, $$ {\mathbf{v}}$$ also captures a summary of the dynamics in $$ {\mathbf{x}}_{t}$$ as it contains the eigenvectors of the autocovariances $$ \boldsymbol{\Sigma }^{\left( k\right) }=({\mathbf{X}}_{T-k}-\bar{{\mathbf{X}}}_{T-k})^{\prime }({\mathbf{X}}_{T}-\bar{\mathbf{X}}_{T})/(T-k)$$ . In (Quasi-) Maximum Likelihoods methods as in Doz et al. (2012) 28 or Banbura and Modugno (2014) 6, the algorithm used to compute the likelihood heavily exploits the linearity of the underlying system and $$ \mathbf{v}$$ is derived from the Kalman Filter. A main theoretical objective in the DFM literature has been to show that asymptotically, for both T and p $$ \rightarrow \infty ,$$ the chosen $$ \mathbf{v}$$ , is such that $$ \mathbf{v}^{\prime }{\mathbf{x}}_{t}$$ consistently estimates the factors.
The forecasting equation (2.3) is secondary since attention is shifted to reducing the dimension of the set of explanatory variables, a process assumed to unveil the data generating process driving $$ {\mathbf{x}}_{t}$$ . The practical goal of forecasting the target yt via link equation (2.3) relies on the assumption that the same factors that determine the marginal distribution of $$ {\mathbf{x}}_{t}$$ also determine the conditional distribution of yt and in the same functional form; that is, linearly.
The assumption of a factor structure and the ancillary role of the forecasting equation in Stock and Watson (2002a) 54 was adopted in the ensuing literature. For instance, Doz et al. (2012) 28 focus on the performance of different estimators in identifying the factors and show no interest in the forecasting accuracy of their estimators.
The assumption that the underlying DGP has a linear factor structure, while convenient, imposes restrictions on the conditional distribution of y given $$ {\mathbf{x}}$$ , which are difficult to pin down. In a follow-up to this paper (Barbarino and Bura (2015) 7), we show by means of Monte-Carlo simulations and formally by exploiting recent results in Leeb (2013) 44 and, more importantly, the extension in Steinberger and Leeb (2015) 52 that a linear factor structure underlying the generation of both y and $$ \mathbf{x}$$ coupled with the assumption of joint normality of the factors implies that a linear model is the correct specification for the conditional mean of y given $$ \mathbf{x}$$ , which is a rather restrictive model.
The formal result is stated in the following proposition.
Then, y can be decomposed into the sum of a linear function of $$ \mathbf{x}$$ and a remainder or error term, as follows,
A key point to note is that conditions (i) and (ii) in Proposition 1 are satisfied for any $$ \boldsymbol{\beta }$$ , if $$ \mathbf{f}$$ is normally distributed, the assumption made in practice in the DFM literature especially likelihood-based but also non-parametric. Moreover, the implied model (2.4) is a standard linear model with the error term uncorrelated with the predictors so that the solution to the normal equations in the population is the OLS, $$ \mathbf{c}_{OLS}=(\mathbf{x}^{\prime }\mathbf{x})^{-1}\mathbf{x}^{\prime }y$$ . For a random sample, regressing $$ \underset{\left( T\times 1\right) }{\mathbf{y}}$$ on $$ \underset{\left( T\times p\right) }{\mathbf{X}}$$ using OLS yields $$ \hat{\mathbf{c}}_{OLS}=(\mathbf{X}^{\prime }\mathbf{X})^{-1}\mathbf{X}^{\prime }\mathbf{Y,}$$ which is optimal, in terms of statistical efficiency, provided the sample size is larger than the number of predictors, i.e. $$ p<T$$ . Therefore the DFM model assumptions (2.2) and (2.3) imply that the forecasting model is approximately linear in the explanatory variables with errors uncorrelated with the predictors. As a consequence it should not come as a surprise that shrinkage methods such as RIDGE perform well in our simulations. Although (52) is valid when $$ (y_{t},\mathbf{x}_{t})$$ is a random sample for $$ t=1,\ldots ,T$$ , on the basis of our simulations, we conjecture that the results in Proposition 1 are approximately true in a factor model, under the set of assumptions commonly made in the DFM literature where both the response and the predictors are potentially autocorrelated time series.
In contrast to DFM, sufficient dimension reduction (SDR) methods depart from the pervasive linear factor assumption. As we show in greater detail in a companion paper (Barbarino and Bura (2015) 7) and summarize in the following paragraph, SDR methods thrive when the relationship between the response and the predictors contains non-linearities whereas they do not have comparative advantage when a linear factor structure is the true DGP.
SDR works directly with observables and sidestep the assumption of a factor structure. Thus, instead of imposing an artificial latent factor structure on the panel $$ {\mathbf{x}}_{t}$$ , SDR methods directly seek to identify how many and which functions of the explanatory variables are needed to fully describe the conditional distribution function $$ F\left( y_{t+h}\vert{\mathbf{x}}_{t}\right) $$ , or its features, such as the conditional mean. SDR aims to identify and estimate functions of the predictors, $$ {\mathbf{R}}\left( {\mathbf{x}}_{t}\right) ,$$ that are called reductions because they preserve all the information that $$ {\mathbf{x}}_{t}$$ carries about $$ y_{t+h}$$ in the sense that $$ F\left( y_{t+h}\vert{\mathbf{x}}_{t}\right) =F\left( y_{t+h}\vert{\mathbf{R}}({\mathbf{x}}_{t})\right) $$ . Obviously only if such functions are fewer than p do they represent proper reductions.
The reductions can be either linear or nonlinear functions. In this paper we focus on moment-based and linear SDR methods that obtain linear reductions in order to draw a more pertinent comparison with the DFM literature. Linear SDR methods lay down conditions under which it is possible to identify the number of and the linear combinations of the explanatory variables needed to " adequately" model yt. They also provide estimation algorithms. More formally, linear SDR methods provide the means to estimate a matrix $$ {\mathbf{v}}:p\times d$$ , $$ 0\leq d\leq p$$ , such that $$ {\mathbf{R}}\left( {\mathbf{x}}_{t}\right) ={\mathbf{v}}^{\prime }{\mathbf{x}}_{t}$$ .
Our proposed SDR-based forecasting framework is based on the following two conditions:
$$\displaystyle E\left[ {\mathbf{x}}_{t}\vert{\mathbf{v}}^{\prime }{\mathbf{x}}_{t}\right] ={\mathbf{A}}\mathbf{v}^{\prime }{\mathbf{x}}_{t}$$ | (2.5) |
Moment-based SDR methods place conditions on the marginal distribution of $$ {\mathbf{x}}_{t}$$ , such as the linearity assumption (2.5), which is a first moment condition and analogous to the linear factor structure (2.2) of DFM.4 However, in contrast with the DFM setup, no dependence on underlying factors is postulated.
The second equation (2.6) specifies the forward model and is analogous to the link equation in DFM, although the SDR framework allows more flexibility admitting a general $$ g(\cdot )$$ instead of a linear function. Linear SDR methods are powerful tools that can determine the number of linear combinations of the explanatory variables $$ {\mathbf{x}}_{t}$$ needed to model the response yt and provide consistent estimators without the need to specify the functional form of the forward model; that is, without specifying the exact relationship between yt and $$ {\mathbf{v}}^{\prime }{\mathbf{x}}_{t}$$ . Linear SDR replaces a large number of explanatory variables by a few of their linear combinations without loss of inferential information; their number $$ d(=\mathrm{rank}({\mathbf{v}}))$$ indicates the dimension of the regression problem. In our experience, fewer than the number of PCs are needed in order to generate a comparable MSFE, which is expected as the SDR estimator is targeted to y. As a result, the forecaster can concentrate on the estimation of $$ g(\cdot )$$ with the option of also using non-parametric regression since the number of predictors is significantly reduced.
How restrictive is the linearity condition? - The linearity condition is satisfied for any $$ \mathbf{v}$$ by any elliptically contoured vector of predictors. Importantly, Hall and Li (1993) 36 showed that, as the cross-section gets large and $$ p\rightarrow \infty $$ , such condition is ever more likely to be satisfied provided that the dimension of the problem d remains small relative to p. More recently, Steinberger and Leeb (2015) 52 showed that as $$ d/\log p$$ becomes smaller, the discrepancy between $$ E\left[ {\mathbf{x}}_{t}\vert{\mathbf{v}}^{\prime }{\mathbf{x}}_{t}\right] $$ and $$ {\mathbf{A}}\mathbf{v}^{\prime }{\mathbf{x}}_{t}$$ in (2.5) goes to zero and the linearity condition holds approximately.
Comparison of SDR and DFM Assumptions - As outlined in Proposition 1, the assumption of a linear factorial structure along with normality imply that a linear model is the correctly specified model. By contrast moment-based SDR requires the linearity condition (2.5) hold for the linear projections $$ \mathbf{v}^{\prime }{\mathbf{x}}$$ that satisfy the general forward regression model $$ F(y\vert\mathbf{x})=F(y\vert\mathbf{v}^{\prime }{\mathbf{x}})$$ . Therefore, DFM assumes more restrictive conditions than SDR on the marginal distribution of $$ {\mathbf{x}}$$ .
Dimension of the Regression - The SDR framework allows for more general models to describe the relationship between y and $$ \mathbf{x}$$ . As a consequence, for example, the finding that on the same dataset four PCA-estimated factors produce the same MSFE as two SDR predictors is not contradictory. The PC-based DFM framework ignores non-linearities in the DGP so that a larger number of factors is required to approximate non-linearities in an incorrectly specified forward regression. This is analogous to the effect of dynamic misspecification as shown in Bai and Ng (2007) 3 where one needs a larger set of static factors in order to approximate a given set of (true) dynamic factors. In their setting a larger set of static factors approximate a polynomial in the lag operator. In our setting, a larger number of static factors is needed to approximate non-linearities which are captured by SDR methods.
Robustness to Model Misspecification - SDR techniques frequently yield very few derived predictors, which allows non-parametric estimation of the relationship between the response yt and the (linear combinations of) the explanatory variables $$ {\mathbf{x}}_{t}$$ . As a result, the prohibitive curse of dimensionality problem in non-parametric regression is circumvented .
Issues in Large p Problems - SDR techniques are data intensive relative to shrinkage estimators such as Principal Components (PC), RIDGE or Partial Least Squares (PLS). As we will see, when the number of predictors p is larger than the sample size, direct application of SIR is not possible since $$ rank(\boldsymbol{\Sigma })\leq \min (T-1,p)$$ and $$ \boldsymbol{\Sigma }$$ is not invertible. Moreover, when T and p are of the same order, or when the columns of $$ {{\mathbf{X}}}$$ are highly correlated, $$ \boldsymbol{\Sigma }$$ is ill-conditioned and its inverse is numerically unstable resulting in non-robust estimation. In this sense SDR methods suffer from the same limitations as OLS. Therefore we also evaluate a regularized version of our estimators, following Chiaromonte and Martinelli (2002) 22 and Li and Li (2004) 46 who used principal components as an intermediate step in order to eliminate PC directions in which the random vector $$ {{\mathbf{x}}}$$ is degenerate. Bernard-Michel et al. (2011) show that preprocessing the data with PC in order to eliminate degenerate projections of $$ {{\mathbf{x}}}$$ and then applying SIR is a special case of their Gaussian Regularized SIR, where a Gaussian prior is introduced on the unknown parameters of the inverse regression. In a companion paper (Barbarino and Bura (2015) 7), we show that even though regularization can be a useful approach in large p settings, a more appealing work-around to the ill-conditioning or non-invertibility of $$ \boldsymbol{\Sigma }$$ is the extension of SDR methods using Krylov subspaces.
Difficulties in Estimating the Forward Model - SDR approaches obtain optimal results in an interactive modeling setting. That is, the sequence of modeling steps in SDR is to (1) reduce and estimate the $$ d(<p)$$ SDR-derived predictors, and (2) obtain visual assessments via scatterplots of the response versus the SDR-predictors to form a forward regression model $$ g(\cdot )$$ that best describes the data at hand. This process cannot be carried out in an automatic fashion so that $$ g(\cdot )$$ be recursively estimated prior to the computation of the out-of-sample forecast. Instead, we simply use the linear SDR predictors, which are linear combinations of $$ {\mathbf{x}}_{t},$$ with no further transformations as input to a linear forward model for the response. As a result, when the estimated dimension of the problem turns out to be greater than one, the forecasting horserace will penalize the SDR estimator by means of misspecification of $$ g(\cdot )$$ . Such difficulty is not faced by the applied forecaster as they only need SDR predictors for one-shot forecast (even when repeated over time) in which case they can easily evaluate the presence of non-linearities and decide on the appropriate modeling of $$ g(\cdot )$$ .
Several estimators in the literature form linear combinations of the explanatory variables $$ {\mathbf{v}}^{\prime }{\mathbf{x}}_{t}$$ as a data reduction step before fitting the model used for prediction. In this Section, we provide a brief review of some such widely used estimators that are also used in the empirical Section. We start with OLS, move on to principal component regression (PCR), which is the prevalent method in dynamic factor analysis, and RIDGE regression. The last leg of our tour provides the fundamentals of the partial least squares (PLS) algorithm. We cast these methods in a shared framework of minimization of an objective function, which is what distinguishes individual methods, and discuss how the resulting estimators exploit the eigen-structure of the data matrix $$ \mathbf{X}$$ . A more in-depth treatment under a common unifying framework is discussed in Barbarino and Bura (2015) 7. To motivate the discussion about the features and relative drawbacks and advantages of different methods, we start off from a simple data generating model, the multivariate normal distribution.
The Normal Data Generating Process - The simplest DGP that implies 1-dimensional linear reduction is a Normal DGP where the predictors and the response are jointly normal:
$$\displaystyle \left( \begin{array}{c} \mathbf{x} \\ y\end{array}\right) \sim N\left( \left( \begin{array}{c} \boldsymbol{\mu }_{\mathbf{x}} \\ \mu _{y}\end{array}\right) ,\left( \begin{array}{cc} \mathbf{\Sigma } & \boldsymbol{\sigma }_{xy} \\ \boldsymbol{\sigma }_{xy}^{\prime } & \sigma _{y}^{2}\end{array}\right) \right)$$ |
$$\displaystyle E\left( y\vert\mathbf{x}\right) =\mu _{y}+\boldsymbol{\beta }^{\prime }\left( \mathbf{x}-\mu _{\mathbf{x}}\right)$$ |
We consider various competing ways of estimating the population parameter $$ \boldsymbol{\beta }$$ next.
Ordinary Least Squares - The OLS coefficient is the solution to the following maximization problem:
$$\displaystyle \max_{\left\{ \beta \right\} }Corr^{2}\left( y,\mathbf{x}^{\prime }\boldsymbol{\beta }\right)$$ | (3.1) |
$$\displaystyle y_{OLS}=\mathbf{x}_{t_{0}}^{\prime }\boldsymbol{\beta }_{OLS}$$ |
Principal Component Regression (PCR) - PCR operates in two steps. First, the linear combinations that maximize the variance of $$ \mathbf{x}$$ and that are mutually orthogonal are the solution to the following maximization problem
$$\displaystyle \max_{\substack{ \left\{ \mathbf{c}_{k}\right\} \\ \mathbf{\beta }_{k}^{\prime }\mathbf{\beta }_{k}=1 \\ \left\{ \mathbf{\beta }_{k}^{\prime }\mathbf{\Sigma \beta }_{i}=0\right\} _{i=1}^{k-1}}}Var(\mathbf{x}^{\prime }\mathbf{\beta }_{k})$$ | (3.3) |
$$\displaystyle \boldsymbol{\beta }_{PCR}\left( M\right) =\mathbf{\Sigma }_{PCR}^{-}(M)\mathbf{\sigma }_{xy}$$ | (3.4) |
$$\displaystyle y_{PCR}=\mathbf{x}_{t_{0}}^{\prime }\boldsymbol{\beta }_{PCR}$$ |
RIDGE Regression - RIDGE regression has been reviewed in the macro-forecasting literature by De Mol et al. (2008) 25 in connection to Bayesian regression. The RIDGE estimator picks one and only one linear combination of the data as it can be shown that RIDGE regression minimizes the least squares criterion on the sphere with radius a:
$$\displaystyle \max_{\left\{ \beta \right\} }Corr^{2}\left( y,\mathbf{x}^{\prime }\boldsymbol{\beta }\right)$$ subject to$$\displaystyle \ \sum_{j=1}^{p}\beta _{j}^{2}\leq a$$ | (3.5) |
$$\displaystyle \boldsymbol{\beta }_{RR}\left( \lambda \right) =\left( \boldsymbol{\Sigma }+\lambda \mathbf{I}\right) ^{-1}\mathbf{\sigma }_{xy}$$ | (3.6) |
$$\displaystyle \boldsymbol{\beta }_{RR}\left( \lambda \right) =\left( \mathbf{I}+\lambda \boldsymbol{\Sigma }^{-1}\right) ^{-1}\boldsymbol{\Sigma }^{-1}\mathbf{\sigma }_{xy}=\left( \mathbf{I}+\lambda \boldsymbol{\Sigma }^{-1}\right) ^{-1}\boldsymbol{\beta }_{OLS}$$ |
$$\displaystyle y_{RR}=\mathbf{x}_{t_{0}}^{\prime }\boldsymbol{\beta }_{RR}\left( \lambda \right)$$ |
Partial Least Squares - PLS, an increasingly popular method of dimension reduction, followed a peculiar trajectory in Econometrics. Originally developed by H. Wold 63 in the mid 70s, it did not gain much traction in Econometrics and swiftly fell into oblivion. By contrast, it garnered a lot of attention in Chemometrics, a field that produced a large volume of PLS studies in the late 80s and early 90s (see, for example, the instructive work of Helland (1988) 38). Only recently has the method resurfaced in Econometrics with the work of Kelly and Pruitt (2014) 43 and Groen and Kapetanios 35 within macro-forecasting applications as PLS handles well cases in which $$ p>T$$ .
The PLS algorithm induces a simultaneous bi-linear decomposition of both the target variable and the panel of regressors5. That is, factors $$ \underset{\left( T\times 1\right) }{\mathbf{f}_{i}}$$ and loadings $$ \underset{\left( p\times 1\right) }{q_{i}}$$ and $$ \underset{\left( 1\times 1\right) }{p_{i}}$$ are generated at each step so that the factors are orthogonal and the following decompositions are carried out concurrently
$$\displaystyle \mathbf{x}=q_{1}^{\prime }\mathbf{f}_{1}+q_{2}^{\prime }\mathbf{f}_{2}+\ldots +q_{u}^{\prime }\mathbf{f}_{u}+\mathbf{E}_{u}$$ |
$$\displaystyle y=p_{1}^{\prime }\mathbf{f}_{1}+p_{2}^{\prime }\mathbf{f}_{2}+...+p_{u}^{\prime }\mathbf{f}_{u}+e_{u}$$ |
The suffix u denotes the last step of the procedure. The algorithm always converges in the sense that after p steps the factors will be identically zero. Using the recursive formulas for the factors and the loadings, one can show that PLS prediction admits a linear form similar to the predictions of the other estimators at $$ {\mathbf{x}}_{t_{0}}$$ :
$$\displaystyle y_{PLS}=\mathbf{x}_{t_{0}}^{\prime }\boldsymbol{\beta }_{PLS}\left( u\right)$$ |
$$\displaystyle \boldsymbol{\beta }_{PLS}\left( u\right) =\mathbf{W}_{u}\left( \mathbf{W}_{u}^{\prime }\boldsymbol{\Sigma }\mathbf{W}_{u}\right) ^{-1}\mathbf{W}_{u}^{\prime }\boldsymbol{\sigma }_{xy}$$ | (3.7) |
$$\displaystyle \mathbf{w}_{u}=\boldsymbol{\sigma }_{xy}-\boldsymbol{\Sigma }\mathbf{W}_{u-1}\left( \mathbf{W}_{u-1}^{\prime }\boldsymbol{\Sigma }\mathbf{W}_{u-1}\right) ^{-1}\mathbf{W}_{u}^{\prime }\boldsymbol{\sigma }_{xy}$$ |
In order to set the stage for the sequel, we now focus on the different targets of the eigen-decompositions the methods in Section 3 entail.
Eigen-Decompositions Underpinning PCR and PLS - PCR targets the extraction of derived inputs that maximize the variance of the explanatory variables. The PCs are left eigenvectors from the eigen-decomposition of $$ \mathrm{var}\left( {\mathbf{x}}\right) $$ . PLS has (initial) target $$ \mathrm{cov}\left( \mathbf{x},y\right) $$ , which reveals its targeted nature and that, by focusing on the principal directions of $$ \mathrm{cov}\left( \mathbf{x},y\right) $$ , it is hard-wired to extract linear signals.6
Eigen-Decomposition Underpinning SDR - SDR methods that will be introduced in the next Section carry out an eigen-decomposition of a target, called the seed or kernel, in order to isolate directions of principal variation in relevance to y.
The SDR method used in the empirical applications, sliced inverse regression (SIR), is based on the eigen-decomposition of $$ \mathrm{var}\left[ E\left( {\mathbf{x}}\vert y\right) \right] $$ . To gain intuition of why such target works, we resort to a classical probabilistic identity satisfied by any random vector $$ {\mathbf{x}}$$ with finite second moment and conditioning random variable or vector y,
$$\displaystyle \underset{\text{identified by PCA}}{\underbrace{\mathrm{var}\left( {\mathbf{x}}\right) }}=\underset{\text{identified by linear SDR}}{\underbrace{\mathrm{var}\left[ \mathrm{E}\left( {\mathbf{x}}\vert y\right) \right] }}+\underset{\text{noise}}{\underbrace{\mathrm{E}\left[ \mathrm{var}({\mathbf{x}}\vert y)\right] }}$$ | (4.1) |
In ANOVA, the first summand is the signal that y carries about $$ {\mathbf{x}}$$ since it represents variation of the average value of $$ \mathbf{x}$$ associated with different values of y from the overall $$ \mathbf{x}$$ mean. The second element represents noise, i.e. deviations of $$ \mathbf{x}$$ from its overall average across bins, hence unrelated to y.
From this perspective, since PCA performs an eigenanalysis of $$ \mathrm{var}\left( {\mathbf{x}}\right) $$ , noise in $$ \mathrm{E}\left[ \mathrm{var}({\mathbf{x}}\vert y)\right] $$ may attenuate or suppress the signal in $$ \mathrm{var}\left[ \mathrm{E}\left( {\mathbf{x}}\vert y\right) \right] $$ and result in PCs that are little related to y. PLS targets $$ \mathrm{cov}\left( \mathbf{x},y\right) $$ and can potentially suppress non-linear signal. By contrast, a method that focuses on the eigen-analysis of $$ \mathrm{var}\left[ E\left( {\mathbf{x}}\vert y\right) \right] $$ produces derived inputs ordered according to their importance with respect to y and has the capacity to preserve non-linear signals. As we will see next, centering on the signal and ignoring the noise is what sufficient dimension reduction in general and, in particular, sliced inverse regression is designed to do.
$$\displaystyle F\left( y\vert{\mathbf{x}}\right) =F\left( y\vert{\mathbf{R}}\left( {\mathbf{x}}\right) \right)$$ | (4.2) |
A consequence of the definition of sufficiency is that, since (4.2) can be written as $$ F\left( y\vert{\mathbf{x}},{\mathbf{R}}\left( {\mathbf{x}}\right) \right) =F\left( y\vert{\mathbf{R}}\left( {\mathbf{x}}\right) \right) $$ we have
$$\displaystyle y {\perp}{\mathbf{x}}\vert{\mathbf{R}}$$ |
Consequently, $$ {\mathbf{R}}\left( {\mathbf{x}}\right) $$ is called a forward reduction. Although the term " sufficient" was originally coined to highlight the information preserving role of $$ \mathbf{R}\left( {\mathbf{x}}\right) $$ , it turns out that there is a specific link with the Fisherian concept of statistical sufficiency (see Cook (2007)) 19 as we will see shortly. Before doing so, we introduce the concept of inverse reduction.
$$\displaystyle {\mathbf{x}}\vert\left( {\mathbf{R}}\left( {\mathbf{x}}\right) ,y\right) \overset{d}{=}{\mathbf{x}}\vert{\mathbf{R}}\left( {\mathbf{x}}\right)$$ | (4.3) |
If one treats y as a parameter, (4.3) states that $$ {\mathbf{R}}\left( {\mathbf{x}}\right) $$ is a sufficient statistic for y and it contains all information $$ {\mathbf{x}}$$ contains about y. Thus, it is a sufficient reduction for the forward regression of y on $$ {\mathbf{x}}$$ . Proposition 2 provides the formal statement and proof of this fact.
$$\displaystyle F\left( y\vert{\mathbf{x}}\right) =F\left( y\vert{\mathbf{R}}\left( {\mathbf{x}}\right) \right)$$ iff $$\displaystyle {\mathbf{x}}\vert\left( {\mathbf{R}}\left( {\mathbf{x}}\right) ,y\right) \overset{d}{=}{\mathbf{x}}\vert{\mathbf{R}}\left( {\mathbf{x}}\right)$$ |
$$\displaystyle F\left( {\mathbf{x}}\vert{\mathbf{R}}\right) =\frac{F\left( y,{\mathbf{x}}\vert{\mathbf{R}}\right) }{F\left( y\vert{\mathbf{R}}\right) }=\frac{F\left( y,{\mathbf{x}},{\mathbf{R}}\right) }{F\left( y\vert{\mathbf{R}}\right) F\left( {\mathbf{R}}\right) }=\frac{F\left( y,{\mathbf{x}},{\mathbf{R}}\right) }{F\left( y,{\mathbf{R}}\right) }=F\left( {\mathbf{x}}\vert y,{\mathbf{R}}\right)$$ |
$$\displaystyle F(y\vert{\mathbf{x}},{\mathbf{R}})=\frac{F(y,{\mathbf{x}},{\mathbf{R}})}{F({\mathbf{x}},{\mathbf{R}})}=\frac{F(\mathbf{x}\vert y,{\mathbf{R}})F(y,{\mathbf{R}})}{F({\mathbf{x}}\vert{\mathbf{R}})F({\mathbf{R}})}$$ |
Proposition 2 sheds light on why inverse regression is a powerful tool for the identification of sufficient reductions of the predictors: if a function $$ {\mathbf{R}}\left( {\mathbf{x}}\right) $$ is a sufficient statistic for the inverse regression, it is also a sufficient reduction for the forward regression. This implies that the econometrician is free to choose the most convenient way to determine a sufficient reduction, either from the forward or inverse regression. An immediate advantage of inverse regression is that it treats each predictor separately instead of treating the panel as a block. That is, a large p-dimensional forward regression (potentially non-linear) problem is split in p univariate regression problems, which are easily modeled if y is univariate (or has a small dimension) even if p is large. Furthermore, inverse regression allows a plethora of estimation methods, also non-parametric, where the curse of dimensionality would make modeling of the forward regression practically impossible. Therefore, the method can result in significantly more accurate estimation than a linear forward regression model.
Most importantly, inverse regression accomplishes another goal in connecting sufficient reductions with the classical concept of a sufficient statistic: the "parameter" to be estimated and predicted is the whole time-series yt.
Even though sufficient reductions need not be linear, moment-based SDR was developed under the requirement that $$ {\mathbf{R}}\left( {\mathbf{x}}_{t}\right) $$ be linear. In linear SDR, $$ {\mathbf{R}}\left( {\mathbf{x}}_{t}\right) $$ is a projection $$ \mathbf{P}_{{\mathcal{S}}}{\mathbf{x}}_{t}$$ onto a lower-dimensional subspace $$ {\mathcal{S}}$$ of $$ {\mathbb{R}}^{p}$$ that incurs no loss of information about the conditional distribution $$ F\left( y_{t+h}\vert{\mathbf{x}}_{t}\right) $$ or selected features thereof. If the mean squared error loss is used to evaluate the accuracy of the forecast, the conditional mean E $$ \left( y_{t+h}\vert{\mathbf{x}}_{t}\right) $$ is of interest and the goal is to find a reduction $$ \mathbf{R}\left( {\mathbf{x}}_{t}\right) $$ such that E $$ \left( y_{t+h}\vert{\mathbf{x}}_{t}\right) =\textit{E}\left( y_{t+h}\vert{\mathbf{R}}\left( {\mathbf{x}}_{t}\right) \right) $$ . In density forecasting, the whole conditional distribution is the target and a reduction $$ {\mathbf{R}}\left( {\mathbf{x}}_{t}\right) $$ such that $$ F\left( y_{t+h}\vert{\mathbf{x}}_{t}\right) =F\left( y_{t+h}\vert R\left( {\mathbf{x}}_{t}\right) \right) $$ is sought. In the rest of this Section we suppress subscripts keeping in mind that y is used in place of $$ y_{t+h}$$ and $$ {\mathbf{x}}$$ in place of $$ {\mathbf{x}}_{t}$$ .
In this Section we focus the discussion on the identification (and peripherally to existence and uniqueness) of linear sufficient reductions and show how to exploit inverse regression to identify them. We require at the very outset the reduction be linear:
$$\displaystyle F\left( y\vert{\mathbf{x}}\right) =F\left( y\vert{\mathbf{R}}\left( {\mathbf{x}}\right) \right)$$ | (4.4) |
$$\displaystyle \mathbf{R}\left( {\mathbf{x}}\right) =\mathbf{a}^{\prime }{\mathbf{x}}$$ |
Notice that the definition of sufficiency implies that we can only identify the subspace spanned by a linear reduction,
$$ \mathrm{span}(\mathbf{a})$$ , rather than
$$ \mathbf{a}$$ per se, since
$$ F\left( y\vert\mathbf{a}^{\prime }{\mathbf{x}}\right) =F\left( y\vert\mathbf{b}^{\prime }\mathbf{x}\right) $$ for all matrices
$$ \mathbf{a}$$ and
$$ \mathbf{b}$$ such that
$$ \mathrm{span}\left( \mathbf{a}\right) =\mathrm{span}\left( \mathbf{b}\right) $$ . A subspace spanned by the columns of a matrix
$$ \mathbf{a}$$ with
$$ F\left( y\vert{\mathbf{x}}\right) =F\left( y\vert{\mathbf{a}}^{\prime }{\mathbf{x}} \right)$$ is called a dimension reduction subspace (DRS).
Existence and Uniqueness - A linear reduction, although a trivial one, always exists, since one can always set $$ {\mathbf{R}}\left( {\mathbf{x}}\right) ={\mathbf{x}}={\mathbf{I}}_{p}{\mathbf{x}}$$ . For the same reason a DRS is not generally unique. SDR's objective is to identify a minimal reduction, that is a DRS with minimum dimension as well as conditions that insure existence and uniqueness. Uniqueness and minimality are jointly guaranteed by focusing on the intersection of all DRS; such intersection, if it is itself a DRS, is called the central subspace. The latter exists under reasonably mild conditions on the marginal distribution of $$ {\mathbf{x}}$$ , such as convexity of its support. We refer to Cook (1998)17 for more details and henceforth restrict attention to those regressions for which a central subspace exists.
The identification of a minimal sufficient reduction or, equivalently, the identification of a basis for the central subspace requires moment conditions on the marginal distribution of the predictor vector $$ {\mathbf{x}}$$ .
$$\displaystyle \mathrm{E}\left[ {\mathbf{x}}\vert\mathbf{v}^{\prime }{\mathbf{x}}\right] ={\mathbf{A}}\mathbf{v}^{\prime }{\mathbf{x}}$$ | (4.5) |
In general, the linearity condition (4.5) on the marginal distribution of the predictors is difficult to verify as it requires knowledge of $$ {\mathbf{v}}$$ . Nevertheless, it is satisfied for all $$ \mathbf{v}\in {\mathbb{R}}^{p\times d}$$ if the predictors have an elliptically contoured distribution [See Eaton (1986)30]. The elliptically contoured family of distributions includes the multivariate normal and Student's t distributions. Moreover, Steinberger and Leeb (2015) 52 showed that under comparatively mild conditions on the distribution of $$ {\mathbf{x}}$$ the condition (4.5) is likely to be satisfied as p grows and the cross-section becomes large. Specifically, they showed that if a random p-vector $$ \mathbf{x}$$ has a Lebesgue density, the mean of certain functions of $$ \mathbf{x}$$ is bounded and that certain moments of $$ \mathbf{x}$$ are close to what they would be in the Gaussian case [see the bounds (b1) and (b2) in Th. 2.1, Steinberger and Leeb (2015)52], then the conditional mean of $$ \mathbf{x}$$ given $$ \mathbf{v}^{\prime }\mathbf{x}$$ is linear in $$ \mathbf{v}^{\prime }\mathbf{x}$$ for a $$ p\times d$$ matrix $$ \mathbf{v}$$ , as $$ p\rightarrow \infty $$ and d is either fixed or grows very slowly at the rate $$ d/\log p\rightarrow 0$$ . An appealing feature of these results is that they rely on bounds that can be estimated from data.
Steinberger and Leeb's result is of fundamental importance in SDR since it ascertains that the linearity condition (4.5) is satisfied by a large class of predictor distributions. Thus, first-moment SDR estimators, such as SIR in the ensuing Section 4.3, can be widely used to estimate basis elements of the column space of $$ \mathbf{v}$$ in the reduction $$ \mathbf{R}({\mathbf{x}}_{t})=\mathbf{v}^{\prime }\mathbf{x}_{t}$$ .
The following lemma links the linearity condition with inverse regression and points to a means to find the reduction.
$$\displaystyle \mathbf{\Sigma }^{-1}\left[ \mathrm{E}\left( {\mathbf{x}}\vert y\right) -\mathrm{E}\left( {\mathbf{x}}\right) \right] \in \mathrm{span}\left( \mathbf{v}\right)$$ |
$$\displaystyle \mathrm{span}\left( \boldsymbol{\Sigma }^{-1}\left[ \mathrm{E}\left( {\mathbf{x}}\vert y\right) -\mathrm{E}\left( {\mathbf{x}}\right) \right] \right) \subseteq \mathrm{span}\left( \mathbf{v}\right)$$ |
Lemma 1 obtains that the centered and scaled inverse regression function "lives" in a subspace, the inverse regression subspace, spanned by the columns of $$ \mathbf{v}$$ . That is, as y varies in $$ \mathbb{R}$$ , the random vector $$ \boldsymbol{\Sigma }^{-1}\left[ \text{E}\left( {\mathbf{x}}\vert y\right) -\text{E}\left( {\mathbf{x}}\right) \right] $$ is contained in a subspace that is spanned by the columns of $$ \mathbf{v}$$ . Therefore, in order to identify the sufficient reduction we need to identify a basis that generates the subspace that contains $$ \boldsymbol{\Sigma }^{-1}\left[ \text{E}\left( {\mathbf{x}}\vert y\right) -\text{E}\left( {\mathbf{x}}\right) \right] $$ as y varies. The following proposition provides the answer.
$$\displaystyle \mathrm{span}\left( \boldsymbol{\Sigma }^{-1}\mathrm{Var}(\mathrm{E}({\mathbf{x}}\vert y)\right) )=\mathrm{span}\left( \boldsymbol{\Sigma }^{-1}\left[ \mathrm{E}({\mathbf{x}}\vert y)-E({\mathbf{x}})\right] \right)\subseteq \mathrm{span}\left( \mathbf{v}\right)$$ |
Lemma 1 and Proposition 3 draw a link between the distribution of the data and the subspace that we wish to identify. Notice that in general the column space of $$ \boldsymbol{\Sigma }^{-1}\mathrm{var}\left( \mathrm{E}({\mathbf{x}}\vert y)\right) $$ provides only partial coverage of the central subspace since the inverse regression subspace can be a proper subset of the central subspace.
Under additional conditions one can show that more exhaustive capturing of the central subspace is possible. Other inverse regression moments, such as $$ \mathrm{E}\left( \boldsymbol{\Sigma -}\mathrm{Var}(\mathrm{E}({\mathbf{x}}\vert y)\right) ^{2},$$ also live in the central subspace under additional conditions on the marginal distribution of the predictors (Cook and Weisberg (1991) 23). In order not to clutter the present exposition, we focus only on the first inverse regression moment $$ \mathrm{E}({\mathbf{X}}\vert Y)$$ in order to introduce SDR methodology to the econometrics literature via the simple and widely used Sliced Inverse Ression (SIR, Li (1991)[45]).7 In general, linear moment-based SDR methods provide a way of identifying the number and coefficients (up to rotations, as in the DFM literature) of the linear combinations of the predictors in the forward forecasting equation. A feature of SDR methods, which can be viewed both as an advantage and a downside, is that they are silent regarding the functional form of the forward regression. They obtain the linear combinations of $$ \mathbf{x}_{t}$$ that are needed in the forecasting equation in order to adequately reduce $$ {\mathbf{x}}_{t}$$ . When the number of SDR directions is 1 or 2, a plot of the response versus the reduction(s) can visually inform forward regression modeling. Dimension 2 or larger indicates that the forward model involves non-linear functions of the reductions. It is important to note that SDR methods do not remove the need to model the response but rather reduce significantly the complexity of modeling and uncover the structural dimension of the forward regression problem, i.e. how many derived linear combinations of the original predictors suffice to completely explain y.
Several estimators have been proposed in order to estimate the central subspace. We focus on the first and most widely used: Sliced Inverse Regression (SIR) proposed by Li (1991)45. SIR is a semiparametric method for finding dimension reduction subspaces in regression. It is based on the results of Section 4.3 and uses a sample counterpart8to $$ \boldsymbol{\Sigma }^{-1}\mathrm{Var}\left( \mathrm{E}\left( {\mathbf{x}}\vert y\right) \right) $$ , the population object that lives in the subspace generated by the coefficient matrix of the reduction. The name derives from using the inverse regression of $$ {\mathbf{x}}$$ on the sliced response y to estimate the reduction. For a univariate y, the method is particularly easy to implement, SIR's step functions being a very simple nonparametric approximation to E $$ ({\mathbf{x}}\vert y)$$ .
Implementation of SIR - In order to estimate $$ {\mathbf{M}}=\mathrm{var}(\mathrm{E}(\mathbf{x}\vert y))$$ , the range of the observed responses $$ \mathbf{Y}=(y_{1},\ldots ,y_{T})^{\prime }$$ is divided in J disjoint slices $$ S_{1},\ldots ,S_{J}$$ whose union is the range of $$ \mathbf{Y}$$ . We denote the overall sample mean of the sample predictor matrix $$ {{\mathbf{X}}}$$ by $$ \bar{{{\mathbf{X}}}}=(\sum_{t=1}^{T}x_{t1}/T,\ldots ,\sum_{t=1}^{T}x_{tp}/T)^{\prime }$$ , and for $$ j=1,\ldots ,J$$ , we let $$ \bar{{{\mathbf{X}}}}_{j}=\sum_{y_{t}\in S_{j}}{\mathbf{X}}_{t}/n_{j}$$ , where nj is the number of yt's in slice Sj. The covariance matrix of $$ {{\mathbf{x}}}$$ is estimated by the sample covariance matrix $$ \widehat{\boldsymbol{\Sigma }}=\sum_{t=1}^{T}({{\mathbf{X}}}_{t}-\bar{{{\mathbf{X}}}})({{\mathbf{X}}}_{t}-\bar{{{\mathbf{X}}}})^{\prime }/T$$ , and the SIR seed $$ {\mathbf{M}}$$ with
$$\displaystyle \widehat{{\mathbf{M}}}=\sum_{j=1}^{J}\frac{n_{j}}{T}(\bar{{{\mathbf{X}}}}_{j}-\bar{{{\mathbf{X}}}})(\bar{{{\mathbf{X}}}}_{j}-\bar{{{\mathbf{X}}}})^{\prime }$$ |
How SIR works: SIR finds the directions of maximum variance between slices, with T data points collapsed in J slice means clustered according to y labels (slices). In the extreme case of J=T, i.e. when each slice corresponds to a single y observation, $$ {\mathbf{M}}$$ becomes $$ \boldsymbol{\Sigma }$$ , the sample covariance of $$ {{\mathbf{x}}}$$ , and SIR is identical to PCA. However, for $$ J<T$$ , the variance (noise) of the components within the same slice is suppressed in favor of their signal, which makes SIR much more efficient in identifying $$ {{\mathbf{x}}}$$ components (projections) targeted to y.
In Proposition 4$$ ,$$ we show that the SIR directions are consistent estimators of directions in the central subspace for all $$ {\mathbf{x}}_{t}$$ satisfying the linear design condition (4.5) and conditional distributions of $$ {\mathbf{x}}_{t}\vert y_{t+h}$$ , h=1,2,... with finite second moments.
$$\displaystyle \mathrm{cov}(\mathrm{E}({\mathbf{x}}_{t}\vert y_{t+h}))=\sum_{s=1}^{J}p_{s}(\mathbf{m}_{s}-\boldsymbol{\mu })(\mathbf{m}_{s}-\boldsymbol{\mu })^{\prime }$$ |
$$\displaystyle \widehat{{\mathbf{M}}}_{h}=\sum_{s=1}^{J}\hat{p}_{s}(\widehat{\mathbf{m}}_{s}-\bar{{{\mathbf{X}}}})(\widehat{\mathbf{m}}_{s}-\bar{{{\mathbf{X}}}})^{\prime }\overset{p}{\rightarrow }{\mathbf{M}}_{h}$$ |
When y is continuous, it is replaced with a discrete version $$ \tilde{y}$$ based on partitioning the observed range of Y into J fixed, non-overlapping slices. Since $$ y{\perp}{\mathbf{x}}\vert{\mathbf{v}}^{\prime }{\mathbf{x}}$$ yields that $$ \tilde{y}{\perp}{\mathbf{x}}\vert{\mathbf{v}}^{\prime }{\mathbf{x}}$$ , we have $$ S_{{\tilde{Y}}\vert{\mathbf{x}}}\subseteq S_{Y\vert{\mathbf{x}}}$$ . In particular, provided that J is sufficiently large, $$ S_{{\tilde{y}}\vert{\mathbf{x}}}\approx S_{y\vert{\mathbf{x}}}$$ , and there is no loss of information when y is replaced by $$ {\tilde{y}}$$ .
Under more restrictive assumptions on the processes $$ {\mathbf{x}}_{t}$$ and $$ {\mathbf{x}}_{t}\vert(y_{t+h}=s)$$ , $$ s=1,\ldots ,J$$ , $$ t=1,\ldots $$ , $$ h=0,1,\ldots $$ , it can also be shown that their sample means are approximately normally distributed for large T (see Appendix D). Under the same assumptions we can then obtain that $$ \widehat{{\mathbf{M}}}_{h}$$ is asymptotically normal following similar arguments as Bura and Yang (2011)15 who obtained the asymptotic distribution of $$ \widehat{{\mathbf{M}}}$$ when the data are iid draws from the joint distribution of $$ (y,\mathbf{x})$$ .
Most SDR methodology is based on inverse regression. In general, inverse regression focuses attention on the set of p inverse regressions
Next we want to put our estimators to work comparing them in a classical pseudo out-of-sample macro-forecasting horserace against the parsimonious AR(4). We pick inflation (CPIAUCSL) and industrial production as our targets (INDPRO). Then we look for a large set of regressors in order to feed into our models as much information on the macroeconomy as possible. We would like to adopt a "standard" dataset containing a large number of US macro variables however the various forecasting studies that use large panels of US macro variables, aside from a core set of agreed upon macro variables, is quite inhomogenous regarding the specific set of non-core variables to be used in forming the initial dataset from which to choose or combine the variables. Given that one of the tasks of the forecasting exercise is choosing the most useful variables to forecast a given target variable, the choice of the initial dataset seems indeed crucial. The next paragraphs highlights the rugged landscape of data sources used in some of the most important studies in the DFM macro-forecasting literature and our final choice.
Settling on a Shared Data Source - A problem faced by any researcher in the macro-forecasting field is the lack of comparability across studies due to the multitude of data sources and data vintages used in the literature, a phenomenon that has been pervasive until recently. Luckily an initiative spearheaded by McCracken and Ng and documented in McCracken and Ng (2015) 49 has set out on the project to impose some discipline in the current and future production of macro-forecasting studies. One outcome of the project has been the creation of a dataset of 132 macro variables called FRED-MD and updated in real time by the same staff that maintains the popular FRED database.9 We embraced their initiative and adopted FRED-MD as our dataset of choice although the dataset comes with some limitations that we discuss to some extent below and more in detail in the appendix.
Alternative Data Sources in Other DFM Studies - FRED-MD has fewer variables than the quarterly dataset of 144 variables used by Stock and Watson in 58, apparently the most exploited dataset in quarterly studies. It has also fewer variables than the dataset of 143 variables used by Stock and Watson in 60. The latter is a quarterly study but the dataset posted by Mark Watson contains monthly variables. Finally our dataset contains fewer variables than the 149 regressors used by Stock and Watson in 54 or the 215 series used by Stock and Watson in 55.10 In turn, both studies draw upon the work in the seminal NBER working paper 53 by Stock and Watson. The recently published quarterly study 60 by Stock and Watson on shrinkage methods uses 143 series but in the dataset posted online by Mark Watson one can find only 108 monthly variables and 79 quarterly variables. Although we have chosen the most up-to-date dataset available online at the cost of omitting some variables given the data intensity of our statistical techniques we also tried to run our pseudo-out-of-sample forecasts using some of the richer datasets mentioned above, and we have noticed a slight deterioration of forecasting performance when using our dataset of choice signalling that either some of the variables that we do not include might be marginally helpful in improving the forecast or that data revisions played a role.11 Definitely the inclusion of the great recession and the subsequent recovery, a period not covered by the mentioned studies, appears to impart a substantive deterioration in the forecasting performance of the estimators that we review. The following table summarizes several relevant studies and the salient characteristics of their dataset and statistical methods used and McCracken and Ng (2015) 49 have an informative chronology of the evolution of the large panel of macro datasets.
Study | # Var. | Freq. | Online | Data Span | Meth. |
---|---|---|---|---|---|
Stock and Watson (1998) 53 | 224 | m | no | 1959-1997 | PCR, VAR |
Stock and Watson (2002) 54 | 149 | m | no | 1959-1999 | PCR, VAR |
Stock and Watson (2002) 55 | 215 | m | no | 1959-1999 | PCR, VAR |
Bai and Ng (2008) 4 | 132 | m | no | 1959-2003 | SPC, EN |
Stock and Watson (2005) 56 | 132 | m | yes | 1959-2003 | VAR, SVAR, PCR |
Boivin and Ng (2006) 10 | 147 | m | no | 1960-1998 | WPCR |
Stock and Watson (2008) 58 | 144 | q | yes | 1959-2006 | Split-Sample PCR: PCR, IC, PT, BMA, EB, B |
Stock and Watson (2012) 60 | 108 | m, q | yes | 1960-2008 | Split-Sample PCR: PCR, IC, PT, BMA, EB, B |
Jurado et al. (2015) 42 | 135 | m | yes | 1959-2010 | PCR |
FRED-MD - The dataset is called FRED-MD (where MD stands for monthly dataset) and it contains a balanced panel with monthly data from 1960m1 to 2014m12, covering 54 years totaling 648 monthly observations.. We choose to work with monthly data since the companion quarterly dataset FRED-QD is not available yet. Moreover, our SDR forecasting procedure is quite data intensive and monthly data seem to be a better workbench to test our estimators at this juncture. The dataset is described in McCracken and Ng (2015) 49 along with a discussion of some data adjustments needed to construct the panel. We note that a major shortcoming of the dataset is that core CPI and non-farm payroll employment, two of the most watched series by forecasters and FED officials have not been included so far. FRED-MD contain very limited real-time vintages making real-time forecasting unfeasible at the moment.
Manipulation of FRED-MD - Some variables in the dataset have a large number of missing data. Rather than running an EM algorithm to fill in the missing data and achieve a balanced panel as done by McCraken and Ng (2015) 49, who in turn follow Stock and Watson (2002b) 55, we exclude them. The five excluded variables are: ACOGNO="New Orders for Consumer Goods", ANDENOx="New Orders for Nondefense Capital Goods", OILPRICE="Crude Oil, spliced WTI and Cushing", TWEXMMTH="Trade Weighted U.S. Dollar Index: Major Currencies" and UMCSENT="Consumer Sentiment Index". We do not apply any cleaning of outliers.
Data Transformations and Forecast Targets - We adopt the transformations suggested by McCracken and Ng (2015) 49 and coded in the second row of the original downloaded dataset. We opt to present our results for a "nominal" forecast target, CPI inflation with mnemonics CPIAUCSL, and a "real" target, total industrial production with mnemonics INDPRO. We follow the literature and instead of forecasting the chosen target variables h months ahead we forecast the average realization of the variable in the h months ahead period. Hence the transformation of the target variable dictates the forecast target. For instance in the case of inflation, a variable marked as I(2) and transformed as
$$\displaystyle y_{t}=\Delta ^{2}\log \left( CPI_{t}\right)$$ |
$$\displaystyle y_{h+t}^{h}=\frac{1200}{h}ln\left( \frac{CIP_{t+h}}{CIP_{t}}\right) -1200\ln \left( \frac{CPI_{t}}{CPI_{t-1}}\right)$$ |
$$\displaystyle y_{t}=\Delta \log \left( IP_{t}\right)$$ |
$$\displaystyle y_{h+t}^{h}=\frac{1200}{h}ln\left( \frac{IP_{t+h}}{IP_{t}}\right)$$ |
The Pseudo Out-of-Sample Forecasting Scheme - We align the data as shown in the following Figure by placing the target on the same line of the regressors in an ideal h-step ahead OLS scheme. The transformed and aligned data are available on request. We conduct our forecasting exercise at horizons h=1,3,6,12,24. Both these are relevant horizons in practice and they are enough to allow exploration of possible variation across horizons within each forecasting method. For practical reasons, as common in the literature, we adopt h-step ahead regression rather than iterated in order to avoid the simulation and feeding of exogenous regressors. As indicated in Figure 1, an advantage of PCR, a non-supervised method, is that the principal components can be re-computed as soon as a new line of obervations becomes available in the recursive scheme. The superscript of the PC component in the table highlights this point: for instance when estimating PCR to forecast 3-steps ahead, a 3-step ahead regression is run in t=1984m01 regressing $$ y_{t+3}^{3}$$ on the PCs for $$ 1984m01$$ but computed using data through $$ t+1=1984m02$$ . Then new data through $$ 1982m02$$ is used to forecast $$ y_{t+3+1}^{3}$$ and prediction is formed
$$\displaystyle \widehat{y}_{t+3+1}^{3}=\alpha _{3}+\gamma _{3}\left( L\right) y_{t}+\beta _{3}\left( L\right) \widehat{PC}^{t+3+1}\left( t\right)$$ |
We report results for different sub-samples however our out-of-sample forecasting exercise uses a recursive window rather then a moving one.
The practical implementation of the estimators summarized in an earlier Section necessitates the choice of several details. In this Section we present these choices by estimation procedure and some sensitivity analysis of the results.
RIDGE - In order to implement RIDGE we used the R package glmnet by Friedman et al. (2015) 33. The shrinkage parameter $$ \lambda $$ needs to be chosen prior to the estimation. DeMol et al. (2008) 25 resorted to fit a grid of values and report MSFE for all. We tried some of the values of their grid and we discuss their forecast performance below. However we also opted to fully exploit the full functionalities of glmnet and let the data suggest a value for lambda using nfold cross-validation (where n=8). As a result of the cross-validation we obtain sequences for $$ \lambda $$ s at each forecasting step that we plot in Figure 2. In the top plot are the values of $$ \lambda _{min}$$ that minimize the cross-validation error. The bottom plot contains the values of $$ \lambda _{1se}$$ corresponding to a 1 standard deviation of the cross-validation sequence (the default). A striking pattern is revealed in the plot: it appears that forecasting at any horizon becomes increasingly difficult after the great recession with a sudden surge both in the volatility and average level of $$ \lambda $$ , a sign of RIDGE attempting to shrink more as a reaction in the increase difficulty in prediction. The discouraging results in terms of MSFE that we report in the next Section appear to be the flip side of this feature.
Figure 2: CPIAUCSL: $$ \lambda _{min}$$ and $$ \lambda _{1se}$$ Selected by Cross-Validated RIDGE over the Recursive Forecasting Window.
PCR - We estimate many different versions of PCR in order to cover a wide menu of choices of the truncation meta-parameter M. First of all we follow Stock and Watson (2002) 55 and estimate a series of PCRs (in this context known as diffusion indexes with acronym DIAR) with a constant M throughout the forecasting experiment. We looked at values for M=1,2,3,4,5,6,8,10,20,30 generating models PCRn1 through PCRn30. When forecasting inflation we find that 8 components explain about 45% of the variance of the panel matching the findings in McCracken and Ng (2015) 49. We note that simulations from an approximate factor model in Barbarino and Bura (2015) 7 in which the true number of factors is 8 result in at least 75% of explained variance on average across simulations.
We use the R-package pls by Mevik et al. (2013) 50 which also implements PCR where the number of components M is selected via cross validation.. The cross-validation in-sample MSFE has an interesting contour over the number of components with one or two local minima and a global minimum all of them depending on the forecast horizon. The global minimum on average is achieved only with a rather large number of components especially toward the end of the sample as shown in the first panel in Figure 3. That is a striking feature considering that in simulations reported in our companion paper Barbarino and Bura (2015) 7 cross-validation tends to gravitate around a number of components close to the number of true factors. The results in our empirical investigations in this paper suggest that either the true number of factors generating FRED-MD is indeed very large or that some non-linearities (such as change in regime) or other features of the data disrupt somewhat the effectiveness of cross-validation after the early 2000s. Cross-validation appears to generate a quite volatile choice for the number of components over the experiment at all forecast horizons. The evolution of the first local minimum is shown in the second panel of Figure 3 (model PCRmin1), which although may be less volatile it is still quite unstable. Another possibility is implementing best subset selection (BSS) in the choice of the components: model PCRbss is allowed to pick any component and PCRbss30 only in the first 30.To implement BSS, we use the leaps R-package by Thomas Lumley (2009) 47. BSS does not impose a hierarchical ordering of the components (in which if component #3 is used also component #2 is used). Rather it uses a backward search running very many regressions from larger models to smaller ones eliminating regressors using the BIC criterion. The last two panels of Figure 3 highlight that forecasting at longer horizons seems more difficult and requires more components. Also for these models it appears that forecasting in the last part of the window requires more components on balance.
Figure 3: CPIAUCSL: Number of Components Selected by Cross-Validated and Best Subset PCR over the Recursive Forecasting Window.
We also run PCR with selection of the components according to Bai and Ng (2002) 2 and pick their favorite PCp2 and the alternative ICp2 criteria. Also in this exercise we find instabilities worth of note. We can more or less reproduce the results in McCracken and Ng
(2015) 49 in which it is shown that such criteria select about 8 to 10 components (as remarked above this number of components explains less than half of the variance in the panel). However especially PCp2 is very sensitive to the maximum number of allowed
components that has to be entered in their computation (a meta-parameter that we denote with $$ k_{\max }$$ ). Setting
$$ k_{max}=30$$ , a natural choice since 30 components explain about 90% of the variance of the dataset (model PCRpcp2, top plot in the Figure below), we obtain that PCp2 selects many more
components than with
$$ k_{max}=15$$ (model PCRpcp2b, 3rd panel below). For
$$ k_{\max }=50$$ PCp2 selects way more components, more than cross validation and very frequently all components, about 120 on average. By contrast ICp2 appears to be more stable when
$$ k_{\max }$$ moves from 15 to 30. ICp2 becomes very unstable when $$ k_{\max }$$ is above
70. The selection of the number of components for these criteria is shown in Figure 4. Notice that these criteria, being untargeted are not affected by the forecast horizon at hand.
PLS - We implement PLS using the pls package in 50. We choose the truncation meta-parameter u with cross-validation. We implement two models. In PLSRcv we include yt and its lags directly in the $$ \mathbf{X}$$ matrix whereas in model PLSRcvd we do not include them and in a second step we run OLS of the target on yt and its lags and the PLS components. In contrast with PCR, at short horizons u is not very large. At longer horizons, h=12 and h=24 PLS needs a large number of components, similar to PCR. The great recession appears to require more components also at shorter horizons.
Figure 4: CPIAUCSL: Number of Components Selected by PCp2 and ICP2 Criteria over Recursive Forecasting Window ( $$ k_{max}=30$$ in top 2 panels, $$ k_{max}=15$$ in bottom 2 panels).
Figure 5: CPIAUCSL: Number of Components Selected by Cross-Validation of PLS over Recursive Forecasting Window.
Table 2 compares PCR and PLS on the basis of explained variance with 10 components. Despite the much smaller number of components the superior targeting nature of PLS relative to PCR is evident as 10 components explain about 70% of the variance CPIAUCSL and 60% of the
variance in INDPRO. Not only does PLS explains a larger fraction of the variance in the target but also it explains a fraction of the variance of the panel similar to the fraction explained by PCR.
CPIAUCSL | INDPRO | |
---|---|---|
PCR10 %var(X) | 52% | 52% |
PCR10 %var($$ y_{t+h}$$ ) | 33% | 40% |
PLS10 %var(X) | 50% | 46% |
PLS10 %var($$ y_{t+h}$$ ) | 70% | 60% |
Tables 3 and 4 wrap up the results on the average number of components and their volatility across our estimators.
CPIAUCSL h=1 | CPIAUCSL h=3 | CPIAUCSL h=6 | CPIAUCSL h=12 | CPIAUCSL h=24 | INDPRO h=1 | INDPRO h=3 | INDPRO h=6 | INDPRO h=12 | INDPRO h=24 | |
---|---|---|---|---|---|---|---|---|---|---|
RIDGE | 1.43 | 0.91 | 0.54 | 0.73 | 0.47 | 14.78 | 9.23 | 5.43 | 7.41 | 3.18 |
RIDGE1se | 69.94 | 15.19 | 26.05 | 6.31 | 2.70 | 192.24 | 70.51 | 35.10 | 25.39 | 12.89 |
PCRcv | 47.90 | 46.93 | 46.18 | 48.39 | 46.57 | 117.38 | 117.06 | 117.00 | 117.13 | 117.11 |
PCRmin1 | 11.38 | 11.57 | 11.56 | 11.35 | 11.21 | 34.30 | 35.75 | 36.21 | 35.79 | 36.37 |
PCRmin2 | 23.37 | 23.43 | 23.68 | 23.02 | 23.19 | 57.72 | 59.60 | 60.95 | 60.26 | 60.16 |
PCRpcp2 | 23.73 | 23.73 | 23.73 | 23.73 | 23.73 | 23.53 | 23.53 | 23.53 | 23.53 | 23.53 |
PCRpcp2b | 10.37 | 10.37 | 10.37 | 10.37 | 10.37 | 10.37 | 10.37 | 10.37 | 10.37 | 10.37 |
PCRicp2 | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 |
PCRicp2b | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 | 6.74 |
PCRbss | 9.30 | 7.06 | 12.10 | 22.23 | 24.61 | 10.53 | 12.41 | 18.49 | 17.02 | 17.94 |
PCRbss30 | 3.15 | 2.98 | 3.53 | 5.97 | 7.67 | 3.88 | 4.12 | 4.40 | 7.28 | 8.73 |
PLSRcv | 9.01 | 10.97 | 27.85 | 70.86 | 63.99 | 1.39 | 2.37 | 13.18 | 2.06 | 4.17 |
PLSRcvd | 3.23 | 5.24 | 8.37 | 39.31 | 44.79 | 1.26 | 2.47 | 4.58 | 2.18 | 3.94 |
CPIAUCSL h=1 | CPIAUCSL h=3 | CPIAUCSL h=6 | CPIAUCSL h=12 | CPIAUCSL h=24 | INDPRO h=1 | INDPRO h=3 | INDPRO h=6 | INDPRO h=12 | INDPRO h=24 | |
---|---|---|---|---|---|---|---|---|---|---|
RIDGE | 3.88 | 3.15 | 2.95 | 2.98 | 2.07 | 9.86 | 9.23 | 5.43 | 7.41 | 3.18 |
RIDGE1se | 241.51 | 115.36 | 192.18 | 26.19 | 16.66 | 91.86 | 70.51 | 35.10 | 25.39 | 12.89 |
PCRcv | 20.91 | 21.15 | 21.06 | 20.88 | 20.80 | 2.55 | 117.06 | 117.00 | 117.13 | 117.11 |
PCRmin1 | 6.50 | 6.67 | 6.36 | 5.83 | 6.16 | 25.76 | 35.75 | 36.21 | 35.79 | 36.37 |
PCRmin2 | 8.18 | 7.77 | 7.46 | 7.41 | 7.35 | 27.51 | 59.60 | 60.95 | 60.26 | 60.16 |
PCRpcp2 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.91 | 23.53 | 23.53 | 23.53 | 23.53 |
PCRpcp2b | 0.49 | 0.49 | 0.49 | 0.49 | 0.49 | 0.48 | 10.37 | 10.37 | 10.37 | 10.37 |
PCRicp2 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 6.74 | 6.74 | 6.74 | 6.74 |
PCRicp2b | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 6.74 | 6.74 | 6.74 | 6.74 |
PCRbss | 2.34 | 2.74 | 3.27 | 3.02 | 1.18 | 3.96 | 12.41 | 18.49 | 17.02 | 17.94 |
PCRbss30 | 1.18 | 1.71 | 1.79 | 1.26 | 1.77 | 1.14 | 4.12 | 4.40 | 7.28 | 8.73 |
PLSRcv | 6.98 | 5.78 | 23.09 | 37.27 | 26.39 | 0.68 | 2.37 | 13.18 | 2.06 | 4.17 |
PLSRcvd | 3.49 | 6.27 | 11.29 | 36.88 | 34.18 | 0.44 | 2.47 | 4.58 | 2.18 | 3.94 |
SIR - As mentioned SIR requires a large sample to yield reliable estimates. In our companion paper Barbarino and Bura (2015) 7 we develop a Krylov subspaces version of SIR that can handle cases where $$ p>T$$ . However in line with the exploratory nature of this paper we wanted to test the performance of a standard SIR which necessitated pre-processing of the data over samples where $$ p>T$$ or p is of order close to T. When we pre-condition the data we use 20 or 30 PCs, a number sufficient to preserve between 75% and 90% of the total variance in the panel. The idea is that by retaining a sufficient number of components not too much information on the conditional predictive density of $$ y_{t+h}\vert x_{t}$$ is lost and the application of SIR on the reduced data can still identify and estimate a SDR subspace. We describe in detail the algorithm use to compute our pre-processed SIR in the Appendix. Although we are experimenting with non-parametric regression techniques in order to model the forward regression at this point we report only results obtained using a linear forward regression that models the dynamics of yt and includes SIR components. This solution is suboptimal and likely negatively affects the forecasting accuracy of our estimator as it omits including non-linear terms which are implied by the higher than one dimension of the SIR predictors. Despite this fact our results are encouraging. Regarding the choice of the dimension of the SDR subspace, we tried to apply both the asymptotic weighted-chi square in Bura and Cook (2001b) 13 and the permutation test in Cook and Yin (2001) 24, however both proved to be very unstable and unreliable in our time-series settings. While we are working on the development of a test appropriate for our time-series environment, in this paper we estimate the dimension to be the number of SIR predictors resulting in the most accurate forecasts under several scenarios of constructing the SIR predictors and forward forecasting models. We verify that almost never a dimension larger than two is beneficial in the forecasting exercise. Notice that in the factor literature the information criteria used to select the number of components are somewhat cumbersome, involving some " model-mining" (for instance see the preceding discussion on $$ k_{\max }$$ ). Our two-step procedure can be viewed as an alternative way of selecting the PC components whereby optimal selection is achieved by SDR techniques achieving as we will see shortly extreme parsimony.12We will also show that for large samples standard SIR applied to the raw variables has competitive performance using only one or two SIR predictors.
We now turn to the analysis of the forecasting performance of the estimators that we have lined up in this study. We concentrate on the mean square forecast error (MSFE) as a measure of performance although broadly similar results are obtained using the mean absolute error criterion. The forecasting performance can depend on the range of the sample it is based on. This is particularly true as the "great recession" is covered in our data. To study such effect and also to be able to draw inference unencumbered by such effect, we consider several sample ranges.
Estimators that Use One Component - AR(4), OLS and RIDGE use only one linear combination to form their forecast. Also RIDGE does so. In Table 5 we report MSFE for these three methods at five different horizons. OLS, as expected, is greatly affected by the relatively small sample size. This said, in simulations conducted in a companion paper 7 we found that OLS are much more competitive when the data are generated by an exact factor model and only large deviation from such DGP or extreme paucity of observations disrupt the forecast efficiency of OLS. We do show in that paper that indeed an exact factor model implies that OLS is the correct model to use. We were surprised by the sub-par performance of RIDGE as it performs well very few times. This is in contrast with results in DeMol et al. (2008) 25. We did feed into our RIDGE estimator (models RIDGE141 through RIDGE3532) also the parameters suggested in DeMol et al. (2008) 25 with little success. Data revisions, sample and estimation procedures may explain the difference.
CPIAUCSL h=1 | CPIAUCSL h=3 | CPIAUCSL h=6 | CPIAUCSL h=12 | CPIAUCSL h=24 | INDPRO h=1 | INDPRO h=3 | INDPRO h=6 | INDPRO h=12 | INDPRO h=24 | |
---|---|---|---|---|---|---|---|---|---|---|
OLS | 2.11 | 6.07 | 8.92 | 7.77 | 1.07 | 1.73 | 1.33 | 4.61 | 32.47 | 7.33 |
AR4 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
RIDGE | 0.92 | 2.05 | 5.57 | 7.66 | 1.51 | 0.97 | 0.87 | 1.11 | 6.04 | 6.99 |
RIDGE1se | 1.00 | 1.68 | 5.33 | 6.66 | 1.47 | 0.94 | 0.95 | 0.99 | 1.70 | 2.55 |
RIDGE141 | 1.10 | 1.56 | 2.04 | 2.36 | 2.73 | 0.92 | 0.98 | 1.00 | 1.05 | 1.02 |
RIDGE288 | 1.11 | 1.59 | 2.09 | 2.43 | 2.84 | 0.96 | 1.08 | 1.08 | 1.05 | 1.01 |
RIDGE292 | 1.11 | 1.59 | 2.09 | 2.43 | 2.84 | 0.96 | 1.08 | 1.08 | 1.05 | 1.01 |
RIDGE528 | 1.12 | 1.60 | 2.11 | 2.47 | 2.89 | 1.01 | 1.17 | 1.15 | 1.06 | 1.01 |
RIDGE582 | 1.12 | 1.60 | 2.12 | 2.47 | 2.89 | 1.01 | 1.18 | 1.16 | 1.07 | 1.01 |
RIDGE949 | 1.12 | 1.61 | 2.13 | 2.49 | 2.92 | 1.05 | 1.24 | 1.19 | 1.07 | 1.01 |
RIDGE3532 | 1.12 | 1.62 | 2.15 | 2.52 | 2.96 | 1.11 | 1.34 | 1.27 | 1.09 | 1.01 |
Principal Components Regression - We now turn to the performance of the diffusion index models in Table 6. PCR appears to be effective when forecasting one month ahead for both targets however forecasting inflation with PCR hits a wall at longer horizons. Including enough components appears to be key in general. Industrial production appears to be an easier to forecast and methods that select a relatively stable number of components are the most successful, such as fixing 8 or 10 components or using PCp2 or ICp2. Also BSS restricted to 30 components seems to be working consistently well across horizons. PCR with 10 PCs has consistently good performance for CPI but not for industrial production.
CPIAUCSL h=1 | CPIAUCSL h=3 | CPIAUCSL h=6 | CPIAUCSL h=12 | CPIAUCSL h=24 | INDPRO h=1 | INDPRO h=3 | INDPRO h=6 | INDPRO h=12 | INDPRO h=24 | |
---|---|---|---|---|---|---|---|---|---|---|
PCRcv | 0.97 | 1.02 | 0.96 | 1.31 | 1.83 | 1.17 | 7.67 | 15.88 | 7.66 | 3.77 |
PCRmin1 | 0.95 | 1.07 | 1.11 | 1.18 | 1.37 | 1.00 | 1.06 | 1.04 | 1.00 | 0.90 |
PCRmin2 | 0.99 | 1.05 | 1.06 | 1.22 | 1.29 | 1.02 | 1.12 | 1.12 | 1.03 | 0.95 |
PCRpcp2 | 0.96 | 1.01 | 1.02 | 1.10 | 1.25 | 1.04 | 1.01 | 0.99 | 0.93 | 0.89 |
PCRicp2 | 0.95 | 1.06 | 1.11 | 1.19 | 1.38 | 0.92 | 0.89 | 0.94 | 0.96 | 1.06 |
PCRpcp2b | 0.93 | 1.07 | 1.12 | 1.20 | 1.40 | 0.94 | 0.87 | 0.93 | 0.92 | 0.93 |
PCRicp2b | 0.95 | 1.06 | 1.11 | 1.19 | 1.38 | 0.92 | 0.89 | 0.94 | 0.96 | 1.06 |
PCRbss | 0.97 | 1.09 | 1.04 | 1.03 | 1.10 | 1.05 | 1.03 | 1.05 | 1.01 | 0.89 |
PCRbss30 | 0.97 | 1.06 | 1.06 | 1.16 | 1.30 | 0.96 | 0.93 | 0.95 | 0.91 | 0.90 |
PCRn1 | 1.02 | 1.05 | 1.08 | 1.13 | 1.34 | 0.96 | 0.96 | 0.98 | 0.99 | 1.01 |
PCRn2 | 1.03 | 1.07 | 1.08 | 1.11 | 1.22 | 0.95 | 0.94 | 1.00 | 1.01 | 1.08 |
PCRn3 | 1.03 | 1.07 | 1.09 | 1.14 | 1.27 | 0.94 | 0.93 | 0.98 | 0.99 | 1.07 |
PCRn4 | 0.98 | 1.07 | 1.09 | 1.16 | 1.33 | 0.93 | 0.93 | 0.98 | 0.99 | 1.08 |
PCRn5 | 0.97 | 1.07 | 1.10 | 1.18 | 1.39 | 0.93 | 0.94 | 0.99 | 1.00 | 1.07 |
PCRn6 | 0.95 | 1.05 | 1.10 | 1.19 | 1.40 | 0.93 | 0.93 | 1.00 | 1.01 | 1.08 |
PCRn8 | 0.95 | 1.06 | 1.11 | 1.19 | 1.38 | 0.92 | 0.85 | 0.92 | 0.94 | 1.05 |
PCRn10 | 0.93 | 1.07 | 1.12 | 1.21 | 1.41 | 0.94 | 0.87 | 0.93 | 0.93 | 0.94 |
Partial Least Squares Regression - The general impression from Table 7 is that cross-validation is very effective when forecasting industrial production at all horizons whereas fixing the number of components causes a deterioration of the MSFEs. The opposite seems to be true when forecasting inflation, in which case fixing the number of components to 6 or 8 seems the most appropriate choice. On balance, PLS comes out of the horserace as one of the best performers especially for inflation.
CPIAUCSL h=1 | CPIAUCSL h=3 | CPIAUCSL h=6 | CPIAUCSL h=12 | CPIAUCSL h=24 | INDPRO h=1 | INDPRO h=3 | INDPRO h=6 | INDPRO h=12 | INDPRO h=24 | |
---|---|---|---|---|---|---|---|---|---|---|
PLSRcv | 1.02 | 1.09 | 1.06 | 1.01 | 0.87 | 0.93 | 0.87 | 0.95 | 0.89 | 0.86 |
PLSRcvd | 0.96 | 1.00 | 1.02 | 1.09 | 1.09 | 0.96 | 0.90 | 0.97 | 0.95 | 0.87 |
PLSRn2 | 0.99 | 0.98 | 1.00 | 1.05 | 1.19 | 0.94 | 0.92 | 0.96 | 0.92 | 0.91 |
PLSRn4 | 0.90 | 0.94 | 0.94 | 1.00 | 1.14 | 1.09 | 1.08 | 1.02 | 0.93 | 0.85 |
PLSRn6 | 0.95 | 0.96 | 0.94 | 0.99 | 1.11 | 1.11 | 1.18 | 1.12 | 1.02 | 0.92 |
PLSRn8 | 0.97 | 0.98 | 0.95 | 0.98 | 1.07 | 1.08 | 1.11 | 1.08 | 1.00 | 0.91 |
PLSRn10 | 0.98 | 0.96 | 0.93 | 0.95 | 1.02 | 1.11 | 1.14 | 1.09 | 1.04 | 0.93 |
Sliced Inverse Regression - Table 8 reports SIR MSFE's relative to AR(4). Pre-conditioned SIR (the first 4 lines of the table) on 30 or 20 PCs turns out to deliver some good results. The last two lines of the table report the results of using principal component regression. SIR is capable of summarizing in just one or two components the information encapsulated in 20 or 30 PCs. SIR improves the dismal performance of PCR in forecasting inflation in the medium term. We view the gain in parsimony and modeling as the major advantage of using SIR in this instance. As mentioned earlier our results are certainly adversely affected by forward model misspecification given that it appears that two SIR components capture all relevant information on the conditional distribution of $$ y_{t+h}^{h}\vert x_{t}$$ . Existing methods in the SDR literature exploit regression graphic devices in this case which are not easily ported in a pseudo forecasting experiment with estimation repeated hundreds of times. We are working to efficiently implement non-parametric methods in order to solve this problem. Finally we also report estimates for SIR on the raw data, without pre-conditioning. The sample is already long-enough to deliver a performance that slightly beats the AR(4) at short horizons although the estimation has been carried out using an inverse regression of the regressors on yt rather than $$ y_{t+h}^{h}$$ in order to preserve as much information as possible in the estimation algorithm, an additional source of mispecification.
CPIAUCSL h=1 | CPIAUCSL h=3 | CPIAUCSL h=6 | CPIAUCSL h=12 | CPIAUCSL h=24 | INDPRO h=1 | INDPRO h=3 | INDPRO h=6 | INDPRO h=12 | INDPRO h=24 | |
---|---|---|---|---|---|---|---|---|---|---|
PC30SIRdr1OLS | 0.99 | 0.97 | 0.95 | 1.01 | 1.13 | 1.07 | 1.04 | 1.12 | 0.95 | 0.94 |
PC30SIRdr2OLS | 1.00 | 0.97 | 0.95 | 1.01 | 1.13 | 1.03 | 1.00 | 1.05 | 0.93 | 0.91 |
PC20SIRdr1OLS | 0.99 | 0.97 | 1.02 | 1.07 | 1.18 | 0.99 | 0.97 | 1.03 | 0.94 | 0.94 |
PC20SIRdr2OLS | 0.97 | 0.99 | 1.04 | 1.07 | 1.20 | 0.98 | 0.93 | 0.97 | 0.91 | 0.90 |
SIRdr1OLS | 0.98 | 1.00 | 1.01 | 1.01 | 1.03 | 0.97 | 1.01 | 1.02 | 1.02 | 1.04 |
SIRdr2OLS | 0.97 | 0.99 | 1.01 | 1.00 | 1.05 | 0.96 | 1.00 | 1.01 | 1.03 | 1.04 |
PCRn20 | 0.96 | 1.07 | 1.11 | 1.20 | 1.41 | 0.99 | 0.97 | 0.97 | 0.92 | 0.88 |
PCRn30 | 0.99 | 1.03 | 1.04 | 1.12 | 1.26 | 1.05 | 1.05 | 1.02 | 0.94 | 0.88 |
A Sub-Sample Comparison of SIR and PCR - In Tables 10-13, we report relative MSFEs with respect to AR(4) focusing attention to PCR, as the most widely used dimension reduction method in macro-economic forecasting, and SIR based forecasting models, whose regularized version also uses the PCs of the $$ \mathbf{x}$$ variables. Because the SIR predictors are driven by the inverse regression of the predictors on the response, in a time series context, where the contemporaneous target variable and its lags can be used as predictors, different choices of variables to consider as predictors and response lead to different SIR models. The four different SIR based models we use are defined in Table 9 as follows. The second column describes how the SIR predictors are formed. For example, in SIRa, the SIR predictors are obtained from using all $$ \mathbf{x}$$ -predictors and yt and its 4 lags as the $$ \mathbf{X}$$ -predictor matrix and $$ y_{t+h}$$ as the response that is sliced in the SIR algorithm of Section 4.3. When the PCs are used, then the regularized SIR algorithm in Appendix B is applied. The third column defines the forward forecasting model. For SIRa, for example, $$ y_{t+h}$$ is regressed on a linear model with inputs the corresponding SIR predictors from the second column. In Tables 10 and 12, only the regularized version of SIR is used as the starting sample is small relative to the number of predictors.
SIR Predictors (inverse regression) | Forward Model | |
---|---|---|
SIRa | X or PCs and $$ y_{t}+4$$ lags on $$ y_{t+h}$$ | $$ y_{t+h}$$ on SIR predictors |
SIRb | X or PCs and $$ y_{t}+4$$ lags on $$ y_{t+h}$$ | $$ y_{t+h}$$ on $$ y_{t}+4$$ lags and SIR predictors |
SIRc | X or PCs on $$ y_{t+h}$$ | $$ y_{t+h}$$ on $$ y_{t}+4$$ lags and SIR predictors |
SIRd | X or PCs on yt | $$ y_{t+h}$$ on $$ y_{t}+4$$ lags and SIR predictors |
For both inflation and industrial production, the general pattern across forecasting windows and horizons is that SIR, either standard or regularized, has similar performance to PCR. For the longest horizon of 24 months, SIR with has better performance. The only exception is for industrial
production over the period 2003:01-2014:12 for models SIRa, SIRb and SIRc (see relative MSFEs in Table 13) where SIR predictors based on all 129
$$ \mathbf{x}$$ variables are used. In contrast, over 2010:01-2014:12, the performance is on par with the other methods. This finding confirms that the sample size has a dramatic impact in SIR
performance. Notably, SIRd, where the SIR predictors are built using only yt, does not appear to be affected by the size of the sample. In effect, for these econometric series, SIRd exhibits overall the best performance for both PC-based and standard SIR
across periods and horizons. PCR typically needs 10 components to achieve its best performance across horizons and time windows. In sum, SIR is shown to achieve what it is designed to do; that is, significantly reduce the dimension of the forecasting problem.
CPIAUCSL | 1971:01-2014:12, h=1 | 1971:01-2014:12, h=3 | 1971:01-2014:12, h=6 | 1971:01-2014:12, h=12 | 1971:01-2014:12, h=24 | 1982:01-2014:12, h=1 | 1982:01-2014:12, h=3 | 1982:01-2014:12, h=6 | 1982:01-2014:12, h=12 | 1982:01-2014:12, h=24 |
---|---|---|---|---|---|---|---|---|---|---|
PCRn4 | 0.96 | 0.974 | 0.95 | 0.894 | 0.767 | 0.97 | 1.07 | 1.092 | 1.153 | 1.33 |
PCRn5 | 0.96 | 0.98 | 0.959 | 0.909 | 0.791 | 0.97 | 1.07 | 1.102 | 1.178 | 1.4 |
PCRn10 | 0.94 | 0.982 | 0.976 | 0.913 | 0.763 | 0.93 | 1.07 | 1.119 | 1.202 | 1.41 |
PCRn20 | 1 | 0.996 | 0.989 | 0.934 | 0.779 | 0.96 | 1.07 | 1.117 | 1.202 | 1.41 |
PC20SIRa | 0.99 | 1.008 | 1.022 | 1.01 | 0.805 | 0.935 | 1.07 | 0.935 | 1.3 | 1.45 |
PC20SIRb | 0.954 | 0.983 | 0.999 | 0.938 | 0.763 | 0.947 | 1.05 | 0.947 | 1.194 | 1.35 |
PC20SIRc | 0.972 | 1.001 | 1.015 | 0.933 | 0.768 | 0.972 | 1 | 0.972 | 1.067 | 1.2 |
PC20SIRd | 0.984 | 1.004 | 0.982 | 0.953 | 0.939 | 0.977 | 1.01 | 0.977 | 0.995 | 1.05 |
CPIAUCSL | 2003:01-2014:12, h=1 | 2003:01-2014:12, h=3 | 2003:01-2014:12, h=6 | 2003:01-2014:12, h=12 | 2003:01-2014:12, h=24 | 2010:01-2014:12, h=1 | 2010:01-2014:12, h=3 | 2010:01-2014:12, h=6 | 2010:01-2014:12, h=12 | 2010:01-2014:12, h=24 |
---|---|---|---|---|---|---|---|---|---|---|
PCRn4 | 0.967 | 1.08 | 1.144 | 1.229 | 1.372 | 0.962 | 1.091 | 1.121 | 1.136 | 1.124 |
PCRn5 | 0.961 | 1.083 | 1.161 | 1.267 | 1.469 | 0.963 | 1.078 | 1.105 | 1.188 | 1.204 |
PCRn10 | 0.901 | 1.079 | 1.183 | 1.311 | 1.516 | 1.001 | 1.134 | 1.185 | 1.293 | 1.3 |
PCRn20 | 0.918 | 1.065 | 1.174 | 1.311 | 1.493 | 1.065 | 1.134 | 1.161 | 1.264 | 1.221 |
PC20SIRa | 0.921 | 1.063 | 1.217 | 1.414 | 1.537 | 1.1 | 1.136 | 1.21 | 1.508 | 1.406 |
PC20SIRb | 0.935 | 1.05 | 1.185 | 1.291 | 1.408 | 1.073 | 1.102 | 1.157 | 1.24 | 1.192 |
PC20SIRc | 0.963 | 0.998 | 1.079 | 1.068 | 1.162 | 1.117 | 1.023 | 1.053 | 1.057 | 1.314 |
PC20SIRd | 0.968 | 1.005 | 0.993 | 0.99 | 1 | 1.045 | 1.04 | 0.986 | 0.973 | 1.208 |
SIRa | 57.416 | 30.219 | 6.788 | 7.059 | 8.975 | 1.597 | 1.345 | 1.39 | 1.339 | 1.917 |
SIRb | 25.035 | 18.929 | 4.659 | 5.897 | 8.183 | 1.209 | 1.122 | 1.192 | 1.238 | 1.78 |
SIRc | 20.473 | 25.865 | 2.388 | 11.89 | 1.902 | 1.057 | 1.145 | 1.062 | 1.378 | 2.225 |
SIRd | 0.948 | 0.981 | 0.991 | 0.989 | 1.012 | 0.982 | 0.997 | 1.023 | 0.999 | 1.012 |
INDPRO | 1971:01-2014:12, h=1 | 1971:01-2014:12, h=3 | 1971:01-2014:12, h=6 | 1971:01-2014:12, h=12 | 1971:01-2014:12, h=24 | 1982:01-2014:12, h=1 | 1982:01-2014:12, h=3 | 1982:01-2014:12, h=6 | 1982:01-2014:12, h=12 | 1982:01-2014:12, h=24 |
---|---|---|---|---|---|---|---|---|---|---|
PCRn4 | 0.904 | 0.8 | 0.761 | 0.717 | 0.75 | 0.943 | 0.947 | 0.996 | 0.989 | 1.078 |
PCRn5 | 0.897 | 0.8 | 0.769 | 0.718 | 0.74 | 0.937 | 0.946 | 0.999 | 0.996 | 1.069 |
PCRn10 | 0.914 | 0.83 | 0.807 | 0.713 | 0.65 | 0.932 | 0.884 | 0.94 | 0.927 | 0.946 |
PCRn20 | 0.954 | 0.9 | 0.868 | 0.741 | 0.64 | 0.987 | 0.986 | 0.983 | 0.922 | 0.882 |
PC20SIRa | 0.946 | 0.86 | 0.867 | 0.742 | 0.64 | 0.977 | 0.93 | 0.977 | 0.921 | 0.91 |
PC20SIRb | 0.94 | 0.88 | 0.872 | 0.741 | 0.64 | 0.996 | 0.95 | 0.996 | 0.917 | 0.901 |
PC20SIRc | 0.914 | 0.89 | 0.819 | 0.693 | 0.62 | 0.975 | 0.95 | 0.975 | 0.912 | 0.907 |
PC20SIRd | 1.008 | 1.02 | 1.002 | 0.973 | 1 | 1.035 | 1.036 | 1.035 | 1.002 | 0.976 |
INDPRO | 2003:01-2014:12, h=1 | 2003:01-2014:12, h=3 | 2003:01-2014:12, h=6 | 2003:01-2014:12, h=12 | 2003:01-2014:12, h=24 | 2010:01-2014:12, h=1 | 2010:01-2014:12, h=3 | 2010:01-2014:12, h=6 | 2010:01-2014:12, h=12 | 2010:01-2014:12, h=24 |
---|---|---|---|---|---|---|---|---|---|---|
PCRn4 | 0.972 | 1.015 | 1.024 | 0.942 | 1.032 | 1.058 | 1.449 | 1.675 | 2.092 | 11.594 |
PCRn5 | 0.987 | 1.009 | 1.022 | 0.925 | 0.994 | 1.043 | 1.34 | 1.38 | 0.875 | 4.311 |
PCRn10 | 0.994 | 0.946 | 0.965 | 0.849 | 0.814 | 1.086 | 1.629 | 1.937 | 2.405 | 4.521 |
PCRn20 | 1.038 | 0.986 | 0.959 | 0.783 | 0.657 | 1.037 | 1.619 | 1.962 | 2.78 | 4.701 |
PC20SIRa | 1.008 | 0.912 | 0.965 | 0.77 | 0.685 | 0.986 | 1.423 | 1.594 | 1.863 | 6.566 |
PC20SIRb | 1.02 | 0.952 | 0.976 | 0.768 | 0.673 | 1.006 | 1.517 | 1.626 | 1.868 | 6.835 |
PC20SIRc | 1.019 | 0.961 | 0.954 | 0.774 | 0.679 | 1.062 | 1.597 | 1.653 | 1.985 | 7.596 |
PC20SIRd | 1.027 | 1.043 | 1.012 | 0.973 | 0.922 | 1.045 | 1.183 | 1.085 | 1.203 | 3.425 |
SIRa | 18.339 | 2.83 | 9.592 | 9.422 | 10.488 | 1.966 | 2.729 | 2.769 | 3.659 | 12.07 |
SIRb | 15.967 | 2.668 | 9.295 | 8.723 | 10.446 | 1.815 | 2.652 | 2.648 | 3.6 | 11.782 |
SIRc | 10.115 | 1.312 | 7.347 | 5.717 | 6.459 | 1.697 | 2.37 | 2.672 | 4.013 | 11.232 |
SIRd | 0.93 | 1.008 | 1.033 | 1.044 | 1.043 | 0.956 | 1.125 | 1.149 | 1.29 | 1.378 |
Evolution of the MSFE - Inflation appears definitely harder to forecast than industrial production, consistent with findings in the forecasting literature. Are there specific periods in which the forecast performance of our estimators deteriorates? Figure 6 portrays the evolution of the MSFE in forecasting inflation 12 months ahead for selected estimators (so each point represents the MSFE up to that point). There are definitely periods in which forecasting inflation is harder, however it seems that these periods vary by estimator with SIR computed out of 20 PCs being the first to react negatively to bad data entering the sample through the rolling window. Using asa a metric the height of the spikes it looks like PC20SIRdr performs at par if not better than other estimators except in the very last portion of the window.
In-Sample Fit - As is evident from Figure 7 there is no obvious relationship between in-sample fit and the out-of-sample forecasting performance commented above. For instance, OLS thanks to the very large number of variables, some
subcomponents of the target variable itself, produces very high R-squared and bad forecasting results.
In this paper we (1) introduced sufficient dimension reduction methodology to econometric forecasting and focused on linear moment-based SDR, (2) derived properties of the SIR SDR estimator for covariance stationary series, (3) cast OLS, PCR, RIDGE and PLS in a common framework , and (4) studied the forecasting performance of these four methods, as well as SIR, using the FRED-MD data set put together by McCracken and Ng (2015) 49. The empirical results indicate that PCR, PLS and SIR do not exhibit drastically different performance. The competitive edge of SIR is its parsimony: it attains practically the same forecasting accuracy using one or two linear combinations of the predictors. In contrast, both PLS and PCR require many components, in many cases more than ten. OLS and RIDGE are not found to be competitive for these data and the time periods we considered in our forecasting exercise.
There are several issues that impede the performance of SIR and which can be improved upon. Dimension two or higher in SIR indicates the presence of nonlinear relationships between the response and the SIR predictors. In such cases, plots of the response versus the SIR predictors would inform the construction of a more appropriate forward model. As the forecasting experiment was carried out in an automatic fashion, we could not visually assess the nature of nonlinearities and nonlinear functions of the SIR predictors were not included in the forecasting model. Gains in forecasting accuracy can potentially be realized by the inclusion of nonlinear SIR terms in the forward model.
SDR in general, and SIR in particular, require a large sample size to yield reliable results. The sample size of the FRED-MD data set is not large enough for SIR to be optimally used. For some periods in the forecasting exercise, SIR predictors were extremely unstable as the sample covariance matrix of the raw predictors was close to ill-conditioned. We develop a Krylov subspaces version of SIR to address this issue in a separate paper. Nevertheless, both these issues amount to limitations that need to be addressed for SIR, or SDR in general, to be properly applied to such data and deserve future empirical and theoretical research.
The following set of tables summarizes the variables in FRED-MD. Each table collects variables by statistical data release imparting an organization of the variables slightly rearranged relative to the tables in McCracken and Ng (2015) 49. Grouping by statistical data release report is more useful both because the timing of the data release is different (although the timing is not exploited in the present study) and because some variables are aggregates of more detailed information in any one statistical data release and share the information and possible biases of that statistical release. Our reordering allows a better bird's eye view on the sources of information. We briefly describe each statistical data release below. In addition each table reports, under the column T, the transformation used13. The G column denotes the grouping chosen by McCracken and Ng (2015) 49 in turn not too dissimilar from groupings operated in other DFM studies. The FRED-MD column reports the variable mnemonics in the original FRED-MD datasets. The Description column permits to identify the series. The remaining two columns denote the Global Insight code and description; the GSI description allows to map the individual series with datasets in older papers. In some papers not all variables are used to compute principal components, a strategy followed by Stock and Watson (2005) 56, who add an additional column containing a dummy to denote whether the variable was used in the computation of the PCs. We include all variables when computing PCs hence we do not need such additional column. Asterisked series are adjusted by McCracken and Ng (2015) (see 49 for details).
Variables Directly Measuring Output - The most reliable and used data containing measures of output at a monthly frequency come from the IP system within the statistical release G.17 produced at the Federal Reserve Board and covering industrial production. The IP system contains information on about 200 sectors at NAICS 4-digits level and covers the manufacturing, mining and utilities sectors. The last variable is capacity utilization in manufacturing, one of the few observable measures of slack also from the G.17, computed as $$ \frac{\text{manufacturing IP}}{\text{manufacturing capacity}}$$ ; manufacturing capacity is estimated by staff at the FRB using the quarterly survey of capacity (in turn run by the BLS) and included in the G.17 publication. The G.17 publication contains information on about 94 subaggregates at NAICS 4 digit level whereas the IP system used to produce it is based on 200+ atoms. Apart from the top aggregate INDPRO, the next seven rows represent the splitting and regrouping of the 200+ atoms in so called "market" groups. The last market group is split in two subaggregates, durable and non-durable materials. Manufacturing IP is a subaggregate of IP at the same level as Utilities. Fuels IP is an odd series to be included in this dataset given its idiosyncratic pattern and its higher level of detail. Notice that 25% of final industrial production data (that is after all revisions have taken place), are estimated from employment data (in the second table of this Section), implying that this set of variables and the set in the second tables might be strongly linked or have a factor in common.
id | T | G | FRED-MD | Description | GSI Description |
---|---|---|---|---|---|
6 | 5 | 1 | INDPRO | IP Index | IP: total |
7 | 5 | 1 | IPFPNSS | IP: Final Products and Nonindustrial Supplies | IP: products |
8 | 5 | 1 | IPFINAL | IP: Final Products (Market Group) | IP: final prod |
9 | 5 | 1 | IPCONGD | IP: Consumer Goods | IP: cons gds |
10 | 5 | 1 | IPDCONGD | IP: Durable Consumer Goods | IP: cons dble |
11 | 5 | 1 | IPNCONGD | IP: Nondurable Consumer Goods | IP: cons nondble |
12 | 5 | 1 | IPBUSEQ | IP: Business Equipment | IP: bus eqpt |
13 | 5 | 1 | IPMAT | IP: Materials | IP: matls |
14 | 5 | 1 | IPDMAT | IP: Durable Materials | IP: dble matls |
15 | 5 | 1 | IPNMAT | IP: Nondurable Materials | IP: nondble matls |
16 | 5 | 1 | IPMANSICS | IP: Manufacturing (SIC) | IP: mfg |
17 | 5 | 1 | IPB51222s | IP: Residential Utilities | IP: res util |
18 | 5 | 1 | IPFUELS | IP: Fuels | IP: fuels |
20 | 2 | 1 | CUMFNS | Capacity Utilization: Manufacturing | Cap util |
Variables Measuring Income and Consumption - Personal Income, personal consumption expenditures and PCE deflators are released monthly by the BEA. Retail sales are released by the Census Bureau.
id | T | G | FRED-MD | Description | GSI Description |
---|---|---|---|---|---|
1 | 5 | 1 | RPI | Real Personal Income | PI |
2 | 5 | 1 | W875RX1 | Real personal income ex transfer receipts | PI less transfers |
3 | 5 | 4 | DPCERA3M086SBEA | Real personal consumption expenditures | Real Consumption |
4* | 5 | 4 | CMRMTSPLx | Real Manu. and Trade Industries Sales | MT sales |
5* | 5 | 4 | RETAILx | Retail and Food Services Sales | Retail sales |
123 | 6 | 7 | PCEPI | Personal Cons. Expend.: Chain Price Index | PCE defl |
124 | 6 | 7 | DDURRG3M086SBEA | Personal Cons. Expend: Durable goods | PCE defl: dlbes |
125 | 6 | 7 | DNDGRG3M086SBEA | Personal Cons. Expend: Nondurable goods | PCE defl: nondble |
126 | 6 | 7 | DSERRG3M086SBEA | Personal Cons. Expend: Services | PCE defl: service |
Variables Measuring Employment and Unemployment - The second table contains information on variables measuring employment, data produced by the Bureau of Labor Statistcs (BLS). The first two rows refer to data from the Current Population Survey (CPS). The rest of the table refers to variables from the Current Employment Statistics (CES) a program run each month that surveys approximately 143,000 businesses and government agencies, representing approximately 588,000 individual worksites. The last 3 variables contain miscellaneous information on the labor market. CLAIMS=unemployment claims, is a variable originally released at weekly frequency and comes from the states unemployment insurance system. HWI=Help-Wanted Index for United States is assembled by the Conference Board and recently it has been corrected by Barnichon (2010) [8]. Obvious candidates missing in the datasets are labor market indicators part of the FED labor market dashboard, such as data from the JOLTS survey.
id | T | G | FRED-MD | Description | GSI Description |
---|---|---|---|---|---|
23 | 5 | 2 | CLF16OV | Civilian Labor Force | Emp CPS total |
24 | 5 | 2 | CE16OV | Civilian Employment | Emp CPS nonag |
25 | 2 | 2 | UNRATE | Civilian Unemployment Rate | U: all |
26 | 2 | 2 | UEMPMEAN | Average Duration of Unemployment (Weeks) | U: mean duration |
27 | 5 | 2 | UEMPLT5 | Civilians Unemployed - Less Than 5 Weeks | U < 5 wks |
28 | 5 | 2 | UEMP5TO14 | Civilians Unemployed for 5-14 Weeks | U 5-14 wks |
29 | 5 | 2 | UEMP15OV | Civilians Unemployed - 15 Weeks Over | U 15+ wks |
30 | 5 | 2 | UEMP15T26 | Civilians Unemployed for 15-26 Weeks | U 15-26 wks |
31 | 5 | 2 | UEMP27OV | Civilians Unemployed for 27 Weeks and Over | U 27+ wks |
33 | 5 | 2 | PAYEMS | All Employees: Total nonfarm | Emp: total |
34 | 5 | 2 | USGOOD | All Employees: Goods-Producing Industries | Emp: gds prod |
35 | 5 | 2 | CES1021000001 | All Employees: Mining and Logging: Mining | Emp: mining |
36 | 5 | 2 | USCONS | All Employees: Construction | Emp: const |
37 | 5 | 2 | MANEMP | All Employees: Manufacturing | Emp: mfg |
38 | 5 | 2 | DMANEMP | All Employees: Durable goods | Emp: dble gds |
39 | 5 | 2 | NDMANEMP | All Employees: Nondurable goods | Emp: nondbles |
40 | 5 | 2 | SRVPRD | All Employees: Service-Providing Industries | Emp: services |
41 | 5 | 2 | USTPU | All Employees: Trade, Transportation Utilities | Emp: TTU |
42 | 5 | 2 | USWTRADE | All Employees: Wholesale Trade | Emp: wholesale |
43 | 5 | 2 | USTRADE | All Employees: Retail Trade | Emp: retail |
44 | 5 | 2 | USFIRE | All Employees: Financial Activities | Emp: FIRE |
45 | 5 | 2 | USGOVT | All Employees: Government | Emp: Govt |
46 | 1 | 2 | CES0600000007 | Avg Weekly Hours : Goods-Producing | Avg hrs |
47 | 2 | 2 | AWOTMAN | Avg Weekly Overtime Hours : Manufacturing | Overtime: mfg |
48 | 1 | 2 | AWHMAN | Avg Weekly Hours : Manufacturing | Avg hrs: mfg |
49 | 1 | 2 | NAPMEI | ISM Manufacturing: Employment Index | NAPM empl |
127 | 6 | 2 | CES0600000008 | Avg Hourly Earnings : Goods-Producing | AHE: goods |
128 | 6 | 2 | CES2000000008 | Avg Hourly Earnings : Construction | AHE: const |
129 | 6 | 2 | CES3000000008 | Avg Hourly Earnings : Manufacturing | AHE: mfg |
32* | 5 | 2 | CLAIMSx | Initial Claims | UI claims |
21* | 2 | 2 | HWI | Help-Wanted Index for United States | Help wanted indx |
22* | 2 | 2 | HWIURATIO | Ratio of Help Wanted/No. Unemployed | Help wanted/unemp |
Variables Measuring Construction Activity - The third table collects the variables that have leading properties in signaling changes in activity in the construction sector. Permits variables come from the Census' building permits monhtly survey of 9,000 selected permit-issuing places adjusted once a year with an annual census of an additional 11,000 permit places that are not in the monthly sample. Housing starts come from the Survey of Construction, a multi-stage stratified random sample that selects approximately 900 building permit-issuing offices, and a sample of more than 70 land areas not covered by building permits. Data from the national association of home buildiers such as existing home sales were not included oin the dataset.
id | T | G | FRED-MD | Description | GSI Descr |
---|---|---|---|---|---|
50 | 4 | 3 | HOUST | Housing Starts: Total New Privately Owned | Starts: nonfarm |
51 | 4 | 3 | HOUSTNE | Housing Starts, Northeast | Starts: NE |
52 | 4 | 3 | HOUSTMW | Housing Starts, Midwest | Starts: MW |
53 | 4 | 3 | HOUSTS | Housing Starts, South | Starts: South |
54 | 4 | 3 | HOUSTW | Housing Starts, West | Starts: West |
55 | 4 | 3 | PERMIT | New Private Housing Permits (SAAR) | BP: total |
56 | 4 | 3 | PERMITNE | New Private Housing Permits, Northeast (SAAR) | BP: NE |
57 | 4 | 3 | PERMITMW | New Private Housing Permits, Midwest (SAAR) | BP: MW |
58 | 4 | 3 | PERMITS | New Private Housing Permits, South (SAAR) | BP: South |
59 | 4 | 3 | PERMITW | New Private Housing Permits, West (SAAR) | BP: West |
Variables Measuring Orders and Inventories - These variables are from the M3 survey run by the U.S. Census Bureau. The M3 is based upon data reported from manufacturing establishments with $500 million or more in annual shipments. Units may be divisions of diversified large companies, large homogenous companies, or single-unit manufacturers in 89 industry categories. The M3 provides statistics on manufacturers' value of shipments, new orders (net of cancellations), end-of-month order backlog (unfilled orders), end-of-month total inventory, materials and supplies, work-in-process, and finished goods inventories (at current cost or market value). Data are collected and tabulated predominantly by 6-digit NAICS (North American Industry Classification System). The most watched series from this survey is ANDENO="New Orders for Nondefense Capital Goods" since it excludes certain highly volatile goods (and not so informative on the business cycle) from new orders. Such series unfortunately has a short history and it is excluded in our estimation.
id | T | G | FRED-MD | Description | GSI Description |
---|---|---|---|---|---|
3 | 5 | 4 | DPCERA3M086SBEA | Real personal consumption expenditures | Real Consumption |
4* | 5 | 4 | CMRMTSPLx | Real Manu. and Trade Industries Sales | MT sales |
5* | 5 | 4 | RETAILx | Retail and Food Services Sales | Retail sales |
64 | 5 | 4 | ACOGNO | New Orders for Consumer Goods | Orders: cons gds |
65* | 5 | 4 | AMDMNOx | New Orders for Durable Goods | Orders: dble gds |
66* | 5 | 4 | ANDENOx | New Orders for Nondefense Capital Goods | Orders: cap gds |
67* | 5 | 4 | AMDMUOx | Unfilled Orders for Durable Goods | Unf orders: dble |
68* | 5 | 4 | BUSINVx | Total Business Inventories | MT invent |
69* | 2 | 4 | ISRATIOx | Total Business: Inventories to Sales Ratio | MT invent/sales |
Variables Measuring the Money Stock and Reserves - These data come mainly from the FRB H.6 statistical release.
id | T | G | FRED-MD | Description | GSI Description |
---|---|---|---|---|---|
70 | 6 | 5 | M1SL | M1 Money Stock | M1 |
71 | 6 | 5 | M2SL | M2 Money Stock | M2 |
72 | 5 | 5 | M2REAL | Real M2 Money Stock | M2 (reaal) |
73 | 6 | 5 | AMBSL | St. Louis Adjusted Monetary Base | MB |
74 | 6 | 5 | TOTRESNS | Total Reserves of Depository Institutions | Reserves tot |
75 | 7 | 5 | NONBORRES | Reserves Of Depository Institutions, Nonborrowed | Reserves nonbor |
Variables Measuring Credit - These variables are mainly drawn from various FRB statistical releases such as G.19 and G.20.
id | T | G | FRED-MD | Description | GSI Descr |
---|---|---|---|---|---|
76 | 6 | 5 | BUSLOANS | Commercial and Industrial Loans, All Commercial Banks | CI loan plus |
77 | 6 | 5 | REALLN | Real Estate Loans at All Commercial Banks | DCI loans |
78 | 6 | 5 | NONREVSL | Total Nonrevolving Credit Owned and Securitized Outstanding | Cons credit |
79* | 2 | 5 | CONSPI | Nonrevolving consumer credit to Personal Income | Inst credit/PI |
131 | 6 | 5 | MZMSL | MZM Money Stock | N.A. |
132 | 6 | 5 | DTCOLNVHFNM | Consumer Motor Vehicle Loans Outstanding | N.A. |
133 | 6 | 5 | DTCTHFNM | Total Consumer Loans and Leases Outstanding | N.A. |
134 | 6 | 5 | INVEST | Securities in Bank Credit at All Commercial Banks | N.A. |
Variables Measuring Interest Rates - The following table contains variables measuring interest rates, yields and spreads. Most variables are from statistical releases by the FRB such as the H.15.
id | T | G | FRED-MD | Description | GSI Descr |
---|---|---|---|---|---|
84 | 2 | 6 | FEDFUNDS | Effective Federal Funds Rate | Fed Funds |
85* | 2 | 6 | CP3Mx | 3-Month AA Financial Commercial Paper Rate | Comm paper |
86 | 2 | 6 | TB3MS | 3-Month Treasury Bill: | 3 mo T-bill |
87 | 2 | 6 | TB6MS | 6-Month Treasury Bill: | 6 mo T-bill |
88 | 2 | 6 | GS1 | 1-Year Treasury Rate | 1 yr T-bond |
89 | 2 | 6 | GS5 | 5-Year Treasury Rate | 5 yr T-bond |
90 | 2 | 6 | GS10 | 10-Year Treasury Rate | 10 yr T-bond |
91 | 2 | 6 | AAA | Moody's Seasoned Aaa Corporate Bond Yield | Aaa bond |
92 | 2 | 6 | BAA | Moody's Seasoned Baa Corporate Bond Yield | Baa bond |
93* | 1 | 6 | COMPAPFFx | 3-Month Commercial Paper Minus FEDFUNDS | CP-FF spread |
94 | 1 | 6 | TB3SMFFM | 3-Month Treasury C Minus FEDFUNDS | 3 mo-FF spread |
95 | 1 | 6 | TB6SMFFM | 6-Month Treasury C Minus FEDFUNDS | 6 mo-FF spread |
96 | 1 | 6 | T1YFFM | 1-Year Treasury C Minus FEDFUNDS | 1 yr-FF spread |
97 | 1 | 6 | T5YFFM | 5-Year Treasury C Minus FEDFUNDS | 5 yr-FF spread |
98 | 1 | 6 | T10YFFM | 10-Year Treasury C Minus FEDFUNDS | 10 yr-FF spread |
99 | 1 | 6 | AAAFFM | Moody's Aaa Corporate Bond Minus FEDFUNDS | Aaa-FF spread |
100 | 1 | 6 | BAAFFM | Moody's Baa Corporate Bond Minus FEDFUNDS | Baa-FF spread |
101 | 5 | 6 | TWEXMMTH | Trade Weighted U.S. Dollar Index: Major Currencies | Ex rate: avg |
102* | 5 | 6 | EXSZUSx | Switzerland / U.S. Foreign Exchange Rate | Ex rate: Switz |
103* | 5 | 6 | EXJPUSx | Japan / U.S. Foreign Exchange Rate | Ex rate: Japan |
104* | 5 | 6 | EXUSUKx | U.S. / U.K. Foreign Exchange Rate | Ex rate: UK |
105* | 5 | 6 | EXCAUSx | Canada / U.S. Foreign Exchange Rate | EX rate: Canada |
Variables Measuring Prices - The variables are from the BLS CPI and PPI statistical releases. For PPI more than 100,000 price quotations per month are organized into three sets of PPIs: (1) Final demand-Intermediate demand (FD-ID) indexes, (2) commodity indexes, and (3) indexes for the net output of industries and their products. The CPIs are based on prices of food, clothing, shelter, fuels, transportation fares, charges for doctors'
and dentists' services, drugs, and other goods and services that people buy for day-to-day living. Prices are collected each month in 87 urban areas across the country from about 6,000 housing units and approximately 24,000 retail establishments-department stores, supermarkets, hospitals, filling stations, and other types of stores and service establishments.
id | T | G | FRED-MD | Description | GSI Descr |
---|---|---|---|---|---|
106 | 6 | 7 | PPIFGS | PPI: Finished Goods | PPI: fin gds |
107 | 6 | 7 | PPIFCG | PPI: Finished Consumer Goods | PPI: cons gds |
108 | 6 | 7 | PPIITM | PPI: Intermediate Materials | PPI: int matls |
109 | 6 | 7 | PPICRM | PPI: Crude Materials | PPI: crude matls |
110* | 6 | 7 | OILPRICEx | Crude Oil, spliced WTI and Cushing | Spot market price |
111 | 6 | 7 | PPICMM | PPI: Metals and metal products: | PPI: nonferrous |
113 | 6 | 7 | CPIAUCSL | CPI : All Items | CPI-U: all |
114 | 6 | 7 | CPIAPPSL | CPI : Apparel | CPI-U: apparel |
115 | 6 | 7 | CPITRNSL | CPI : Transportation | CPI-U: transp |
116 | 6 | 7 | CPIMEDSL | CPI : Medical Care | CPI-U: medical |
117 | 6 | 7 | CUSR0000SAC | CPI : Commodities | CPI-U: comm. |
118 | 6 | 7 | CUUR0000SAD | CPI : Durables | CPI-U: dbles |
119 | 6 | 7 | CUSR0000SAS | CPI : Services | CPI-U: services |
120 | 6 | 7 | CPIULFSL | CPI : All Items Less Food | CPI-U: ex food |
121 | 6 | 7 | CUUR0000SA0L2 | CPI : All items less shelter | CPI-U: ex shelter |
122 | 6 | 7 | CUSR0000SA0L5 | CPI : All items less medical care | CPI-U: ex med |
Variables Measuring the Stock Market - These data are elaborated by Standard & Poor.
id | T | G | FRED-MD | Description | GSI Descr |
---|---|---|---|---|---|
80* | 5 | 8 | SP 500 | SP's Common Stock Price Index: Composite | SP 500 |
81* | 5 | 8 | SP: indust | SP's Common Stock Price Index: Industrials | SP: indust |
82* | 2 | 8 | SP div yield | SP's Composite Common Stock: Dividend Yield | SP div yield |
83* | 5 | 8 | SP PE ratio | SP's Composite Common Stock: Price-Earnings Ratio | SP PE ratio |
Diffusion Indexes from Manufacturing and Consumer Surveys - The last table mostly collects the diffusion indexes from the Institute for Supply Management (ISM)14. These variables are released the first day of month,following the reference month, hence they are quite timely and are used by several institutions in the produciton of their high frequency forecasts. However, since the
literature on large panels of macro variables carries out monthly pseudo-forecast experiments and we follow that tradition, we do not exploit the full potential of these variables. Hence the only variable that likely has the most forecasting power is "new orders" a natural measure of future
activity. Notice that these variables are diffusion indexes, that is they essentially capture the fraction of respondents that say that activity is up15.
Given that they are diffusion indexes (fractions) they are stable and they are left in levels in the estimation. The dataset does not include data from other manufacturing surveys used in the construction of activity indexes and nowcasting such as the Philly Fed BOS survey or the Richmond Fed
survey as well as information from the services surveys: most likely the choice of the authors was dictated by the span of available data. The last variable in the table is the closely watched Consumer Sentiment Index from the University of Michigan used in the forecast of comsuption expenditures.
Other informative sub-indexes of the Michigan survey were not included in the dataset.
id | T | G | FRED-MD | Description | GSI Descr |
---|---|---|---|---|---|
19 | 1 | 1 | NAPMPI | ISM Manufacturing: Production Index | NAPM prodn |
29 | 1 | 2 | NAPMEI | ISM Manufacturing: Employment Index | NAPM empl |
60 | 1 | 4 | NAPM | ISM : PMI Composite Index | PMI |
61 | 1 | 4 | NAPMNOI | ISM : New Orders Index | NAPM new ordrs |
62 | 1 | 4 | NAPMSDI | ISM : Supplier Deliveries Index | NAPM vendor del |
63 | 1 | 4 | NAPMII | ISM : Inventories Index | NAPM Invent |
112 | 1 | 7 | NAPMPRI | ISM Manufacturing: Prices Index | NAPM com price |
130* | 2 | 4 | UMCSENTx | Consumer Sentiment Index | Consumer expect |
In relevance to the forecasting model (2.1), the response is $$ y_{t+h}$$ , $$ t=1,\ldots ,T,\ldots $$ , and the predictors consist of a group of p exogenous variables $$ \mathbf{x}_{t}=(x_{t1},\ldots ,x_{tp})^{\prime }$$ and the current response value yt along with L of its lags, which is denoted by $$ \mathbf{W}_{t}=(y_{t-1},\ldots ,y_{t-L})^{^{\prime }}$$ .
$$\displaystyle \widehat{{\mathbf{M}}}=\sum_{j=1}^{J}\frac{n_{j}}{T}(\bar{\widetilde{{{\mathbf{X}}}}}_{j}-\bar{\widetilde{{{\mathbf{X}}}}})(\bar{\widetilde{{{\mathbf{X}}}}}_{j}-\bar{\widetilde{{{\mathbf{X}}}}})^{\prime }$$ |
A sequence of random variables xjt is covariance stationary or weakly stationary if and only if
$$\displaystyle \exists \mu _{j}\in {\mathbb{R}\colon }\mathrm{E}(x_{jt})=\mu _{j},\forall t>0$$ |
$$\displaystyle \forall t^{\prime }\geq 0,\exists \gamma _{jt^{\prime }}\in {\mathbb{R}\colon }\mathrm{cov}(x_{jt},x_{j,t-t^{\prime }})=\mathrm{E}[(x_{jt}-\mu _{j})(x_{j,t-t^{\prime }}-\mu _{j})]=\gamma _{j,t-t^{\prime }}=\gamma _{j}(t-t^{\prime })=\gamma _{j}(h),\forall t>t^{\prime }$$ |
A multivariate time series $$ \mathbf{x}_{t}=(x_{1t},x_{2t},...,x_{pt})$$ is covariance stationary and ergodic if all of its component time series are stationary and ergodic. The mean of $$ \mathbf{x}_{t}$$ is defined as the $$ (T\times 1)$$ vector E $$ (\mathbf{x}_{t})=\boldsymbol{\mu =}\left( \text{E(}x_{1t}),\text{E(}x_{2t}),...,\text{E(}x_{pt})\right) ^{\prime }=\left( \mu _{1},\mu _{2},...,\mu _{p}\right) ^{\prime }$$ and the variance/covariance matrix
$$\displaystyle \boldsymbol{\Sigma }(0)$$ | $$\displaystyle =$$ | $$\displaystyle \mathrm{var}(\mathbf{x}_{t})=\left( \left( \mathbf{x}_{t}-\boldsymbol{\mu }\right) \left( \mathbf{x}_{t}-\boldsymbol{\mu }\right) ^{\prime }\right) =\mathrm{E}\left( \mathbf{x}_{t}\mathbf{x}_{t}^{\prime }-\boldsymbol{\mu \mu }^{\prime }\right) =$$ | |
$$\displaystyle =$$ | $$\displaystyle \begin{bmatrix} \mathrm{var}(x_{1t}) & \mathrm{cov}(x_{1t},x_{2t}) & \cdots & \mathrm{cov}(x_{1t},x_{pt}) \ \mathrm{cov}(x_{2t},x_{1t}) & \mathrm{var}(x_{2t}) & \cdots & \mathrm{cov}(x_{2t},x_{pt}) \ \vdots & \vdots & \ddots & \vdots \ \mathrm{cov}(x_{pt},x_{1t}) & \cdots & \cdots & \mathrm{var}(x_{pt}) \ & & & \end{bmatrix}$$ | ||
$$\displaystyle \boldsymbol{\Sigma }(h)$$ | $$\displaystyle =$$ | cov($$\displaystyle \mathbf{x}_{t+h},\mathbf{x}_{t})=\mathrm{E}\left( \left( \mathbf{x}_{t+h}-\boldsymbol{\mu }\right) \left( \mathbf{x}_{t+h}-\boldsymbol{\mu }\right) ^{\prime }\right) =\mathrm{E}\left( \mathbf{x}_{t+h}\mathbf{x}_{t}^{\prime }-\boldsymbol{\mu \mu }^{\prime }\right)$$ |
If xjt is a stationary time series with mean $$ \mu _{j}$$ and autocovariance function $$ \gamma _{j}(h)$$ , $$ \bar{X}_{j}=\sum_{t=1}^{T}X_{jt}/T $$ converges in mean square to $$ \mu _{j}$$ if $$ \gamma (T)\rightarrow 0$$ as $$ T\rightarrow \infty $$ (see Prop. 2.4.1, p. 58 in Brockwell and Davis (2002), prop. 10.5, p. 279 in Hamilton (1994)). The consistency of the estimator $$ \mathbf{\bar{X}}$$ is established by applying the proposition to each of the component time series xjt, $$ j=1,\ldots ,p $$ (Prop. 7.3.1, p. 234, Brockwell and Davis (2002)). A sufficient condition to ensure ergodicity (consistency) for second moments is $$ \sum_{h=-\infty }^{\infty }\vert\gamma _{jj}(h)\vert<\infty $$ (Prop. 7.3.1, p. 234, Brockwell and Davis (2002)).
The parameters
$$ \boldsymbol{\mu }$$ ,
$$ \boldsymbol{\Sigma }(0),$$ and
$$ \boldsymbol{\Sigma }(h)$$ are estimated from
$$ \mathbf{X}_{1},\mathbf{X}_{2},...,\mathbf{X}_{T}$$ using the sample moments:
$$\displaystyle \mathbf{\bar{X}}$$ | $$\displaystyle \mathbf{=}$$ | $$\displaystyle \frac{1}{T}\sum_{t=1}^{T}\mathbf{X}$$ | |
$$\displaystyle \boldsymbol{\hat{\Sigma}}(0)$$ | $$\displaystyle =$$ | $$\displaystyle \frac{1}{T}\sum_{t=1}^{T}\left( \mathbf{X}_{t}-\mathbf{\bar{X}}\right) \left( \mathbf{X}_{t}-\mathbf{\bar{X}}\right) ^{\prime }$$ | |
$$\displaystyle \boldsymbol{\hat{\Sigma}}(h)$$ | $$\displaystyle =$$ | $$$ \begin{cases} \frac{1}{T}\sum_{t=1}^{T-h}\left( \mathbf{X}_{t+h}-\mathbf{\bar{X}}\right) \left( \mathbf{X}_{t}-\mathbf{\bar{X}}\right) ^{\prime } & \text{ if }0\leq h\leq T-1 \ \boldsymbol{\hat{\Gamma}}(h)^{\prime } & \text{ if }-T+1\leq h<0\end{cases}$$$ |
The ergodic theorem obtains that if $$ \mathbf{x}_{t}$$ is a strictly stationary and ergodic time series then as $$ T\rightarrow \infty $$
$$\displaystyle \mathbf{\bar{X}}$$ | $$\displaystyle \overset{p}{\rightarrow }\boldsymbol{\mu }\quad$$ | (8.1) |
$$\displaystyle \boldsymbol{\hat{\Sigma}}(0)$$ | $$\displaystyle \overset{p}{\rightarrow }\mathbf{\Sigma }(0)$$ | (8.2) |
$$\displaystyle \boldsymbol{\hat{\Sigma}}(h)$$ | $$\displaystyle \overset{p}{\rightarrow }\mathbf{\Sigma }(h)$$ | (8.3) |
Under more restrictive assumptions on the process $$ {\mathbf{x}}_{t}$$ it can also be shown that $$ \bar{{\mathbf{X}}}_{T}$$ is approximately normally distributed for large T. Determination of the covariance matrix of this distribution is quite complicated. For example, the following is a CLT for a covariance stationary m-dependent vector process (Villegas (1976), Thm. 5.1). A stochastic vector process $$ {\mathbf{x}}_{1},{\mathbf{x}}_{2},\ldots $$ is m-dependent if the two sets of random vectors $$ {\mathbf{x}}_{1},\ldots ,{\mathbf{x}}_{r}$$ and $$ {\mathbf{x}}_{s},\ldots ,{\mathbf{x}}_{n}$$ are independent whenever $$ s-r>m$$ .
$$\displaystyle \mathbf{V}=\sum_{h=-m}^{m}\boldsymbol{\Sigma }(h)$$ |
$$\displaystyle MSE\left( \widehat{\theta },\theta \right) \leq MSE\left( \widehat{\theta }_{OLS},\theta \right)$$ |