More on "Naturally Orthogonal"

I realize that not all CA readers are interested in multivariate methods and that dendroclimatologists want to “forget the math”, but I find it interesting to try to relate dendro and paleoclimate recipes to known statistical methodologies that you can read about in texts.

I commented the other day on the form of Principal Components Regression used in Wilson et al 2007 and prior articles by Luckman’s students: their reconstructions are a form of inverse OLS regression but on 3 or so PCs rather than 20 proxies. I thought that it might be worthwhile to write up some of my thoughts on multivariate methods. The approach here is based on an interesting article and tutorial by Magnus Borga entitled “A Unified Approach to PCA, PLS (Partial Least Squares), MLR (Multiple Linear Regression) and CCA (Canonical Correspondence Analysis)” . I use OLS and MLR interchangeably below.

In my opinion, Borga provides a very elegant way of keeping track of these different methods, such that some insight is actually shed on their relationships. I’ve added my own twist to his analysis by fitting Ridge Regression, Canonical Ridge Regression and Principal Components Regression into this “unified approach” – I don’t think my analysis is particularly profound, but I haven’t seen it anywhere and it seems pretty to me.

Let’s start with the relationship between OLS and PLS regression (accept for now that the regression step in MBH98 is a form of Partial Least Squares regression although the authors did not recognize this in their linear algebra, nor for that did Bürger and Cubasch in their recent consideration of the matter). PLS regression is implicitly used in Hegerl et al 2006 and the Juckes et al submission, again without being aware of it. The difference between PLS regression and OLS regression can be viewed as a rotation of the coefficients, as follows:

(1) $\hat{\beta}_{OLS} = C_{xx}^{-1} \hat{\beta}_{PLS}$

The idea that the $C_{xx}^{-1}$ matrix is a rotation of the coefficients is somewhat of a linear algebra approach (thinking here of the coefficients $\beta$ as simply being abstract in the dual space. Stone and Brooks (1990), a famous article, showed that there is a “continuum” connecting PLS and OLS, which can be illustrated in rather a nice way using Borga’s method which I will now proceed to describe.

Borga groups all the above multivariate methods as solutions to one eigenvalue equation shown below using different combinations of covariance and identity matrices:

(2) $B^{-1} A w = \lambda w$

He summarizes the different combinations of covariance and identity matrices in the following table. The different multivariate methods are encompassed in different $B$ matrices: with Canonical Correspondence Analysis (CCA) using both the $C_{yy}$ and $C_{xx}$ matrices, OLS regression using the $C_{xx}$ matrix and the identity and PLS regression using both identity matrices. Mann interestingly said that he considered CCA, but did not use it.

Borga didn’t discuss Ridge Regression, but I think that it fits rather elegantly into this formulation. Hoerl and Kennard 1976 define the ridge regression (RR) coefficient as follows:

(3) $\beta_{RR} = (C_{xx}+ kI)^{-1} C_{xy}$

In effect, the ridge regression coefficient k “blends” the PLS B-matrix $I$ and the OLS/MLR B-matrix $C_{xx}$. I prefer to write mixing coefficients by defining $\lambda = k/(1-k)$ and the precisely equivalent equation to show the “mixing” of the PLS rotation matrix and the OLS rotation matrix as follows:

(4) $\beta_{RR} = (1-\lambda)^{-1} ((1-\lambda) C_{xx}+ \lambda I)^{-1} C_{xy}$

Canonical ridge regression can be viewed in precisely the same way – with the canonical ridge regression coefficient acting as a mixing coefficient blending the $C_{xx}$ and $I$ matrices on the one hand and the $C_{yy}$ and $I$ matrices on the other hand.

I’ve represented these methods in the following diagram that I find helpful. The three corners of my diagram are simply a visual representation of lines 2,3 and 4 of Borga’s Table 1, with Ridge Regression being the path along the “X” axis of the mixing coefficient and Canonical Ridge Regression following the line from PLS to CCA.

What made me think about this again was the form of Principal Components Regression used by Rob Wilson and other Luckman students. We observed the other day that the coefficients using Principal Components Regression were given by (changing the nomenclature slightly):

(5) $\beta_{PCR}= \hat{V} \hat{S}^{-2} \hat{V} ^T C_{xy}$

If one expands the $C_{xx}^{-1}$ matrix in the above expressions in terms of the svd decomposition of X ($X= USV^T$, then we have

(6) $\beta_{OLS}= V S^{-2} V^T C_{xy}$

In effect, as you go from 0 to m retained PCs, you define a path from PLS to OLS in which each step is generated by $\hat{V}_{1}, \hat{V}_{2},$ …, the truncated eigenvector matrix with 1,2,..,m eigenvectors. In a sense, retaining more PCs is like moving the ridge regression coefficient from PLS to OLS. This makes sense when you think about it.

Now Hoerl and Kennard 1976 proposed a method for selecting a ridge regression coefficient. In the context of an inverse regression, you have to think long and hard about whether a procedure for regression of effect upon causes (tree ring ~ temperature + precipitation) where you want orthogonality can be transmogrified into an inverse regression of cause upon effect in the style of dendroclimatologists ( temperature ~ bristlecones+ Gasp” + …), where you actually want multicollinearity (i.e. a signal). At this point, I’m not prepared to say that any procedure for selecting a mixing coefficient is “right”. Indeed, my own preference, in the absence of any reason to prefer one series to another, is for averaging where the issue of mixing coefficients doesn’t arise. Note that some of the recent work of Mann and Rutherford involves ridge regression, although the methods themselves are not related to methodology as it is known off the island. (Plus as CA readers have observed, there are some seriously weird things in their RegEM implementation.)

The above analysis does not include simple averaging (or similar things like CVM). It looks to me like averaging is an even more trivial implementation in which the $A$ matrix is changed by replacing the $C_{xy}$ matrix in the PLS formulation by a matrix 1/n $1$ where $1$ is a matrix of 1s. Perhaps some meaning can be attributed to formulations with mixing coefficients between 1/n $1$ and $C_{xy}$. Just a thought.

Von Storch and Zorita observed that the low-frequency variation in regression-based reconstructions was attenuated. I don’t think that they explored the reasons for this effect very fully and some of my notes last year on VZ pseudoproxies are relevant. I discussed OLS reconstructions, not because they were used in present-day reconstructions (although it was used in Groveman and Landsberg 1979), but because it was a type of extreme case that sharpened the contrast between methodologies. In terms of preserving low-frequency variability, OLS had the worst performance in a tame network, as it introduced irrelevant variation in the regression coefficients. PLS (Mannian regression) although by no means the worst possible technique, were worse than CVM in preserving low-frequency variability. If the proxies are “naturally orthogonal”, PLS and OLS are the same.

Back to the Wilson-Luckman type reconstruction: with 3 or so PCs, they are moving part-way from a PLS to an OLS reconstruction, losing low-frequency variability as we’ve already observed, which is perceptible in comparing the Wilson-Luckman reconstruction to the Esper chronology using only RW.

Rob Wilson and JMS have purported to explain the “natural orthogonality” of RW and MXD series by suggesting that they measure different seasons. “Target” seasons run the risk of being a deus ex machina, but that’s a story for another day. In one of the articles on Rocky Mountain climate, I’ve seen a reference to a very high correlation (0.92 – I’ll try to find the reference) between July and August temperatures. If so, this would not explain the natural orthogonality of the RW and MXD series.

However, there’s an interesting twist on this in terms of the multivariate alternatives discussed here. If the covariances between the different months matter, then they need to articulate their methods in terms of a $Y$ matrix of monthly temperatures (perhaps with multiple sites) and an $X$ matrix of chronologies, both RW and MXD. From their point of view, wouldn’t it make more sense to do some sort of CCA than to do PC on the tree ring chronologies and inverse regression of average temperature on the chronology PCs?

Does it “matter”? Who knows? The fit will definitely be artificially increased; the 20th century trend will be increased; the low-frequency variance will be decreased. How much is impossible to assess without the data.

References:
Magnus Borga, Canonical Correlation: A Tutorial http://www.imt.liu.se/~magnus/cca/tutorial/tutorial.pdf

M. Borga, T. Landelius, H. Knutsson, 1997 A Unified Approach to PCA, PLS, MLR and CCA, Report LiTH-ISY-R-1992, ISYhttp://www.imt.liu.se/mi/Publications/Papers/LiTHISYR1992.pdf

1. Posted Apr 8, 2007 at 5:23 PM | Permalink

One case that would be interesting is the relationship between the PLS where n=1 and the simple average. The Wilson-Luckman results look like there is only one really dominant PC. Lots of ‘natural’ data look like that. If the PLS and n=1 reduces to a simple average then it would show simply that PC analysis is simply ‘overkill’.

2. Eduardo Zorita
Posted Apr 9, 2007 at 10:49 AM | Permalink

Steve,

I think CCA means in this context Canonical Correlation Analysis, and not Canonical Correspondence Analysis. This latter method is a different method mostly used in ecology, and it is only very loosely related to Canonical Correlation.

A comment on the natural orthogonality. There has been some debate on the difference between the Artic Oscillation and North Atlantic Oscillation, which is related to this question. It boils down to the situation in which one has 3 time series, A,B,C. A pathological example would be that A and B are correlated only for even timesteps, A and C are correlated only for odd time steps; B and C are uncorrelated when considering all timesteps. If one performes a PCA of these three series, one would obtain an eigenvector where all three series are contributing, and yet B and C are uncorrelated.
I have not looked into the data, but something similar could be happening here if, for instance, the correlation between T and RW and T and MXD is only present in subsegments of the record, RW and MXD being uncorrelated for the whole record.

3. Steve McIntyre
Posted Sep 30, 2008 at 7:48 AM | Permalink

Equation (5) here looks like it is connected to the EIV approach in Mann et al 2008, which inverts the (X^T X)^{-1} matrix by using a truncated SVD re-combination. For the reasons implied in the above thread, it’s not at all clear that this represents an improvement over the regression methods of MBH.