Von Storch et al 2004 advocated using climate models to generate pseudoproxies to test the properties of proposed multivariate methods. Hardly unreasonable. I might argue that these are long-winded ways of generating proxy series with certain kinds of temporal and spatial covariance structures, but there’s much to be said for testing methods on some standard data. Their own networks of pseudoproxies is much too "tame" to be adequately realistic, but, if you can’t understand what the tame networks do, you’ll never understand what the "wild" networks do – a lack of understanding presently being demonstrated by Wahl and Ammann and other Hockey Team supporters.
Also for all the huffing and puffing about "moving on", there’ s nothing wrong with testing multivariate methods against the MBH98 proxy data set as an example of a "wild" network. I don’t think there’s much in it that’s usable for climate studies, but it’s an interesting statistical collection and, at this point, people interested in the field can use it as a benchmark.
For people that have not studied multivariate statistical literature, it is hard to communicate the variety of interconnected multivariate methods. In fields other than climate science, you usually have to prove the validity of a method through benchmarking and explain its statistical properties before using the method. In climate science, these steps do not seem to be required by major journals, such as Science or Nature, despite, in the latter case, having seemingly appropriate policies on paper.
I’ve done reconstructions under various multivariate methods using pseudoproxies from "Region 1" (55 series) from the erik167 run (kindly provided by Eduardo Zorita). I’m still experimenting, but the results seem pretty interesting and one of many things that I should work up further. The cases here are all direct reconstructions of the NH average, without using "climate field reconstructions". The Mannian temperature PC1 closely resembles the NH average. I’ve looked at the following methods:
1. Scaled Composite (average after scaling in the calibration period – 1902-1980 used here after MBH);
2. Ordinary Least Squares (OLS) – here this is inverse multivariate regression (which is what Groveman and Landsberg did)
3. Partial Least Squares – I’m using this label according to my characterization of the actual MBH method as seet out in my Linear Algebra posts. Unless modified, this is undetrended. The method as applied here includes rescaling.
4. Partial Least Squares Detrended. This is the much criticized VZ implementation of MBH.
5. PC1- covariance.
6. PC1-correlation. Despite the use of PC methods in the tree ring networks, MBH did NOT apply PC methods to their proxy networks formed from collating tree ring PCs with other proxies (the 22 series in the AD1400 has 3 PC series; the 112 series in the AD1820 has 31 PCs).
7. Principal Components Regression (PCR). This is regression on the principal components, as opposed to reconstructing from the principal components.
I’ve not checked out Ridge Regression here, much less RegEM. Stone and Brooks 1990 prove that Ridge Regression forms a continuum (with one parameter) between OLS and PLS, so in a general sense one can conclude that Ridge Regression results will be intermediate between OLS and PLS results. Canonical Correlation Analysis; Lasso methods – there are numerous other plausible multivariate methods.
Decomposition by Scale
There’s a lot of talk about low-frequency and high-frequency results in the current debate. However, I don’t think that the examples to date are very helpful. I like looking at wavelet decompositions of variance by scale , aggregating low-, medium- and high- frequency scales together. I used a "la8" wavelet here, but I don’t think that much turns on it. I think that the decomposition shown here is more useful than what has been put forward so far in the literature.
I’m not going to show the decompositions for all cases, but will show a couple of extreme cases. The first figure shows the OLS reconstruction. You will see that it provides a remarkable fit in the calibration period, that it has much less low-frequency variance than the target and generally poor out-of-sample performance even in this very tame network. In a later graphic, I’ll show the behavior of the coefficients in this fit. Groveman and Landsberg use this methodology and I can see some of the same features in coefficient patterns there. It looks to me like an unusable reconstruction and any of the solar results based on correlations of solar proxies to Groveman and Landsberg would accordingly be questionable. This example nicely illustrates "overfitting".
Figure 1. Wavelet decomposition of Reconstruction from Multiple Linear Regression from erik167 pseudoproxies in Region 1. Top 4 panels – NH reconstruction and decomposition by scales ( high: 1-8 years; medium 16-64 years; low – 128+ years). Red – Echo-G NH. Bottom panel – Proportion of variance by scale. Dark cyan is NH target , used as reference in other decompositions.
At the other extreme is a reconstruction using an unweighted average of scaled pseudoproxies. In this case, there is much better recovery of low-frequency information, as you can see by comparing the proportion of variance in each scale (BTW the proportion of high-frequency variance in the underlying Echo-G target seems unduly low, but that’s a spearate issue.) One of the reasons that this method performs so well is the very "tameness" of the underlying network – the noise is white, also the proportion of noise is constant, about 50% in the erik167 example, yielding very high (0.7) correlations to gridcell temperature.
Figure 2. Wavelet decomposition of Scaled Composite from erik167 pseudoproxies. Top 4 panels – NH reconstruction and decomposition by scales ( high: 1-8 years; medium 16-64 years; low – 128+ years). Red – Echo-G NH. Very top panel is 2x scale of next 3 panels. Bottom panel – Proportion of variance by scale. Dark cyan is NH target , used as reference in other decompositions.
I’ve done these for all the cases, illustrated in the next graphs, but you should get the idea.
Next here is a spaghetti graph of results from 3 multivariate methods: OLS, re-scaled PLS (MBH98) and scaled composite. You see that MBH-style PLS is intermediate between OLS and scaled composite. The better performance of a scaled composite in recovering low-frequency information was observed inpassing in von Storch et al 2004, but, IMHO, they did not pay sufficient attention to this result. In each case, the attenuation of variance comes from a greater proportion of high-frequency variance in the reconstruction than in the original. This is a property of the multivariate method. As a method, Moberg tries to avoid this by ensuring that the reconstruction has similar proportions of low-frequency and high-frequency variance – which doesn’t necessarily make this reconstruction "right", but it is at least attentive to the problem and a direction worth pursuing.
Figure 3. Spaghetti graph of selected multivariate methods. The blow-up is not because of intrinsic interest to the period, but just to show detail a little better in a different scale.
The next graph shows similar results for other multivariate methods, including detrended and nondetrended Mannian PLS. Given the amount of hyper-ventilating by the Goon Line about trending versus de-trending in Mannian methods, there is surprisingly little impact in this particular case. It’s actually not even clear that undetrended performs better; I expected to see more difference and don’t entirely understand why I’m not seeing more difference. I’m wondering whether there is still some discrepancy in MBH implementation by VZ which actually leads them to overstate the differences. Wouldn’t that be ironic? I’ll chase Eduardo for code on how he did this step. In any event, the impact of trending versus not-detrending is microscopic under my calculations. The more salient issue is the markedly inferior performance of Mannian PLS to some simple alternatives such as unweighted average or simple principal components. (I realize that further issues arise in a less tame network, but the hyperventilation is all about the tame network.)
Figure 4. Another spaghetti graph of selected multivariate methods.
The next graphic compares some common verification statistics for the different methods, again with extremely interesting results. Here I think that there is a great benefit from considering a broader range of multivariate methods. Look at the OLS statistics: a very strong calibration r2, an RE statistic of just under 0.5 (about the range of early period MBH reconstructions) and negligible verification r2 (the reason that you can’t see it in the graph is because, it’s 0.002.).
The "good" methods have RE statistics up in the 0.7 range, rather than the 0.4 range. In our GRL article (Reply to Huybers is our most recent position), we suggested that a 0.51 benchmark for RE significance. B’rger and Cubasch is the first article to try to come to grips with this conundrum, although their suggested benchmark of 0.25 hardly follows from their example. In passing, Mann’s review of B’rger and Cubasch is typical of Hockey Team dreck on this topic.
The CE statistics, praised by the NAS Panel, are all negative for fairly decent reconstructions.
Despite recent hyper-ventilation about r2 being a "high-frequency" measure, it contains both low and high-frequency information. I’m not advocating r2 as a magic bullet, merely not looking at one statistic. In this case, based on these particular networks, I’d be inclined to say that verification r2 is probably more helpful than CE, but I’m not tied to this position. My objection to Mann’s handling of the r2 was that MBH98 said that this was one of the verification statistics that they calculated and, in IPCC TAR, he claimed "skill" in multiple verification statistics. If they didn’t want to use it, argue that; just don’t claim skill if it isn’t there.
Notice the fifth column which is the explained variance for the entire period – after all, by using RE statistics in a short verification period, one is hopefully estimating RE statistics for the process or at least the longer period. The best performers in this tame network are the unweighted composite and the principal components. Mannian PLS shows poor performance in comparison. All methods under-capture low-frequency variance. Detrending or not in a PLS method makes negligible contribution.
Figure 5. Verification statistics for selected multivariate methods.
Finally, here are plots of the weighting factors for some different methods. These "weighting factors" have different terminologies – in one case they are called "regression coefficients"; in another case, they are the first eigenvector (re-scaled). A scaled composite has roughly equal weighting factors. In this "tame" netowrk, the principal components methods have virtually identical coefficients and come close to recovering a simple average.
Figure 6. Weighting Factors for Several Multivariate Methods
One can prove theoretically that, in a white noise network with noise mixed in in equal amounts, the optimum result comes from assigning exactly equal weights to each series, so that the white noise cancels out. In such a case, the variance of the white noise reduces as . The more variability in the weights, the worse the performance. For exxample, if you load your weights on just one series, you get no white noise cancelling and retain the original variance proportion.
If you have both positive and negative weights in this very tame network, then the weights cause the proportion of signal to be reduced without a gain in noise cancellation. That’s one thing that caught my eye in the Groveman and Landsberg situation – you are not regressing onto causes and so you don’t want opposite signed coefficients. Ridge regression and even Mannian implicit PLS are less bad than OLS in this context, but simple averages or simple principal components trump both. You’d think that this would have been done before Mann proposed his method – that’s what’s done in econometrics or any other discipline except climate science, but, hey, they’re the Hockey Team.
I think that the best way to proceed to a more complicated test case is to use a linear mixed effects method in which you allow for heteroskedasticity. I’ve experimented a little with this on tree ring networks but have run into memory problems. (It’s not that the memory requriements are huge; it’s that I really need to get a new computer- which I’ve been talking about for a year now; I’m always worried about the amount of time to convert and the time has never seemed right. Maybe this will prompt me to do it.)
Reference script (as an aide-memoire): zorita/make.residuals.txt