I quickly mentioned BàÆà⻲ger and Cubasch  which was published in December and meant to post up some further comments at the time, but forgot to do so. I was reminded of this by the question of a reader at Daily Kos . Let me mention first that BàÆà⻲ger and Cubasch is a really interesting and well-conceived paper and that Cubasch was an IPCC TAR heavyweight. Here’s a tease -the following diagram is in their Supplementary Information and is explained later in this post.
Anyway, Jack asked Mann at Daily Kos:
Response to Burger and Cubasch 2005 needed
Thanks for the response here and the direct response on RealClimate! I have just posted a reply on RealClimate that is summarized in the Subject line of this post. Outdated? Not when a critical (i.e., important) paper was published two months ago.
I looked for the post about Burger and Cubasch at realclimate that Jack mentioned and surprise, surprise, it was censored. (The first post by Jack was allowed, not the one on BàÆà⻲ger and Cubasch.) I wonder what Gavin’s excuse is this time. They really are pathetic. Anyway they couldn’t censor at Daily Kos so Jack’s question survived at Daily Kos and prompted the following reply from Mann:
Burger and Cubasch: Would have been a useful contribution to the literature about 10 years ago, when Mann et al ("MBH98") and other groups were using simple EOF-based approaches. The primary criticism of Burger and Cubasch is that such approaches lack regularization or an explicit model of the error covariance structure of the data. This is fair enough. However, the method used by Mann and Colleagues for roughly the past 6 years now, Regularized Expectation-Maximization, is not subject to either criticism. This method yields essentially the same reconstruction when applied to the same proxy data [Rutherford, S., Mann, M.E., Osborn, T.J., Bradley, R.S., Briffa, K.R., Hughes, M.K., Jones, P.D., Proxy-based Northern Hemisphere Surface Temperature Reconstructions: Sensitivity to Methodology, Predictor Network, Target Season and Target Domain, Journal of Climate, 18, 2308-2329, 2005], http://www.meteo.psu.edu/~mann/shared/articles/RuthetalJClimate05.pdfindicating that the original approach was robust in practice, despite the legitimate theoretical limitations of using a truncated EOF basis.
This method furthermore has been demonstrated to accurately reconstruct multi-century timescale variability based on applications to model simulation data (which rebuts another criticism that has been leveled against the MBH98 method).
Mann, M.E., Rutherford, S., Wahl, E., Ammann, C., Testing the Fidelity of Methods Used in Proxy-based Reconstructions of Past Climate, Journal of Climate, 18, 4097-4107, 2005.
So Burger and Cubasch is effectively pre-empted by this more recent work, which was highlighted by Science last November and in the Feburary issue to appear of the Bulletin of the American Meteorological Society. More information can be found here.
Once again, the nomads appear to have de-camped. Whenever anyone criticizes one of their papers, they’ve moved on. Never a place to lay one’s head even for one night. In this case, the nomads must have moved on just before BàÆà⻲ger and Cubasch arrived. Rutherford, Mann et al. was not published 10 years ago, but in mid-2005. how would anyone know that the nomads had already decamped? I might mention here that, in early 2005, I requested details from Journal of Climate on the RegEM method described here, which they refused to provide prior to final publication, which seems to have occurred in mid-summer. As I mentioned before, I’m blocked from Rutherford’s website, but have been provided with the contents of the website, but haven’t analyzed them in detail yet.
First, as we’ve got used with respect to our articles, no matter how many problems are itemized in a critique, the Hockey Team always picks one point, which they label the "primary criticism" and only deal with that. Here Mann argues that the "primary criticism" of BàÆà⻲ger and Cubasch is that "simple EOF-based approaches…lack regularization or an explicit model of the covariance structure of the data" – defects which Mann seems to admit for MBH, but not for the "regularized expectation-maximization" methodology of Rutherford et al . I’ll return to this supposed "primary criticism" after giving a little exposition of actual BàÆà⻲ger and Cubasch claims, as it doesn’t seem to me that the criticism that Mann chooses to mention is "primary" in BàÆà⻲ger and Cubasch . So let’s look at what BàÆà⻲ger and Cubasch actually said (the paper is posted up here.)
The main empirical work carried out by BàÆà⻲ger and Cubasch is to analyze the cumulative impact of 6 seemingly innocuous methodological decisions both in the context of their emulation of MBH98 and in climate simulations (discussed primarily in a companion article which I have not seen yet.) They pose the problem as follows:
For instance, assertions made by MBH98 and later about certain steps (such as rescaling) being “Å”Åinsensitive” to the method were hard to quantify and thus of little help. BàÆà⻲ger et al.  showed that the method is, on the contrary, highly sensitive to the variation of 5 independent standard criteria (as we call the steps here), resulting in an entire spectrum of possible climate histories. Those experiments were conducted in the synthetic world of a climate model, with noise-disturbed temperature grid points serving as pseudo-proxies, and it turned out that the amplitude of the reconstructions ranged between about 20% and 100% of the true (simulated) millennial history. Whether or not these results extend to the real-world case, i.e. whether or not the MBH98 and relative approaches are robust, including the predictor selection issues as argued by McIntyre and McKitrick [2005a], is the subject of the current study.
It’s nice to be cited in such an interesting article. The "independent standard criteria" – 5 in their simulations and 6 in their MBH emulation – are all seemingly innocuous methodological choices, which yield 2^6 different reconstructions. They point out that:
No a priori, purely theoretical argument allows us to select one out of the 64 as being the “Å”Åtrue” reconstruction.
and argue that "if it [the MBH98 reconstruction] is robust certain refinements such as rescaling should not affect the essence of the final result."
Music to our our ears. The 6 binary choices pertain to: detrending; use of PCR in the regression step; using global or temperature PCs as a target; inverse or direct regression; re-scaling; centering of tree ring PCs. Here’s a quick synopsis edited slightly:
The following 6 criteria were considered, all belonging to the standard toolbox of empirical climatology. The model nomenclature is binary (1/0).
TRD – trended – 1 or detrended – 0 data in calibration period
PCR – Before estimating the regression model, the proxy predictors undergo a PC transformation (PC regression). PCR -1; no PCR – 0. SM note: this is different than PC applied to tree ring networks or to temperature gridcells. It would be a further PC step once the network of 22-112 proxies are assembled.
GLB – One can use either the single predictand NHT (1) or, alternatively, a set of leading principal components (0) so that spatial detail is simulated as well. But note that like MBH98 we use just one PC. (SM note – this presumbly refers to the AD1400 step in controversy (or the AD1000 step in MBH99), which is the step with only one PC)
INV – Direct (0) or inverse (1) regression. " Direct regression is the kind of regression that is normally applied, here, as a regression of the instrumental temperature fields (predictand) on the proxies (predictor). Inverse regression goes vice versa, first, by regressing the proxies on temperature and, second, by finding for a given proxy the temperature field with the closest (in a least squares sense) image to the proxy under the regression map. This is the same as inverting the regression map using the pseudo inverse. It is noteworthy that the simulated amplitudes of a multiple direct regression are scaled by the canonical correlations between predictor and predictand field, while the inverse form is scaled by the inverse of those correlations [see Bu¨rger et al., 2005].
RSC – Rescaling (1) or not re-scaled (0). "To match simulated and original variability, rescaling of the predictand is sometimes applied with scaling factors taken from the calibration period. This ensures adequate variability at least for that period, but introduces uncontrollable results if that domain is left. RSC is frequently encountered in statistical downscaling under the name inflation [cf. Karl et al., 1990]. Note that if either one of INV and RSC is applied the simulated amplitude is increased relative to observations; this is in conflict with the damping arguments given in [von Storch et al., 2004]. We have not found any reference regarding the effect of rescaling on model uncertainty."
CNT – Tree ring PCs centered (1) or uncentered (0). " The MBH98 choice of calculating the PCs of some proxy clusters from anomalies of the 20th century climate has been criticized for reducing off-calibration amplitudes and favoring hockey stick shaped results [cf. McIntyre and McKitrick, 2005a, 2005b]. Under the CNT criterion those PCs are determined from the full period to temper the impact of a strong positive 20th-century trend. We applied Preisendorfer’s rule N for selecting the PCs." SM note- it would be a little more precise to show 0 – uncentered; 1- correlation; 2- covariance. I’m not sure which they used (although it doesn’t matter much in terms of the argument.)
They summarize the situation as follows:
Note that each single criterion is a priori sound, with numerous applications elsewhere, and can hardly be dismissed purely on theoretical grounds. Note further that all of the above criteria are independent, mutually consistent and can thus arbitrarily be mixed, so that any combination thereof defines one of 2^6 = 64 reasonable “Å”Åflavors” of the regression model. Following Table 1 we identify a flavor using a binary code of length 6, indicating whether any of the 6 criteria is valid or not. For example, 100110 refers to an inverse regression with rescaling, trend, and spatially explicit predictands, and without using PCR; this is the variant used by MBH98, and we denote it by MBH.
I disagree with their acquiescence in the MBH uncentered principal components method as "being a priori sound, with numerous applications elsewhere", but this is not central. In light of other disputes, one could probably identify 3 PC alternatives, with both covariance and correlation PCs. In the above binary nomenclature, the case that we illustrated in our E&E article would be a further variation of 100111. Since the handling of the Gaspé series is a 7th choice and has a material impact, this would increase the poulation of reconstructions to 3* 2^6.
Their first empirical finding is the spread of reconstructions is immense. BàÆà⻲ger and Cubasch:
Figure 1 shows the 64 variants of reconstructed millennial NHT as simulated by the regression flavors. Their spread about MBH is immense, especially around the years 1450, 1650, and 1850. No a priori, purely theoretical argument allows us to select one out of the 64 as being the “Å”Åtrue” reconstruction. One would therefore check the calibration performance, e.g. in terms of the reduction of error (RE) statistic. [SM note – the "calibration RE" statistic is really just the "calibration r2" statistic; when I use "RE" statistic, I’m nearly always referring to the "verification RE" statistic. Watch what he’s doing here – he’s using the calibration statistic to pick the model and the verification statistic to check what he calls "over-fitting", what I’d call "spurious regression".] But even when confined to variants better than MBH a remarkable spread remains; the best variant, with an RE of 79% (101001; see supplementary material ftp://ftp.agu.org/apend/gl/2005GL024155.), is, strangely, the variant that most strongly deviates from MBH.
Figure 1. 26 = 64 variants of millennial NH temperature, distinguished by smaller (light grey) and larger (dark grey) calibration RE than the MBH98 analogue (MBH, black). Instrumental observations are dashed. All curves are smoothed using a 30y filter.
I draw attention to the much better statistical practice used by BàÆà⻲ger and Cubasch than by the Hockey Team in carefully distinguishing calibration and verification. They state the following:
It may be important to stress the following: On the basis of the validation RE one might be tempted to prefer the (most simple) variants 100000 or 101000, or also MBH, to the others. But that statistic must not be used to select a model; it can only serve as a check of a model, e.g. for overfitting, after it has been selected. To do otherwise amounts to extend the calibration over to the validation period. In that case, i.e. using the calibration 1854–1980, the simulations look remarkably different (not shown).
It would be very interesting to see the results for the 1854-1980 calibration period and it’s too bad that they are not shown. I’ll see if I can get them. I sent BàÆà⻲ger my reconstruction code early last year, so he should be receptive to the request.
BàÆà⻲ger and Cubasch go on to show a large spread in their simulation results as well, reported as follows for the AD1600 network – a network for which MBH claimed much closer confidence intervals. Note that the figure below is for only 32 cases and only uses uncentered PCA:
We have analyzed the influence of each of the criteria on the overall behavior of the simulation…. We have nevertheless conducted the same experiments under the setting of the AD 1600 step where more proxies (57) are available. The variations are comparable to those seen in Figure 1. The spread is particularly large in the earliest part of the simulations, especially among those with a calibration RE higher than MBH (cf. SM). But they have a negative validation RE, which indicates overfitting.
Supplementary Figure 2. 1600.eps. The 32 variants from combining criteria 1-5 (grey, with CNT=0), distinguished by worse (light grey) or better (dark grey) performance than the MBH98-analogue MBH (10011, black). Note the remarkable spread in the early 16th and late 19th century. ftp://ftp.agu.org/apend/gl/2005GL024155/2005GL024155-fs02.jpg
BàÆà⻲ger and Cubasch attempt to explain this remarkable lack of robustness by noting that proxy values are being extrapolated well outside of the range for the proxies in the calibration period, sometimes far outside the calibration range. They say:
But as Fritts [1976, p. 15] continues: “Å”ÅIn order to make this kind of inference, however, it is important that the entire range of variability in climate that occurred in the past is included in the present-day sampling of environment.” This is, in fact, the basic condition of statistical regression – but only one half of it. The other half applies to the tree ring variations: They also must lie in a range that is dictated by the calibrating sample. This, however, is not the case here. For almost all of the 24 proxies, the range of the millennial variation is considerably larger than the sampled one, with numerous cases of proxies exceeding 7 and more calibration standard deviations (cf. SM). As a consequence, the regression model is extrapolated beyond the domain for which it was defined and where the error is limited.
BàÆà⻲ger and Cubasch conclude with some very strong caveats about fitting models outside their calibration range as follows:
Any robust, regression-based method of deriving past climatic variations from proxies is therefore inherently trapped by variations seen at the training stage, that is, in the instrumental period. The more one leaves that scale and the farther the estimated regression laws are extrapolated the less robust the method is. The described error growth is particularly critical for parameter-intensive, multi-proxy climate field reconstructions of the MBH98 type. Here, for example, colinearity and overfitting induce considerable error already in the estimation phase. To salvage such methods, two things are required: First, a sound mathematical derivation of the model error and, second, perhaps more sophisticated regularization schemes that can keep this error small. This might help to select the best among the 64, and certainly many more possible variants. In view of the relatively short verifiable period not much room is left.
So after this devastating critique of MBH98-type reconstructions (entirely in the spirit of our articles), Mann concluded that the "primary criticism" of BàÆà⻲ger and Cubasch was that "EOF-based approaches … lack regularization or an explicit model of the error covariance structure of the data", and, conveniently, claims that Rutherford, Mann et al., which is hot off the press, somehow avoids these seemingly intractable problems. (BTW I was unable to locate any use of the term "explicit model of the error covariance structure of the data" or any apparent synonym. As so often, the Hockey Team does not quote the critic, but re-writes the claim into one that is perhaps easier to respond to, regardless of the original criticism.)
As I mentioned before, I’ve not gone through Rutherford, Mann et al. to see if they’ve avoided the BàÆà⻲ger and Cubasch criticisms – it would amaze me if they did. For example, Rutherford, Mann et al. astonishingly use the original MBH98 tree ring principal components series in their primary analyses. In fact, I doubt that Rutherford, Mann et al. actually avoid any of the BàÆà⻲ger and Cubasch critiques, but, hey, it’s in Journal of Climate. Andrew Weaver edited it, so what further proof would anyone require.