I hope that you are following the lively discussion about Burger and Cubasch at Climates of the Past here , where Mann aka Anonymous Referee #2 is carrying on in a quite extraodinary way. I’ll probably try to weigh in over there at some point. The dialogue has exploded fairly quickly and I’ve collated some of the discussion in the post here.
First of all, I’m glad that the question of RE significance has been picked up. We raised the issue of spurious RE significance in our GRL article, but the only discussion of it so far has been Huybers 2005 and Wahl and Ammann (in their rejected GRL submission) both of which were critical and sought to restore the old benchmark of 0. Burger and Cubasch, together with the Zorita review, are the second discussion to appreciate that there is a real issue with RE significance.
The first, interestingly, was the NAS Panel, which stated:
Regarding metrics used in the validation step in the reconstruction exercise, two issues have been raised (McIntyre and McKitrick 2003, 2005a,b). One is that the choice of “significance level” for the reduction of error (RE) validation statistic is not appropriate. The other is that different statistics, specifically the coefficient of efficiency (CE) and the squared correlation (r2), should have been used (the various validation statistics are discussed in Chapter 9). Some of these criticisms are more relevant than others, but taken together, they are an important aspect of a more general finding of this committee, which is that uncertainties of the published reconstructions have been underestimated. Methods for evaluation of uncertainties are discussed in Chapter 9.
I think that there are two separable issues in the Burger and Cubasch paper which you have to pay attention to -
1) the impact of a spurious regression on RE statistics;
2) the impact of calibration/verification period on RE statistics.
Before I comment on BàÆà⻲ger and Cubasch, I’m going to quote from Eduardo Zorita’s posted review in which he cites discussions of "spurious regression" from Phillips 1998 and Granger and Newbold 1974 (previously discussed on this blog). "Spurious" in this context is not merely a term of disapproval e.g. Mann’s generic use of the term "spurious criticism", but a technical term to refer to high statistical values for unrelated series. Zorita provides a length quote from Phillips as follows:
Spurious regression, or nonsense correlation as they were originally called, have a long history in statistics, dating back at least to Yule (1926).Textbooks and the literature of statistics and econometrics abound with interesting examples, many of them quite humorous. One is the high correlation between the number of ordained ministers and the rate of alcoholism in Britain in the nineteenth century. Another is that of Yule (1926), reporting a correlation of 0.95 between the proportion of Church of England marriages to all marriages and the mortality rate of the period 1866-1911. Yet another is the econometric example of alchemy reported by Henry(1980) between the price level and cumulative rainfall in the UK. The latter “relation” proved resilient to many econometric diagnostic test and was humorously advanced by its author as a new theory of inflation. With so many well known examples like these, the pitfalls of regression and correlation studies are now common knowledge even to nonspecialists. The situation is especially difficult in cases where the data are trending- as indeed they are in the examples above- because third factors that drive the trends come into play in the behavior of the regression, although these factors may not be at all evident in the data.’ Phillips (1998).
Eduardo goes on to say:
Phillips is alluding at the possibility that correlations between timeseries may arise just because a third factor is driving the trend in both timeseries. Is this possible in paleo-climatological applications? I think it is very well possible. For instance, a proxy indicator sensitive to rainfall, nutrients or other environmental factors may exhibit a trend that can be erroneously interpreted as an indication of temperature trend. A regression analysis between a proxy and temperature that oversees this possibility takes the risk of not properly estimating regression parameters. If the validation of the regression model is performed in periods where this trend is continued, a high, but perhaps spurious, validation skill is likely to be the answer, specially when using measures of skill that do not hedge against such risks.
Without directly saying so, the issue posed by Eduardo here, in a Mannian context, is whether the calibrated relationship between the North American PC1 (bristlecones) and the NH temperature PC1, which has a positive correlation even though there is negligible correlation to gridcell temperature, is simply a result of two trends – increased temperature and CO2 or other fertilization, or whether it is a bona fide relationship.
Spurious RE Examples
As an exercise, I located digital versions of the data used in the original Yule (1926) article – Church of England marriages and mortality per 1000. I fitted the relationship in a calibration period and ended up with an RE statistic of 0.97.
BàÆà⻲ger and Cubasch report a very high RE statistic for the "nonsense" relationship betwen the number of reporting gridcells and NH temperature.
I’m quite confident that most "classical" spurious regressions will have high RE statistics. I think that one can prove that this particular statistic has negligible power against common unrelated trends. We showed examples of spurious RE statistics in MM05a and Reply to Huybers, but both sets of examples were framed in MBH contexts and, in retrospect, the argument would have benefited by simply showing "spurious" RE statistics in garden variety contexts of "classical" spurious regressions.
In their abstract, they say:
This setting is characterized by very few degrees of freedom and leaves the regression susceptible to nonsense predictors.
Note that they do not say that the regression is necessarily contaminated by nonsense predictors, merely that it is susceptible. As you’ll see below, Mann, in one of his usual debating tricks, shifts from one to the other, in replying to this. In their running text, they go on to say:
The scores are estimated from strongly autocorrelated (trended) time series in the instrumental period, along with a special partitioning into calibration and validation samples. This setting is characterized by a rather limited number of degrees of freedom. It is easy to see that calibrating a model in one end of a trended series and validating it in the other yields fairly high RE values, no matter how well other variability (such as annual) is reproduced…
Obviously, I agree with the last sentence. It seems that this point, while "easy to see" for Gerd and me, is not widely accepted and, as noted above, showing RE statistics for some classic nonsense regressors like the Yule 1926 example will be worthwhile to illustrate the pitfalls of relying exclusively on this measure. They go on:
The problem is that few degrees of freedom are easily adjusted in a regression. Therefore, the described feature will occur with any trended series, be it synthetic or natural (trends are ubiquituous): regressing it on NHT using that special calibration/validation partition returns a high RE. McIntyre and McKitrick, 2005b demonstrate this with suitably filtered red noise series. We picked as a nonsense NHT regressor the annual number of available grid points and, in fact, were rewarded with an RE of almost 50% (see below)!
Their citation here is a bit wonky, as the citation seems to be to our E&E article, whereas the topic was discussed in our GRL article and our Reply to Huybers (not cited here, although perhaps intended in McIntyre and McKitrick 2005c, not listed in the references). In our article, we tried to show that spurious RE statistics could be expected to arise under Mannian methods, adding these example to the traditional ones. However, these red noise examples were not intended as a universal inventory of all possible situations – merely an example and a highly pertinent example.
David Stockwell’s experiments are also relevant here. There are a variety of situations under which spurious RE statistics can arise and it’s long overdue for climate scientists relying on this measure to understand its properties. B and C go on:
If even such nonsense models score that high the reported 51% of validation RE of MBH98 are not very meaningful. The low CE and R2 values moreover point to a weakness in predicting the shorter time scales. Therefore, the reconstruction skill needs a re-assessment, also on the background of the McIntyre and McKitrick, 2005b claim that it is not even significantly nonzero.
The nonsense predictor mentioned above (number of available grid points) scores RE=46% (and CE=à⣃ ’ ”¬’¢23%), which is more than any of the flavors ever approaches in the 100 random samples. And it is not unlikely that other nonsense predictors score even higher. On this background, the originally reported 51% of verification RE are hardly significant. This has already been claimed by McIntyre and McKitrick (2005b) in a slightly different context. In addition to that study, we have derived more stable estimates of verifiable scores for a whole series of model variants, the optimum of which (1120) scoring with RE=25%±7% (90% confidence).
I haven’t quite figured out how they attribute a RE of 0.25 to MBH. It looks like it arises from various permutations and combinations of methods applied to the MBH network. However, if the network itself contains spurious regressors (bristlecones), none of the choices are really applicable. Their point is more that the calibration/verification period selection of MBH yields an exceptionally high RE statistic relative to random sampling – still a worthwhile point. In their conclusion, they point out that the confidence intervals obtainable from an RE of 0.25 leave an amplitude error of which is insufficient to resolve the medieval-modern problem.
Mann’s reply to this (his "review") depends on Mann et al 2005 (J Clim 2005 available at his website) and Mann et al 2006 unpublished. (Isn’t amazing how often the Team’s supposed refutation of a point is in the still unpublished article?)
Let us next consider the erroneous discussion by the authors of statistical significance of verification scores. The authors state the following:The nonsense predictor mentioned above (number of available grid points) scores RE=46% (and CE=-23%), which is more than any of the flavors ever approaches in the 100 random samples. And it is not unlikely that other nonsense predictors score even higher. On this background, the originally reported 57% of verification are hardly significant. This has already been claimed by McIntyre and McKitrick 2005b in a slightly different context.
There are several problems here. First, the authors’ statistical significance estimates are meaningless, ass they are based on random permutations of subjective “flavors” that are simply erroneous in the context of the correctly implemented RegEM procedure, as detailed in section “1″ above. Mann et al (2005) provide a true rigorous significance estimation procedure and their code is available as supplementary information to that paper.
Let’s pause here for a minute. First, the high RE results from nonsense predictors and the proposed B-C statistic of 0.25 are different issues. If you can get a high RE statistic from unrelated information, then the RE statistic by itself does not ensure that the model is meaningful. End of story. As to Mann’s statement that "Mann et al 2005 provide a true rigorous significance estimation procedure", I am unable to locate this calculation in the publication – maybe someone else can locate it for me. Mann describes the "true rigorous significance estimation procedure" as follows:
The procedure is based on the null hypothesis of AR(1) red noise predictions over the validation interval, using the variance and lag-one autocorrelation coefficient of the actual NH series over the calibration interval to provide surrogate AR(1) red noise reconstructions. From the ensemble of surrogates, a null distribution for RE and CE is developed. In other words, the appropriate significance estimation procedure requires the use appropriate AR(1) red noise surrogates against which the performance of the correct RegEM procedure as implemented by Schneider (2001)/Rutherford et al (2005)/Mann et al (2005) can be diagnosed.
This hardly deals with the nonsense predictor problem. The question is whether a proxy (as arguably bristlecones) are affected by an unrelated trend or by long persistence. Low AR1-coefficient red noise behaves a lot like white noise in that the effects attenuate quickly ("short memory"). The Mannian null test is irrelevant. The null test described here was described in MBH98 (not in Mann et al 2005 as far as I can tell).
Mann goes on:
Instead, Burger and Cubasch have simply analyzed the sensitivity of the procedure to the introduction of subjective, and erroneous alterations. Their results are consequently statistically meaningless. This view was recently endorsed by the U.S. National Academy of Sciences in their report "Surface Temperature Reconstructions for the Last 2000 Years" which specifically took note of the inappropriateness of the putative significance estimation procedures used by Cubasch and collaborators in their recent work.
This last sentence appears to me to be a complete fabrication (and is repeated in his Second Reply). I’ve searched the NAS Panel report (Cubasch) and found no reference that remotely supports this claim. If anything, the NAS quote highlighted at the beginning endorsed the view exactly opposite to what Mann espouses here.
Mann goes on:
Burger and Cubasch (2005) attempt to bolster their claims based on a reference to claims by "McIntyre and McKitrick (2005b)", published in a social science journal “Energy and Environment” that is not even in the ISI database. McIntyre and McKitrick made essentially the same claim in a 2005 GRL article. Huybers (2005), in a comment on that article, demonstrated that McIntyre and McKitrick’s unusually high claimed thresholds for significant were purely an artifact of an error in their time series standardization. Huybers (2005), after correcting the mistake by McIntyre and McKitrick, verified the original RE significance thresholds indicated by Mann et al (1998). It is extremely surprising that Burger and Cubasch appear unaware of all of this.
Now Huybers 2005 demonstrated nothing of the sort. There was no "error" in our time series standardization. Our simulations reported in MM05a used a regression fit (which was in keeping with what was known about MBH methods at the time) and did not include a re-scaling step identified by BàÆà⻲ger and Cubasch as a methodological "flavor". However, they showed the existence of spurious RE statistics in an MBH-type context. In our Reply to Huybers, ignored here by Mann, we responded fully to Huybers by showing an identical effect with re-scaling in the context of a network of 22 noisy proxies. It’s too bad, in my opinion, that B and C did not discuss Reply to Huybers in more detail as it’s still ahead of the curve.
Finally we come to “nonsense variables” argument made by Burger and Cubasch. Their basic argument here appears to be that CFR methods such as those used by Mann et al (1998), Rutherford et al (2005) and dozens of other groups of climate and paleoclimate researchers, are somehow prone to statistical overfitting in the presence of trends, which leads to the calibration of false relationships between predictors and predictand. So their above claim can be restated as follows: calibration using CFR methods such as Rutherford et al (2005) and multiproxy networks with the attributes (e.g. spatial locations and signal-vs-noise characteristics) similar to those of the proxy network used by Rutherford et al (2005) will produce unreliable reconstructions if the calibration period includes a substantial trend. They believe that such conditions will necessarily lead to the selection of “nonsense predictors” in reconstructing past variations from the available predictor network.
I warned at the start of the post that we’d see the usual Mannian debating trick. B-C said that the Mannian method was "susceptible" to nonsense regressors, a view that I share. Watch the pea under the thimble. Mann re-states this as being "necessarily" and then tries to show a situation where the problem doesn’t occur – thereby supposedly refuting the original claim.
The real “nonsense” however is associated with their claim, which has already been tested and falsified by Mann et al (2005). Burger and Cubasch (2005) cite and discuss this paper, yet they appear unfamiliar with its key conclusions. Mann et al (2005) tested the RegEM CFR method using synthetic “pseudoproxy” proxy datasets derived from a 1150 year forced simulation of the NCAR CSM1.4 coupled model, with SNR values even lower (SNR=0.25 or “94% noise”) than Burger and Cubasch estimate for the actual Mann et al (1998)/Rutherford et al (2005) proxy data network. Mann et al (2005) demonstrate that even at these very low SNR values, and calibrating over precisely the same interval (1856-1980) as Rutherford et al (2005) (over which the model contains an even greater warming trend than the actual surface temperature observations), a highly skillful and unbiased reconstruction of the prior history of the model surface temperature field (and NH mean series) is produced. In other words, the RegEM CFR method applied to multiproxy networks that are even “noisier” than Burger and Cubasch estimate for the actual Mann et al (1998)/Rutherford et al (2005) proxy network, and calibrated over intervals in which the surface temperature field exhibits even greater trends than in the actual data, yields faithful long-term reconstructions of the past climate history (this can be confirmed in the model, because the 1000 year evolution of the temperature field prior to the calibration interval is precisely known).
Certainly, if the method were–as Burger and Cubasch claim–prone to using “nonsense predictors” when applied to multiproxy network with the signal-to-noise attributes of the Rutherford et al (2005) proxy network, and calibrating over intervals with trends similar to those in the actual observations, there would be at least some hint of this in tests using networks with even lower signal-to-noise ratios, and calibrated over intervals with even larger trends? But there is no such evidence at all in the experiments detailed by Mann et al (2005).
In their Reply to Mann, B and C will point out that Mann et al 2005 deals with pseudoproxies in a tame network (using my terminology here).
11."Mann et al., 2005". – The rev. tries to disprove our conclusions citing Mann et al., 2005. That study nicely demonstrates a successful reconstruction of a simulated climate history from sufficiently many pseudoproxies (104, representing the AD 1780 network), which obviously cannot contain any nonsense predictors and which has never been questioned by us. Our focus was on the AD 1400 network with 22 real proxies.
I’ll show below some actual results from Mann et al 2005 to confirm the validity of this reply.
Mann’s Second Comment
This time, Mann relies on the unpublished Mann et al 2006. I presume that this will be published in Hockey Team house organ Journal of Climate.
11. This is especially disappointing. Based on the above, the authors appear to have gotten little at all out of their reading of Mann et al (2005). Morevoer, Mann et al (2006) have already dispelled the specious claim that the findings for the AD 1400 sparse network are any different for those for the full network. They are not. Moreover, what can the authors possibly mean by "non-sense" predictors if not predictors that are composed entirely or almost entirely of noise. At SNR=0.25, for which Mann et al (2005) show a skillful reconstruction is still produced, the pseudoproxies are composed of 94% noise by variance. In more recent work Mann et al (2006) have shown this is true even if the noise is substantially more ‘red’ than is supported for actual proxy records. Mann et al (2006) show that the performance for a fixed SNR=0.4 (86% noise by variance) are very similiar that for a multiproxy data set with the same average SNR (0.4), but for which the SNR for individual pseudoproxies ranges from SNR=0.1 to SNR=0.7. Would BC try to seriously argue now that pseudoproxies with SNR=0.1 (essentially, entirely composed of noise) are not ‘nonsense predictors’ by their definition. Lets think a bit about what the real "nonsense" is here.
It really is frustrating wading through these debates. A series with SNR of 0.1 (with white noise) is not "essentially, entirely composed of noise". It actually doesn’t take all that many series for the noise to cancel out and to reccover the signal. The spurious regression problem is the problem of an unrelated trending series. Arguing from high white noise situation is totally irrelevant.
Mann et al 2005
If you actually go to Mann et al 2005 to see how it supposedly supports the claims here, you’ll not find much that actually supports any of the claims. The context is a "tame network" like the one that I discussed in connection with the VZ pseudoproxies a little while ago. Like VZ erik167, they constructed pseudoproxies from gridcells information from a GCM mixed with equal amounts of white noise – a "tame" network. Most of the article is about "CPS" methods i.e. averaging scaled versions of the series. If you know that the proxies all have equal amounts of white noise, this system works pretty well and is probably as good as you can get. Here’s the description of their setup in Mann et al 2005:
We investigate here both the CFR and CPS approaches, using networks of synthetic pseudoproxy data (see Mann and Rutherford 2002) constructed to have attributes similar to actual proxy networks used in past CFR and CPS studies, respectively. The pseudoproxy data are derived from a simulation of the climate of the past millennium (A.D. 850–1999) using the National Center for Atmospheric Research (NCAR) Climate System Model (CSM) version 1.4 coupled ocean–atmosphere modelPseudoproxy time series (see supplementary Fig. 1 at http://dx.doi.org/10.1175/jcli3564.s1) were formed through summing each grid box annual mean temperature series with a realization of white noise (a reasonable assumption for representing observational error”¢’¬?the effects of the noise “color” were investigated by Mann and Rutherford 2002), allowing for various relative amplitudes of noise variance (expressed as a signal-to-noise ratio (SNR) of amplitudes; see Mann and Rutherford 2002). Experiments with SNR 1.0 were performed for all three networks. For the first network (A), experiments were performed for four different values of SNR: 0.25, 0.5, 1.0, and infinite i.e., no added noise).
A SNR ratio of 1 appears to be exactly the same as the VZ erik167 network that I’ve experimented on. In any of these pseudoproxy neworks – if you’re adding equal amounts of white noise – you can usually get your signal back using simple averages. From the above networks, they calculated RE, r2 and CE results:
We calculated the reduction of error (RE) reconstruction skill diagnostics during both the calibration period and an independent, precalibration “verification” period (see, e.g., Cook et al. 1994; Rutherford et al. 2005, and references therein). We also calculated alternative coefficient of efficiency (CE) and squared Pearson correlation (r2) verification skill metrics. While RE is considered a superior metric for evaluating statistical reconstruction skill (see Rutherford et al. 2005), verification RE values tend to be enhanced in these particular experiments as a result of the unusually large pre-twentieth-century mean temperature changes that occur in the model output and are captured by the reconstructions.These mean changes are larger than those observed in the actual instrumental data. We therefore also used a highly conservative estimate of unresolved variance provided by 1- r2 (along with the more conventional 1- RE) to estimate statistical uncertainties as conservatively as possible.
It’s interesting to actually look at the table of results. For the MBH98 network (here they use 104 locations), the RE score is very high for all 5 cases (as are the CE and verification r2 scores). Given that the issue with MBH is negligible verification r2 and CE scores, this hardly proves the point that is supposedly being made.
All I can determine from this article is that if you have a tame network, you usually get a decent reconstruction. Also note the apples and oranges comparison for the MBH method – for the simple average, they use 12 pseudoproxies, while for MBH98 they use 104 pseudoproxies. Obviously 104 pseudoproxies will do better with white noise. I don’t see any results here for 12 pseudoproxies using MBH methods. Another odd point, they don’t report calibration RE results saying they are "not available in RegEM". I don’t understand that at all. I presume that the values will be about 0.99999, but you could still report it.
From Mann et al 2005.
Mann, M. E., Rutherford, S., Wahl, E., and Ammann, C.: Testing the Fidelity of Methods Used in Proxy-Based Reconstructions of Past Climate, J. Climate, 18, 4097–4107, 2005.
Mann, M.E. et al, Robustness of proxy-based climate field reconstruction methods, 2006 (accepted).