Climate Audit

The RE Benchmark in A&W

There’s an interesting irony in the GRL rejection of the Ammann & Wahl Comment and it will be interesting to see how this gets handled. It turns out that the A&W Climatic Change depends on their GRL submission for their test of statistical significance for the RE statistic. So even though the GRL referee thought that there was nothing new in the GRL submission, it turns out that it contained something that is essential for their Climatic Change submission. The issue is substantive and not just formal. I’ll show the connection.

It’s all about RE significance. Unlike r2 statistics which have tables, there are no tables for RE significance. A rule of thumb for regression models was that an RE greater than 0 had "some significance". (Actually in these contexts e.g. Wilks [1995], it is presumed that the RE statistic is necessarily less than the r2 statistic – a point we made in MM05a, but which seems to be ignored in the huffing about the supposed primacy of the RE statistic.)

But the rules of thumb are only for linear regression models – which MBH obviously isn’t. MBH98 recognized this and based their claimed benchmark (still 0.0) on simulations. In MM05a, we pointed out that the simulations in MBH98 did not replicate essential aspects of the MBH98 algorithm and argued that a more realistic 99% benchmark in the context of MBH98 methodology was 0.56.

Huybers [2005] criticized our RE benchmarking on the basis that our simulations did not incorporate a re-scaling step in MBH98 (which could be confirmed in the source code released last summer, although it was not mentioned in the original text.) Huybers said that, with re-scaling, the RE benchmark was once again 0.0. In our Reply to Huybers [2005], we pointed out that Huybers’ own simulations did not replicate the network aspects of MBH98 and we showed new simulations incorporating networks of noise, which yielded a 99% benchmark of 0.51.

In my opinion, we completely answered Huybers’ point about RE benchmarking. Our Reply to Huybers was peer reviewed, just as much as Huybers. The onus on anyone seeking to carry out significance tests using an RE statistic in an MBH98 context has to either start with a 99% benchmark or prove that a lower benchmark can be used.

In their original GRL Comment, A&W did not refer to the RE benchmarking issue. However, in their revised GRL Comment, they reported on new simulations (which look to be incorrect) which purported to restore a 99% benchmark of 0.0. Their GRL submission said:

We have also examined the MM approach for benchmarking the RE statistic presented in MM05a. Although the MM method generates realistic pseudoproxy series with autocorrelation (AC) structures like those of the original proxy data, these time series have nearly uniform variances, unlike those of the original proxies. PCs derived from such data generally have AC structures unlike those derived from the original proxies, and thus they should not be used as equivalent to the original PCs. Restoring the variances of the original proxy data to the pseudoproxy series yields PCs with AC structures like those of the original PCs. But more importantly for the benchmarking, we confirm Huybers’ (2005) correction to the MM RE calculations, which rescales the variance of the fitted NH temperatures to match that of the observed values regressed against the simulated PC1s. This approach more accurately mimics the actual MBH procedure, which applies a parallel rescaling to the fitted instrumental PCs that drive the MBH climate field reconstruction process. Using our AC-correct PC1s, RE = 0.0 occurs at the 0.985 level of significance.

A&W (CC) acknowledged our point about simulations in a backhanded sort of way as follows:

In MM05a/b, the authors also examine two issues concerning validation statistics and their application by MBH. The first issue concerns which statistics should be applied as validation measures; the second issue concerns estimating appropriate threshold values for significance for the reduction of error (RE) statistic, which is commonly used as a validation measure in paleoclimatology (Fritts, 1976; Cook et al., 1994). …We consider the issue of appropriate thresholds for the RE statistic in Appendix 2, based on analysis and results reported elsewhere (Ammann, C.M. and E.R. Wahl, ‘Comment on “Hockey sticks, principal components, and spurious significance” by S. McIntyre and R. McKitrick’, in review with Geophysical Research Letters). p. 10

… statistical tests are done during the calibration and verification periods and their results are employed to infer the possible quality of pre-verification reconstructions. Often, these examinations are formalized by the use of null-hypothesis testing, in which a threshold of a selected validation measure is established representing a low likelihood that a value at or above the threshold would occur in the reconstruction process purely by chance. When theoretical distributions are not available for this purpose, Monte Carlo experiments with randomly-created data containing no climatic information have been used to generate approximations of the true threshold values (Fritts, 1976; cf. MM05a; Huybers, 2005; Ammann and Wahl, in review–note that the latter two references correct errors in implementation and results in MM05a) (A&W, p. 45)

The bolded sentence becomes absolutely critical to their CC submission. Appendix 2 of A&W (CC) re-states the above (but cites and relies on it):

In implementing this procedure, we found a technical problem that we reported in Ammann and Wahl (in review, and supplemental material there referenced). The method presented in MM05a generates apparently realistic pseudo tree ring series with autocorrelation (AC) structures like those of the original MBH proxy data (focusing on the 1400-onward set of proxy tree ring data), using red noise series generated by employing the original proxies’ complete AC structure. However, one byproduct of the approach is that these time series have nearly uniform variances, unlike those of the original proxies, and the PCs derived from them generally have AC structures unlike those of the original proxies’ PCs. Generally, the simulated PCs (we examined PCs 1-5) have significant spurious power on the order of 100 years and approximate harmonics of this period. When the original relative variances are restored to the pseudoproxies before PC extraction, the AC structures of the resultant PCs are much like those of the original proxy PCs. Following MM05a, the first PCs of this process were then used as regressors in a calibration with the Northern Hemisphere mean from the MBH verification data grid and the RE of verification determined, for each Monte Carlo iteration. Approximate RE significance levels can then be determined, assuming this process represents an appropriate null hypothesis model. Using the AC-correct PC1s in the RE benchmarking algorithm had little effect on the original MM benchmark results, but does significantly improve the realism of the method’s representation of the real-world proxy-PC AC structure. (A&W, p53) [my italics and bold]

Now there are a couple of things about the A&W simulations is paragraph that I think are wrong and certainly fail to consider relevant aspects of the exchange with Huybers. First, I don’t understand their point about scaling. (And although A&W have commendably put up a lot of code, they haven’t put up the code for the argument discussed here so it can’t be clarified.) Mannian PC methodology divides each series by its standard deviation in the calibration period. Thus, under a Mannian method, even if the variances of the simulated proxies were re-scaled to match those of the original proxy series, the MBH98 standardization would undo this re-scaling (there would be a difference between the standard deviation in the calibration period to the standard deviation of the entire period, but this is a secondary effect in this context; I haven’t specifically tested this, but I’ve got a pretty good feel for these things and don’t see how this would have enough impact to affect the PCs.)

I’m pretty sure that A&W have confused this matter (which cancels out) with the impact of setting up networks in the simulation, as described in Reply to Huybers, which does not cancel out. In our Reply to Huybers, we pointed out that he had failed to simulate the effect of having a network of 22 proxies (using only one PC) and when one did simulations with a 22-proxy network of noise, we got a 99% RE benchmark of 0.51. As one can see in the above paragraph, A&W appear to have omitted this step.

In passing, I repeat a point made before – trying to reconcile detailed results seem to be beyond the capabilities of journal peer reviewers. Such reconciliations are infinitely better dealt with in a joint paper of the type that I proposed to A&W. I’m quite happy to reconcile code and let the chips fall where they may. Obviously the Hockey Team has decided that their better course of action is not to reconcile code, to say that their various errors "do not matter" and to try to win a public relations campaign.

Be that as it may, in their own obscure way, in their GRL article, A&W actually purported to present a new results, which they applied in their CC article – the bolded sentence above:

Using our AC-correct PC1s, RE = 0.0 occurs at the 0.985 level of significance.

After the GRL rejection – regardless of the reason – they no longer have this result (which they shouldn’t have, as their argument is incorrect). Thus, the most recent peer-reviewed statement on the topicof RE significance is our Reply to Huybers, which sets the bar at 0.51 in an MBH98 context.

Now let’s look at exactly how A&W report their RE benchmark. They said that the 99% benchmark is 0.0 based on their GRL article as follows:

Numerically, we consider successful validation to have occurred if RE scores are positive, and failed validation to have occurred if RE scores are negative (Ammann and Wahl, in review; Appendix 2). This threshold also has the empirical interpretation that reconstructions with positive RE scores possess skill in relation to substituting the calibration period mean for the reconstructed climate values. (A&W, p.17)

That’s it. They’ve got nothing else to establish a benchmark for RE significance. They relied on their GRL submission to establish that they could use a benchmark of 0.0, but their GRL submission got rejected. They’ve done dozens of calculations and reported dozens of RE statistics, but they have no peer-reviewed standard of RE significance. In fact, virtually all of their results fail the more onerous test set out in MM05a and re-stated in Reply to Huybers. (These results are a way of reconciling the r2 statistical failure with the seeming RE significance. They are both insignificant.)

I’m still learning academic protocols. In a business situation, let’s suppose that separate audits were being done on a parent company and a related company and that the statements of the parent depended materially on the statements of the related company. First, it’s impossible that the auditors of the parent company would sign off before the audit of the related company, if the relationship was material. But let’s say that they’d done so, on the assumption that there would be no problem with the audit of the subsidiary (but hadn’t published the statements of the parent company.) What would happen it the auditors refused to sign off on the statements of the subsidiary? The auditors of the parent company would pull the statements of the parent so fast that it would make your head spin. Not just the auditors, but the management of the parent company. If there were problems with the statements of the subsidiary, they would be obliged on their own account to notify the auditors of the parent company and pull the statements.

What will happen here? Hard to say. I expect that A&W will try to drive on and hope that CC doesn’t notice or doesn’t care.