In MM2005a,b,c, we observed that the RE statistic had no theoretical distribution. We noted that MBH had purported to establish a benchmark by simulations using AR1 red noise series with AR1=0.2, yielding a RE benchmark of 0. We originally observed that high RE statistics could be obtained from PC operations on red noise in MM2005a (GRL) simply by doing a regression fit of simulated PC1s on NH temperature. Huybers 2005 criticized these simulations as incompletely emulating MBH procedures, as this simulation did not emulate a re-scaling step used in MBH (but not mentioned in the original article or SI). In our Reply to Huybers, we amended our simulations to include this re-scaling step, but noted that MBH also included the formation of a network of 22 proxies in the AD1400 step and if white noise was inserted as the other 21 proxies, then we once again obtained high (0.54) 99% benchmarks for RE significance. Our Reply fully responded to the Huybers criticism.
Wahl and Ammann/Ammann and Wahl consider this exchange and, as so often in their opus, misrepresent the research record.
Before I discuss these particular misrepresentations, I’d like to observe that, if I were re-doing the exposition of RE benchmarking today, I would discuss RE statistics in the context of classic spurious regression literature, rather than getting involved with tree ring simulations. This is not to say that the simulations are incorrect – but I don’t think that they illuminate the point nearly as well as the classic spurious regressions (which I used in my Georgia Tech presentation.)
The classic spurious regression is in Yule 1926, where he reported a very high correlation between mortality and the proportion of Church of England marriages, shown in the figure below. Another classic spurious regression is Hendry 1980’s model of inflation, in which he used cumulative rainfall to model the UK consumer price index. In both cases, if the data sets are divided into “calibration” and “Verification” subsets, one gets an extremely “significant” RE statistic – well above 0.5. Does this “prove” that there is a valid model connecting these seemingly unrelated data sets? Of course not. It simply means that one cannot blindly rely on a single statistic as “proof” of a statistical relationship.
In effect, the RE statistic has negligible power to reject a classic spurious regression.
This seems like a pretty elementary point and I don’t know why it’s so hard for so many climate scientists to grasp in the case of the proxy literature.
Wahl and Ammann/Ammann and Wahl wade blindly into this. They don’t spend any time examining primary literature on bristlecones to prove that Graybill’s chronologies are somehow valid. They don’t discuss the important contrary results from the Ababneh 2006 thesis, although you’d think that they or one of the reviewers would have known of these results.
Instead, they try to resuscitate an RE benchmark of 0, by making several important misrepresentations of both our results and MBH results.
Wahl and Ammann 2007 states (Appendix 1):
When theoretical distributions are not available for this purpose, Monte Carlo experiments with randomly-created data containing no climatic information have been used to generate approximations of the true threshold values (Fritts, 1976; cf. Ammann and Wahl, 2007; Huybers, 2005; MM05a, MM05c—note that the first two references correct problems in implementation and results in MM05a and MM05c).
MM05c is our Reply to Huybers. Obviously Huybers 2005 could not “correct problems in implementation and results in MM05a and MM05c” since MM05c was a reply to Huybers 2005. In fact, in my opinion, MM05c completely superceded Huybers 2005 as it obtained high RE values with re-scaling in the context of an MBH network – a more complete emulation of MBH methods than Huybers 2005. Ammann and Wahl seem almost constitutionally incapable of making accurate statements in respect to our work.
Ammann and Wahl 2007 is later than MM05c. Did it or Wahl and Ammann 2007 “correct” any “errors in implementation” in MM05c?
Wahl and Ammann 2007 Appendix 2 makes the following criticism of our method of simulating synthetic tree ring series:
one byproduct of the approach is that these time series have nearly uniform variances, unlike those of the original proxies, and the PCs derived from them generally have AC structures unlike those of the original proxies’ PCs. Generally, the simulated PCs (we examined PCs 1–5) have significant spurious power on the order of 100 years and approximate harmonics of this period. When the original relative variances are restored to the pseudoproxies before PC extraction, the AC structures of the resultant PCs are much like those of the original proxy PCs.
Here I don’t exactly understand what they did and, as I presently understand this sentence, it doesn’t make much sense in an MBH context. In Mannian PCs (or correlation PCs), the time series are standardized to have uniform standard deviations (and thus variance) in the calibration period (and entire period respectively). So even if the variance of the time series in our network were too uniform (and I haven’t analyzed whether this is so or not as yet, I don’t see how this could affect the downstream calculations for Mannian pseudo-PCs or correlation PCs. I don’t get the relevance of this point even if it were valid.
Later in the paragraph in Apendix 2, they seem to concede this:
Using the AC-correct PC1s in the RE benchmarking algorithm had little effect on the original MM benchmark results, but does significantly improve the realism of the method’s representation of the real-world proxy-PC AC structure.
So if this observation – whatever it is – had “little effect” on the original results, so what?
Even though Ammann and Wahl 2007 acknowledged that MM2005c (Reply to Huybers) contained the most detailed exposition of RE simulation results:
Particularly, MM05c (cf. Huybers 2005) have evaluated the extent to which random red-noise pseudoproxy series can generate spurious verification significance when propagated through the MBH reconstruction algorithm.
Wahl and Ammann 2007 totally failed to consider the methods described in MM05c, instead merely repeating the analysis of Huybers 2005 using a network of one PC1 rather than a network of 22 proxies (a PC1 plus 21 white noise series as in MM05c). They purported to once again get a RE benchmark of 0.0:
When we applied the Huybers’ variance rescaled RE calculation to our AC-correct pseudoproxy PCI s, we generated a 98.5% significance RE benchmark of 0.0.
But note the sleight of hand. MM2005c is mentioned, but they fail to show any defect in the results. They misrepresent the research record by claiming that Huybers 2005 had refuted MM2005c – which was impossible – and then they themselves simply replicate Huybers’ results on a regression network of one series and not a full network of 22 series. Also it’s not as though these matters weren’t raised previously. They were. It’s just that Ammann and Wahl didn’t care.
They also make an important misrepresentation of MBH. Ammann and Wahl 2007 (s4) asserts:
MBH and WA argue for use of the of Reduction of Error (RE) metric as the most appropriate validation measure of the reconstructed Northern Hemisphere temperature within the MBH framework, because of its balance of evaluating both interannual and long-term mean reconstruction performance and its ability thereby to avoid false negative (Type II) errors based on interannual-focused measures (WA; see also below).
In respect to MBH, this claim, as so often in Ammann’s articles about Mann, is completely untrue. MBH did not argue for the use of the RE statistic as the “most appropriate” validation measure “because of its balance of evaluating both interannual and long-term mean construction…”. These issues did not darken the door of MBH. As reported on many occasions, MBH Figure 3 illustrated the verification r2 statistic in the AD1820 step, where they say that it passed. If MBH had reported the failed verification r2 in other steps and attempted to argue a case for preferring the RE statistic as Wahl and Ammann are now doing, then one would have more sympathy for them. But that’s not what they did. They failed to report the failed verification r2 statistic. And now Ammann is simply adding more disinformation to the mix by falsely asserting that MBH had argued for a justification that was nowhere presented in the four corners of MBH.
By discussing these particular misrepresentations, please don’t take that as a complete inventory. It’s hard to pick all the spitballs off the wall and these are merely a couple of them. I’ll discuss more on other occasions.
As I noted elsewhere, I’ve written to Ammann asking him for a statistical reference supporting the statement:
Standard practice in climatology uses the red-noise persistence of the target series (here hemispheric temperature) in the calibration period to establish a null-model threshold for reconstruction skill in the independent verification period, which is the methodology used by MBH in a Monte Carlo framework to establish a verification RE threshold of zero at the > 99% significance level.
So far no support has been provided for this claim.