I asked KNMI what were the studies that had “refuted” our work. It seems to be Wahl and Ammann. I’ve never understood the traction of Wahl and Ammann with climate scientists. I doubt that any of them have worked through the details, but Wahl and Ammann issued a press release that all our claims were “unfounded” and that seems to be enough to settle things in climate world. Of course, they proved no such thing, but press releases seem to be what people pay attention – this is true in mining promotions as well.
When climate scientists say that they want to “move on”, you notice that they don’t use that against Wahl and Ammann. They want to “move on”, but they want to get in the last word. Thus, Wahl and Ammann. But it’s not like Wahl and Ammann are independent – they are Mann’s recent co-authors and Ammann was a PhD student under Mann and Bradley. Their article simply fleshes out points that Mann himself had already made in his correspondence with Nature – although Mann is nowhere mentioned in the acknowledgements – presumably in an attempt to make the study seem “independent” – a claim that Wegman rejected sarcastically.
Anyway I’ve finally sucked in my gut and begun the turgid task of seeing exactly what Wahl and Ammann did. As I mentioned in May 2005, our codes reconciled almost exactly (and I made a very small modification to my emulation code so that they now reconcile exactly.) Based on this and the very real possibility of listing a number of agreed findings, in December 2005, as I reported earlier, I proposed that we attempt to produce a joint statement of agreed results (failing which we could continue controversy). Had this happened, people would have saved a lot of time and energy – indeed both the NAS panel and Wegman reports would have been very different and more constructive. However, Wahl and Ammann chose to engage in more controversy.
Ultimately, I don’t think that anyone on the Team side will find that this was a wise decision. As I’ve started parsing through WA in more detail, there are problems that, in many ways, are worse than MBH – if only, because the second time around is a bit of a farce. Anyway on to details.
The key scenarios in WA for principal component discussions are Scenarios 5 and 6. Scenario 5 considers reconstructions under the following 4 scenarios: a) Mannian PCs; b) two correlation PCs; c) two covariance PCs; d) four covariance PCs. All scenarios are without the Gaspé extrapolation. This is fair enough. These correspond to the cases in play (with 5b being the Huybers case.) As a reminder, “full standardization” in a PC context simply means using correlation PCs rather than covariance PCs – a method not endorsed by the NAS panel in this context. But that’s a different story and ultimately not relevant to anything. Figure 1 below is a direct paste of WA Figure 5, which shows these scenarios for the 1400-1500 detail period. The case with 2 covariance PCs (the high 15th century case in MM05b) is shown in two colors – pink for the 1400 step; green for the 1450 step. The split between the 1400 and 1450 step will be discussed later.
Figure 1. WA Figure 5. Red – two Mannian PCs (WA); pink/green – two covariance PCs, pink for 1400-1450; green for 1450-1500; blue – range of 2 correlation PCs and 4 covariance PCs.
The next figure, Figure 2, shows my replication of Figure 1 above to show that the algorithms are apples and apples. There is one slight difference in parameters. WA annoyingly did their calculations on a slightly different set of temperature PCs than Mann used and with unweighted proxies (while Mann had assigned weights in some unknown method.) There’s not much difference in the results (as evidenced by the similiarity of the emulation to WA.) However there are some differences in detail. WA proclaimed the similarity under these parameter changes as evidence of MBH98 “robustness”, although these particular steps have never been in issue.
The orange line added here is the MBH98 reconstruction itself. You will notice that it is lower than the WA emulation. WA stated that these differences are due to their use of different PCs and weights. This is simply false. At best, WA recklessly made the assertion without checking. I’ve reconciled all calculations exactly; similar differences remain using WA methods on MBH temperature PCs and weights. As a reviewer of the first draft, I asked that they benchmark using actual MBH PCs and weights to avoid annoying reconciliation problems, but this logical request – which had no controversial angle – was ignored by WA and Climatic Change. The pale grey lines are various cases involving absence of bristlecones – ignore that here. Anyway, you can see that we’re talking apples and apples in this discussion.
Figure 2. Emulation of WA Figure 5 with MBH weights and PCs. Red – two Mannian PCs (WA); pink/green – two covariance PCs, pink for 1400-1450; green for 1450-1500; blue – range of 2 correlation PCs and 4 covariance PCs; orange- archived MBH98 reconstruction.
I mentioned above that our results in MM05b reconcile closely to WA results. This is shown very clearly in Figure 3 below. On the right are smoothed versions of WA Scenario 5 results (as I’ve emulated the calculations); on the left are our archived results from MM05b Figure 1, all shown in a consistent color code. Obviously, the results for the MBH emulation (red) and 2 covariance PCs (pink) are consistent. Further, in addition to the results illustrated in MM05b, MM05b stated that results using 2 correlation PCs were about halfway between “MM”-type results and MBH-type results and that using 4 covariance PCs was more like MBH-type results – findings which are surely consistent with WA results (blue on right). MM05b stated:
If the data are not transformed (MM), but the principal components are calculated on the correlation matrix rather than the covariance matrix, the results move part way from MM to MBH, with bristlecone pine data moving up from the PC4 to influence the PC2.If a centered PC calculation on the North American network is carried out …
MBH-type results occur if the NOAMER network is expanded to 5 PCs in the AD1400 segment (as proposed in Mann et al., 2004b, 2004d). Specifically, MBH-type results occur as long as the PC4 is retained, while MM-type results occur in any combination which excludes the PC4.
There is virtually no difference between what we said in MM05b and the WA Scenario 5. It is exceedingly annoying that WA did not discuss the close relationship between what we had said in MM05b and what they said. Indeed, their failure to reconcile the results arguably rises to a distortion of the record – a point which that I made as a reviewer, but which Climatic Change and WA ignored.
Figure 3. MBH-style reconstructions. Left: Archived results from MM05(EE) Figure 1. Right – WA Scenario 5 results (emulation). Pink – 2 covariance PCs; blue – 2 correlation PCs; also 4 covariance PCs; red – WA case with Mannian PCs; orange – MBH98.
WA Scenario 6 explores the effect of not using bristlecones in the North American networks in MBH-type calculations. Again, there is nothing material in the actual calculations with which I disagree, although their conclusions do not follow. Figure 4 below compares the reconstructions with and without 15 bristlecone sites. The left panel repeats the Scenario 5 diagram already shown, while the right panel shows corresponding results for networks without 15 bristlecone sites. Obviously, reconstructions without bristlecones – regardless of PC method – yield virtually identical results to calculations using 2 covariance PCs – a point acknowledged by WA.
Figure 4. Left – WA Scenario 5 as previously described. Right – WA Scenario 6 with xxx bristlecone series excluded. Orange – MBH98 for reference. Red – with two Mannian PCs (WA Scenario 6a); magenta – with 2 covariance PCs (WA Scenario 6c) ; blue – one graph with 2 correlation PCs (WA Scenario 6b); one graph with 5 covariance PCs.
WA acknowledged in multiple places that the PC issues corresponded to how much weight was placed on bristlecones:
When two or three PCs are used, the resulting reconstructions (represented by scenario 5d, the pink (1400- 1449) and green (1450-1499) curve in Fig. 3) are highly similar (supplemental information). As reported below, these reconstructions are functionally equivalent to reconstructions in which the bristlecone/foxtail pine records are directly excluded (cf. pink/blue curve for scenarios 6a/b in Fig. 4).
and in the Legend to their Figure 3 describing Scenario 5d with 2 covariance PCs:
Pink (1400-1449) and green (1450-1499) reconstruction is same as scenario 5c, except with too few PC series retained to capture information dynamic structure of ITRDB data (acting in effect as exclusion of bristlecone/foxtail pine records from PC calculations) (scenario 5d)
and in the Legend to their Figure 5 which states:
The emulations directly exclude the bristlecone/foxtail pine records from calculation of PC summaries of N. American tree ring data (which are indirectly excluded by MM05a/b, cf. “Results” in text). [implicitly referring to the 2 covariance PC case]
Statistical and “Climatological” Significance
So far there’s nothing in the WA calculations where I have a material disagreement. As opposed to showing that our calculations were “unfounded” they pretty much replicate our results. WA report a variety of verification statistics for the various permuations. Other than the verification statistics for the MBH reconstruction, we did not report verification statistics for other permutations since we were not proposing any of them as an alternative reconstruction – a point acknowledged by realclimate and understood by Wegman but misrepresented by WA (despite my objecting to this in person to Ammann). WA reported verification r2 statistics grudgingly. You may recall our difficulties in getting WA to admit that MBH reconstructions failed verification r2 tests. They did not report these results in their first draft and refused to provide the results to me as an anonymous reviewer. In San Francisco, Ammann said that he would not report these results and refused to answer when I asked him the values at the AGU meeting. I filed a misconduct complaint against Ammann at UCAR about distorting the record by not reporting these results. In the final WA version, they grudgingly reported the verification r2 buried in the Appendix- which showed that MBH results are ~0, as we had previously reported. (They did not withdraw their press release that all our claims were “unfounded”) The NAS panel cited this finding when we drew it to their attention – without citing our prior observation of this in MM05a. The negligible r2 values apply to all of these variations – not just to the MBH reconstruction.
In WA Table 2 and in the discussion, WA report negative RE scores (-0.13; -0.20; -0.56) for the three cases in Scenario 6 (without bristlecones) . I got somewhat different (but negative) RE results in my emulation of these cases. As a result of these negative RE scores, WA conclude that none of the Scenario 6 results have “climatological meaning” (a finding with which I agree). They say:
Although the highest temperatures in this scenario for the early 15th century are similar to those reported in MM05b (max 0.53°), which would, on face value, suggest the possibility of a double bladed hockey stick result, these values once again cannot be ascribed climatological meaning
I don’t disagree with this. We agree that an MBH-type reconstruction without bristlecones lacks statistical significance. We have frequently said that we never proposed an alernative reconstruction and certainly have never suggested that the simple expedient of removing bristlecones from an MBH method would solve all the problems.
So let’s review the bidding – WA agree that reconstructions without bristlecones lack significance – a position perilously close to agreeing with MM. How do they attempt to avoid this precipice. The argument is on page 33 of their preprint:
1) they say that the AD1450 reconstruction without bristlecones is similar to the AD1450 reconstruction with bristlecones. I haven’t checked the 1450 step in detail yet. There are not very many proxies introduced in the 1450 step and I’m not sure why the results would be different, other than the system sometimes responds strongly to presence/absence of a couple of proxies. Maybe the re-introduction of Gaspé in the 1450 step is what makes a difference. But let’s grant this point for the sake of argument.
2) from this, they state: “thus, from a strictly statistical perspective inclusion of the bristlecone/foxtail pine data in the proxy PC calculations neither enhances nor degrades reconstruction performance during the second half of the 15th century.” I’m not sure of this, but again, let’s grant it for the sake of argument.
3) they conclude: “from a climate reconstruction point of view one can argue that, in general, the bristlecone/foxtail pine records do not introduce spurious noise and their inclusion is justifiable; or said more strongly, their elimination is not objectively justifiable”
A little later, they re-state this:
Over 1450-1499, the bristlecone/foxtail pine proxies neither enhance nor degrade reconstruction performance when PC summaries are used. Thus, in this situation, it is logically appropriate to retain these proxies over the entire 15th century, since they are necessary for verification skill in the first half of this period and have no impact on calibration and verification performance in the later half.
A little underwhelming as a statistical argument, to say the least. They go on to say:
These results enhance the validity of the MBH assumption that proxies used in the reconstruction process do not necessarily need to be closely related to local/regional surface temperatures, as long as they register climatic variations that are linked to the empirical patterns of the global temperature field that the MBH method (and other climate field reconstructions) target.
The NAS and Wegman Panels
While there might be a certain perverse amusement in dissecting the above WA argument, this is made unnecessary by the finding of the NAS Panel that strip-bark samples should be “avoided” in temperature reconstructions. InMM05a and MM05b, we reported specialist views that 20th century bristlecone growth could not be related to temperature [Lamarche et al 1984; Graybill and Idso, 1993; Hughes and Funkhouser 2003; to which can be added Biondi et al 1999.]
The Wegman Report noted problems with bristlecones as follows:
Although we have not addressed the Bristlecone Pines issue extensively in this report except as one element of the proxy data, there is one point worth mentioning. Graybill and Idso (1993) specifically sought to show that Bristlecone Pines were CO2 fertilized. Bondi et al. (1999) suggest [Bristlecones] “are not a reliable temperature proxy for the last 150 years as it shows an increasing trend in about 1850 that has been attributed to atmospheric CO2 fertilization.” It is not surprising therefore that this important proxy in MBH98/99 yields a temperature curve that is highly correlated with atmospheric CO2. We also note that IPCC 1996 stated that “the possible confounding effects of carbon dioxide fertilization need to be taken into account when calibrating tree ring data against climate variations.” In addition, as use of fossil fuels has risen, so does the release of oxides of nitrogen into the atmosphere, some of which are deposited as nitrates, that are fertilizer for biota. Thus tree ring growth would be correlated with the deposition of nitrates, which, in turn, would be correlated with carbon dioxide release. There are clearly confounding factors for using tree rings as temperature signals. (p 49)
In his testimony on July 27, Wegman re-stated this, adding
“At the very least, the effect of these proxies on temperature reconstruction should be examined.”
Unfortunately, neither Wegman nor the NAS panel waded through the swamps of WA 2006. The NAS panel recognized the relationship between PC methodology and bristlecones, noting as follows that the MBH98 reconstruction was “strongly dependent on data from the Great Basin region in the western United States” – which in this context means bristlecones/foxtails.
The NAS panel went on to state about the PC criticism:
The more important aspect of this criticism [principal components methodology] is the issue of robustness with respect to the choice of proxies used in the reconstruction. For periods prior to the 16th century, the Mann et al. (1999) reconstruction that uses this particular principal component analysis technique is strongly dependent on data from the Great Basin region in the western United States.
Here they seem to nicely follow the link between PC weightings and the impact of bristlecones on MBH reconstructions, illustrated above. However, here’s where the NAS panel did not follow through. In an earlier chapter, they clearly stated that “strip-bark samples” – a specific form collected by Graybill -should be “avoided for temperature reconstructions”:
The possibility that increasing tree ring widths in modern times might be driven by increasing atmospheric carbon dioxide (CO2) concentrations, rather than increasing temperatures, was first proposed by LaMarche et al. (1984) for bristlecone pines (Pinus longaeva) in the White Mountains of California. In old age, these trees can assume a “stripbark” form, characterized by a band of trunk that remains alive and continues to grow after the rest of the stem has died. Such trees are sensitive to higher atmospheric CO2 concentrations (Graybill and Idso 1993), possibly because of greater water-use efficiency (Knapp et al. 2001, Bunn et al. 2003) or different carbon partitioning among tree parts (Tang et al. 1999). … strip-bark” samples should be avoided for temperature reconstructions (p. 50)
This recommendation goes straight to the heart of the Graybill collection that is the heart of the MBH hockey stick (and not just bristlecones, but also foxtails and even a couple of limber pine sites.) We discussed this in MM05b (astonishingly not cited by the NAS panel). In Graybill and Idso 1993, Graybill reported the following criteria that were used in collecting:
Another tree selection factor that is crucial to our findings involves tree form. Experience has indicated that many of the oldest five-needled pines have experienced cambial dieback to varying degrees. This appears to begin after several hundred years of growth and is progressive. These so-called strip-bark trees can have active cambium that is only a few centimetres in width. Foliage and cones are also accordingly limited. Trees of this nature [strip-bark] were the primary focus of investigation whenever possible. They were most commonly found and sampled in stands of bristlecone pine and limber pine in the Great Basin and in stands of foxtail pine in the Sierra Nevada.
Obviously, every Graybill site includes strip-bark samples to a greater or lesser degree (probably greater) and, in order to implement the strip-bark recommendation of the NAS panel, all of the Graybill sites should be “avoided” in a temperature reconstruction. These are essentially the sites in the MBH “CENSORED” directory and it was already studied by MBH (yielding PC series with no HS shape.) This is very similar to the network in WA Scenario 6 (although the Scenario 6 network includes several Graybill strip-bark limber pine sites).
Conclusion
WA reported that reconstructions without bristlecones (their Scenario 6) lack “skill” in reconstruction and “climatological meaning”, a finding with which we concur. The NAS Panel says that bristlecones should be avoided in temperature reconstructions. Thus, MBH-type reconstructions (with PC networks) with or without bristlecones are both eliminated. So much for the “refutation” of our criticisms.
Now WA, following Rutherford et al 2005, have proposed that they can “get” a hockey stick without using PC calculations. In my opinion, PC methodology applied to tree ring networks was integral to MBH98 as a paper. If they cannot get to their answer using PC methodologies applied to tree ring networks using valid proxies, then the logical course of action would be to say so and “move on”.
I’m working on some notes on their “no PC” calculations. I’ve been posting up from time to time on MBH regression methodologies and there is much to say about them as well. These matters acually become easier to deal with without the involvement of prior PC calculations.
23 Comments
Nothing in this area surprises me anymore. WA use a 15th century argument to say that bristlecones do not significantly alter the results for that century, then say that there is no statistical reason for excluding them regardless of their demonstrated failure as a temperature proxy for the last 150 years.
How can this be reconciled with the result that the presence or absence of bristlecones is critical to the conclusion that current temperatures are unprecedented?
Are you going to address this in a sequel?
Thanks, Steve.
I’ll probably work this up into a reply for submission to Climatic Change or CPD.
I think that the fundamental illogicality of the WA position is evident on these materials and the findings of the NAS panel on non-use of bristlecones save a LOT of arguing. For example, the WA argument that bristlecones add “necessary skill” and this means that they should be used – is an argument that Wegman would roll his eyes at, but climate scientists consider convincing.
Most civilians understand these issues better than climate scientists. If I presented this argument to a CPA, he’d understand immediately that, on this set of facts, none of the models were established. You have to be a climate scientist to not understand this reasoning.
Didn’t WA admit that the r2 was poor as you have always claimed? I suppose they should revise their specious claim to read “most of M&M’s claims are unfounded.”
Mark
Teleconnections again. Proxies are better than thermometers.
Why even worry about “teleconnections”: maybe they should say:
re: #4-5
I wonder if the team recall the “ether field” from a hundred years ago? It seemed that there had to be some link between electromagnetic radiation and space and ether was postulated to be what was vibrating to produce the waves. Now I’m sure there’s no real “field” when it comes to global temperatures which can be teleconnected. But there might be something else, like water in the air (in whatever state) which could affect local temperatures at low frequencies while still not affecting it on a high-frequency basis. But if so, it should be made explicit and the signs of such a connection should be predicted and measured.
How about
Who says teleconnections are stationary?
Steve M., would you be willing to comment on the following (from p.51-52 of the preprint)?
Wahl and Ammann are a big hit with HMG in the UK. This is what Paul Monro of DEFRA (anyone know who he is?) wrote to me recently:
“Your claim that Stephen Mclntyre has succeeded in establishing that MBH98 was wrong is not substantiated by recent analysis. Mistakes were certainly made by MBH98 but their overall result has been upheld. The current arguments being deployed by MBH98 critics would seem to have found their ultimate expression in a ‘conspiracy theory’ argument expressed in Dr Wegman’s report about a social network existing among leading climatologists. This ‘insight’ says nothing about the inherent validity of paleoclimate reconstructions, hence we regard it as irrelevant.
In his report Dr Wegman also discusses in detail the issue of non centred PCA analysis. However, the hockey stick pattern emerges whether non centred or centred PCA analysis is applied (not mentioned in Wegman’s report) provided that the correct number of statistically significant principal components are employed. Mclntyre and McKitrick (MM) failed to do this, hence their analysis censored out a great deal of data. Their reconstruction showing a 15th century warming excursion exceeding the late 20th century warming is shown in reference 1 to have no reconstruction skill1. This paper also demonstrates that the MBH98 results are reproducible and verifiable, therefore we do not agree that Dr Mann’s paper lacks full disclosure.”
Ref I is WA.
Steve, maybe you should stop by the UK on your trip and tell these guys a few things.
Does it not say something about the statistical independence of the multi-proxy approach upon which the papers are based? Or about the scientific independence of peer review which guided the publication process? Indeed it does.
#9. If they are trying to invoke statistical authority for this proposition, I would be more impressed by a quote from Draper and Smith than Lytle and Wahl.
An obvious reason for holding them accountable for the verification r2 statistic is simply that MBH said that they considered it in MBH98 and, in the AD1820 step where the values were favorable, provided a full-color illustration. Regardless of whether the verification r2 is a good thing or bad thing, Mann said that their study passed this statistic and presumably this added to its initial credibility in the eyes of anyone who took this claim at face value. Whether any climate scientists actually read the article prior to citing it is a different matter – perhaps no climate scientists read it closely enough to actually be misled.
Secondly, both Mann and themselves use verification r2 when it is to their advantage.
Third, in pseudoproxy studies, including studies by Mann, Ammann and Wahl, any model that successfully captures a signal and has a high RE value also passes a verification r2 statistic. While one can envisage toy examples (and WA give one) where an r2 might reject a decent reconstruction, they have not demonstrated any pseudoproxy cases where this is a relevant concern.
Fourth, they ignore the fact that in the calibration period there is a mathematical relationship between the RE and r2 statistics – I posed this as a puzzle a while ago. It’s an interesting puzzle.
Fifth, they assume that they have established that they’ve shown a statsitically meaningful RE value. Ourposition is that their RE benchmarking is inappropriate. High RE values can be thrown up by any trends. The classic Yule spurious regression – alcoholism against Anglican marriages – has a very high RE statistic. The RE statistic has negligible power against spurious trends – which is the $64 question here if one is worried about bristlecoens. They cite Ammann and Wahl for authority that an RE of 0 is significant (supposedly overturning our argument in MM05a and Reply to Huybers) – a rejected GRL submission. There are no significance tables for RE statistics.
I’ll probably get to a detailed consideration of this section of A&W now that I’ve sucked in my gut and gotten to it. I’m getting some pretty amusing results in the no PC section. The trouble that they’re having is that they’re trying to salvage MBH on the run and stepping out of the frying pan into the fire and they’re going to look pretty hopeless when all is said and done.
They like to talk about the “correct number of statistically significant principal components are employed” but selectively forget about the bristlecones.
In business, you learn that your first loss is generally your best loss. If there’s a problem, it nearly always gets worse. Whatever they think the pain will be in jettisoning MBH, they’d be better off taking the loss and jettisoning MBH.
If they now tell people that WA is validation of some kind, they are going to look like they really don’t know what they’re doing – which is probably true, but they shouldn’t be proving it quite so convincingly.
They fail r2 outside of the calibration and training periods simply because the statistics are not stationary (among other glaring problems, including components that are highly correlated). Nearly every text I’ve read clearly states that varying statistics require online (adaptive) methods for calculating components. There are also calculations that can be done to determine the level of “non-stationarity” in the data, and thresholds for determining if even online methods will work properly. I have citations if necessary (may take a day to revive them all).
Mark
#14. No, you’re looking for too conventional an explanation. Look at my next post on overfitting. It’s not that the relationship is varying – it’s just that you have overfitting on an unimaginable scale through their “inverse regression” method which nobody bothered figuring out.
#10:
Whoever Paul Monro is, he obviously has not read our E&E05 paper, nor has he grasped the real message of W&A, so maybe he didn’t read it either. (We were just recently told about another European scientist who declared to a questioner a similar refusal to read E&E05, even while chastising us for not doing the very analysis undertaken therein.) As Steve says above, W&A showed, as did we, that the hockey stick = the bristlecones. The PC error makes it seem legit to rely on them because it bounces then up to the PC1. That being the case, it’s a wonder why MBH even bother keeping all the rest of the data, since the bcp’s force the opposite conclusion than what the rest of the data would indicate.
His bristling dismissal of the social network analysis only serves to illustrate the Wegman panel’s point about the ‘self-reinforcing feedback mechanism’ and the isolation of that community.
The Wegman panel did study E&E05, and address the issues Monro raises in his summary of it on pp79-80 of his report. It’s too bad they didn’t discuss it more in the front matter of their report, but they obviously grasped the issues:
Re: RE
If one is trying to reconstruct temperature anomalies and is using a data set of temperatures from 1850-2000, then the RE statistics are inherently designed to be closer to one than if one were using a stationary data set. The RE is, at least as I understand it (and in 40 years of statistics and econometrics I never ran in to it until started reading climate papers), is 1 minus the ratio of the predicted minus actual values (both in the verification period) squared over the squared deviations of the actual values (in the verification period) less the mean of the __calibration period__. [I don’t know how to produce bold or underline in this format.] Thus, to make an RE statistic bigger simply pick a data set for which the means of the verification and calibration period differ by a significant amount. [Steve, this may be a point that you made earlier, but I missed it.]
For the CRU anomalies the 1856-1902 mean is something like -0.32 while the 1902 to 1980 is something like -0.16, i.e. half the size. Run a Monte Carlo on these two with errors that average out with a calibration period R-squared, and you will get RE statistics bigger than the R-squared. The fact that the denominator of the ratio in the RE is greatly different from the mean of the verification period makes the deviations from that calibration period mean very large and hence the ratio very small. Thus the RE moves toward one when there is a significant (relative the standard errors of the prediction) difference in the means of the variable being prediction in the two periods.
I will further speculate regarding the benchmarking of the RE statistic that it may have been done using a “no relationship” null hypothesis but not specifying the change in the means of the dependent variable. Note: how one does this change with a no-relationship relationship in a simulation is a bit tricky. For instance, what is the empirical average RE stat for the subset of “no-relationship” experiments that have a OLS significant relationship in the calibration period and a doubling or halving of the average values between the two periods? But if you, Steve, did that in you benchmarking you might want to start making a big deal about it and show the arithmetic of the incremental effect on the RE when there are differing means for the verification and calibration periods.
My simple Matlab simulations indicate that RE rewards for heavy low-pass filtering (here). Comparison of 2nd and 3rd figure is quite interesting: the RMSE and RE are smaller in the 3rd figure, but the residuals are correlated (is it better reconstruction than the original measurement?). When I have more time I can write the equations down. S/N ratio is very important issue here.
In MM05a and improved in Reply to Huybers, we argued that an appropriate benchmark for the RE statistic in an MBH context was over 0.5, as compared to Mann and dendro benchmark for significance of 0.
Our modeling in these articles was probably needlessly complicated because you get very high RE statistics in “classical” spurious regressions e.g. Yule’s alcoholism versus C of E marriages.
The history of the RE score deserves a separate paper. I saw the same statistic (under a different name) mentioned for economic forecasting in Granger and Newbold 1973 – 1973 not the famous 1974 paper, citing earlier use by Theil.
My take on it is that the statistc has very low power (or no power) against spurious co-trending as evidenced by its values in the Yule case – which is the very problem of concern.
#18. UC, you need to add some text explaining what you’re doing in these calculations. They look very interesting and I’d like to understand them, but I can’t follow it. Can you expand the description of the calculations?
#20:
I’ll clarify the presentation later (no paychecks from oil industry, so I need to do other work as well 🙂 ). But to put it shortly:
y=s+n, y is measurement, s is signal and n is additive Gaussian(0,4/12) noise. s is AR1 process with p=0.99 and driving noise var q=0.01. Optimal linear filter is used to obtain estimate of s (see e.g. Gelb, Applied Optimal Estimation). But the trick is that the filter uses underestimated p, 0.77. As a result, 2-sigma bars are broken, and the variability is clearly underestimated. Yet, RE and RMSE tell that ‘good work fella!’ Disclaimer: Haven’t double-checked the computations, colors of first figure are confusing, etc.
Steve,
It appears to me that WA and others on the team are responding to criticism of their work in ways that are very similar to the methods used by politicians when they wish to silence their detractors. Both groups make use of logical fallacies (Strawman, Appeal to Authority, Ad Homs, etc.) to “prove” their points. I am afraid that if this continues, certain branches of science will become indistinguishable from politics.
Wahl and Ammann “accepted March 1 2006” is finally online at Climatic Change. It will be interesting to see whether (a) the status of “Ammann and Wahl, (under review)” [at GRL] has been changed; or (b) whether the text of the paper supposedly “in press” for IPCC on Feb. 28, 2006 has been altered and how it has been altered to accommodate the rejection of the GRL paper.