An article by Peter Huybers has been accepted at GRL together with our Reply. I’m going to give a preview of this. This will take a few posts.
Our original article is here. (Copyright 2005 American Geophysical Union. Further reproduction or electronic distribution is not permitted.) The Huybers Comment is posted here . Our reply is here (Accepted for publication in GRL. Copyright 2005 American Geophysical Union. Further reproduction or electronic distribution is not permitted.)
Before discussing Huybers, I’d like to re-post the Abstract of our GRL article entitled “Hockey Sticks, Principal Components and Spurious Significance”. There has been so much disinformation about the article — especially the supposed “MM reconstruction”, that it is useful to occasionally remind oneself of what we actually said. Our concerns about MBH98 were the biased PC methodology, robustness, statistical significance and proxy selection. Here’s our abstract:
The “hockey stick” shaped temperature reconstruction of Mann et al. [1998, 1999] has been widely applied. However it has not been previously noted in print that, prior to their principal components (PCs) analysis on tree ring networks, they carried out an unusual data transformation which strongly affects the resulting PCs. Their method, when tested on persistent red noise, nearly always produces a hockey stick shaped first principal component (PC1) and overstates the first eigenvalue. In the controversial 15th century period, the MBH98 method effectively selects only one species (bristlecone pine) into the critical North American PC1, making it implausible to describe it as the “dominant pattern of variance”. Through Monte Carlo analysis, we show that MBH98 benchmarks for significance of the Reduction of Error (RE) statistic are substantially under-stated and, using a range of cross-validation statistics, we show that the MBH98 15th century reconstruction lacks statistical significance.
On Apr. 28, 2005, Peter Huybers wrote a pleasant letter to me, inquiring about our work. (Our subsequent correspondence has mostly been cordial, although I marvel at the centrifugal tendencies of academic discourse.) Unlike the Hockey Team, I actually like inquiries and I sent him back a nice and detailed letter the same day, summarizing our work and giving some updated thoughts on these issues. This prompted an immediate response from Huybers. We’ve had lengthy correspondence and I do not propose to recapitulate it all. (I’ve done so with Crowley and Mann because of their egregious public misrepresentations of the actual correspondence.)
I provide a limited discussion of this correspondence to clarify points that are not clear in the article itself, especially where the correspondence evidences points of agreement, which are either unstated in the article or stated in very obscure terms.
In my opinion, our Reply is a complete response to Huybers’ points and, in one aspect, improves our original article. But you can decide for yourselves. The following headings correspond to the key points of our original abstract.
Our first point was obviously that the MBH PC method was different than represented in the original article and was strongly biased. We had reported that, when applied to persistent red noise, the MBH98 PC method yielded hockey sticks over 99% of the time. Subsequent to the GRL article, we’ve done new (unreported) studies on the effect of 1-2 flawed proxies (e.g. proxies with nonclimatic trends) on the MBH98 PC method and found that 1-2 "bad apples" have an even more profound effect than a pure red noise situation. I mentioned this unpublished work to Huybers in my first letter as follows:
Our main point with the MBH98 method (in statistical terms) was that it was biased – it mined for hockey stick shaped series. This has been confirmed by a few other commentators (e.g. von Storch, Zweiers) although the ultimate impact of this bias is controversial. Intuitively, if your conclusion is that climate has a hockey stick shaped history, using such a biased method is a pretty risky way of supporting that conclusion. …
In some other experiments, arising out of this discussion, we’ve experimented with simulations in which some proxies have an added non-climatic fertilization trend – this demonstrates the effect a little more clearly.
Huybers replied that the existence of the bias in the MBH98 PC method was not an issue as far as he’s concerned (with similar comments in later correspondence):
I thought the PDF you showed of the hockey-stick index was highly convincing regarding the bias of un-centered PCs.
In his article, Huybers credits us with pointing out a bias in the MBH recosntruction and later reports that "this bias was checked using a Monte Carlo algorithm independent of MM05’s". His opening paragraph metions that "having reproduced the statistical results of MM05".
We thought that Huybers’ comment was not very clear as to what he had replicated, so, in our Reply, we made a very clear statement on his agreement as follows:
McIntyre and McKitrick [2005a, “MM05”] showed that the actual MBH98 PC method used an unreported short-centering method, which was biased towards producing a hockey stick shaped PC1 with an inflated eigenvalue. Huybers concurs with these particular findings…
Our editor requested that we take this statement out on the basis that we were trying to divert attention away from the "real" differences of opinion. We argued vehemently against this on the basis that the eventual community wanted to know what we agreed on as well as what we disagreed on. I had raised this issue with Huybers directly, when I learned, some time after hsi first letter, that he was planning a Comment to GRL. I said:
While Comments in journals tend to be biased towards negative comments, in this case, many people are seeking guidance on what to think. To the extent that you agree with many or even some of our points, as indicated below, and have verified at least some of the points in dispute, I think that it would be very constructive to submit a Comment reporting on such verification and I think that GRL would probably welcome something like that. That’s not to say that you shouldn’t also submit on points of disagreement. However, under the circumstances, it’s such a contentious issue that comments about R2 versus RE, on robustness etc. would itself probably attract a lot of attention.
Huybers replied to this a few days later as follows:
Steve, as you noted earlier, comments tend to be rather negative, but it can also be useful to point out where results corroborate what you initially found. To that end, the comment both starts and concludes by calling attention to how short-centering biases the PCA results and hence the MBH98 reconstruction. I also made note of the possible non-temperature effects on the tree rings and the R2 statistical results you published….
We quoted this paragaph to our editor and he agreed that we could say that Huybers "concurred" with the finding of bias in the MBH PC methodology. But it seemed like an odd point to have to fight to get into print. I’ll refer to this above paragraph again in connection with the R2 results. Thus, whatever else may be in dispute, the data-mining bias of the MBH PC method is not on the table as far as Huybers was concerned. In fact, in one of his later emails, he said:
As I mentioned earlier, it seems to me the "short-centered" PCA does affect the results and this is a bias that should be accounted for. Efforts would seem better applied at correcting for the bias as opposed to arguing for its insignificance.
Non-Robustness to Bristlecones
The second big issue is the non-robustness of MBH98 to bristlecones. In our article, we said that (1) the biased MBH method preferentially selected bristlecones into their PC1 and that this series, which MBH had identified in previous controversy as the "dominant component of variance", consisted almost entirely of bristlecones. and (2) there were serious questions in the specialist literature [Graybill and Idso, 1993] about the validity of bristlecones as a temperature proxy due to potential CO2 fertilization. We expanded considerably on this issue on our EE article, where, in addition to CO2 fertilization, we noted other possible non-temperature factors including increased precipitation, phosphate fertilization, nitrate fertilization etc.
The non-robustness of MBH results to bristlecones is notably avoided by realclimate. Their defence is now that any methodology which either does not use the flawed bristlecones (or which reduces their weight in the final reconstruction) amounts to "throwing out" data. They don’t face up as to exactly how how they propose to reconcile this defence with their claims that their reconstruction is robust to the presense/absence of all dendroclimatic indicators (which presumably includes bristlecones).
I raised these issues in my April 28 letter to Huybers as follows:
we have tried to emphasize the relationship between data and methods. We then followed the biased method into the critical North American tree ring network to see what it did: it picked out bristlecone pines which have an anomalous 20th century growth spurt, explicitly said by specialists not to be temperature related. If you remove the bristlecone pines from the dataset, the hockey stick disappears…
We pointed out that using centered PC calculations and 2 retained PC series (as in MBH98) the bristlecone pattern goes down to the PC4. Mann et al. have argued that they can get high 15th century results if they use 5 PCs. (We discuss this in E&E). This goes to robustness – now climate history turns on the PC4 or alternatively on the bristlecones (which are a flawed proxy.) In effect, the biased method searches for and overweights the worst proxies. Another unattractive property of the biased method is illustrated in our E&E article, where we show that the MBH98 method will invert increased ring widths in 15th century non-bristlecones and show lower 15th century results.
Huybers seemed to agree with our concerns about the dependence of MBH results on bristlecones with the following:
I have followed this discussion. Results should be made as robust as possible, and I agree that it is unsatisfying when results are sensitive to a small subset of the data.
In our article, we pointed out that the bristlecones were weighted very differently under the MBH PC methodology and under a conventional PC calculation. We illustrated the differences in Figure 3 of our GRL article, showing that the differences were not just trivial. There are two options in PC algorithms – using a covariance matrix and using a correlation matrix. MBH only said that they used a "conventional" calculation. If a network is in common units, the conventional methodology is a covariance PC calculation, which is what we used to make Figure 3 and what we used in our emulation of MBH98 under centered PC calculations presented in our EE article (no such calculations were presented in our GRL article). We were not saying that covariance PCs would yield a meaningful indicator out of the swamp of MBH tree ring chronologies- only that this is what a reasonable person implementing MBH methodology would do. Huybers asked about this in his first letter. I replied:
In our calculations, we used the covariance matrix rather than the correlation matrix. Tree ring chronologies are already standardized to dimensionless ratios. So I think that the use of the covariance matrix rather than the correlation matrix is the preferred route and can be justified in standard texts. The use of a correlation matrix (i.e. re-normalizing) is certainly an option, but climate history should not stand or fall on this choice. The bristlecones do get promoted higher with a correlation matrix than with a covariance matrix. In our recent debates with Mann, they’ve also used the covariance matrix – their PC4 under centered calculations (shown at realclimate) matches our own calculations. The big issues are robustness and proxy selection- these remain unchanged.
Huybers response seemed to agree with this position, replying:
Yes, it is again rather unsatisfying that answers are so sensitive to seemingly small changes in technique.
However, this ended up being one of the two main issues in his Comment. I’ll discuss this at length in my next post.
As to the impact of bristlecones, Huybers did not directly address our finding. He acknowledged that the validity of bristlecones as a proxy would be a valid question for "future work" as follows:
"Another point raised by MM05 is that many of hte strongest trends in tree ring chronologies may be unrelated to tempertature change [Graybill and Idso, 1993] – in future studies, this may warrant the exclusion or down-weighting of certian records, but this is an additional step which would have to be explicitly stated."
In our Reply, we pointed out that the net effect of the correlation PC1 was to increase the weight of the bristlecones and that it is a little late in the day to defer the determination of bristlecone validty to "future studies". IPCC 2AR had already taken a position that CO2 fertilization was an issue that needed to be handled with caution prior to relying on tree ring chronologies. In our opinion, it is unacceptable that the Graybill and Idso bristlecones – the sites most identified with the issue of CO2 fertilization – should have come to dominate the canonical temperature reconstruction through the back door. If climate reconstruction is to depend on proxies which may be affected by fertilizaiton, this issue should have been articulated and argued in the plain light of day back in 1998, rather than being decoded some years later. MBH98 warranted that the proxies had been carefully selected. You can’t now say that the study of their validity should be deferred to"future studies".
Thus, aside from all the technical issues about correlation versus covariance PCs – and we think that our position is impeccable on thes issues – the fact that the correlation PC1 increases the impact of bristlecones is not itself a result that should be automatically accepted. Any unsupervised algorithm like PCA requires a little supervision prior to making a climate history.,
The third leg of our argument — “spurious significance” — is a term in the title of our article that is seldom discussed in the controversy by the realclimate side (for obvious reasons). In our article, we pointed out that, in the controversial 15th century, the MBH reconstruction failed (catastrophically) the most standard cross-validation test (R2), a test used by the Hockey Team in other studies and for the AD1820 step of MBH98 itself, as well as failing other standard cross-validation statistics (CE etc), used by the Hockey Team elsewhere.
While it was then only a surmise that MBH had calculated cross-validation R2 statistics for the 15th century (and not reported), the recent source code provided to the Barton Committee proves beyond a doubt that the cross-validation R2 statistic was calculated (and not reported). As reported on this blog, the source code shows the calcualtion of the cross-validation R2 very clearly together with other statistics. Then when one examines the SI to MBH98, the other statistics are collated into a table but the cross-validation R2 is left out. (There’s one other statistic that’s left out: the RE for their El Nino reconstruction, although in this case, they inconsistently reported the cross-validation R2 statistic. One is left to guess that the El Nino cross-validation RE statistics might also be pretty bad.)
If the MBH temperature index has a true underlying relationship to temperature, it is impossible for it to have a cross-validation R2 of 0.01. It is not the only statistic that should be looked, but it is a very important one to look at. If a reconstruction fails a very simple test like this, then it’s not a valid reconstruction. The realclimate response, such as it is, is to present a bizarre and hypothetical synthetic case where a reconstruction passes an R2 test with flying, but does not pass an RE test. They then accuse us of proposing the exclusive use of the R2 test. Obviously, we did no such thing. We commented on these issues to Huybers in our original letter as follows:
What we observed is that the PC1s from the simulations yielded a spuriously high RE statistic and negligible R2. This is what we observed in the MBH98 15th century calculation. It’s my view that if they are recovering a "temperature signal", they should have both a high RE and R2 statistic…
The take-home point is really that you can have spuriously high RE statistics and that a "skilful" reconstruction should pass several verification statistics. Mann’s response to this has been a diatribe against the R2 statistic.
Again Huybers seemed to agree as follows:
It would be more convincing that the reconstruction had skill if both the R2 and RE statistics were significant. I believe this is a valid point which you made clearly in the GRL paper. The MBH98 methodology is unfortunately difficult to understand —- climate reconstructions would benefit both from greater accuracy and clarity.
In the comment quoted above, Huybers said that in his article:
I also made note of the possible non-temperature effects on the tree rings and the R2 statistical results you published….
In his first version of his article, Huybers stated:
Note that unlike RE, the cross-correlation statistic [R2] is insensitive to changes in variance and thus the MM05 estimate of the cross-correlation critical value (indicating that the observed cross-correlation is insignificant) is not biased.."
Somewhat perversely, this statement was left out of the final version. (This statement is slightly incorrect: the R2 statistic has a known distribution. We concluded that it was insiginificant based on tables, rather than simulations, although the simulations gave results that were more or less consistent with the tables. ) We intially interpreted this comment, together with Huybers’ correspondence, as indicating that he had replicated our R2 results and we felt that it was important to report this apparent agreement. Again our editor felt that this was an attempt to "divert" attention from the "real" issues. Then it turned out that Huybers did not appear to have calculated the R2 statistic. (Note that Wahl and Ammann also failed to report the cross-validation R2 statistic on their webpage. Given our specific focus on this issue, this reluctance to calculate a simple statistic seems incomprehensible – you’d almost think that they were all afraid of the answer.) Anyway we negotiated back and forth on what we could say about Huybers’ postion. Eventually Huybers agreed that we could say that he "did not dispute" our calculation of the R2 statistic, so we went with that and our editor agreed.
So if Huybers agrees or doesn’t disagree about these things, what is Huybers’ article actually about? It’s about two things.
First, he argues that Figure 3 of our article, which illustrated the impact of the biased MBH method on the NOAMER network, “exaggerated” the bias by using a covariance PC1 instead of a correlation PC1, another possible PC methodology, although one that we do not believe can be justified. We will show that both of Huybers’ references recommend use of covariance PC1s, when the networks are in common units (as here where the networks are already standardized to dimensionless units.) Huybers provides a figure supposedly showing this exaggeration. But when you examine the figure, you find that the supposed "bias" depends on short-segment centering – a bizarre and ironic situation to say the least. Huybers also described the covariance PC1 as the “MM05 normalization”, which we supposedly "proposed" as a means of "removing the bias" in MBH98. Of couse we did no such thing. We protested vehemently to GRL about this mischaracterization of what we did, but had no success whatever in getting this language altered. Any reader of our works knows that we endorse nothing in MBH methodology and have not proposed any methods of "removing a bias". Merely demonstrating the result of a covariance PC calculation on the NOAMER network does not transmogrify that calculation into an “MM normalization convention”. More on this over the weekend.
The second issue pertains to the benchmarking of the RE statistic. Huybers observed that variances of the simulated PC1s used in our explanation of the spurious RE statistic were less than the corresponding target variances and proposed that these should be re-scaled so that the variances matched the target variances. When Huybers re-scaled the variances, he got an RE benchmark of 0, so that the MBH result once again seemed to have statistical “significance”. Readers of this blog will be attentive to issues of “spurious” significance, i.e. where a statistic is high even without a valid model. Spurious relationships can sometimes be discovered by looking at other statistics, e.g. the Durbin-Watson in some of the discussions we’ve had here. Even if Huybers were right, all that would happen is that the spurious RE statistic would be unexplained.
However, we had a really interesting response, which built on some prior discussions on this blog. You may recall John Hunter grinding me in the spring about whether the hockey stick-ness observed in the simulated PC1s would carry through to a NH reconstruction under MBH methods. (Having raised a good question, he rather spoiled the effect by asking me for an answer about every 4 hours, as though I had nothing else to do.) In a reasonable length of time (a few days), I checked the results by making up a proxy network consisting of 21 white noise series and 1 simulated PC1. The hockey stick-ness carried through, which I reported on the blog.
I used the same approach in replying to Huybers. I used the simulated PC1s saved from our GRL simulations, made up networks of 22 series (with 21 white noise series) and did an MBH-type calculation. Bingo. We got high RE statistics together with matching variances. Our explanation that the RE statistic was spurious held together. I’m going to describe these results in detail in a follow-up post, because I found them interesting and because they show some interesting statistical aspects of the MBH model. I’ll show some very pretty (to me) diagrams illustrating what’s going on. I’ll also show why the "other 20" MBH proxies model like white noise.
As you will see, our view is that our simulations using a pseudoproxy network (rather than Huybers’ simple re-scaling), completely outflank Huybers. In my opinion, there should be no residual point of dispute between competent people (as Huybers is). However, the affair is very partisan, and, rather than these new outflanking results putting an end to the issue, I was (and remain) concernted that partisans will pick on some of Huybers’ claims as prolonging the controversy, even if completely refuted.
Given what I perceive as being an underlying interest in resolving these and similar matters, I made the following offer to Huybers:
1) that he could review our simulations replying to him with no obligation;
2) if he agreed with our results — and only if he fully agreed with our results — then we would submit a joint paper to GRL reporting on agreed results.
3) if he did not agree fully with our Reply, then we would proceed on the course that we were pursuing.
I suggested to him, that while this might be unorthodox in academic terms, it was something that I thought GRL would welcome and that certainly the broader community would welcome. Huybers showed no interest whatever in this offer. So readers will be left one more time to try to sort this stuff out. I thought that we made a good suggestion and I’m sorry that the opportunity was missed. Having said that, I’m satisfied that our response is accurate and thorough, that Huybers has made no points which are not completely responded to, and I am quite content to let the chips fall where they may.
More on this over the next few days.