I’m going to give a fairly brief account of previous attempts to get the residual series and/or cross-validation R2 from Mann, including inquiries to Mann, N.S.F., through Nature, by Climatic Change, by Natuurwetenschap & Techniek and by the House Energy and Commerce Committee. As you will see, no one has been able to get Mann to disclose the information – even with a very direct question by the House Committee.
Do residuals and cross-validation statistics "matter" and should Mann have to disclose them? Well, they are vital to consideration of any statistical model. They should be every bit as important to a climate scientist as DNA fingerprints and stem cell colony photos are to stem cell researchers. Imagine if the SI to the Hwang article had not contained this information? Without the detailed SI, Hwang would still be in business. But there are many reasons short of fraud to examine the residuals; concern about fraud is probably the last reason. But none of these other reasons have so far prevailed. Preparing this review has reminded me just how determined Mann has been in avoiding disclosure of this information and the dangerous line that he is treading with respect to the House Energy and Commerce Committee.
Mann and NSF
Reprising briefly here my recent discussion of this stage. On December 17, 2003, I requested the residual series from Mann, copying David Verardo. David Verardo had replied here that the source code was Mann’s personal property. This was a highly questionable assertion since Mann’s terms of employment appear to provide that the source code was university property, a claim discussed at Title to Source Code and The Tort of Conversion here.
After being rebuffed by Mann in the request for residuals, on Dec. 17, 2003, we added this request to our existing Materials Complaint to Nature here (item 3). Nature promised to seek "external independent advice" on these matters, but failed to do so as is evident in the correspondence file.
In February 2004, since Mann’s response to the Materials Complaint had not satisfied them, they advised me that they would require a Corrigendum, saying that they "trust that the responses answer all your queries, and that you find this resolution of the matter satisfactory". When we expressed concerns, they said "The authors have assured us that the data sets and methods are revealed completely and accurately, and we are confident that they are as keen as yourself to resolve the matter. "
When we saw the draft Corrigendum, we pointed out many problems, including our concern about whether the residual series were in the proposed SI (see heading Supplementary Information item 3). We saw the draft SI only in June 2004 after being directed there in connection with our review of a submission by MBH to Climatic Change. We noticed that the requested information on cross-validation statistics and residuals was not in the draft SI and immediately notified Nature. I received a temporizing reply: "We do hope that this will provide you with the information that you are after, but please do not hesitate to get in touch if further problems remain."
When we saw the referee comments in August 2004, we realized that they had not been involved in refereeing the Corrigendum For example, one of the referees said:
For instance, questions that seem to be quite critical, such as the sensitivity of the MBH98 reconstructions in more remote periods to changes or omissions in the proxy network or the dependency of the final results to the rescaling of the reconstructed PCs, have become clearer to me now. From the reply in MBH04 I am now afraid that they were not sufficiently described in the original MBH98 work. In particular the PCs renormalization, could have been included as clarification in the recent Corrigendum in Nature by MBH.
He also said that our investigations should not be "hampered" as follows:
I would encourage them to pursue their testing of MMB98,and by the way other reconstructions. As I wrote in my first evaluation, this should be a normal and sound scientific process that should not hampered.
On August 10, 2004, we re-iterated our longstanding requests for the residuals and source code. These were referred to the Editor himself. On Sep. 7, 2004, the tortuous process reached a dead end as follows:
And with regard to the additional experimental results that you request, our view is that this too goes beyond an obligation on the part of the authors, given that the full listing of the source data and documentation of the procedures used to generate the final findings are provided in the corrected Supplementary Information. (This is the most that we would normally require of any author.)
Obviously not every author had been required to issue a completely new SI. Further, MBH had certainly not received a clean bill of health from the referees. In my opinion, regardless of a publication decision on our comment, this position by Nature – taken directly by Philip Campbell – was completely unreasonable.
Climatic Change 2004
Late in 2003, MBH had submitted an article excoriating our 2003 article to Climatic Change. You can see a reference to it on Stephen Schneider’s reference list here. Mann and others had vehemently objected to our publishing at E&E without their having a chance to review (notwithstanding Mann’s prior statement to us that he was too busy see #16), a position reported at Schneider’s website as follows:
Mann and his colleagues and other members of the scientific community were outraged when they learned of the publication of the McIntyre/McKitrick article. Most credible scientific journals receiving criticism of previously published work typically give the authors under fire the chance to review and respond to an article challenging their claims.
In fairness to this position, in late 2003, Schneider offered me a chance to review the MBH submission criticizing our 2003 paper. (I’ve found Schneider to always be an engaging and cheerful correspondent.) In my capacity as a reviewer, I promptly requested the source code and residual series, in this case also specifically requesting the Durbin-Watson statistic. (I wasn’t then thinking about anything as mundane as total failure of cross-validation R2 statistics.) This occasioned a very interesting correspondence. Schneider advised me that no one had ever requested source code in 28 years of his editing the journal and even to make the request required an editorial decision. We had lengthy and interesting correspondence. Eventually I was informed that the Climatic Change had adopted a policy requiring authors to provide supporting data, but not source code. I re-iterated my request for residual series as supporting data and Schneider duly requested the information from Mann. Mann provided a URL for the SI being prepared for the Nature Corrigendum (without mentioning this to Schneider) who interpreted it as merely being helpful. But he refused the request for residuals and the Durbin-Watson statistics in no uncertain terms as follows:
It is not our responsibility to provide [the residual series], we have neither the time nor the inclination to do so. These can be readily produced by anyone seeking to reproduce our analysis, based on the data we have made available, and our method which we have described in detail
For the Durbin-Watson statistic, they said "We did not describe such statistics in our study." Can you imagine an econometrician satisfying an editor with such a statement?
Schneider sent the MBH response to me with the comment:
I am hopeful that you will now be better able to complete your review, though not all items you requested–in particular source code–are included.
I dutifully wrote a review, pointing out that MBH had just stuck a finger in the eye of Climatic Change’s policy on providing supporting data and had thereby disqualified themselves. In my opinion, Schneider should have dealt with the matter editorially as soon as Mann refused to provide the supporting data. Why did he need me to write a review after such an overt refusal to supply supporting data? He’d already seen the breach of policy for himself and should have reacted accordingly right away.
I never heard any more about the submission, but, in any event, the MBH submission was never published. However, by then, Jones and Mann  had cited MBH [submitted to Climatic Change] as authority for statements hyper-ventilating against us. (They did not withdraw these statements when the submission was not published.) In terms of our getting the residuals, Mann et al. had withdrawn their paper rather than provide the residual series. (I re-iterate that we did not know that there were problems as elementary as the cross-validation R2 when we started asking for the residuals; however, Mann did. This undoubtedly accounts for the almost hysterical comments about the R2 statistics when we first mentioned them in our revised Nature submission.)
Natuurwetenschap & Techniek
In our submission at Nature, notwithstanding our explicit statements that we were not offering an "alternative" reconstruction, one of the reviewers asked us for cross-validation statistics. Grudgingly, we did the calculations for our re-submission and first noticed low R2 values for the 15th century step in our emulation of MBH98; we reported this but rather as an afterthought. Our main focus was on the remarkable lack of "robustness" of results to a few series and to the methodological fingers on the scale through the PC method and the "editing" of the Gaspé series. In our 800-word version, we even dropped out this point. However, the passing mention of the low R2 provoked a hysterical response from Mann in his Reply to Referees – he fulminated against the R2 statistic on no fewer than 5 occasions in the Reply. (These fulminations ultimately resulted in the curious diatribe against the R2 statistic in Rutherford et al , a diatribe which is completely inconsistent with prior positions of the parties on cross-validation R2 statistics. The only motivation for this diatribe was our pending Nature submission, which theoretically was governed by confidentiality restrictions prohibiting the use of the material by the responding authors for their own purposes. However, that’s another story.)
On our side, the issue of cross-validation statistics focussed a little more when we saw the referee comments in August. They spent a lot of time on RE versus R2 – far beyond anything in our revised submission, where the R2 comment had been exported to the SI. I suspect that Mann brought the attention upon himself by his fulminations against the R2 statistic in his Reply to Referees.
One of the referees (#2 here), (the one who also said "I am particularly unimpressed by the MBH style of ‘shouting louder and longer so they must be right") gave short shrift to Mann’s argument for the supremacy of RE versus R2 and his comments are very perspicacious given subsequent MBH positions:
The advocacy of RE in preference to r by MBH is a bit extreme. The correlation coefficient certainly has drawbacks, but no verification measure is perfect, and I see no evidence in the verification literature (or Wilks) that RE is the standard preferred measure. Indeed the only one of the 3 references (7) cited in the revised response that was available to me is somewhat critical of RE. My preference would be not to rely on a single measure, but to look at contributions form bias, differences in variances and departures from linear dependence.
This referee was a statistician who specialized in principal components. (A reader has written to me that he is in fact a very eminent specialist if you’re trying to guess.) However, we lost ground with Referee #2 of our original article (#3 in the re-submission) , who was strongly influenced by RE arguments. (Frustratingly, he benchmarked the RE versus R2 against the AD1820 step, where the R2 is favorable. This step was illustrated in a map in MBH98 – obvious proof that they used the R2 statistic when it was to their advantage.) If you read the comments of this referee, you’ll notice the mis-spelling McKritik, which is a Germanic mis-spelling that I’ve noticed elsewhere, giving some clues as to who the referee might be.
While the referee position was frustrating in terms of getting published at Nature, the comments do not give MBH98 a clean bill of health and, as I mentioned above, should have occasioned a fresh re-refereeing of MBH98 itself, which did not take place. For our purposes, we realized that we had to deal directly with RE statistics in a way that we’d not done in our Nature submission. This led directly to a complete re-thinking of the topic expressed in our GRL article, which was far more than a regurgitation of the Nature submission.
We improved our MBH emulation using the new data at the Corrigendum SI (available in July 2004) and felt confident enough in our emulation to assert that cross-validation R2 for the 15th century step was approximately 0.02 – obviously a damning result (as were other standard cross-validation statistics used in paleoclimate such as the CE, product mean test and sign test.) We reported this in our GRL article and no one to date has denied this , although we’ve been accused of many things. With such a lousy verification R2 statistic, we argued that it was impossible for the underlying model to have statistical significance and thus that the seemingly significant RE statistic was in fact "spurious" using this in a statistical sense [Granger and Newbold, 1974; Phillips 1986] rather than in the Mannian sense of merely being a term of disapproval.
Further, it seemed impossible to us that Mann would not have calculated the R2 statistic (especially since the R2 statistic for the AD1820 step was shown in a map by gridcell). Snce it was not reported in the original SI, the only conclusion was that Mann had withheld the information. We eventually commented on this in very sharp terms in our E&E article.
So this topic was very fresh in my mind when, in late 2004, I was interviewed by Marcel Crok, a reporter for Natuurwetenschap & Techniek. Like others, he initially viewed the story as an unlikely curiosity, but gradually got very intrigued and wrote a lengthy article. As part of his due diligence, he asked me for some questions to ask Mann through which he could try to differentiate our positions. I suggested the question about the cross-validation R2. So the question from NWT to Mann directly asking about it was really the first direct inquiry for the cross-validation R2 (as opposed to the inquiry for the residual series which would have led to it). I’ve excerpted the full dialogue from NWT since it is really quite provocative, but I urge interested parties to re-read the full letter at NWT since it gives such a nice flavor to Mann’s efforts to block any criticism.
[[2) There is a severe debate between you and MM about the skill of the calculation. You claim a high RE-statistic. MM show that their simulated hockey sticks also give a high RE-statistic but a very low R^2 statistic. ]]
We showed in our reply to the REJECT MM comment to Nature, that they incorrectly calculated all of their verification statistics, because they didn’t account for the changing spatial sampling of the Northern Hemisphere temperature record back in time. See the attached supplementary information ("supplementary3.pdf"–read page 2) that was provided to the reviewers of the rejected comment by McIntyre and McKitrick. Keep in mind that the reviewers of their Nature comment, who had the expertise and full available material to judge whether or not MM’s claims were plausible, decided that they were not.
Our reconstruction passes both RE and R^2 verification statistics if calculated correctly. Wahl and Ammann (in press) reproduce our RE results (which are twice as high as those estimated by MM), and cannot reproduce MM’s results. There is little, if anything correct, in what MM have published or claimed. Again, none of their claims have passed a legitimate scientific peer review process!
See also Rutherford et al (in press–see above) for an extensive discussion of cross-validation, and the relative merits of different metrics (RE vs CE vs r2). It is well known to any scientists in meteorology or climatology that RE is the preferred metric for skill validation because it accounts for changes in mean and variance prior to the calibration interval (which R^2 does not!). The preferred use of RE dates back to the famous paper by Lorenz in evaluated skill in meteorological forecasts.
It must be stated that McKitrick has been shown to be prone to making major errors in his published work. You should refer to the discussions here:
particularly interesting, in the context of this discussion, is his failure in an independent context (the Michaels and McKitrick paper discussed in the first link) to understand the issue of cross-validation! That is, in both the McIntyre and McKitrick ’03 paper, and the Michaels and McKitrick ’05 paper, the authors failed to even understand the importance of performing cross-validation! Such papers could never be published in a respected scientific journal.
[[In MBH98 you didn’t calculate the R^2 statistic, but in Mann and Jones (2003) you did. I asked Eduardo Zorita questions about this and he said he would calculate both. Why didn’t you calculate the R^2 in MBH98? ]]
Repeating what I said above, see Rutherford et al (in press–see above) for an extensive discussion of cross-validation, and the relative merits of different metrics (RE vs CE vs R^2). It is well known to any scientists in meteorology or climatology that RE is the preferred metric for skill validation because it accounts for changes in mean and variance prior to the calibration interval (which R^2 does not!). The preferred use of RE dates back to the famous paper by Lorenz in evaluated skill in meteorological forecasts.
It’s interesting to re-read Mann’s answer especially in light of subsequent access to Mann’s source code in the summer of 2005, which provided incontrovertible evidence that Mann had in fact calculated the cross-validation R2 statistic (which was then not reported.) See for example Cross-Validation R2 and More on Cross-Validation R2. Note that Mann told NWT that his reconstruction "passes both RE and R^2 verification statistics if calculated correctly." This is obviously a different claim than saying that the R2 test should be preferred.
In any event, while Mann fulminated at length, you will note that he did not provide the cross-validation R2 statistic in question to NWT.
House Energy and Commerce Committee
Now comes a remarkable twist to the story. The House Energy and Commerce Committee became intrigued with the matter when Mann, who had testified to Congress, injudiciously told the Wall Street Journal that he would not be intimidated into disclosing his algorithm. unofficial online here . The House Committee asked Mann (and Bradley and Hughes) straight out:
7 c. Did you calculate the R2 statistic for the temperature reconstruction, particularly for the 15th Century proxy record calculations and what were the results?"
d. What validation statistics did you calculate for the reconstruction prior to 1820, and what were the results?
You wouldn’t think that that left much wiggle room. But do you think that they got a straight answer? Neither Bradley nor Hughes even answered the question as I noted here. I’ve provided below Mann’s full answer as it is rather delicious:
A(7C): The Committee inquires about the calculation of the R2 statistic for temperature reconstruction, especially for the 15th Century proxy calculations.
In order to answer this question it is important to clarify that I assume that what is meant by the “R2″ statistic is the squared Pearson dot-moment correlation, or r2 (i.e., the square of the simple linear correlation coefficient between two time series) over the 1856-1901 “verification” interval for our reconstruction. My colleagues and I did not rely on this statistic in our assessments of “skill” (i.e., the reliability of a statistical model, based on the ability of a statistical model to match data not used in constructing the model) because, in our view, and in the view of other reputable scientists in the field, it is not an adequate measure of “skill.” The statistic used by Mann et al. 1998, the reduction of error, or “RE” statistic, is generally favored by scientists in the field. See, e.g., Luterbacher, J.D., et al., European Seasonal and Annual Temperature Variability, Trends and Extremes Since 1500, Science 303, 1499-1503 (2004).
RE is the preferred measure of statistical skill because it takes into account not only whether a reconstruction is “correlated” with the actual test data, but also whether it can closely reproduce the mean and standard deviation of the test data. If a reconstruction cannot do that, it cannot be considered statistically valid (i.e., useful or meaningful). The linear correlation coefficient (r) is not a sufficient diagnostic of skill, precisely because it cannot measure the ability of a reconstruction to capture changes that occur in either the standard deviation or mean of the series outside the calibration interval. This is well known. See Wilks, D.S., STATISTICAL METHODS IN ATMOSPHERIC SCIENCE, chap. 7 (Academic Press 1995); Cook, et al., Spatial Regression Methods in Dendroclimatology: A Review and Comparison of Two Techniques, International Journal of Climatology, 14, 379-402 (1994). The highest possible attainable value of r2 (i.e., r2 = 1) may result even from a reconstruction that has no statistical skill at all. See, e.g., Rutherford, et al., Proxy-based Northern Hemisphere Surface Temperature Reconstructions: Sensitivity to Methodology, Predictor Network, Target Season and Target Domain, Journal of Climate (2005) (in press, to appear in July issue)(available at ftp://holocene.evsc.virginia.edu/pub/mann/RuthetalJClimate-inpress05.pdf). For all of these reasons, we, and other researchers in our field, employ RE and not r2 as the primary measure of reconstructive skill.
As noted above, in contrast to the work of Mann et al. 1998, the results of the McIntyre and McKitrick analyses fail verification tests using the accepted metric RE. This is a key finding of the Wahl and Ammann study cited above. This means that the reconstructions McIntyre and McKitrick produced are statistically inferior to the simplest possible statistical reconstruction: one that simply assigns the mean over the calibration period to all previous reconstructed values. It is for these reasons that Wahl and Ammann have concluded that McIntyre and McKitrick’s results are “without statistical and climatological merit.”
A(7D): The Committee asks “[w]hat validation statistics did you calculate for the reconstruction prior to 1820, and what were the results?”
Our validation statistics were described in detail in a table provided in the supplementary information on Nature’s website accompanying our original nature article, Mann, M.E., Bradley, R.S., Hughes, M.K., Global-Scale Temperature Patterns and Climate Forcing Over the Past Six Centuries, Nature, 392, 779-787 (1998). These statistics remain on Nature’s website (see http://www.nature.com/nature/journal/v392/n6678/suppinfo/392779a0.html) and on our own website. See ftp:holocene.evsc.virginia.edu/pub/Mannetal98.
Interestingly, the link to Nature does not contain the said statistics nor does the UVA website. (The statistics are still up at the UMass website, not listed here.) You’d think that he’d try to get things like this right once in a while.
I won’t go in detail over the many mis-statements and mischaracterizations here. The main point is: did the House Committee get the requested information about the cross-validation statistics? The answer is obviously that they didn’t. Maybe they’d have done better if they’d asked about steroids.
Anyway I’ll tell you tomorrow whether I was able to get the R2 information from Ammann at AGU, which Mann had so stoutly withheld even from the House Energy and Commerce Committee.