Here is what Reviewer #3 submitted in his review of the Bürger CPD submission (I’ve taken down the draft review.) He regrets agreeing to do the review (for reasons that will become clear). Even though Reviewer #3 has much goodwill towards Bürger, he is rather a stickler for detail and it seems rather unfair that Bürger should draw Reviewer #3 for this submission and Michael Mann for his last submission, while other authors get cheerleaders. Reviewer #3 hopes that he has managed to separate his comments from his POV, although the separation is not always easy.

He appreciated the thoughtful response from CA readers, especially TCO. The online submission of reviews was turned off when Reviewer #3 sent in his review, so he sent it to the editor by email. We’ll see what happens. Reviewer # 3 writes:

I have benefited from reading the reviews of the two other referees. I have written on these matters and am cited in this article, my comments should be weighed in that context.

I see two fairly distinct objectives within this paper:

– the proposal that multicalibration provides an advance in statistical methodology for evaluating temperature reconstructions;

– the estimation of a “model skill” for MBH98 of around 25%

I think that these two objectives fit together very uncomfortably within one article. If one wants to propose multicalibration of RE and CE calculations as a novel application for testing climate reconstructions, the utility and benefits of this method should be demonstrated on uncontentious pseudoproxy networks where one knows the properties (perhaps the von Storch-Zorita pseudoproxies could be used). Perhaps multicalibration will yield new insights relative to present methodologies; perhaps not. But one should not use the same dataset (in this case MBH98) to develop the method, which is then used to evaluate the dataset. In this case, one could hardly pick a worse data set than MBH98 to benchmark a novel application of a statistical method, since the dataset is controversial.

Bürger correctly observes that very high RE statistics can arise from “nonsense”/”spurious” regressions. In addition to the example provided by Bürger, the stereotypical examples of spurious regression (alcoholism rate versus Church of England marriages [Yule 1926]; cumulative rainfall versus money supply [Hendry 1984]) have high RE statistics. This is a point that makes it imprudent for paleoclimatologists to rely exclusively or over-rely on RE statistics and might well be made more forcefully by the author.

Econometricians have pondered the spurious regression problem for many years, as Eduardo Zorita observed in a CPD review last summer. There has been much searching for a statistic that could serve as a magic bullet for separating spurious relationships from valid relationships, but thus far, there is no such single statistic. Granger and Newbold 1974 observed that spurious correlations arose between random walks; that many such spurious relationships failed Durbin-Watson statistics and urged practitioners to check this statistic before jumping to conclusions from high r2 statistics. The NAS Panel last summer also urged that this statistic be consulted. Most of the well-known millennium reconstructions fail the Durbin-Watson statistic in the calibration period and the verification r2 statistic (Bürrger’s cross-validity) in the verification period. However, the Durbin-Watson statistic primarily watches out for first-order autocorrelation and other sources of spurious relationships, such as uncontrolled or mis-specified variables must be watched out for as well in any attempt to evaluate the validity of a model. At present, in the absence of any universal method of identifying spurious correlations, applied practitioners must look at the data from multiple perspectives – something that is very much in the spirit of the NAS panel recommendations.

While Referee #1 says that multicalibration brings a “rigorous assessment” of the quality of a reconstruction, it is still not the “magic bullet” that economists have sought for. In order to show this, I tested the method of Bürger’s Figure 3 against a classical spurious relationship — Yule’s marriage-alcoholism relationship mentioned above. In this particular case, very high RE statistics are returned for all permutations and combinations. In Bürger’s terms, the “model skill” here is well over 80% and yet one remains unconvinced that there is an underlying relationship. (I will posted a script for this and some related calculations at http://www.climateaudit.org/scripts/proxy/burger.txt)

Figure 1. RE statistics from multicalibration of Yule’s relationship between alcoholism and Church of England marriages.

In respect to Bürger’s other enterprise, the estimation of a “model skill” for MBH, while Bürger has presented many interesting graphs, which might well be the topic of a very focused paper, without a full, previous qualification of multicalibration as a statistical method, it is inappropriate to use it here to estimate “model skill” in a contentious situation. While I share Referee #2’s concern that the term “skill” (which is used 50 times in the article) is never defined, I view the more serious problem as being the fluidity in use of the term, which includes, as Bürger said in his reply to Referee #2, a colloquial usage on the one hand and on the other hand a seemingly technical use as a property of models that can be be estimated.

If Bürger proposes to estimate “model skill”, he needs to establish that the model is even valid, which requires a considerable amount of statistical analysis additional to what is provided here.

In view of the criticisms in our articles, endorsed on this point by both the NAS Panel and the Wegman Report, I do not believe that the MBH98 PC1, as calculated, can continue to be used as a proxy even for the exercise undertaken by Bürger. Any skill in a model which includes the use of this particular series will necessarily have a fortuitous element to it. As a result, I’m afraid that there has been a considerable amount of wasted effort in analysing permutations and combinations of reconstructions involving this particular proxy (or red noise models based on this proxy) if the ultimate objective is to estimate “model skill”.

In addition, if “model skill” is to be estimated, the “robustness” of the model (using “robust” in the sense of say, Frank Hampel) needs to be assessed. This requires the examination of “leverage points”; in many cases, as observed by Hampel, the retention or exclusion of “leverage points” needs to be decided on scientific grounds rather than statistical grounds (a statistic may identify data as a leverage point, but may not be able to say whether the underlying data is usable or not.) In the case of the MBH98 data set, bristlecones have been identified as having great leverage. Any attempt to attribute “model skill” to MBH98 needs to discuss these leverage points, which would also require the author to address the NAS Panel recommendation that these series be “avoided”. Given that there are only 22 series in the MBH98 network, I would urge that any attempt to estimate MBH98 “model skill” include a plot of the 22 series in a concise format, plus a plot and discussion of the residuals from, say, the 222-flavor.

There are any number of details which should be tidied up. The article needs to be spell-checked in English. It should probably be broken into distinct papers. The exposition needs to be improved.

Following are some of the points that I’ve noted, but the list is by no means exhaustive and I’d be happy to provide further comments offline.

251, 9: The NAS Panel Report and the Wegman Report should both be added to this list, as should McIntyre and McKitrick [EE 2005b], which is complementary to McIntyre and McKitrick [GRL 2005] which is cited.

251: 9-11. “the aspect of verification has not found a comparable assessment”. I suggest that this wording be modified as it is inconsistent with the discussion of page 266.

251 29: The term “robust” has a specific meaning in statistics that is inconsistent with the usage here. If the term is to be used, it should be consistent with usage by (say) Hampel.

252 10: Gordon in Hughes et al, eds. Climate from Tree Rings 1981, p 60, discusses the splitting of periods, citing McCarthy 1976 for what seems to be what Bürger calls multi-calibration as follows:

Of course there are a great many different ways that the data may be halved and the selection of a proper subset of these possibilities has been discussed in some detail by McCarthy (1976)

253 17: MBH98 and MM05 do not use the term “level of no skill”.

254 9: Briffa et al 2001 uses a different network that is relatively distinct from the others which do overlap a lot.

254 10: While, some of the cited studies calculated verification statistics, many did not.

Briffa et al 1992 reports all the tests listed in Fritts 1976; Briffa et al 2001 report verification r and RE statistics, but, based on a quick inspection, I don’t think that any verification statistics are reported in Jones et al 1998; Crowley and Lowery 2000; Esper et al 2002.

254 14: In the context of Wilks 1995, it might be noted here (or in the discussion of the relationship between verification statistics) that they show an equation linking RE to verification r2 under the assumption of constant climatology, citing Murphy (1988). Under that equation, the RE is necessarily lower than the verification r2 statistic.

254 15-17: Do Fritts 1976 and Briffa et al 1988 actually say this exactly?

254ff: I don’t get how this discussion is necessary to the development of the conclusions announced in the Abstract. Many readers will find it distracting. I skipped over the section, and even when I came back to it, I had trouble figuring out the point. If the section is retained, shouldn’t the relationships between verification statistics presented in Murphy (1988), cited in Wilks 1995, be mentioned as well?

255 20: Wahl and Ammann 2007 do report MBH98 scores from their emulation, which are virtually identical to scores previously reported in MM05a? Would it be unreasonable to cite the earlier report as well?

256 7: Are you sure that Lorenz 1956 really the “first” article to estimate this dependency? The citation in Raju et al 1997 mentions some articles from the 1930s that seem to be on this topic.

256 15-20: I couldn’t replicate this figure or this calculation. I’ve posted up my script and maybe the author can point out why I’m having trouble.

256 21: It’s certainly possible that equation (4) applies to say Partial Least Squares regression, but shouldn’t a citation be provided for this claim? If no citation can be found, I don’t see that it’s necessary for the development of the article.

257 1: the simple trend is “a” trivial predictor, but I wouldn’t call it “the” trivial predictor which has mathematical connotations not present here.

257 1-16: the existence of spurious RE statistics is important. However I don’t see how the term “calibration mean bias” contributes to the discussion.

257 13 I understand the point here, but why is it a “discrepancy”? And what is the value of saying that there is an “enormous bias in the calibration mean”?

259 13: I’ve presented results from the classic Yule spurious regression, which show a high RE value regardless of the temporal separation of the calibration and verification data sets. So I think that there’s a lot of heavy lifting to be done before one can conclude that a general property has been established.

265 1-10: Bürger misses an important aspect of the concept of “spurious regression” in this discussion. I would urge him to read or re-read Granger and Newbold 1974 and Phillips 1986, before re-writing this section. The problem, as posed in econometrics literature, is how a regression between unrelated series can yield a statistic that indicates “strong significance”. The strategy of Phillips 1986 was to see if the the conventional distributions accurately described the circumstances being tested, looking, in particular, at the random walks of Granger and Newbold 1974. Phillips observed that conventional distributions for t-statistics did not apply for this type of model and, when the distributions were re-calculated, the seemingly significant statistic proved not to be so. This was our strategy in MM05a, where we sought to explain the seeming discrepancy between the failed verification r2 (and CE statistics) of MBH98 and the high RE statistic. While I may disagree with some nuances of the calculation in Bürger’s Figure 6, I endorse the approach of examining actual distributions. My disagreement with the characterization of this paragraph is that, in my opinion, it fails to adequately record the problem thrown up by failure of one important verification statistic. However, the above comments very much reflect a POV; while I would urge the author to consider the POV, it would be inappropriate for me to insist that this POV be adopted.

265 8: The “trivial” model may have a high RE statistic, but,as the author obviously realizes, it is not “significantly skillful” and this wording should be changed.

265 11-18: This paragraph is poorly worded and particularly hard to understand.

265, 17-18. The proposed significance benchmark of RE=54% was relevant to the analysis as carried out in MBH98, which was what was being discussed at the time.

265 20-25: as noted in paragraph 6 of this review, evaluating models requires not only a consideration of autocorrelation properties, but of potential misspecification, which requires a different suite of methods, which are too quickly assumed out of existence here.

266 10. Bürger’s characterization of the methodology of MM05a, MM05c is incorrect. Bürger stated that the red noise predictor was estimated from “one of the 22 proxies”. In fact, the red noise predictor was simulated by doing a principal components analysis under MBH methodology on a network of 70 red noise series (constructed according using the algorithm described by Bürger at 267 5-10.) The properties of these time series differ somewhat from the time series generated by a univariate model of the MBH North American PC1. This difference in procedure probably accounts for some differences in our red noise results and Bürger’s (although both simulations yield many a substantial population of positive RE values from red noise simulations.)

266 13: where does the figure of 36% come from? I read Huybers as arguing for a value of 0.

266 14: the same point applies as 266 10 in respect to the simulated PC1s. I’m uncertain what Bürger’s question is in respect to “uncorrelated?” here. White noise series are by definition uncorrelated. I would be happy to provide clarification for any remaining question.

266 22: In MM05a, we described our simulations as establishing a “lower limit” for a benchmark since our distributions were entirely obtained from operations on noise. I agree with Bürger’s observation about “lower bound” on line 22, but this was also approach, which is not noted here.

266 21. I don’t understand what is meant by “convergence” here or the utility of this concept under these circumstances. My key objection to Bürger’s simulation approach here, insofar as it is an attempt to simulate salient particulars of MBH, is excessive reliance on univariate modeling red noise modeling of the MBH North American PC1 (as opposed to modeling of a PC network),

266 19: To this recital, the viewpoint of Wahl and Ammann 2007 needs to be discussed.

267 8: Our simulations were done using Brandon Whitcher’s algorithm stated to be an implementation of the method of Hosking 1984 (as cited in our text and code). A correspondent has said to me that the Whitcher algorithm has somewhat different properties than Hosking 1984 and accordingly the Whitcher algorithm is the more precise citation here.

267 24: Bürger reports that the MBH-type reconstructions have somewhat higher typical realizations than red noise networks as he modeled them. In order to claim that this difference is “significant”, a significance test should be performed.

269 10-15: Some (no doubt Wahl and Ammann) might argue that the RE and CE statistics with the record split in half in calendar years is to be preferred because it provides information on “low frequency” that is unavailable from multicalibration. Whether one agrees or disagrees with this argument, it should be dealt with.

269 29: this is an interesting argument that appears out of nowhere. In order to be stated as a conclusion, it needs to be proven in the text.

I have benefited from reading the reviews of the two other referees. I have written on these

matters and am cited in this article, my comments should be weighed in that context.

I see two fairly distinct objectives within this paper:

– the proposal that multicalibration be used as a statistical test for reconstructions;

– the estimation of a “model skill” for MBH98 of around 25%

I think that these two objectives fit together very uncomfortably within one article. If one

wants to propose multicalibration of RE and CE calculations as a novel application for testing

climate reconstructions, the utility and benefits of this method should be demonstrated on

uncontentious pseudoproxy networks where one knows the properties (perhaps the von Storch-Zorita

pseudoproxies could be used). Perhaps multicalibration will yield new insights relative to

present methodologies; perhaps not. But one should use datasets that themselves are under

examination (e.g. MBH) to develop a method, which is than used to evaluate the same dataset.

In addition to the example provided by Bürger, the stereotypical examples of spurious high r2q

(citations) have high RE as well.

Bürger correctly observes that very high RE statistics can arise from “nonsense”/”spurious”

regressions. This is a point that makes it imprudent for paleoclimatologists to rely exclusively

on RE statistics. Econometricians have pondered the spurious regression problem for many years,

as Eduardo Zorita observed in a CPD review last summer. There has been much searching for a

statistic that could serve as magic bullet for separating spurious relationships from valid

relationships, but thus far, there is no such single statistic. Granger and Newbold 1974 observed

that spurious correlations arose between random walks; that many such spurious relationships had

failed Durbin-Watson statistics and urged practitioners to check this statistic before jumping to

conclusions from high r2 statistics. The NAS Panel last summer also urged that this statistic be

consulted. Most of the canonical millennium reconstructions fail the Durbin-Watson statistic in

the calibration period and the verification r2 statistic (Bürger’s cross-validity) in the

verification period. However, this statistic primarily watches out for autocorrelation-enhanced

matches. Other sources of spurious relationships, such as uncontrolled variables or even chance

exist as well. At present, in the absence of any universal method of identifying spurious

correlations, applied practitioners must look at the data from multiple perspectives ‘€” something

that is very much in the spirit of the NAS panel recommendations.

While Referee #1 states that multicalibration brings a “rigorous assessment” of the quality of a

reconstruction, it is still not the “magic bullet” that economists have sought for. In order to

show this, I tested the method of Bürger’s Figure 3 against a classical spurious relationship ‘€”

Yule’s marriage-alcoholism relationship mentioned above. In this particular case, very high RE

statistics are returned for all permutations and combinations. In Bürger’s terms, the “model

skill” here is well over 80% and yet one remains unconvinced that there is an underlying

relationship. (I have posted a script for this and related calculations at

http://www.climateaudit.org/scripts/proxy/burger.txt)

Figure 1. RE statistics from multicalibration of Yule’s relationship between alcoholism and

Church of England marriages.

(YOU MAY WANT TO WRITE A FORMAL COMMENT IF THIS IS AN IMPORTANT CALCULATION–I DON’T. ALSO, NOT

SURE THAT THIS IS REALLY RELEVANT TO BURGER’S PAPER, IF BURGER NEVER MAKES tHE “RIGOROUS” CLAIM.)

In respect to Bürger’s other enterprise, the estimation of a “model skill” for MBH, Bürger has

presented many interesting graphs, which might well be the topic of a focused paper. However,

without a full, previous qualification of multicalibration as a statistical method, it is

inappropriate to use it at present to estimate “model skill” in an uncertain situation.

While I share Referee #2’s concern that the term “skill” (which is used 50 times in the article)

is never defined, I view the more serious problem as being the fluidity in use of the term, which

includes, as Bürger said in his reply to Referee #2, a colloquial usage on the one hand and on

the other hand a seemingly technical use as a property of models that can be be estimated.

There are any number of details which should be tidied up:

-The article needs to be spell-checked in English.

-It needs to be broken into appropriate papers (for full explication) or a considerably longer

paper needs to be written (preferably the former).

-The writing is unclear in overall story and in specific sections. I recommend that Professor

Burger consult with a technical writer to help with getting a clear story and clear prose–this

is the second submission of this paper and the problem remains.

Following are some of the points that I’ve noted, but the list is by no means exhaustive and I’d

be happy to provide further comments offline.

251, 9: Recommend to add McIntyre and McKitrick [EE 2005b] as it is complementary to McIntyre

and McKitrick [GRL 2005] which is cited. (STEVE, I AGREE WITH LEAVING THE PUBLIC COMMISIONS OUT.

THERE’S NO ORIGINAL WORK IN THEM. BURGER’S PAPER IS ON RE/CE SCIENCE, NOT ON PUBLIC COMMISIONS

OR CONTROVERSY OR WHAT PANEL OF EXPERTS ENDORSES WHO.)

251: 9-11. “the aspect of verification has not found a comparable assessment”. I suggest you

modify this wording. This is inconsistent with the discussion of page 266.

251 29: The term “robust” has a specific meaning in statistics that is inconsistent with the

usage here. Per reference, BLUE BOOK TUKEY, it means BLABLABLA. A better way to discuss the

colloquial issue of robustness is BLEHBLEH.

252 10: Gordon in Hughes et al, eds. Climate form Tree Rings 1981, discusses the splitting of

periods, citing McCarthy 1976 for what seems to be what Bürger calls multi-calibration.

(GOOD!!!)

253 17: MBH98 and MM05 do not use the term “level of no skill”.

254 9: Briffa et al 2001 uses a different network that is relatively distinct from the others

which do overlap a lot.

254 10: While, some of the cited studies calculated verification statistics, many did not.

Briffa et al 1992 is the most comprehensive, reporting all the tests listed in Fritts 1976;

Briffa et al 2001 report verification r and RE statistics, but, based on a quick inspection, I

don’t think that any verification statistics are reported in Jones et al 1998; Crowley and Lowery

2000; Esper et al 2002.

254 14: In the context of Wilks 1995, it might be noted here (or in the discussion of the

relationship between verification statistics) that they show an equation linking RE to

verification r2 under the assumption of constant climatology. Under that equation, the RE is

necessarily lower than the verification r2 statistic.

254 15-17: Do Fritts 1976 and Briffa et al 1988 actually say this exactly?

.

254ff: I don’t get how this discussion is necessary to the development of the conclusions

announced in the Abstract. Many readers will find it distracting. I skipped over the section, and

even when I came back to it, I had trouble figuring out the point. If the section is retained,

shouldn’t the relationships between verification statistics presented in Murphy (1988), cited in

Wilks 1995, be mentioned as well?

255 20: Wahl and Ammann 2007 do report MBH98 scores from their emulation, which are virtually

identical to scores previously reported in MM05a. Would it be unreasonable to cite the earlier

report as well?

256 7: Are you sure that Lorenz 1956 really the “first” article to estimate this dependency? The

citation Raju et al mentions some articles from the 1930s that seem to be on this topic.

256 15-20: I couldn’t replicate this figure or this calculation. I’ve posted up my script and

maybe the author can point out why I’m having trouble.

256 21: It’s certainly possible that equation (4) applies to say Partial Least Squares

regression, but shouldn’t a citation be provided for this claim? I don’t see that it’s necessary

for the development of the article if no citation can be found.

257 1: the simple trend is “a” trivial predictor, but I wouldn’t call it “the” trivial predictor

which has mathematical connotations not present here.

257 1-16: the existence of spurious RE statistics is important. However I don’t see how the term

“calibration mean bias” contributes to the discussion.

257 13 ‘€” I understand the point here, but why is it a “discrepancy”? And what is the value of

saying that there is an “enormous bias in the calibration mean”?

259 13: I’ve presented results from the classic Yule spurious regression, which show a high RE

value regardless of the temporal separation of the calibration and verification data sets. So I

think that there’s a lot of heavy lifting to be done before one can conclude that a general

property has been established.

265 11-18: This paragraph is poorly worded and hard to understand. Is the point is that

different settings for significance (e.g. the Huybers, MM discrepancy) depend on the level of

challenge of the pseudoproxy test set?

265, 17-18. The proposed significance level of RE=54% is relevant to the analysis as carried out

by MBH, which is what was originally being discussed.

265 27: “red” is meant, “white” is written.

266 10. Bürger’s characterization of the methodology of MM05a, MM05c is incorrect. Bürger stated

that the red noise predictor was estimated from “one of the 22 proxies”. Details of the exact MM

methodology are best understood by looking at the code (archived HERE) and at the subroutine,

(available HERE). Actually, those experiments used a network of 70 series which had the red

noise properties of the MBH North American network, from a 5-parameter (ARMA) model (OR WHATEVER

IT WAS THAT YOU DID).

266 13: where does the figure of 36% come from? I read Huybers as arguing for a value of 0.

266 14: “White” is said, “red” is meant. I’m uncertain what Burger’s question is with the

“uncorrelated?”, but this should be figured out looking at the method details (which are public).

266 16-19: Given the lack of understanding of the actual MM method, Burger needs to clarify what

his desired (level 5) method really is.

266 21. I don’t understand what is meant by “convergence”.

267 8: MM05 did not use “the method of Hosking84”, it used a public subroutine titled “Hosking”

Which does different things.

267 24: Bürger reports that the MBH-type reconstructions have higher typical realizations than

red noise networks as he modeled them. This should be framed in terms of a significance test.

269 10-15: Some might argue that the RE and CE statistics with the record split in half in

calendar years is to be preferred because it provides information on “low frequency” that is

unavailable from multicalibration. Whether one agrees or disagrees with this argument, it should

be dealt with.

269 29: This inference should be omitted or the paper should be rescoped. (It comes in out of

nowhere.)

Here’s a quick question. I’ve seen references to a repository for IMTRB (something like that) for the dendro crowd. But, does there exist, or is there even any push for, a centralized repository for scripts, data files, etc, for all climate scientists to use? It’s great that you make things available, but computers crash, go down, etc. It would be nice if the NSF or some other entity provided funding for a central repository, where one could search for articles, and download all relevant scripts, data, etc. At paper published with NSF funds would be required to provide all said data/scripts, etc., to such a repository. People make it out to sound so difficult, but it’s a simple task, storage is cheap, and it ought to be done.