CA readers have followed some of the interesting reviews at Climate of the Past, where some of the exchanges have been lively (although most papers don’t seem to attract much comment.) Two reviews are in on Bürger’s most recent submission and a 3rd reviewer has been invited to give an opinion. However, he’s slow about these things. (CPD reviewers are allowed to identify themselves – Joel Guiot has submitted a review under his own name, and the third reviewer will as well.) The third reviewer is sometimes distracted by interesting by-ways in an article and this is no exception. So I thought that I’d explore some of these by-ways on CA for him. It’s possible that Reviewer #3 will draw on some of this material, although it’s too lengthy and ambling for a review.
Bürger discusses the topic of RE significance – a topic which I think is important and one which is in a very unsettled state, as Bürger observes, noting that MM05a and MM Reply to Huybers had reported that very high RE values could be generated from red noise treated in an MBH manner resulting in a 95% benchmark of over 0.5 (he expressed it differently), while MBH and Huybers had argued for a benchmark of 0. He didn’t discuss Ammann and Wahl 2007 arguing for and using 0.0 as a benchmark (based on their rejected GRL submission). I’ll discuss this further, but first I got intrigued by a point made by Referee #2 about Bürger’s use of the term “skill” which had echoes, in my opinion, of a broader issue.
Referee #2 observed that the paper was “difficult to read”, an observation that TCO would doubtless have agreed with and I agree with. I’ve spent a lot of time on the paper and it’s taken me an unduly long time to understand what he’s doing. One more reason for archiving code. If I could read his code, I could sort out what was explained poorly. Anyway Referee #2 observed:
in this paper, the term “skill” is never clearly defined.
The term “skill” is not incidental. Bürger uses the term “skill” 50 times – 8 times in his abstract, 16 times in his introduction and 9 times in his conclusion. Indeed, his main conclusions are about “skill” – he argues that MBH claims of skill in the range of 50% are over-estimates and that “more realistic estimates” of MBH98 skill are about 25%. Thus, without ever defining skill, Bürger nonetheless manages to estimate it. You can sort of figure out what he means in any given context, but he moves rather fluidly from a colloquial understanding of the term to its use to denote a very particular statistic/skill score. Referee #2 illustrated the fluid usage through the following sentence (just one possible example):
“Calibrating is done,…, by optimizing the model skill for a selected sample (the calibration set) and is almost affected by the presence of sampling noise. This renders the model imperfect, and its true skill is bound to shrink. But it is this skill that is relevant when independent data are to be predicted”.
The referee observed (correctly in my opinion):
The expression “by optimizing the model skill” is rather puzzling because (1) there is no “the skill” but “a” skill among many possible skills, and (2) the parameters of a given statistical model are estimated by optimizing a given criterion (Mean Squared Error, likelihood function, etc) that can be very different for a chosen skill score that has been calculated after the parameter estimation step.
The referee then proceeds to make sensible observations about the need to carefully define terms citing Thornes and Stephenson 2001 as an example of how skill measures can be clearly defined. Instead of responding to the referee’s concern about the failure to define the term “skill” and his concern about the fluidity of its use, in my opinion, Bürger missed the point about the fluid use of terms, choosing instead to debate incidental aspects of how the referee illustrated the point. Bürger:
1. (“skill”): – For the introductory statements, “skill” was intentionally used in a more colloquial and general sense. For any skill score measuring the correspondence of a modeled quantity to observations the following is almost axiomatic: If a model, calibrated by whatever method, is applied to independent data the skill is expected to shrink. I should be very surprised if the rev disagrees on this point. (If he/she is able to provide a counterexample it will of course be considered accordingly.) The actual law of shrinkage certainly depends on the chosen model and score. Eq. (4), for example, applies to cross validity, Rc, for model estimates based on least-squares. The scores mentioned by the rev, such as odds ratio (ORSS) and Peirce (PSS), are binary and as such not really appropriate for climate reconstructions.
I didn’t think that the referee was contesting the argument that the value of a statistic will typically decline in a verification sample, but Bürger’s use of the phrase “model skill” in the context of “optimizing the model skill”. Similarly the referee also didn’t propose the Odds Ratio as a specific test, but merely illustrated that there can be many ways to measure skill. While Bürger studied RE and CE statistics in an MBH context (and the results expressed narrowly are of interest), I think that Bürger’s fluid terminology has allowed him to reify the colloquial idea of skill with a particular statistic without properly distinguishing the two.
The concern over fluid use of the term “skill” reminded me of two interesting exchanges about the topic – one at the House Energy and Committee hearings last summer; the other at the blogs of Pielke Sr and James Annan about a year ago.
House Energy and Commerce Committee Hearings
Oddly enough, the House Energy and Commerce Committee hearings had a very illuminating exchange involving Wegman and Mann, illustrating a communication problem between statistical and meteorological communities.
when I read the paper [MBH98] originally, it took me probably 10 times to read it to really understand what he was trying to say. He uses phrases that are not standard in the literature I am familiar with. He uses, for example, the phrase “statistical skill” and I floated that phrase by a lot of my statistical colleagues and nobody had ever heard of that phrase, statistical skill. He uses measures of quality of fit that are not focused on the kind of things typically we do.
Of course, as Mann observed at the NAS Panel hearings, he is “not a statistician”. Mann later had an opportunity to comment on Wegman’s point and stated:
That was another very odd statement on his part, and I found his lack of familiarity with that term somewhat astonishing. The American Meteorological Society considers it such an important term in the context of statistical weather forecasting verification that they specifically define that term on their website and in their official literature. And in fact it is defined by the American Meteorological Society in the following manner: “A statistical evaluation of the accuracy of forecasts or the effectiveness of detection techniques.” Several simple formulations are commonly used in meteorology. The skill score is useful for evaluating predictions of temperatures, pressures, et cetera, et cetera, so I was very surprised by that statement.
Mann’s observation about the American Meteorological Society can be confirmed, as it does indeed define skill as follows:
“A statistical evaluation of the accuracy of forecasts or the effectiveness of detection techniques.” Several simple formulations are commonly used in meteorology. The skill score (SS) is useful for evaluating predictions of temperatures, pressures, or the numerical values of other parameters. It compares a forecaster’s root-mean-squared or mean-absolute prediction errors, Ef, over a period of time, with those of a reference technique, Erefr, such as forecasts based entirely on climatology or persistence, which involve no analysis of synoptic weather conditions: If SS > 0, the forecaster or technique is deemed to possess some skill compared to the reference technique.
To which, Wegman later replied:
My own sense is that if you look at, for example, this matter of statistical skill, it doesn’t matter that the American Meteorological Society says what statistical skill is. Statisticians do not recognize that term. I went around to a whole dozen or so of my statisticians network and asked them if they knew what they were talking about. It is my contention that there is a gulf between the meteorological community and the statistical community.
We examined, for example, this committee that is on probability and statistics of the American Meteorological Society. We found only two of the nine people in that committee are actually members of the American Statistical Association, and in fact one of those people is an assistant professor in the medical school whose specialty is bio-statistics The assertion I have been making is that although this community, the meteorological community in general and the paleoclimate community in particular, used statistical methods. They are substantially isolated.
Pielke and Annan
While the exchange between Wegman and Mann illustrates, at a minimum, a communication problem, it doesn’t exclude the possibility that the two communities might be doing similar things using different terminology. Econometricians use somewhat different language than signal processors for multivariate analyses, but one can often identify common features under the terminology.
Pielke Sr comment defined “skill” in terms of how closely the model fir observations.
I interpret skill as an absolute measure of forecast error with respect to observed data. I agree that skill of different models can be compared with each other, as long as observed data is the reference.
James Annan argued vehemently against “skill” being appropriately measured against actual data as opposed to against a reference point both in terms of the logic and in terms of the AMS definition:
In saying “I interpret skill as an absolute measure of forecast error with respect to observed data”, you are demonstrating that your use of the term “skill” differs from the AMS reference which you cited and other similar references…They all explicitly define skill as a comparative measure in which the data are used to determine which of two competing hypotheses (generally a model forecast and a simple reference such as persistence) is more accurate. as I said previously, using your own idiosyncratic definition is liable to mislead your readers. Your measure is simply not “skill” as understood in the forecasting field.
Incredibly, it turns out that Roger is claiming it is appropriate to use the data themselves as the reference technique! If the model fails to predict the trend, or variability, or nonlinear transitions, shown by the data (to within the uncertainty of the data themselves) then in his opinion it has no skill. …I repeat again my challenge to Roger, or anyone else, to find any example from the peer-reviewed literature where the target data has been used as the reference hypothesis for the purposes of determining whether a forecast has skill.
After this exchange, Henk Tennekes observed:
To both of you, I suggest that your quest for an objective and universally valid metric for the measurement of skill is unlikely to succeed. Skill, however defined, is ultimately a qualitative judgment, not a quantitative one. More precisely, it is a judgment, not a calculation. Being an old man, I want to remind you that I have written more than once about the added value of human skills. You might wish to delve into my Illusions paper, and into Michael McIntyre’s response (Weather 43, 294-298, 1988).
Apart from that I want to apologize for the loose way I fooled around with the skill concept in my blog and in earlier speeches and opinion pieces. Flowery language has its own problems, unfortunately.
The last sentence here closes the circle back to Bürger’s article – “flowery language has its own problems”. Quite so.
Annan’s challenge, to find a peer-reviewed article, any article, in which the “target data has been used as the reference hypothesis for the purposes of determining whether a forecast has skill” contrasts rather sharply with the routine use of actual data in hindcasts to claim “skill”. Whether meteorological forecasting terminology involving, as Annan says, a comparison to a reference forecast, should be applied to hindcasting reconstructions, is, I think, an open question.
Prior Paleoclimate Usage of the Term “Skill”
I also became a bit intrigued at exactly when the term “skill” started being used in paleoclimate literature. All the current dendrochronological verification tests except the CE statistic are described in Fritts 1976, which uses the term Reduction of Error (RE). The CE statistic seems to have entered tree ring literature with Briffa et al 1988 (cited by Bürger). About the RE statistic, Fritts 1976 merely says:
any positive value indicates that there is some information in the reconstruction. (p.24)…
Jacoby et al 1985 say:
Although not very well known outside the atmospheric science literature, the RE statistic is rigorous estimate of statistical estimates. An RE greater than 0 is regarded as proof that the estimates are an improvement over using just the mean (Gordon and Leduc, 1981
The term skill is used in Preisendorfer 1988 uses the term “skill”, who uses the term “hindcast skill”, where, as I read his equation 9.48, it is equivalent to what we would call the calibration r2 statistic. Preisendorfer, in his usual interesting bibliographic notes, discusses linear regression going back to Wiener and, in a meteorological context, refers to a couple of articles by Davis in the mid-1970s which use the term “skill”.
The transfer of the term “skill” to dendroclimatology seems to have been done by Jacoby and D’Arrigo in connection with their NH temperature reconstruction in the late 1980s. Jacoby and D’Arrigo Clim Chg 1989 substantially increases the properties attributed to the RE statistic from the earlier publications (even though no new statistical texts demonstrated the upgrade):
The reduction of error is a rigorous method of testing significance (Gordon and Leduc 1981). Any value above 0 indicates significant predictive skill. (p 47)
Fritts 1991 uses the term skill with less forcefulness as follows:
Any positive value of RE indicates that the regression model, on the average, has some skill and that the reconstruction made with the particular model is of some value..
D’Arrigo and Jacoby, 1992 (Climate since 1500). made a similar claim with the first hint of potential problems resulting from the presence of trends:
Any positive RE value shows that there is considerable skill in the verification period estimates as compared to the calibration period mean (Gordon and LeDuc 1981). The strength of the RE in this case may however partly reflect a difference in the mean of the calibration and verification periods (Gordon and LeDuc 1981).
The influential study Cook et al 1994 re-stated the various Fritts’ tests, together with the CE statistic. About the RE statistic, they observed:
Thus RE>0 indicates that the reconstruction is better than the calibration period mean. However there is no way of testing the RE for statistical significance. Note that x in the denominator will not produce the true corrected sum-of-squares unless it is identical to the verification period mean. Consequently a large difference in the calibration and verification means can lead to an RE greater than the square of the product-moment correlation coefficient. This occasional odd behaviour suggests that they should be interpreted cautiously when the data contains trends or are highly autocorrelated.
I’ll check this reference for its use of the term “skill” (but can’t find the article right now – it’s somewhere here).
The RE statistic got discussed in the context of MBH98 claims about the verification r2 statistic (together with actual values of ~0 for the verification r2 statistic.)
Back to Bürger
Let’s grant Bürger a little license for terminological inconsistency and then examine actual usage to see if there are some distinct usages that we can follow. He cites a number of articles from psychometric literature (Raju et al 1997; Cattin 1980) in which he says:
skill is accordingly being referred to as crossvalidity
In examining the underlying references, “cross-validity” in Raju et al is equivalent to what we would call “verification r2″. Bürger’s equation (4) says that the verification r2 in a multiple regression study is necessarily less than the calibration r2. He goes on to say:
Equation (4) applies to models estimated by ordinary least squares and thus to all reconstructions (“predictions”) that are based on some form of multiple regression.
I suspect that this is true, but there are differences between Partial Least Squares Regression and OLS regression. You shouldn’t just say this sort of thing, unless you have a citation. However, the salient thing here is that “skill” is said to be defined by a statistic that is equivalent to a verification r2 statistic. This however is the last we hear of this statistic, as Bürger moves into a lengthy series of RE and CE calculations for MBH98 flavors. These calculations are by far the most focused part of the paper. As a referee, I’m not sure whether I should be saying what should be in or out of a paper – I find many referee suggestions of this type to be annoying – but, in this case, as a blogger, I think that Bürger should ditch most of the theoretical stuff and focus on simply reporting the flavor calculations, staying away from metaphors and unfocused terminology.
After doing hundreds of RE and CE calculations under various permutations, Bürger proposes that these simulations enable him to estimate the skill of the MBH98 reconstruction, which he places at 25%, less than MBH claims of about 50%.
Here there is an interesting difference between how he handles the RE statistic and how a statistician would view it. The viewpoint of a statistician is always that one is using a statistic (verification r2, RE, or for that matter, Durbin-Watson or more exotic statistics) against a null hypothesis. On the other hand, Bürger’s viewpoint is that “skill scores” are estimates of the “skill” of the model, with the term “skill” somewhere morphing into an identification with percentage explained variance (with RE, CE two variants). One quickly gets into deep epistemological waters here.
It seems to me that the problem is almost identical to what we saw with Juckes’ attempts to claim 99.99% significance. In my comment on Juckes, I observed:
There is a large economics literature on the topic of “spurious” or “nonsense” correlations between unrelated series [Yule 1926; Granger and Newbold 1974; Hendry 1980; Phillips 1986 and a large literature]. Yule’s classic example of spurious correlation was between alcoholism and Church of England marriages. Hendry showed a spurious correlation between rainfall and inflation. The simulations performed in Juckes et al have virtually no “power” (in the statistical sense) as a test against possible spurious correlation between the Union reconstruction and temperature. A common, but not especially demanding, test is the Durbin-Watson test [Granger and Newbold 1974], whose use was encouraged by the NRC Panel (p. 87). According to my calculations, the Union Reconstruction failed even this test (see http://www.climateaudit.org/?p=945 ).
I also observed that it was trivial to develop an alternative chronology with different medieval-modern relationships that was also 99.99% significant through slight variations in proxy selection. Obviously both reconstructions could not be 99.99% significant — a point which seems interesting to me, but which Juckes chose not to respond to.
IMHO the problem with Juckes’ simulations is that, in null hypothesis terms, they simply tested the Union reconstruction against a null hypothesis of randomly generated AR1 red noise. I agree that the time series properties of the Union reconstruction are inconsistent with it being randomly selected AR1 noise. The trouble is that that is not the issue: the issue for these reconstructions is twofold: (1) do “key” proxies contain non-climatic trends, which have a spurious correlation with temperature in the calibration period? (2) has the selection of proxies into subsets been biased towards the inclusion of HS-shaped proxies (regardless of whether the reasons are “good”)?
Despite all of Bürger’s work, he doesn’t seem to have grasped either nettle. In particular, he doesn’t mention bristlecone pines. The $64 question for MBH98 is whether there is a spurious relationship with bristlecones in a complicated multivariate setting with other proxies functioning as white noise or low-order red noise – very much the viewpoint of MM05b and our Reply to Huybers (MM05 EE curiously is not cited by Bürger, although he makes many citations to us and not disparagingly). I don’t know how you can do a statistical evaluation of MBH98 without taking note of the non-robustness of its results to the presence/absence of bristlecones. Thus, doing thousands of simulations on the basis that bristlecones are a valid proxy and hoping that this will shed some light on the “skill” of MBH98 seems a pointless exercise to me.
Having said that, Bürger is a thoughtful guy and there’s interesting material in the article. I didn’t enjoy reviewing it at all. I';ve only written up about half my notes here. At every step, I wanted to act as a friendly editor and send a cleaned-up version to him focusing and what is salvageable, rather than criticizing things. In this case, I think that would be more constructive and save a long song-and-dance. On the other hand, Referee #1 (Joel Guiot), an editor of CP, liked the article as it is.