Gerd Bürger published an interesting comment in Science 2006 on cherry-picking in Osborn and Briffa 2006. A few CA readers have noticed the exchange and brought it to my attention. Eduardo Zorita (who I was glad to hear from after our little dust-up at the Nature blog) sent me the info as did Geoff Smith. I started on a summary yesterday, but quickly got distracted into one of the many many possible thickets. So here’s Geoff’s summary:
There’s a pretty hot exchange (at least for CA readers) in last Friday’s Science magazine. Gerd Bürger (lead chapter author and contributor for the TAR) writes about Osborn and Briffa’s 2006 hockey stick (“The Spatial Extent of 20th-Century Warmth in the Context of the Past 1200 Years”) commenting critically on site selection and statistics. He writes “…given the large number of candidate proxies and the relatively short temporal overlap with instrumental temperature records, statistical testing of the reported correlations is mandatory. Moreover, the reported anomalous warmth of the 20th century is at least partly based on a circularity of the method, and similar results could be obtained for any proxies, even random-based proxies. This is not reflected in the reported significance levels”.
In commenting on the proxies (most of them well known to CA readers) he says that this “method of selecting proxies by screening a potentially large number of candidates for positive correlations runs the danger of choosing a proxy by chance. This is aggravated if the time series show persistence, which reduces the degrees of freedom for calculating correlations (6) and, accordingly, enhances random fluctuations of the estimates. Persistence, in the form of strong trends, is seen in almost all temperature and many proxy time series of the instrumental period. Therefore, there is a considerable likelihood of a type I error, that is, of incorrectly accepting a proxy as being temperature sensitive’.
He goes on to say ” This effect can only be avoided, or at least mitigated, if the proxies undergo stringent significance testing before selection. Osborn and Briffa did not apply such criteria”.
Bürger indicates the more serious problem is the series screening process, which only looked at proxies with positive correlations. “The majority of those random series would not even have been considered, having failed the initial screening for positive temperature correlations. Taking this effect into account, the independence of the series shrinks for the instrumental period”. This means in Bürger’s opinion that the “results described by Osborn and Briffa are therefore at least partly an effect of the screening, and the significance levels depicted in figure 3 in (1) have to be adjusted accordingly”.
Bürger repeats the analysis with the appropriate adjustments for temperature sensitivity, and finds as a result “the “highly significant” occurrences of positive anomalies during the 20th century disappear. The 99th percentile is almost never exceeded, except for the very last years for {theta} = 1, 2. The 95th percentile is exceeded mostly in the early 20th century, but also about the year 1000″.
There is a reply by Osborn and Briffa, which gives a number of justifications of their procedures (which some will find unconvincing) but concludes ” we agree with Bürger that the selection process should be simulated as part of the significance testing process in this and related work and that this is an interesting new avenue that has not been given sufficient attention until now”.
Progress.
Refs: 1) Gerd Bürger, Comment on “The Spatial Extent of 20th-Century Warmth in the Context of the Past 1200 Years”, Science, 29 June 2007: Vol. 316. no. 5833, p. 1844
DOI: 10.1126/science.1140982 available here but only to subscribers
2) Timothy J. Osborn and Keith R. Briffa, Response to Comment on “The Spatial Extent of 20th-Century Warmth in the Context of the Past 1200 Years” (29 June 2007)Science 316 (5833), 1844b. [DOI:10.1126/science.1141446] available here
3) Timothy J. Osborn and Keith R. Briffa, The Spatial Extent of 20th-Century Warmth in the Context of the Past 1200 Years, Science, 10 February 2006: Vol. 311. no. 5762, pp. 841 – 844
DOI: 10.1126/science.1120514 available here
A couple of quick points. The inter-relationship of persistence and spurious correlation was discussed in an econometrics context by Ferson et al 2003 (which I’ve discussed here) and has been a concept that has animated much of my thinking. David Stockwell has also been very attentive to the effects of picking from red noise based on correlations. One of the early exercises that I did was to see what happened if, like Jacoby, you picked the 10 “most temperature-sensitive” chronologies from synthetic red noise series with more than AR1 persistence and averaged them. Like Bürger, I did the exercise with persistent series and found that the Jacoby HS was not exceptional relative to red noise selections. (Jacoby only archived the 10 “most temperature-sensitive” series and failed to archive the others. He also refused to provide me the rejected series referring, “as an ex-marine” to a “few good men”).
The Jacoby case was one of the few cases where one could quantify the picking activity and benchmark it against biased selection from red noise.
Geoff spent more space on the Bürger comment than the Osborn and Briffa reply. Its’ main response is that the picking-by-correlation had a relatively minor impact on the selections at their stage because they picked 14 from a universe of supposedly onlly 16 series available from Mann et al 2003 (EOS), Esper et al 2002 and Mann and Jones 2003. They said:
The 14 series used in (2 – Osborn and Briffa 2006) were selected from three previous studies (3—5: Mann et al EOS 2003; Esper et al 2002; Mann and Jones 2003), although this set also encompasses almost all the proxies with high temporal resolution used in the other Northern Hemisphere temperature reconstructions cited in (2) [MBH98, MBH99, Jones et al 1998, Crowley and Lowery 2000, Briffa 2000, Briffa et al 2001, Esper et al 2002, Mann et al 2003, Mann and Jones 2003, Moberg et al 2005, Rutherford et al 2005].
This statement is untrue even if Osborn and Briffa are granted the one stated qualifier and another unstated qualifier. The “high temporal” resolution qualifier is not defined; this qualifier excludes several series from Crowley and Lowery and Moberg et al 2005 and prefers tree rings. The second (unstated) qualifier is that the series go back to 1000. This excludes the majority of the series. (However, Briffa et al 2001 has a very large population of series and a serious “divergence problem”. The Briffa et al 2001 network is one of two networks used in Rutherford et al 2005. The above statement is obviously false in respect to this population.) It is also false even with the long series used in these studies. There are many more than 14 series that cumulatively occur in these studies: there are Moroccan series used in MBH99, a number of oddball series in Crowley and Lowery 2000. I’m in the process of making a definitive count but it is far more than 14.
Osborn and Briffa also gloss over the impact of cumulative data snooping within the literature, by which biased selections are made within the literature. CA readers are familiar with this. For example, consider Briffa’s substitution of Yamal for Polar Urals. For example, updated results for Polar Urals show a very elevated MWP. Even though Briffa had made his name in Nature (1995) for showing a cold MWP in the Polar Urals series, he did not publish the updated information and in Briffa 2000, substituted Yamal (with a HS) for the Polar Urals series. This substitution was followed in all subsequent Team studies except surprisingly Esper et al 2002. The Polar Urals update is excluded from Osborn and Briffa 2006 on some pretext, even though they use both a foxtail and bristlecone series (Mann’s PC1) from sites about 30 miles apart – closer than Polar Urals and Yamal. These individual substitutions are not trivial as this one substitution affects medieval-modern levels in several studies.
Osborn and Briffa observe that
“it is difficult to quantify exactly the size of the pool of potential records from which the 14 series used in (2) were selected, because there is implicit and explicit selection at various stages, from the decision to publish original data to the decision to include data in large-scale climate reconstructions.”
Quite so.
They go on to say:
in our study (2), only two series were excluded on the basis of negative correlations with their local temperature, and no further series had been explicitly excluded by the three studies from which we obtained our data. We cannot be certain that prior knowledge of temperature correlations did not influence previous selection decisions, and there are more levels in the hierarchy of work upon which our study depends at which some selection decisions may have been made on the basis of correlations between proxy records and their local temperature. However, the degree of selectivity is unlikely to be much greater than that for which we have explicit information. Simply, there is not a large number of records of millennial length that have relatively high temporal resolution and an a priori expectation of a dominant temperature signal.
They argue that Bürger has created too large a universe for comparison and that the appropriate simulation is to check cherry-picking of 14 out of 16 – and, surprise, surprise, they emerge with seemingly significant results:
The assessment of the statistical significance of the results of (2) is modified so that, rather than comparing the real proxy results with a similar analysis based on 14 random synthetic proxy series, we now generate 16 synthetic series and select for analysis the 14 that exhibit the strongest correlations with their local temperature records.
Nowhere in either article are bristlecones mentioned and yet they feature prominently in the differing results. Osborn and Briffa use Mann’s discredited PC1 as one proxy and nearby foxtails as another – two out of 14 in a “random” sample! As observed here previously, these do not have a significant correlation with temperature. Under Bürger’s slightly more stringent hurdle, series 1 and 3 are excluded as having too low a correlation (these are the PC1 and foxtails, both of which are HS shaped and important to elevated 20th century results.) Osborn and Briffa say that they use a low correlation hurdle for the following reason:
Our decision to use a weak criterion for selecting proxy records was intended to reduce the probability of erroneous exclusion of good proxies.
Well, one of the “good proxies” that they are working hard not to exclude is Mann’s PC1. 😈 In addition, it is obviously ludicrous that the Team should continue to keep presenting permutations of bristlecones and foxtails as new studies, like the dead parrot in Monty Python. If, in addition, they have to lower the hurdle to get these series in, then don’t lower the hurdle. If the results are any good, they should survive the presence/absence of bristlecones/foxtails.
Their modeling of the cherry-picking process is ludicrous. It’s not even true that only excluded 2 series. What about the Polar Urals update that was in Esper et al 2002 (uniquely)? Why wasn’t that used? Well, they had a pretext for using Yamal instead – yeah, there’s always a pretext. Briffa knows all about this substitution – he was the one that originally did it back in Briffa 2000. Instead of reporting the updated Polar Urals results with a high MWP) as even a mining promoter would have had to do, Briffa substituted his own version of the Yamal series (which is now often attributed in Team articles to Hantemirov even though Hantemirov’s reconstruction is different than Briffa’s.) This substitution has a major impact on a couple of reconstructions – altering the medieval-modern level in Briffa 2000 and D’Arrigo et al 2006. So there’s at least one more series that they excluded. The pretext – that they’ve already got a series from that area (Yamal), but then what about the doubling up of bristlecone/foxtail series.
Osborn and Briffa falsely claim that the 14 series selected constitute “almost all the proxies with high temporal resolution” used in a range of Team studies:
The 14 series used in (2) were selected from three previous studies (3—5), although this set also encompasses almost all the proxies with high temporal resolution used in the other Northern Hemisphere temperature reconstructions cited in (2).
This claim is simply false as any competent reviewer would have pointed out – jeez, any reader of CA could have pointed this out. The studies cited are” MBH98, MBH99, Jones et al 1998, Crowley and Lowery 2000, Esper et al 2002, Briffa 2000, Briffa et al 2001, Rutherford et al 2005, Mann and Jones 2003, Mann et al 2003.
Briffa et al 2001 contains hundreds of series and has a big “divergence problem” not shared by the cherry-picked series. Indeed, the bias of the picking is proven by the lack of divergence. Their retort would be that they meant to limit the matter to series that go back to AD1000. Fine, but then they should say that.
Second, it’s not true even for the series that go back to 1000. I need to do a count, but at a minimum there are at least double that number within the listed studies: there are a couple of Morocco series in MBH99, a French tree ring series, several oddball series in Crowley and Lowery 2000. By the time we get to Osborn and Briffa, there has already been a lot of data snooping.
Beyond that, there is obviously data snooping before this. For example, the Mount Logan dO18 series goes back to AD1000 and has the same sort of resolution as other ice core series. Why isn’t it used? Well, it has a depressed 20th century which is attributed to wind circulation. But then how can you say that Dunde and Guliya aren’t as well, yielding different results. The use of Dunde (via the Yang composite) and not Mount Logan is classic cherry-picking that Osborn and Briffa have totally ignored. And BTW, another year has gone by without Lonnie Tompson reporting the Bona Churchill dO18 results. I’m standing by my prediction of last year that, if and when these results are ever published, they will not have elevated 20th century dO18. (And another year has gone by without Hughes reporting the Sheep Mountain update. What a swamp this is.)
And what about series like Rein’s offshore Peru lithics with a strong MWP anomaly? Is this excluded because it is not of sufficiently high resolution? Well, the Chesapeake Mg-Ca series has resolution of no more than 10 years in the MWP portion and has a couple of weird splices. And what of Mangini’s speleothems with high MWP? Or Biondi’s Idaho tree ring reconstruction? BTW: if the temporal resolution of the Chesapeake Mg-Ca series is used as a benchmark, there are quite a few ocean sediment series that qualify (e.g. Julie Richey’s series).
I need to make a systematic catalog of series going back to the MWP with resolution at least as high as the Chesapeake Mg-Ca series, but off the cuff, I’d say that there are at least 50 series, probably more.
So when Osborn and Briffa say that the universe from which they’ve selected can be represented by selecting 14 of 16, this is completely absurd. There has probably been cherrypicking from at least 3 times that population. But aside form all that, the active ingredients in the 20th century anomaly remain the same old whores: bristlecones, foxtails, Yamal. They keep trotting them out in new costumes, but really it’s time to get them off the street.