Reflecting on then current scandals in psychology arising from non-replicable research, E. Wagenmakers, a prominent social psychologist, blamed many of the problems on “data torture”. Wagenmakers attributed many data torture problems on ex post selection of methods. In today’s post, I’ll show an extraordinary example of data torture in the PAGES2K Australasian reconstruction.
Wagenmakers on Data Torture
In the first article, Wagenmakers observed that psychologists did not define their statistical methods before examining the data, creating a temptation to tune the results to obtain a “desired result”:
we discuss an uncomfortable fact that threatens the core of psychology’s academic enterprise: almost without exception, psychologists do not commit themselves to a method of data analysis before they see the actual data. It then becomes tempting to fine tune the analysis to the data in order to obtain a desired result—a procedure that invalidates the interpretation of the common statistical tests. The extent of the fine tuning varies widely across experiments and experimenters but is almost impossible for reviewers and readers to gauge.
Some researchers succumb to this temptation more easily than others, and from presented work it is often completely unclear to what degree the data were tortured to obtain the reported confession.
It is obvious that Wagenmakers’ concerns are relevant to paleoclimate, where ad hoc and post hoc methods abound and where some results are more attractive to researchers.
Gergis et al 2012
As is well-known to CA readers, Gergis et al did ex post screening of their network by correlation against their target Australasian region summer temperature. Screening reduced the network from 62 series to 27. For a long time, climate blogs have criticized ex post screening as a bias-inducing procedure -a bias that is obvious, but which has been neglected in academic literature. For the most part, the issue has been either ignored or denied by specialists.
Gergis et al 2012, very unusually for the field, stated that they intended to avoid screening bias by screening on detrended data, describing their screening process as follows:
For predictor selection, both proxy climate and instrumental data were linearly detrended over the 1921-1990 period to avoid inflating the correlation coefficient due to the presence of the global warming signal present in the observed temperature record. Only records that were significantly (p<0.05) correlated with the detrended instrumental target over the 1921-1990 period were selected for analysis. This process identified 27 temperature-sensitive predictors for the SONDJF warm season.
Unfortunately for Gergis and coauthors, that’s not what they actually did. Their screening was done on undetrended data. When screening was done in the described way, only 8 or so proxies survived. Jean S discovered this a few weeks after publication of the Gergis et al article on May 17, 2012. Two hours after Jean S’ comment at CA, coauthor Neukom notified Gergis and Karoly of the problem.
Gergis and coauthors, encouraged by Gavin Schmidt and Michael Mann, attempted to persuade the Journal of Climate editors that they should be allowed to change the description of their methodology to what they had actually done. However, the editors did not agree, challenging the Gergis coauthors to show the robustness of their results. The article was not retracted. The University of Melbourne press statement continues to say that it was published on May 17, 2012, but has been submitted for re-review (and has apparently been under review for over two years now.)
The PAGES2K Australasian network is the product of the same authors. Its methodological description is taken almost verbatim from Gergis et al 2012. Its network is substantially identical to the Gergis 2012 network: 20 of 27 Gergis proxies carry forward to the P2K network. Several of the absent series are from Antarctica, covered separately in P2K. The new P2K network has 28 series, now including 8 series that had been previously screened out. The effort to maintain continuity even extended to keeping proxies in the same order in the listing, even inserting new series in the precise empty spaces left by vacating series.
Once again, the authors claimed to have done their analysis using detrended data:
All data were linearly detrended over the 1921-1990 period and AR(1) autocorrelation was taken into account for the calculation of the degrees of freedom .
This raises an obvious question: in the previous test using detrended data, only a fraction passed. So how did they pass the detrended test this time?
Read their description of P2K screening and watch the pea:
The proxy data were correlated against the grid cells of the target (HadCRUT3v SONDJF average). To account for proxies with different seasonal definitions than our target SONDJF season (for example calendar year averages) we calculate the correlations after lagging the proxies for -1, 0 and 1 years. Records with significant (p < 0.05) correlations with at least one grid-cell within a search radius of 500 km from the proxy site were included in the reconstruction. All data were linearly detrended over the 1921-1990 period and AR(1) autocorrelation was taken into account for the calculation of the degrees of freedom . For coral record with multiple proxies (Sr/Ca and ä18O) with significant correlations, only the proxy record with the higher absolute correlation was selected to ensure independence of the proxy records.
Gergis et al 2012 had calculated one correlation for each proxy, but the above paragraph describes ~27 correlations: three lag periods (+1,0,-1) by nine gridcells ( not just the host gridcell, but the W,NW,N, NE,E,SE,S and SW gridcells, all of which would be within 500 km according to my reading of the above text.) The other important change is the change from testing against a regional average to testing against individual gridcells, which, in some cases, are not even in the target region.
Gergis’ test against multiple gridcells takes the peculiar Mann et al 2008 pick-two methodology to even more baroque lengths. Thinking back to Wagenmakers’ prescription of ex ante methods, it is hard to imagine Gergis and coauthors ex ante proposing that they test each proxy against nine different gridcells for “statistical significance”. Nor does it seem plausible that much “significance” can be placed on higher correlations from a contiguous gridcell, as compared to the actual gridcell. It seems evident that Gergis and coauthors were doing whatever they could to salvage as much of their network as they could and that this elaborate multiple screening procedure was simply a method of accomplishing that end. Nor does it seem reasonable to data mine after the fact for “significant” correlations between three different lag periods, including one in which the proxy leads temperature.
Had the PAGES2K coauthors fully discussed the background and development of this procedure from its origin in Gergis et al 2012, it seems hard to believe that a competent reviewer would not have challenged them on this peculiar screening procedure. Even if such data torture were acquiesced in (which is dubious), it should have mitigated by requiring adjustment of the t-statistic standard to account for the repeated tests: with 27 draws, the odds of a value that is “95% significant” obviously change dramatically. When the draws are independent, there are well-known procedures for doing so. Using the Bonferroni correction with 27 “independent” tests, the t-statistic for each individual test would have to be qt(1- 0.05/27,df) rather than qt(1-.05,df). For typical detrended autocorrelations, the df is ~55. This changes the benchmark t-statistc from ~1.7 to 3.0. The effective number of independent tests would be less than 27 because of spatial correlation, but even if the effective number of independent tests was as few as 10, it increases the benchmark t-statistic to 2.7. All this is without accounting for their initial consideration of 62 proxies – something else that ought to be accounted for in the t-test.
While all of these are real problems, the largest problem with the Neukom-Gergis network is grounded in the data: the long ice core and tree ring series don’t have a HS shape. However, there is a very strong trend in coral d18O data after the Little Ice Age and especially in the 20th century. Splicing the two dissimilar proxy datasets results in hockey sticks even without screening. Such splicing of unlike data in the guise of “multiproxy” has been endemic in paleoclimate since Jones et al 1998 and is underdiscussed. It’s something that I plan to do.
There are other peculiarities in the Gergis dataset. Between Gergis et al 2012, PAGES2K and Neukom et al 2014, numerous proxies are assigned to inconsistent calendar years. If a proxy is assigned to a calendar year that is inconsistent with the calendar year of its corresponding temperature series, the calculated correlation will be less than it really is. Some of the low detrended correlations of Gergis et al 2012 appear to have arisen from errors in proxy year assignment. I noticed this with Oroko which I analysed in detail: it ought to pass a detrended correlation test given the splicing of instrumental data and therefore failure of a detrended correlation test requires close examination.