“Unprecedented” Model Discrepancy

Judy Curry recently noted that Phil Jones’ 2014 temperature index (recently the subject of major adjustments in methodology) might be a couple of hundredths of degree higher than a few years ago and alerted her readers to potential environmental NGO triumphalism. Unsurprisingly, it has also been observed in response that the hiatus continues in full force for the satellite records, with 1998 remaining the warmest satellite year by a considerable margin.

Equally noteworthy however – and of greater interest to CA readers where there has been more focus on model-observation discrepancy   – is that the overheating discrepancy between models and surface temperatures in 2014 was the fourth highest in “recorded” history and that the 5 largest warm discrepancies have occurred in the past 6 years.  The cumulative discrepancy between models and observations is far beyond any previous precedent. This is true for both surface and satellite comparisons.

In the figure below, I’ve compared CMIP4.5 RCP4.5 models to updated surface observations (updating a graphic used here perviously), adding a lower panel showing the discrepancy between observations and CMIP5 RCP4.5 model mean.



Figure 1. Top panel.  CMIP RCP4.5 model mean (black) and 5-95% percentile envelope (grey) compared to HadCRUT4 (red). Dotted blue – the projection of the hiatus/slowdown (1997-2014) to 2030; dotted red – a projection in which observations catch up to CMIP5 RCP4.5 model mean by 2030.  Bottom panel – discrepancy between CMIP5 RCP4.5 model mean and HadCRUT4 observations.  All values basis 1961-1990.


During the hiatus/slowdown, HadCRU changed their methodology:  the changes in methodology contribute more to the slight resulting trend in HadCRUT4 than the trend in common with the older methodology. But even stipulating the change in method, 2014 observed surface temperatures are somewhat up from 2013, but still only at the bottom edge of the confidence interval envelope for CMIP5 models.   Because the CMIP5 model mean goes up relentlessly, the 2014 uptick in HadCRUT4 is far too little to catch up to the discrepancy, which remains at near-record levels.  I’ve also shown two scenarios out to 2030. The dotted blue line continues the lower trend during the hiatus, while the dotted red line shows a catch-up to model mean by 2030.  Reasonable people can disagree over which of the two scenarios is more likely.  In either scenario, the cumulative discrepancy continues to build and reach unprecedented levels.

In the second graphic, I’ve done an identical plot for satellite temperature (RSS TLT), centering over 1979-1990 since satellite records did not start until 1979. The discrepancy between model TLT and observed TLT is increasingly dramatic.

ci_GLB_tlt_1920_twopanelF IGURE 2. As above, but for TLT satellite records.


Reasonable people can disagree on why the satellite record differs from the surface record, but the discrepancy between models and observations ought not to be sloughed off because the 2014 value of Phil Jones’ temperature index is a couple of hundredths higher than a few years ago.

The “warmest year”, to its shame, neglected Toronto, which experienced a bitter winter and cool summer last year. For now, we can perhaps take some small comfort in the fact that human civilization has apparently continued to exist, perhaps even thrive, even in the face of the “warmest year”.

Some readers wondered why I showed RSS, but not UAH. In past controversies, RSS has been preferred by people who dislike the analysis here, so I used it to be accommodating. Here is the same graphic using UAH.

Figure 3. As Figure 2, but with UAH.



Sheep Mountain Update

Several weeks ago,  a new article (open access) on Sheep Mountain (Salzer et al 2014 , Env Res Lett) was published, based on updated (to 2009) sampling at Sheep Mountain.

One of the longstanding Climate Audit challenges to the paleoclimate community, dating back to the earliest CA posts, was to demonstrate out-of-sample validity of proxy reconstructions, by updating inputs subsequent to 1980. Because Graybill’s bristlecone chronologies were so heavily weighted in the Mann reconstruction,  demonstrating out-of-sample validity at Sheep Mountain and other key Graybill sites is essential to validating the Mann reconstruction out of sample.

The new information shows dramatic failure of the Sheep Mountain chronology as an out-of-sample temperature proxy, as it has a dramatic divergence from NH temperature since 1980, the end of the Mann et al (and many other) reconstructions.  While the issue is very severe for the Mann reconstructions, it affects numerous other reconstructions, including PAGES2K. Continue reading

Anti-SLAPP Hearing Today

Mann v CEI, National Review, Simberg, Steyn and their amici is being argued today. Amici for Steyn, CEI, Simberg and NR include: American Civil Liberties Union, the Reporters Committee for Freedom of the Press, American Society of News Editors, the Association of Alternative Newsmedia, the Association of American Publishers, Inc., Bloomberg L.P., the Center for Investigative Reporting, the First Amendment Coalition, First Look Media Inc., Fox News Network, Gannett Co. Inc., the Investigative Reporting Workshop, the National Press Club, the National Press Photographers Association, Comcast Corporation, the Newspaper Association of America, the North Jersey Media Group Inc., the Online News Association, the Radio Television Digital News Association, the Seattle Times Company, the Society of Professional Journalists, Stephens Media LLC, Time Inc., Tribune Publishing, the Tully Center for Free Speech, D.C. Communications, Inc. and the Washington Post.

Disappointingly, Scott Mandia and the costumed vigilantes of the Climate Response Team elected not to appear as Mann amici. (Nor anyone else.)

New Data and Upside-Down Moberg

I’ve been re-examining SH proxies for some time now, both in connection with PAGES2K and out of intrinsic relevance.  In today’s post, I’ll report on a new (relatively) high-resolution series from  the Arabian Sea offshore Pakistan (Boll et al 2014, Late Holocene primary productivity and sea surface temperature variations in the northeastern Arabian Sea: implications for winter monsoon variability, pdf).  The series has considerable ex ante interest on a couple of counts. Alkenones yield temperature proxies that have a couple of important advantages relative to nearly all other temperature “proxies”: they are calibrated in absolute temperature (not by anomalies); and they yield glacial-interglacial patterns that make “sense”. No post hoc screening or trying to figure out which way is up.  In the extratropics, their useful information is limited to summer season, but so are nearly all other proxies. Though more or less ignored in IPCC AR5, the development of alkenone series has arguably been one of the most important paleoclimate developments in the past 10 years and is something that I pay attention to.

But there is a big conundrum in trying to use them for 20th century comparisons: all of the very high resolution alkenone series to date are from upwelling zones and show a precipitous decline (downward HS) in 20th century temperatures. See discussion here.  These precipitous declines have been very closely examined by specialists, who conclude, according to my reading, that this is not a “divergence” breakdown of the proxy-temperature relationship, but rather an actual decrease in local SST in the upwelling zone, attributed (plausibly) to increased upwelling.

Because upwelling zones form only a small fraction of the ocean (though an important fraction due to biological productivity),  it is important to obtain corresponding high-resolution alkenone series from non-upwelling zones. The Boll et al 2014 is the first such example that I’ve seen and, in my opinion, it sheds very interesting new light on the vexed issue of two-millennium temperature. Continue reading

Data Torture in Gergis2K

Reflecting on then current scandals in psychology arising from non-replicable research,  E. Wagenmakers, a prominent social psychologist,  blamed many of the problems on “data torture”.  Wagenmakers attributed many data torture problems on ex post selection of methods. In today’s post, I’ll show an extraordinary example of data torture in the PAGES2K Australasian reconstruction.

Wagenmakers on Data Torture


Two accessible Wagenmakers’ articles on data torture are An Agenda for Purely Confirmatory Research pdf and a Year of Horrors pdf.

In the first article, Wagenmakers observed that psychologists did not define their statistical methods before examining the data, creating a temptation to tune the results to obtain a “desired result”:

we discuss an uncomfortable fact that threatens the core of psychology’s academic enterprise: almost without exception, psychologists do not commit themselves to a method of data analysis before they see the actual data. It then becomes tempting to fine tune the analysis to the data in order to obtain a desired result—a procedure that invalidates the interpretation of the common statistical tests. The extent of the fine tuning varies widely across experiments and experimenters but is almost impossible for reviewers and readers to gauge.

Wagenmakers added:

Some researchers succumb to this temptation more easily than others, and from presented work it is often completely unclear to what degree the data were tortured to obtain the reported confession.

It is obvious that Wagenmakers’ concerns are relevant to paleoclimate, where ad hoc and post hoc methods abound and where some results are more attractive to researchers.

Gergis et al 2012

As is well-known to CA readers, Gergis et al did ex post screening of their network by correlation against their target Australasian region summer temperature.   Screening reduced the network from 62 series to 27.  For a long time, climate blogs have criticized ex post screening as a bias-inducing procedure -a bias that is obvious, but which has been neglected in academic literature.  For the most part, the issue has been either ignored or denied by specialists.

Gergis et al 2012, very unusually for the field, stated that they intended to avoid screening bias by screening on detrended data, describing their screening process as follows:

For predictor selection, both proxy climate and instrumental data were linearly detrended over the 1921-1990 period to avoid inflating the correlation coefficient due to the presence of the global warming signal present in the observed temperature record. Only records that were significantly (p<0.05) correlated with the detrended instrumental target over the 1921-1990 period were selected for analysis. This process identified 27 temperature-sensitive predictors for the SONDJF warm season.

Unfortunately for Gergis and coauthors, that’s not what they actually did. Their screening was done on undetrended data. When screening was done in the described way, only 8 or so proxies survived.  Jean S discovered this a few weeks after publication of the Gergis et al article on May 17, 2012.  Two hours after Jean S’ comment at CA, coauthor Neukom notified Gergis and Karoly of the problem.

Gergis and coauthors, encouraged by Gavin Schmidt and Michael Mann, attempted to persuade the Journal of Climate editors that they should be allowed to change the description of their methodology to what they had actually done. However, the editors did not agree, challenging the Gergis coauthors to show the robustness of their results. The article was not retracted. The University of Melbourne press statement continues to say that it was published on May 17, 2012, but has been submitted for re-review (and has apparently been under review for over two years now.)


The PAGES2K Australasian network is the product of the same authors. Its methodological description is taken almost verbatim from Gergis et al 2012.  Its network is substantially identical to the Gergis 2012 network: 20 of 27 Gergis proxies carry forward to the P2K network. Several of the absent series are from Antarctica, covered separately in P2K.  The new P2K network has 28 series, now including 8 series that had been previously screened out.  The effort to maintain continuity even extended to keeping proxies in the same order in the listing, even inserting new series in the precise empty spaces left by vacating series.

Once again, the authors claimed to have done their analysis using detrended data:

All data were linearly detrended over the 1921-1990 period and AR(1) autocorrelation was taken into account for the calculation of the degrees of freedom [55].

This raises an obvious question:  in the previous test using detrended data, only a fraction passed.  So how did they pass the detrended test this time?

Read their description of P2K screening and watch the pea:

The proxy data were correlated against the grid cells of the target (HadCRUT3v SONDJF average). To account for proxies with different seasonal definitions than our target SONDJF season (for example calendar year averages) we calculate the correlations after lagging the proxies for -1, 0 and 1 years. Records with significant (p < 0.05) correlations with at least one grid-cell within a search radius of 500 km from the proxy site were included in the reconstruction. All data were linearly detrended over the 1921-1990 period and AR(1) autocorrelation was taken into account for the calculation of the degrees of freedom [55]. For coral record with multiple proxies (Sr/Ca and ä18O) with significant correlations, only the proxy record with the higher absolute correlation was selected to ensure independence of the proxy records.

Gergis et al 2012 had calculated one correlation for each proxy, but the above paragraph describes ~27 correlations: three lag periods (+1,0,-1) by nine gridcells ( not just the host gridcell, but the W,NW,N, NE,E,SE,S and SW gridcells, all of which would be within 500 km according to my reading of the above text.) The other important change is the change from testing against a regional average to testing against individual gridcells, which, in some cases, are not even in the target region.


Gergis’  test against multiple gridcells takes the peculiar Mann et al 2008 pick-two methodology to even more baroque lengths.  Thinking back to Wagenmakers’ prescription of ex ante methods, it is hard to imagine Gergis and coauthors ex ante proposing that they test each proxy against nine different gridcells for “statistical significance”. Nor does it seem plausible that much “significance” can be placed on higher correlations from a contiguous gridcell, as compared to the actual gridcell.  It seems evident that Gergis and coauthors were doing whatever they could to salvage as much of their network as they could and that this elaborate multiple screening procedure was simply a method of accomplishing that end.  Nor does it seem reasonable to data mine after the fact for “significant” correlations between three different lag periods, including one in which the proxy leads temperature.

Had the PAGES2K coauthors fully discussed the background and development of this procedure from its origin in Gergis et al 2012, it seems hard to believe that a competent reviewer would not have challenged them on this peculiar screening procedure.  Even if such data torture were acquiesced in (which is dubious), it should have mitigated by requiring adjustment of the t-statistic standard to account for the repeated tests: with 27 draws, the odds of a value that is “95% significant” obviously change dramatically.  When the draws are independent, there are well-known procedures for doing so. Using the Bonferroni correction with 27 “independent” tests, the t-statistic for each individual test would have to be  qt(1- 0.05/27,df) rather than qt(1-.05,df).  For typical detrended autocorrelations, the df is ~55. This changes the benchmark t-statistc from ~1.7 to 3.0.  The effective number of independent tests would be less than 27 because of spatial correlation, but even if the effective number of independent tests was as few as 10, it increases the benchmark t-statistic to 2.7.  All this is without accounting for their initial consideration of 62 proxies – something else that ought to be accounted for in the t-test.

While all of these are real problems, the largest problem with the Neukom-Gergis network is grounded in the data:  the long ice core and tree ring series don’t have a HS shape. However, there is a very strong trend in coral d18O data after the Little Ice Age and especially in the 20th century.  Splicing the two dissimilar proxy datasets results in hockey sticks even without screening.   Such splicing of unlike data in the guise of “multiproxy” has been endemic in paleoclimate since Jones et al 1998 and is underdiscussed. It’s something that I plan to do.

There are other peculiarities in the Gergis dataset.  Between Gergis et al 2012, PAGES2K and Neukom et al 2014,  numerous proxies are assigned to inconsistent calendar years.  If a proxy is assigned to a calendar year that is inconsistent with the calendar year of its corresponding temperature series, the calculated correlation will be less than it really is.  Some of the low detrended correlations of Gergis et al 2012 appear to have arisen from errors in proxy year assignment. I noticed this with Oroko which I analysed in detail: it ought to pass a detrended correlation test given the splicing of instrumental data and therefore failure of a detrended correlation test requires close examination.

PAGES2K and Nature’s Policy against Self-Plagiarism

Nature’s policies on plagiarism state:

Duplicate publication, sometimes called self-plagiarism, occurs when an author reuses substantial parts of his or her own published work without providing the appropriate references.

The description of the Australasian network of PAGES2K (coauthors Gergis, Neukom, Phipps and Lorrey) is almost entirely lifted in verbatim or near-verbatim chunks from Gergis et al, 2012 (withdrawn and under re-review), in apparent violation of Nature’s policy against self-plagiarism.

Continue reading

Gergis2K and the Oroko “Disturbance-Corrected” Blade

Only two Gergis proxies (both tree ring) go back to the medieval period: Oroko Swamp, New Zealand and Mt Read, Tasmania, both from Ed Cook.  Although claims of novelty have been made for the Gergis reconstruction, neither of these proxies is “new”, with both illustrated in AR4 and Mt Read being used as early as Mann et al 1998 and Jones et al 1998.

In today’s post, I’ll look in more detail at the Oroko tree ring chronology, which was used in three technical articles by Ed Cook (Cook et al 2002 Glob Plan Chg; Cook et al 2002 GRL; Cook et al 2006) to produce temperature reconstructions.  In Cook’s earliest article (2002 Glob Plan Chg), Cook showed a tree ring chronology which declined quite dramatically after 1957.  Cook reported that there was a very high correlation to instrumental summer temperature (Hokitika, South Island NZ) between 1860 and 1957, followed by a “collapse” in correlation after 1957 – a decline attributed by Cook to logging at the site.  For his reconstruction of summer temperature, Cook  “accordingly” replaced the proxy estimate with instrumental temperature after 1957, an artifice clearly marked in Cook’s original articles, but not necessarily in downstream multiproxy uses.

Gergis et al 2012 (which corresponds to PAGES2K up to a puzzling one year offset) said that they used “disturbance-corrected” data for Oroko:

“for consistency with published results, we use the final temperature reconstructions provided by the original authors that includes disturbance-corrected data for the 213 Silver Pine record…( E. Cook, personal communication)

By “disturbance correction” , do they mean the replacement of proxy data after 1957 by instrumental data? Or have they employed some other method of “disturbance correction”?

Assessment of this question is unduly complicated because Cook never archived Oroko measurement data or, for that matter, any of the chronology versions or reconstructions appearing in the technical articles.  Grey versions of the temperature reconstruction (but not chronology) have circulated in connection with multiproxy literature (including Mann and Jones 2003, Mann et al 2008, Gergis et al 2012 and PAGES2K 2013).  In addition, two different grey versions occur digitally in Climategate letters from 2000 and 2005, with the later version clearly labeled as containing a splice of proxy and instrumental data.  The Gergis version is clearly related to the earlier grey versions, but, at present, I am unable to determine whether the “disturbance correction” included an instrumental splice or not.

There’s another curiosity.  As noted above, Cook originally claimed a high correlation to instrumental temperature up to at least 1957, and, based, on their figures, the correlation to 1999 would still have been positive, even if attenuated, but Mann and Jones 2003 reported a negative correlation (-0.25) to instrumental temperature.  However, Gergis et al 2012 obtained opposite results, once again asserting a statistically significant positive correlation to temperature.  To the extent that there had been splicing of instrumental data into the Gergis version, one feels that claims of statistical significance ought to be qualified.  Nonetheless, the negative correlation claimed in Mann and Jones 2003 is puzzling: how did they obtain an opposite sign to Cook’s original study?

As to the Oroko proxy itself,  it does not have anything like a HS-shape. It has considerable centennial variability. Its late 20th century values are somewhat elevated (smoothed 1 sigma basis 1200-1965), but nothing like the Gergis 4-sigma anomaly.  It has no marked LIA or MWP. It has elevated values in the 13th century, but it has low values in the 11th century, the main rival to the late 20th century, and these low 11th century values  attenuate reconstructions where 11th and 20th century values are close.  The HS-ness of the Gergis2K reconstruction does not derive from this series.

The Oroko Swamp site is on the west (windward) coast of South Island, New Zealand at 43S at low altitude (110 m).   In December 2012, during family travel to New Zealand South Island, we visited a (scenic) fjord on the west coast near Manapouri (about 45S).   These are areas of constant wind and very high precipitation. They are definitely nowhere near altitude or latitude treelines. Cook himself expressed surprise that a low-altitude chronology would be correlated to temperature, but was convinced by the relationship (see below).

In today’s post, I’ll parse the various versions far more closely than will interest most (any reasonable) readers.  I got caught up trying to figure out the data and want to document the versions while it’s still fresh in my mind. Continue reading

Gergis and the PAGES2K Regional Average

The calculation of the PAGES2K regional average contains a very odd procedure that thus far has escaped commentary. The centerpiece of the PAGES2K program was the calculation of regional reconstructions in deg C anomalies. Having done these calculations, most readers would presume that their area weighted average (deg C) would be the weighted average of these regional reconstructions already expressed in deg C.

But this isn’t what they did. Instead, they first smoothed by taking 30-year averages, then converted the smoothed deg C regional reconstructions to SD units (basis 1200-1965) and took an average in SD units, converting the result back to deg C by “visual scaling”.

This procedure had a dramatic impact on the Gergis reconstruction. Expressed in deg C and as illustrated in the SI, it has a very mild blade. But, the peculiar PAGES2K procedure amplified the relatively small amplitude reconstruction into a monster blade with a 4 sigma closing value. Following the Arctic2K non-corrigendum correction, it is the largest blade in the reconstruction (and has the greatest area weight.)

I’ll show this procedure in today’s post.

. Continue reading

The Kaufman Tautology

The revised PAGES2K Arctic reconstruction used 56 proxies (down three from the original 59).  Although McKay and Kaufman 2014 didn’t mention the elephant in the room changes in their reconstruction (as discussed at CA here here), they reported with some satisfaction that “decadal-scale variability in the revised [PAGES2K] reconstruction is quite similar to that determined by Kaufman et al. (2009)”, presumably thinking that this replication in the larger dataset was evidence of robustness of at least this property of the data. However, while the decadal scale similarity is real enough, this is more of a tautology rather than evidence of robustness, as 16 of the most highly weighted PAGES2K proxies come from the Kaufman et al 2009 network  (the 22 Kaufman 2009 proxies being assigned over 80% of the total weight and the other  34 proxies under 20%.) Continue reading

Warmest since, uh, the Medieval Warm Period

The money quote in the PAGES2K abstract was that there wasn’t any worldwide Little Ice Age of Medieval Warm Period and that AD1971-2000 temperatures were the highest in nearly 1400 years, long before the Medieval Period:

There were no globally synchronous multi-decadal warm or cold intervals that define a worldwide Medieval Warm Period or Little Ice Age … during the period ad 1971–2000, the area-weighted average reconstructed temperature was higher than any other time in nearly 1,400 years.

In today’s post, I’ll show that the knock-on impact of changes to the Arctic reconstruction on the area-weighted average also make the latter claim untrue. Incorporating the revised Arctic reconstruction, one can however say that, during the period AD1971–2000, the area-weighted average reconstructed temperature was higher than any other time since, uh, the Medieval Warm Period. Continue reading


Get every new post delivered to your Inbox.

Join 3,581 other followers