Also see Brandon S’s recent posts here here.

The re-appearance of Gergis’ Journal of Climate article was accompanied by an untrue account at Conversation of the withdrawal/retraction of the 2012 version. Gergis’ fantasies and misrepresentations drew fulsome praise from the academics and other commenters at Conversation. Gergis named me personally as having stated in 2012 that there were “fundamental issues” with the article, claims which she (falsely) said were “incorrect” and supposedly initiated a “concerted smear campaign aimed at discrediting [their] science”. Their subsequent difficulty in publishing the article, a process that took over four years, seems to me to be as eloquent a confirmation of my original diagnosis as one could expect.

I’ve drafted up lengthy notes on Gergis’ false statements about the incident, in particular, about false claims by Gergis and Karoly that the original authors had independently discovered the original error “two days” before it was diagnosed at Climate Audit. These claims were disproven several years ago by emails provided in response to an FOI request. Gergis characterized the FOI requests as “an attempt to intimidate scientists and derail our efforts to do our job”, but they arose only because of the implausible claims by Gergis and Karoly to priority over Climate Audit.

Although not made clear in Gergis et al 2016 (to say the least), its screened network turns out to be **identical** to the Australasian reconstructions in PAGES2K (Nature 2013), while the reconstructions are nearly identical. PAGES2K was published in April 2013 and one cannot help but wonder at why it took more than three years and nine rounds of revision to publish something so similar.

In addition, one of the expectations of the PAGES2K program was that it would identify and expand available proxy data covering the past two millennia. In this respect, Gergis and the AUS2K working group failed miserably. The lack of progress from the AUS2K working group is both astonishing and dismal, a failure unreported in Gergis et al 2016 which purported to “evaluate the Aus2k working group’s regional consolidation of Australasian temperature proxies”.

**Detrended and Non-detrended Screening**

The following discussion of data torture in Gergis et al 2016 draws on my previous and similar criticism of data torture in PAGES2K.

Responding to then recent scandals in social psychology, Wagenmakers (2011 pdf, 2012 pdf) connected the scandals to academics tuning their analysis to obtain a “desired result”, which he classified as a form of “data torture”:

we discuss an uncomfortable fact that threatens the core of psychology’s academic enterprise: almost without exception, psychologists do not commit themselves to a method of data analysis

beforethey see the actual data.It then becomes tempting to fine tune the analysis to the data in order to obtain a desired result—a procedure that invalidates the interpretation of the common statistical tests.The extent of the fine tuning varies widely across experiments and experimenters but is almost impossible for reviewers and readers to gauge…Some researchers succumb to this temptation more easily than others, and

from presented work it is often completely unclear to what degree the data were tortured to obtain the reported confession.

As I’ll show below, it is hard to contemplate a better example of data torture, as described by Wagenmakers, than Gergis et al 2016.

The controversy over Gergis et al, 2012 arose over *ex post screening* of data, a wildly popular technique among IPCC climate scientists, but one that I’ve strongly criticized over the years. Jeff Id and Lucia have also written lucidly on the topic (e.g. Lucia here and, in connection with Gergis et al, here). I had raised the issue in my first post on Gergis et al on May 31, 2012. Closely related statistical issues arise in other fields under different terminology e.g. sample selection bias, conditioning on post-treatment variable, endogenous selection bias. The potential bias of ex post screening seems absurdly trivial if one considers the example of a drug trial, but, for some reason, IPCC climate scientists continue to obtusely deny the bias. (As a caveat, objecting to the statistical bias of ex post screening does not entail that opposite results are themselves proven. I am making the narrow statistical point that biased methods should not be used.)

Despite the public obtuseness of climate scientists about the practice, shortly after my original criticism of Gergis et al 2012, Karoly *privately* recognized the bias associated with ex post screening as follows in an email to Neukom (June 7, 2012; FOI K,58):

If the selection is done on the proxies without detrending ie the full proxy records over the 20th century, then records with strong trends will be selected and that will effectively force a hockey stick result. Then Stephen Mcintyre criticism is valid. I think that it is really important to use detrended proxy data for the selection, and then choose proxies that exceed a threshold for correlations over the calibration period for either interannual variability or decadal variability for detrended data…The

criticism that the selection process forces a hockey stick result will be valid if the trend is not excluded in the proxy selection step.

Gergis et al 2012 had purported to avoid this bias by screening on **detrended** data, even advertising this technique as a method of “avoid[ing] inflating the correlation coefficient”:

For predictor selection, both proxy climate and instrumental data were linearly detrended over the 1921-1990 period

to avoid inflating the correlation coefficient due to the presence of the global warming signal present in the observed temperature record.Only records that were significantly (p<0.05) correlated with the detrended instrumental target over the 1921-1990 period were selected for analysis. This process identified 27 temperature-sensitive predictors for the SONDJF warm season.

As is now well known, they didn’t actually perform the claimed calculation. Instead, they calculated correlation coefficients on undetrended data. This error was first reported by CA commenter Jean S on June 5, 2012 (here). Two hours later (nearly 2 a.m. Swiss time), Gergis coauthor Raphi Neukom notified Gergis and Karoly of the error (FOI 2G, page 77). Although Karoly later (falsely) claimed that his coauthors were unaware of the Climate Audit thread, emails obtained through FOI show that Gergis had sent an email to her coauthors (FOI 2G, page 17) drawing attention to the CA thread, that Karoly himself had written to Myles Allen (FOI 2K, page 11)about comments attributed to him on the thread (linking to the thread) and that Climate Audit and/or myself are mentioned in multiple other contemporary emails (FOI 2G).

When correlation coefficients were re-calculated according to the stated method, only a handful actually passed screening, a point reported at Climate Audit by Jean S on June 5 and written up by me as a head post on June 6. According to my calculations, only six of the 27 proxies in the G12 network passed detrended screening. On June 8 (FOI 2G, page 112), Neukom reported to Karoly and Gergis that eight proxies passed detrended screening (with the difference between his results and mine perhaps due to drawing from the prescreened network or to difference in algorithm) and sent them a figure (not presently available) comparing the reported reconstruction with the reconstruction using the stated method:

Dashed reconstruction below is using only the 8 proxies that pass detrended screening. solid is our original one.

This figure was unfortunately not included in the FOI response. It would be extremely interesting to see.

As more people online began to be aware of the error, senior author Karoly decided that they needed to notify Journal of Climate. Gergis notified the journal of a “data processing error” on June 8 and their editor, John Chiang, immediately rescinded acceptance of the paper the following day as follows, stating his understanding that they would redo the analysis to conform with their described methodology:

After consulting with the Chief Editor regarding your situation, my decision is to rescind the acceptance of your manuscript for publication. My understanding is that you will be redoing your analysis

to conform to your original description of the predictor selection, in which case you may arrive at a different conclusion from your original manuscript. Given this, I request that you withdraw the manuscript from consideration.

Contrary to her recent story at Conversation, Gergis tried to avoid redoing the analysis, instead she tried to persuade the editor that the error was purely semantic (“error in words”), rather than a programming error, invoking support for undetrended screening from Michael Mann, who was egging Gergis on behind the scenes:

Just to clarify, there was an error in the words describing the proxy selection method and not flaws in the entire analysis as suggested by amateur climate skeptic bloggers…People have argued that detrending proxy records when reconstructing temperature is in fact undesirable (see two papers attached provided courtesy of Professor Michael Mann) .

The Journal of Climate editors were unpersuaded and pointedly asked Gergis to explain the difference between the first email in which the error was described as a programming error and the second email describing the error as semantic:

Your latest email to John characterizes the error in your manuscript as one of wording. But this differs from the characterization you made in the email you sent reporting the error. In that email (dated June 7) you described it as “an unfortunate data processing error,” suggesting that you had intended to detrend the data. That would mean that the issue was not with the wording but rather with the execution of the intended methodology. would you please explain why your two emails give different impressions of the nature of the error?

Gergis tried to deflect the question. She continued to try to persuade the Journal of Climate to acquiesce in her changing the description of the methodology, as opposed to redoing the analysis with the described methodology, offering only to describe the differences in a short note in the Supplementary Information:

The message sent on 8 June was a quick response when we realised there was an inconsistency between the proxy selection method described in the paper and actually used. The email was sent in haste as we wanted to alert you to the issue immediately given the paper was being prepared for typesetting. Now that we have had more time to extensively liaise with colleagues and review the existing research literature on the topic , there are reasons why detrending prior to proxy selection may not be appropriate. The differences between the two methods will be described in the supplementary material, as outlined in my email dated 14 June. As such, the changes in the manuscript are likely to be small, with details of the alternative proxy selection method outlined in the supplementary material .

The Journal of Climate editor resisted, but reluctantly gave Gergis a short window of time (to end July 2012) to revise the article, but required that she directly address the sensitivity of the reconstruction to proxy selection method and “demonstrate the robustness” of her conclusions:

In the revision, I strongly recommend that the issue regarding the sensitivity of the climate reconstruction to the choice of proxy selection method (detrend or no detrend) be addressed. My understanding that this is what you plan to do, and this is a good opportunity to demonstrate the robustness of your conclusions.

Chiang’s offer was very generous under the circumstances. Gergis grasped at this opportunity and promised to revert by July 27 with a revised article showing the influence of this decision on resultant reconstructions:

Our team would be very pleased to submit a revised manuscript on or before the 27 July 2012 for reconsideration by the reviewers . As you have recommended below, we will extensively address proxy selection based on detrended and non detrended data and the influence on the resultant reconstructions.

Remarkably, this topic is touched on only in passing in Gergis et al 2016 and the only relevant diagram conceals, rather than addresses, its effect.

**Gergis’ Trick to Hide the Discrepancy**

We know that Neukom had sent Gergis a comparison of the “original” reconstruction to the reconstruction using the stated method as early as June 8, 2012. It would have been relatively easy to add such a figure to Gergis et al 2012 and include a discussion, if the comparison “demonstrate[d] the robustness of your conclusions”. This obviously didn’t happen and one has to ask why not.

Nor is the issue prominent in Gergis et al 2016. The only relevant figure is Figure S1.3 in the Supplementary Information. Gergis et al asserted that this figure suggested that “decadal-scale temperature variability is not highly sensitive to the predictor screening methods”. (In the following text, “option #4”, with “nine predictors”, is a variation of the network calculated using the stated G12 methodology.)

Figures S1.3-S1.5 compare the R28 reconstruction for just the PCR method presented in the main text, with the results based on the range of alternative proxy screening methods. They show that the variations reconstructed for each option are similar. The results always lie within the 2 SE uncertainty range of our final reconstruction (option #1), except for a few years for option #4 (Figure S1.3), which only uses nine predictors.

This suggests that decadal-scale temperature variability is not highly sensitive to the predictor screening methods.

If, as Gergis et al say here, their results were “not highly sensitive” to predictor screening method – even the difference between detrended and non-detrended screening, then Gergis’ failure to comply with editor Chiang’s offer by July 31, 2012 is all the more surprising.

However, there’s a trick in Gergis’ Figure S1.3. On the left is Gergis’ original Figure S1.3. It gives a strong rhetorical impression of coherence between the four illustrated reconstructions. (“AR1 detrending fieldmean” corresponds to the reconstruction using the stated method of Gergis et al 2012). On the right is a blowup showing that one of the four reconstructions (“AR1 detrending fieldmean”) has been truncated prior to AD1600 when it is well outside the supposed confidence interval.

CA readers are familiar with this sort of truncation in connection with the trick to “hide the decline” in the IPCC AR4 chapter edited by Mann. One can only presume that earlier values were also outside the confidence interval on the high side and that Gergis truncated the series at AD1600 in order to “hide” the discrepancy.

Although I haven’t seen the the “dashed” reconstruction in Neukom’s email of June 8, I can only assume that it also diverged upward before AD1600 and that Gergis et al had been unable to resolve within editor Chiang’s deadline of July 2012.

**Torturing and Waterboarding the Data**

In the second half of 2012, Gergis and coauthors embarked on a remarkable program of data torture in order to salvage a network of approximately 27 proxies, while still supposedly using “detrended” screening. Their eventual technique for ex post screening bore no resemblance to the simplistic screening of (say) Mann and Jones, 2003.

One of their key data torture techniques was to compare proxy data correlations not simply to temperatures in the same year, but to temperatures of the preceding year and following year.

To account for proxies with seasonal definitions other than the target SONDJF season (e. g., calendar year averages), the comparisons were performed using lags of -1, 0, and +1 years for each proxy (

Appendix A).

This mainly impacted tree ring proxies. In their practice, a lag of -1 year meant that a tree ring series is assigned one year earlier than the chronology (+1 is assigned one year later.) For a series with a lag of -1 year (e.g. Celery Top East), ring width in the summer of (say) 1989-90 is said to correlate with summer temperatures of the previous year. There is precedent for correlation to previous year temperatures in specialist studies. For example, Brookhouse et al (2008) (abstract here) says that the Baw Baw tree ring data (a Gergis proxy), correlates positively with spring temperatures from the preceding year. In this case, however, Gergis assigned zero lag to this series, as well as a negative orientation.

The lag of +1 years assigned to 5 sites is very hard to interpret in physical terms. Such a lag requires that (for example) Mangawhera ring widths assigned to the summer of 1989-1990 correlate to temperatures of the following summer (1990-1991) – ring widths in effect acting as a predictor of next year’s temperature. Gergis’ supposed justification in the text was nothing more than armwaving, but the referees do not seem to have cared.

Of the 19 tree ring series in the 51-series G16 network, an (unphysical) +1 lag was assigned to five series, a -1 lag to two series and a 0 lag to seven series, with five series being screened out. Of the seven series with 0 lag, two had inverse orientation in the PAGES2K. In detail, there is little consistency for trees and sites of the same species. For example, New Zealand LIBI composite-1 had a +1 lag, while New Zealand LIBI composite-2 had 0 lag. Another LIBI series (Urewara) is assigned an inverse orientation in the (identical) PAGES2K and thus presumably in the CPS version of G16. Two LIBI series (Takapari and Flanagan’s Hut) are screened out in G16, though Takapari was included in G12. Because the assignment of lags is nothing more than an ad hoc after-the-fact attempt to rescue the network, it is impossible to assign meaning to the results.

In addition, Gergis also borrowed from and expanded a data torture technique pioneered in Mann et al 2008. Mann et al 2008 had been dissatisfied with the number of proxies passing a screening test based on correlation to local gridcell, a commonly used criterion (e.g. Mann and Jones 2003). So Mann instead compared results to the two “nearest” gridcells, picking the highest of the two correlations but without modifying the significance test to reflect the “pick two” procedure. (See here for a contemporary discussion.) Instead of comparing only to the two nearest gridcells, Gergis expanded the comparison to all gridcells “within 500 km of the proxy’s location”, a technique which permitted comparisons to 2-6 gridcells depending both on the latitude and the closeness of the proxy to the edge of its gridcell:

As detailed in appendix A, only records that were significantly (p < 0.05) correlated with temperature variations in

at least one grid cell within 500 km of the proxy’s locationover the 1931-90 period were selected for further analysis.

As described in the article, both factors were crossed in the G16 comparisons. Multiplying three lags by 2-6 gridcells, Gergis appears to have made 6-18 *detrended* comparisons, retaining those proxies for which there was a “statistically significant” correlation. It doesn’t appear that any allowance was made in the benchmark for the multiplicity of tests. In any event, using this “detrended” comparison, they managed to arrive at a network of 28 proxies, one more than the network of Gergis et al 2012. Most of the longer proxies are the same in both networks, with a shuffling of about seven shorter proxies. No ice core data is included in the revised network and only one short speleothem. It consists almost entirely of tree ring and coral data.

Obviously, Gergis et al’s original data analysis plan did not include a baroque screening procedure. It is evident that they concocted this bizarre screening procedure in order to populate the screened population with a similar number of proxies to Gergis et al 2012 (28 versus 27) and to obtain a reconstruction that looked like the original reconstruction, rather than the divergent version that they did not report. Who knows how many permutations and combinations and iterations were tested, before eventually settling on the final screening technique.

It is impossible to contemplate a clearer example of “data torture” (even Mann et al 2008).

Nor does this fully exhaust the elements of data torture in the study, as torture techniques previously in Gergis et al 2012 were carried forward to Gergis et al 2016. Using original and (still) mostly unarchived measurement data, Gergis et al 2012 had re-calculated all tree ring chronologies, except two, using an opaque method developed by the University of East Anglia. The two exceptions were the two long tree ring chronologies reaching back to the medieval period:

All tree ring chronologies were developed based on raw measurements using the signal-free detrending method (Melvin et al., 2007; Melvin and Briffa, 2008) …The only exceptions to this signal-free tree ring detrending method was the New Zealand Silver Pine tree ring composite (Oroko Swamp and Ahaura), which contains logging disturbance after 1957 (D’Arrigo et al., 1998; Cook et al., 2002a; Cook et al., 2006) and the Mount Read Huon Pine chronology from Tasmania which is a complex assemblage of material derived from living trees and sub-fossil material. For consistency with published results, we use the final temperature reconstructions provided by the original authors that includes disturbance-corrected data for the Silver Pine record and Regional Curve Standardisation for the complex age structure of the wood used to develop the Mount Read temperature reconstruction (E. Cook, personal communication, Cook et al., 2006).

This raises the obvious question why “consistency with published results” is an overriding concern for Mt Read and Oroko, but not for the other series, which also have published results. For example, Allen et al (2001), the reference for Celery Top East, shows the chronology at left for Blue Tier, while Gergis et al 2016 used the chronology at right for a combination of Blue Tier and a nearby site. Using East Anglia techniques, the chronology showed a sharp increase in the 20th century and “consistency” with the results shown in Allen et al (2001) was not a concern of the authors. One presumes that Gergis et al had done similar calculations for Mount Read and Oroko, but had decided not to use them. One can hardly avoid wondering whether the discarded calculations didn’t emphasize the desired story.

Nor is this the only ad hoc selection involving these two important proxies. Gergis et al said that their proxy inventory was a 62-series subset taken from the inventory of Neukom and Gergis, 2011. (I have been unable to exactly reconcile this number and no list of 62 series is given in Gergis et al 2016.) They then excluded records that “were still in development at the time of the analysis” (though elsewhere they say that the dataset was frozen as of July 2011 due to the “complexity of the extensive multivariate analysis”) or “with an issue identified in the literature or through personal communication”:

Of the resulting 62 records we also exclude records that were still in development at the time of the analysis .. and records

with an issue identified in the literatureor through personal communication

However, this criterion was applied inconsistently. Gergis et al acknowledge that the Oroko site was impacted by “logging disturbance after 1957” – a clear example of an “issue identified in the literature” but used the data nonetheless. In some popular Oroko versions (see CA discussion here), proxy data after 1957 was even replaced by instrumental data. Gergis et al 2016 added a discussion of this problem, arm-waving that the splicing of instrumental data into the proxy record didn’t matter:

Note that the instrumental data used to replace the disturbance-affected period from 1957 in the silver pine [Oroko] tree-ring record may have influenced proxy screening and calibration procedures for this record. However, given that our reconstructions show skill in the early verification interval, which is outside the disturbed period, and our uncertainty estimates include proxy resampling (detailed below), we argue that this irregularity in the silver pine record does not bias our conclusions.

There’s a sort of blind man’s buff in Gergis’ analysis here, since it looks to me like G16 may have used an Oroko version which did not splice instrumental data. However, because no measurement data has ever been archived for Oroko and a key version only became available through inclusion in a Climategate email, it’s hard to sort out such details.

**PAGES2K**

The precise timing of Gergis’ data torture can be constrained by the publication of the PAGES2K compilation of regional chronologies used in IPCC AR5. The IPCC First Order Draft had included a prominent graphic with seven regional reconstructions, one of which was the Australian reconstruction of Gergis et al, 2012 (cited as under review). The AR5 Second Order Draft, published in July 2012 after the withdrawal of Gergis et al 2012, included a more or less identical reconstruction, this time cited to PAGES2K, under review.

The PAGES2K compilation had been submitted to Science in July 2012, barely meeting the deadline. Remarkably, it was rejected. Mann, one of the reviewers, argued that it was impossible to review so many novel regional reconstructions and that they should be individually reviewed in specialist journals before attempting a compilation. This left IPCC in rather a jam. However, Nature stepped in and agreed to publish the rejected article. Keith Briffa, one of the Nature reviewers, “solved” the problem of trying to review so many novel reconstructions by suggesting that the article be published as a “Progress Article”, a type of article which had negligible peer review requirements. Everyone readily agreed to this diplomatic solution and thus the sausage was made (also see discussion by Hilary Ostrov here).

The Gergis contribution to PAGES2K screened the AUS2K proxy network down to 28 proxies – exactly the same selection as Gergis et al 2016, published three years later. The PAGES2K Paico reconstruction is identical to the G16 Paico reconstruction up to a slight rescaling: the correlation between the two versions is exactly 1. Their “main” reconstruction used principal components regression – a technique harking back to Mann et al 1998, which is commonly defended on the grounds that later article use different techniques. The G16 version is nearly identical to the PAGES2K version, as shown below.

The PAGES2K article was mentioned on a variety of occasions in Gergis et al 2016, but I’m not sure how a reader of G16 could become aware of the identity of the networks and reconstructions.

Given that the PAGES2K network was accepted with no more than cursory peer review, it’s interesting that it took nine rounds of revision for the Journal of Climate to accept Gergis et al 2016 with its identical network and virtually identical reconstruction.

**The Dismal Lack of Progress by the AUS2K Working Group**

Despite the long-standing desire for more “long” SH proxies, the AUS2K working group provided Gergis with only three records (Law Dome d18O, Mt Read Tasmania tree rings, Oroko NZ tree rings) in the target geographical area which began prior to AD1100, with the Law Dome series being screened out. None of these are new records.

Closely related versions of all three series were used in Mann and Jones (2003), which also selected series by screening against gridcell temperatures, but with different results. Mann and Jones screened according to “decadal correlations”, resulting in selection of Tasmania (r=0.79) and Law Dome (r=0.76) and exclusion of Oroko (r=-0.25) – a different screening result than Gergis et al.

All three series have been discussed at Climate Audit from time to time over the years: tags tasmania oroko lawdome. Two of the three series (Mt Read, Oroko) were illustrated in AR4 (which didn’t show Oroko values after 1957), but AR4 lead authors snickered at my request that they also show Law Dome (see here.) The authors realized that the Law Dome series had very elevated values in the late first millennium (see figure below from Jones and Mann, 2004) and there was no way that there were going to show a series which “diluted the message”. Compare the two series used in Mann and Jones 2003 in the first figure below with the two series shown in AR4 in the second figure below.

*Figure ^. Excerpt from Mann and Jones 2003, showing Law Dome and Mount Read series.*

*Figure ^. Excerpt from IPCC AR4, showing Oroko and Mount Read series.*

Thus, despite any aspirations for AR5, Gergis et al 2016 contained no long series which had not been used in Mann and Jones 2003.

It is also obvious that long results from combining Law Dome and Mt Read will have a considerably different appearance than long results from combining Mt Read and Oroko. Although Gergis et al claimed that screening had negligible impact on results, Law Dome was excluded from all such studies.

Nor did Gergis et al actually use the Tasmania “Regional Curve Standardisation” series, as claimed. Cook archived two versions of his Tasmania chronology in 1998, one of which (“original”) was the RCS chronology, while the other (“arfilt”) was a filtered version of the RCS chronology. Gergis used the “arfilt” rather than “original” version – perhaps inheriting this version from Mann et al 2008, which also used the arfilt version. Cook’s original article (Cook et al 2000) also contained an interesting figure showing mean ring widths in Tasmania prior to adjustment (for juvenile growth). This data is plotted below (showing Cook’s figure as an insert). It shows a noticeable increase in 20th century ring widths, which, however, are merely returning to levels achieved earlier in the millennium and surpassed in the first millennium. High late first millennium values are also present in the Law Dome data.

Many of the Gergis series are very short – with coral series nearly all starting in the 18th and even late 19th centuries. To the extent that the Gergis reconstruction shows a 20th century hockey stick, it’s not because this is a feature that is characteristic of the long data, but through the splicing of short strongly trending coral data with the longer tree ring data. The visual result will depend on how the coral data is scaled relative to the tree ring data.

While Gergis and coauthors made no useful contribution to understanding past climate change in the Australasian region, in the interest of sounding a more positive note, large and interesting speleothem datasets have been recently published, though not considered by Gergis et al, including very long d18O series from Borneo and Indonesia (Liang Luar), both located in the extended G16 region. I find the speleothem data particularly interesting since some series provide data on both a Milankowitch scale and through the 20th century. For example, the Borneo series (developed by Jud Partin, Kim Cobb and associates) has very pronounced Milankowitch variability and comes right to the present. In ice age extremes, d18O values are less depleted (more depleted in warm periods.) Modern values do not appear exceptional. Results at Liang Luar are similar.

**Conclusions**

Gergis has received much credulous praise from academics at Conversation, but none of them appear to have taken the trouble to actually evaluate the article before praising it. Rather than the 2016 version being a confirmation of or improvement on the 2012 article, it constitutes as clear an example of data torture as one could ever wish. We know Gergis’ ex ante data analysis plan, because it was clearly stated in Gergis et al 2012. Unfortunately, they made a mistake in their computer script and were unable to replicate their results using the screening methodology described in Gergis et al 2012.

In order to get a reasonably populated network and a reconstruction resembling the Gergis et al 2012 reconstruction, Gergis and coauthors concocted a baroque and ad hoc screening system, requiring a complicated and implausible combination of lags and adjacent gridcells. A more convincing example of “fine tun[ing] the analysis to the data in order to obtain a desired result” (data torture) is impossible to imagine. None of the supposed statistical tests have any significance under the weight of such extreme data torture.

Because IPCC AR5 had used results of Gergis et al 2012 in a prominent diagram that it was committed to using, and continued to use the results even after Journal of Climate rescinded acceptance of Gergis et al 2012 (see here), Gergis et al had considerable motivation, to say the least, to “obtain” a result that looked as much like Gergis et al 2012 as possible. The degree to which they subsequently tortured the data is somewhat breathtaking.

One wonders whether the editors and reviewers of Journal of Climate fully understood the extreme data torture that they were asked to approve. Clearly, there seems to have been some resistance from editors and reviewers – otherwise there would not have been nine rounds of revision and 21 reviews. Since the various rounds of review left the network unchanged even one iota from the network used in the PAGES2K reconstruction (April 2013), one can only assume that Gergis et al eventually wore out a reluctant Journal of Climate, who, after four years of submission and re-submission, finally acquiesced.

As noted above, Wagenmakers defined data torture as succumbing to the temptation to “fine tune the analysis to the data in order to obtain a desired result” and diagnosed the phenomenon as being particularly likely when the authors had not “commit themselves to a method of data analysis *before* they see the actual data”. In this case, Gergis et al had, ironically, committed themselves to a method of data analysis not just privately, but in the text of an accepted article, but they obviously didn’t like the results.

One can understand why Gergis felt relief at finally getting approval for such a tortured manuscript, but, at the same time, the problems were entirely of her own making. Gergis took particular umbrage at my original claim that there were “fundamental issues” with Gergis et al 2012, a claim that she called “incorrect”. But there is nothing “incorrect” about the actual criticism:

One of the underlying mysteries of Gergis-style analysis is one seemingly equivalent proxies can be “significant” while another isn’t. Unfortunately, these fundamental issues are never addressed in the “peer reviewed literature”.

This comment remains as valid today as it was in 2012.

In her Conversation article, Gergis claimed that her “team” discovered the errors in Gergis et al 2012 independently of and “two days” before the errors were reported at Climate Audit. These claims are untrue. They did not discover the errors “independently” of Climate Audit or before Climate Audit. I will review their appropriation of credit in a separate post.

]]>

]]>

*Introduction and Summary*

In a recently published paper (REA16),[1] Mark Richardson et al. claim that recent observation-based energy budget estimates of the Earth’s transient climate response (TCR) are biased substantially low, with the true value some 24% higher. This claim is based purely on simulations by CMIP5 climate models. As I shall show, observational evidence points to any bias actually being small. Moreover, the related claims made by Kyle Armour, in an accompanying “news & views” opinion piece,[2] fall apart upon examination.

The main claim in REA16 is that, in models, surface air-temperature warming over 1861-2009 is 24% greater than would be recorded by HadCRUT4 because it preferentially samples slower-warming regions and water warms less than air. About 15 percentage points of this excess result from masking to HadCRUT4v4 geographical coverage. The remaining 9 percentage points are due to HadCRUT4 blending air and sea surface temperature (SST) data, and arise partly from water warming less than air over the open ocean and partly from changes in sea ice redistributing air and water measurements.

REA16 infer an observation-based best estimate for TCR from 1.66°C, 24% higher than the value of 1.34°C if based on HadCRUT4v4.. Since the scaling factor used is based purely on simulations by CMIP5 models, rather than on observations, the estimate is only valid if those simulations realistically reproduce the spatiotemporal pattern of actual warming for both SST and near-surface air temperature (*tas*), and changes in sea-ice cover. It is clear that they fail to do so. For instance, the models simulate fast warming, and retreating sea-ice, in the sparsely observed southern high latitudes. The available evidence indicates that, on the contrary, warming in this region has been slower than average, pointing to the bias due to sparse observations over it being in the opposite direction to that estimated from model simulations. Nor is there good observational evidence that air over the open ocean warms faster than SST. Therefore, the REA16 model-based bias figure cannot be regarded as realistic for observation-based TCR estimates.

It should also be noted that the 1.66°C TCR estimate ignores the fact that the method used overestimates canonical CMIP5 model TCRs (those per AR5 WG1 Table 9.5) by ~5% (Supplementary Information, page 4). Including this scaling factor along with the temperature measurement scaling factor reduces the estimate to 1.57°C (Supplementary Table 11).

*Relevant details of and peculiarities in REA16*

REA16 focus on energy-budget TCR estimates using the ratio of the changes in global temperature and in forcing, measuring both changes as the difference between the mean over an early baseline period and the mean over a recent final period. They refer to this variously as the difference method and as the Otto et al.[3] method; it was introduced over a decade earlier by Gregory et al.[4] and copied by both Otto et al. (2013) and Lewis and Curry (2015).[5] The primary baseline and final periods used by REA16 are 1861–80 and 2000–09, almost matching those used in the best-constrained Otto et al. estimate. Lewis and Curry, taking longer 1859–82 base and 1995-2011 final periods, obtained the same 1.33°C best estimate for TCR as Otto et al., using the same HadCRUT4v2 global temperature dataset.

REA16 estimate the TCR of each CMIP5 model by comparing its global warming with forcing estimated in the same way as in Otto et al., using model-specific data where available and multimodel mean forcing otherwise. The method is somewhat circular, since forcing for each model is calculated each year as the product of its estimated climate feedback parameter and its simulated global warming, adjusted by the change in its radiative imbalance (heat uptake). Each model’s climate feedback parameter is derived by regressing the model’s radiative imbalance response against its global temperature response over the 150 years following an abrupt quadrupling of CO_{2} concentration.

In model historical simulations the weighted average period from when each forcing increment arose to 2000–09 is only ~30 years, not 150 years. Accordingly, the forcing estimation method relies upon a model exhibiting a fairly linear climate response, and hence having a climate feedback parameter (and an effective climate sensitivity) that does not vary with time (in addition to having a temperature response that is proportional to forcing). In this context, the statement in REA16 that they do not calculate equilibrium climate sensitivity (ECS) “to avoid the assumption of linear climate response” is peculiar: they have already made this assumption in deriving model forcings.

Although REA16 is based on simulations by all CMIP5 models for which relevant data are available, the weighting given to each model in determining the median estimates that are given varies over a range of ten to one. That is because, unlike for most IPCC model-based estimates, each available model-simulation – rather than each model – is given an equal weighting. Whilst only one simulation is included for most models, almost 60% of the simulations that determine the median estimates come from the 25% of models with four or more simulation runs.

REA16 do not appear to state the estimated median TCR applicable to the 84 historical-RCP8.5 CMIP5 simulations used. Dividing the primary periods *tas*-only difference method figure of 1.98°C per Supplementary Table 6 by 1.05 to allow for the stated overestimation by the difference method implies a median estimate for true TCR of 1.89°C. Back-calculating TCR from the difference method bias values in Supplementary Table 5 instead gives an estimate of 1.90°C. The figures are rather higher than the median TCR of 1.80°C that I calculate to apply to the subset of 68 simulations by models for which the canonical TCR is known.

There seem to be inconsistencies in REA16 between different estimates of the bias resulting from use of the difference method and blended air and SST temperature data. The top RH panel of Supplementary Figure 4 shows that the median TCR estimate when doing so, with 2000–09 as the final decade is ~2.00°C. This is a 5% overestimate of the apparent actual value of ~1.90°C rather than (as stated in Supplementary Table 5) an underestimate of 8%. Moreover, contrary to Supplementary Figure 4, Supplementary Table 6 gives a median TCR estimate in this case of 1.81°C, implying an underestimate of 4%, not 8%. Something appears to be wrong here.

REA16 also claim that energy budget TCR estimates are sensitive to analysis period(s), particularly when using a trend method. However, Supplementary Figure 4 shows that the chosen difference method provides stable estimation of model TCRs provided that the final decade has, like the 1861–80 base period, low volcanic forcing. That is, for decades ending in the late 2000s on. As discussed in some detail in LC15, sensitivity estimation using an energy budget difference method is sensitive to variations between the base and final periods in volcanic forcing, due to its very low apparent efficacy, so periods with matching volcanism should be used. The sensitivity, shown in Supplementary Table 6, of TCR estimation using the difference method to choice of base period when using a 2000–09 final period is explicable primarily by poor matching of volcanic forcing when base periods other than 1861–80 are used. Good matching of the mean state of the Atlantic Multidecadal Oscillation (AMO) between the base and final period is also necessary for reliable observational estimation of TCR.

*The effect of blending air and SST data*

I question whether using SST as a proxy for *tas* over the open ocean has caused any downward bias in estimation of TCR in the real climate system, or even (to any significant extent) in CMIP5 models.

The paper REA16 primarily cite to support faster warming of *tas* over open water than SST,[6] which is also model-based, attributes this effect to the thermal inertia of the ocean causing a lag in ocean warming. This argument appears to be unsound. Another paper,[7] which they also cite, instead derives an equilibrium air – sea surface warming differential from a theoretical model based on an assumed relative humidity height profile, with thermal inertia playing no role. This is a better argument. However, it depends on the assumed relative humidity profile being realistic, which it may not be. The first paper cited notes (caveating that observational uncertainties are considerable) that models do not match observed changes in subtropical relative humidity or in global precipitation.

For CMIP5 models, REA16 states that the *tas* vs SST warming differential is about 9% on the RCP8.5 scenario and is broadly consistent between models historically and over the 21st century. However, the differential I calculate is far smaller than that. I compared the increases in *tas* and ‘*ts*‘ between the means for first two decades of the RCP8.5 simulation (2006–25) and the last two decades of the 21st century, using ensemble mean data for each of the 36 CMIP5 models for which data was available. CMIP5 variable ‘*ts*‘ is surface temperature, stated to be SST for the open ocean and skin temperature elsewhere. The excess of the global mean increase in *tas* over that in *ts*, averaged across all models, was only 2%. Whilst *ts* is not quite the same as *tas* over land and sea ice, there is little indication from a latitudinal analysis that the comparison is biased by any differences in their behaviour over land and sea ice. Consistent with this, Figures 2 and S2 of Cowtan et al. 2015[8] (which use respectively *tas* and *ts* over land and sea ice) show very similar changes over time (save in the case of one model). Accordingly, I conclude that the stated 9% differential greatly overstates the mean difference in model warming between *tas* and blended air-sea temperatures. To a large extent that is because the 9% figure also includes an effect, when anomaly temperatures are used, from changes in sea ice extent. However, Figure 2 of Cowtan et al 2015 shows, based on essentially the same set of CMIP5 RCP8.5 simulations as REA16 and excluding sea-ice related effects, a mean differential of ~5% (range 1% – 7%), over double the 2% I estimate.

So, models exhibit a range of behaviours. What do observations show? Unfortunately, there is limited evidence as to whether and to what extent differential air-sea surface warming occurs in the real climate system. However, in the deep tropics, where the theoretical effects on the surface energy budget of temperature-driven changes in evaporation and water vapour are particularly strong, there is a near quarter century record of both SST and *tas* from the Tropical Atmosphere Ocean array of fixed buoys in the Pacific ocean. With averages over the full array extent based on a minimum of 40% valid data points, SAT and SST data are available for 1993-2015. The trend increase in SST over that period is 0.078°C/decade, considerably higher than the 0.047°C/decade for *tas*, not lower. If the required minimum is reduced to 20%, trends can be calculated over 1992-2015, for which they are 0.029°C/decade for SST, and 1.5% higher at 0.030°C/decade for *tas*. This evidence, although limited both spatially and temporally, does not suggest that *tas* increases faster than SST.

*The effect of sea ice changes*

The separation in REA16 of the effect of masking from that of sea ice changes on blending air and water temperature changes is somewhat artificial, since HadCRUT4 has limited coverage in areas where sea ice occurs. However, I will follow the REA16 approach. Their model-based estimate of the effect of sea ice changes appears to be ~4%, the difference between the 9 percentage points bias due to blending and the 5 percentage points (per Cowtan et al. 2015) due purely to the use of SST rather than *tas* for the open ocean. Changes in sea ice make a difference only when temperatures are measured as anomalies relative to a reference period, however I can find no mention in REA16 of what reference period is used.

CMIP5 models have generally simulated decreases in sea ice extent since 1900, accelerating over recent decades, around both poles (AR5 WG1 Figure 9.42). In reality, Antarctic sea ice extent has increased, not decreased, over the satellite era. Its behaviour prior to 1979 is unknown. On the other hand, since ~2005 Arctic sea ice has declined more rapidly than per mean CMIP5 projections. Differences in air temperatures above affected sea ice in the two regions, and the use of widely varying model weightings in REA16, complicate the picture. It is difficult to tell to what extent REA16’s implicit 4 percentage point estimate is biased. Nevertheless, based on sea ice data from 1979 on and unrealistically high long term warming by CMIP5 models in high southern latitudes (as discussed below), it seems to me likely to be an overestimate for changes between the baseline 1861–80 and final 2000–09 periods used in REA16.

*The effect of masking to HadCRUT4 coverage*

I turn now to the claims about incomplete, and changing, data coverage biasing down HadCRUT4 warming by 15 percentage points. The reduction in global warming from masking to HadCRUT4 coverage is based on fast CMIP5 model historical period warming in southern high latitudes as well as northern; see REA16 Supplementary Fig. 6, LH panel and Supplementary Table 8. But this is the opposite of what has happened; high southern latitudes have warmed more slowly than average, over the period for which data are available.

Based on HadCRUT4 data with a minimum of 20% grid cells with data, warming over 60S–90S averaged 0.05°C/decade from 1934 to 2015. The trend is similar using a 10% or 25% minimum; higher minima result in no pre-WW2 data. This trend is much lower than the 0.08°C/decade global mean trend over the period. For the larger 50S–90S region a trend over 1880–2015 can be calculated, at 0.03°C/decade, if a minimum of 15% of valid data points is accepted. Again, this is much lower than the global mean trend of 0.065°C/decade over the same period. An infilled spatial plot of warming since 1960 per BEST (http://berkeleyearth.org/wp-content/uploads/2015/03/land-and-ocean-trend-comparison-map-large.png) likewise shows slower than average warming in southern high latitudes. And UAH (v6.0beta5) and RSS (v03_3) lower-troposphere datasets show very low warming south of 60S over 1979–2015: respectively 0.01 and –0.02°C/decade.

It follows that the real effect of masking to HadCRUT4 coverage over the historical period is, in the southern extra-tropics, almost certainly the opposite of that simulated by CMIP5 models. Therefore, in the real world the global effect of masking is likely to be far smaller than the ~15% bias claimed by REA15.

In an article earlier this year updating the Lewis and Curry results,[9] I addressed the key claims about the effects of masking to HadCRUT4 coverage made in Cowtan et al. 2015 and repeated in REA16, writing:

“It has been claimed (apparently based on HadCRUT4v1) that incomplete coverage of high-latitude zones in the HadCRUT4 dataset biases down its estimate of recent rates of increase in GMST [Cowtan and Way 2014].[10] Representation in the Arctic has improved in subsequent versions of HadCRUT4. Even for HadCRUT4v2, used in [Lewis and Curry], the increase in GMST over the period concerned actually exceeds the area-weighted average of increases for ten separate latitude zones, so underweighting of high-latitude zones does not seem to cause a downwards bias. The issue appears to relate more to interpolation over sea ice than to coverage over land and open ocean in high latitudes.

The possibility of coverage bias in HadCRUT4 has since been independently examined by ECMWF using their well-regarded ERA-Interim reanalysis dataset. They found no reduction in that dataset’s 1979-2014 trend in 2 m near-surface air temperature when the globally-complete coverage was reduced to match that of HadCRUT4v4.[11] Since the ERA-interim reanalysis combines observations from multiple sources and of multiple atmospheric variables, based on a model that is well-proven for weather forecasting, it should in principle provide a more reliable infilling of areas where surface data [are] very sparse, such as high-latitude zones, than mechanistic methods such as kriging. Moreover, during the early decades of the HadCRUT4 record (which includes the 1859-1882 base period) data [were] sparse over much of the globe, and global infilling may introduce significant errors.”

Thus, the claim by Cowtan and Way (2014) that the ERA-interim analysis shows a rapidly increasing cold bias in HadCRUT4 after 1998 does not apply to HadCRUT4v4 over the longer post 1978 period. Focussing first on this period, the performance of the ERA-Interim and six other reanalyses in the Arctic was examined by Lindsay et al.[12] Although the accuracy of reanalyses in the fast warming but sparsely observed Arctic region has been questioned, the authors found that ERA-interim had a very high correlation with monthly temperature anomalies at 449 Arctic land stations. They reckoned ERA-interim to be the most accurate reanalysis for surface air temperature both in absolute terms and as to (post 1979) trend.

Lindsay et al. found GISTEMP to have a higher post-1978 trend in the Arctic than ERA-interim, but GISTEMP uses a crude interpolation and extrapolation based infilling method. Moreover, the ERA-interim version used by ECMWF to investigate possible coverage bias differs from the main dataset. It incorporates a homogeneity adjustment to its post 2001 SST data that significantly increases its temperature trend over that of the main ERA-interim reanalysis. Taking account of that might well eliminate the Arctic trend shortfall compared with GISTEMP. Certainly, over 1979-2015 both the adjusted ERA-interim and HadCRUT4v4 datasets showed a slightly higher trend in global temperature (of respectively 0.166 and 0.165 °C/decade) than did GISTEMP (0.162°C/decade).

Another recent study, Dodd et al,[13] stated that “ERA-Interim has been found to be consistent with independent observations of Arctic [surface air temperatures] and provides realistic estimates of Arctic temperatures and temperature trends that outperform, or are comparable to, other currently available reanalyses for all areas of the Arctic so far investigated.” In her Phd thesis, Dodd also noted that “The issues arising from using drifting platforms in this study highlight the difficulty of investigating [surface air temperatures] over Arctic sea ice.” All this suggests that mechanistic infilling methods are unlikely to outperform the ERA-interim reanalysis in the Arctic, or indeed the Antarctic.

Prior to 1979, there is very little evidence as to the actual effects of incomplete observational coverage, or of blending air and SST measurement, on estimated trends in global temperature. However, there are two well known long-term surface temperature datasets that are based (on a decadal timescale upwards) on air temperature over the ocean as well as land, and which moreover infill to obtain complete or near complete global coverage: NOAAv4.01 and GISSTEMP. Cowtan et al (2015) accept that the new NOAA data set “incorporates adjustments to SSTs to match night-time marine air temperatures and so may be more comparable to model air temperatures”. GISSTEMP uses the NOAAv4.01 SST data set (ERSST4). Both NOAAv4.01 and GISSTEMP show almost identical changes in mean GMST to that per HadCRUT4v4 from 1880-1899, the first two decades they cover, to 1995-2015, the final period used in the update of Lewis and Curry. This suggests that any downwards bias in TCR estimation arising from use of HadCRUT4v4 is likely to be very small. Moreover, whilst some downwards bias in HadCRUT4v4 warming may exist, there are also possible sources of upwards bias, particularly over land, such as the effects of urbanisation and of destabilisation by greenhouse gases of the night-time boundary layer.

*A way to resolve some of the uncertainties arising from poor early observational coverage*

It is doubtful that any method of global infilling of temperatures based on the limited observational coverage available in the second half of the 19th century or (to a decreasing extent) during the first half of the 20th century is very reliable.

Fortunately, there is no need to use the full historical period when estimating TCR. Uncertainty regarding ocean heat uptake in the middle of the historical period, although a problem for ECS estimation, is not relevant to TCR estimation. Lewis and Curry gave an estimate of TCR based on changes from 1930–50 to 1995–2011, periods that were well matched for mean volcanic activity and AMO state, and which delineate a period over which forcing approximated a 70 years ramp. That TCR estimate was 1.33°C, the same as the primary TCR estimate using 1859–82 as the base period. Updating the final period to 1995–2015 and using HadCRUT4v4 left the estimate using the 1930–50 base period unchanged at 1.33°C. The infilling of HadCRUT4 by Cowtan and Way is prone to lesser error when using a 1930–50 base period rather than 1859–82 (or 1861–80 as in REA16), since observational coverage was less sparse during 1930–50. Accordingly, estimating TCR using an infilled temperature dataset makes more sense when the later base period is used.

So does use of the infilled Cowtan and Way dataset increase the 1930–50 to 1995–2015 TCR estimate by anything like 15%, the coverage bias for CMIP5 models reported in REA16 for the full historical period? No. The bias is an insignificant 3%, with TCR estimated at 1.37°C. Small additional biases, discussed above, from changes in sea ice and differences in warming rates of SST and air just above the open ocean (which it appears the Cowtan and Way dataset does not adjust for) might push up the bias marginally. However, ~80% of the total warming involved occurred after 1979, and as noted earlier since 1979 the trend in HadCRUT4v4 matches that in the (adjusted) ERA-interim dataset, which estimates purely surface air temperature, not a blend with SST, and has complete coverage. That suggests the bias from estimating TCR from 1930–50 to 1995–2015 using HadCRUT4v4 data is very minor, and that observation based estimates of TCR of ~1.33°C need to be revised up by, at most, a small fraction of the 24% claimed in REA16.

*Claims by Kyle Armour*

In an opinion piece related to REA16 in the same issue of Nature Climate Change, “Climate sensitivity on the rise”, Kyle Armour made three claims:

- That, as a result of REA16’s findings, observation-based estimates of climate sensitivity and TCR must be revised upwards by 24%.

. - That the findings in Marvel et al (2015)[14] about various other types of forcing having differing effects on global temperature from CO
_{2}(different efficacies) call for multiplying observational estimates of climate sensitivity and TCR by a further factor of 1.30.

. - That a robust behaviour in models of apparent (effective) climate sensitivity being lower in the early years after a forcing is imposed than subsequently, rather than remaining constant, requires multiplying estimates of climate sensitivity by a further factor of ~1.25 in order to convert what they actually estimate (effective climate sensitivity) to ECS.

I will show that each of these claims is very wrong. Taking them in turn:

- REA16’s findings are purely model based and do not reflect behaviour in the real climate system. There is little evidence for any major bias when TCR is estimated using observed changes from early in the historical period to the recent past, but limited observational coverage in the early part makes it difficult to quantify bias. However, TCR can also validly be estimated from observed warming since 1930–1950, most of which occurred during the well observed post-1978 satellite era. Doing so produces an identical TCR estimate to when using the long period, and any downwards bias in the estimate appears to be very small. An adjustment factor in the range 1.01x to 1.05x, not 1.24x, appears warranted.

. - As I have pointed out elsewhere,[15] Marvel et al has a number of serious faults, only two of which have to date been corrected.[16] Nonetheless, for what it is worth, after correcting those two errors Marvel et al.’s primary (iRF) estimate of the effect on global temperature of the mix of forcings acting during the historical period is the same as if the forcing had been, as per the definition of TCR, solely due to CO
_{2}. That is, historical forcing has an estimated transient efficacy of 1.0 (actually 0.99). That would, ignoring the other problems with Marvel et al., justify a multiplicative adjustment to TCR estimates of 1.01x, not 1.30x.

. - It is not true that increasing effective sensitivity is a “robust” feature of models. In four CMIP5 models, the shortfall of climate sensitivity estimated using the first 35 years’ data following an abrupt CO
_{2}increase (roughly corresponding to the weighted average duration of forcing increments over the historical period) compared to that estimated using the standard 150 year regression method, is negligible (2% or less) for six models; for three of those the short period estimate is actually higher. The average shortfall over all CMIP5 models for which I have data is only 7%. Moreover, there is little evidence that the principal causes of estimated ECS exceeding multidecadal effective climate sensitivity in many CMIP5 models (in particular, weakening of the Pacific Walker circulation) are occurring in the real world. So any adjustment to observational estimates of climate sensitivity on account of effective climate sensitivity being, in many models, below ECS (a) does not appear to be well supported by observations; and (b) if based on the average behaviour of CMIP5 models, should be 1.08x rather than 1.25x.

**Nicholas Lewis
.
**

*References*

[1] Mark Richardson, Kevin Cowtan, Ed Hawkins and Martin Stolpe. Reconciled climate response estimates from climate models and the energy budget of Earth. Nature Clim Chng (2016) doi:10.1038/nclimate3066

^{[2]} Kyle Armour. Projection and prediction: Climate sensitivity on the rise Nature Clim Chng (2016) doi:10.1038/nclimate3079

[3] Otto, A. et al. Energy budget constraints on climate response. Nature Geosci. 6, 415-416 (2013).

[4] Gregory, J. M., Stouffer, R. J., Raper, S. C. B., Stott, P. A. & Rayner, N. A. An Observationally Based Estimate of the Climate Sensitivity. *J. Clim. ***15, **3117–3121 (2002).

[5] Lewis, N. & Curry, J. A. The implications for climate sensitivity of AR5 forcing and heat uptake estimates. Clim. Dynam. **45, **1009_1023 (2015).

[6] Richter, I. & Xie, S.-P. Muted precipitation increase in global warming simulations: a surface evaporation perspective. J. Geophys. Res. **113, **D24118 (2008).

[7] Ramanathan, V. The role of ocean-atmosphere interactions in the CO2 climate problem. J. Atmos. Sci. **38, **918_930 (1981).

[8] Cowtan, K. *et al. *Robust comparison of climate models with observations using blended land air and ocean sea surface temperatures. *Geophys. Res. Lett. ***42, **6526–6534 (2015).

[9] https://niclewis.files.wordpress.com/2016/04/ar5_ebstudy_update_article1b.pdf

[10] Cowtan, K. & Way, R. G. Coverage bias in the HadCRUT4 temperature series and its impact on recent temperature trends. Q. J. R. Meteorol. Soc. 140, 1935_1944. (2014)

[11] See http://www.ecmwf.int/en/about/media-centre/news/2015/ecmwf-releases-global-reanalysis-data-2014-0. The data graphed in the final figure shows the same 1979-2014 trend whether or not coverage is reduced to match HadCRUT4.

[12] Lindsay, R et al. Evaluation of Seven Different Atmospheric Reanalysis Products in the Arctic. J Clim 27, 2588–2606 (2014)

[13] Dodd, MA, °C Merchant, NA Rayner and CP Morice. An Investigation into the Impact of using Various Techniques to Estimate Arctic Surface Air Temperature Anomalies. J Clim 28, 1743-1763 (2015).

[14] Kate Marvel, Gavin A. Schmidt, Ron L. Miller and Larissa S. Nazarenko, et al.: Implications for climate sensitivity from the response to individual forcings. Nature Climate Change DOI: 10.1038/NCLIMATE2888 (2015).

[15] https://climateaudit.org/2016/01/08/appraising-marvel-et-al-implications-of-forcing-efficacies-for-climate-sensitivity-estimates/

[16] https://climateaudit.org/2016/03/11/marvel-et-al-giss-did-omit-land-use-forcing/

]]>

**Background**

Much public commentary about the Deflategate controversy has been about the Ideal Gas Law. However, contrary to many misconceptions, Exponent fully accounted for deflation according to the Ideal Gas Law. However, they observed that there was “additional” loss of pressure in Patriot balls for which they could identify “no set of credible environmental or physical factors”. The Wells Report said that this “tends to support a finding that human intervention may account for the additional loss of pressure”:

This absence of a credible scientific explanation for the Patriots halftime measurements

tends to support a finding that human intervention may account for the additional loss of pressure exhibited by the Patriots balls.

While Exponent did not expressly quantify the additional pressure loss – a very peculiar omission, it was approximately 0.35 psi, as compared to an observed pressure loss of approximately 1.4 psi, due to changes in temperature and the balls becoming wet offset by slight warming during intermission. Drawing on information in the Exponent Report (see article for details), for Patriot balls (left two columns) and Colt balls (right two columns), the figure below compares estimates of the impacts of cooling (Ideal Gas Law) and wet footballs to observed deflation during the intermission plus an allowance for warming during intermission. Information for Colt balls reconcile almost exactly, but there is a discrepancy of about 0.38 psi for Patriot balls. This discrepancy is almost exactly equal to the bias of referee Anderson’s Logo Gauge (orange) – a coincidence that should alarm any analyst of this data (including Exponent, Marlow and Wells).

*Figure 1. Reconciliation of Patriot and Colt pressure drops. In the right column of each pair, estimated warming through the intermission is added to the observed pressure drop to estimate the pressure drop at the start of the half-time intermission. In the left column of each pair are shown the pressure drop for dry balls (limegreen), an estimated average additional drop for wet balls and, for Patriot balls, the additional deflation arising from re-setting pressure after gloving. *

**Simulation of Patriot Ball Preparation**

The newly identified error pertains to Exponent’s simulation of Patriot ball preparation for the AFC Championship Game – an issue originally pointed to by Patriot coach Bill Belichick at his press conference of January 24, 2015.

In their simulation of Patriot ball preparation, Exponent set football pressures to 12.5 psi, then vigorously rubbed the balls (“gloving”) for 7 to 15 minutes before stopping. They observed that pressures increased about 0.7 psi, but that the effect wore off after 15-20 minutes – effects shown in their Figure 16 (shown below). From this analysis, they excluded Patriot ball preparation as a potential contributor to the additional pressure loss.

*Figure 2. Exponent Figure 16. Original Caption: The pressure as a function of time while a football is being vigorously rubbed.*

However, Exponent’s simulation neglected an essential element of Patriot ball preparation. According to the Wells Report (WR, 50), Patriot equipment manager Jastremski set football pressures to 12.6 psi **after** the vigorous rubbing.

Jastremski told us that he set the pressure level to 12.6 psi

aftereach ball was gloved and then placed the ball on a trunk in the equipment room for Brady to review. [my bold]

The detail that Jastremski set pressure **after** gloving is not mentioned in the Exponent Report, only in passing in the Wells Report. Exponent’s failure to mention this detail makes one wonder whether Paul, Weiss might have failed to transmit this detail to Exponent. The detail is critical: footballs so processed will have pressures of 12.1-12.2 psi at room temperature, about 0.3-0.4 psi below the NFL minimum of 12.5 psi. This is illustrated in the re-statement of Exponent Figure 16 shown below, which illustrates the setting of pressure to 12.6 psi of footballs warmed by gloving, with the subsequent loss of pressure to 12.1-12.2 psi as the balls return to room temperature. The resulting amount of under-inflation is almost exactly equal to the amount of “unexplained” Patriot loss in pressure.

*Figure 3. Re-statement of Exponent Figure 16. Red: Exponent transient shows effect of rubbing to increase pressure, together with a decline after rubbing stopped. Black: shows reduction in pressure from re-setting to 12.6 psi, followed by transients as ball temperature returns to room temperature. Twenty seconds allowed for setting gauge in above transients. Dotted vertical lines show 7-15 minutes from start of rubbing reported by Jastremski. Logo gauge values of 12.5 and 12.6 psi are shown on right axis, deducting the bias of ~0.38 psi.*

**“Logical Inferences” on Gauges**

The battleground issue in scientific analysis of Deflategate has concerned which gauges were used by referee Anderson for pre-game measurement. Exponent argued that, “despite the remaining uncertainty, logical inferences can be made according to the data collected to establish the likelihood of which gauge was used [by referee Anderson]”, a conclusion with which I agree, though the above analysis changes the conclusions.

Referee Anderson had two gauges, one of which (the Logo Gauge) measured about 0.38 psi too high, while the other gauge (the Non-Logo Gauge) was accurate. It was observed almost immediately (MacKinnon, 2015; Hassett et al 2015) that the additional deflation of Patriot balls could be explained if Anderson had used the Logo Gauge for pre-game measurement of Patriot balls, a hypothesis that was consistent with Anderson’s own recollection that he had used the Logo Gauge. However, Exponent argued that other information led to the “logical inference” that Anderson had used the Non-Logo Gauge for measuring both Patriot and Colt balls. They stated:

Walt Anderson recalled that according to the gauge he used (which is either the Logo or Non-Logo Gauge), all of the Patriots and Colts footballs measured at or near 12.5 psig and 13.0 psig, respectively, when he first tested them (with two Patriots balls slightly below 12.5 psig). This means that the gauges used by the Patriots and the Colts each read similarly to the gauge used by Walt Anderson during his pregame inspection.

Exponent had obtained dozens of gauges (all new gauges similar to the Non-Logo Gauge), none of which had a bias similar to the Logo Gauge. From this information, they argued that it was “very unlikely” that both the Patriots and Colts could have had gauges that were “out of whack” (the term used by Wells) similarly to the Logo Gauge and therefore concluded that Anderson had used the Non-Logo Gauge for pre-game measurements. This conclusion was endorsed in the Wells Report, with Wells’ being particularly vehement about the conclusion in the Appeal Hearing, comparing the possibility to a “lightning strike” – a term that he liked and used twice.

Wells was particular emphatic that use of the Logo Gauge was a “scientific” finding (rather than a conclusion from circumstantial evidence). Wells told Goodell:

The scientists, the Exponent people say they believe based on their

scientific teststhat the non-logo gauge was used.

Wells invoked “science” to explain away Anderson’s recollection of using the Logo Gauge as follows:

Look, this is no different than a case where somebody has a recollection of X happening and then you play a tape and the tape says Y happened. Now, the person could keep saying, well, darn it, I remember it was X. But the people are going to go with the tape. I went with the science and the logic that I had three data points. And that’s what I based my decision on.

Goodell was swayed by Wells’ vehemence and his decision expressly included the following finding on gauges:

There was argument at the hearing about which of two pressure gauges Mr. Anderson used to measure the pressure in the game balls prior to the game. The NFLPA contended, and Dean Snyder opined, that Mr. Anderson had used the so-called logo gauge. On this issue, I find unassailable the logic of the Wells Report and Mr. Wells’ testimony that the non-logo gauge was used because otherwise neither the Colts’ balls nor the Patriots’ balls, when tested by Mr. Anderson prior to the game, would have measured consistently with the pressures at which each team had set their footballs prior to delivery to the game officials, 13 and 12.5 psi, respectively. Mr Wells’s testimony was confirmed by that of Dr. Caligiuri and Professor Marlow. As Professor Marlow testified, “There’s ample evidence that the non-logo gauge was used”.

This reasoning is valid if the pressures were set without rubbing (the Colt balls) but leads to exactly opposite conclusions for Patriot balls.

Because the Patriot rubbing protocol resulted in the balls being under-inflated by approximately 0.35 psi at room temperature, the only way in which Anderson could have measured them above 12.5 psi was if he used the **Logo Gauge**. This is based on exactly the same form of logical inference used by Exponent, but without their erroneous interpretation of Patriot ball preparation.

The corollary is that Anderson inattentively switched gauges between measuring Patriot and Colt balls. While this seems peculiar, NFL officials did exactly the same thing at half-time – switching gauges between measuring Patriot and Colt balls – despite heightened scrutiny. If Anderson put the gauge in his pocket after measuring one set of footballs, it would be entirely random whether he used the same gauge for the other set of footballs.

Although the possibility of Anderson inattentively switching gauges for pre-game measurements was an important possibility (suggested, for example, by Hassett et al, 2015 pdf), at the appeal hearing (*Hearing,* 369:11), Exponent made the remarkable statement that they had been “told” not to consider such a possibility, which was not raised or analysed in the Exponent Report. Surprisingly, this admission wasn’t pursued by Brady’s counsel at the Appeal Hearing and it is therefore unknown who gave these instructions or why.

**Transients**

An essential element of Exponent’s report were their comparisons of observed pressures at half-time to modeled transients of pressure changes through the half-time intermission as footballs warmed up, simulations illustrated in a series of figures (Figures 24-30). Remarkably, these simulations contained another error. The Exponent Report stated that the Logo Gauge had been used to set pressure of footballs to 12.5 psi in Figures 27 and the right panel of Figure 28:

The Logo Gauge was used to set the pressure of two balls to 12.50 psig (representative of the Patriots) and two balls to 13.00 psig (representative of the Colts).

However, Exponent actually used a different gauge (the unbiased Master Gauge) to set pressures to 12.5 psi, resulting in transients that were approximately 0.38 psi higher than under the stated procedure. In the figure below, I’ve re-stated results from their simulations to show transients based on Colt pressures being set with the Non-Logo Gauge and Patriot pressures with the Logo Gauge. In each case, there is plenty of time during which there is an overlap between observations and modeled transients, contradicting Exponent:

*Figure 4. Re-statement of transients from Figures 25 (Non-Logo) and 27 (Logo), basis 70 deg F, for Colt balls set to 13 psig using Non-Logo Gauge and for Patriot balls set to 12.5 psig using the Logo Gauge. *

**Conclusion**

The “unexplained” additional loss of pressure can be unequivocally seen to occur as a result of Jastremski setting pressure **after** gloving, rather than before. This is a complete explanation, which precludes tampering after referee measurement.

Previous scientific critiques of the Wells Report had observed that Patriot deflation could be explained if Anderson used the Logo Gauge, but had been unable to overcome Exponent’s argument about the improbability of the Patriot Gauge being “out of whack” similarly to the Logo Gauge. That weakness is overcome in the present analysis. Correct modeling of Patriot ball preparation yields the “logical inferences” that the Patriot gauge was relatively accurate and that Anderson used the Logo Gauge.

Indeed, it is the Wells Report itself that requires an implausible “lightning strike”. Wells’ analysis requires that, out of all possible target deflations, the amount of Patriot deflation was almost exactly equal to the bias of Anderson’s Logo Gauge. Any self-respecting analyst should have examined and cross-examined his data when asked to arrive at that conclusion.

Exponent expressly stated that their procedures were based on information provided by Paul, Weiss (Ted Wells’ legal firm). While descriptions in the Exponent Report generally track descriptions in the Wells Report, the information that Jastremski set pressure **after **gloving appears only in the Wells Report and is conspicuously absent from the Exponent Report. With present documents, it is impossible to tell whether Exponent was in possession of this information and neglected to include it in their simulation or whether Paul, Weiss neglected to transmit this information to Exponent. Either way, it is an error that has no place in a professional report.

Without these errors, Exponent could not have stated that there were “no set of credible environmental or physical factors” explaining the additional pressure loss.

Appeal courts are poorly suited to resolve such errors. There is another way to resolve the controversy. The scientific community takes considerable pride in the concept of science being “self-correcting”. When a scientist has inadvertently made an error, the most honorable and effective method of correcting the scientific record is to issue a corrected report, and, if such is not possible, retraction. The Deflategate controversy originated in scientific and technical errors and the responsible scientists and investigators should take responsibility. Even at this late stage, Paul, Weiss and/or Exponent and/or Marlow should man up, acknowledge the errors and either re-issue corrected reports or retract. If any of them do so, it is hard to envisage the Deflategate case continuing much further.

Advocate and even policy-makers often like to say that their conclusions are supported by “science”, but Wells’ enthusiastic use of the terms “science” and “scientific tests” as rhetoric to validate incorrect analysis should serve as a caveat

The complete paper is online pdf.

]]>

**Background **

The proximate cause of Schmidt’s bilious tweets was Curry’s proposed use of the tropical troposphere spaghetti graph from Christy’s more recent congressional testimony in her planned presentation to NARUC. In that testimony, Christy had reported that “models over-warm the tropical atmosphere by a factor of approximately 3, (Models +0.265, Satellites +0.095, Balloons +0.073 °C/decade)”

The Christy diagram has long been criticized by warmist blogs for its baseline – an allegation that I examined in my most recent post, in which I showed that baselining as set out by Schmidt and/or Verheggen was guilty of the very offences of Schmidt’s accusation and that, ironically, Christy’s nemesis, Carl Mears, had used a nearly identical baseline, but had not been excoriated by Schmidt or others.

I had focused first on baselining, because that had been the main issue at warmist blogs relating to the Christy diagram. However, in twitter followup to my post, Schmidt pretended not to recognize the baselining issue, instead saying that the issue was merely “uncertainties”, but did not expand on exactly how “uncertainties” discomfited the Christy graphic. Even though I had shown that Christy’s baselining was equivalent to Carl Mears’, Schmidt refused to disassociate himself from Verheggen’s offensive accusations.

One clue to Schmidt’s invocation of “uncertainties” comes from the histogram diagram which he proposed to Judy Curry, shown below. This diagram was accompanied by a diagram, which represented the spaghetti distribution of model runs as a grey envelope – an iconographical technique that I will discuss on another occasion. The diagram consisted of two main elements: (1) a histogram of the 102 CMIP5 runs (32 models); (2) five line segments, each representing the confidence intervals for five different satellite measurements.

Schmidt did not provide a statistical interpretation or commentary on this graphic, apparently thinking that the diagram somehow refuted Christy on its face. However, it does nothing of the sort. CA reader JamesG characterized it as “the daft argument that because the obs uncertainties clip the model uncertainties then the models ain’t so bad.” In fact, to anyone with a grounded understanding of joint statistical distributions, Schmidt’s diagram actually supports Christy’s claim of inconsistency.

**TRP vs GLB Troposphere**

Alert readers may have already noticed that whereas the Christy figure in controversy depicted trends in the **tropical** troposphere – a zone that has long been especially in dispute, Schmidt’s histogram depicts trends in the **global** troposphere.

In the figure below, I’ve closely emulated Schmidt’s diagram and shown the effect of the difference. In the left panel, I’ve shown the Schmidt histogram (GLB TMT) with horizontal and vertical axes transposed for graphical convenience. The second panel shows my emulation of the Schmidt diagram using GLB TMT (mid-troposphere) from CMIP5. The third and fourth panels show identically constructed diagrams for **tropical** TMT and tropical **TLT** (lower troposphere), each derived from the Christy compilation of 102 CMIP5 runs (also, I believe, used by Schmidt.) Discussion below the figure.

*Figure 1. Histograms of 1979-2015 trends versus satellite observations. Left – Gavin Schmidt; second panel – GLB TMT; third panel – TRP TMT; fourth panel: TRP TLT. The black triangle shows average of model runs. All calculated from annualized data. *

The histograms and observations in panels 2-4 were all calculated from annualizations of monthly data (following indications of Schmidt’s method.) The resulting panel for Global TMT (second panel) corresponds reasonably to the Schmidt diagram, though there are some puzzling differences of detail. The lengths of the line segments for each satellite observation series were calculated as the standard error of the trend coefficient using OLS on *annualized* data, closely replicating the Schmidt segments (and corresponding to information from a Schmidt tweet.) This yields higher uncertainty than the same calculation on* monthly* data, but less than assuming *AR1* errors with monthly data. The confidence intervals are also somewhat larger than the corresponding confidence intervals in the RSS simulations of structural uncertainty, a detail that I can discuss on another occasion.

In the third panel, I did the same calculation with *tropical* (TRP) TMT data, thus corresponding to the Christy diagram with which Schmidt had taken offence. The trends in this panel are noticeably higher than for the GLB panel (this is the well known “hot spot” in models of the tropical troposphere). In my own previous discussions of this topic, I’ve considered the lower troposphere (*TLT*) rather than mid-troposphere and, for consistency, I’ve shown this in the right panel. Tropical TLT in models run slightly warmer than tropical TMT model runs, but only a little. In each case, I’ve extracted available satellite data. Tropical TLT data from RSS 4.0 and NOAA is not yet available (and thus not shown in the fourth panel.)

The average tropical TMT model trend was 0.275 deg C/decade, about 30% higher than the corresponding GLB trend (0.211 deg C/decade), shown in the Schmidt diagram. The difference between the mean of the model runs and observations was about 55% higher in the tropical diagram than in the GLB diagram.

So Schmidt’s use of the global mid-troposphere shown in his initial tweet to Curry had the effect of materially reducing the discrepancy. **Update (May 6):** In a later tweet, Schmidt additionally showed the corresponding graphic for tropical TMT. I’ll update this post to reflect this.

**The Model Mean: Back to Santer et al 2008**

In response to my initial post about baselining, Chris Colose purported to defend Schmidt (tweet) stating:

“re-baselining is not the only issues. large obs uncertainty,

model mean not appropriate, etc.”

I hadn’t said that “re-baselining” was the “only” issue. I had opened with it as an issue because it had been the most prominent in warmist critiques and had occasioned offensive allegations, originally from Verheggen, but repeated recently by others. So I thought that it was important to take it off the table. I invited Gavin Schmidt to disassociate himself from Verheggen’s unwarranted accusations about re-baselining, but Schmidt refused.

Colose’s assertion that the “model mean [is] not appropriate” ought to raise questions, since differences in means are assessed all the time in all branches of science. Ironically, a comparison of observations to the model mean was one of the key comparisons in Santer et al 2008, of which Schmidt was a co-author. So Santer, Schmidt et al had no issue at the time with the principle of comparing observations to the model mean. Unfortunately (as Ross and I observed in a contemporary submission), Santer et al used obsolete data (ending in 1999) and their results (purporting to show no statistically significant difference) were invalid using then up-to-date data. (The results are even more offside with the addition of data to the present.)

For their comparison of the difference between means, Santer et al used a t-statistic, in which their formula for the standard error of the model mean was the standard deviation of the model trends by the square root of the number of models (highlighted). I show this formula since Schmidt and others had argued vehemently against inclusion of the n_m divisor for number of models.

The above formula for the standard error of the model, as Santer himself realized – mentioning the point in several Climategate emails, was identical to that used in Douglass et al 2008. Their formula differed from Douglass et al in the other term of the denominator – the standard error of observations s{b_o}.

In December 2007, Santer et al 2008 coauthor Schmidt had ridiculed this formula for the standard error of models as an “egregious error”, claiming that division of the standard deviation by the (square root of ) the number of models resulted in the “absurd” situation where some runs contributing to the model mean were outside the confidence interval for the model mean.

Schmidt’s December 2007 post relied on rhetoric rather than statistical references and his argument was not adopted in Santer et al 2008, which divided the standard deviation by the square root of the number of models.

Schmidt’s December 2007 argument caused some confusion in October 2008 when Santer et al 2008 was released, on which thus far undiscussed Climategate emails shed interesting light. Gavin Cawley, commenting at Lucia’s and Climate Audit in October 2008 as “beaker”, was so persuaded by Schmidt’s December 2007 post that he argued that there must have been a misprint in Santer et al 2008. Cawley purported to justify his claimed misprint with a variety of arid arguments that made little sense to either me or Lucia. We lost interest in Cawley’s arguments once we were able to verify from tables in Santer et al 2008 that there was no misprint and were able to establish that Santer et al 2008 had used the same formula for standard error of *models* as Douglass et al (differing, as noted above, in the term for standard error of *observations*.)

Cawley pursued the matter in emails to Santer that later became part of the Climategate record. Cawley pointed to Schmidt’s earlier post at Real Climate and asked Santer whether there was a misprint in Santer et al 2008. Santer forwarded Cawley’s inquiry to Tom Wigley, who told Santer that Schmidt’s Real Climate article was “simply wrong” and warned Santer that Schmidt was “not a statistician” – points on which a broad consensus could undoubtedly have been achieved. Unfortunately, Wigley never went public with his rejection of Schmidt’s statistical claims, which remain uncorrected to this day. Santer reverted back to Cawley that the formula in the article was correct and was conventional statistics, citing von Storch and Zwiers as authority. Although Cawley had been very vehement in his challenges to Lucia and myself, he did not close the circle when he heard back from Santer, conceding that Lucia and I had been correct in our interpretation.

**Bayesian vs Frequentist **

In recent statistical commentary, there has been a very consistent movement to de-emphasize “statistical significance” as a sort of talisman of scientific validity, while increasing emphasis on descriptive statistics and showing distributions – a move that is associated with the increasing prominence of Bayesianism and something that is much easier with modern computers. As someone who treats data very descriptively, I’m comfortable with the movement.

Rather than worry about whether something is “statistically significant”, the more modern approach is to look at its “posterior distribution”. Andrew Gelman’s text (Applied Bayesian Analysis, p 95) specifically recommended this in connection with *difference in means*:

In problems involving a continuous parameter θ (

say the difference between two means), the hypothesis that θ is exactly zero is rarely reasonable, and it is of more interest to estimate a posterior distribution or a corresponding interval estimate of θ. For a continuous parameter θ, the question ‘Does θ equal 0?’ can generally be rephrased more usefully as ‘What is the posterior distribution for θ? (text, p 95)

In the diagram below, I show how the information in a Schmidt-style histogram can be translated into a posterior distribution, and why such a distribution is helpful and relevant to someone trying to understand the data in a practical way. The techniques below do not use full Bayesian apparatus of MCMC simulations (which I have not mastered), but I would be astonished if such technique would result in any material difference. (I’m somewhat reassured that this was my very first instinct when confronted with this issue: see October 2008 CA post here and Postscript below.)

On the left, I’ve shown the Schmidt-style diagram for tropical TMT (third panel above). In the middle, I’ve shown approximate distributions for model runs (pink) and observations (light blue) – explained below, and in the right panel, the distribution of the difference between model mean and observations. From the diagram in the right panel, one can draw conclusions about the t-statistic for the difference in means, but, for me, the picture is more meaningful than a t-statistic.

*Figure 2. Tropical TMT trends. Left – As in third panel of Figure 1. Middle. pink: distribution of model trends corresponding to histogram; lightblue: implied distribution of observed trends. Right: distribution of difference of model and observed trends. In the data used in panel three above (TRP TMT), I got indistinguishable results (models +0.272 deg C/decade; satellites +0.095 deg C/decade). *

The left panel histogram of trends for tropical TMT is derived from the Christy collation (also used by Schmidt) of the 102 CMIP5 runs (with taz) at KNMI. The line segments represent 95% confidence intervals for five satellite series based on the method used in Schmidt’s diagram (see Figure 1 for color code).

In the middle panel, I’ve used normal distributions for the approximations, since their properties are tractable, but the results of this post would apply for other distributions as well. For models, I’ve used the mean and standard deviation of the 102 CMIP5 runs (0.272 and 0.058 deg C/decade, respectively). For observations, I presumed that each satellite was associated with a normal distribution with the standard deviation being the standard error of the trend coefficient in the regression calculation; for each of the five series, I simulated 1000 realizations. From the composite of 5000 realizations, I calculated the mean and standard deviation (0.095 and 0.049 deg C/decade respectively) and used that for the normal distribution for observations shown in light blue. There are other reasonable ways of doing this, but this seemed to me to be the most consistent with Schmidt’s graphic. Note that this technique yields a somewhat wider envelope than the envelope of realizations representing structural uncertainty in the RSS ensemble.

In the right panel, I’ve shown the distribution of the difference of means, calculated following Jaynes’ formula (discussed at CA previously here). In an analysis following Jaynes’ technique, the issue is not whether the difference in means was *“statistically significant”*, but assessing the odds/probability that a draw from models would be higher than a draw from observations, fully accounting for uncertainties of both, calculated according to the following formula from Jaynes:

By specifying the two distributions in the middle panel as normal distributions, the distribution of the difference of means is also normal, with its mean being the difference between the two means and the standard deviation being the square root of the sum of squares of the two standard deviations in the middle panel ( mean 0.177 and sd 0.076 deg C/decade respectively). For more complicated distributions, the distribution could be calculated using simulations to effect the integration.

**Conclusion**

In the present case, from the distribution in the right panel:

- a model run will be warmer than an observed trend
**more than 99.5%**of the time; - will be warmer than an observed trend by
**more than 0.1 deg C/decade**approximately**88% of the time;** - and will be warmer than an observed trend by
**more than 0.2 deg C/decade****more than 41% of the time**.

These values demonstrate a very substantial warm bias in models, as reported by Christy, a bias which cannot be dismissed by mere arm-waving about “uncertainties” in Schmidt style. As an editorial comment about why the “uncertainties” have a relatively negligible impact on “bias”: it is important to recognize that the uncertainties work in both directions, a trivial point seemingly neglected in Schmidt’s “daft argument”. Schmidt’s “argument” relied almost entirely on the rhetorical impact of the upper tail of the observation distributions nicking the lower tail of the model distributions. But the wider upper tail is accompanied by a wider lower tail and, for these measurements, the discrepancy is even larger than the mean discrepancy.

Unsurprisingly, using up-to-date data, the t-test used in Santer et al 2008 is even more offside than it was in early 2009. The t-value under Santer’s equation 12 is 3.835, far outside usual confidence limits. Ironically, it fails even using the incorrect formula for standard error of models, which Schmidt had previously advocated.

The bottom line is that Schmidt’s diagram does not contradict Christy after all, and, totally fails to support Schmidt’s charges that Christy’s diagram was “partisan”.

**Postscript**

As a small postscript, I am somewhat pleased to observe that my very first instinct, when confronted by the data in dispute in Santer et al 2008, was to calculate a sort of posterior distribution, albeit in a somewhat homemade method – see October 2008 CA post here.

In that post, I calculated a histogram of model trends used in Douglass et al (tropical TLT to end 2004, as I recall – I’ll check what I did). Note that the model mean (and overall distribution) at that time was considerably less than the model mean (and envelope) to the end of 2015. When one squints at the models in detail, they tend to accelerate in the 21st century. I had then calculated the proportion of models with greater trend than observations for values between -0.3 and 0.4 deg C/decade (a different format than the density curve in my diagram above, but one can be calculated from the other).

*Figures from CA here. Left – histogram of model runs used in Douglass et al 2008; right *

]]>

*@curryja use of Christy’s misleading graph instead is the sign of partisan not a scientist. YMMV. tweet;*

*@curryja Hey, if you think it’s fine to hide uncertainties, error bars & exaggerate differences to make political points, go right ahead. tweet.*

As a result, Curry decided not to use Christy’s graphic in her recent presentation to a congressional committee. In today’s post, I’ll examine the validity (or lack) of Schmidt’s critique.

Schmidt’s primary dispute, as best as I can understand it, was about Christy’s centering of model and observation data to achieve a common origin in 1979, the start of the satellite period, a technique which (obviously) shows a greater discrepancy at the end of the period than if the data had been centered in the middle of the period. I’ll show support for Christy’s method from his long-time adversary, Carl Mears, whose own comparison of models and observations used a short early centering period (1979-83) “so the changes over time can be more easily seen”. Whereas both Christy and Mears provided rational arguments for their baseline decision, Schmidt’s argument was little more than shouting.

**Background**

The full history of the controversy over the discrepancy between models and observations in the tropical troposphere is voluminous. While the main protagonists have been Christy, Douglass and Spencer on one side and Santer, Schmidt, Thorne and others on the other side, Ross McKitrick and I have also commented on this topic in the past, and McKitrick et al (2010) was discussed at some length by IPCC AR5, unfortunately, as too often, deceptively on key points.

**Starting Points and Reference Periods**

Christy and Spencer have produced graphics in a similar style for several years. Roy Spencer (here) in early 2014 showed a similar graphic using 1979-83 centering (shown below). Indeed, it was this earlier version that prompted vicious commentary by Bart Verheggen, commentary that appears to have originated some of the prevalent alarmist memes.

*Figure 1. 2014 version of the Christy graphic, from Roy Spencer blog (here). This used 1979-83 centering. This was later criticized by Bart Verheggen here. *

Christy’s February 2016 presentation explained this common origin as the most appropriate reference period, using the start of a race as a metaphor:

To this, on the contrary, I say that we have displayed the data in its most meaningful way. The issue here is the rate of warming of the bulk atmosphere, i. e., the trend. This metric tells us how rapidly heat is accumulating in the atmosphere – the fundamental metric of global warming. To depict this visually, I have adjusted all of the datasets so that they have a common origin. Think of this analogy: I have run over 500 races in the past 25 years, and in each one all of the runners start at the same place at the same time for the simple purpose of determining who is fastest and by how much at the finish line. Obviously, the overall relative speed of the runners is most clearly determined by their placement as they cross the finish line – but they must all start together.

The technique used in the 2016 graphic varied somewhat from the earlier style: it took the 1979 value of the 1975-2005 trend as a reference for centering, a value that was very close to the 1979-83 mean.

**Carl Mears**

Ironically, in RSS’s webpage comparison of models and observations, Christy’s longstanding adversary, Carl Mears, used an almost identical reference period (1979-84) in order that “* the changes over time can be more easily seen”. *Mears wrote that “If the models, as a whole, were doing an acceptable job of simulating the past, then the observations would mostly lie within the yellow band”, but that “this was not the case”:

The yellow band shows the 5% to 95% envelope for the results of 33 CMIP-5 model simulations (19 different models, many with multiple realizations) that are intended to simulate Earth’s Climate over the 20th Century. For the time period before 2005, the models were forced with historical values of greenhouse gases, volcanic aerosols, and solar output. After 2005, estimated projections of these forcings were used.

If the models, as a whole, were doing an acceptable job of simulating the past, then the observations would mostly lie within the yellow band.For the first two plots (Fig. 1 and Fig 2), showing global averages and tropical averages,this is not the case.

Mears illustrated the comparison in the following graphic, the caption to which states the reference period of 1979-84 and the associated explanation.

**Figure 2. **From RSS here.** Original caption:** Tropical (30S to 30N) Mean TLT Anomaly plotted as a function of time. The the blue band is the 5% to 95% envelope for the RSS V3.3 MSU/AMSU Temperature uncertainty ensemble. The yellow band is the 5% to 95% range of output from CMIP-5 climate simulations. **The mean value of each time series average from 1979-1984 is set to zero so the changes over time can be more easily seen.** Again, after 1998, the observations are likely to be below the simulated values, indicating that the simulation as a whole are predicting more warming than has been observed by the satellites.

The very slight closing overlap between the envelope of models and envelope of observations is clear evidence – to anyone with a practiced eye – that there is a statistically significant difference between the ensemble mean and observations using the t-statistic as in Santer et al 2008. (More on this in another post).

Nonetheless, Mears did not agree that the fault lay with the models, instead argued, together with Santer, that the fault lay with errors in forcings, errors in observations and internal variability (see here). Despite these differences in diagnosis, Mears agreed with Christy on the appropriateness of using a common origin for this sort of comparison.

**IPCC AR5**

IPCC, which, to borrow Schmidt’s words, is not shy about “exaggerat[ing or minimizing] differences to make political points”, selected a reference period in the middle of the satellite interval (1986-2005) for their AR5 Chapter 11 Figure 11.25, which compared a *global* comparison of CMIP5 models to the average of 4 observational datasets.

*Figure 3. IPCC AR5 WG1 Figure 11.25a.*

The *effective origin* in this graphic was therefore **1995, **reducing the divergence between models and observations to approximately half of the full divergence over the satellite period. Roy Spencer recently provided the following diagram, illustrating the effect of centering two series with different trends at the middle of the period (top panel below), versus the start of the period (lower panel). If the two trending series are centered in the middle of the period, then the gap at closing is reduced to half of the gap arising from starting both series at a common origin (as in the Christy diagram.)

*Figure 4. Roy Spencer’s diagram showing difference between centering at the beginning and in the middle.*

**Bart Verheggen**

The alarmist meme about supposedly inappropriate baselines in Christy’s figure appears to have originated (or at least appeared in an early version) in a 2014 blogpost by Bart Verheggen, which reviled an earlier version of the graphic from Roy Spencer’s blog (here) shown above, which had used 1979-83 centering, a choice that was almost exactly identical to the 1979-84 centering that later used by RSS/Carl Mears (1979-84).

Verheggen labeled such baselining as “particularly flawed” and accused Christy and Spencer of “shifting” the model runs upwards to “increase the discrepancy”:

They shift the modelled temperature anomaly upwards to increase the discrepancy with observations by around 50%.

Verheggen claimed that the graphic began with an 1986-2005 reference period (the period used by IPCC AR5) and that Christy and Spencer had been “re-baseline[d]” to the shorter period of 1979-83 to “maximize the visual appearance of a discrepancy”:

The next step is re-baselining the figure to maximize the visual appearance of a discrepancy: Let’s baseline everything to the 1979-1983 average (way too short of a period and chosen very tactically it seems)… Which looks surprisingly similar to Spencer’s trickery-graph.

Verheggen did not provide a shred of evidence showing that Christy and Spencer had first done the graphic with IPCC’s middle-interval reference period and then “re-baselin[ed]” the graphic to “trick” people. Nor, given that the reference period of **“1979-83”** was clearly labelled on the y-axis, it hardly required reverse engineering to conclude that Christy and Spencer had used a 1979-83 reference period nor should it have been “surprising” that an emulation using a 1979-83 reference period would look similar. Nor has Verheggen made similar condemnations of Mears’ use of a 1979-84 reference period to enable the changes to be “more easily seen”.

Verheggen’s charges continues to resonate in the alarmist blog community. A few days after Gavin Schmidt challenged Judy Curry, Verheggen’s post was cited at Climate Crocks as the “best analysis so far of John Christy’s go-to magical graph that gets so much traction in the deniosphere”.

The trickery is entirely the other way. Graphical techniques that result in an origin in the middle of the period (~1995) rather than the start (1979) reduce the closing discrepancy by about 50%, thereby, hiding the divergence, so to speak.

**Gavin Schmidt **

While Schmidt complained that the Christy diagram did not have a “reasonable baseline”, Schmidt did not set out criteria for why one baseline was “reasonable” and another wasn’t, or what was wrong with using a common origin (or reference period at the start of the satellite period) “so the changes over time can be more easily seen” as Mears had done.

In March 2016, Schmidt produced his own graphics, using two different baselines to compare models and observations. Schmidt made other iconographic variations to the graphic (which I intend to analyse separately), but for the analysis today, it is the reference periods that are of interest.

Schmidt’s first graphic (shown in the left panel below – unfortunately truncated on the left and right margins in the Twitter version) was introduced with the following comment:

Hopefully final versions for tropical mid-troposphere model-obs comparison time-series and trends (until 2016!).

This version used 1979-1988 centering, a choice which yields relatively small differences from Christy’s centering. Victor Venema immediately ragged Schmidt about producing anomalies so similar to Christy and wondered about the reference period:

~~@~~ClimateOfGavinAre these Christy-anomalies with base period 1983? Or is it a coincidence that the observations fit so well in beginning?

Schmidt quickly re-did the graphic using 1979-1998 centering, thereby lessening the similarity to “Christy anomalies”, announcing the revision (shown on the right below) as follows:

~~@~~VariabilityBlogIt’s easy enough to change. Here’s the same thing using 1979-1998. Perhaps that’s better…

After Schmidt’s “re-baselining” of the graphic (to borrow Verheggen’s term), the observations were now shown as within the confidence interval throughout the period. It was this second version that Schmidt later proffered to Curry as the result arising from a “more reasonable” baseline.

*Figure 5. Two figures from Gavin Schmidt tweets on March 4, 2016. ** Left – from March 4 tweet, using 1979-1988 centering. Note that parts of the graphic on the left and right margins appear to have been cut off, so that the graph does not go to 2015. Right- second version using 1979-1998 centering, thereby lowering model frame relative to observations.*

The incident is more than a little ironic in the context of Verheggen’s earlier accusations. Verheggen showed a sequence of graphs going from a 1986-2005 baseline to a 1979-1983 baseline and accused Spencer and Christy of “re-baselining” the graphic “to maximize the visual appearance of a discrepancy” – which Verheggen called “trickery”. Verheggen made these accusations without a shred of evidence that Christy and Spencer had started from a 1986-2005 reference period – a highly questionable interval in the first place, if one is trying to show differences over the 1979-2012 period, as Mears had recognized. On the other hand, prompted by Venema, Schmidt actually did “re-baseline” his graphic, reducing the “visual appearance of a discrepancy”.

**The Christy Graphic Again**

Judy Curry had reservations about whether Schmidt’s “re-baselining” was sufficient to account for the changes from the Christy figure, observing:

My reaction was that these plots look nothing like Christy’s plot, and its not just a baseline issue.

In addition to changing the reference period, Schmidt’s graphic made several other changes:

- Schmidt used annual data, rather than a 5-year average.
- Schmidt showed a grey envelope representing the 5-95% confidence interval, rather than showing the individual spaghetti strands;
- instead of showing 102 runs individually, Christy showed averages for 32 models. Schmidt seems to have used the 102 runs individually, based on his incorrect reference to 102
**models**(!) in his caption.

I am in the process of trying to replicate Schmidt’s graphic. To isolate the effect of Schmidt’s re-baselining on the Christy graphic, I replicated the Christy graphic as closely as I could, with the resulting graphic (second panel) capturing the essentials in my opinion, and then reproduced the graphic using Schmidt centering.

The third panel isolates the effect of Schmidt’s 1979-1998 centering period. This moves downward both models and observations, models slightly more than observations. However, in my opinion, the visual effect is not materially changed from Christy centering. This seems to confirm Judy Curry’s surmise that the changes in Schmidt’s graphic arise from more than the change in baseline. One possibility was that change in visual appearance arose from Christy’s use of ensemble averages for each model, rather than individual runs. To test this, the fourth panel shows the Christy graphic using runs. Once again, it does not appear to me that this iconographic decision is material to the visual impression. While the spaghetti graph on this scale is not particularly clear, the INM-CM4 model run can be distinguished as the singleton “cold” model in all four panels.

*Figure 1. Christy graphic (left panel) and variations. See discussion in text. The blue line shows the average of the UAH 6.0 and RSS 3.3 TLT tropical data. *

**Conclusion**

There is nothing mysterious about using the gap between models and observations at the end of the period as a measure of differing trends. When Secretariat defeated the field in the 1973 Belmont by 25 lengths, even contemporary climate scientists did not dispute that Secretariat ran faster than the other horses.

Even Ben Santer has not tried to challenge whether there was a “statistically significant difference” between Steph Curry’s epic 3-point shooting in 2015-6 and leaders in other seasons. Last weekend, NYT Sports illustrated the gap between Steph Curry and previous 3-point leaders using a spaghetti graph (see below) that, like the Christy graph, started the comparisons with a common origin. The visual force comes in large measure from the separation at the end.

If NYT Sports had centered the series in the middle of the season (in Bart Verheggen style), then Curry’s separation at the end of the season would be cut in half. If NYT Sports had centered the series on the first half (in the style of Gavin Schmidt’s “reasonable baseline”), Curry’s separation at the end of the season would likewise be reduced. Obviously, such attempts to diminish the separation would be rejected as laughable.

There is a real discrepancy between models and observations in the tropical troposphere. If the point at issue is the difference in trend during the satellite period (1979 on), then, as Carl Mears observed, it is entirely reasonable to use center the data on an early reference period such as the 1979-84 used by Mears or the 1979-83 period used by Christy and Spencer (or the closely related value of the trend in 1979) so that (in Mears’ words) “the changes over time can be more easily seen”.

Varying Schmidt’s words, doing anything else will result in “hiding” and minimizing “differences to make political points”, which, once again in Schmidt’s words, “is the sign of partisan not a scientist.”

There are other issues pertaining to the comparison of models and observations which I intend to comment on and/or re-visit.

]]>

**Introduction**

In a recent article I discussed Bayesian parameter inference in the context of radiocarbon dating. I compared Subjective Bayesian methodology based on a known probability distribution, from which one or more values were drawn at random, with an Objective Bayesian approach using a noninformative prior that produced results depending only on the data and the assumed statistical model. Here, I explain my proposals for incorporating, using an Objective Bayesian approach, evidence-based probabilistic prior information about of a fixed but unknown parameter taking continuous values. I am talking here about information pertaining to the particular parameter value involved, derived from observational evidence pertaining to that value. I am not concerned with the case where the parameter value has been drawn at random from a known actual probability distribution, that being an unusual case in most areas of physics. Even when evidence-based probabilistic prior information about a parameter being estimated does exist and is to be used, results of an experiment should be reported without as well as with that information incorporated. It is normal practice to report the results of a scientific experiment on a stand-alone basis, so that the new evidence it provides may be evaluated.

In principle the situation I am interested in may involve a vector of uncertain parameters, and multi-dimensional data, but for simplicity I will concentrate on the univariate case. Difficult inferential complications can arise where there are multiple parameters and only one or a subset of them are of interest. The best noninformative prior to use (usually Bernardo and Berger’s reference prior)[1] may then differ from Jeffreys’ prior.

**Bayesian updating **

Where there is an existing parameter estimate in the form of a posterior PDF, the standard Bayesian method for incorporating (conditionally) independent new observational information about the parameter is “Bayesian updating”. This involves treating the existing estimated posterior PDF for the parameter as the prior in a further application of Bayes’ theorem, and multiplying it by the data likelihood function pertaining to the new observational data. Where the parameter was drawn at random from a known probability distribution, the validity of this procedure follows from rigorous probability calculus.[2] Where it was not so drawn, Bayesian updating may nevertheless satisfy the weaker Subjective Bayesian coherency requirements. But is standard Bayesian updating justified under an Objective Bayesian framework, involving noninformative priors?

A noninformative prior varies depending on the specific relationships the data values have with the parameters and on the data-error characteristics, and thus on the form of the likelihood function. Noninformative priors for parameters therefore vary with the experiment involved; in some cases they may also vary with the data. Two studies estimating the same parameter using data from experiments involving different likelihood functions will normally give rise to different noninformative priors. On the face of it, this leads to a difficulty in using objective Bayesian methods to combine evidence in such cases. Using the appropriate, individually noninformative, prior, standard Bayesian updating would produce a different result according to the order in which Bayes’ theorem was applied to data from the two experiments. In both cases, the updated posterior PDF would be the product of the likelihood functions from each experiment, multiplied by the noninformative prior applicable to the first of the experiments to be analysed. That noninformative priors and standard Bayesian updating may conflict, producing inconsistency, is a well known problem (Kass and Wasserman, 1996).[3]

**Modifying standard Bayesian updating **

My proposal is to overcome this problem by applying Bayes theorem once only, to the joint likelihood function for the experiments in combination, with a single noninformative prior being computed for inference from the combined experiments. This is equivalent to the modification of Bayesian updating proposed in Lewis (2013a).[4] It involves rejecting the validity of standard Bayesian updating for objective inference about fixed but unknown continuously-valued parameters, save in special cases. Such special cases include where the new data is obtained from the same experimental setup as the original data, or where the experiments involved are different but the same form of prior in noninformative in both cases.

Since standard Bayesian updating is simply the application of Bayes’ theorem using an existing posterior PDF as the prior distribution, rejecting its validity is quite a serious step. The justification is that Bayes’ theorem applies where the prior is a random variable having an underlying, known probability distribution. An estimated PDF for a fixed parameter is not a known probability distribution for a random variable.[5] Nor, self-evidently, is a noninformative prior, which is simply a mathematical weight function. The randomness involved in inference about a fixed parameter relates to uncertainty in observational evidence pertaining to its value. By contrast, a probability distribution from which a variable is drawn at random tells one how the probability that the variable will have any particular value varies with the value, but it contains no information relating to the specific value that is realised in any particular draw.

Computing a single noninformative prior for inference from the combination of two or more experiments is relatively straightforward provided that the observational data involved in all the experiments is independent, conditional on the parameter value – a standard requirement for Bayesian updating to be valid. Given such independence, likelihood functions may be combined through multiplication, as is done both in standard Bayesian updating and my modification thereof. Moreover, Fisher information for a parameter is additive given conditional independence. In the univariate parameter case, where Jeffreys’ prior is known to be the best noninformative prior, revising the existing noninformative prior upon updating with data from a new experiment is therefore simple. Jeffreys’ prior is the square root of Fisher information (*h*), so one obtains the updated Jeffreys’ prior (*π*_{J.u}) by simply adding in quadrature the existing Jeffreys’ prior (*π*_{J.e}) and the Jeffreys’ prior for the new experiment (*π*_{J.n}):

*π*_{J.u} = sqrt(*π*_{J.e}^{2} + *π*_{J.n}^{2 })

where *π*_{J. }= sqrt( *h*_{.}).

The usual practice of reporting a Bayesian parameter estimate, whether or not objective, in the form of a posterior PDF is insufficient to enable implementation of this revised updating method. In the univariate case, it suffices to report also the likelihood function and the (Jeffreys’) prior. Reporting just the likelihood function as well as the posterior PDF is not enough, because that only enables the prior to be recovered up to proportionality, which is insufficient here since addition of (squared) priors is involved.[6]

The proof of the pudding is in the eating, as they say, so how does probability matching using my proposed method of Bayesian updating compare with the standard method? I’ll give two examples, both taken from Lewis (2013a), my arXiv paper on the subject, each of which involve two experiments: A and B.

**Numerical testing: Example 1**

In the first example, the observational data for each experiment is a single measurement involving a Gaussian error distribution with known standard deviation (0.75 and 1.5 for experiments A and B respectively). In experiment A, what is measured is the unknown parameter itself. In experiment B, the cube of the parameter is measured. The calculated Jeffreys’ prior for experiment A is uniform; that for experiment B is proportional to the square of the parameter value. Probability matching is computed for a single true parameter value, set at 3; the choice of parameter value does not qualitatively affect the results.

Figure 1 shows probability matching tested by 20,000 random draws of experimental measurements. The *y*-axis shows the proportion of cases for which the parameter value at each posterior CDF percentage point, as shown on the *x*-axis, exceeds the true parameter value. The black dotted line in each panel show perfect probability matching (frequentist coverage).

Fig 1: Probability matching for experiments measuring: A – parameter; B – cube of parameter.

The left hand panel relates to inference based on data from each separate experiment, using the correct Jeffreys’ prior (which is also the reference prior) for that experiment. Probability matching is essentially perfect in both cases. That is what one would expect, since in both cases a (transformational) location parameter model applies: the probability of the observed datum depends on the parameter only through the difference between strictly monotonic functions of the observed value and of the parameter. That difference constitutes a so-called pivot variable. In such a location parameter case, Bayesian inference is able to provide exact probability matching, provided the appropriate noninformative prior is used.

The right hand panel relates to inference based on the combined experiments, using the Jeffreys’/reference prior for experiment A (green line), that for experiment B (blue line) or that for the combined experiments (red line), computed by adding in quadrature the priors for experiments A and B, as described above.

Standard Bayesian updating corresponds to either the green or the blue lines depending on whether experiment A or experiment B is analysed first, with the resulting initial posterior PDF being multiplied by the likelihood function from the other experiment to produce the final posterior PDF. It is clear that standard Bayesian updating produces order-dependent inference, with poor probability matching whichever experiment is analysed first.

By contrast, my proposed modification of standard Bayesian updating, using the Jeffreys’/reference prior pertaining to the combined experiments, produces very good probability matching. It is not perfect in this case. Imperfection is to be expected; when there are two experiments there is no longer a location parameter situation, so Bayesian inference cannot achieve exact probability matching.[7]

**Numerical testing: Example 2**

The second example involves Bernoulli trials. The experiment involves repeatedly making independent random draws with two possible outcomes, “success” and “failure”, the probability of success being the same for every draw. Experiment A involves a fixed number of draws *n*, with the number of “failures” *z* being counted, giving a binomial distribution. In Experiment B draws are continued until a fixed number *r* of failures occur, with the number of observations *y* being counted, giving a negative binomial distribution.

In this example, the parameter (here the probability of failure, *θ*) is continuous but the data are discrete. The Jeffreys’ priors for these two experiments differ, by a factor of sqrt(*θ*). This means that Objective Bayesian inference from them will differ even in the case where *z* = *r* and *y* = *n*, for which the likelihood functions for experiment A and B are identical. This is unacceptable under orthodox, Subjective, Bayesian theory. That is because it violates the so-called likelihood principle:[8] inference would in this case depend on the stopping rule as well as on the likelihood function for the observed data, which is impermissible.

Figure 2 shows probability matching tested by random draws of experimental results, as in Figure 1. Black dotted lines show perfect probability matching (frequentist coverage).In order to emphasize the difference between the two Jeffreys’ priors, I’ve set *n* at 40, *r* at 2, and selected at random 100 probabilities of failure, uniformly in the range 1–11%, repeating the experiments 2000 times at each value.[9]

The left hand panel of Fig. 2 shows inference based on data from each separate experiment. Green dashed and red dashed-dotted lines are based on using the Jeffreys’ prior appropriate to the experiment involved, being respectively binomial and negative binomial cases. Blue and magenta short-dashed lines are for the same experiments but with the priors swapped. When the correct Jeffreys’ prior is used, probability matching is very good: minor fluctuations in the binomial case arise from the discrete nature of the data. When the priors are swapped, matching is poor.

The right hand panel of Fig. 2 shows inference derived from the experiments in combination. The product of their likelihood functions is multiplied by each of three candidate priors. It is multiplied alternatively by the Jeffreys’ prior for experiment A, corresponding to standard Bayesian updating of the experiment A results (green dashed line); by the Jeffreys’ prior for experiment B, corresponding to standard Bayesian updating of the experiment A results (dash-dotted blue line); or by the Jeffreys’ prior for the combined experiments, corresponding to my modified form of updating (red short-dashed line). As in Example 1, use of the combined experiments Jeffreys’ prior provides very good, but not perfect, probability matching, whilst standard Bayesian updating, of the posterior PDF from either experiment generated using the prior that is noninformative for it, produces considerably less accurate probability matching.

Fig 2: Probability matching for Bernoulli experiments: A – binomial; B – negative binomial.

**Conclusions**

I have shown that, in general, standard Bayesian updating is not a valid procedure in an objective Bayesian framework, where inference concerns a fixed but unknown parameter and probability matching is considered important (which it normally is). The problem is not with Bayes’ theorem itself, but with its applicability in this situation. The solution that I propose is simple, provided that the necessary likelihood and Jeffreys’ prior/ Fisher information are available in respect of the existing parameter estimate as well as for the new experiment, and that the independence requirement is satisfied. Lack of (conditional) independence between the different data can of course be a problem; prewhitening may offer a solution in some cases (see, e.g., Lewis, 2013b).[10]

I have presented my argument about the unsuitability of standard Bayesian updating for objective inference about continuously valued fixed parameters in the context of all the information coming from observational data, with knowledge of the data uncertainty characteristics. But the same point applies to using Bayes’ theorem with any informative prior, whatever the nature of the information that it represents.

One obvious climate science application of the techniques for combining probabilistic estimates of a parameter set out in this article is the estimation of equilibrium climate sensitivity (ECS). Instrumental data from the industrial period and paleoclimate proxy data should be fairly independent, although there may be some commonality in estimating, for example, radiative forcings. Paleoclimate data, although generally more uncertain, is relatively more informative than industrial period data about high ECS values. The different PDF shapes of the two types of estimate implies that different noninformative priors apply, so using my proposed approach to combining evidence, rather than using standard Bayesian updating, is appropriate if objective Bayesian inference is intended. I have done quite a bit of work on this issue and I am hoping to get a paper published that deals with combining instrumental and paleoclimate ECS estimates.

.

[1] See, e.g., Chapter 5.4 of J M Bernardo and A F M Smith, 1994, Bayesian Theory. Wiley. An updated version of Ch. 5.4 is contained in section 3 of Bayesian Reference Analysis, available at http://www.uv.es/~bernardo/Monograph.pdf

[2] Kolmogorov, A N, 1933, Foundations of the Theory of Probability. Second English edition, Chelsea Publishing Company, 1956, 84 pp.

[3] Kass, R. E. and L. Wasserman, 1996: The Selection of Prior Distributions by Formal Rules. J. Amer. Stat. Ass., 91, 435, 1343-1370. Note that no such conflict normally arise where the parameter takes on only discrete values, since then a uniform prior (equal weighting all parameter values) is noninformative in all cases.

[4] Lewis, N, 2013a: Modification of Bayesian Updating where Continuous Parameters have Differing Relationships with New and Existing Data. arXiv:1308.2791 [stat.ME] http://arxiv.org/ftp/arxiv/papers/1308/1308.2791.pdf.

[5] Barnard, GA, 1994: Pivotal inference illustrated on the Darwin Maize Data. In Aspects of Uncertainty, Freeman, PR and AFM Smith (eds), Wiley, 392pp

[6] In the multiple parameter case, it is necessary to report the Fisher information (a matrix in this case) for the parameter vector, since it is that which gets additively updated, as well as the joint likelihood function for all parameters. Where all parameters involved are of interest, Jeffreys’ prior – here the square root of the determinant of the Fisher information matrix – is normally still the appropriate noninformative prior (it is the reference prior). Where interest lies in only one or a subset of the parameters, the reference prior may differ from Jeffreys’ prior, in which case the reference prior is to be preferred. But even where the best noninformative prior is not Jeffreys’ prior, the starting point for computing it is usually the Fisher information matrix.

[7] This is related to a fundamental difference between frequentist and Bayesian methodology: frequentist methodology involves integrating over the sample space; Bayesian methodology involves integrating over the parameter space.

[8] Berger, J. O. and R. L. Wolpert, 1984: *The Likelihood Principle (2nd Ed.)*. Lecture Notes Monograph, Vol 6, Institute of Mathematical Statistics, 206pp

[9] Using many different probabilities of failure helps iron out steps resulting from the discrete nature of the data.

[10] Lewis, N, 2013b: *An objective Bayesian, improved approach for applying optimal fingerprint techniques to estimate climate sensitivity.* Jnl Climate, 26, 7414–7429.

]]>

I reported in a previous post, here, a number of serious problems that I had identified in Marvel et al. (2015): *Implications for climate sensitivity from the response to individual forcings*. This Nature Climate Change paper concluded, based purely on simulations by the GISS-E2-R climate model, that estimates of the transient climate response (TCR) and equilibrium climate sensitivity (ECS) based on observations over the historical period (~1850 to recent times) were biased low.

I followed up my first article with an **update** that concentrated on land use change (LU) forcing. Inter alia, I presented regression results that strongly suggested the Historical simulation forcing (iRF) time series used in Marvel et al. omitted LU forcing. Gavin Schmidt of GISS responded on RealClimate, writing:

“Lewis in subsequent comments has claimed without evidence that land use was not properly included in our historical runs…. These are simply post hoc justifications for not wanting to accept the results.”

In fact, not only had I presented strong evidence that the Historical iRF values omitted LU forcing, but I had concluded:

“I really don’t know what the explanation is for the apparently missing Land use forcing. Hopefully GISS, who alone have all the necessary information, may be able to provide enlightenment.”

When I responded to the RealClimate article, here, I inter alia presented further evidence that LU forcing hadn’t been included in the computed value of the total forcing applied in the Historical simulation: there was virtually no trace of LU forcing in the spatial pattern for Historical forcing. I wasn’t suggesting that LU forcing had been omitted from the forcings applied during the Historical simulations, but rather that it had not been included when measuring them.

Yesterday, a climate scientist friend drew my attention to a correction notice published by Nature Climate Change, reading as follows:

“**Corrected online 10 March 2016**

In the version of this Letter originally published online, there was an error in the definition of F2×CO2 in equation (2). The historical instantaneous radiative forcing time series was also updated to reflect land use change, which was inadvertently excluded from the forcing originally calculated from ref. 22. This has resulted in minor changes to data in Figs 1 and 2, as well as in the corresponding main text and Supplementary Information. In addition, the end of the paragraph beginning’ Scaling ΔF for each of the single-forcing runs…’ should have read ‘…the CO2-only runs’ (not ‘GHG-only runs’). The conclusions of the Letter are not affected by these changes. All errors have been corrected in all versions of the Letter. The authors thank Nic Lewis for his careful reading of the original manuscript that resulted in the identification of these errors.”

So, as well as the previously flagged acceptance that the F2×CO2 value of 4.1 W/m^{2} was wrong in the original paper, the authors now accept that LU forcing had indeed been omitted from the Historical forcing values. They had been omitted from the forcing originally calculated in ref. 22 (Miller et al. 2014); Figures 2 and 4 of that paper likewise omit LU forcing.

It is decent of the authors to acknowledge me. However, I am mystified by their claim that “The conclusions of the Letter are not affected by these changes.” Their revised primary (iRF) estimate of historical transient efficacy is, per Table S1, 1.0 (0.995 at the centre of the symmetrical uncertainty range). This means the global mean temperature (GMST) response to the aggregate forcing applying during the historical period (actually, 1906–2005) was identical, in the model, to the response to the same forcing from CO2 only. That implies its TCR can be accurately estimated by comparing the GMST response and forcing over that period. But in their conclusions they contradict this, stating that:

“GISS ModelE2 is more sensitive to CO2 alone than it is to the sum of the forcings that were important over the past century.”

and they go on to claim that:

“Climate sensitivities estimated from recent observations will therefore be biased low in comparison with CO2-only simulations owing to an accident of history: when the efficacies of the forcings in the recent historical record are properly taken into account, estimates of TCR and ECS must be revised upwards.”

I would not have left these claims unaltered if it had been my paper – not that I would ever have made the second claim in any event, since it assumes the real world behaves in the same manner as GISS-E2-R.

I also find the way that the error in the F2×CO2 value has been dealt with in the corrected paper to be unsatisfactory. The previous, incorrect, value of 4.1 W/m^{2} has simply been deleted, without the correct values (4.5 W/m^{2} for iRF and 4.35 W/m^{2} for ERF) being given, either there or elsewhere in the paper or in the Supplementary Information. All the efficacy, TCR and ECS estimates given in the paper scale with the relevant F2×CO2 value, so it is important. In my view it would be appropriate to provide the F2×CO2 values somewhere in the paper or the SI.

Finally, I note that none of the other serious problems with the paper that I identified have been corrected. These include:

- the use of values for aerosol and ozone forcings, which are sensitive to climate-state, being calculated in an unrealistic climate state;
- use of ocean heat uptake – which amounts to only ~86% of total heat uptake – as a measure of total heat uptake despite the observational studies Marvel et al. critique using estimates that included non-ocean heat uptake;
- downwards bias in the equilibrium efficacy estimates by not comparing the GMST response to the forcing concerned with that to CO
_{2}forcing derived in the same way; - various results whose values disagree significantly to the data used and/or the stated bases of calculation. For instance, all the uncertainty ranges appear to be out by a factor of two or more. (I can’t make sense of Gavin Schmidt’s justification for the way they have been calculated.)

.

Steve M asked about the changes in the corrected paper/SI from the original. Apart from those mentioned in the 10 March 2016 correction notice, I have identifed the following other changes:

1) The caption to Figure 1 of the paper has been changed, by inserting “(defined with respect to 1850-1859)” after “**a** Non-overlapping ensemble average decadal mean changes in temperature and instantaneous radiative forcing for GISS-E2-R single-forcing ensembles (filled circles)” in the second line. As worded this definition applies to temperature as well as forcing, since the filled circles relate to both variables. But I think it is intended to apply only the forcing change, not to the temperature change. This was triggered, I imagine, by the following exchange at RealClimate on 11 January 2016:

“Nic Lewis says:

Chris Colose, you asked “why do the volcanic-only forcings (red dots) hover around a positive value in the first graph?”. The explanation I give in my technical analysis of Marvel et al. at Climate Audit is that in Figure 1 the iRF for volcanoes appears to have been shifted by ~+0.29 W/m2 from its data values, available at http://data.giss.nasa.gov/modelforce/Fi_Miller_et_al14.txt. Why not check it out and see whether my analysis is confused, as has been suggested here?

[**Response:** You are confused because you are using a single year baseline, when the data are being processed in decadal means. Thus the 19th C baseline is 1850-1859, not 1850. We could have been clearer in the paper that this was the case, but the jumping to conclusions you are doing does not seem justified. – gavin]”

There was no mention anywhere in the originally paper or SI of the 1850-1859 mean being used as a baseline for any variable, and it appears to have been used as the baseline solely for iRF forcing.

2) The words “for the iRF case” have also been added two lines further down, at the end of the caption for Figure 1.a.

3) The second sentence of the paragraph starting “Assuming that all forcings have the same transient efficacy as greenhouse gases” now reads: “Scaling each forcing by our estimates of transient efficacy determined from iRF we obtain a best estimate for TCR of 1.7 C (Fig. 2a) (1.6 C if efficacies are determined from ERF).” The words in brackets are new, and previously the figure before “(Fig. 2a)” was 1.7 C.

4) The ECS figure in the third sentence of the paragraph starting “We apply the same reasoning to estimates of ECS.” has been changed from 2.9 C to 2.6 C.

5) Various values in graphs have been changed.

6) All the efficacy values in Table S1 in the Supplementary Information have been changed. I show below the update history for this table, along with my provisional calculations of what all the values should be.

.

My calculations are based on the method I have deduced Marvel et al. used to produce the iRF efficacy estimates, not on the basis stated by Marvel et al., which is significantly different in two respects. Marvel et al. state:

“In the iRF case, where annual forcing time series are available, TCR and ECS are calculated by regressing ensemble-average decadal mean forcing or forcing minus ocean heat content change rate against ensemble-average temperature change.”

and say that transient and equilibrium efficacies are defined as the ratio of the calculated TCR or ECS to published GISS-E2-R TCR and ECS values. But I can only match the Table S1 mean iRF efficacy values by regressing the other way around, that is regressing temperature change against decadal mean forcing or forcing minus ocean heat content change rate. Moreover, to match the ECS efficacy values (for which it makes a difference), I have to regress on a run-by-run basis and then average the regression slopes, rather than regressing on ensemble-average values as stated was done.

For all iRF forcings other than LU, my mean estimates agree with those in the 10 March 2016 version of Table S1, within rounding errors. I am currently unsure why the iRF LU efficacies differ between the version of Table S1 dated 18 January 2016, that was available at the GISS website until recently – which matched my estimate as to the mean – and the 10 March 2016 version. Nor have I yet worked out why most of the ERF mean estimates and uncertainty ranges differ between my calculations and the 10 March 2016 version of Table S1.

As already noted, my iRF efficacy uncertainty ranges are much narrower than Marvel et al.’s. Their ratio appears to be [double] the square root of [one-fifth] the number of simulation runs from which the range was derived. That is [, by a factor of 1.118 (1.225 for historical forcing), not] consistent with Gavin Schmidt’s comment at RealClimate:

“The uncertainties in the Table S1 are the 90% spread in the ensemble, not the standard error of the mean.”

Since the values given are stated to be “mean and 95% confidence intervals”, I cannot see any justification for the efficacy uncertainty ranges actually being 95% confidence intervals for a single run, centered on the mean efficacy calculated over all runs.

Note: wording in square brackets in the pre-penultimate paragraph inserted 14 March 2016 AM; I wrote the Update late at night and overlooked that Gavin Schmidt’s explanation would only account for the discrepancy if the divisor for the standard error of the mean of n runs were sqrt(n-1), rather than the correct value of sqrt(n).

]]>

In April 2014 I published a guest article about statistical methods applicable to radiocarbon dating, which criticised existing Bayesian approaches to the problem. A standard – subjective Bayesian – method of inference about the true calendar age of a single artefact from a radiocarbon date determination (measurement) involved using a uniform-in-calendar-age prior. I argued that this did not, as claimed, equate to not including anything but the radiocarbon dating information, and was not a scientifically sound method for inference about isolated examples of artefacts.[1]

My article attracted many comments, not all agreeing with my arguments. This article follows up and expands on points in my original article, and discusses objections raised.

First, a brief recap. Radiocarbon dating involves determining the radiocarbon age of (a sample from) an artefact and then converting that determination to an estimate of the true calendar age *t*, using a highly nonlinear calibration curve. It is this nonlinearity that causes the difficulties I focussed on. Both the radiocarbon determination and the calibration curve are uncertain, but errors in them are random and in practice can be combined. A calibration program is used to derive estimated calendar age probability density functions (PDFs) and uncertainty ranges from a radiocarbon determination.

The standard calibration program OxCal that I concentrated on uses a subjective Bayesian method with a prior that is uniform over the entire calibration period, where a single artefact is involved. Calendar age uncertainty ranges for an artefact whose radiocarbon age is determined (subject to measurement error) can be derived from the resulting posterior PDFs. They can be constructed either from one-sided credible intervals (finding the values at which the cumulative distribution function (CDF) – the integral of the PDF – reaches the two uncertainty bound probabilities), or from highest probability density (HPD) regions containing the total probability in the uncertainty range.

In the subjective Bayesian paradigm, probability represents a purely personal degree of belief. That belief should reflect existing knowledge, updated by new observational data. However, even if that body of knowledge is common to two people, their probability evaluations are not required to agree,[2] and may for neither of them properly reflect the knowledge on which they are based. I do not regard this as a satisfactory paradigm for scientific inference.

I advocated taking instead an objective Bayesian approach, based on using a computed “noninformative prior” rather than a uniform prior. I used as my criterion for judging the two methods how well they performed upon repeated use, hypothetical or real, in relation to single artefacts. In other words, when estimating the value of a fixed but unknown parameter and giving uncertainty ranges for its value, how accurately would the actual proportions of cases in which the true value lies within each given range correspond to the indicated proportion of cases? That is to say, how good is the “probability matching” (frequentist coverage) of the method. I also examined use of the non-Bayesian signed root log-likelihood ratio (SRLR) method, judging it by the same criterion.

A quick recap on Bayesian parameter inference in the continuously-valued case, which is my exclusive concern here. In the context we have here, with a datum – the measured C14 age (radiocarbon determination) *d*_{C14} – and a parameter *t*, Bayes’ theorem states:

* p*(*t*|*d*_{C14}) = *p*(*d*_{C14}|*t*) *p*(*t*) / *p*(*d*_{C14}) (1)

where *p*(*x*|*y*) denotes the conditional probability density for variable *x* given the value of variable *y*. Since *p*(*d*_{C14}) is not a function of *t*, it can be (and usually is) replaced by a normalisation factor set so that the posterior PDF *p*(*t*|*d*_{C14}) integrates to unit probability. The probability density of the datum taken at the measured value, expressed as a function of the parameter (the true calendar age *t*),* p*(*d*_{C14}|*t*), is called the likelihood function. The construction and interpretation of the prior distribution or prior, *p*(*t*), is the critical practical difference between subjective and objective Bayesian approaches. In a subjective approach, the prior represents as a probability distribution the investigator’s existing degree of belief regarding varying putative values for the parameter being estimated. There is no formal requirement for the choice of prior to be evidence-based, although in scientific inference one might hope that it often would be. In an objective approach, the prior is instead normally selected to be noninformative, in the sense of letting inference for the parameter(s) of interest be determined, to the maximum extent possible, solely by the data.

A noninformative prior primarily reflects (at least in straightforward cases) how informative, at differing values of the parameter of interest, the data are expected to be about that parameter. In the univariate parameter continuous case, Jeffreys’ prior is known to be the best noninformative prior, in the sense that, asymptotically, Bayesian posterior distributions generated using it provide closer probability matching than those resulting from any other prior.[3] Jeffreys’ prior is the square root of the (expected) Fisher information. Fisher information – the expected value of the negative second derivative of the log-likelihood function with respect to the parameters, in regular cases – is a measure of the amount of information that the data, on average, carries about the parameter values. In simple univariate cases involving a fixed, symmetrical measurement error distribution, Jeffreys’ prior will generally be proportional to the derivative of the data variable being measured with respect to the parameter.

To simplify matters, I worked with a stylised calibration curve, which conveyed key features of the nonlinear structure of the real calibration curve – alternating regions where the radiocarbon date varied rapidly and very slowly with calendar age – whilst retaining strict monotonicity and having a simple analytical derivative. Figure 1, a version of Figure 2 in the original article, shows the stylised calibration curve (black) along with the error distribution density for an example radiocarbon date determination (orange). The grey wings of the black curve represent a fixed calibration curve error, which I absorb into the C14 determination error, assumed here to have a Gaussian distribution with fixed, known, standard deviation. The solid pink line shows the Bayesian posterior probability density function (PDF) using a uniform in calendar age prior. The dotted green line shows the noninformative Jeffreys’ prior used in the objective Bayesian method, which reflects the derivative of the calibration curve. The posterior PDF using Jeffreys’ prior is shown as the solid green line. The dashed pink and green lines, added in this version, show the corresponding posterior CDF in each case.

**Fig. 1**:* Bayesian inference using uniform and objective priors with a stylised calibration curve and *

*an example observation; combined measurement/calibration error standard deviation 60 RC years.*

The key conclusions of my original article were:

“The results of the testing are pretty clear. In whatever range the true calendar age of the sample lies, both the objective Bayesian method using a noninformative Jeffreys’ prior and the non-Bayesian SRLR method provide excellent probability matching – almost perfect frequentist coverage. Both variants of the subjective Bayesian method using a uniform prior are unreliable. The HPD regions that OxCal provides give less poor coverage than two-sided credible intervals derived from percentage points of the uniform prior posterior CDF, but at the expense of not giving any information as to how the missing probability is divided between the regions above and below the HPD region. For both variants of the uniform prior subjective Bayesian method, probability matching is nothing like exact except in the unrealistic case where the sample is drawn equally from the entire calibration range”

For many scientific and other users of statistical data, I think that would clinch the case in favour of using the objective Bayesian or the SRLR methods, rather than the subjective Bayesian method with a uniform prior. Primary results are generally given by way of an uncertainty range with specified probability percentages, not in the form of a PDF. Many subjective Bayesians appear unconcerned whether Bayesian credible intervals provided as uncertainty ranges even approximately constitute confidence intervals. Since under their interpretation probability merely represents a degree of belief and is particular to each individual, perhaps that is unsurprising. But, in science, users normally expect such ranges to be at least approximately valid as confidence intervals, so that, upon repeated applications of the method – not necessarily to the same parameter – in the long run the true value of the parameter being estimated would lie within the stated intervals in the claimed percentages of cases.

However, there was quite a lot of pushback against the rather peculiar shape of the objective Bayesian posterior PDF resulting from use of Jeffreys’ prior. It put near zero probability on regions where the data, although being compatible with the parameter value, was insensitive to it. That is, regions where the data likelihood was significant but the radiocarbon determination varied little with calendar age, due to the flatness of the calibration curve. The pink, subjective Bayesian posterior PDF was generally thought by such critics to be more realistically-shaped. Underlying that view, critics typically thought that there was relevant prior information about the age distribution of artefacts that should be incorporated, by reflecting through use of a uniform prior a belief that an artefact was equally likely to come from any (equal-length) calendar age range. Whether or not that is so, the uniform prior had instead been chosen on that basis that it did not introduce anything but the RC dating information, and I argued against it on that basis.

I think the view that one should reject an objective Bayesian approach just on the basis that the posterior PDF is gives rise to is odd-looking is mistaken. In most cases, what is of concern when estimating a fixed but uncertain parameter, here calendar age, is how well one can reliably constrain its value within one or more uncertainty ranges. In this connection, it should be noted that although the Jeffreys’ prior will assign low PDF values in a range where likelihood is substantial but the data variable is insensitive to the parameter value, the uncertainty ranges that the resulting PDF gives rise to will normally include that range.

I think it is best to work with one-sided ranges, which unlike two-sided ranges are uniquely defined, whereas an 80% (say) range could be 5–85% or 10–90%. A one-sided *x*% range is the range from the lowest possible value of the parameter (here, zero) to the value, *y*, at which the range contains *x*% of the posterior probability. An *x*_{1}–*x*_{2}% range or interval for the parameter is then *y*_{1} − *y*_{2}, where *y*_{1} and *y*_{2} are the (tops of the) one-sided *x*_{1}% and *x*_{2}% ranges. An *x*% one-sided credible interval derived from a posterior CDF relates to Bayesian posterior probability. By contrast, a (frequentist) *x*% one-sided confidence interval bounded above by *y* is calculated so that, upon indefinitely repeated random sampling from the data uncertainty distribution(s) involved, the true parameter value will lie below the resulting *y* in *x*% of cases (i.e., with a probability of *x*%).

By definition, an accurate confidence interval bound exhibits exact probability matching. If one-sided Bayesian credible intervals derived using a particular prior pass the matching test for all values of *x* then they and the prior used are said to be probability matching. From a scientific viewpoint, probability matching is highly desirable. It implies – if the statistical analyses have been performed (and error distributions assessed) correctly – that, for example, only in one study out of twenty will the true parameter value lie above the reported 95% uncertainty bound. In general, Bayesian posteriors can only, at best, be approximately probability matching. But the simple situation presented here falls within the “location parameter” exception to the no-exact-probability-matching rule, and use of Jeffreys’ prior should in principle lead to Bayesian one-sided credible intervals exhibiting exact probability matching.

The two posterior PDFs in Figure 1 imply very different calendar age uncertainty ranges. However, as I showed in the original article, if a large number of true calendar ages are sampled in accordance with the assumed uniform distribution over the entire calibration curve (save, to minimise end-effects from likelihood lying outside the calibration range, 100 years at each end), and a radiocarbon date determined for each of them in accordance with the calibration curve and error distribution, then both methods provide excellent probability matching. This is to be expected, but for different reasons in the subjective and objective Bayesian cases.

Figure 2, a reproduction of Figure 3 in the original article, shows the near perfect probability matching in this case for both methods. At all one-sided credible/confidence interval sizes (upper level in percent), the true calendar age lay below the calculated limit in almost that percentage of cases. As well as results for one-sided intervals using the subjective Bayesian method with a uniform prior and the objective Bayesian method with Jeffreys’ prior, Figure 2 shows comparative results on two other bases. The blue line is for two-sided HPD regions, using the subjective Bayesian method with a uniform prior. An *x*% HPD region is the shortest interval containing *x*% of the posterior probability and in general does not contain an equal amount of probability above and below the 50% point (median). The dashed green line uses the non-Bayesian signed root log-likelihood ratio (SRLR) method, which produces confidence intervals that are in general only approximate but are exact when a normal distribution or a transformation thereof is involved, as here.

**Fig. 2**: *Probability matching: repeated draws of parameter values; one C14 measurement for each*

Figure 2 supports use of a subjective Bayesian method in a scenario where there is an exactly known prior probability distribution for the unknown variable of interest (here, calendar age), many draws are made at random from it and only probability matching averaged across all draws is of concern. The excellent probability matching for the subjective Bayesian method shows that stated uncertainty bounds using it are valid when aggregated over samples of parameter values drawn randomly, pro rata to their known probability of occurring. In such a case, Bayes’ theorem follows from the theory of conditional probability, which is part of probability theory as rigorously developed by Kolmogorov.[4]

However, the problem of scientific parameter inference is normally different. The value of a parameter is not drawn from a known probability distribution, and does not vary randomly from one occasion to another. It has a fixed, albeit unknown, value, which is the target of inference. The situation is unlike the scenario where the distribution from which parameter values are drawn is fixed but the parameter value varies between draws, and, often, no one drawn parameter value is of particular significance (a “gambling scenario”). Uncertainty bounds that are sound only when parameter values are drawn from a known probability distribution, and apply only upon averaging across such draws, are not normally appropriate for scientific parameter inference. What is wanted are uncertainty bounds that are valid on average when the method is applied in cases where there is only one, fixed, parameter value, and it has not been drawn at random from a known probability distribution.

In Figure 2, values for the unknown variable are drawn many times at random from a specified distribution and in each instance a single subject-to-error observation is made. The testing method corresponds to a gambling scenario: it is only the average probability matching over all the parameter values drawn that matters. Contrariwise, Figure 3 corresponds to the parameter inference case, where probability matching at each possible parameter value separately matters. I selected ten possible parameter values (calendar ages), to illustrate how reliable inference would typically be when only a single, fixed, parameter value is concerned. I actually drew the parameter values at random, uniformly, from the same age range as in Figure 2, but I could equally well have chosen them in some other way. For each of these ten values, I drew 5,000 radiocarbon date determinations randomly from the measurement error distribution, rather than a single draw as used for Figure 2, and computed one-sided 1% to 100% uncertainty intervals. Each of the lines corresponds to one of the ten selected parameter values.

As can be seen, the objective Bayesian and the SRLR methods again provide almost perfect probability matching in the case of each of the ten parameter values separately (the differences from exact matching are largely due to end effects). But probability matching for the subjective Bayesian method using a uniform prior was poor – awful over much of the probability range in most of the ten cases. Whilst HPD region coverage appears slightly less bad than for one-sided intervals, only the total probability included in the HPD region is being tested in this case: the probabilities of the parameter value lying above the HPD region and lying below it may be very different from each other. Therefore, HPD regions are useless if one wants the risks of over- and under-estimating the parameter value to be similar, or if one is primarily concerned about one or other of those risks.

**Fig. 3**: *Probability matching for each of 10 artefacts with ages drawn
from a uniform distribution: 5000 C14 measurements for each*

I submit that Figure 3 strongly supports my thesis that a uniform prior over the calibration range should not be used here for scientific inference about the true calendar age of a single artefact, notwithstanding that it gives rise to a much more intuitively-appealing posterior PDF than use of Jeffreys’ prior. It may appear much more reasonable to assume such a uniform prior distribution than the highly non-uniform Jeffreys’ prior. However, the wide uniform distribution does not convey any genuine information. Rather, it just causes the posterior PDF, whilst looking more plausible, to generate misleading inference about the parameter.

What if an investigator is only concerned with the accuracy of uncertainty bounds averaged across all draws, and the artefacts are known to be exactly uniformly distributed? In that case, is there any advantage in using the subjective Bayesian method rather than the objective one, apart from the posterior PDFs looking more acceptable? Although both methods deliver essentially perfect probability matching, one might generate uncertainty intervals that are, on average, smaller than those from the other method, in which case it would arguably be superior. Re-running the Figure 2 case did indeed reveal a difference. The average 5–95% uncertainty interval was 736 years using the objective Bayesian method. It was rather longer, at 778 years, using the subjective Bayesian method.

So in this case the objective Bayesian method not only provides accurate probability matching for each possible parameter value, but on average it generates narrower uncertainty ranges even in the one case where the subjective method also provides probability matching. Even the less useful HPD 90% regions, which say nothing about the relative chances of the parameter value lying below or above the region, were on average only slightly narrower, at 724 years, than the objective Bayesian 5–95% regions. The failure of the subjective Bayesian method to provide narrower 5–95% uncertainty ranges reflects the fact that knowing the parameter is drawn randomly from a wide uniform distribution provides negligible useful information about its value.

My tentative interpretation of the difference between inference using a uniform prior in Figures 2 and 3 is this. In the Figure 2 case, the calendar age is a random variable, in the Kolmogorov sense (a KRG). Each trial involves drawing an artefact, and for it an associated RC determination, at random. There is a joint probability function involved and a known unconditional probability distribution for calendar age. The conditional probability density of calendar age for an artefact given its RC determination is then well defined, and given by Bayes’ theorem. But in the Figure 3 case, taking any one of the ten selected calendar ages on their own, there is no joint probability distribution involved. The parameter (calendar age) is not a KRG;[5] its value is fixed, albeit unknown. A conditional probability density for the calendar age of the artefact is therefore not validly given, by Bayes’ theorem, as the product of its unconditional density and the conditional density of the RC determination, since the former does not exist.[6]

Under the subjective Bayesian interpretation, the standard use of Bayes’ theorem is always applicable for inference about a continuously-valued fixed but unknown parameter In my view, Bayes’ theorem may validly be applied in the standard way in a gambling scenario but not, in general, in the case of scientific inference about a continuously-valued fixed parameter – although it will provide valid inference in some cases and lead only to minor error in many others. The fact that the standard subjective uses of Bayes’ theorem should never lead to internal inconsistencies in personal belief systems does not mean that they are objectively satisfactory.

If there are many realisations of the variable of interest, drawn randomly from a known probability distribution, and inferences made after obtaining observational evidence for each value drawn only need to be valid when aggregated across all the draws, then a subjective Bayesian approach produces sound inference. By sound inference, I mean here that uncertainty statements in the form of credible intervals derived from posterior PDFs exhibit accurate probability matching when applied to observational data relating to many randomly drawn parameter values. Moreover, the posterior PDFs produced have a natural probabilistic interpretation.

However, if the variable of interest is a fixed but unknown, unchanging parameter, subjective Bayesian inferences are in principle objectively unreliable, in the sense that they will not necessarily produce good probability matching for multiple draws of observational data with the parameter value fixed. Not only is aggregation across randomly drawn parameter values impossible here, but only the value of the actual, fixed, parameter is of interest. In such cases I suggest that most researchers using a method want it to produce uncertainty statements that are, averaged over multiple applications of the method in the same or comparable types of problem, accurate. That is to say, their uncertainty bounds constitute, at least approximately, confidence intervals.[7] In consequence, the posterior PDFs produced by such methods will be close to confidence densities.[8] It follows that they may not necessarily have a natural probabilistic interpretation, so the fact that their shapes may not appear to be realistic as estimated probability densities is of questionable relevance, at least for estimating parameter values.

Subjective Bayesians often question the merits of confidence intervals. They point out that “relevant subsets” may exist, which can lead to undesirable confidence intervals. But I think the existence of such pathologies is in practice rare, or easily enough overcome, where a continuous fixed parameter is being estimated. Certainly, the suggestion in a paper recommended to me by a fundamentalist subjective Bayesian statistician active in climate science, that the widely used Student’s *t*-distribution involves relevant subsets, seems totally wrong.

To summarise, I still think objective Bayesian inference is normally superior to subjective Bayesian inference for scientific parameter estimation, but the posterior PDFs it generates may not necessarily have the usual probabilistic interpretation. I plan, in a subsequent article, to discuss how to incorporate useful, evidence based, probabilistic prior information about the value of a fixed but unknown parameter using an objective Bayesian approach.

*.
Notes and References*

[1] I did not comment on the formulation of priors for inference about groups of artefacts (which was a major element of the standard methodology involved).

[2] De Finetti, B. ,2008. Philosophical lectures on probability, Springer, 212 pp. p.23

[3] Welch, B L, 1965. On comparisons between confidence point procedures in the case of a single parameter. J. Roy. Soc. Ser. B, 27, 1, 1-8; Hartigan, 1965.The Asymptotically Unbiased Prior Distribution, Ann. Math. Statist., 36, 4, 1137-1152

[4] Kolmogorov, A N, 1933, Foundations of the Theory of Probability. Second English edition, Chelsea Publishing Company, 1956, 84 pp.

[5] Barnard, GA, 1994: Pivotal inference illustrated on the Darwin Maize Data. In Aspects of Uncertainty, Freeman, PR and AFM Smith (eds), Wiley, 392pp.

[6] Fraser, D A S, 2011. Is Bayes Posterior just Quick and Dirty Confidence? Statistical Science, 26, 3, 299–316.

[7] There is a good discussion of confidence and credible intervals in the context of Bayesian analysis using noninformative priors in Section 6.6 of J O Berger (1980): Statistical decision theory and Bayesian analysis, 671pp.

[8] Confidence densities are discussed in Schweder T and N L Hjort, 2002: Confidence and likelihood. Scandinavian Jnl of Statistics, 29, 309-332.

]]>

**The Correct System of Equations for Climate and Weather Models**

The system of equations numerically approximated by both weather and climate models is called the hydrostatic system. Using a scale analysis for mid-latitude large scale motions in the atmosphere (motions with a horizontal length scale of 1000 km and time scale of a day), Charney (1948) showed that hydrostatic balance, i.e., balance between the vertical pressure gradient and gravitational force, is satisfied to a high degree of accuracy by these motions. As the fine balance between these terms was difficult to calculate numerically and to remove fast vertically propagating sound waves to allow for numerical integration using a larger time step, he introduced the hydrostatic system that assumes exact balance between the vertical pressure gradient and the gravitational force. This system leads to a columnar (function of altitude) equation for the vertical velocity called Richardson’s equation.

A scale analysis of the equations of atmospheric motion assumes that the motion will retain those characteristics for the period of time indicated by the choice of the time scale (Browning and Kreiss, 1986). This means that the initial data must be smooth (have spatial derivatives on the order of 1000 km) that lead to time derivatives on the order of a day. To satisfy the latter constraint, the initial data must satisfy the elliptic constraints determined by ensuring a number of time derivatives are of the order of a day. If all of these conditions are satisfied, then the solution can be ensured to evolve smoothly, i.e., on the spatial and time scales used in the scale analysis. This latter mathematical theory for hyperbolic systems is called “The Bounded Derivative Theory” (BDT) and was introduced by Professor Kreiss (Kreiss, 1979, 1980).

Instead of assuming exact hydrostatic balance (leads to a number of mathematical problems discussed below), Browning and Kreiss (1986) introduced the idea of slowing down the vertically propagating waves instead of removing them completely, thus retaining the desirable mathematical property of hyperbolicity of the unmodified system. This modification was proved mathematically to accurately describe the large scale motions of interest and, subsequently, also to describe smaller scales of motion in the mid-latitudes (Browning and Kreiss, 2002). In this manuscript, the correct elliptic constraints to ensure smoothly evolving solutions are derived. In particular the elliptic equation for the vertical velocity is three dimensional, i.e., not columnar, and the horizontal divergence must be derived from the vertical velocity in order to ensure a smoothly evolving solution.

It is now possible to see why the hydrostatic system is not the correct reduced system (the system that correctly describes the smoothly evolving solution to a first degree of approximation). The columnar vertical velocity equation (Richardson’s equation) leads to columnar heating that is not spatially smooth. This is called rough forcing and leads to the physically unrealistic generation of large amounts of energy in the highest wave numbers of a model (Browning and Kreiss, 1994; Page, Fillion, and Zwack, 2007). This energy requires large amounts of nonphysical numerical dissipation in order to keep the model from becoming unstable, i.e., blowing up. We also mention that the boundary layer

interacts very differently with a three dimensional elliptic equation for the vertical velocity than with a columnar equation (Gravel, Browning, and Kreiss).

**References:**

Browning, G. L., and H.-O. Kreiss 1986: Scaling and computation of smooth atmospheric motions. Tellus, 38A, 295–313.

——, and ——, 1994: The impact of rough forcing on systems with multiple time scales. J. Atmos. Sci., 51, 369-383

——, and ——, 2002: Multiscale bounded derivative initialization for an arbitrary domain. J. Atmos. Sci., 59, 1680-1696.

Charney, J. G., 1948: On the scale of atmospheric motions. Geofys.Publ., 17, 1–17.

Kreiss, H.-O., 1979: Problems with different time scales for ordinary differential equations. SIAM J. Num. Anal., 16, 980–998.

——, 1980: Problems with different time scales for partial differential equations. Commun. Pure Appl. Math, 33, 399–440.

Gravel, Sylvie et al.: The relative contributions of data sources and forcing components to the large-scale forecast accuracy of an operational model. This web site

Page, Christian, Luc Fillion, and Peter Zwack, 2007: Diagnosing summertime mesoscale vertical motion: implications for atmospheric data assimilation. Monthly Weather Review, 135, 2076-2094.

]]>