Radiocarbon calibration and Bayesian inference

A guest post by Nic Lewis

 

1. Introduction

On 1 April 2014 the Bishop Hill blog carried a guest post ‘Dating error’ by Doug Keenan, in which he set out his allegations of research misconduct by Oxford University professor Christopher Bronk Ramsey. Professor Bronk Ramsey is an expert on calibration of radiocarbon dating and author of OxCal, apparently one of the two most widely used radiocarbon calibration programs (the other being Calib, by Stuiver and Reimer). Steve McIntyre and others opined that an allegation of misconduct was inappropriate in this sort of case, and likely to be counter-productive. I entirely agree. Nevertheless, the post prompted an interesting discussion with statistical expert Professor Radford Neal of Toronto University and with Nullius in Verba (an anonymous but statistically-minded commentator). They took issue with Doug’s claims that the statistical methods and resulting probability densities (PDFs) and probability ranges given by OxCal and Calib are wrong. Doug’s arguments, using a partly Bayesian approach he calls a discrete calibration method, are set out in his 2012 peer reviewed paper.

I also commented, saying if one assumes a uniform prior for the true calendar date, then Doug Keenan’s results do not follow from standard Bayesian theory. Although the OxCal and Calib calibration graphs (and the Calib manual) are confusing on the point, Bronk Ramsey’s papers make clear he does use such a uniform prior. I wrote that in my view Bronk Ramsey had followed a defensible approach (since his results flow from applying Bayes’ theorem using that prior), so there was no research misconduct involved, but that his method did not represent best scientific inference.

The final outcome was that Doug accepted what Radford and Nullius said about how the sample measurement should be interpreted as probability, with the implication that his criticism of the calibration method is invalid. However, as I had told Doug originally, I think his criticism of the OxCal and Calib calibration methods is actually valid: I just think that imperfect understanding rather than misconduct on the part of Bronk Ramsey (and of other radiocarbon calibration experts) is involved. Progress in probability and statistics has for a long time been impeded by quasi-philosophical disagreements between theoreticians as to what probability represents and the correct foundations for statistics. Use of what are, in my view, unsatisfactory methods remains common.

Fortunately, regardless of foundational disagreements I think most people (and certainly most scientists) are in practice prepared to judge the appropriateness of statistical estimation methods by how well they perform upon repeated use. In other words, when estimating the value of a fixed but unknown parameter, does the true value lie outside the specified uncertainty range in the indicated proportion of cases?

This so-called frequentist coverage or probability-matching property can be tested by drawing samples at random from the relevant uncertainty distributions. For any assumed distribution of parameter values, a method of producing 5–95% uncertainty ranges can be tested by drawing a large number of samples of possible parameter values from that distribution, and for each drawing a measurement at random according to the measurement uncertainty distribution and estimating a range for the parameter. If the true value of the parameter lies below the bottom end of the range in 5% of cases and above its top in 5% of cases, then that method can be said to exhibit perfect frequentist coverage or exact probability matching (at least at the 5% and 95% probability levels), and would be viewed as a more reliable method than a non-probability-matching one for which those percentages were (say) 3% and 10%. It is also preferable to a method for which those percentages were both 3%, which would imply the uncertainty ranges were unnecessarily wide. Note that in some cases probability-matching accuracy is unaffected by the parameter value distribution assumed.

I’ll now attempt to explain the statistical issues and to provide evidence for my views. I’ll then set up a simplified, analytically tractable, version of the problem and use it to test the probability matching performance of different methods. I’ll leave discussion of the merits of Doug’s methods to the end.

2. Statistical issues involved in radiocarbon calibration

The key point is that OxCal and Calib use a subjective Bayesian method with a wide uniform prior on the parameter being estimated, here calendar age, whilst the observational data provides information about a variable, radiocarbon or 14C age, that has a nonlinear relationship to the parameter of interest. The vast bulk of the uncertainty relates to 14C age – principally measurement and similar errors, but also calibration uncertainty. The situation is thus very similar to that for estimation of climate sensitivity. It seems to me that the OxCal and Calib methods are conceptually wrong, just as use of a uniform prior for estimating climate sensitivity is normally inappropriate.

In the case of climate sensitivity, I have been arguing for a long time that Bayesian methods are only appropriate if one takes an objective approach, using a noninformative prior, rather than a subjective approach (using, typically, a uniform or expert prior). Unfortunately, many statisticians (and all but a few climate scientists) seem not to understand, or at least not to accept, the arguments in favour of an objective Bayesian approach. Most climate sensitivity studies still use subjective Bayesian methods.

Objective Bayesian methods require a noninformative prior. That is, a prior that influences parameter estimation as little as possible: it lets the data ‘speak for themselves’[i]. Bayesian methods generally cannot achieve exact probability matching even with the most noninformative prior, but objective Bayesian methods can often achieve approximate probability matching. In simple cases a uniform prior is quite often noninformative, so that a subjective Bayesian approach that involved using a uniform prior would involve the same calculations and give the same results as an objective Bayesian approach. An example is where the parameter being estimated is linearly related to data, the uncertainties in which represent measurement errors with a fixed distribution. However, where nonlinear relationships are involved a noninformative prior for the parameter is rarely uniform. In complex cases deriving a suitable noninformative prior can be difficult, and in many cases it is impossible to find a prior that has no influence at all on parameter estimation.

Fortunately, in one-dimensional cases where uncertainty involves measurement and similar errors it is often possible to find a completely noninformative prior, with the result that exact probability matching can be achieved. In such cases, the so-called ‘Jeffreys’ prior’ is generally the correct choice, and can be calculated by applying standard formulae. In essence, Jeffreys’ prior can be thought of as a conversion factor between distances in parameter space and distances in data space. Where a data–parameter relationship is linear and the data error distribution is independent of the parameter value, that conversion factor will be fixed, leading to Jeffreys’ prior being uniform. But where a data–parameter relationship is nonlinear and/or the data precision is variable, Jeffreys’ prior achieves noninformativeness by being appropriately non-uniform.

Turning to the specifics of radiocarbon dating, my understanding is as follows. The 14C age uncertainty varies with 14C age, and is lognormal rather than normal (Gaussian). However, the variation in uncertainty is sufficiently slow for the error distribution applying to any particular sample to be taken as Gaussian with a standard deviation that is constant over the width of the distribution, provided the measurement is not close to the background radiation level. It follows that, were one simply estimating the ‘true’ radiocarbon age of the sample, a uniform-in-14C-age prior would be noninformative. Use of such a prior would result in an objective Bayesian estimated posterior PDF for the true 14C age that was Gaussian in form.

However, the key point about radiocarbon dating is that the ‘calibration curve’ relationship of ‘true’ radiocarbon age t14C to the true calendar date ti of the event corresponding to the 14C determination is highly nonlinear. (I will consider only a single event, so i = 1.) It follows that to be noninformative a prior for ti must be non-uniform. Assuming that the desire is to produce uncertainty ranges beyond which – upon repeated use – the true calendar date will fall in a specified proportion of cases, the fact that in reality there may be an equal chance of tilying in any calendar year is irrelevant.

The Bayesian statistical basis underlying the OxCal method is set out in a 2009 paper by Bronk Ramsey[ii]. I will only consider the simple case of a single event, with all information coming from a single 14C determination. Bronk Ramsey’s paper states:

The likelihood defines the probability of obtaining a measurement given a particular date for an event. If we only have a single event, we normally take the prior for the date of the event to be uniform (but unnormalized):

 p(ti) ~ U(–∞,∞ ) ~ constant

 Defensible though it is in terms of subjective Bayesian theory, a uniform prior in titranslates into a highly non-uniform prior for the ‘true’ radiocarbon age (t14C) as inferred from the 14C determination. Applying Bayes’ theorem in the usual way, the posterior density for t14C will then be non-Gaussian.

The position is actually more complicated, in that the calibration curve itself also has uncertainty, which is also assumed to be Gaussian in form. One can think of there being a nonlinear but exact functional calibration curve relationship s14C = c(ti) between calendar year ti and a ‘standard’ 14C age s14C, but with – for each calendar year – the actual (true, not measured) 14C age t14C having a slightly indeterminate relationship with ti. So the statistical relationship (N signifying a normal or Gaussian distribution having the stated mean and standard deviation ) is:

                                                            t14C ~ N(c(ti), σc(ti))                                                      (1)

where σc is the calibration uncertainty standard deviation, which in general will be a function of ti. In turn, the radiocarbon determination age d14C is assumed to have the form

                                                            d14C ~ N(t14C, σd)                                                          (2)

with the variation of the standard deviation σd with t14C usually being ignored for individual samples.

 NicL_radioC1_Keenan2012_Fig1Oxcal
Fig. 1: Example OxCal calibration (from Fig.1 of Keenan, 2012, Calibration of a radiocarbon age)

 

Figure 1, from Fig. 1 in Doug’s paper, shows an example of an OxCal calibration, with the resulting 95.4% (±2 sigma for a Gaussian distribution) probability range marked by the thin bar above the x-axis. The red curve on the y-axis is centred on the 14C age derived by measurement (the radiocarbon or 14C determination) and shows the likelihood for that determination as a function of true 14C age. The likelihood for a 14C determination is the relative probability, for any given true 14C age, of having obtained that determination given the uncertainty in 14C determinations. The blue calibration curve shows the relationship between true 14C age (on the y-axis) and true calendar age on the x-axis. Its vertical width represents calibration uncertainty. The estimated PDF for calendar age is shown in grey. Ignoring the small effect of the calibration uncertainty, the PDF simply expresses the 14C determination’s likelihood as a function of calendar age. It represents both the likelihood function for the determination and – since a uniform prior for calendar age is used – the posterior PDF for the true calendar age (Bayes’ theorem giving the posterior as the normalised product of the prior and the likelihood function).

By contrast to OxCal’s subjective Bayesian, uniform prior based method, an objective Bayesian approach would involve computing a noninformative prior for ti. The standard choice would normally be Jeffreys’ prior. Doing so is somewhat problematic here in view of the calibration curve not being monotonic – it contains reversals – and also having varying uncertainty.

If the calibration curve were monotonic and had an unvarying error magnitude, the calibration curve error could be absorbed into a slightly increased 14C determination error, as both these uncertainty distributions are assumed Gaussian. Since the calibration curve error appears small in relation to 14C determination error, and typically only modestly varying over the 14C determination error range, I will make the simplifying assumption that it can be absorbed into an increased 14C determination error. The statistical relationship then becomes, given independence of calibration curve and radiocarbon determination uncertainty:

                                                d14C ~ N( c(ti), sqrt(σc²+σd²) )                                                   (3)

On that basis, and ignoring also the calibration curve being limited in range, it follows that Jeffreys’ prior for ti would equal the absolute derivative (slope) of calibrated 14C age with respect to calendar date. Moreover, in the absence of non-monotonicity it is known that in a case like this the Jeffreys’ prior is completely noninformative. Jeffreys’ prior would in fact provide exact probability matching – perfect agreement between the objective Bayesian posterior cumulative distribution functions (CDFs – the integrals of PDFs) and the results of repeated testing. The reason for the form here of Jeffreys’ prior is fairly clear – where the calibration curve is steep and hence its derivative with respect to calendar age is large, the error probability (red shaded area) between two nearby values of t14C corresponds to a much smaller ti range than when the derivative is small.

An alternative way of seeing that a noninformative prior for calendar age should be proportional to the derivative of the calibration curve is as follows. One can perform the Bayesian inference step to derive a posterior PDF for the true 14C age, t14C, using a uniform prior for 14C age – which as stated previously is, given the assumed Gaussian error distribution, noninformative. That results in a posterior PDF for 14C age that is identical, up to proportionality, to its likelihood function. Then one can carry out a change of variable from t14C to ti. The standard (Jacobian determinant) formula for converting a PDF between two variables, where one is a function of the other, involves multiplying the PDF, expressed in terms of the new variable, by the absolute derivative of the inverse transformation – the derivative of t14C with respect to ti. Taking this route, the objective posterior PDF for calendar age is the normalised product of the 14C age likelihood function (since the 14C objective Bayesian posterior is proportional to its likelihood function), expressed in terms of calendar age, multiplied by the derivative of t14C with respect to ti. That is identical, as it should be, to the result of direct objective Bayesian inference of calendar age using the Jeffreys’ prior.

3. Examining various methods using a simple stylised calibration curve

In order to make the problem analytically tractable and the performance of different methods – in terms of probability matching – easily testable, I have created a stylised calibration curve. It consists of the sum of three scaled shifted sigmoid functions. The curve exhibits both plateaus and steep regions whilst being smooth and monotonic and having a simple derivative.

 Figure 2 shows similar information to Figure 1 but with my stylised calibration curve instead of a real one. The grey wings of the curve represent a fixed calibration curve error, which, as discussed, I absorb into the 14C determination error. The pink curve, showing the Bayesian posterior PDF using a uniform prior in calendar age, corresponds to the grey curve in Figure 1. It is highest over the right hand plateau, which corresponds to the centre of the red radiocarbon age error distribution, but has a non-negligible value over the left hand plateau as well. The figure also shows the objective Jeffreys’ prior (dotted green line), which reflects the derivative of the calibration curve. The objective Bayesian posterior using that prior is shown as the solid green line. As can be seen, it is very different from the uniform-calendar-year-prior based posterior that would be produced by the OxCal or Calib programs for this 14C determination (if they used this calibration curve).

NicL_radioC2_Calibration1.1000.60 Fig. 2: Bayesian inference using uniform and objective priors with a stylised calibration curve

 

The Jeffreys’ prior (dotted green line) has bumps wherever the calibration curve has a high slope, and is very low in plateau regions. Subjective Bayesians will probably throw up their hands in horror at it, since it would be unphysical to think that the probability of a sample having any particular calendar age depended on the shape of the calibration curve. But that is to mistake the nature of a noninformative prior, here Jeffreys’ prior. A noninformative prior has no direct probabilistic interpretation. As a standard textbook (Bernardo and Smith, 1994) puts it in relation to reference analysis, arguably the most successful approach to objective Bayesian inference: “The positive functions π(θ) [the noninformative reference priors] are merely pragmatically convenient tools for the derivation of reference posterior distributions via Bayes’ theorem”.

Rather than representing a probabilistic description of existing evidence as to a probability distribution for the parameter being estimated, a noninformative prior primarily reflects (at least in straightforward cases) how informative, at differing values of the parameter, the data is expected to be about the parameter. That in turn reflects how precise the data are in the relevant region and how fast expected data values change with the parameter value. This comes back to the relationship between distances in parameter space and distances in data space that I mentioned earlier.

It may be thought that the objective posterior PDF has an artificial shape, with peaks and low regions determined, via the prior, by the vagaries of the calibration curve and not by genuine information as to the true calendar age of the sample. But one shouldn’t pay too much attention to PDF shapes; they can be misleading. What is most important in my view is the calendar age ranges the PDF provides, which for one-sided ranges follow directly from percentage points of the posterior CDF.

By a one-sided x% range I mean the range from the lowest possible value of the parameter (here, zero) to the value, y, at which the range is stated to contain x% of the posterior probability. An x1x2% range or interval for the parameter is then y1y2, where y1 and y2 are the (tops of the) one-sided x1% and x2% ranges. Technically, this is a credible interval, as it relates to Bayesian posterior probability.

By contrast, a (frequentist) x% one-sided confidence interval with a limit of y can, if accurate, be thought of as one calculated to result in values of y such that, upon indefinitely repeated random sampling from the uncertainty distributions involved, the true parameter value will lie below y in x% of cases. By definition, an accurate confidence interval exhibits perfect frequentist coverage and so represents, for an x% interval, exact probability matching. If one-sided Bayesian credible intervals derived using a particular prior pass that test then they and the prior used are said to be probability matching. In general, Bayesian posteriors cannot be perfectly probability matching. But the simplified case presented here falls within an exception to that rule, and use of Jeffreys’ prior should in principle lead to exact probability matching.

The two posterior PDFs in Figure 2 imply very different calendar age uncertainty ranges. As OxCal reports a 95.4% range, I’ll start with the 95.4% ranges lying between the 2.3% and 97.7% points of each posterior CDF. Using a uniform prior, that range is 365–1567 years. Using Jeffreys’ prior, the objective Bayesian 2.3–97.7% range is 320–1636 years – somewhat wider. But for a 5–95% range, the difference is large: 395–1472 years using a uniform prior versus 333–1043 years using Jeffreys’ prior.

Note that OxCal would report a 95.4% highest posterior density (HPD) range rather than a range lying between the 2.3% and 97.7% points the posterior CDF. A 95.4% HPD range is one spanning the region with the highest posterior densities that includes 0.954 probability in total; it is necessarily the narrowest such range. HPD ranges are located differently from those with equal probability in both tails of a probability distribution; they are narrower but not necessarily better.

What about confidence intervals, a non-Bayesian statistician would rightly ask? The obvious way of obtaining confidence intervals is to use likelihood-based inference, specifically the signed root log-likelihood ratio (SRLR). In general, the SRLR only provides approximate confidence intervals. But where, as here, the parameter involved is a monotonic transform of a variable with a Gaussian distribution, SRLR confidence intervals are exact. So what are the 2.3–97.7% and 5–95% SRLR-derived confidence intervals? They are respectively 320–1636 years and 333–1043 years – identical to the objective Bayesian ranges using Jeffreys’ prior, but quite different from those using a uniform prior. I would argue that the coincidence of the Jeffreys’ prior derived objective Bayesian credible intervals and the SRLR confidence intervals reflects the fact that here both methods provide exact probability matching.

4. Numerical testing of different methods using the stylised calibration curve

Whilst an example is illuminating, in order properly to compare the performance of the different methods one needs to carry out repeated testing of probability matching based on a large number of samples: frequentist coverage testing. Although some Bayesians reject such testing, most people (including most statisticians) want a statistical inference method to produce, over the long run, results that accord with relative frequencies of outcomes from repeated tests involving random draws from the relevant probability distributions. By drawing samples from the same uniform calendar age distribution on which Bronk Ramsey’s method is predicated, we can test how well each method meets that aim. This is a standard way of testing statistical inference methods. Clearly, one wants a method also to produce accurate results for samples that – unbeknownst to the experimenter – are drawn from individual regions of the age range, and not just for samples that have an equal probability of having come from any year throughout the entire range.

I have accordingly carried out frequentist coverage testing, using 10,000 samples drawn at random uniformly from both the full extent of my calibration curve and from various sub-regions of it. For each sampled true calendar age, a 14C determination age is sampled randomly from a Gaussian error distribution. I’ve assumed an error standard deviation of 30 14C years, to include calibration curve uncertainty as well as that in the 14C determination. Whilst in principle I should have used somewhat different illustrative standard deviations for different regions, doing so would not affect the qualitative findings.

In these frequentist coverage tests, for each integral percentage point of probability the proportion of cases where the true calendar age of the sample falls below the upper limit given by the method involved for a one-sided interval extending to that percentage point is computed. The resulting proportions are then plotted against the percentage points they relate to. Perfect probability matching will result in a straight line going from (0%, 0) to (100%,1). I test both subjective and objective Bayesian methods, using for calendar age respectively a uniform prior and Jeffreys’ prior. I also test the signed root log-likelihood ratio method.

For the Bayesian method using a uniform prior, I also test the coverage of the HPD regions that OxCal reports. As HPD regions are two-sided, I compute the proportion of cases in which the true calendar age falls within the calculated HPD region for each integral percentage HPD region. Since usually only ranges that contain a majority of the estimated posterior probability are of interest, only the right hand half of the HPD curves (HPD ranges exceeding 50%) is of practical significance. Note that the title and y-axis label in the frequentist coverage test figures refer to one-sided regions and should in relation to HPD regions be interpreted in accordance with the foregoing explanation.

I’ll start with the entire range, except that I don’t sample from the 100 years at each end of the calibration curve. That is because otherwise a significant proportion of samples result in non-negligible likelihood falling outside the limits of the calibration curve. Figure 3 accordingly shows probability matching with true calendar ages drawn uniformly from years 100–1900. The results are shown for four methods. The first two are subjective Bayesian using a uniform prior as per Bronk Ramsey – from percentage points of the posterior CDF and from highest posterior density regions. The third is objective Bayesian employing Jeffreys’ prior, from percentage points of the posterior CDF. The fourth uses the non-Bayesian signed root log-likelihood ratio (SRLR) method. In this case, all four methods give good probability matching – their curves lie very close to the dotted black straight line that represents perfect matching.

 NicL_radioC3_Cover_bu.hpd.jp.lr_100-1900.30Fig. 3: Probability matching from frequentist coverage testing with calendar ages of 100–1900 years

 

Now let’s look at sub-periods of the full 100–1900 year period. I’ve picked periods representing both ranges over which the calibration curve is mainly flattish and those where it is mainly steep. I start with years 100–500, over most of which the calibration curve is steep. The results are shown in Figure 4. Over this period, SRLR gives essentially perfect matching, while the Bayesian methods give mixed results. Jeffreys’ prior gives very good matching – not quite perfect, probably because for some samples there is non-negligible likelihood at year zero. However, posterior CDF points using a uniform prior don’t provide very good matching, particularly for small values of the CDF (corresponding to the lower bound of two-sided uncertainty ranges). Posterior HPD regions provide rather better, but still noticeably imperfect, matching.

NicL_radioC4_Cover_bu.hpd.jp.lr_100-500.30Fig. 4: Probability matching from frequentist coverage testing with calendar ages of 100–500 years

 

Figure 5 shows results for the 500–1000 range, which is flat except near 1000 years. The conclusions are much as for 100–500 years save that Jeffreys’ prior now gives perfect matching and that mismatching from posterior CDF points resulting from a uniform prior give smaller errors (and in the opposite direction) than for 100–500 years.

NicL_radioC5_Cover_bu.hpd.jp.lr_500-1000.30Fig. 5: Probability matching from frequentist coverage testing with calendar ages of 500–1000 years

 

Now we’ll take the 1000–1100 years range, which asymmetrically covers a steep region in between two plateaus of the calibration curve. As Figure 6 shows, this really separates the sheep from the goats. The SRLR and objective Bayesian methods continue to provide virtually perfect probability matching. But the mismatching from the posterior CDF points resulting from a uniform prior Bayesian method is truly dreadful, as is that from HPD regions derived using that method. The true calendar age would only lie inside a reported 90% HPD region for some 75% of samples. And over 50% of samples would fall below the bottom of a 10–90% credible region given by the posterior CDF points using a uniform prior. Not a very credible region at all.

NicL_radioC6_Cover_bu.hpd.jp.lr_1000-1100.30Fig. 6: Probability matching from frequentist coverage testing with calendar ages of 1000–1100 years

 

Figure 7 shows that for the next range, 1100–1500 years, where the calibration curve is largely flat, the SRLR and objective Bayesian methods again provide virtually perfect probability matching. However, the uniform prior Bayesian method again fails to provide reasonable probability matching, although not as spectacularly badly as over 1000–1100 years. In this case, symmetrical credible regions derived from posterior CDF percentage points, and HPD regions of over 50% in size, will generally contain a significantly higher proportion of the samples than the stated probability level of the region – the regions will be unnecessarily wide.

NicL_radioC7_Cover_bu.hpd.jp.lr_1100-1500.30Fig. 7: Probability matching from frequentist coverage testing with calendar ages of 1100–1500 years

 

Finally, Figure 8 shows probability matching for the mainly steep 1500–1900 years range. Results are similar to those for years 100–500, although the uniform prior Bayesian method gives rather worse matching than it does for years 100–500. Using a uniform prior, the true calendar age lies outside the HPD region noticeably more often than it should, and lies beyond the top of credible regions derived from the posterior CDF twice as often as it should.

NicL_radioC8_Cover_bu.hpd.jp.lr_1500-1900.30

Fig. 8: Probability matching from frequentist coverage testing with calendar ages of 1500–1900 years

 

5. Discussion and Conclusions

The results of the testing are pretty clear. In whatever range the true calendar age of the sample lies, both the objective Bayesian method using a noninformative Jeffreys’ prior and the non-Bayesian SRLR method provide excellent probability matching – almost perfect frequentist coverage. Both variants of the subjective Bayesian method using a uniform prior are unreliable. The HPD regions that OxCal provides give less poor coverage than two-sided credible intervals derived from percentage points of the uniform prior posterior CDF, but at the expense of not giving any information as to how the missing probability is divided between the regions above and below the HPD region. For both variants of the uniform prior subjective Bayesian method, probability matching is nothing like exact except in the unrealistic case where the sample is drawn equally from the entire calibration range – in which case over-coverage errors in some regions on average cancel out with under-coverage errors in other regions, probably reflecting the near symmetrical form of the stylised overall calibration curve.

I have repeated the above tests using 14C error standard deviations of 10 years and 60 years instead of 30 years. Results are qualitatively the same.

Although I think my stylised calibration curve captures the essence of the principal statistical problem affecting radiocarbon calibration, unlike real 14C calibration curves it is monotonic. It also doesn’t exhibit variation of calibration error with age, but such variation shouldn’t have a significant impact unless, over the range where the likelihood function for the sample is significant, it is substantial in relation to 14C determination error. Non-monotonicity is more of an issue, and could lead to noticeable differences between inference from an objective Bayesian method using Jeffreys’ prior and from the SRLR method. If so, I think the SRLR results are probably to be preferred, where it gives a unique contiguous confidence interval. Jeffreys’ prior, which in effect converts length elements in 14C space to length elements in calendar age space, may convert single length elements in 14C space to multiple length elements in calendar age space when the same 14C age corresponds to multiple calendar ages, thus over-representing in the posterior distribution the affected parts of the 14C error distribution probability. Initially I was concerned that the non-monotonicity problem was exacerbated by the existence of calibration curve error, which results in uncertainty in the derivative of 14C age with respect to calendar age and hence in Jeffreys’ prior. However, I now don’t think that is the case.

Does the foregoing mean the SRLR method is better than an objective Bayesian method? In this case, perhaps, although the standard form of SRLR isn’t suited to badly non-monotonic parameter–data relationships and non-contiguous uncertainty ranges. More generally, the SRLR method provides less accurate probability matching when error distributions are neither normal nor a transforms of a normal.

Many people may be surprised that the actual probability distribution of the calendar date of samples for which radiocarbon determinations are carried out is of no relevance to the choice of a prior that leads to accurate uncertainty ranges and hence is, IMO, appropriate for scientific inference. Certainly most climate scientists don’t seem to understand the corresponding point in relation to climate sensitivity. The key point here is that the objective Bayesian and the SRLR methods both provide exact probability matching whatever the true calendar date of the sample is (provided it is not near the end of the calibration curve). Since they provide exact probability matching for each individual calendar date, they are bound to provide exact probability matching whatever probability distribution for calendar date is assumed by the drawing of samples.

How do the SRLR and objective Bayesian methods provide exact probability matching for each individual calendar date? It is easier to see that for the SRLR method. Suppose samples having the same fixed calendar date are repeatedly drawn from the radiocarbon and calibration uncertainty distributions. The radiocarbon determination will be more than two standard deviations (of the combined radiocarbon and calibration uncertainty level) below the exact calibration curve value for the true calendar date in 2.3% of samples. The SRLR method sets its 97.7% bound at two standard deviations above the radiocarbon determination, using the exact calibration curve to convert this to a calendar date. That bound must necessarily lie at or above the calibration curve value for the true calendar date in 97.7% of samples. Ignoring non-monotonicity, it follows that the true calendar date will not exceed the upper bound in 97.7% of cases. The bound is, given the statistical model, an exact confidence limit by construction. Essentially Jeffreys’ prior achieves the same result in the objective Bayesian case, but through operating on probability density rather than on its integral, cumulative probability.

Bayesian methods also have the advantage that they can naturally incorporate existing information about parameter values. That might arise where, for instance, a non-radiocarbon based dating method had already been used to estimate a posterior PDF for the calendar age of a sample. But even assuming there is genuine and objective probabilistic prior information as to the true calendar year, what the textbooks tell one to do may not be correct. Suppose the form of the data–parameter relationship differs between the existing and new information, and it is wished to use Bayes’ theorem to update, using the likelihood from the new radiocarbon measurement, a posterior PDF that correctly reflects the existing information. Then simply using that existing posterior PDF as the prior and applying Bayes’ theorem in the standard way will not give an objective posterior probability density for the true calendar year that correctly combines the information in the new measurement with that in the original posterior PDF. It is necessary to use instead a modified form of Bayesian updating (details of which are set out in my paper at http://arxiv.org/abs/1308.2791). It follows that it the existing information is simply that the sample must have originated between two known calendar dates, with no previous information as to how likely it was to have come from any part of the period those dates define, then just using a uniform prior set to zero outside that period would bias estimation and be unscientific.

And how does Doug Keenan’s ‘discrete’ calibration method fit in to all this? So far as I can see, the uncertainty ranges it provides will be considerably closer to those derived using objective Bayesian or SRLR methods than to those given by the OxCal and Calib methods, even though like them it uses Bayes’ theorem with a uniform prior. That is because, like the SRLR and (given monotonicity) Jeffreys’ prior based objective Bayesian methods, Doug’s method correctly converts, so far as radiocarbon determination error goes, between probability in 14C space and probability in calendar year space. I think Doug’s treatment of calibration curve error avoids, through renormalisation, the multiple counting of 14C error probability that may affect a Jeffreys’ prior based objective Bayesian method when the calibration curve is non-monotonic. However, I’m not convinced that his treatment of calibration curve uncertainty is noninformative even in the absence of it varying with calendar age. Whether that makes much difference in practice, given that 14C determinant error appears normally to be the larger of the two uncertainties by some way, is unclear to me.

Does the uniform prior subjective Bayesian method nevertheless have advantages? Probably. It may cope with monotonicity better than the basic objective Bayesian method I have set out, particularly where that leads to non-contiguous uncertainty ranges. It may also make it simpler to take advantage of chronological information where there is more than one sample. And maybe in many applications it is felt more important to have realistic looking posterior PDFs than uncertainty ranges that accurately reflect how likely the true calendar date is to lie within them.

I can’t help wondering whether it might help if people concentrated on putting interpretations on CDFs rather than PDFs. Might it be better to display the likelihood function from a radiocarbon determination (which would be identical to the subjective Bayesian posterior PDF based on a uniform prior) instead of a posterior PDF, and just to use an objective Bayesian PDF (or the SRLR) to derive the uncertainty ranges? That way one would both get a realistic picture of what calendar age ranges were supported by the data, and a range that the true age did lie above or below in the stated percentage of instances.

Professor Bronk Ramsey considers that knowledge of the radiocarbon calibration curve does give us quantitative information on the prior for 14C ‘age’. He argues that the belief that in reality calendar dates of samples are spread uniformly means that a non-uniform prior in 14C age is both to be expected and is what you would want. That would be fine if the prior assumption made about calendar dates actually conveyed useful information.

Where genuine prior information exists, one can suppose that it is equivalent to a notional observation with a certain probability density, from which a posterior density of the parameter given that observationhas been calculated using Bayes’ theorem with a noninformative ‘pre-prior’, with the thus computed posterior density being employed as the prior density (Hartigan, 1965).

However, a uniform prior over the whole real line conveys no information. Under Hartigan’s formulation, it’s notional observation has a flat likelihood function and a flat pre-prior. Suppose the transformation from calendar date to 14C age using the calibration curve is effected before the application of Bayes’ theorem to the notional observation for a uniform prior. Then its likelihood function remains flat – what becomes non-uniform is the pre-prior. The corresponding actual prior (likelihood function for notional observation multiplied by the pre-prior) in 14C age space is therefore nonlinear, as claimed. But when the modified form of Bayesian updating set out in my arXiv paper is applied, that prior has no influence on the shape of the resulting posterior PDF for true 14C age and nor, therefore, for the posterior for calendar date. In order to affect an objective Bayesian posterior, one has to put some actual prior information in. For instance, that could be in the form of a Gaussian distribution for calendar date. In practice, it may be more realistic to do so for the relationship between the calendar dates of two samples, perhaps based on their physical separation, than for single samples.

Let me give a hypothetical non-radiocarbon example that throws light on the uniform prior issue. Suppose that a satellite has fallen to Earth and the aim is to recover the one part that will have survived atmospheric re-entry. It is known that it will lie within a 100 km wide strip around the Earth’s circumference, but there is no reason to think it more likely to lie in any part of that strip than another, apart from evidence from one sighting from space. Unfortunately, that sighting is not very precise, and the measurement it provides (with Gaussian error) is non-linearly related to distance on the ground. Worse, although the sighting makes clear which side of the Earth the satellite part has hit, the measurement is aliased and sightings in two different areas of the visible side cannot be distinguished. The situation is illustrated probabilistically in Figure 9.

NicL_radioC9_Calibration2x.1120.30Fig. 9: Satellite part location problem

 

In Figure 9, the measurement error distribution is symmetrically bimodal, reflecting the aliasing. Suppose one uses a uniform prior for the parameter, here ground distance across the side of the Earth visible when the sighting was made, on the basis that the item is as likely to have landed in any part of the 100 km wide strip as in any other. Then the posterior PDF will indicate an 0.825 probability that the item lies at a location below 900 (in the arbitrary units used). If one instead uses Jeffreys’ prior, the objective Bayesian posterior will indicate a 0.500 probability that it does so. If you had to bet on whether the item was eventually found (assume that it is found) at a location below 900, what would you consider fair odds, and why?

Returning now to radiocarbon calibration, there seems to me no doubt that, whatever the most accurate method available is, Doug is right about a subjective Bayesian method using a uniform prior being problematical. By problematical, I mean that calibration ranges from OxCal, Calib and similar calibration software will be inaccurate, to an extent varying from case to case. Does that mean Bronk Ramsey is guilty of research misconduct? As I said initially, certainly not in my view. Subjective Bayesian methods are widely used and are regarded by many intelligent people, including statistically trained ones, as being theoretically justified. I think views on that will eventually change, and the shortcomings and limits of validity of subjective Bayesian methods will become recognised. We shall see. There are deep philosophical differences involved as to how to interpret probability. Subjective Bayesian posterior probability represents a personal degree of belief. Objective Bayesian posterior probability could be seen as, ideally, reflecting what the evidence obtained implies. It could be a long time before agreement is reached – there aren’t many areas of mathematics where the foundations and philosophical interpretation of the subject matter are still being argued over after a quarter of a millennium!

 

A PDF of this article and the R code used for the frequentist coverage testing are available at http://niclewis.files.wordpress.com/2014/04/radiocarbon-calibration-bayes.pdf and http://niclewis.files.wordpress.com/2014/04/radiocarbon-dating-code.doc
[i] A statistical model is still involved, but no information as to the value of the parameter being estimated is introduced as such. Only in certain cases is it possible to find a prior that has no influence whatsoever upon parameter estimation. In other cases what can be sought is a prior that has minimal effect, relative to the data, on the final inference (Bernardo and Smith, 1994, section 5.4).

[ii] I am advised by Professor Bronk Ramsey that the method was originally derived by the Groningen radiocarbon group, with other notable related subsequent statistical publications by Caitlin Buck and her group and Geoff Nicholls.

 

The “Ethics Application” for Lewandowsky’s Fury

In today’s post, I will discuss the ethics application and approval process for Fury. Continue reading

Frontiers Issues Statement on Lewandowsky

Following a variety of untrue allegations by Lewandowsky and his supporters, Frontiers have issued a new statement stating that they received “no threats” and that they had received “well argued and cogent” complaints, including mine here and here. (I did not report or publicize this complaint at Climate Audit or invite any public pressure on the journal.)

According to my understanding, the issues identified by the journal are issues that constitute of violations of most codes of conduct within academic psychology, including Australian codes.

There has been a series of media reports concerning the recent retraction of the paper Recursive Fury: Conspiracist ideation in the blogosphere in response to research on conspiracist ideation, originally published on 18 March 2013 in Frontiers in Psychology. Until now, our policy has been to handle this matter with discretion out of consideration for all those concerned. But given the extent of the media coverage – largely based on misunderstanding – Frontiers would now like to better clarify the context behind the retraction.

As we published in our retraction statement, a small number of complaints were received during the weeks following publication. Some of those complaints were well argued and cogent and, as a responsible publisher, our policy is to take such issues seriously. Frontiers conducted a careful and objective investigation of these complaints. Frontiers did not “cave in to threats”; in fact, Frontiers received no threats. The many months between publication and retraction should highlight the thoroughness and seriousness of the entire process

As a result of its investigation, which was carried out in respect of academic, ethical and legal factors, Frontiers came to the conclusion that it could not continue to carry the paper, which does not sufficiently protect the rights of the studied subjects. Specifically, the article categorizes the behaviour of identifiable individuals within the context of psychopathological characteristics. Frontiers informed the authors of the conclusions of our investigation and worked with the authors in good faith, providing them with the opportunity of submitting a new paper for peer review that would address the issues identified and that could be published simultaneously with the retraction notice.

The authors agreed and subsequently proposed a new paper that was substantially similar to the original paper and, crucially, did not deal adequately with the issues raised by Frontiers.

We remind the community that the retracted paper does not claim to be about climate science, but about psychology. The actions taken by Frontiers sought to ensure the right balance of respect for the rights of all.

One of Frontiers’ founding principles is that of authors’ rights. We take this opportunity to reassure our editors, authors and supporters that Frontiers will continue to publish – and stand by – valid research. But we also must uphold the rights and privacy of the subjects included in a study or paper.

SH Proxies: Peru d18O

One of the hidden assumptions of proxy reconstructions, as carried out by IPCC authors, is that each “proxy” has a linear relationship to temperature plus relatively low-order red noise. Under such circumstances, the noise will cancel out in a linear combination of proxies (reconstruction) and a “signal” will emerge. However, I’ve never seen any author discuss the validity of this assumption, let alone establish the validity.

In today’s post, I’m going to look at low-latitude South American d18O isotope series mainly from Peru, including three proxies from Neukom. Tropical ice core d18O series (especially Quelccaya, but also Huascaran and Sajama) have been a staple of temperature reconstructions. During the past few years, d18O series have also been obtained from speleothems and lake sediments.

In my opinion, before one can begin thinking about temperature reconstructions using many different types of proxies, some of which are singletons, it makes sense to see if one can make sense of something as simple as d18O series within one relatively circumscribed region.

Continue reading

Neukom and Gergis Serve Cold Screened Spaghetti

Neukom, Gergis and Karoly, accompanied by a phalanx of protective specialists, have served up a plate of cold screened spaghetti in today’s Nature (announced by Gergis here).

Gergis et al 2012 (presently in a sort of zombie withdrawal) had foundered on ex post screening. Neukom, Gergis and Karoly + 2014 take ex post screening to a new and shall-we-say unprecedented level. This will be the topic of today’s post. Continue reading

UWA Vice Chancellor Johnson Refuses Data Again Again

Barry Woods has been trying to get Lewandowsky’s data, inclusive of any metadata on referring blogs, since August 2012 (before anyone had even heard of Lewandowsky). Woods has made multiple requests, many of which have not even been acknowledged. Woods has expressed concern about Hoax to Eric Eich, editor of Psychological Science, who suggested that Woods submit a comment.

The UWA’s Code of Conduct for the Responsible Practice of Research states clearly:

3.8 Research data related to publications must be available for discussion with other researchers.

The Australian Code of Conduct for the Responsible Practice of Research (to which the University of Western Australia claims to adhere) states:

2.5.2 Research data should be made available for use by other researchers unless this is prevented by ethical, privacy or confidentiality matters.

Nonetheless, Vice Chancellor Johnson flatly and unequivocally denied data to Woods for the purpose of submitting a comment to the journal, stating that “it is not the University’s practice to accede to such requests”.

From: Paul Johnson
Sent: Friday, March 28, 2014 8:08 AM
To: Barry Woods
Cc: Murray Maybery ; Kimberley Heitman
Subject: request for access to data

Mr B. Woods

Dear Mr Woods,

I refer to your emails of the 11th and 25th March directed to Professor Maybery, which repeat a request you made by email dated the 5th September 2013 to Professor Lewandowsky (copied to numerous recipients) in which you request access to Professor Lewandowsky’s data for the purpose of submitting a comment to the Journal of Psychological Science.

It is not the University’s practice to accede to such requests.

Yours faithfully,
Professor Paul Johnson,
Vice-Chancellor

It seems highly doubtful to me that it is indeed the “University’s practice” to refuse access to data to other researchers. Such a practice, if generally applied, would be a flagrant violation of the Australian Code of Conduct and would surely have come to light before now. But whether the refusal of data to other researchers is the general “practice” of the University or merely applied opportunistically in this particular case, it is a violation of the Australian Code of Conduct for Responsible Research and the “practice” should cease.

UWA Vice-Chancellor Refuses Lewandowsky Data

Over the past 15 months, I’ve made repeated requests to the University of Western Australia for a complete copy of Lewandowsky’s Hoax data in order to analyse it for fraudulent and/or scammed responses. Up to now, none of my previous requests were even acknowledged.

I was recently prompted to re-iterate my longstanding request by the retraction of Lewandowsky’s Fury. This time, my request was flatly and permanently denied by the Vice Chancellor of the University himself, Paul Johnson, who grounded his refusal not on principles set out in university or national policy, but because the University administration’s feelings were hurt by my recent blogpost describing the “investigation” by the University administration into the amendment of Lewandowsky’s ethics application .

Continue reading

Lewandowsky Ghost-wrote Conclusions of UWA Ethics Investigation into “Hoax”

Following the retraction of Lewandowsky’s Fury, the validity of University of Western Australia ethics “investigations” is again in the news. At present, we have negligible information on the University’s investigation into Fury, but we do have considerable (previously unanalysed) information on their earlier and illusory “investigation” into prior complaints about the ethics application for Moon Landing Hoax (“Hoax”).

This earlier “investigation” (recently cited at desmog here and Hot Whopper here) supposedly found that the issues that I had raised in October 2012 were “baseless” and that the research in Hoax was “conducted in compliance with all applicable ethical guidelines”.

However, these conclusions were not written by a university investigation or university official but by Lewandowsky himself and simply transferred to university letterhead by UWA Deputy Vice Chancellor Robyn Owens within minutes after Lewandowsky had sent her language that was acceptable to him.

In today’s post, I’ll set out a detailed chronology of these remarkable events. Continue reading

Lewandowsky’s Fury

Moon Hoax author Stephan Lewandowsky is furious that Frontiers in Psychology has retracted his follow-up article, Recursive Fury. See also Retraction Watch here.

Accompanying the news, Graham Redfearn at desmog has released FOI documents from the University of Western Australia that include correspondence between the university and the Frontiers editorial office up to May 2013. (Lewandowsky did not take exception to an FOI request from desmog.)

One of the last emails in the dossier is a request from the Frontiers editorial to the University of Western Australia in early May 2013, acknowledging receipt of the University’s statement that Lewandowsky had been investigated and cleared of various misconduct allegations.

The University’s investigation had been swift, to say the least, given that some of the complaints had been made as recently as April 5, 2013, and that, at a minimum, the falseness of Lewandowsky’s SKS claim had been unequivocally confirmed by SKS editor Tom Curtis.

The Frontiers editorial office sought particulars of the procedures of the UWA investigation (see list below), telling UWA that they had appointed a team of senior academics to examine the incident and hoped that “the team’s report could state that they have seen UWA’s decision and the background documents and are happy to be able to rely on that as a solid and well-founded decision (assuming that to be the case)”. They also stated that they not only wanted the evaluation to be “robust, even-handed and objective” but for the process to be perceived as such:

I am therefore writing to ask if it would be possible for the team evaluating the complaints to have a little more information in the process adopted by UWA in assessing these issues. The sole purpose of any such access would be to assist the evaluation team in its work. We are striving to ensure that the evaluation is robust, even-handed and objective and this information would be helpful not only to facilitate this but also to allow it to seem to be so. The idea would be that the team’s report could state that they have seen UWA’s decision and the background documents and are happy to be able to rely on that as a solid and well-founded decision (assuming that to be the case.)

We are well aware of the sensitivity of whole question…

If UWA felt able to share any of the following types of information it would be helpful:

1. The specific complaints made
2. The articles of the code of conduct which were considered relevant for the assessment
3. Whether any codes of conduct relating specifically to psychology were considered relevant and if so, which ones
4. The aspects of factors considered by UWA in its investigation
5. The reasoning adopted to support the findings of the preliminary investigation
6. Whether the recommendations referred to in UWA’s letter concerning dealing with conflicts of interest means that UWA considers that conflicts of interest were present in this case
7. Confirmation by UWA that those who assessed these allegations were independent of each of the authors and had no conflicts of interest or similar challenges in carrying out this task (note that we are not asking for details or evidence, just UWA’s confirmation
8. Finally, from UWA’s letter we understand that the conclusion is that there was neither any breach nor any research misconduct as defined by the Australian Code for the Responsible Conduct of Research. Is this correct?

A few days later, the UWA appears to have sent a “more detailed report” (according to an acknowledgement by Frontiers on May 6, 2013.)

Lewandowsky’s blog article contains the following short statement which has now been issued by Frontiers will issue later today:

In the light of a small number of complaints received following publication of the original research article cited above, Frontiers carried out a detailed investigation of the academic, ethical and legal aspects of the work. This investigation did not identify any issues with the academic and ethical aspects of the study. It did, however, determine that the legal context is insufficiently clear and therefore Frontiers wishes to retract the published article. The authors understand this decision, while they stand by their article and regret the limitations on academic freedom which can be caused by legal factors.

The statement conspicuously does not contain the planned statement that they had “seen UWA’s decision and the background documents and are happy to be able to rely on that as a solid and well-founded decision”, from which one can surmise that they were unable to to make such a statement.

Does “Inhomogeneous forcing and transient climate sensitivity” by Drew Shindell make sense?

A guest post by Nic Lewis

 This new Nature Climate Change paper[i] by Drew Shindell claims that the lowest end of transient climate response (TCR) – below 1.3°C – in CMIP5 models is very unlikely, and that this suggests the lowest end of model equilibrium climate sensitivity estimates – modestly above 2°C – is also unlikely. The reason is that CMIP5 models display substantially greater transient climate sensitivity to forcing from aerosols and ozone than to forcing from CO2. Allowing for this, Shindell estimates that TCR is 1.7°C, very close to the CMIP5 multimodel mean of ~1.8°C. Accordingly, he sees no reason to doubt the models. In this connection, I would note (without criticising it) that Drew Shindell is arguing against the findings of the Otto et al (2013) study,[ii] of which he and myself were two of the authors.

As with most papers by establishment climate scientists, no data or computer code appears to be archived in relation to the paper. Nor are the six models/model-averages shown on the graphs identified there. However, useful model-by-model information is given in the Supplementary Information. I was rather surprised that the first piece of data I looked at – the WM-GHG (well-mixed greenhouse gas) global forcing for the average of the MIROC, MRI and NorESM climate models, in Table S2 –  is given as 1.91 W/m², when the three individual model values obviously don’t average that. They actually average 2.05 W/m². Whether this is a simple typo or an error affecting the analysis I cannot tell, but the apparent lack of care it shows reinforces the view that little confidence should be placed in studies that do not archive data and full computer code – and so cannot be properly checked.

The extensive adjustments made by Shindell to the data he uses are a source of concern. One of those adjustments is to add +0.3 W/m² to the figures used for model aerosol forcing to bring the estimated model aerosol forcing into line with the AR5 best estimate of -0.9 W/m². He notes that the study’s main results are very sensitive to the magnitude of this adjustment. If it were removed, the estimated mean TCR would increase by 0.7°C.  If it were increased by 0.15 W/m², presumably the mean TCR estimate of 1.7°C would fall to 1.35°C – in line with the Otto et al (2013) estimate. Now, so far as I know, model aerosol forcing values are generally for the change from  the 1850s, or thereabouts, to ~2000, not – as is the AR5 estimate – for the change from 1750. Since the AR5 aerosol forcing best estimate for the 1850s was -0.19 W/m², the adjustment required to bring the aerosol forcing estimates for the models into line with the AR5 best estimate is ~0.49 W/m², not ~0.3 W/m². On the face of it, using that adjustment would bring Shindell’s TCR estimate down to around 1.26°C.

Additionally, the estimates of aerosol forcing in the models that Shindell uses to derive the  0.3 W/m² adjustment are themselves quite uncertain. He gives a figure of -0.98 W/m² for the NorESM1‑M model, but the estimate by the modelling team  appears to be -1.29 W/m². Likewise, Shindell’s figure of -1.44 W/m² for the GFDL-CM3 model appears to be contradicted by the estimate of -1.59 W/m² (or -1.68 W/m², dependent on version), by the team involved with the model’s development. Substituting these two estimates for those used by Shindell would bring his TCR estimate down even further.

In any event, since the AR5 uncertainty range for aerosol forcing is very wide (5–95% range: -1.9 to -0.1 W/m²), the sensitivity of Shindell’s TCR estimate to the aerosol forcing bias adjustment is such that the true uncertainty of Shindell’s TCR range must be huge  – so large as to make his estimate worthless.

I’ll set aside further consideration of the detailed methodology Shindell used and the adjustments and assumptions he made. In the rest of this analysis I deal with the question of to what extent the model simulations used by Shindell can be regarded as providing reliable information about how the real climate system responds to forcing from aerosols, ozone and other forcing components.

First, it is generally accepted that global forcing from aerosols has changed little over the well-observed period since 1980. And most of the uncertainty in aerosol forcing relates to changes from preindustrial (1750) to 1980. So, if TCR values in CMIP5 models are on average correct, as Shindell claims, one would expect global warming simulated by those models to be, on average, in line with reality. But as Steve McIntyre showed, here, that is far from being the case. On average, CMIP5 models overestimate the warming trend between 1979 and 2013 by 50%. See Figure 1, below.

CMIP5 trends 1979-2103 CA

Figure 1: Modelled versus observed decadal global surface temperature trend 1979–2013

Temperature trends in °C/decade. Virtually all model climates warmed much faster than the real climate over the last 35 years. Source: http://climateaudit.org/2013/09/24/two-minutes-to-midnight/. Models with multiple runs have separate boxplots; models with single runs are grouped together in the boxplot marked ‘singleton’. The orange boxplot at the right combines all model runs together. The default settings in the R boxplot function have been used; the end of the boxes represent the 25th and 75th percentiles. The red dotted line shows the actual trend in global surface temperature over the same period per the HadCRUT4 observational dataset. The 1979–2013 observed global temperature trends from the three datasets used in AR5 are very similar; the HadCRUT4 trend shown is the middle of the three.
.

Secondly, the paper relies on the simulation of the response of the CMIP5 models to aerosol, ozone and land use changes being realistic, and not overstated. Those components dominate the change in total non-greenhouse gas anthropogenic forcing over the 1850-2000 period considered in the paper. Aerosol forcing changes are most important by a wide margin, and land use changes (which Shindell excludes in some analyses) are of relatively little significance.

For its flagship 90% and 95% certainty attribution statements, AR5 relies on the ‘gold standard’ of detection and attribution studies. In order to separate out the effects of greenhouse gases (GHG), these analyses typically regress time series of  many observational variables – including latitudinally and/or otherwise spatially distinguished surface temperatures – on model-simulated changes arising not only from separate greenhouse gas and natural forcings but also from other separate non-GHG anthropogenic forcings. The resulting regression coefficients – ‘scaling factors’ – indicate to what extent the changes simulated by the model(s) concerned have to be scaled up or down to match observations. There is a large literature on this approach and the associated statistical optimal fingerprint methodology. The IPCC, and the climate science community as a whole, evidently considers this observationally-based-scaling approach to be a more robust way of identifying the influence of aerosols and other inhomogeneous forcings than the almost purely climate-model-simulations-based approach used by Shindell. I agree with that view.

Figure 10.4 of AR5, reproduced as Figure 2 below, shows in panel (b) estimated scaling factors for three forcing components: natural (blue bars), GHG (green bars) and  ‘other anthropogenic’ – largely aerosols, ozone and land use change (yellow bars). The bars show 5–95% confidence intervals from separate studies based on 1861–2010, 1901–2010 and 1951–2010 periods. Best estimates from the studies using those three periods are shown respectively by triangles, squares and diamonds. Previous research (Gillett et al, 2012)[v] has shown that scaling factors based on a 1901 start date are more sensitive to end date than those starting in the middle of the 19th century, with temperatures in the first two decades of the 20th century having been anomalously low, so the 1861–2010 estimates are probably more reliable than the 1901–2010 ones.

Multimodel estimates are given for the 1861–2010 and 1951–2010 periods (“multi”, at the top of the figure). The best estimate scaling factors for ‘other anthropogenic’ over those periods are respectively 0.58 and 0.61. The consistency of the two best estimates is encouraging, suggesting that the choice between these two periods does not greatly affect results. The average of the two scaling factors implies that the CMIP5 models analysed on average exaggerate the response to aerosols, ozone and other non-greenhouse gas anthropogenic forcings by almost 70%. However, the ‘other anthropogenic’ scaling factors for both periods have wide ranges. The possibility that the true scaling factor is zero is not ruled out at a 95% confidence level (although zero is almost ruled out using 1951–2010 data alone).

AR5 Fig10.4Figure 2: Reproduction of Figure 10.4 of IPCC AR5 WGI report

.

The individual results for models used by Shindell are of particular interest.

The first of the five individual CMIP5 models included in Shindell’s analysis, CanESM2, shows negative scaling factors for ‘other anthropogenic’ over all three periods – strongly negative over 1901–2010. The best estimates for its GHG scaling factor are also far below one over both 1861–2010 and 1951–2010.  So it would be inappropriate to place any weight on simulations by this model. In Figure 1, it is CanESM2 that shows the greatest overestimate of 1979-2013 warming.

The second CMIP5 model in Shindell’s analysis, CSIRO-Mk3-6-0, shows completely unconstrained scaling factors using 1901–2010 data, and extremely high scaling factors for both GHG and ‘other anthropogenic’ over both 1861–2010 and 1951–2010 – so much so that the GHG scaling factor is inconsistent with unity at better than 95% confidence for the longer period, and at almost 95% for the shorter period. This indicates that the model should be rejected as a representation of the real world, and no confidence put in its simulated responses to aerosols, ozone or any other forcings.

The third of Shindell’s models, GFDL-CM3, is not included in AR5 Figure 10.4.

The fourth of Shindell’s models, HadGEM2, shows scaling factors for ‘other anthropogenic’ averaging 0.44, with all but the 1901–2010 analyses being inconsistent with unity at a 95% confidence level. The best defined scaling factor, using 1861–2010 data, is only 0.31, with a 95% bound of 0.58. So HadGEM2 appears to have a vastly exaggerated response to aerosol, ozone etc. forcing.

The fifth and last of Shindell’s separate models, IPSL-CM5A-LR, is included in Figure 10.4 in respect of 1861–2010 and 1901–2010. The scaling factors using 1861-2010 data are much the better defined. They are inconsistent with unity for all three forcing components, as are those over 1901-2010 for natural and GHG components. That indicates no confidence should be put in the model as a representation of the real climate system. The best estimate scaling factor for ‘other anthropogenic’ for the 1861-2010 period is 0.49, indicating that the model exaggerates the response to aerosols, ozone etc. by a factor of two.

Shindell also includes the average of the MIROC-CHEM, MRI-CGCM3 and NorESM1-M models. Only one of those, NorESM1-M, is included in AR5 Figure 10.4.

To summarise, four out of six models/model-averages used by Shindell are included in the detection and attribution analyses whose results are summarised in AR5 Figure 10.4. Leaving aside the generally less well constrained results using the 1901–2010 period that started with two anomalously cold decades, none of these show scaling factors for ‘other anthropogenic’ – predominantly aerosol and to a lesser extent ozone, with minor contributions from land use and other factors – that are consistent with unity at a 95% confidence level. In a nutshell, these models at least do not realistically simulate the response of surface temperatures and other variables to these factors.

A recent open-access paper in GRL by Chylek et al,[vi] here, throws further light on the behaviour of three of the models used by Shindell. The authors conclude from an inverse structural analysis that the CanESM2, GFDL-CM3and HadGEM-ES models all strongly overestimate GHG warming and compensate by a very strongly overestimated aerosol cooling, which simulates AMO-like behaviour with the correct timing – something that would not occur if the models were generating true AMO behaviour from natural internal variability. Interestingly, the paper also estimates that only about two-thirds of the post-1975 global warming is due to anthropogenic effects, with the other one-third being due to the positive phase of the AMO.

In the light of the analyses of the characteristics of the models used in Shindell’s analysis, as outlined above, combined with the evidence that Shindell’s aerosol forcing bias-adjustment is very likely understated and that his results’ sensitivity to it makes his TCR estimate far more uncertain than claimed, it is difficult to see that any weight should be put on Shindell’s findings.

.


[i]  Drew T. Shindell, 2014, Inhomogeneous forcing and transient climate sensitivity. Nature Climate Change. doi:10.1038/nclimate2136, available here (pay-walled).

[ii]  Otto, A., F. E. L. Otto, O. Boucher, J. Church, G. Hegerl, P. M. Forster, N. P. Gillett, J.Gregory, G. C. Johnson, R. Knutti,N. Lewis,U. Lohmann, J.Marotzke,G.Myhre, D. Shindell, B Stevens and M. R. Allen, 2013. Energy budget constraints on climate response. Nature Geosci., 6: 415–416.

[iii] A.Kirkevag et al, 2013, Aerosol-climate interactions in the Norwegian Earth System Model-NorESM1-M. GMD .

[iv]  M.Salzmann et al, 2010: Two-moment bulk stratiform cloud microphysics in the GFDL AM3 GCM: description, evaluation, and sensitivity tests. ACP.

[v]  Gillett N.P., V. K. Arora, G. M. Flato, J. F. Scinocca, K. von Salzen, 2012: Improved constraints on 21st-century warming derived using 160 years of temperature observations. Geophys. Res. Lett, 39, L01704, doi:10.1029/2011GL050226

[vi]  P Chylek e al., 2014. The Atlantic Multidecadal Oscillation as a dominant factor of oceanic influence on climate. GRL.

Follow

Get every new post delivered to your Inbox.

Join 2,897 other followers