I think it’s a matter of understanding what the approach is trying to say. The objective Bayesian approach is trying to discount our prior knowledge and tell us only what the *evidence* tells us, and the evidence that we’ve obtained offers us *no support* for the hypothesis that the calendar date is 750-850, because the experimental method used *doesn’t provide that information*. It is telling us that radiocarbon dating is *blind* in this region, due to the plateau in the calibration curve. For us to conclude from this evidence that the date is in this region, we would have to suppose a very narrow, precisely defined measurement error, so narrow that it has a very low probability, and which is further diluted by being spread out over the entire plateau. Our C14 measurement is not precise enough to resolve this interval.

When we look at a result and say “That doesn’t look right…”, that’s our prior knowledge speaking. Your assessment that there is a good chance of the true calendar date being in the 750-850 range is not based on the radiocarbon evidence, it’s based on your prior expectations about how the world is.

I agree that taken as a best estimate of how the world is, it’s not right. But that’s not what the method is designed for.

]]>Thanks for your kind comments! ]]>

Certainly, when judging whether a method is valid, one should look at the details of the method, its justification, intermediate quantities like the prior, as well as the final result. But what can one say if someone just accepts all the intermediate things that look wrong to you? The only ultimate ground for comparison is the final result, and the only ultimate judge is whether the results accord with common sense.

Now, by “common sense”, I mean a sophisticated common sense, that has contemplated any insights that theoretical analysis or computational investigation has provided, and has carefully considered whether an apparently bizarre result might actually be correct. In sufficiently complex problems, we might never be able to acquire such common sense, and just have to accept the result even if it looks “wrong”, if the method used produces results in simpler situations that do accord with common sense.

But this isn’t all that complex a problem. In Nic’s Fig. 2, a look at the posterior PDF produced with the objective prior should immediately produce the reaction, “Why do calendar years from 750 to 850 have virtually zero probabillity! That doesn’t look right…”. And it’s not right.

]]>The posterior PDF produced by use of Jeffreys’ prior doesn’t just look “artificial”. It looks completely wrong. I think this is the most crucial point.

The criterion should not be whether the results look wrong, but whether the method (including the implicit prior) looks wrong. The Jeffreys prior on C14 age should be rejected because it implies a bizarre prior on the calibrated date of interest, given the objective information in the calibration curve. This is why it gives bizarre posteriors sometimes, but we should not reject the results just because the posterior is subjectively “wrong”.

]]>Could you try conditioning your Q-Q graphs on an observable quantity like the point estimate of calibrated date rather than the unobserved true date, but drawing the unobserved true dates from the OxCal prior with a generous margin on either side of the desired interval so as to ensure that the point estimate will lie in it?

If we knew the true date, we would just use it and wouldn’t bother with C-14 dating.

I should mention that whichever prior is used, your equal-tail credible intervals make a lot more sense to me than OxCal’s HPD (Highest Posterior Density) credible regions. I want to see a pair of bounds, such that the probability that the true value is less than the lower bound is say 2.5% (in the appropriate sense), and the probability that is above the upper bound is the same. HPD is sensitive to monotonic transforms of the parameter in question — it will tell different stories for standard deviation, variance, precision (reciprocal variance), and log variance, while equal tails will give the same answer for all such transformations, if appropriately computed.

]]>How? With what distribution?

If you simply run through the possible C14 age values one by one, you are implicitly assuming that C14 age is uniformly distributed. In practice, some C14 age values/observations will be more likely to occur than others. That weights the odds differently.

You need to simulate the whole process. You submit a physical sample which has a particular true calendar/C14 age combination – a random point on the calibration curve – with some input distribution. You perform a measurement of that true C14 age to get a measured C14 age with lognormal error, you then apply one of our competing algorithms to estimate a measured calendar age distribution from it. You finally filter out from the collection of trials all those with a particular measured C14 age, and look at the distributions of true calendar ages that generated it, compared against the measured calendar age distribution each algorithm output.

The problem is, what input distribution do you use in the first step, to generate the true calendar ages? We know some things about the distribution: younger objects are generally more likely than older ones due to decay processes, But archaeologists likely only submit samples they’re fairly sure are old, so they’re not too young. Do you use a different distribution for Egyptian artefacts, versus Roman artefacts, versus Babylonian artefacts? Or do you smear them all together into one?

The chances are that the archaeologist is going to have a lot of prior information (although far from certainty) about the age, but the radiocarbon lab isn’t going to know what it is. It’s often a very specialist knowledge, and the lab techs don’t know it.

It’s not quite the same situation that the objective Bayesian method is designed for, which is complete lack of any prior information. The information exists, it’s just that the client archaeologist has it rather than the radiocarbon lab. Objective priors are only for use when you don’t know *anything*. When you *do* know something, you ought to use all the information you’ve got.

There may certainly be an argument to be made that the archaeologists priors are unlikely to be be objectively generated, and they ought to instead start with an objective prior and that apply Bayesian updating to incorporate their evidence for decay/destruction rates of different materials in different environments, historical and physical context, contamination rates, and so on. The reliability of human judgement versus Bayesian methods is an old argument in AI research. But it’s not one that’s going to be settled here.

]]>Thank you for a first-class, lucid and thought-provoking article. ]]>

Thank you for this clear essay.

It is not common to find this depth of understanding expressed in peer reviewed papers for the IPCC (here more broadly than carbon dating)by authors of diverse backgrounds indulging in statistics. A possible conclusion is that many do not understand the treatment of error at your level. Indeed, I use the treatment of error as a rough guide to the eventual quality of a paper. It is a factor that has led me to generalise over the years that much climate work is not of very high quality.

Researchers who do not comprehend your essay have a doubtful place in the writing of papers that are used for major policy considerations. Yet time and again I see huge errors accepted ‘because it is convenient’. The TOA radiation balance varies from satellite to satellite. The Surface Sea Temperature varies with make of probe. The atmospheric temperature record varies between balloons and satellite microwave methods. Etc, etc.

Background. At age 29 I borrowed a lot of money and established a fairly large private laboratory from which I sold analytical numbers to clients in several industries. It was an advanced lab, including the only private fast neutron generator for NAA seen here in Australia before or since.

I sinned in my younger years. I was one of a large set of optimistic analysts whose self-described prowess was greater than I realised as I aged. It is entirely possible, for it seems to be in the breed, that laboratory estimates of accuracy and precision now used in carbon dating have this optimistic element. I apologise for my sins, which time has shown to have had no major practical consequences.

Your article was easy to comprehend because I spent a career doing this type of inspection of results, either hands on or by supervision. The blue calibration curve in your figure 1 troubles me greatly, not because I have mathematical evidence (I’m now too old and have forgotten too much detail) but because decades of experience screams that it does not look right. The blue envelope is inconsistent with the slope reversals, for a start. It is too optimistic. It should maybe be wider at the ends where low or high instrument count rates can produce less precision. Above all, it seems not to recognise that some analyses have a very low probability of being correct, the ones way out on the wings of the distribution curve. They happen, though rarely. If the person constructing the curve has independent confidence that the analysis is correct, it can distort the proper form of the curve and widen the blue envelope. But, I don’t see this (though I do not claim deep recent reading of the detailed literature).

Your essay points to the need that has always been there, for replication, replication, replication such as multiple sampling of test material where possible. It reinforces old concepts like carefully prepared calibration standards, which in this case might be being used. But mostly it repeats the valuable observation that subjectivity can enter the picture and that there must be proper, appreciated ways to deal with it. The proper way delves into sources of bias, well beyond the simple calculation of a statistical factor such as a standard deviation, shown as if it is an obligatory inclusion of little importance.

There is good value in your essay and it complements the many threads that Steve has posted where more attention to accuracy is mentioned. It shall sit with my favourites, along with Steve’s memorable ‘However, drawing conclusions from a subpopulation of zero does take small population statistics to a new and shall-we-say unprecedented level.’

………………………..

Error analysis in the CMIP comparisons is a pathetic shambles of subjectivity. Please, you active statisticians, can it be next under the spotlight? ]]>

I understand your general point about generating random values, except for why there need by any question about how you choose your Monte Carlo values in this case.

We get a reading for C14 from a sample, and that equates to an age of that sample. We want error limits. There is an error in that original reading, and in the calibration curve.

Provided we know the size and shape of the error in that original reading (Normal, log Normal etc) we can generate a spread of random values around our actual reading reflecting what it might be, and then for each of those read off the age that implies, using a table that randomly selects from the (smeared out) calibration curve at that value. That generates a confidence interval fairly quickly, to whatever % you want.

There’s no prior choice involved unless we don’t know the size and shape of the reading error, or the calibration curve.

]]>Well, subjective Bayesians think that there is always prior information. I do have my doubts about this in extreme cases like assigning prior probabilities to different cosmological theories, where it’s hard to see why people should have evolved to have intuitive knowledge of what’s plausible, but we’re not dealing with that sort of problem here. People have all sorts of information about what are more or less plausible dates for old pieces of parchment, tree stumps, or whatever. It’s unlikely that this will lead to a uniform prior, however, which is why I suspect that people are using that prior just as a way of communicating the likelihood function (while possibly being a bit confused about what they’re really doing).

Nic: We agree that the posterior PDF produced by use of Jeffreys’ prior may look artificial.

The posterior PDF produced by use of Jeffreys’ prior doesn’t just look “artificial”. It looks completely wrong. I think this is the most crucial point. Your example isn’t one that should convince readers to use Jeffreys’ prior because it gives exact probabililty matching for credible intervals. It’s an example that should convince readers that Jeffreys’ prior is flawed, and probability matching is not something one should insist on. Could there possibly be a clearer violation of the rule “DON’T INVENT INFORMATION”? The prior gives virtually zero probability to large intervals of calendar age based solely on the shape of the calibration curve, with this curve being the result of physical processes that almost certainly have nothing to do with the age of the sample.

Statistical inference procedures are ultimately justified as mathematical and computational formalizations of common sense reasoning. We use them because unaided common sense tends to make errors, or have difficulty in processing large amounts of information, just as we use formal methods for doing arithmetic because guessing numbers by eye or counting on our fingers is error prone, and is anyway infeasible for large numbers. So the ultimate way of judging the validity of statistical methods is to apply them in relatively simple contexts (such as this) and check whether the results stand up to well-considered common sense scrutiny. In this example, Jeffreys’ prior fails this test spectacularly.

I think you would maybe agree that Jeffreys’ prior is not to be taken seriously, given that you say the following:

Nic: … think of the posterior PDF as a way of generating a CDF and hence credible intervals rather than being useful in itself. I agree that realistic posterior PDFs can be very useful, but if the available information does not enable generation of a believable posterior PDF then why should it be right to invent one?

But with this comment, you seem to have adopted a strange position that may be unique to you. Frequentists usually don’t have much use for a posterior PDF for any purpose. And I think “objective” Bayesians aim to produce a posterior PDF that is sensible. I’m puzzled why you would bother to produce a posterior that you don’t believe is even close to being a proper expression of posterior belief, and then use it as a justification for the credible intervals that can be derived from it. If these credible intervals have any justification, it can’t be that. And in fact, for this example, you can (and do) justify these intervals as being confidence intervals according to standard frequentist arguments (albeit ones that I think are flawed in this context). So what is the point of the whole objective Bayesian argument?

Nic: My words “but there is no reason to think it more likely to lie in any part of that strip than another” were intended to indicate total ignorance, not that prior knowledge of the satellite’s location was equivalent to a uniform distribution over the strip. There is a difference. You are assuming it is known that the process which generated the impact location results in it being equally likely to lie in any part of the strip…

I think it is impossible to maintain this distinction between a physical random process and “ignorance”, which you don’t seem willing to represent using probability (even though that’s central to all Bayesian methods, subjective or not). Archetypal random processes such as coin flips are probably not actually random, in the sense of quantum uncertainty or thermal noise, but appear random only because of our ignorance of initial conditions, which could be eliminated by suitable measuring instruments (that are not impossible according to the laws of physics).

Among both frequentists and objective Bayesians, I think there is a degree of wishful thinking from wanting to find a procedure that avoids all “subjectivity”. But it’s just not possible. Refusal to admit that it’s not possible inevitably leads to methods that produce strange results.

]]>