In many interesting comments, beaker, a welcome Bayesian commenter, has endorsed the Santer criticism of Douglass et al purporting to demonstrate inconsistency between models and data for tropical troposphere trends. (Prior post in sequence here) Santer et al proposed revised significance tests which, contrary to the Douglass results, did not yield results with statistical “significance”, which they interpreted as evidence that all was well, as for example, Gavin Schmidt here:

But it is a demonstration that there is no clear model-data discrepancy in tropical tropospheric trends once you take the systematic uncertainties in data and models seriously. Funnily enough, this is exactly the conclusion reached by a much better paper by P. Thorne and colleagues. Douglass et al’s claim to the contrary is simply unsupportable.

In passing, beaker mentioned that he was re-reading Jaynes (1976), Confidence intervals vs. Bayesian intervals. I took a look at this article by a Bayesian pioneer which proved to contain many interesting dicta, many of which were directed at ineffective use of significance tests resulting in the failure to extract useful statistical conclusions available within the data, many dicta resonating, at least to me, in the present situation. The opening motto for the Jaynes article reads:

Significance tests, in their usual form, are not compatible with a Bayesian attitude.

This motto that seems strikingly at odds with beaker’s incarnation as a guardian of the purity of significance tests. Jaynes described methods whereby a practical analyst could extract useful results from the available data. Jaynes looked askance at unsophisticated arguments that results from elementary significance tests were the end of the story. In this respect, it’s surprising that we haven’t heard anything of this sort from beaker.

Not that I disagree with criticisms of the Douglass et al tests. If you’re using a significance, it’s important to do them correctly. The need to allow for autocorrelation in estimating the uncertainty in trends was a point made here long before the publication of Santer et al 2008 and was one that I agreed with in my prior post. But in a practical sense, there does appear to be a “discrepancy” between UAH data and model data (this is not just me saying this, the CCSP certainly acknowledges a “discrepancy”. It seems to me that it should be possible to say something about this data and that’s the more interesting topic that I’ve been trying to focus on. So far I am unconvinced so far by the arguments of Santer, Schmidt and coauthors purporting to show that you can’t say anything meaningful about the seeming discrepancy between the UAH tropical troposphere data and the model data. These arguments seem all too reminiscent of the attitudes criticized by Jaynes.

The Jaynes article, recommended by beaker, was highly critical of statisticians who were unable to derive useful information from data that seemed as plain as the nose on their face, because their “significance tests” were poorly designed for the matter at hand. As a programme, Jaynes’ bayesianism is an effort to extract every squeak of useful information out of the matter at hand by avoiding simplistic use of “significance tests”. This is not to justify incorrect use of significance tests – but merely to opine that the job of a bayesian statistician, according to Jaynes, is derive useful quantitative results from the available information.

Interestingly, the Jaynes reference begins with an example analysing the difference in means – taking an entirely different approach than the Santer et al t-test. Here’s how Jaynes formulates the example:

Jaynes observes that the “common sense” conclusion here would be that B out-performed even though the available information was less than one would want.

Jaynes inquires into how the authority arrived at this counter-intuitive conclusion:

Jaynes conclusion is certainly one which resonates with me:

Now any statistical procedure which fails to extract evidence that is already clear to our unaided common sense is certainly not for me!

Jaynes’ next example has a similar flavor:

Again, Jaynes observes that application of conventional significance tests failed to yield useful results in a seemingly slam-dunk situation:

This latter conclusion sounds all too much like the nihilism of Santer et al. Jaynes observes of this sort of statistical ineffectiveness:

Quite so.

Now I must admit that my eyes ordinarily tend to glaze over when I read disputes between Bayesians and frequentists. However, as someone whose interests tend to be practical (and I think of my activities here as more of as a “data analyst” as opposed to a “statistician”), I like the sounds of what Jaynes is saying. In addition, our approaches here to the statistical bases of reconstructions have been “bayesian” in flavor (as Brown and Sundberg are squarely in that camp and my own experiments with profile likelihood results are, I suppose, somewhat bayesian in approach, though I don’t pretend to have all the lingo. I also don’t have to unlearn a lot of the baggage that bayesians spend so much time criticising, as my own introduction to statistics in the 1960s was from a very theoretical and apparently (unusually for the time) Bayesian viewpoint (transformation groups, orbits), with surprisingly little in retrospect on the mechanics of significance tests.

I’ve done some more calculations in which I’ve converted profile likelihoods to a bayesian-style distribution of trends, given observations up to 1999 (Santer), 2004 (Douglass) and 2008 (as I presume a bayesian would do.) They are pretty interesting. I’ll post these up tomorrow. I realize that the Santer crowd have excuses for not using up-to-date data – their reason is that the models don’t come up to 2008. It is my understanding that bayesians try to extract every squeak of usable information. It appears to me that Jaynes would castigate any analyst who failed to extract whatever can be used from up-to-date information. Nevertheless, beaker criticized Douglass et al for doing exactly that.

Perhaps beaker is a closet frequentist.

Reference:

Jaynes, E. T. 1976. Confidence intervals vs. Bayesian intervals. Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science 2: 175-257. http://bayes.wustl.edu/etj/articles/confidence.pdf

## 31 Comments

I believe in the third para “In passing, beaker mentioned that he was re-reading Jaynes (2006)” should be corrected to “…Jaynes (1976)”.

Good mini-essay, Steve. I’ve been reading the Santer 2006 thread over recent days with some amazement. It has seemed at times like the old “How many angels on the tip of a pin” debates, but here you have brought some reality back to the subject.

I was struck by one of the phrases you used:

It seems to me that the Team appears to spend most of its time revising data and analysis methods in a great many areas: surface temperature data, upper-trop radiosonde and satellite data, dendro data, Mannian methods, statistical significance analyses, etc.

If only they spent as much time revising their climate sensitivity assumptions and their models……

Steve:I think that you’re unfair here. It’s not like Santer is revising their own tests; they objected to Douglass’ tests and proposed alternatives. That’s fair enough – the issue is whether their conclusion that the data does not enable any useful analysis.Re: Pat Keating (#2), Steve

Yes, Steve, you are probably right about my being unfair in the case of Santer et al 2006. However, the Team does seem to devote a lot more time to adjusting the empirical data (and derivatives thereof) to fit its hypotheses than it does to adjusting its theories (e.g., climate-sensitivity assumptions and models) to fit the data.

As a theoretical physicist myself, I am offended by reverse adjustment of experimental data to fit one’s theory, unless it is very well-founded.

I would encourage a look at:

http://www.newton.ac.uk/webseminars/pg+ws/2008/sch/0606/nychka/

“Challenges of regional climate modelling and validation”

6th June 2008

Author: D.W. Nychka (National Center for Atmospheric Research)

This seems relevant given the current topics at hand…

Steve, pardon my antipodean ignorance, what is CCSP?

SM:sorry about that. The US Climate Change Science Program Assessment Report 1.1, which I’ve mentioned as prompting much of the Douglass et al comment.Re: david elder (#4),

Brief commercial:

http://climateaudit101.wikispot.org/Glossary_of_Acronyms

will led you to CCSP = http://en.wikipedia.org/wiki/Climate_Change_Science_Program

Steve, could you please substitute this list for the old one on your masthead? Ours really is a *lot* better.

Thanks & cheers, Pete Tillman

The thing about the troposphere trend significance which bothered me is that it seemed all variation was treated as noise rather than signal. I may have missed the point but it is related to my latest post directed toward Tamino, where he claimed the 10 year temp drop is statistically insignificant. I used the standard deviation of the ‘difference’ between GISS, UAH and RSS ran it through some simple R functions and demonstrated that the slope is known to a very high degree.

http://noconsensus.wordpress.com/2008/10/21/taminos-folly-temperatures-did-drop/

The example above is of a repeated static measurement rather than a variable trend. In the model case we have the ability to asses the confidence intervals by differencing actual measurements rather than using the entire signal. Wouldn’t a difference analysis be a more appropriate for establishing the confidence of the tropospheric model trends?

Also, this talk obliquely mentions Mann and the lack of statistical rigor in the HS. It’s between the 10 and 20 min mark. But its a good talk overall that speaks to many things discussed here…

http://www.newton.ac.uk/webseminars/pg+ws/2008/sch/schw01/0108/cook/

For me the interesting quote from Jaynes is ‘the hypothesis that they are not equal is rejected at the 95% level, so without further ado he assumes that they are equal’

Performing a statistical test for significance is not about whether two things are absolutely the same or not – but what is the chance or risk that they are the same. ‘Eyeballing’ which is what Jaynes implies can give you obvious or intuitive answers, may lead you astray and you need some sort of independent and quantifiable measure of significance, which what the statistical tests give you, but to use them to say that 2 things are definitely different or definitely the same is wrong.

What they tell you is the chance that 2 things are the same or different.

The important point is that what level of significance you choose is entirely up to you on a case by case basis – it relates to the risk that your hypothesis is rejected. So I might be perfectly happy to assume a 1 SD risk if I have a flutter on a horse race (think about framing a hypothesis that horse B (my preferred flutter) will beat horse A).

However if I am designing a dam costing hundreds of millions of dollars and lives are at risk then I may only be prepared to accept a 6 sigma or higher random event that may cause the dam to fail. The statistics may be exactly the same in both cases but what I am prepared to gamble on is not the same.

Now commonly in science, a 95% significance level is chosen, but that does not mean that if 2 measurements are less than 3 sigma different that they definitely, no doubt about it, are the same. It means that at that level of significance I cannot be sure that there isn’t a 1/20 chance it didn’t happen by random chance. I can choose a 99% or a 99.9% significance test – With ever more stringent tests, at some point my hypothesis is going to fail – but that doesn’t suddenly make 2 measurements the same, only that the chance of them being the same by random chance is 1/100, or 1/1000.

So I look at Jaynes’ examples of the manufactured components and rockets and say the statistics tells me that not that they are the same but the chance that they are different – and whether you accept that chance (risk) or not, depends upon what rides upon it – or perhaps it tells you that you have to go and gather better data to lower your risk!

Re: Lav (#7), you say that in Jaynes’ examples of the manufactured components and rockets and the statistics tell you “not that they are the same but the chance that they are different”.

I’d say the statistics tell you only the chance of getting the data you got if they were indeed equal. Why would you expect rockets from different vendors to be equal in the first place?

Re: carlitos (#26),

Maybe I wasn’t very clear. Sure the rockets are almost certainly made by different manufacturers, what is being tested is one of their properties, their ability to be fired in a straight line if I understand the example and if lots of them are fired, how tightly clustered their ‘hits’ will they be when they reach their target? (A very practical exposition of standard deviation!).

So, the military commander wants to know if he orders 100 rockets fired, which of the two systems is going to produce the most hits? In Jaynes’ example, the military commander’s ‘tame’ statistician does a two-sided F test and proclaims that the difference is statistically indistinguishable at the 95% confidence level, so therefore there is no reason to choose one system or the other.

But this is the fallacy. Choosing a 95% confidence level is quite arbitrary. As I tried to show, what level of confidence you choose depends upon how much rides on the result. An evens bet may be good enough for a horse-race flutter but I might only accept a 0.0001% chance that a freak rainstorm in a century can overwhelm the very expensive dam I am designing. To strengthen the dam to withstand that freak storm might cost another $20M on a $200M dam, is it worth it? But if the freak rainstorm does occur and wipes out the $200M dam (and possibly lives too), what confidence limit can I accept that the rainstorm will not occur in the next 100 years? This is how statistics becomes an assessment of risk and consequences.

A better question is at what level of significance, or at what probability is the hypothesis falsified? In the case of the rockets that is at 92%. Sure it doesn’t meet the ‘95%’ criterion but it tells the military commander that there is only an 8% chance that the rocket tests gave the results that they did by chance. This is what the commander’s intuition in Jaynes’ example is telling him that the second lot of rockets really is better so he may go with that and order the second type of rocket. He may decide that that probability is still not low enough to be sure that this test hasn’t given a fluke result and order more tests until the probability that it could have happened by chance is low enough that he can place the order for rockets with confidence.

The important point, back to the GCM models and Douglas’ and Santer’s treatments of it is that Santer et al. may be strictly correct that the hypothesis that the models agree with the observed satellite data is not falsified at the 95% confidence level – but that

does notmean that the models therefore do agree with the observational data. 95% confidence level is typically used in science – but is still arbitrary. A better question is what is the confidence limit at which the hypothesis is falsified? If that is 92% then you are pretty damn sure that there is only a small chance that the result could have arisen by random fluctuations. More data and a longer time frame help push that confidence limit in one direction or another. You can never be 100% sure of anything, there is always a chance, however small that the result you observe may be a random fluke but you need to look at statistics in a case like this from the point of view of risk and probability, not having an arbitrary threshold that you have to cross. Just because your system hasn’t crossed that threshold does not make it the same, as the military commander understands, but Santer et al. apparently do not.Sorry Steve, I think you are rather wide of the mark there. For example, when you say

In the post you link to I was pointing out that your criticism “Puh-leeze” of Santer et al for using model runs from 1979-1999 was perhaps a little unfair as in fact Douglass et al. had used the same interval (presumably that is why the reviewer insisted on it). I said:

Seemed pretty clear to me.

Now did I criticise Douglass et al for not using up-to-date information? No, I criticised them for not comparing apples-with-apples and thereby introducing a bias into the test, which means that even an ensemble capable of accurately predicting daily weather 30 years into the future is not guaranteed to pass the test.

Now if you think someone is being inconsistent, perhaps you ought to stop and make sure you actually understand the argument they are putting forward. Perhaps you have missed some subtle point.

I am glad you have read Jaynes. Read his book as well. However you have singularly missed my point (repeatedly) on my view of frequentists stats. Frequentist confidence intervals are fine, as long as you treat them as confidence intervals and not Bayesian credible intervals. Although the actual interval is the same in simple cases such as this, the interpretation is not.

I have to say I have run out of enthusiasm for repeatedly explaining subtleties of statistical tests on CA., but I will leave you with a quote from Jaynes book, where he quotes James Bernoulli (1654-1705) as saying:

plus ca change…

TTFN

I really don’t see what relevance Bayes vs Frequentist has in the Santer et al test. They’re applying a sub-optimally powered test (not based on the maximum likelihood estimate) with a clearly misspecified model (ie. AR1, Normal errors and a linear trend in the temperatures). It’s the wrong question answered in the wrong way and nothing to do with which statistical paradigm is applied.

Brown and Sunberg’s profile likelihood approach is not Bayesian because there is no prior information and the confidence intervals are defined via long run probabilities not epistemic probability. Certainly Fisherian statistics does aim to extract all the information from the data, where it diverges from Bayes is in not using external prior information of dubious validity.

I’m probably being a bit touchy because one of my main bugbears is people assuming Bayes is some panacea for statistical problems and arguing for it based on some caricature of non-Bayesian statistics.

All I can say is God help us if Mann et al embrace ‘subjective probability’ in a big way!

Re: Andrew (#9),

I agree. I do not either understand how this is heading to “Bayesian vs Frequentist statistic” question. For those interested in philosophical matters of probability I recommend the book

The Search for Certainty. The Clash of Science and Philosophy of Probabilityby Krzysztof Burdzy available here:http://www.math.washington.edu/~burdzy/Philosophy/

I propose a different interesting question.

You have only one model and not the many that Santer et al. and Douglass et al. consider. Statistics of many models is gone. Now let the observational data sets [with average and standard deviation] remain the same.

What is the proper scientific way to compare this model to the observations?

Re: dh (#11),

In my field the usual way is simply by percentage error which implies that you have good data. But then we would also do a full sensitivity test of the input parameter ranges. That isn’t not too useful with climate models because the huge uncertainties would invalidate the entire exercise. Hence a climate model is effectively based on a Bayesian approach anyway. That fact should discourage us from using frequentist statistics for the models I’d have thought.

But the real oddity is that we should be suspicious if all the data errors were in the same direction. Such a finding is at odds with the most basic foundation of statistics – gaussian distributions, means and sd’s. Moreover any data that is partly modeled is also likely based on Bayesian reasoning too and so probably shouldn’t be using frequentist stats either.

So it’s an eyeball test for me.

Re: JamesG (#14),

Reading my own words – that we usually have good data to compare models with – brings me to another oddity: If you don’t trust the available observational data then how do you know what the heck you attempting to model? Isn’t it all then just a complete waste of time?

Steve:As I mentioned before, the “data” is itself highly modeled and the dispute between UAH and RSS lies in how this modeling is done. UAH and RSS don’t dispute what came from the satellite, merely how to interpret this as a temperature time series. So the dichotomy you’re making here is overdrawn.Re: JamesG (#14),

Ok Steve but they don’t trust the Sondes either do they?

One interesting approach sometimes used in econometric analysis is a version of Leamer’s Extreme Bounds Analysis, which as I understand it essentially asks you how strong your prior beliefs would have to be to come to a given conclusion in the face of the observed data. This inverse question seems like a more productive way to have discussions and disagreements than the simple “rejected!” or “not rejected!” debates we commonly see. One contribution of this method when applied to some standard economic empirical controversies was to show that the data were not nearly strong enough to rule out competing hypotheses if you allowed even mild priors in favor of them.

I’m probably going totally OT, but in inferring phylogenetic trees from DNA sequences, various evolution models (dealing with various substitution or mutation rates) are applied using maximum likelihood, distance/similarity, parsimony and Bayesian methods. The authors of these approaches write about how well their model applied to data reproduces the real course of events. The humble geneticists usually run their data through all these “mills” just to be on the safe side, although the trees differ, sometimes quite a lot. For my data, especially Bayesian trees (with MC randomizations) looked weird.

There are few not very much cited papers basically saying that the method or evolutionary model is rather irrelevant, if the original dataset of sequences is robust enough (in which case all trees produced by various methods have the same branch topology). I was once asked by a referee

to dropa reference of one of these papers stressing the dataset importance from my manuscript. Apparently it is not fashionable to point out, that the evolution models are only as good as the data fed to them. I haven’t heard about someone trying to get a “true” tree by averaging the output of ML, Bayesian and parsimony approach, though.Steve – Thank you for this interesting post. It really helped clarify one of the issues that has bothered me about many of the statistical arguments – that is, the seemingly general conclusion that if you can’t show that the distributions or trends are different at the 95% confidence level, then you must treat them as being the same, which, as shown by Jaynes’ examples often defies common sense. It is the same as concluding that because a jury found a criminal not quilty then he did not commit the crime (OJ anyone?).

Perhaps it really gets back to making sure we are asking the right question (Are the models consistent with the data?

Are the data consistent with the models?) and making sure that the null hypotheses is structured accordingly.notBobN

Being a very primitive Popperian, I developed a hypothesis testing approach to evaluating model performance:

Loehle, C. 1997. A Hypothesis Testing Framework for Evaluating Ecosystem Model Performance. Ecological Modelling 97:153-165

and then applied it to hydrologic data

Loehle, C. and G. Ice. 2003. Criteria for Evaluating Watershed Models. Hydrological Science and Technology 19:1-15

The method is data-centric, in that the degree to which you have confidence that you know what the real system does limits how well you can evaluate your model (like power analysis). Then, test criteria are developed that evaluate degree of model deviation from the true system behavior (not accept or reject only). I can mail copies to anyone if they want (no pdfs). In my approach, a model ensemble best-estimate behavior would be a single hypothesis to be tested against the data, not a “population”.

Re: Craig Loehle (#17),

Exactly my position, Craig. A model ensemble best estimate is simply another model, nothing else, and has no knowable statistical properties outside its own pedicted data points.

Re: Alan Wilkinson (#22),

As a layman, the above quote seems to cut right to the point. I don’t see how this could be false.

I have a physicist friend who tells me that no one knows what causes inertia. This is interesting in itself. He also tells me that Einstein believed (surmised?) that inertia is caused by the distant stars. (!) Of course Einstein was a great scientist, but that truth does not elevate his belief/surmise to the level a scientific hypothesis. The notion that inertia is caused by the distant stars is not a scientific hypothesis precisely because it cannot be tested in any scientific way; it is metaphysical, or “metascience”. It seems to me that climate science tends at least somewhat also to metascience to the extent that experiments cannot be conducted on the climate, just as they cannot be conducted on the distant-stars notion of inertia causality. So statiticians try to tease out information from data not obtained by experiment, and it’s the statistical methodology which is really at the crux of it in the absence of scientific experiments. Steve M. continually makes this point, for which I am mighty grateful. Any climate scientist worth his salt ought to be equally grateful for all the Steve M.s that insist on poring over their methodology and holding them to the most rigorous standards.

Re: Bill Larson (#18),

You’ll find that explanation of inertia called the “The Mach Principle”. Interestingly Mach was one of the last great scientists to disbelieve the atomic theory – which only goes to show.

I have a physics professor friend who told me that 90% of what she hypothesizes turns out to be wrong. For grad students its more like 95%, for undergrads 99.9% …

I’ll tell you what causes inertia in science – fear. Fear of being wrong. Fear of public humiliation. Peer pressure is an excellent producer of inertia. I often wonder if Einstein would have produced his four 1905 papers if he’d have been inside academia at the time, with a academic career to defend in front of his academic peers.

That’s what was so refreshing about my friend telling me this, because unless you’re prepared to change your beliefs in the face of new evidence then you’ll never discover something new.

I have the Douglas paper. Interesting. Can someone point me to the monthly UAH data with error. I have only seen the data and then the trend with no mention of individual point errors. Also RSS data. Until these are appropriately combined with errors there is no point talking about a linear trends and the like. From what I have seen already there is quite a lot of variation in the data (without errors) within the space of a few years.

Beaker,

What exactly do you mean when you say you are an objective Bayesian?

Good read so far, Jean S. Burdzy’s list of real-estate advertisements is hilarious (particularly the last 5 contributed by a reader, apparently).

Mark

Hi all,

There is some confusion about what a classical significance test actually says. It is an

indirectprobability statement about what you want to know. Or, in plain English, it is a statement about something you don’t want to know.What you want to know is: what is the chance that these two missile types are different?

A classical significance test first abstracts the two things—the observables, the missile angles—with probability models. Those probability models (one for each type) have parameters, most of which are not of main interest either.

The next thing that happens is the strangest: one (or more) parameters from the probability model of each observable are said to be

exactly equal. Then the classical statistician says, “Given that my probability models are correct, and that some of the parameters for the two observables’ probability models are exactly equal, what is the probability that I see a test statistic as large as the one I got?”This is why Jaynes’s example is fantastic. The angles of the missiles are modeled with two different probability distributions, the parameters are assumed to be equal, and a probability that the, in this case, F statistic would be larger if we repeated the test is calculated.

Apparently, the statistic isn’t that improbable. But so what?

Who cares what the chance that some weird statistic would be larger if we ran the experiment many more times?

What you really care about is, what are the chances the second missile type is better?

And that’s the kind of direct question you can answer using Bayesian statistics.

Understand that in

noway can a classical significance test answer this question. (Incidentally, it wasn’t designed to; Fisher was a Popperian too and loved the idea of falsifiability.)I don’t know how many of our readers at CA realize it, but the lengthy 83 page document consists of 39 pages of presentation of Jaynes’ ideas followed by several critiques by “mainstream” statisticians (with further responses by Prof. Jaynes). The critiques are worth reading as you read the main text to provide a balanced approach to the whole exercise of informing oneself about the Bayesian ethos.

I get cranky when I observe the common tactic of Bayesian advocates, (see William M Briggs (#29))

exaggerating and misrepresentating of the meanings of concepts from mainstream statistics. For example,

I don’t understand what seems to be so strange about asking the simple question “If the two samples came from the same situation (i.e, population), what is the chance that I will observe as much difference (or more) in the sample results as I just saw? In effect, that is what the null hypothesis is all about. No one states unequivocally that anything is “exactly equal”. It is a simple “what if” question. The probability value is a measure of how the samples differ that is both meaningful and easily understood. Along with the power of the test, you get a pretty good idea of how often you might be right and how often you might make errors in basing decisions on the test.

I definitely would be interested in an answer to such a question if I could get one without having to rely on the tooth fairy bringing me a prior. Mainstream statisticians gladly use any extra information about population parameters should such be available, but it should be based on real information, not a prior distribution chosen because the math can be worked out with it. By the way, how do you interpret the answer to the question? Using the “frequentist” definition? It is specious results like these that give credence to statements like “the probability that the current warming is caused by humans is x”.

## One Trackback

[...] at Climate Audit? It’s discussed in several threads related to “Santer17″. (See 1, 2 etc.) Are you wondering what it all [...]