Comments on: Santer and the Closet Frequentist

By: SD or SE: What the heck are ‘beaker’ and the other talking about? | The Blackboard

Fri, 24 Oct 2008 17:20:01 +0000

[…] at Climate Audit? It’s discussed in several threads related to “Santer17″. (See 1, 2 etc.) Are you wondering what it all […]

By: Lav

Lav — Wed, 22 Oct 2008 21:43:30 +0000

In reply to carlitos.

Re: carlitos (#26),
Maybe I wasn’t very clear. Sure the rockets are almost certainly made by different manufacturers, what is being tested is one of their properties, their ability to be fired in a straight line if I understand the example and if lots of them are fired, how tightly clustered their ‘hits’ will they be when they reach their target? (A very practical exposition of standard deviation!).

So, the military commander wants to know if he orders 100 rockets fired, which of the two systems is going to produce the most hits? In Jaynes’ example, the military commander’s ‘tame’ statistician does a two-sided F test and proclaims that the difference is statistically indistinguishable at the 95% confidence level, so therefore there is no reason to choose one system or the other.

But this is the fallacy. Choosing a 95% confidence level is quite arbitrary. As I tried to show, what level of confidence you choose depends upon how much rides on the result. An evens bet may be good enough for a horse-race flutter but I might only accept a 0.0001% chance that a freak rainstorm in a century can overwhelm the very expensive dam I am designing. To strengthen the dam to withstand that freak storm might cost another $20M on a $200M dam, is it worth it? But if the freak rainstorm does occur and wipes out the $200M dam (and possibly lives too), what confidence limit can I accept that the rainstorm will not occur in the next 100 years? This is how statistics becomes an assessment of risk and consequences.

A better question is at what level of significance, or at what probability is the hypothesis falsified? In the case of the rockets that is at 92%. Sure it doesn’t meet the ‘95%’ criterion but it tells the military commander that there is only an 8% chance that the rocket tests gave the results that they did by chance. This is what the commander’s intuition in Jaynes’ example is telling him that the second lot of rockets really is better so he may go with that and order the second type of rocket. He may decide that that probability is still not low enough to be sure that this test hasn’t given a fluke result and order more tests until the probability that it could have happened by chance is low enough that he can place the order for rockets with confidence.

The important point, back to the GCM models and Douglas’ and Santer’s treatments of it is that Santer et al. may be strictly correct that the hypothesis that the models agree with the observed satellite data is not falsified at the 95% confidence level – but that does not mean that the models therefore do agree with the observational data. 95% confidence level is typically used in science – but is still arbitrary. A better question is what is the confidence limit at which the hypothesis is falsified? If that is 92% then you are pretty damn sure that there is only a small chance that the result could have arisen by random fluctuations. More data and a longer time frame help push that confidence limit in one direction or another. You can never be 100% sure of anything, there is always a chance, however small that the result you observe may be a random fluke but you need to look at statistics in a case like this from the point of view of risk and probability, not having an arbitrary threshold that you have to cross. Just because your system hasn’t crossed that threshold does not make it the same, as the military commander understands, but Santer et al. apparently do not.

By: RomanM

RomanM — Wed, 22 Oct 2008 21:19:40 +0000

I don’t know how many of our readers at CA realize it, but the lengthy 83 page document consists of 39 pages of presentation of Jaynes’ ideas followed by several critiques by “mainstream” statisticians (with further responses by Prof. Jaynes). The critiques are worth reading as you read the main text to provide a balanced approach to the whole exercise of informing oneself about the Bayesian ethos.

I get cranky when I observe the common tactic of Bayesian advocates, (see William M Briggs (#29))
exaggerating and misrepresentating of the meanings of concepts from mainstream statistics. For example,

The next thing that happens is the strangest: one (or more) parameters from the probability model of each observable are said to be exactly equal. Then the classical statistician says, “Given that my probability models are correct, and that some of the parameters for the two observables’ probability models are exactly equal, what is the probability that I see a test statistic as large as the one I got?”

I don’t understand what seems to be so strange about asking the simple question “If the two samples came from the same situation (i.e, population), what is the chance that I will observe as much difference (or more) in the sample results as I just saw? In effect, that is what the null hypothesis is all about. No one states unequivocally that anything is “exactly equal”. It is a simple “what if” question. The probability value is a measure of how the samples differ that is both meaningful and easily understood. Along with the power of the test, you get a pretty good idea of how often you might be right and how often you might make errors in basing decisions on the test.

What you really care about is, what are the chances the second missile type is better?
And that’s the kind of direct question you can answer using Bayesian statistics.

I definitely would be interested in an answer to such a question if I could get one without having to rely on the tooth fairy bringing me a prior. Mainstream statisticians gladly use any extra information about population parameters should such be available, but it should be based on real information, not a prior distribution chosen because the math can be worked out with it. By the way, how do you interpret the answer to the question? Using the “frequentist” definition? It is specious results like these that give credence to statements like “the probability that the current warming is caused by humans is x”.

By: William M Briggs

William M Briggs — Wed, 22 Oct 2008 17:59:14 +0000

Hi all,

There is some confusion about what a classical significance test actually says. It is an indirect probability statement about what you want to know. Or, in plain English, it is a statement about something you don’t want to know.

What you want to know is: what is the chance that these two missile types are different?

A classical significance test first abstracts the two things—the observables, the missile angles—with probability models. Those probability models (one for each type) have parameters, most of which are not of main interest either.

The next thing that happens is the strangest: one (or more) parameters from the probability model of each observable are said to be exactly equal. Then the classical statistician says, “Given that my probability models are correct, and that some of the parameters for the two observables’ probability models are exactly equal, what is the probability that I see a test statistic as large as the one I got?”

This is why Jaynes’s example is fantastic. The angles of the missiles are modeled with two different probability distributions, the parameters are assumed to be equal, and a probability that the, in this case, F statistic would be larger if we repeated the test is calculated.

Apparently, the statistic isn’t that improbable. But so what?

Who cares what the chance that some weird statistic would be larger if we ran the experiment many more times?

What you really care about is, what are the chances the second missile type is better?

And that’s the kind of direct question you can answer using Bayesian statistics.

Understand that in no way can a classical significance test answer this question. (Incidentally, it wasn’t designed to; Fisher was a Popperian too and loved the idea of falsifiability.)

By: Mark T.

Mark T. — Wed, 22 Oct 2008 17:40:19 +0000

Good read so far, Jean S. Burdzy’s list of real-estate advertisements is hilarious (particularly the last 5 contributed by a reader, apparently).

Mark

By: Patrick M.

Patrick M. — Wed, 22 Oct 2008 16:54:28 +0000

In reply to Alan Wilkinson. Re: Alan Wilkinson (#22),

"A model ensemble best estimate is simply another model"

As a layman, the above quote seems to cut right to the point. I don't see how this could be false.

By: carlitos

carlitos — Wed, 22 Oct 2008 16:14:33 +0000

In reply to Lav. Re: Lav (#7), you say that in Jaynes' examples of the manufactured components and rockets and the statistics tell you "not that they are the same but the chance that they are different". I'd say the statistics tell you only the chance of getting the data you got if they were indeed equal. Why would you expect rockets from different vendors to be equal in the first place?

By: JamesG

JamesG — Wed, 22 Oct 2008 16:07:05 +0000

In reply to JamesG. Re: JamesG (#14), Ok Steve but they don't trust the Sondes either do they?

By: Jean S

Jean S — Wed, 22 Oct 2008 10:23:36 +0000

In reply to Andrew.

Re: Andrew (#9),
I agree. I do not either understand how this is heading to “Bayesian vs Frequentist statistic” question. For those interested in philosophical matters of probability I recommend the book The Search for Certainty. The Clash of Science and Philosophy of Probability by Krzysztof Burdzy available here:
http://www.math.washington.edu/~burdzy/Philosophy/

By: Peter D. Tillman

Peter D. Tillman — Wed, 22 Oct 2008 06:57:20 +0000

In reply to david elder.

Re: david elder (#4),

Brief commercial:
http://climateaudit101.wikispot.org/Glossary_of_Acronyms

will led you to CCSP = http://en.wikipedia.org/wiki/Climate_Change_Science_Program

Steve, could you please substitute this list for the old one on your masthead? Ours really is a *lot* better.

Thanks & cheers, Pete Tillman