## Re-read Pielke Jr on Consistency

Roger vs Annan here

1. TA
Posted Aug 13, 2010 at 2:21 PM | Permalink

This is a hilarious post by Roger Pielke Jr.

2. Posted Aug 13, 2010 at 5:16 PM | Permalink

Very funny – Megan is great!

3. orkneygal
Posted Aug 13, 2010 at 5:35 PM | Permalink

I’m an undergraduate and I’m not sure that I agree with Megan’s advice.

I think the more appropriate test in this situation, which should be possible to conduct, is either the Kolmogorov-Smirnov test or the Wilcoxon Mann-Whitney test.

• Robbo
Posted Aug 13, 2010 at 6:04 PM | Permalink

Agree. What is the basis for the normality assumption ?

• Posted Aug 13, 2010 at 10:39 PM | Permalink

My post was much more about the philosophy of consistency than a particular statistical test

• orkneygal
Posted Aug 14, 2010 at 4:34 AM | Permalink

Dr. Pielke-

So, if I understand your comment, that discussion was more about ethics than it was about statistics.

We did Ethics of Science last year.

The thing I remember above all else from that is-

“If it cannot be reproduced, then it is not science.”

• snowrunner
Posted Aug 14, 2010 at 10:38 AM | Permalink

I agree this is about philosophy of science rather than details, so here are some philosophical points that seem to have got lost in all the fighting:
1) A model is used to make predictions (ok, there are other purposes, but concentrate on this one).
2) A prediction is useless without a measure of uncertainty. If someone says the temperature rise in the next 100 years will be 2C, that is useless if you don’t know whether that means 2 +/- 0.1, or 2 +/- 5. If you just say “2” then you are guaranteed to be wrong.
3) A prediction is still useless if the uncertainty is very large (eg 2 +/- 5) or if it is wrong (eg too tight error bars).
4) The only useful prediction is one that has a measure of uncertainty produced in a plausible way that has been tested to some degree.
5) One way to test a prediction is to compare it with observations of whatever was being predicted. The comparison is then whether or not the truth falls within the range of uncertainty given by the predictor. Remember that the truth may be measured with uncertainty, but that it does have a real single value (it is very rarely a distribution).

An individual model run is therefore useless as a prediction, because it has no uncertainty attached. One way of getting some uncertainty is to do multiple runs of the same model with different initial conditions. This is very unlikely to produce enough variability however, so the uncertainty would be too small, and therefore the prediction of no use. If you are in the happy position of having many different models, all making different assumptions and giving different predictions, you might use the whole range of predictions as a measure of uncertainty. If you want to give a point estimate as well as the range, you could use the mean, or the median, or anything else. This is the least important part of the prediction.

If the range of models is very wide, then the prediction still runs the risk of being useless (because the uncertainty is too large). It the uncertainty is small enough to give a useful prediction, it has to be verified (not that the prediction is correct, this can’t in general be done until too late, but that the method is plausible). This can be done by running the models on historic data and comparing the results with the truth. If the truth falls within the range of predictions, then there is no reason to say that the method is flawed (you have to make a judgment about what “within” means, at the 99 percentile probably doesn’t count, at the 80 percentile probably does)

If the modelers reported their predictions for the tropospheric temperature with the full uncertainty given by the range of models, then using data only up to 1999 you couldn’t say they were wrong (though you might say the uncertainty was so large it was useless). Using the data to 2009 you might start to think that the uncertainty was wrong and so again the prediction was useless.

So, if you go from your blue line to the red one, you are going from a prediction that is useless because the uncertainty is incorrect, to one that is useless because the uncertainty is too large.

Or to put it more succinctly, “consistency” is a necessary but not sufficient condition for a useful prediction.

• Posted Aug 14, 2010 at 11:34 AM | Permalink

Nice overview, snowrunner.

• orkneygal
Posted Aug 14, 2010 at 7:21 PM | Permalink

snowrunner-

In properly done Monte Carlo Simulations using appropriate Baysean techniques, “uncertainty” describes the variability of input parameters not the range of output values. Probability levels are assigned to output values when using Bayesean methods

You seem to be talking about confidence intervals and confidence levels which apply to the inferior, non-Bayesean approaches modeling and which describe output variables. Yet you seem to be using the term uncertainty when discussing an output result instead.

Could you please clarify what you are trying to convey about the variability of model outputs?

TIA

• snowrunner
Posted Aug 15, 2010 at 5:06 AM | Permalink

I was using uncertainty in the common English sense that you don’t know exactly what something is, ie you have a guess at a future value of something (eg temperature trend) but you don’t know how close to the truth it is likely to be. You are uncertain about it. It is clearly true that a point estimate will be wrong, but that there must be some interval which will be correct in some sense. If you were a frequentist you would report a confidence interval, a Bayesian would report a credibility interval. It is difficult to use these definitions in the case of climate models, because they are not stochastic, and are not fitted to data in the usual way of statistical models. One approach is to say that the various models are random realisations from some distribution of models, and that somewhere within this distribution is what will turn out to be the truth. You can then give an interval for plausible values of the truth under this assumption. A bayesian might regard this distribution as the prior distribution for the parameter of interest (the trend) and in the absence of any more information this would be the only way of calculating a credibility interval. It is a bit harder to come up with a frequentist interpretation.

I don’t suppose that helped very much.

• orkneygal
Posted Aug 15, 2010 at 6:27 AM | Permalink

You are quite right. Your commentary didn’t help my understanding at all.

So let me ask the question about the way climate models work a different way……..

Let us consider one of the important output variables of a climate model, such as lapse rate over the equator in the year 2335.

Would the climate model output have a probability value associated with each possible value or would the model output have an expected value with a confidence interval and confidence level associated with it?

TIA

PS What in the world is a “credibility interval”? That phrase is not in my modeling textbook. Could you give me a link about “credibility interval” so I can study it?

• snowrunner
Posted Aug 15, 2010 at 7:21 AM | Permalink

Orkneygal: Thought it wouldn’t help. For your example of lapse rate, I don’t know what this is, but if it is an output of model, then a single run only gives a single point value. Since this is no use to anyone, there has to be some attempt to quantify the uncertainty in some other way. One way is to run the model with different starting values, which will result in different point values for the lapse rate. The distribution of these can be used as an estimate of the range of possible true values. It is obvious however that a single model even with different starting values will not give enough variability, which is why a range of models is used.

A credibility or (or “credible”) interval is the Bayesian equivalent of a confidence interval. From the posterior distribution of a parameter (ie the distribution given the data and the prior) take the interval that includes 95% of this distribution.

• orkneygal
Posted Aug 15, 2010 at 9:16 PM | Permalink

Thank you for your response Snowrunner. The example we did in class was modeling of a power system grid.

That was complicated but the results were straightforward. This climate modeling seems to be so much different.

I’ll just have to study this more.

• Ron Cram
Posted Aug 14, 2010 at 2:06 AM | Permalink

Orkneygal,
Fine. Will you perform both of the tests and report your findings?

• orkneygal
Posted Aug 14, 2010 at 4:35 AM | Permalink

Certainly, Ron Cram-

Just post the raw data and I’ll be happy to.

• Ron Cram
Posted Aug 15, 2010 at 10:23 PM | Permalink

I don’t have the raw data. Perhaps you can get it from lucia. Have you visited her blog? It’s very good. Just post a request for the data and I bet she will post it or email it to you.

http://rankexploits.com/musings/

4. Posted Aug 13, 2010 at 9:30 PM | Permalink

I’m not a statistician, but why should that stop me.

Group 1: 55 points , mean .19 , stdev .21 .
Group 2: 5 points , mean -.07 , stdev .07 .

The underlying assumption must be that the variation in these values is due to a random difference in say ring-width as a response to say temperature, overlying a reliable relationship of width to temperature. That is a big thing to prove. Assume it is true for the sake of argument.

Question: Are these two groups different samples from the same population?

The relevant test is not to compare the mean values. They could be two different groups which happen to have nearby mean values. As Pielke points out, the less reliable the measurements (or model predictions) for Group 1 (the larger its stdev) then the more likely the mean of Group 2 will be “consistent”.

I suggest this analysis: Assume that Group 1 represents the population. What is the probability that a selection of 5 points from this population would have the observed mean and standard deviation of Group 2?

Maybe that is what the “unpaired t-test” is doing. I don’t know, but I’m willing to learn.

• Posted Aug 14, 2010 at 11:04 AM | Permalink

@Andrew: Regarding the two sample t-test, the null is:

H0: The two population means are the same

For the model runs, the variation is created by the initial conditions and internal components. For measurements, variation is created by measurement method, location, and random factors. The population for the models is “all possible runs of all possible models”. The population for the measurements is “all possible measurements at all possible locations using all possible methods”.

You test statistic is d the difference of sample means.

You reject the null, if the probability of observing a difference in sample means that is as far away from zero as the value of d you got is small enough (where small enough traditionally means one of less than 10%, less than 5%, less than 1%, or less than 0.1%.

I tried to explain the philosophy of the test. I do not condone just calculating averages of some slope coefficients and doing a t-test on them.

Now, if there is very little variation in the model output and very little variation in the actual measurements of the corresponding variable, then you are essentially comparing two points, and there is very little chance of failing to reject the null.

Introducing more variation in the model output would increase those chances, but it would not increase our knowledge.

The only way that can be done is by having models generate output which more closely matches observations. Of course, a model that just takes data and outputs it would match observations perfectly, but would also be useless.

I have no problem with model output not being able to match perfectly the set of observations we have. What I have a problem with is using those models to make projections hundreds of years in to the future. From The Hitchhiker’s Guide:

The Total Perspective Vortex derives its picture of the whole Universe on the principle of extrapolated matter analyses.

To explain — since every piece of matter in the Universe is in some way affected by every other piece of matter in the Universe, it is in theory possible to extrapolate the whole of creation — every sun, every planet, their orbits, their composition and their economic and social history from, say, one small piece of fairy cake.

The man who invented the Total Perspective Vortex did so basically in order to annoy his wife.

Trin Tragula — for that was his name — was a dreamer, a thinker, a speculative philosopher or, as his wife would have it, an idiot.

And she would nag him incessantly about the utterly inordinate amount of time he spent staring out into space, or mulling over the mechanics of safety pins, or doing spectrographic analyses of pieces of fairy cake.

Have some sense of proportion! she would say, sometimes as often as thirty-eight times in a single day.

And so he built the Total Perspective Vortex — just to show her.

And into one end he plugged the whole of reality as extrapolated from a piece of fairy cake, and into the other end he plugged his wife: so that when he turned it on she saw in one instant the whole infinity of creation and herself in relation to it.

To Trin Tragula’s horror, the shock completely annihilated her brain; but to his satisfaction he realized that he had proved conclusively that if life is going to exist in a Universe of this size, then the one thing it cannot afford to have is a sense of proportion.(emphasis mine)

• Posted Aug 14, 2010 at 11:18 AM | Permalink

Regarding consistency:

Suppose a crime is committed at 6 pm and John Doe is a suspect.

Suppose you had seen him have lunch at noon.

– That is not inconsistent with him having committed the crime.

– That is also not inconsistent with him not having committed the crime.

Note the use of the cumbersome phrase “not inconsistent with” rather than the following:

– That is consistent with him having committed the crime.

– That is consistent with him not having committed the crime.

The word consistent has a specific meaning in statistics. The appropriate phrase to use when observations fall within the observed range of model outputs is “models are not inconsistent with observations”. That is a more accurate verbalization of our ignorance.

5. Alan Bates
Posted Aug 14, 2010 at 3:28 AM | Permalink

To quote Andrew_M_Garland:

I’m not a statistician, but why should that stop me.

I have understood (I may have misunderstood …) that the Student’s t-test is to compare 2 means, not 2 standard deviations. You can do that with the F test. The unequal standard fdeviations can be dealt with by a simple calculation. The t-test is reputed to be relatively insensitive to the assumption of a Normal Distribution.

No one can do the non-parametric tests (e.g. the Wilcoxon Mann*-Whitney test) unless they have the data set.

*Mann – no relation, AFAIK

6. Lewis
Posted Aug 14, 2010 at 11:45 AM | Permalink

Roger, beautiful (and fun) post. Are you younger now than you were back then? Fighting talk, almost McIntyre-esque!

7. Mesa
Posted Aug 14, 2010 at 4:13 PM | Permalink

snowrunner-

I disagree with your post in this sense. If we take the individual runs of one model, it must have a similar stochastic content to the actual single real observation we have. For example, if my “models” are constant slope linear forecasts of (.2C,.4C,.6C, .2C, all with std=0) then I can say that the mean is .35C and the std of the ensemble is .19 C and perhaps that’s within the observed trend of .2 C. std .1C (say). The problem is when I run the forecast forward for 100 yrs, none of the individual models has sufficient natural variability to create a sufficient distribution of outcomes. The terminal distribution of the models in 100 yrs will be (2C, 4C, 6C, 2C), so the entire spread will be +2C-6C. If the natural climate was a normal distribution, it’s terminal distribution would be +2C +/- .3C or so. So this is the danger of the use of the ensemble distribution when the variance does not match. Clearly, stochastic component needs to be added to these models.

Also, philosophically, I differ in interpretation that the different inter-model runs are all like different future histories from one big meta-run. The collective individual runs of each *individual* model are each different future history distribution estimations. Each model is trying to predict the distribution of future histories of the same climate. To group them together as uncorrelated separate history runs is plain wrong as I see it. Since the models are all trying to do the same thing, they should be judged individually, not as a group.

• snowrunner
Posted Aug 15, 2010 at 5:24 AM | Permalink

One of the many problems with the models is that (as you say) they do not give the correct variability. It would be very nice if the runs from each individual model could be regarded as future histories, but it is plain that they can’t. In fact the individual models give too little variability, you seem to be suggesting they give too much. However, you can quite easily show that in this sense each model is wrong, and you could just stop there. However, there is a lot of variability between models, and it is at least worth trying to see if this variability is useful as an estimate of the variability of the real climate. An alternative would be to add a stochastic component, and in many ways this would be a good approach, but it isn’t what is being done here. The “ensemble model” has the problem that it is difficult to understand exactly what the physics behind it is, but this is also true of a stochastic model. If you are only interested in the prediction it doesn’t really matter how you interpret the model, what matters is whether the prediction is any good.

I am not saying that the ensemble model is correct, or even that it is a good idea, just that it is worth trying. It isn’t a totally crazy idea though. Each model includes assumptions, simplifications and guesses at unknown parameters, so it is at least plausible that the variability between models in these assumptions is of the same order of magnitude as the variability in real climate.

• Mesa
Posted Aug 15, 2010 at 10:16 AM | Permalink

To be clear, I have suggested that the individual models give too little variability, a point on which everyone seems to agree. Clearly a stochastic component needs to be added.

Yes I am suggesting the **inter-model** variability is too high. As discussed here we can make it as high or low as we want by adding or subtracting models, bogus or otherwise. In addition as I explained, they should all be designed to produce future histories – ie they are trying to do the same thing. If they differ wildly as to future histories, compare them to the data and pick the best model.

As my example above shows, it seems suspiciously like the ensemble approach is designed to “not reject” the models, while also not providing the inherent stochastic variability that would do anything but allow the 100 yr projections to essentially just have a drift term.

It’s looks like another bogus statistical approach from the climate scientists.

• Mesa
Posted Aug 15, 2010 at 11:12 AM | Permalink

Also, having outlier models allow the scary scenarios to remain firmly in place as part of a “not rejected” ensemble. This is also preposterous when considering how these models are used for long term future projections.

8. JS
Posted Aug 15, 2010 at 12:59 AM | Permalink

I read through that thread and felt that some greater clarity might have been gained if the concept of type I and type II errors was raised. The fixation on 5% p-values (the size of the test) meant that people failed to consider the power of the test. A test with no power is no test at all. As Rodger’s example showed – the test that was being proposed has no power at all because it can not reject anything. The power of the proposed tests sucked.

That is at least one area where there seemed to be a failure to comprehend.

For those who don’t like statistics, this is a technical paraphrase of Rodger asking for a definition of “consistent with” – if everything is consistent with something then the test is no test at all (i.e. it has suckworthy power).

9. Hu McCulloch
Posted Aug 15, 2010 at 10:18 AM | Permalink

Perhaps Annan was confusing standard error (SE) with standard deviation (SD). When I first read Pielke’s post this morning, it looked to me like the means were not significantly different, but only because on just one cup of coffee I was erroneously interpreting SD as if it were SE. Each mean is about one of these from 0, so the difference can’t be more than 1.4 of them from 0.

But of course the SE’s are much smaller than the SD’s (by a factor of sqrt(N)), so equality of the means is a very easy reject.

It does appear that Megan’s internet test is based on the equality of the two SDs, while in fact they are obviously very different. There is a long literature on the “Fisher-Behrens” problem of testing the equality of two means when the variances are unequal, e.g. http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177697509 . It seems that there is no exact test, but a lot of good approximate tests. In the present case, however, it’s not even a close call, so any reasonable statistic would give an easy reject.

• pete
Posted Aug 15, 2010 at 5:57 PM | Permalink

Borrowing from John V:

It’s an important statistical subtlety, but the two sets of trend estimates should not be expected to have the same variance.

If we were comparing the heights of Swedish men (per lucia’s analog), this would be the difference:

The 5 observations are 5 measurements of *one* Swedish man’s height. The 55 models are single measurements of each of 55 different Swedish men. Both sets of data attempt to determine the average height of Swedish men. The variance is not the same.

So the appropriate scaling for the difference is $\sqrt{{SE}^2_{obs}+{SD}^2_{models}}$.

• Skiphil
Posted Feb 13, 2013 at 4:05 PM | Permalink

Thanks, Hu — and I also find Lucia’s comment at Pielke Jr’s former blogto be very helpful in indicating how in some of the very heated debates people may be talking ast each other and not really addressing the same issue at times: