Roger vs Annan here

Tip Jar

Pages

Categories

Articles

Blogroll
 Accuweather Blogs
 Andrew Revkin
 Anthony Watts
 Bishop Hill
 Bob Tisdale
 Dan Hughes
 David Stockwell
 http://wattsupwiththat.wordpress.com
 Icecap
 Idsos
 James Annan
 Jeff Id
 Josh Halpern
 Judith Curry
 Keith Kloor
 Klimazweibel
 Lubos Motl
 Lucia's Blackboard
 Matt Briggs
 NASA GISS
 Nature Blogs
 RealClimate
 Roger Pielke Jr
 Roger Pielke Sr
 Roman M
 Science of Doom
 Tamino
 Warwick Hughes
 William Connolley
 WordPress.com
 World Climate Report

Favorite posts

Links

Weblogs and resources

Archives
31 Comments
This is a hilarious post by Roger Pielke Jr.
Very funny – Megan is great!
I’m an undergraduate and I’m not sure that I agree with Megan’s advice.
I think the more appropriate test in this situation, which should be possible to conduct, is either the KolmogorovSmirnov test or the Wilcoxon MannWhitney test.
Agree. What is the basis for the normality assumption ?
My post was much more about the philosophy of consistency than a particular statistical test
Dr. Pielke
Thank you for your clarification.
So, if I understand your comment, that discussion was more about ethics than it was about statistics.
We did Ethics of Science last year.
The thing I remember above all else from that is
“If it cannot be reproduced, then it is not science.”
I agree this is about philosophy of science rather than details, so here are some philosophical points that seem to have got lost in all the fighting:
1) A model is used to make predictions (ok, there are other purposes, but concentrate on this one).
2) A prediction is useless without a measure of uncertainty. If someone says the temperature rise in the next 100 years will be 2C, that is useless if you don’t know whether that means 2 +/ 0.1, or 2 +/ 5. If you just say “2″ then you are guaranteed to be wrong.
3) A prediction is still useless if the uncertainty is very large (eg 2 +/ 5) or if it is wrong (eg too tight error bars).
4) The only useful prediction is one that has a measure of uncertainty produced in a plausible way that has been tested to some degree.
5) One way to test a prediction is to compare it with observations of whatever was being predicted. The comparison is then whether or not the truth falls within the range of uncertainty given by the predictor. Remember that the truth may be measured with uncertainty, but that it does have a real single value (it is very rarely a distribution).
An individual model run is therefore useless as a prediction, because it has no uncertainty attached. One way of getting some uncertainty is to do multiple runs of the same model with different initial conditions. This is very unlikely to produce enough variability however, so the uncertainty would be too small, and therefore the prediction of no use. If you are in the happy position of having many different models, all making different assumptions and giving different predictions, you might use the whole range of predictions as a measure of uncertainty. If you want to give a point estimate as well as the range, you could use the mean, or the median, or anything else. This is the least important part of the prediction.
If the range of models is very wide, then the prediction still runs the risk of being useless (because the uncertainty is too large). It the uncertainty is small enough to give a useful prediction, it has to be verified (not that the prediction is correct, this can’t in general be done until too late, but that the method is plausible). This can be done by running the models on historic data and comparing the results with the truth. If the truth falls within the range of predictions, then there is no reason to say that the method is flawed (you have to make a judgment about what “within” means, at the 99 percentile probably doesn’t count, at the 80 percentile probably does)
If the modelers reported their predictions for the tropospheric temperature with the full uncertainty given by the range of models, then using data only up to 1999 you couldn’t say they were wrong (though you might say the uncertainty was so large it was useless). Using the data to 2009 you might start to think that the uncertainty was wrong and so again the prediction was useless.
So, if you go from your blue line to the red one, you are going from a prediction that is useless because the uncertainty is incorrect, to one that is useless because the uncertainty is too large.
Or to put it more succinctly, “consistency” is a necessary but not sufficient condition for a useful prediction.
Re: snowrunner (Aug 14 10:38),
Nice overview, snowrunner.
snowrunner
I’m not understanding your comments at all.
In properly done Monte Carlo Simulations using appropriate Baysean techniques, “uncertainty” describes the variability of input parameters not the range of output values. Probability levels are assigned to output values when using Bayesean methods
You seem to be talking about confidence intervals and confidence levels which apply to the inferior, nonBayesean approaches modeling and which describe output variables. Yet you seem to be using the term uncertainty when discussing an output result instead.
Could you please clarify what you are trying to convey about the variability of model outputs?
TIA
I was using uncertainty in the common English sense that you don’t know exactly what something is, ie you have a guess at a future value of something (eg temperature trend) but you don’t know how close to the truth it is likely to be. You are uncertain about it. It is clearly true that a point estimate will be wrong, but that there must be some interval which will be correct in some sense. If you were a frequentist you would report a confidence interval, a Bayesian would report a credibility interval. It is difficult to use these definitions in the case of climate models, because they are not stochastic, and are not fitted to data in the usual way of statistical models. One approach is to say that the various models are random realisations from some distribution of models, and that somewhere within this distribution is what will turn out to be the truth. You can then give an interval for plausible values of the truth under this assumption. A bayesian might regard this distribution as the prior distribution for the parameter of interest (the trend) and in the absence of any more information this would be the only way of calculating a credibility interval. It is a bit harder to come up with a frequentist interpretation.
I don’t suppose that helped very much.
You are quite right. Your commentary didn’t help my understanding at all.
So let me ask the question about the way climate models work a different way……..
Let us consider one of the important output variables of a climate model, such as lapse rate over the equator in the year 2335.
Would the climate model output have a probability value associated with each possible value or would the model output have an expected value with a confidence interval and confidence level associated with it?
TIA
PS What in the world is a “credibility interval”? That phrase is not in my modeling textbook. Could you give me a link about “credibility interval” so I can study it?
Orkneygal: Thought it wouldn’t help. For your example of lapse rate, I don’t know what this is, but if it is an output of model, then a single run only gives a single point value. Since this is no use to anyone, there has to be some attempt to quantify the uncertainty in some other way. One way is to run the model with different starting values, which will result in different point values for the lapse rate. The distribution of these can be used as an estimate of the range of possible true values. It is obvious however that a single model even with different starting values will not give enough variability, which is why a range of models is used.
A credibility or (or “credible”) interval is the Bayesian equivalent of a confidence interval. From the posterior distribution of a parameter (ie the distribution given the data and the prior) take the interval that includes 95% of this distribution.
Thank you for your response Snowrunner. The example we did in class was modeling of a power system grid.
That was complicated but the results were straightforward. This climate modeling seems to be so much different.
I’ll just have to study this more.
Orkneygal,
Fine. Will you perform both of the tests and report your findings?
Certainly, Ron Cram
Just post the raw data and I’ll be happy to.
I don’t have the raw data. Perhaps you can get it from lucia. Have you visited her blog? It’s very good. Just post a request for the data and I bet she will post it or email it to you.
http://rankexploits.com/musings/
I’m not a statistician, but why should that stop me.
Group 1: 55 points , mean .19 , stdev .21 .
Group 2: 5 points , mean .07 , stdev .07 .
The underlying assumption must be that the variation in these values is due to a random difference in say ringwidth as a response to say temperature, overlying a reliable relationship of width to temperature. That is a big thing to prove. Assume it is true for the sake of argument.
Question: Are these two groups different samples from the same population?
The relevant test is not to compare the mean values. They could be two different groups which happen to have nearby mean values. As Pielke points out, the less reliable the measurements (or model predictions) for Group 1 (the larger its stdev) then the more likely the mean of Group 2 will be “consistent”.
I suggest this analysis: Assume that Group 1 represents the population. What is the probability that a selection of 5 points from this population would have the observed mean and standard deviation of Group 2?
Maybe that is what the “unpaired ttest” is doing. I don’t know, but I’m willing to learn.
@Andrew: Regarding the two sample ttest, the null is:
H0: The two population means are the same
For the model runs, the variation is created by the initial conditions and internal components. For measurements, variation is created by measurement method, location, and random factors. The population for the models is “all possible runs of all possible models”. The population for the measurements is “all possible measurements at all possible locations using all possible methods”.
You test statistic is d the difference of sample means.
You reject the null, if the probability of observing a difference in sample means that is as far away from zero as the value of d you got is small enough (where small enough traditionally means one of less than 10%, less than 5%, less than 1%, or less than 0.1%.
I tried to explain the philosophy of the test. I do not condone just calculating averages of some slope coefficients and doing a ttest on them.
Now, if there is very little variation in the model output and very little variation in the actual measurements of the corresponding variable, then you are essentially comparing two points, and there is very little chance of failing to reject the null.
Introducing more variation in the model output would increase those chances, but it would not increase our knowledge.
The only way that can be done is by having models generate output which more closely matches observations. Of course, a model that just takes data and outputs it would match observations perfectly, but would also be useless.
I have no problem with model output not being able to match perfectly the set of observations we have. What I have a problem with is using those models to make projections hundreds of years in to the future. From The Hitchhiker’s Guide:
Regarding consistency:
Suppose a crime is committed at 6 pm and John Doe is a suspect.
Suppose you had seen him have lunch at noon.
 That is not inconsistent with him having committed the crime.
 That is also not inconsistent with him not having committed the crime.
Note the use of the cumbersome phrase “not inconsistent with” rather than the following:
 That is consistent with him having committed the crime.
 That is consistent with him not having committed the crime.
The word consistent has a specific meaning in statistics. The appropriate phrase to use when observations fall within the observed range of model outputs is “models are not inconsistent with observations”. That is a more accurate verbalization of our ignorance.
To quote Andrew_M_Garland:
I have understood (I may have misunderstood …) that the Student’s ttest is to compare 2 means, not 2 standard deviations. You can do that with the F test. The unequal standard fdeviations can be dealt with by a simple calculation. The ttest is reputed to be relatively insensitive to the assumption of a Normal Distribution.
No one can do the nonparametric tests (e.g. the Wilcoxon Mann*Whitney test) unless they have the data set.
*Mann – no relation, AFAIK
Roger, beautiful (and fun) post. Are you younger now than you were back then? Fighting talk, almost McIntyreesque!
snowrunner
I disagree with your post in this sense. If we take the individual runs of one model, it must have a similar stochastic content to the actual single real observation we have. For example, if my “models” are constant slope linear forecasts of (.2C,.4C,.6C, .2C, all with std=0) then I can say that the mean is .35C and the std of the ensemble is .19 C and perhaps that’s within the observed trend of .2 C. std .1C (say). The problem is when I run the forecast forward for 100 yrs, none of the individual models has sufficient natural variability to create a sufficient distribution of outcomes. The terminal distribution of the models in 100 yrs will be (2C, 4C, 6C, 2C), so the entire spread will be +2C6C. If the natural climate was a normal distribution, it’s terminal distribution would be +2C +/ .3C or so. So this is the danger of the use of the ensemble distribution when the variance does not match. Clearly, stochastic component needs to be added to these models.
Also, philosophically, I differ in interpretation that the different intermodel runs are all like different future histories from one big metarun. The collective individual runs of each *individual* model are each different future history distribution estimations. Each model is trying to predict the distribution of future histories of the same climate. To group them together as uncorrelated separate history runs is plain wrong as I see it. Since the models are all trying to do the same thing, they should be judged individually, not as a group.
One of the many problems with the models is that (as you say) they do not give the correct variability. It would be very nice if the runs from each individual model could be regarded as future histories, but it is plain that they can’t. In fact the individual models give too little variability, you seem to be suggesting they give too much. However, you can quite easily show that in this sense each model is wrong, and you could just stop there. However, there is a lot of variability between models, and it is at least worth trying to see if this variability is useful as an estimate of the variability of the real climate. An alternative would be to add a stochastic component, and in many ways this would be a good approach, but it isn’t what is being done here. The “ensemble model” has the problem that it is difficult to understand exactly what the physics behind it is, but this is also true of a stochastic model. If you are only interested in the prediction it doesn’t really matter how you interpret the model, what matters is whether the prediction is any good.
I am not saying that the ensemble model is correct, or even that it is a good idea, just that it is worth trying. It isn’t a totally crazy idea though. Each model includes assumptions, simplifications and guesses at unknown parameters, so it is at least plausible that the variability between models in these assumptions is of the same order of magnitude as the variability in real climate.
To be clear, I have suggested that the individual models give too little variability, a point on which everyone seems to agree. Clearly a stochastic component needs to be added.
Yes I am suggesting the **intermodel** variability is too high. As discussed here we can make it as high or low as we want by adding or subtracting models, bogus or otherwise. In addition as I explained, they should all be designed to produce future histories – ie they are trying to do the same thing. If they differ wildly as to future histories, compare them to the data and pick the best model.
As my example above shows, it seems suspiciously like the ensemble approach is designed to “not reject” the models, while also not providing the inherent stochastic variability that would do anything but allow the 100 yr projections to essentially just have a drift term.
It’s looks like another bogus statistical approach from the climate scientists.
Also, having outlier models allow the scary scenarios to remain firmly in place as part of a “not rejected” ensemble. This is also preposterous when considering how these models are used for long term future projections.
I read through that thread and felt that some greater clarity might have been gained if the concept of type I and type II errors was raised. The fixation on 5% pvalues (the size of the test) meant that people failed to consider the power of the test. A test with no power is no test at all. As Rodger’s example showed – the test that was being proposed has no power at all because it can not reject anything. The power of the proposed tests sucked.
That is at least one area where there seemed to be a failure to comprehend.
For those who don’t like statistics, this is a technical paraphrase of Rodger asking for a definition of “consistent with” – if everything is consistent with something then the test is no test at all (i.e. it has suckworthy power).
Perhaps Annan was confusing standard error (SE) with standard deviation (SD). When I first read Pielke’s post this morning, it looked to me like the means were not significantly different, but only because on just one cup of coffee I was erroneously interpreting SD as if it were SE. Each mean is about one of these from 0, so the difference can’t be more than 1.4 of them from 0.
But of course the SE’s are much smaller than the SD’s (by a factor of sqrt(N)), so equality of the means is a very easy reject.
It does appear that Megan’s internet test is based on the equality of the two SDs, while in fact they are obviously very different. There is a long literature on the “FisherBehrens” problem of testing the equality of two means when the variances are unequal, e.g. http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177697509 . It seems that there is no exact test, but a lot of good approximate tests. In the present case, however, it’s not even a close call, so any reasonable statistic would give an easy reject.
Borrowing from John V:
So the appropriate scaling for the difference is .
Thanks, Hu — and I also find Lucia’s comment at Pielke Jr’s former blogto be very helpful in indicating how in some of the very heated debates people may be talking ast each other and not really addressing the same issue at times:
http://cstpr.colorado.edu/prometheus/?p=4419#comment10097
The final upshot of these discussions I believe is that these models are not at all suited to the task of providing 100 yr average temperature projections (or any temperature projections, at any time horizon, probably). That’s fine – then they should not be used for that. It’s possible they can be used to estimate short/medium term feedbacks, evaluated in a tight loop with satellite data – if so great. Let’s get to the right feedbacks, then estimate a simple time series model with different CO2 scenarios, add a proper stochastic (unknowable) component and create terminal distributions we can all understand and that mean something. These model “forecasts” are pure hocus pocus as they are being currently applied.
Excellent discussion of form vs substance. James Annan held to his statistical “forms” while Pielke Jr literally eviscerated him on “substance”. Nice analogy. . . . the more trashy forecasts you add the more consistent the forecasts become with observations.
One commenter, perhaps slightly off topic suggested winnowing out the models consistently producing the trashier forecasts but what kind of heresy is that? That’s a direct attack on the alarmism! Consistency has to be defined in a way that all manner of garbage is carried along!