David Douglass' Comments:

David Douglass writes in:

I wish to make a few comments/clarifications on the Douglass et al. paper. We took the model data from the LLNL archive. We computed the tropical zonal averages and trends from 1979 to 1999 at the various pressure levels for each realizations for each model and averaged the realizations for each model to simulate removal of the El Nino Southern Oscillation (ENSO) effect; these models cannot reproduce the observed time sequence of El Nino and La Nina events, except by chance as has been pointed out by Santer et al..

We offered no opinion as to the validity of any model.

This is table II of our paper. [ Note: Some of the values in the table for the models for 1000 hPa are not consistent with the surface value. This is probably because some model values for p = 1000 hPa are unrealistic; they may be below the surface. This may be why the modelers compute the surface value separately and are the values that we use in the plot.]

How to interpret model data in Table II?

We noted that previous papers and IPCC reports had introduced the concept of averaging over models. There is no scientific justification for doing so. The models are not independent samples from a parent population because the models are, in fact, different from each other in many ways.

Nevertheless, we used this same concept to describe the results of our paper and computed statistical quantities: averages, standard deviation, standard errors of the mean. This may not have been the best decision. Perhaps a better way would have been to forget statistics altogether. Just look at the plot of the model results from the table as Willis Eschenbach has suggested [comment # 144]. I show this plot below. Note that the models are not normalized to thier surface value; normalizing hides some of the discrepancies.

The observations are not plotted. The main observational results are:

1. The surface value is about 0.125 K/decade which is close to the 22-model average.

2. The trend values of the 4 data sets generally decrease with altitude. [Note; Christy has explained in these comments the choice of RAOBCORE 1.2. We submitted this to the IJC Journal on Jan 3, 2008 as an addendum. It is not yet published, but I will send a copy of the addendum to anyone requesting it.]

The plot of the values from the table indicates that most of the models show an increase with altitude – opposite to what the observations show.
Which ones ?

We would like to make a list of which models do not agree with the observations. Better yet, which ones are not excluded?

Rather than argue pointlessly about the meaning of standard deviation and error of the mean of the 22 models, let’s do something real simple. Each model must and should be compared individually against the two observational results above.

Test 1.

Using the somewhat arbitrary criteria that models that have a trend at higher altitudes [between 500 and 200 hPA] greater than 0.2K/decade disagree with the observations. This test then rejects all models except 2, 8, 12, 22. [These are labeled in the figure]

Test 2.

Reject models that have surface trends less than 0.05K/decade. Model 22 from the list in test 1 is now eliminated. Out of the 22, only models 2, 8, and 12 are left as viable. Further tests or different tests can change this list. Invent your own tests or criteria. The main conclusion will be the same – only a few models can be reconciled with the observations.

I want to a make a general comment.

Each modeling group may find comparison of its results to results from other groups interesting but a much more important comparison is to the observations. I would expect each group following the scientific method to do this. A corollary of this is that each group would have to say that some, perhaps most, models are wrong. It is in the interest of every group to do this.

This entry was written by Stephen McIntyre, posted on May 1, 2008 at 7:27 PM, filed under General. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

529 Comments

Andrew

Posted May 1, 2008 at 7:46 PM | Permalink

Where are the figures he references?
Raven

Posted May 1, 2008 at 8:53 PM | Permalink

So what are models 8, 12, and 2?

What are the long term GMST projections for these models?
bender

Posted May 1, 2008 at 9:04 PM | Permalink

This approach is more productive than doing what Gavin did – computing an envelope for the family of models.

Although it does make sense to me to think of these models as samples from a population of possible models. It’s just that it doesn’t get you very far. With Douglass’ method here you are now forced to argue whether the models that fit the observations are superior, and whether this support is then sufficient to define “the fingerprint”. And that’s progress.

Now, who will be the first alarmist to step up and admit that the three models that fit best are the ones that happen to have the least credible convection parameterizations? Judith? Phil? Mr Crickets?
Eggplant fan

Posted May 1, 2008 at 9:16 PM | Permalink

David Douglass said:

We noted that previous papers and IPCC reports had introduced the concept of averaging over models. There is no scientific justification for doing so. The models are not independent samples from a parent population because the models are, in fact, different from each other in many ways.

I have to say that I’m a bit dumbfounded. My immediate reaction when I first saw the “error bars” on the RC post was “You can’t do that!”, but a failure of basis statistical analysis on a blog isn’t all that surprising. But to hear that it is used in the IPCC reports is, well, shocking. Have anyone ever been called to defend this practice in the literature (peer reviewed or otherwise)? If so, what kind of response was given?
bender

Posted May 1, 2008 at 9:27 PM | Permalink

#5
I just defended it, in #3.
bender

Posted May 1, 2008 at 9:38 PM | Permalink

#5 Think of it as a heuristic technique, not a statistical technique. It is the same method used to rank college football teams using composite indices. It’s legitimate if a huge panel of experts deems it to be acceptable. Which is the case.
Raven

Posted May 1, 2008 at 9:52 PM | Permalink

I don’t understand why this approach was not used from the start.

If I am simulating a transitor I look for the model that has the best match to reality over the frequency range that I am interested in. I do not take every possible model and use the average output.
Ron Cram

Posted May 1, 2008 at 10:19 PM | Permalink

I have a question for David Douglass or John Christy.

Here are some quotes from your published paper:

We have tested the proposition that greenhouse model simulations and trend observations can be reconciled. Our conclusion is that the present evidence, with the application of a robust statistical test, supports rejection of this proposition.

Models are very consistent, as this article demonstrates, in showing a significant difference between surface and tropospheric trends, with tropospheric temperature trends warming faster than the surface.

The last 25 years constitute a period of more complete and accurate observations and more realistic modelling efforts. Yet the models are seen to disagree with the observations.

Based on these quotes, I believed that no model accurately reflected both observations at the surface of the tropics and observations in the tropical troposphere.

However, based on what David Douglass has written above it appears three models (2, 8, 12) all pass the test. What am I missing here?
rk

Posted May 1, 2008 at 10:35 PM | Permalink

This graph is certainly illuminating. It does seem like developing a mean and variance based on the obs then testing the models for goodness of fit is a logical approach, although with such a few sets of obs whatever bands you have won’t be very tight.

I’m sure its easy to increase the error bars of the 22 models so as to account for any obs that you want. So to do it the other way around seems a little silly. The error bars from RC obviously can account for almost any observations, therefore the falsification issue is real. I think someone has already said this…but I’m just now catching up.
beaker

Posted May 2, 2008 at 12:40 AM | Permalink

Good morning all.

rk says:

I’m sure its easy to increase the error bars of the 22 models so as to account for any obs that you want.

I don’t understand where the idea comes from that you can arbitrarily increase the error bars to account for any observation you want. You can’t, the error bars on the prediction of the ensemble is decided by the spread of the ensemble. The error bars are determined by the systematic and stochastic uncertainties inherent in this approach to modelling. They show the range of values the ensemble considers to be plausible. If an observation lies within the spread of the ensemble, then how can it be inconsistent with the ensemble?

So to do it the other way around seems a little silly. The error bars from RC obviously can account for almost any observations, therefore the falsification issue is real. I think someone has already said this…but I’m just now catching up.

You seem to misunderstand falsification. The error bars are of finite width, so it is it is not impossible that an observation could be found that lies outside. Falsifiability only requires that falsification is possible in principle, not that it actually happens.

This is the importance of the error bars that I am trying to explain. The fact that the error bars are so wide shows that the models are *very* uncertain about the trends, the mean can be argued as being the most probably value for the trend (assuming it is near the mode), but it is wrong to ignore the uncertainty of the prediction. It is wrong to ignore them in making predictions of future climate, but it is just as wrong to ignore them in seeing if the observed trend is consistent with a hindcast.

However, if you want to make non-statistical arguments about the usefulness of the models, that is much less problematic (and probably more useful). One argument that you might reasonably make is that the models are so uncertain that they are not yet of practical value.
beaker

Posted May 2, 2008 at 12:46 AM | Permalink

Ron #9: This seems to me to be a result of the difference between inconsistency and bias in the pre-existing statistical sense of the terms, rather than those defined in the original paper. The ensemble is biased (the mean does not give a good picture of the observed trend), but it isn’t inconsistent (the observed trend is not outside the range of values considered plausible, if not perhaps highly probable, by the ensemble).
braddles

Posted May 2, 2008 at 1:06 AM | Permalink

It would seem to me that the failure of most models is not necessarily a failure of the concept. Let’s say a toymaker asks 22 people to invent a new toy, and 21 of them turn out to be rubbish but one is good, he can still go with that and the exercise was a success. He won’t average the good one and the 21 bad ones.

It would appear that Model #2 is the only good toy in the ensemble. How is it different to the others? Is it based on sound science?

If it is, and someone (not me, probably) can understand why it succeeds where others fail, it may have useful predictive power. The other 21 can be quietly dropped.
beaker

Posted May 2, 2008 at 1:52 AM | Permalink

David Douglass said:

We noted that previous papers and IPCC reports had introduced the concept of averaging over models. There is no scientific justification for doing so. The models are not independent samples from a parent population because the models are, in fact, different from each other in many ways.

I doubt this was introduced by the IPCC, averaging over models is standard practice in Bayesian statistics, it is called “marginalisation”.

As for whether the models are samples from a parent population, consider this interpretation: Assume that we could construct a generalised GCM (G^2CM) that could implement any of the individual models in the existing ensemble by changing a set of simulation parameters. Now there will be some uncertainty in the correct setting of these parameters (which correspond to assumptions about the physics), the natural thing to do would be to suggest a statistical distribtion over these parameters establishing the plausibility of different settings. We could then construct an ensemble by drawing sets of parameters from this distribution and running them throught the G^2CM. In that way the ensemble is a sample from a parent distribution. The ensemble we have also could be considered as a sample from a parent distribution, although it has been created in a more ad-hoc manner and the distribution over plausible parameter sets (i.e. physical assumptions) is implicitly represented by the opinions held by the teams constructing the models. In principle, you could get the same result using the G^2CM approach, so it is reasonable.
Willis Eschenbach

Posted May 2, 2008 at 3:28 AM | Permalink

Folks, the data is not only two standard errors from the mean. It is also two standard deviations from the mean.

Can we now agree that the models are inconsistent with the data?

w.
Willis Eschenbach

Posted May 2, 2008 at 3:31 AM | Permalink
Willis Eschenbach

Posted May 2, 2008 at 3:33 AM | Permalink
beaker

Posted May 2, 2008 at 3:54 AM | Permalink

Willis #14: Two additional things need to be taken into account:

Firstly the means for the runs for each model only really represent the systematic uncertainties involved (i.e. uncertainties in the underlying physics), the stochastic uncertainties also need to be taken into account, e.g. by taking the standard deviation over all model runs in order to reflect the true uncertainty of the ensemble. Certainly there is a problem to do with the fact that different numbers of runs were performed for each model. Taking the standard deviation over all runs is perhaps a conservative appoach, that is true, however taking the standard deviation over model means is an obviously biased approach as it ignores the stochastic uncertainty entirely. If you want to falsify a proposition, it is important to avoid unneccessary biases in favour of falsification, so we can be confident that the falsification itself is robust.

Secondly, as demonstrated by your plot, the estimates of the observed trend are not in particularly good agreement even with eachother, which shows there is uncertainty in the observed trend. To test for inconsistency properly, standard practice would be to show that the mean +-2sd error bars of the models don’t overlap with the mean +-2sd error bars of the data. Yes, I know that makes the inconsistency test easier to pass, but that is the correct way to perform the test for inconsistency.

Now if you want to test for bias, that is another matter entirely.
Bob B

Posted May 2, 2008 at 4:34 AM | Permalink

Raven, GaAs, SiGe, or 45nM Si CMOS transistors?
beaker

Posted May 2, 2008 at 4:42 AM | Permalink

Willis #14: Two additional things need to be taken into account:

Firstly the means for the runs for each model only really represent the systematic uncertainties involved (i.e. uncertainties in the underlying physics), the stochastic uncertainties also need to be taken into account, e.g. by taking the standard deviation over all model runs in order to reflect the true uncertainty of the ensemble. Certainly there is a problem to do with the fact that different numbers of runs were performed for each model. Taking the standard deviation over all runs is perhaps a conservative appoach, that is true, however taking the standard deviation over model means is an obviously biased approach as it ignores the stochastic uncertainty entirely. If you want to falsify a proposition, it is important to avoid unneccessary biases in favour of falsification, so we can be confident that the falsification itself is robust.

Secondly, as demonstrated by your plot, the estimates of the observed trend are not in particularly good agreement even with eachother, which shows there is uncertainty in the observed trend. To test for inconsistency properly, standard practice would be to show that the mean +-2sd error bars of the models don’t overlap with the mean +-2sd error bars of the data. Yes, I know that makes the inconsistency test easier to pass, but that is the correct way to perform the test for inconsistency.

Now if you want to test for bias, that is another matter entirely.

N.B. This post was blocked as spam the first time, but I can’t see anything in the content that is objectionable.

Steve: The Spam KArma log stated: -10: Fake Javascript Payload.
beaker

Posted May 2, 2008 at 4:46 AM | Permalink

My posts are getting blocked, I wonder if this one will be as well.

Steve: I manually recovered 3 posts from Spam Karma. The log said that the posts had been penalized for a Fake Javascript payload.
JamesG

Posted May 2, 2008 at 5:07 AM | Permalink

Beaker: You’ve just given us yet another reason to distrust Bayesian statistics. It doesn’t matter how much you say something is reasonable or plausible, or even if it has been done before – it doesn’t make it correct. Averaging across models is meaningless to a computer modeler. It doesn’t help drive our understanding of the model. Since Gavin is such a beast, he can only be using this standard deviation gotcha as an excuse to avoid seeing the obvious. The ensemble plot tells us all we need to know, using our brain. A modeler usually deals with whether the model is correct or not. Any test, statistical or otherwise which doesn’t help separate good from bad is useless. I’m glad we agree though that leaving statistics out is actually a step forward and more useful. I wish RC would realize that.
beaker

Posted May 2, 2008 at 5:13 AM | Permalink

“Fake Javascript payload”, presumably it doesn’t like some of the punctuation then. Many thanks for restoring the post, much appreciated!
beaker

Posted May 2, 2008 at 5:21 AM | Permalink

JamesG: Consider this, why doesn’t someone publish a paper showing the full uncertainties of the model, and say “have you seen the size of those error bars, do people realise just how uncertain the models are? Are models this uncertain really useful? Note also that the models have a significant bias, so the ensemble mean is not a good fit for the observations”. That would be a really good way of pointing the important defficiencies of the models, and it would have the full backing of both frequentist and especially Bayesian statistics.

Note the revised version of Willis’ test that I give in #19 is fully frequentist. A Bayesian would probably prefer to look at Bayes factors for competing hypotheses rather than a conventional significance test on a single null hypothesis.
MarkW

Posted May 2, 2008 at 5:23 AM | Permalink

If I were to take a Semi, trailer fully loaded, a Semi, no trailer, a NASCAR stockcar, a nitro burning dragster, a Chevette, and a Sherman tank, and average their 0-60 times, what would that tell me about the capabilities of the internal combustion engine?
beaker

Posted May 2, 2008 at 5:44 AM | Permalink

MarkW says:

If I were to take a Semi, trailer fully loaded, a Semi, no trailer, a NASCAR stockcar, a nitro burning dragster, a Chevette, and a Sherman tank, and average their 0-60 times, what would that tell me about the capabilities of the internal combustion engine?

That we can’t give confident predictions about 0-60 times unless we have a good understanding of the vehicle, i.e. there are large error bars on the 0-60 times. Equally, if we have imperfect knowledge of the physics of climate change, we will end up with models that make very uncertain predictions, with very wide error bars – and that is just what we have! It really is puzzling to me why it is so hard to see that it is the uncertainty of the models that is the greater practical problem that should be recieving a greater focuss.
MBK

Posted May 2, 2008 at 5:52 AM | Permalink

I have been lurking so far and the amount of talk on this issue (starting at the other thread) amazes me. Obviously at the very least the statistical issues are debatable and that in itself is a shame. Another clear conclusion to me is that model outputs were accepted at face value by IPCC without standardized procedures for validation or even comparison. In my naive view a major point of a model would be to get a better guess at those parameters going into the model, of which one is uncertain, by comparing the output to reality. But models do not seem to be used such if-then scenarios, to better constrain parameters one doesn’t know, using measurements that one does know, and linking them by physics. They seem to be viewed as plausible realities, rather than as experimental research tools as I would see them.

Also, any statistics on this data set do at least look debatable seeing the ca. 500 posts on the subject. Fairly naively it looks to me that the only clean way to test the models against the data is tho take each and every model in isolation, do enough runs to stabilize SD, and test against the available measured data. That way you test the, quoting beaker, systematic uncertainty of the model means, using the stochastic (not really, in my eyes chaotic) uncertainties of the runs. Then eliminate models whose means don’t measure up against the data, one by one. The remaining models, hopefully, will then have parameters, and produce projections, consistent with each other to begin with.

If they don’t “converge” in this way that’s yet another issue.
James Lane

Posted May 2, 2008 at 5:55 AM | Permalink

I recall looking at the Douglass paper and the RC response some time ago and chuckled at how Gavin suddenly discovered bucketloads of uncertainty in the GCMs to refute Douglass et al.

But I’m rocked by what I’ve read in this and the trop trop thread. I think JamesG puts it well:

Since Gavin is such a beast, he can only be using this standard deviation gotcha as an excuse to avoid seeing the obvious. The ensemble plot tells us all we need to know, using our brain. A modeler usually deals with whether the model is correct or not. Any test, statistical or otherwise which doesn’t help separate good from bad is useless.

Even beaker seems to have conceded that the models are hopeless for the tropical troposphere, although maintaining that they are not statistically falsified.
Judith Curry

Posted May 2, 2008 at 6:19 AM | Permalink

What David Douglass says is absolutely correct. At the recent NOAA review of GFDL modelling activities (we discussed this somewhere on another thread), I brought up the issue numerous times that you should not look at projections from models that do not verify well against historical observations. This is particularly true if you are using the IPCC results in some sort of regional study. The simulations should pass some simple observational tests: a credible mean value, a credible annual cycle, appropriate magnitude of interannual variability. Toss out the models that don’t pass this test, and look at the projections from those that do pass the test. This generated much discussion, here are some of the counter arguments:
1) when you do the forward projections and compare the 4 or so models that do pass the observational tests with those that don’t, you don’t see any separation in the envelope of forward projections
2) some argue that a multiple model ensemble with a large number of multiple models (even bad ones) is better than a single good model

My thinking on this was unswayed but arguments #1 and #2. I think you need to choose the models that perform best against the observations (and that have a significant number of ensemble members from the particular model), assemble the error statistics for each model, and use these error statistics to create a multi-model ensemble projection.

This whole topic is being hotly debated in climate community right now, as people who are interested in various applications (regional floods and droughts, health issues, whatever) are doing things like average the results of all the IPCC models. there is a huge need to figure out how to interpret the IPCC scenario simulations.
beaker

Posted May 2, 2008 at 6:24 AM | Permalink

James Lane says:

Even beaker seems to have conceded that the models are hopeless for the tropical troposphere, although maintaining that they are not statistically falsified.

I’m sorry, but this comment is an illustration of the problem with blogs on climate change, which is that everyone is automatically assumed to have a position on one side of the debate or another. I don’t, however I do think it is important that papers potentially containing errors are audited in an impartial and impersonal manner, without any commitment to the outcome of the audit, regardless of which side of the debate the belong to. I have made no such “concession” as I have not said that the ensmble means give good predictions of the tropical trophospheric trends (and neither for instance does Gavin@RC if you read his article in an unbiased manner)!

The models have NOT been falsified, they have been shown to be significantly biased, I have explained the difference (and at Willis’ request demonstrated that these terms have pre-existing meanings in statistics). The reason why Douglass et al. needs auditing is because some, such as yourself, believe that inconsistency has been established rather than bias, it hasn’t.
Steve Geiger

Posted May 2, 2008 at 6:36 AM | Permalink

quick question. If the models were somehow tuned or calibrated to match this vertical gradient, would they then lose the match for surface trend? At the least this seems like it should be very alarming for those trying capture what’s really going on. Also, what types of forcings *would* result in the type of vertical gradients observed in the tropics?

Thanks
beaker

Posted May 2, 2008 at 6:36 AM | Permalink

Judith: For those interested in interpreting the IPCC scenarios for practical applications, Bayesian statistics already provides a sound means of intepretation. Just run each member of the ensemble through the impacts model separately and collect the results. The distribtuion of the impact variable that you get this way is the logical consequence of the uncertainty in the IPCC scenarios. Any impact that is within that distribution is plausible according to the scenarios, the mean of the distribution gives you the expected (in statistical sense) impact. I have investigated this approach myself and it does work, the only problem is you have a hard time explaining it to people who don’t trust statistical methods.

If the scenarios are uncertain, you get uncertain predictions of the impacts. I would find that reassuring as a statistician in any means of interpretation.

If you just run the ensemble mean through the impacts model, the error bars then only reflect the uncertainty imposed by the impacts model whilst ignoring the uncertainty of the GCM output. As a statistician, the narrow error bars would worry me.
Steve McIntyre

Posted May 2, 2008 at 6:43 AM | Permalink

I am highly troubled by the RC position that the standard deviation in the model outputs is a relevant estimate of the stochastic uncertainty of the earth’s response, as I believe that people are conflating uncertainties derived from modeling issues which appear to be very large with stochastic uncertainty in how the earth responds (which I suspect might be rather low, despite the views of some readers here on chaos.)

For example, the IPCC likely range of doubled CO2 impact is 1.4-4.5 deg C, but surely the physics of the earth is constrained enough that all these outcomes are not possible merely through stochastic uncertainty. Whether the impact is 0.5, 1.5, 2.5 or 6.5 deg C, I would have thought that the stochastic uncertainty would be only 0.1 deg C or so, with the rest of the uncertainty arising entirely from model uncertainty.

beaker, from a Bayesian perspective, wouldn’t it make sense to try to at least make a ball park allocation of these forms of uncertainty based on some sort of a priori judgement? (Unfortunately, my knowledge of Bayesian statistics is limited, as its popularity post-dates my university days (though I noticed some Bayesian articles by Dawid cited my stats prof, D.A.S. Fraser) so I may have unintentionally inhaled some Bayesian concepts in the 1960s.)
Paul Linsay

Posted May 2, 2008 at 6:49 AM | Permalink

Looks at Willis’ plot in #16 here and #144 on Tropical Troposhpere thread then reads this in #29

The models have NOT been falsified, they have been shown to be significantly biased…

Reminds me of Grouch Marx’ question “Who are you going to believe, me or your lying eyes?”
JamesG

Posted May 2, 2008 at 6:54 AM | Permalink

I hope Gavin doesn’t think I’m calling him a beast – I meant he’s a computer modeler.

Beaker: A paper like that has been published and ignored. All such papers will be ignored for now, just like the cloud feedback papers and much other contradictory evidence has been ignored. Even the evidence of the positive feedback of soot has been largely ignored because it lessens the CO2-as-driver argument. The thing is, to a modeler who isn’t into confirmation bias, such papers aren’t even necessary; you only have to see the huge input uncertainties and the huge number of adjustable parameters to absolutely know, without going any further, that the results will be almost useless. Even a system with 3 parameters is very difficult: Here we can have about 30. In order to limit the amount of work, the climate modelers fix most of these using unproven assumptions and vary the 3 or 4 remaining. So we are in the realm here of heuristics – as Bender said. Climate modelers are using parameters they believe to be correct because they give the result they believe to be true. When some outsider points out that the obs don’t match then it’s either back to square one or belittle the outsiders results (the CYA option). I find it sad because both environmentalists and computer modelers are usually right when they put their focus on hard data gathering and testing. Putting overly much trust in this farrago of unproven assumptions, while ignoring real world results, will ultimately backfire on both communities.
Lance

Posted May 2, 2008 at 6:59 AM | Permalink

After originally watching beaker with a jaundiced eye I have concluded that he is just insisting that a rigorous definition of “falsified” be applied to the issue. He has submitted that the ensemble of models has not been “falsified” in a Bayesian sense.

But speaking in Bayesian terms one must ask what is the “degree of belief” that can be attributed to the predictive power and thus usefulness of these models based on the statistical state of this analysis.

Being a lowly physicist with little statistical acumen I haven’t the ability or proclivity to make such a determination.

Perhaps beaker could render an analysis that could estimate the Bayesian probability of these models having useful predictive skill.
lucia

Posted May 2, 2008 at 7:03 AM | Permalink

Beaker-– “Fake javascript payload” is some error generated by your browser. You’re machine may be infected by a virus, our you may be running some sort of privacy program that is interfering with the script SpamKarma asks your browser to run to make sure you aren’t a spammer.

Either way, it’s something you need to sort out, as you will encounter difficulties at blogs that are protecting themselves by asking your machine to return a unique value to make sure they aren’t spambots.

Douglas–
Have you computed standard errors for the experimental average trends? If you create confindence intervals for the data then it is clear that only those predictions with that predict mean trends that fall within the confidence intervals for mean trends.

So, as an oversimplifed example using heights:

Say Bender has a model that predicts the average height of an army recruit is 6.7′ tall which he obtained by running a zillion test and taking an ensemble average (but he doesn’t tell us the standard deviation of heights and doesn’t bother to tell you the uncertainty in his individual estiamte — because we are going to figure that all out by averaging with other people’s models)

I tell you the mean height is 6.2″ and I’ve run ensembles with my model.

Steve Mc says it’s 5’8″ and he’s run emsembles.

None of us tell you the standard deviation of heights our ensembles give for the height, insisiting that’s just “height noise” blah, blah, blah….

Now, say you go out, meausure 20 army recruits, and find the average height is 5’10” and the 95% confindence interval for your mean is ±1″. You determine this using the traditional methods used by a famous beer manufacturing statistician.

Now, using classical undergraduate statisitics, you can run a t-test and to see if 6.2″ is consistent with 5’10, to a confindence of 95%. You can test 6’7″, you can test 5’7″.

You will find we are all, individually wrong. There is something wrong with each model.

If you wish, you can also test the “esemble average” of our three predictions or what-not. But the individual assessements still stand: Each is individually wrong. This would indicate each of us has something wrong in our method.

People may, of course, debate the meaning of fact that the average of all our guesses is closer than our individual guess. But if they tell us their method of predicting the height is to average and report confidence intervals for that prediction, then I think it’s fair to apply the test in your original paper to test the evaluation of that method– which they actually use.

No matter what anyone’s idea of statistics might be, logic dictates that it’s fair to test the actual trends a groups like the IPCC predicts (or uses as a basis for their predictions), using the uncertainty intervals predict and which they communicate to the public. It’s also fair to test metics computed using the method they use– which is averaging over all models, and communicating the standard error for the average over all models.

The public needs to know whether the actual predictions, or the method used to make predictions and communicate uncertainties are useful or accurate.

In this regard, the comparing the average trend from the enesmble with uncertainty intervals computed that way is fine. But, the only addition required was to add the uncertainty intervals to the data. Then, you would be showing the method used to communicate predictions and uncertainties is a) biased and b) has unrealistically small uncertainty intervals.
beaker

Posted May 2, 2008 at 7:04 AM | Permalink

Steve: Absolutely, the standard deviation of the model outputs is not an estimate of the stochastic uncertainty of the Earths response, it is a measure of the uncertainty of the prediction, taking into account all of the uncertainties, including the variability of the thing we are trying to predict as well as the uncertainties involved in constructing and estimating the model. However, if you want to invalidate the model, you need to show that the models can’t give rise to the observed data, i.e. the models don’t think the observed data are plausible. It is the standard deviation error bars that define the range of values that the ensemble considers to be plausible (taking all uncertainties into consideration). The SE error bars only show the uncertainty on the expected value of the trend, not the uncertainty of the modelled trend itself.

I’m not sure it is possible at this stage to estimate the true stochastic uncertainty of the Earths response using the models. It isn’t my area of expertise, but if they can’t reproduce a trend that the theory apparently suggests they should show clearly, it may be an indication that the systematic uncertainty is too dominant. But that is just my intuition.

The ensemble as we have it now is already the result of a-priori judgements on the part of the modellers. The ensemble basically reflects the range of ways in which the climate works that the modelling community consider plausible. If a number of groups independently come up with similar models, it is an indication that a particular set of assumptions is more plausible than others. If the ensemble is broad, this just reflects the uncertainty within the modelling community on the basic physics of the climate. The last thing we should do is ignore this by concentrating on the mean all the time.
Andrew

Posted May 2, 2008 at 7:05 AM | Permalink

Someone above said that the observations disagree with one another slightly. However, couldn’t this be an artifact of the slightly varying definitions of “tropics” in each data set, at least in part? I noticed they gave slightly different latitude bands.
beaker

Posted May 2, 2008 at 7:15 AM | Permalink

Andrew: Looking at Willis’s figure in #16, I’d say they disagreed substantially rather than only slightly, if I understand it correctly, compare e.g. RAOBCORE and IGRA trends at 200 (hPa).
kim

Posted May 2, 2008 at 7:28 AM | Permalink

32 (Steve) There may be enough flexibility in the carbon and water cycles to allow a range of sensitivities to doubling of carbon, depending on the other pertinent circumstances.
=============================================
kim

Posted May 2, 2008 at 7:30 AM | Permalink

Well, I meant doubling of carbon dioxide in the atmosphere.
==================================
kim

Posted May 2, 2008 at 7:31 AM | Permalink

The point being that there may not be a narrow range for the numerical possibilities of the temperature sensitivity to doubling.
=======================================
bender

Posted May 2, 2008 at 7:34 AM | Permalink

beaker’s recent contributions #13, #17, #19, #23 etc. are not sophomoric. I agree with everything said here.
Pat Keating

Posted May 2, 2008 at 7:38 AM | Permalink

I’m astonished at the statistical mumbo-jumbo pronounced by some of the posters, and used as an excuse to claim “fatally flawed”. No wonder we have the famous saying “Lies, damned, lies,..”. What happened to simple observation and common-sense?

A straightforward examination of Figure 1 of the Douglass et al paper should be enough to convince most reasonable persons that there is a major difference between the observational data and the model results. All the attempts by Gavin and Beaker to obfuscate are just a statistical smoke screen. Baffle them with b… seems to be the order of the day.

Steve: You’re not being fair to Beaker here. Whether one agrees or disagrees or fully understands beaker’s point, he’s making a conscientious effort to articulate it and is not merely doing a hit-and-run job of the type that we’ve seen all too frequently. I certainly have no problem with insisting on precision in statistical expression. I’m not convinced that Beaker’s original endorsement of Gavin’s position is valid, but that’s entirely different.
bender

Posted May 2, 2008 at 7:44 AM | Permalink

The models have NOT been falsified, they have been shown to be significantly biased.

Puh-leeze. To remove the bias requires that you adjust the model implies you think the old model was wrong implies that the hypothesis was rejected implies the model was falsified.

Word gamers.
bender

Posted May 2, 2008 at 7:46 AM | Permalink

#44

there is a major difference between the observational data and the model results

there is a major difference between the observational data and SOME OF the model results
bender

Posted May 2, 2008 at 7:47 AM | Permalink

Raven’s #2, folks. beaker, do you know the answer?
Pat Keating

Posted May 2, 2008 at 7:59 AM | Permalink

44 (Steve)
Perhaps you are right about my being unfair to Beaker.
However, my main point is that while statistics can be used to illuminate, it can also be used to obfuscate, and the clear difference between model and observational data in Fig. 1 is being lost in the discussion. This has been used to claim that the paper is “fatally flawed”. Flawed, perhaps, but fatally so, surely not.
Sinan Unur

Posted May 2, 2008 at 8:03 AM | Permalink

At first, I was troubled by the fact that each GCM was run a different number of times. I was under the impression that that was a choice by Douglass et al. More careful examination of the references led me to https://esg.llnl.gov:8443/index.jsp (which seems to have an incorrect SSL setup).

If I understand correctly, those were the runs used for the IPCC reports. Therefore, it does not make sense to try to choose among them on the basis of some ad hoc criterion but to use the results which were used to make policy recommendations.

I need to decide whether to get an account with the intention of writing a paper. Of course, before I can decide whether to do that, I need to see how much work it is to deal with the datasets. For that, I need an account. Seems a little circular 🙂

Any way, the point of this post was to state that I no longer find the multiple and different number of runs per model that troubling given that it no longer looks like the decision was up to Douglass et al.

Of course, I am a little slow, so this might have been obvious to everyone else.

— Sinan
Eggplant fan

Posted May 2, 2008 at 8:05 AM | Permalink

Bender #3, #5, and #6:

I’m sorry, but a large group of experts saying something is acceptable does not make it so in science. Any techniques used must be based on a rigorous scientific foundation, particularly if they are going to be used for model validation studies. (As a corollary, peer review is not infallible). Moreover, the IPCC is not trying to rank football teams, they are trying set world policy. A rationale far more rigorous than a heuristic that seems to make sense is needed to make their sweeping statements, particularly when they assign “confidence levels” to their statements.
Pat Keating

Posted May 2, 2008 at 8:10 AM | Permalink

46 bender

SOME OF the model results

If you wish, although I would point out that none of the models is close to the observational data at the most-important uppermost (100 hPa) part of the troposphere (where, BTW, the observations are quite close to each other).
Steve McIntyre

Posted May 2, 2008 at 8:19 AM | Permalink

IPCC makes a number of comments that specifically claim that the errors in a multi-model mean will be lower than the errors in an individual run – a point which would contradict Gavin Schmidt’s categorical assertion at RC. Their theory is that each model estimates parameters and that these errors will cancel out to some extent in the multi-model ensemble, to the extent that there is no systemic bias. Quotes:

The reason to focus on the multi-model mean is that averages across structurally different models empirically show better large-scale agreement with observations, because individual model biases tend to cancel (see Chapter 8). The expanded use of multi-model ensembles of projections of future climate change therefore provides higher quality and more quantitative climate change information compared to the TAR. (ch 10)

The use of multi-model ensembles has been shown in other modelling applications to produce simulated climate features that are improved over single models alone (see discussion in Chapters 8 and 9).

It continues to be the case that multi-model ensemble simulations generally provide more robust information than runs of any single model.

Chapter 8 states:

The multi-model averaging serves to filter out biases of individual models and only retains errors that are generally pervasive. There is some evidence that the multi-model mean fi eld is often in better agreement with observations than any of the fields simulated by the individual models (see Section 8.3.1.1.2), …

Why the multi-model mean field turns out to be closer to the observed than the fields in any of the individual models is the subject of ongoing research; a superficial explanation is that at each location and for each month, the model estimates tend to scatter around the correct value (more or less symmetrically), with no single model consistently closest to the observations. This, however, does not explain why the results should scatter in this way.

The viewpoint expressed by IPCC here is completely different conceptually form the RC viewpoint in which the real world is deemed to be one model realization and thus having the same sigma as the models.
beaker

Posted May 2, 2008 at 8:20 AM | Permalink

Pat Keating says:

However, my main point is that while statistics can be used to illuminate, it can also be used to obfuscate, and the clear difference between model and observational data in Fig. 1 is being lost in the discussion. This has been used to claim that the paper is “fatally flawed”. Flawed, perhaps, but fatally so, surely not.

One way of obfuscating with stats is to use phrases with pre-existing statistical interpretations (such as “inconsistent”) to mean something else, when there is already an existing statistical term with exactly the desired meaning (“biased”). The problem I have with the paper is that it makes a very big claim (inconsistency) but performs a test that can demonstrate only bias, a lesser claim (I have lost count of the number of times I have had to type that ;o). If they want to show the ensemble mean is a poor predictor of the observational data, that is a meaningful test of whether the ensemble is useful (although not the only one), and they have made their case. However bias and inconsistency are not the same thing, and they claim inconsistency.

If Douglass’ test falsifies anything, it is the use of the ensemble mean to make predictions without considering the uncertainty. But then again, any decent statistician would tell you that is inherently the wrong thing to do anyway on basic principles.
Clark

Posted May 2, 2008 at 8:28 AM | Permalink

#10

beaker says:

I don’t understand where the idea comes from that you can arbitrarily increase the error bars to account for any observation you want. You can’t, the error bars on the prediction of the ensemble is decided by the spread of the ensemble.

Ah, but you can. Simply add a few more models with more divergent results, and your SD is now larger. There are likely hundreds of thousands of climate models that have been run. To pick and choose a few dozen makes it very easy to manipulate the mean, mode and SD of the models, which is why the entire concept of model averaging seems ludicrous.
bender

Posted May 2, 2008 at 8:29 AM | Permalink

#53
beaker, get over it. Agreed: the Douglass paper is not perfect. The point is: perfection is not required. They have made a very important observation and the alarmists are trying to hand-wave it away. You clearly agree that the observation is important. You can use words to distance yourself from the skeptics if you like, but the fact is you are as skeptical as we are about these models. Otherwise you wouldn’t be parsing the paper and this discussion as carefully as you are.
James Lane

Posted May 2, 2008 at 8:29 AM | Permalink

Beaker, don’t get the wrong end of the stick:

James Lane says:

“Even beaker seems to have conceded that the models are hopeless for the tropical troposphere, although maintaining that they are not statistically falsified.”

I’m sorry, but this comment is an illustration of the problem with blogs on climate change, which is that everyone is automatically assumed to have a position on one side of the debate or another. I don’t, however I do think it is important that papers potentially containing errors are audited in an impartial and impersonal manner, without any commitment to the outcome of the audit, regardless of which side of the debate the belong to. I have made no such “concession” as I have not said that the ensmble means give good predictions of the tropical trophospheric trends (and neither for instance does Gavin@RC if you read his article in an unbiased manner)!

The models have NOT been falsified, they have been shown to be significantly biased, I have explained the difference (and at Willis’ request demonstrated that these terms have pre-existing meanings in statistics). The reason why Douglass et al. needs auditing is because some, such as yourself, believe that inconsistency has been established rather than bias, it hasn’t.

I don’t conclude anything about your position on AGW. I value your contribution to the blog. But I think your position is absurd. Do you really believe that the modeled representation of the tropical tropo is consistent with the observational data?
Eggplant fan

Posted May 2, 2008 at 8:33 AM | Permalink

Beaker #53:

I’ve checked a number of the the statistical terminology glassories listed at http://www.glossarist.com/glossaries/science/physical-sciences/statistics.asp?Page=2 and have not found an existing definition of “inconsistent”. Douglass et al.‘s use was informal, but was it really improper?
Christopher

Posted May 2, 2008 at 8:36 AM | Permalink

>…To pick and choose a few dozen makes it very easy to manipulate the mean, mode and SD of the models, which is why the entire concept of model averaging seems ludicrous.

Unless you take every single model (no cherry picking) or administer an a priori test to determine which ones should be dropped (lemon dropping by enforcing some degree of congruence with observational data). I do agree, and this would seem obvious, that summary statistics can be trivially pushed to a breakdown point. The mean is not a robust statistic.

I am also curious how beaker would have done such an analysis. Maybe I’ve missed it? But Bayesian stats is not my strong point so I’d find it illuminating if he were to share.
beaker

Posted May 2, 2008 at 8:36 AM | Permalink

Steve: Re: “Beaker’s original endorsement of Gavin’s position is valid, but that’s entirely different.” I think my endorsement will depend on how you read Gavin’s article. My interpretation (as a fairly neutral observer) was that he considers “inconsistent” the same way that I (and other statisticians) do, as meaning that the ensemble can’t explain the observational data, i.e. the ensemble considers the data to be outliers, or implausible. In that case, the +-2SD test is the right test. My reading of the article was that he also agrees that the model fit is not perfect (I would go further and say it is fairly poor unless you look at RAOBCORE 1.4). All he argues is that there is no clear inconsistency between the models and the data in the usual statistical sense, and he has demonstrated that. I am reasonably confident, having read a number of his articles, that if asked, he would readily agree that the ensemble demonstrates a significant bias.

If you consider Gavin’s position to be that the models recieve any significant corroboration from these observations, then I would not endorse that! There is plausibility, but little support.
Eggplant fan

Posted May 2, 2008 at 8:44 AM | Permalink

Steve McIntyre #52:

Thanks for providing the IPCC statements. That certainly isn’t a rigorous defense, and really not much more than hand waving. Maybe the systematic errors do just happen to average out, but there doesn’t seem to be a good demonstration of that except that they can get an answer they like for a particular parameter (the surface temperature). I’ll try to dig through the report and see exactly what they include in chapters 8 through 10. (Either that or drop it altogether as I’m spending way too much time learning about this!)
bender

Posted May 2, 2008 at 8:45 AM | Permalink

#50 For pure scientists, yes, statistics are always preferred over heuristics. But to the extent that global policy is run by business interests and bureaucracies who have no time for “paralysis by analysis”, the world is indeed being run on heuristics. Should IPCC be held to a higher standard? Some would argue that. Those in favor of a “precautionary principle” would argue otherwise. Funny how it’s always the other guy that needs to show more “common sense”.

Your other points have been made many times and are tiresome.
beaker

Posted May 2, 2008 at 8:47 AM | Permalink

Christopher says:

I am also curious how beaker would have done such an analysis. Maybe I’ve missed it? But Bayesian stats is not my strong point so I’d find it illuminating if he were to share.

The first thing to do is to establish exactly what the question is and express it unambiguously in the appropriate statistical terms. Most statistical tests tell you something, the key is to know what to test for to answer the important question and how to interpret the answer.

If you wanted to ask “Are GCMs consistent as a basic approach” then the +-2sd test discussed at RC by Gavin is the right one (although he didn’t include the uncertainty in the data and a few other sources that you would need to perform the test really thoroughly).

If you wanted to ask “Is the mean of the ensemble an accurate predictor of the tropical trend” then the Douglass et al. +-2se test for statistical bias is fairly reasonable, although again you would have to include the data uncertainty to do it properly (it would be +-2sd on the data not +-2se). However, as I said, a decent statistician would tell you not to use the ensemble mean without considering the uncertainty (+-2sd error bars) anyway.
Christopher

Posted May 2, 2008 at 8:53 AM | Permalink

Re: 62

beaker, I understand that. I had, perhaps mistakenly, thought there might be a different (more Bayesian) approach you would adopt.
bender

Posted May 2, 2008 at 8:58 AM | Permalink

a decent statistician would tell you not to use the ensemble mean without considering the uncertainty (+-2sd error bars) anyway

1. So how would YOU do that?
2. This is only going to widen gavin’s already huge envlope, and make the GCMs look *more* useless than they aready appear. Agreed?
beaker

Posted May 2, 2008 at 9:00 AM | Permalink

Eggplant fan says:

I’ve checked a number of the the statistical terminology glassories listed at http://www.glossarist.com/glossaries/science/physical-sciences/statistics.asp?Page=2 and have not found an existing definition of “inconsistent”. Douglass et al.’s use was informal, but was it really improper?

You won’t find inconsistent defined that was as an individual term, as it has another deeper meaning as a jargon term. However it does have a clear statistical implication , I posted an example of this on the other thread and gave full details of the paper. You can find plenty more by performing a Google Scholar search for “inconsistent with the data”. Incidentally, yesterday I discussed this with another statistician in my department, over a cup of tea, this time a proper frequentist one of considerable experience, and he said that he would naturally interpret “inconsistent with the data” in exactly the same way that I did (which was reassuring ;o) (and smiled when I explained how it has been used in the paper). Having discussed this issue now on CA, I can see why the term was being used, but it is improper in a statistical sense, as there is already a more accurate and specific term (bias).
bender

Posted May 2, 2008 at 9:09 AM | Permalink

Quit word parsing and focus on the substantive.
beaker

Posted May 2, 2008 at 9:11 AM | Permalink

bender says:

a decent statistician would tell you not to use the ensemble mean without considering the uncertainty (+-2sd error bars) anyway

1. So how would YOU do that?

I have already explained that a couple of times in the last day or so and at least once on this thread. See e.g. #31.

The models give an impression of the logical consequences of the detailed assumptions regarding climate physics on which they are based, nothing more. If we do not understand the climate physics really well, the predictions are likely to be unreliable. However, their predictions can give an indication of the range of outcomes that can be considered plausible based on what we do know. Which is why we need to look at the whole ensemble, not just the mean, if we are really listening to what the GCMs are telling us.

If you understand that, you will see why Gavin is right in the test of inconsistency, and why establishing bias is not neccessarily the only test of whether the GCMs are useful.
Not sure

Posted May 2, 2008 at 9:14 AM | Permalink

#49:

They’re just using a self-signed SSL certificate. They either didn’t want to or couldn’t pay for a commercial one.
Patrick M.

Posted May 2, 2008 at 9:21 AM | Permalink

(Beaker):

I’m a layman trying to understand your descriptions. Would I be correct in making the following analogy?

If we were talking about shooting a rifle would:

1 – uncertainty = grouping
2 – bias = distance of the center of the group from the bullseye

If this is accurate, how would this analogy work regarding GCM’s?

Thanks
beaker

Posted May 2, 2008 at 9:22 AM | Permalink

Christopher says:

Re: 62 beaker, I understand that. I had, perhaps mistakenly, thought there might be a different (more Bayesian) approach you would adopt.

I would be happy with the frequentists tests (I may be a Bayesian, but I am not bigoted about it ;o), both are fine as long as they are implemented properly and the claim is suported by the outcome. However it is my Bayesian outlook that makes me want to use the uncertainty to give more robust predictions, rather than just use the ensemble mean as if I had confidence in it.
beaker

Posted May 2, 2008 at 9:31 AM | Permalink

Patrick M: Spot on. As an analogy, the models all predict different values for the trend, the spread of these predictions is just like the grouping and the bias is very much like the distance from the center of mass of the group to the bullseye (except that there are uncertainties in the data as well, so we don’t know exactly where the bullseye is, although we have a pretty good idea).

The idea of inconsistency would be like saying that the marksman is so bad that the bulseye isn’t even anywhere within the grouping.

The idea of significant bias is basically like saying that the bullseye clearly isn’t in the middle of the grouping, so there is good evidence that the marksman needs to adjust his sights (or the climate modeller his GCM, as noted by Gavin on RC).
Steve McIntyre

Posted May 2, 2008 at 9:31 AM | Permalink

Part of the problem here may lie in the fact that IPCC AR4 Chapter 8 “Climate Models” does not contain a single reference to the problem being discussed here. It contains only one irrelevant mention of the tropical troposphere.
MrPete

Posted May 2, 2008 at 9:32 AM | Permalink

I found this helpful, and enjoyable, not least:

3) A frequentist is a person whose long-run ambition is to be wrong 5% of the time.
4) A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.

🙂
MrPete

Posted May 2, 2008 at 9:37 AM | Permalink

To continue the marksman analogy:

In this case, we’re not actually examining a single marksman. We have 22 contestants aiming at the same target, with different colored bullets.

Nineteen of the marksmen are shooting their BB guns at the wrong side of the barn door.

Three are getting an occasional shot within range. Except apparently they are shooting water pistols instead of BB guns — they didn’t read the contest rules. (This per Bender #3)

🙂
PI

Posted May 2, 2008 at 9:38 AM | Permalink

Some people seem to be confused about the Bayesian approach to this problem.

beaker initially said that averaging models is a standard Bayesian practice, and JamesG complained that was therefore “another” reason — I wonder what the rest are? — to distrust Bayesian statistics.

What beaker described is technically Bayesian, but it uses rather unrealistic prior assumptions: it assumes that you believe that all models are, a priori, equally probable to be true.

Nobody actually believes this. But if you did believe it, then averaging the models would be perfectly valid, so this doesn’t speak against Bayesian methods: it’s a correct inference, given that assumption. The problem is that we have experimental and theoretical reasons to believe that some models are more probable than others, not that Bayesian inference doesn’t work.

What Bayesians actually do in Bayesian model averaging (BMA), a la Raftery, is to give each model a probabilistic weight and compute a weighted average of the models. The weight is determined by how well each model fits the observations, as quantified by a likelihood function. (There is much room to argue about what likelihood to use, but frequentists have that problem too.) Models which fit the observations have a higher posterior probability of being true, and get a greater weight in the average.

This is, in essence, a fancier version of what Douglass is describing in frequentist terms. Models which don’t fit the data are essentially discarded (given a low posterior probability). If you choose to average them, those models contribute little to the ensemble average (as long as there are models in the ensemble which fit the data better).

Not all Bayesians agree with the BMA approach to model combination; there are different camps who like different approaches. BMA assumes that there is such a thing as a true model, and that one of the considered models is the true model (but we’re not sure which). Arguably, this is not the case. BMA proponents have various responses to that objection. One is that it’s not rigorously correct but it is a good approximation as long as the considered models span, in some sense, the full space of reasonable models. Another is more pragmatic, in that the model average is often demonstrably a better predictor (according to, e.g., cross validation) than is any individual model, even when all the considered models have obvious structural errors. There may be other, better responses; I am not an expert in BMA.
Pat Keating

Posted May 2, 2008 at 9:47 AM | Permalink

59

..he considers “inconsistent” the same way that I (and other statisticians) do, as meaning that the ensemble can’t explain the observational data, i.e. the ensemble considers the data to be outliers, or implausible.

This is trying to turn science upside down. The models are the “outliers” or “implausible”, not the data. It is the data which is pre-eminent, and a useful model must explain the data, or it fails.
beaker

Posted May 2, 2008 at 9:52 AM | Permalink

PI: The problem with re-weighting like this is that the historical data has already been used (even if not directly) to tune the model parameters, so you would be using the data twice, which raises problems. Assigning equal weight to each model at least expresses an equal confidence in the ability of each modelling group. Not being an expert in the field I am not qualified to suggest a more realistic prior than that, but I can advise on how to procede if a better prior were available. However my intuition would suggests an equal weighting would guard against making extreme predictions in either direction, which would be no bad thing as it is generally better to express uncertainty than undue confidence.
PI

Posted May 2, 2008 at 9:59 AM | Permalink

MrPete,

My favorite Bayesian/frequentist quote is by P. G. Hamer:

“A Frequentist uses impeccable logic to answer the wrong question, while a Bayesian answers the right question by making assumptions that nobody can fully believe in.”
Christopher

Posted May 2, 2008 at 10:02 AM | Permalink

Re: 75
Thank you, that was very useful (in effect what I was hoping to glean from beaker). Again, my Bayesian ability is limited but the reason I’m curious here is I see a link between ensembles and ergodicity. This is a real question in my opinion. I have also found an example where this procedure was applied in an relevant domain and where some of the issues here are discussed with more rigor that Douglass: “Using Bayesian Model Averaging to Calibrate Forecast Ensembles.”
MrPete

Posted May 2, 2008 at 10:04 AM | Permalink

beaker, while I agree with you that the models appear to be tuned from the data, that flies in the face of what GCM experts tell us — claiming the models are based on physics, with little or no correction from observation. This gets into another long term CA thread topic led by Dr Browning.

There are quite a few surprising ways that modelers explicitly or implicitly try to have it both ways.
beaker

Posted May 2, 2008 at 10:06 AM | Permalink

Pat Keating says:

59 ..he considers “inconsistent” the same way that I (and other statisticians) do, as meaning that the ensemble can’t explain the observational data, i.e. the ensemble considers the data to be outliers, or implausible.

This is trying to turn science upside down. The models are the “outliers” or “implausible”, not the data. It is the data which is pre-eminent, and a useful model must explain the data, or it fails

If you actually read my posts, you will find that I have repeatedly said that consistency does not imply usefullness (it is a neccessary, but not a sufficient condition). If you read Gavin’s RC article, you will find that he also says that the lack of inconsistency does not mean there is a good fit. Language like “trying to turn science upside down” is unhelpful, especially when it is Douglass et al that have redefined (I expect inadvertently) the meanings of existing statistical terms, not gavin or I.
PI

Posted May 2, 2008 at 10:09 AM | Permalink

beaker,

It depends on which data were used for tuning (tropical vertical temperature profile trends, or something else?). Of course you have to worry about correlations between the residuals of whatever was used for tuning and whatever observations you’re trying to weight against (e.g., global annual averages vs. these vertical profiles). I don’t know that it’s unreasonable to give weights in this way, if you’re considering just the models’ ability to fit and predict these tropical trends. I do think it’s unreasonable to take a model that clearly is not predicting the data and assume that it has an equal probability as one that is, as far as predicting that data is concerned. On the other hand, you might say that they do about an equal job of fitting many other data, and on that basis we should give them similar prior weights even though they diverge on this data set.
Pat Keating

Posted May 2, 2008 at 10:21 AM | Permalink

81 beaker

I objected to the language you used (the ensemble considers the data to be….implausible.), which strongly implied that model results had at least as high a standing as observational data.

I suspected that you didn’t believe that, but such biases in language, even though inadvertant, are important in science, which relies on careful statements for good communication. (BTW, an ensemble cannot consider anything).
beaker

Posted May 2, 2008 at 10:22 AM | Permalink

PI says:

On the other hand, you might say that they do about an equal job of fitting many other data, and on that basis we should give them similar prior weights even though they diverge on this data set.

Yes, that is pretty much my line of thinking (not being sufficiently expert in the models themselves to hold a stronger opinion). My basic attitude is the med school principle of “first do no harm”, I’d rather not take steps that reduce the uncertainty unless I was sure they were justified, as it is a recipe for taking (in)action based on overly confident predictions of an extreme outcome.

If you wanted to make predictions of the future tropical trends, I could see the value of dropping those models that were grossly wrong on the hindcast (taking all uncertainties in the data, and stochastic uncertainties in each model), but not drop those models for all predictions solely on that basis.
Steve McIntyre

Posted May 2, 2008 at 10:26 AM | Permalink

RAOBCORE versions are another controversy regarding this paper. RAOBCORE http://www.univie.ac.at/theoret-met/research/raobcore/ is an acronym for “RAdiosonde OBservation COrrection using REanalyses (RAOBCORE)”. So it’s a radiosonde version of the adjustment process and not independent measurements.

They use HadCRUT3 surface temperatures in their calculations and thus are not “independent” of HadCRUT3 to the extent that that matters:

For the calculation of the MT and LT layer mean temperatures, not only radiosonde data but also HadCRUT3 surface temperature data have been used.

The tropical troposphere appears to be a source of concern for them:

Cautionary note: The tropical mean trends 1958-1978 show warming at low levels but cooling at upper tropospheric levels, which seems unrealistic. This feature is related to a strong warming anomaly over the Eastern US which is spread to the Caribbean by the ERA-40 bg. Consequently the Carribean stations may be overcorrected at low levels during this period. Since most tropical stations are in the Caribbean during these early days, this problem strongly affects also the global tropical means. This issue, which affects all RAOBCORE versions, will be fixed in the next version.

Their web-visualization diagrams show that one of the outcomes of their adjustment process is in fact to increase the trend in the tropical troposphere.

In a quick look at their adjustment procedures, they look a lot like what USHCN plans to introduce – heavy reliance on changepoint analysis developed by climate scientists. An important issue to be sure, but I’m pretty sure that the USHCN changepoint analysis is easily contaminated and this is something that would need to be examined for RAOBCORE.
steven mosher

Posted May 2, 2008 at 10:30 AM | Permalink

This argument has an interesting flipside.

When Models dont match the observations, too well.

Here’s one.

Bender will get the argument.
Jedwards

Posted May 2, 2008 at 10:32 AM | Permalink

Re #71, Ok that was very clear and understandable (being trained in classical marksmanship myself).

However, your statement below is actually what we have here. The grouping of the ensembles clearly misses the bullseye (demonstrating bias or “windage” in marksmanship terms), but appears to at least be on the paper (and therefore NOT inconsistent).

The idea of inconsistency would be like saying that the marksman is so bad that the bulseye isn’t even anywhere within the grouping.

As anyone who has fired a gun will tell you though, its pretty obvious that these folks need to “adjust their sights” more than just a touch.
Patrick M.

Posted May 2, 2008 at 10:39 AM | Permalink

re #16, 17, 71 (beaker):

Okay, after getting my rifle analogy approved and re-reading a lot of the posts, I believe I understand what beaker is saying, for example from #17

To test for inconsistency properly, standard practice would be to show that the mean +-2sd error bars of the models don’t overlap with the mean +-2sd error bars of the data. Yes, I know that makes the inconsistency test easier to pass, but that is the correct way to perform the test for inconsistency.

The fact that #16 does not show the +-2sd for the data means that we haven’t shown a proper test for “inconsistency”. But it looks like “inconsistency” is not very far off! Is inconsistency always a pass/fail proposition or can there be degrees of inconsistency, (or maybe there’s another term that describes degree of inconsistency)?

After staring at the graph in #16, it looks to my eye that while the GCMs may have avoided failing due to inconsistency, they only passed with a D-.

As far as “bias” goes, I don’t see how a label of bias can be avoided, correct?
beaker

Posted May 2, 2008 at 10:41 AM | Permalink

Pat Keating says:

I objected to the language you used (the ensemble considers the data to be….implausible.), which strongly implied that model results had at least as high a standing as observational data.

I suspected that you didn’t believe that, but such biases in language, even though inadvertant, are important in science, which relies on careful statements for good communication. (BTW, an ensemble cannot consider anything).

Language biases are indeed important in science, which is why I have been using the correct statistical terminology, unlike Douglass et al., which is then repeatedly misunderstood! Plausibility is about possibility, not probability. If I had said the ensemble considers the data to be probable that would suggest that the data provided a degree of support to the models. If I said the ensemble considers the data to be plausible, that means only that the data do not contradict the model, but that does not imply any substantial degree of support. I chose my words there carefully. I find the best thing to do in a discussion/argument is to give my opponent the benefit of the doubt and try my best to see his point of view. A good start is to look for the most moderate interpretation of what they have said, and use that, rather than an extreme one (note you say you expected I didn’t mean that, but your response clearly implied that I did).

BTW, from a Bayesian perspective, the output of a model (or ensemble of models) expresses the distribution of belief over a set of propositions or outcomes, so it is not that unreasonable to say that the models consider something plausible. Of course I could have written it more formally, but given that the scientifically correct terms that I have used so far have been repeatedly misunderstood, even though I have explained fully what they mean several times, I’m not sure that would be a good idea!
beaker

Posted May 2, 2008 at 10:48 AM | Permalink

Jedwards says:

Re #71, Ok that was very clear and understandable (being trained in classical marksmanship myself).

However, your statement below is actually what we have here. The grouping of the ensembles clearly misses the bullseye (demonstrating bias or “windage” in marksmanship terms), but appears to at least be on the paper (and therefore NOT inconsistent).

The idea of inconsistency would be like saying that the marksman is so bad that the bulseye isn’t even anywhere within the grouping.

As anyone who has fired a gun will tell you though, its pretty obvious that these folks need to “adjust their sights” more than just a touch.

O.K., perhaps a better analogy for inconsistency would be that the grouping is so far from the bullseye that you have to wonder whether the barrel is so bent that you can shoot around corners! ;o)

I am a toxophilite myself (target recurve bow).
beaker

Posted May 2, 2008 at 10:55 AM | Permalink

PatrickM #88 That sounds like a reasonable summary (although the chart in #16 doesn’t include the stochastic uncertainties in the models or the uncertainty in the data, and I think all the plots are normalised to the surface trend whichif fixed would all make inconsistency harder to prove).

Inconsistency is defined at a level of confidence, so at 2SD you can be about 95% sure that there is an inconsistency, 3SD you can be 99% sure. The more standatd deviations you use, the more confident you can be of the inconsitency, but at 3SD you are really already at the point of diminishing returns for most purposes.
Patrick M.

Posted May 2, 2008 at 11:20 AM | Permalink

re #91 (beaker):

As I suspected, beaker appears to be arguing neither for or against AGW or GCMs. He is just arguing for better statistics and better data, (aren’t we all). People like beaker, Svalgaard, and Steve M. who wear the black and white striped jerseys are very important to climate science and they are what make Climate Audit better than the other blogs.
beaker

Posted May 2, 2008 at 11:29 AM | Permalink

PatrickM. Cheers! At last I can go home for the weekend and enjoy some cricket content in the knowledge that somebody understood exactly what I was saying! ;o)

Have a good weekend everybody.

P.S. I am generally for GCMs as AFAICS there isn’t a better approach, but they do appear to need some work!

Steve: I played some cricket when I was at university in England (hadn’t played in Canada.) To this day, I remember the first over that I ever batted. We had a weak college team and I played mostly to fill out the ranks, but I could field and throw well. Because I hadn’t played before, they placed my last in the order and I never got up. One slow day when we were getting killed, I asked to be moved up the order – the other low batters weren’t any good either so I couldnt do any worse. The other team was using a pretty weak bowler as well. I swung like a baseball player and got 16 runs on 4 home runs or whatever they’re called. So the other team immediately changed bowlers and the next bowler gave some complicated spinning thingee which turned me inside out and that was it. The cricket fields in England were pretty civilized places on a spring day.
Pat Keating

Posted May 2, 2008 at 11:37 AM | Permalink

89 beaker

I guess it is the Baysian ‘perspective’ that is a little topsy-turvy, then. To a scientist, the models have to justify themselves against the observations, not the other way round.

Assuming we agree that the observational data is pre-eminent over the model results, I guess we can let this one go, now.
Pat Keating

Posted May 2, 2008 at 11:41 AM | Permalink

85 Steve M

For that reason, I prefer the Had-AT data. I believe this is from radiosondes using conventionally-calibrated thermometry and minimal “adjustment”, though I’m not sure of that.
Steve McIntyre

Posted May 2, 2008 at 11:56 AM | Permalink

#91-93. Based on these very constructive posts, I wonder if a re-phrasing of the point in Douglass mightn’t meet Beaker’s standards. If Douglass et al had made the very narrow claim that the multi-model mean of any 22 model ensemble is inconsistent with the data (aside from possible issues with RAOBCORE 1.4), doesn’t that still follow from the above analysis?
PI

Posted May 2, 2008 at 12:01 PM | Permalink

Pat,

It’s not accurate to say that the Bayesian perspective means that the models don’t have to justify themselves against the data. The whole point of Bayesian inference is to infer the relative probabilities of hypotheses (here, models or model assumptions) as implied by the data.

In my opinion:

The phrase “the model implies that the data are unlikely” should technically be interpreted as a statement about the likelihood function, p(data|model). (It would be less ambiguous to say, “the model implies the data are unlikely, assuming the model is correct”.) Both frequentists and Bayesians use likelihood functions.

The phrase “the data imply the model is improbable” should technically be interpreted as a statement about the posterior probability, p(model|data). This is a fully Bayesian statement; frequentists don’t speak of the probability of hypotheses, only on the probability of data conditioned on hypotheses.
Phil.

Posted May 2, 2008 at 12:08 PM | Permalink

Re #93
Enjoy the cricket, which match are you watching? I’m waiting for Eng-NZ myself.
Patrick M.

Posted May 2, 2008 at 12:11 PM | Permalink

re #96 (beaker):

LOL! You think #92 meant you could go?? Oh no, that just means we can consider you a resource, see #96…
Eggplant fan

Posted May 2, 2008 at 12:11 PM | Permalink

Bender #60:

The Douglass paper appeared in a scientific journal so of course the standard must be purely scientific. As for the IPCC, my impression was that they present themselves as a scientific organization, so I assumed that they held themselves to the same standard. Your point of analysis paralysis is well taken, but expressions of near certainty when it does not exist are equally dangerous (e.g. Iraq). As for my other points being tiresome, I’m not sure exactly what you are referring to since you addressed all of the points of my post directly, but sorry for being annoying. I’m just kind of overwhelmed by the whole situation.
Patrick M.

Posted May 2, 2008 at 12:13 PM | Permalink

Correction

re #93 (beaker)
Kenneth Fritsch

Posted May 2, 2008 at 12:14 PM | Permalink

If they want to show the ensemble mean is a poor predictor of the observational data, that is a meaningful test of whether the ensemble is useful (although not the only one), and they have made their case. However bias and inconsistency are not the same thing, and they claim inconsistency.

The authors have made their case and explained it in sufficient detail to allow the thinking person not to confuse the word inconsistent in the general sense from the statistical implication of that word. If they wrote a correction that stated the use of the word inconsistent was used in the general sense as was obvious from the context would it have any implications for the import of the paper. I think not?

Whether the authors should have used a comparison in the manner of the IPCC is another question, but I would assume that when a paper is apparently written in response to at least some of the IPCC review findings the authors might chose the IPCC’s standard of comparison. In any of the potential methods used I think the point made would be the same: the uncertainty in models is large and on average different than the observed GHG fingerprint.

I would not consider Beaker’s comments in the same vain as those from RC. RC has obviously taken sides in the AGW debate in favor of immediate mitigation and tends to show evidence and defenses that fit that side. Pointing to potential contradictions are in order for RC on issues such as these and particulary in the case of the IPCC methodolgy, while I can only ponder why Beaker who has statistical insights to offer here keeps going back to the word “inconsistent” used in the Douglas paper.
steven mosher

Posted May 2, 2008 at 12:44 PM | Permalink

Beaker or Dr. Curry, as I noted in 86, when models with natural forcing only dont
match observations ( attribution studies) what type of statistical tests are used
to determine that the models ( with natural forcing only) dont match the observations?

Is this test done formally? or just visually.
Cliff Huston

Posted May 2, 2008 at 12:47 PM | Permalink

RE: 96 Steve M
You say:

“If Douglass et al had made the very narrow claim that the multi-model mean of any 22 model ensemble is inconsistent with the data. . .”

To my reading of Douglass et al, that is the claim they made.

Douglass et al carefully define the data and methods used and conditionally draw their conclusions from the framework defined. The only possible hidden pea was the rational for using RAOBCORE 1.2, which they promptly supplied in an addendum.

I made a statement earlier that I thought that Gavin overstated the claim in order to attack it. Beaker objected because they used the word ‘inconsistent’.

Cliff
Sam Urbinto

Posted May 2, 2008 at 12:59 PM | Permalink

Now we see what happens when a bunch of people who basically agree with each other start discussing the details. 😀

I like computers.
Oh, which OS.
Windows. <—&rt; Linux.
Both scream at once, you suck!
Ladies and gentlemen, start your engines.

BTW The BB (gun) is at http://www.climateaudit.org/phpBB3/
Kenneth Fritsch

Posted May 2, 2008 at 1:05 PM | Permalink

Target practice results as analogies to the SD and SEM approaches.

1. I shoot at a target with the idea of making an adjustment to my gun based on the results. I know that individual shots can be affected differently by wind conditions and small shooter variations. The target is analogous to the instrumental data “target” and my shoots are the model outputs. I want to know if my gun is biased and therefore I must not adjust each shot by way of knowing how far and what direction my past shoots missed the target. How do I determine whether I have a statistically significant bias? I determine the average location of shots and using the SEM determine if that average location is statistically different than the target.

2. I shoot at a target many times and then remove the sight from my gun for a final shot. Now perhaps one would want to determine whether that last shot was inconsistent with the previous ones. In that case one has an average and distribution of previous shots with which to compare the last shot and to do that we use the SD. Notice that in this case we do not consider how close the previous and last shot came to the target. I t could be that the last shot was on target and the previous average was well off target, but this comparison says nothing about that.
NTaylor

Posted May 2, 2008 at 1:10 PM | Permalink

Steve McIntyre (#96) stated, “wonder if a re-phrasing of the point in Douglass mightn’t meet Beaker’s standards.”

As a nonstatistician (clinician), why should there be a concern about meeting Beaker’s ‘standards’? Is it just an interest in another statistical method (theory/philosophy) and that this interests you from a technical standpoint?
In other words, for me, in reading most every post in the two applicable threads (began to lose interest in beaker’s certitude/certainty near the end), it appears that the key for his absolute pronouncements on the Douglass paper are entirely based upon a Bayesian approach.

My question relates to the validity/reliability (in laymen’s usage) of the Bayesian method itself. It is epistemological, I guess. Independent of appeal to authority, or that it is in ?widespread usage–is there any evidence or data (other than it’s own self-generated conclusions) that show it should even be given weight in this (or any other) discussion? Are there error bars, SD or SEM for the statistical method itself so that one can assess independently whether even looking at it from a Bayesian method is worth spending all energy on–except as an intellectual exercise?

In case there’s a question of motivation in this post, I’m certain (w/in 95% confidence intervals) that I’m asking a serious question (of course, except for this one!).

[My experience in my (medical) field has been that many treatments and/or approaches have been widely adopted before ever showing they actually do what they were “intended” to do; that many of these passed ‘statistical analysis’; and that relief from the propagation of the errors after being widely accepted as ‘true’ takes a lot of energy. That time/energy should have been better spent at the beginning being skeptical.

Also, as a side note; just based on the posted charts/graphs themselves, were this a meta-analysis of a medical question–as one who is responsible to my patients to first do no harm, I would never accept the conclusion that the models were useful. I wouldn’t care about the debate as to whether they were ‘falsified’ or not–they would be so for any implementation into my practice.]

NT
Michael Smith

Posted May 2, 2008 at 1:16 PM | Permalink

In comment #144 in the other thread concerning this paper, Willis showed a graph of the 22 models in question. He noted, as I had, that two of the models do not show an amplification of troposphere temperatures relative to the surface.

I have argued that it makes no sense to include models that don’t predict troposphere warming relative to the surface in any attempt to evaluate how well the AGW models agree with the observations.

To test my claim, I have removed these two models from the data and recalculated the limits, beaker style. Here is what I got.

For the 150 hPa altitude, here is the data (units are in milli C/decade)

Original Calculations, all 22 models, from table IIB of Douglass et al:

Average 268
Std. Dev. 160
+2 SD 589
-2 SD -52

Note that this gives us the negative lower limit and thus encompasses the observations. Here is what you get for the 150 hPa altitude if you omit the two models that don’t predict amplified warming of the troposphere:

Average 301
Std. Dev. 126
+2 S.D. 552
-2 S.D. 50

Note that this new lower limit excludes even the RAOBCORE V1.4 observations that RealClimate used in their response to Douglass.

I will recalculate the limits at all the altitudes, but I expect the same sort of effect: Tightened limits with a considerably higher lower limit that moves even further away from the observations used by Douglass.
Pat Keating

Posted May 2, 2008 at 1:23 PM | Permalink

97 PI

It’s not accurate to say that the Bayesian perspective means that the models don’t have to justify themselves against the data.

No, I realize that. However, the language Beaker used (based on “the Baysian perspective”) stated that the observational data was “implausible”, a truly topsy-turvy viewpoint. That is what I objected to, because the GCMs are already given unearned credibility IMO in many quarters, and this was taking it too far, when one reads it as stated.
Steve McIntyre

Posted May 2, 2008 at 1:26 PM | Permalink

#104. I agree that interpretation of their statement is quite reasonable, but let’s see what we can agree on first.

#107. I’m just trying to identify what things are actually at issue underneath the verbiage. This is an ongoing frustration in climate science. Wahl and Ammann and ourselves have code that reconciles exactly. However rather than choosing to make a joint statement on what we agree on and what we disagree on and what’s necessary to resolve things, their decision was to say that our results were “unfounded” even though they got identical results for apples-and-apples cases. And in doing so, they slimed what we had done, rather than listing points of agreement and recognizing calculations that we had already done. It’s an endless process of picking spitballs off the wall with these folks.
Cliff Huston

Posted May 2, 2008 at 1:31 PM | Permalink

Beaker,

The disconnect in this dialog is one of context. In terms of the proper statistic to use for a test of inconsistency of the model ensemble, that context was set by the IPCC report (as shown by Steve M in #52 above). For all practical purposes, the IPCC has put forth one model, the output of which is defined as the average of the ensemble. In that context the only choice is to use standard error for the model error bars.

You don’t agree with the IPCC defined model context, and you use that disagreement (in effect) to declare Douglass et al (who simply used the IPCC defined model context) ‘fatally flawed’. You may well be correct in this assessment, but you should then be as forth right in declaring the IPCC report ‘fatally flawed’. I doubt very much that Gavin (or the IPPC cast of thousands) would agree with that assessment of the IPCC report, but it would be logically consistent.

Ah you say, but the IPCC was not testing for consistency but rather they were merely discussing consistency. Douglass et al, on the other hand, were testing for consistency and had a duty to change the context (hence the meaning) of the model data and use standard deviation for the test. It matters not that they were testing the IPCC theory that the model output (defined as the average of the ensemble) would skillfully track observed temperatures in the troposphere.

But how can that test be valid, when it is out of context?

Cliff
PI

Posted May 2, 2008 at 1:35 PM | Permalink

NTaylor,

“it appears that the key for his absolute pronouncements on the Douglass paper are entirely based upon a Bayesian approach.”

There is nothing inherently Bayesian about averaging model output (or anything else). Nor is there anything inherently Bayesian about comparing model spread to data. Bayesian statistics just gives such model averages a probability interpretation. (For instance, an unweighted average corresponds to giving each model an equal probability.)

“Independent of appeal to authority, or that it is in ?widespread usage–is there any evidence or data (other than it’s own self-generated conclusions) that show [the Bayesian method] should even be given weight in this (or any other) discussion?”

I have no idea what this question means. But to throw some random points out there, Bayesian inference satisfies some rather well known axioms for coherence of inference, there are theorems showing the convergence of Bayesian posteriors to the true values in the asymptotic limit of infinite data, you can check the quality of inference by testing it on data where the data-generating process is known, etc.

Note that frequentist methods often don’t satisfy coherency criteria, so one might wonder why you’re singling out Bayesian methods here.

I don’t know if any of that was useful to what you’re asking. Turn it around and replace “Bayesian” by “frequentist” in your question. If you can tell me the answer to that question, perhaps I can figure out what kind of answer to the Bayesian question you’re looking for.

“Are there error bars, SD or SEM for the statistical method itself”

I don’t know what it means to have error bars, standard deviations, etc. for “a statistical method”.

“Also, as a side note; just based on the posted charts/graphs themselves, were this a meta-analysis of a medical question–as one who is responsible to my patients to first do no harm, I would never accept the conclusion that the models were useful.”

They’re probably not that useful for predicting the data being discussed here. Whether they’re useful for predicting other things is a different question.
Patrick M.

Posted May 2, 2008 at 1:41 PM | Permalink

re #108(Michael Smith):

It seems you could do one of those robustness deals where you recalculate the +-sd2 with various models removed and show that passing the inconsistency test depends on one or two models being present, (like what they did with the bristle cones).

Would that be a statistically valid approach?
Jon

Posted May 2, 2008 at 2:28 PM | Permalink

A few points that seem to have been briefly addressed but then dismissed:

1. The Douglass et al paper is fair game to be audited just like any other. As such it should be judged on its success or failure to achieve what it claims to.
2. It claimed to show inconsistency when it did not. Whether or not this was an error in understanding the language used or a genuine but mistaken belief that it did, the result is a failure to properly demonstrate the claim.
3. This paper was EXPLICITLY touted by its coauthors as proof that the observed warming was not manmade and thus no policy should be made to address it[1][2]. Even had Douglass et al sought to and succeeded in showing model bias rather than inconsistency, these grandiose claims are completely unwarranted.

This isn’t just a single paper that sought to and succeeded in pointing out conflicts between models and obs and found bias in the models. This was touted by its own authors as the silver-bullet, AGW killer.

That is why some people find it a bit odd that auditing the paper by its own goals is being dismissed as attempts to confuse the issue.

Kudos to those that are happy to audit papers on both sides. Thanks especially to beaker for having the patience to restate time and again his analysis.
Lance

Posted May 2, 2008 at 2:42 PM | Permalink

Jon,

Your complete dismissal of Douglas et. al. seems a bit overwrought. The issues raised here by beaker and others seem to be mostly semantic. The predictive skills of the models used by the IPCC while not slain by this study are certainly on life-support.
Bob B

Posted May 2, 2008 at 2:44 PM | Permalink

Jon, while Douglas was not successful in showing there is no AGW fingerprint, I would dare say any reading his paper came away doubting the models are capable of “predicting” the Earths temperatures with any certainty. They are useless especially where political and policy decisions need to be made. Douglas has more then cast a doubt on the physics behind the models, since there is no clear AGW signal.
Willis Eschenbach

Posted May 2, 2008 at 2:44 PM | Permalink

beaker, you say:

To test for inconsistency properly, standard practice would be to show that the mean +-2sd error bars of the models don’t overlap with the mean +-2sd error bars of the data. Yes, I know that makes the inconsistency test easier to pass, but that is the correct way to perform the test for inconsistency.

The fact that #16 does not show the +-2sd for the data means that we haven’t shown a proper test for “inconsistency”.

My friend, you are moving too fast. Think about it for a minute.

If all of the observations are outside the ±2SD error bars for the models, the ±2sd error bars for the observations will also be outside of those error bars. Look at it again, and this time consider where the ± 2SD error bars for the data will go.

Therefore, I have shown the test that you required. The observations (and thus their error bars) are everywhere outside of the models ±2SD error bars. By your and Gavins definition, we can say that the models are inconsistent with the data.

As near as I can tell by re-reading both threads, that was your only point. Now that your point been proven wrong by the results of the inconsistency test, can we move on?

w.
Andrew

Posted May 2, 2008 at 2:46 PM | Permalink

114 (Jon): Back from sparing with Roger, eh? Singer well known for his sometimes hyperbolic statements. But his claim is not in the paper itself. If you have a problem with Singer’s interpretation, address it to him, not at the paper. However, beaker’s statistical points have been well taken. The were not “briefly addressed and then dismissed”, there is ongoing discussion of them! I for one am not going to argue his points. However, you should hardly expect everyone to except them, it doesn’t work that way. But given what you have said about CA and other blogs elsewhere, you seem to just be upset that people don’t hear the critique from RC and just bow down to the great Gods of Discrediting. Well, half the time what (nay, more) comes from them is more hit and run, obfuscation, or even downright lies and misrepresentation of the publication record than insightful analysis, so don’t be surprised if people don’t instantly believe it. But you can go back to calling us deniers on other blogs now. Heck, you’d feel right at home at Tammy’s (is that where you came from?) where you would even have to act as intelligent as you are now to impress them.
Andrew

Posted May 2, 2008 at 2:51 PM | Permalink

116 (Bob B):

Jon, while Douglas was not successful in showing there is no AGW fingerprint, I would dare say any reading his paper came away doubting the models are capable of “predicting” the Earths temperatures with any certainty. They are useless especially where political and policy decisions need to be made. Douglas has more then cast a doubt on the physics behind the models, since there is no clear AGW signal.

Whuh?
Michael Smith

Posted May 2, 2008 at 2:55 PM | Permalink

Willis, are those +-2SD error bars plotted using the standard deviations from table IIB in Douglass?
Bob B

Posted May 2, 2008 at 2:58 PM | Permalink

Let’s frame the discussion the other way. What is the confidence using the data from Douglas that the models show skill in predicting the AGW footprint?
Ntaylor

Posted May 2, 2008 at 3:01 PM | Permalink

#110 Steve

Thanks, it is frustrating for you statistically; it’s frustrating from the outside looking in, as well. Hopefully, your work will eventually break down some barriers, and such unwillingness to state publically the areas of agreements in principle will not occur going forward.

#112 PI

I think you caught the (wordy) gist of my question pretty well. I infer that from your reply that (generally speaking) Bayesian and ‘frequentist’ approaches are equally valid/invalid techniques (if/when applied properly). My underlying question tended to the philosophical; how to ‘statistically validate’ competing statistical methods. (How do we know we know what we know.) I understand it is outside the scope of this thread and purpose of CA. Thanks.

NT
Dave Andrews

Posted May 2, 2008 at 3:17 PM | Permalink

Jon,

You must surely know that this isn’t just about the Douglass paper. The models have considerable flaws, there are so many things they can’t parameterize in any meaningful way. They might be the best that can be achieved at the moment but they are far from adequately recreating the reality of the earth’s climate.

This is probably accepted in the climatology world but it is not the way things have been presented to the general public, nor is it the impression IPCC gives to policy makers. Political decisions which impact millions of peoples lives are thus being made on the basis of inadequate models.
bender

Posted May 2, 2008 at 3:29 PM | Permalink

#103
Exactly. Yet another double-standard for the database. Skeptics must adhere to strict statistics for refuting models. Alarmists are free to just eyeball it.
bender

Posted May 2, 2008 at 3:35 PM | Permalink

Don’t let beaker off the hook yet.
1. He has not explained why he thinks there is a bias in some of the GCMs.
2. He has not expained if tropical tropospheric warming is or is not part of the AGW fingerprint.
3. He hasn’t explained why GMT has flattened since 1998.
He’s half-answering all the easy questions, dodging the important ones.
Jon

Posted May 2, 2008 at 3:44 PM | Permalink

@118

Singer well known for his sometimes hyperbolic statements. But his claim is not in the paper itself.

His claim was made as part of a media blitz surrounding the paper. His coauthors seemed to have no issue with his claims. For the standards some seem to have regarding professional conduct amongst the community, I would have thought that this would be considered inexcusable behavior rather than brushed aside as idiosyncrasy. But to each their own, I suppose.

However, beaker’s statistical points have been well taken. The were not “briefly addressed and then dismissed”, there is ongoing discussion of them!

That wasn’t what I was referring to- I was referring to the attitude that seemed quite prevalent to treat his excellent points as an obfuscation of “the real issue(s)” rather than acknowledging them.

given what you have said about CA and other blogs elsewhere

I made no value judgment of the blogs, rather of some of the comments allowed. I’d rather not get side-tracked, but the repeated meme that climate science is a conspiracy, implicitly or explicitly stated is an example of such. Further I hardly see how posts I have written on other blogs not pertaining to this discussion would be relevant to the points I made here. Let’s stay on topic, shall we?

you seem to just be upset that people don’t hear the critique from RC and just bow down to the great Gods of Discrediting.

I’m afraid you’re reading too much into my posts. Whether it was beaker or RC pointing out the problems with Douglass et al’s claim is immaterial. And in fact I would rather people hear from someone like beaker so they cannot dismiss it as biased- although it seems that hasn’t stopped some people from doing so.

But you can go back to calling us deniers on other blogs now.

I never called anyone on this blog a “denier”, and I don’t see what you hope to gain from such an accusation other than rhetorical point scoring. I’d rather focus on the topic, thanks.

Heck, you’d feel right at home at Tammy’s (is that where you came from?) where you would even have to act as intelligent as you are now to impress them.

I have no idea what this means, but I assume it is also unrelated to the topic and is more attempted point scoring.
bender

Posted May 2, 2008 at 3:50 PM | Permalink

A simple set of yes/no questions: was the Douglass et al. focus on the tropical troposphere justified? Is it now? And now that it has been admitted that vertical convection parameterizations are shaky (where does it say that in AR4?), would a focus on tropical troposphere be difficult to justify?

This is NOT about Douglass’ paper. NOT AT ALL. It is about the climate community – and RC’s – reaction to the paper’s revelations. Gavin’s response may have been, technically, error-free, dr beaker; however it was a dodgy kind of truth, NOT “the whole truth and nothing but the truth”. The whole truth would address the questions posed in #125.
bender

Posted May 2, 2008 at 3:53 PM | Permalink

#126

attitude that seemed quite prevalent to treat his excellent points as an obfuscation of “the real issue(s)”

I stand by my assertions. His points are not “excellent”. Technically correct as they may be, they are a distraction from the primary issue: #125.
bender

Posted May 2, 2008 at 3:54 PM | Permalink

Jon, what is the “fingerprint”? Where do we look for it? Thank you in advance for not dodging this all-important question.
bender

Posted May 2, 2008 at 3:59 PM | Permalink

Dr Douglass:
Did you cherry-pick the tropical troposphere for testing model fit (“consistency”, “bias”, whatever), or were you led there honestly by the models?
Jedwards

Posted May 2, 2008 at 4:02 PM | Permalink

Re #117 Willis, I think we need one more chart, this time with MEAN_Obs +/-2SD included. If there is no overlap then I “think” the inconsistency claim is valid according to what I’ve been reading in this thread, according to beaker.
SteveW

Posted May 2, 2008 at 4:05 PM | Permalink

It seems like computing a mean/sd/se for the models is a trap we don’t want to fall into: 1. We don’t know what models were (cherry) picked by the IPCC. 2. We don’t know the uncertainties of the models, and no error bars have been provided for the models (or seem forthcoming). We don’t need the broad brush.

Lets flip it around and do a graph like #16, but add ± 2SD error bars for the observations only. Lets see visually what models fall inside that! We could then say which models succeed/fail based on that measure.

You could then make a statement like “x out of 22 models agree with observations”. Let each separate failed model maker go back to the drawing board.
bender

Posted May 2, 2008 at 4:08 PM | Permalink

#117 Willis

If all of the observations are outside the ±2SD error bars for the models, the ±2sd error bars for the observations will also be outside of those error bars. Look at it again, and this time consider where the ± 2SD error bars for the data will go.

Disagree, Willis. The upper bound on the data (+2sd) will be higher than the lower bound for the model (-2sd). i.e. They will overlap to some degree.
bender

Posted May 2, 2008 at 4:16 PM | Permalink

#67
Since beaker won’t answer in detail – just a trivial sketch – I will provide my own recipe.

Use a Monte Carlo approach for picking a GCM at random, setting initial conditions at random and running scenarios with a reasonable range of uncertainty on the model parameters (acouonting for Judy’s concern). Iterate a million or more times and count the number of runs that cool more than the observed. The ratio of too-warm to too-cool is is the probability that your models are crap.

This approach simultaneously accounts for uncertainty stemming from all sources: Pat Frank’s parameter uncertainty, internal climate variability, and model formulation errors. No fancy statistics, just Monte Carlo Count ‘Em.

This is a paper.

Dr Douglass?
Bob B

Posted May 2, 2008 at 4:22 PM | Permalink

You can stretch the statistical arguments to prove global warming is consistent with global cooling:

http://sciencepolicy.colorado.edu/prometheus/archives/climate_change/001413global_cooling_consi.html

twits
Cliff Huston

Posted May 2, 2008 at 4:22 PM | Permalink

RE: 120 Michael Smith
I believe Willis is using data from the IIB table, but he is normalizing such that all the models agree on the surface temperature. Willis explains:

From #144 on the Tropical Troposphere thread:
“AGW theory says that as you go up from the tropical surface, the temperature trends will increase. So, let’s start all of them at the same spot, the actual surface trend, and see who does what. No error bars, no averages, just the raw data.”

I think this a reasonable way to look at the data, but Douglass et al did not go there. This step reduces the range of the models, hence reduces the 2SD error bars (although more on the high side than the low side). A fairer way when applying 2SD because it directly addresses the GHW theory in the models (higher troposphere than surface temperatures), but open to objections.
Cliff
bender

Posted May 2, 2008 at 4:37 PM | Permalink

#135
Cooling is evidence of “internal climate variability”. Warming is evidence of long-term GHG trend.
John M

Posted May 2, 2008 at 4:54 PM | Permalink

Jon #126

I never called anyone on this blog a “denier”,

OK, perhaps true literally (I do admire fine parsing), but surely the folks over here might take exception to this fine prose you posted over at RP Jrs site.

You either don’t understand what you are doing, or you do and are doing it intentionally. You are grossly distorting things and making claims pretty much indistinguishable from the kind of garbage posted in the comments on CA or Watts’s blog.

This was not too long after you wrote this in the same comment

It’s not the questions, Roger. It’s your baffling repetition of denier memes- “They’re predicting global cooling”, “I’m not allowed to question modeling”, “global warming is predicted to be monotonic”, etc.

When you and others insist on treating Singer and Christy’s work in the context of the body of their work, don’t complain when folks look at what you have written and its context.

Steve, sorry if this is off-topic, but you might be interested in how Jon’s been treating your “brand”.
kuhnkat

Posted May 2, 2008 at 5:00 PM | Permalink

It would appear from Beaker’s point of view that the main point of contention is the use of the word INCONSISTENT.

In his world inconsistent means that the models COULD NOT INCLUDE THE OBSERVATIONAL DATA AS AN OUTPUT, and the test does not show that.

In his terminology BIAS is what has been shown.

Although Douglass did give a definition in the paper for what he was showing, Beaker apparently feels this is insufficient to cover for the use of a term that has a specific meaning among Statisticians!! That Douglass was using statistics to make his point gives this some validity.

The quick fix would be for Douglas to amend his paper changing the terminology.

The harder fix is to require RC and/or IPCC to show the usefulness of using a suite in general, and this suite in particular, for this purpose in the first place!!!
bender

Posted May 2, 2008 at 5:03 PM | Permalink

#126

climate science is a conspiracy

This subject has been dealt with previously. It’s NOT a “conspiracy”. What it is is .. well … Jon can search the archives to find out.
Michael Smith

Posted May 2, 2008 at 5:12 PM | Permalink

Cliff, thanks for clarifying that.
Kenneth Fritsch

Posted May 2, 2008 at 5:15 PM | Permalink

Re: #114

That is why some people find it a bit odd that auditing the paper by its own goals is being dismissed as attempts to confuse the issue.

Kudos to those that are happy to audit papers on both sides. Thanks especially to beaker for having the patience to restate time and again his analysis.

Good example of a statistician like Beaker getting stuck on the use of a term like inconsistent and word smithing something to the effect that this significantly affects the analyzed results of the Douglas paper and then have someone like Jon taking it a step further. Jon, must of us know what Beaker has contributed here and little of it has to do with his “auditing” picking out a term in the paper that the authors obviously were using in a general sense when the context is revealed. Beaker does have insights into the statistical methodology that could be applicable to these situations, but his auditing the word “consistent” was not a shining moment.

If Steve M were to audit like Beaker has in this case, I think many of his fellow bloggers would fire him on the spot. So in conclusion, for auditing I don’t need no Beakers, but I certainly think with proper details of his proposed methodologies for statistical analyses Beaker could add valuable content to the CA knowledge base.
Bob B

Posted May 2, 2008 at 5:17 PM | Permalink

Bender says:
May 2nd, 2008 at 4:37 pm

#135
Cooling is evidence of “internal climate variability”. Warming is evidence of long-term GHG trend.

And whee is your objective proof???
Richard Sharpe

Posted May 2, 2008 at 6:14 PM | Permalink

Bob B, I think you have misunderstood bender.
Bob B

Posted May 2, 2008 at 6:42 PM | Permalink

snip – no ruminations on policy please.
Andrew

Posted May 2, 2008 at 6:59 PM | Permalink

126 (Jon): No attempt at point scoring is being made. You are creating some big conspiracy media campaign to use this paper to discredit AGW that is almost entirely imaginary, and you are complaining about it instead of address the statistical issue raise by beaker, and complaining that not everyone has uncritically excepted it. Big surprise, a site like this has “deniers” who are (gasp!) allowed free speech by Steve! I thought it was clear from my post that I knew what you said about us on other blogs. Don’t think you’ll get treated nicely by people you insult when they aren’t looking. I’m not excusing anyone mistreating becuase of it, of course, but don’t be surprised. BTW, I would love for you to explain to me why you engage in marginalizing the opposition with your conspiracy theory conspiracy theory. Climate science is no conspiracy-but that doesn’t mean its all good in the hood. No conspiracy necessary, the involved parties just follow their own obvious self interest.

You think we are frustrating, repeat “denier memes” and not just taking your word for gospel? Have you look in the mirror?

At least show the people here a little respect. All I see from you so far is an attempt to act like you are above us all. Beaker is better than you becuase he is trying to explain things calmly and non-partisanly, and contrary to what you are trying to do, you are making his criticism harder to accept. Jeez, just don’t be rude!
Geoff Sherrington

Posted May 2, 2008 at 7:14 PM | Permalink

Re # 49 Sinan Unar and similar

Any way, the point of this post was to state that I no longer find the multiple and different number of runs per model that troubling given that it no longer looks like the decision was up to Douglass et al.

We have seen statistical theory dissected in some detail above. But I repeat an earlier point. The figures being examined were almost certainly cherry picked by the modellers from far larger numbers of simulations that were never published or used in IPCC reports.

If you are going to spend so much effort on statistical dissection, why do you not contact the modellers and ask them for ALL of the results of ALL of their models, not just the ones chosen by them (by probably subjective/biased criteria) for submission to papers that ended up in IPCC compliations.

If you include ALL the simulations, including the rejected ones, I suspect that arguments about Bayesian versus classical, SD versus SE, distributions, falsification and so on would disappear in one huge bowl of well-mixed spaghetti that has a foul taste.
Steve McIntyre

Posted May 2, 2008 at 7:26 PM | Permalink

#146. No more squabbling of this type.

Perhaps it’s worth examining how IPCC uses the word “consistent”. AR4 chapter 8 (detection and attribution) has 138 uses of the word. Interestingly, they have only two mentions of the word “tropical troposphere”. They say:

All data sets show that the global mean and tropical troposphere has warmed from 1958 to the present, with the warming trend in the troposphere slightly greater than at the surface.

IT doesn’t seem to me that “all” data sets show that the tropical troposphere has warmed slightly more than the surface.

In their only other mention of the TT in this chapter, they say:

Uncertainties remain in homogenised radiosonde data sets which could result in a spurious inference of net cooling in the tropical troposphere.
Max

Posted May 2, 2008 at 7:29 PM | Permalink

Just a layman question, since the modern observed data is being argued as the flawed part of the comparison, how about modeling backwards, if you program a climate model, with data from the past from all the temperature proxies done, does it vindicate all the past temperature proxies by matching them all together along with all known forcings at the time? Do the predict the proper temp changes when major natural disasters occcured and put known elements in known amounts into the air? Or is that asking to much of the models?
aurbo

Posted May 2, 2008 at 7:45 PM | Permalink

Re the target shooting analogy. A definition of a Texas Sharpshooter is one who unloads a clip or two on a blank wall and then draws the bullseye around the tightest grouping.

Applying this analogy to climate “science”, some AGW proponents know that model performance can be improved by simply cherry-picking or altering the observational data. Take for example the apparently consistent upward bias of each systematic “correction” made to land temperature observations and MSU data. I believe that statistics will show that so far the probability of such adjustments being upwards is close to 100%.

(I hope I haven’t offended Beaker by using consistent and bias in the same sentence).
Joe Solters

Posted May 2, 2008 at 7:59 PM | Permalink

Bender #125. Bender is correct that beaker has been dancing on the head of a pin with his ‘inconsistancy’ schtik. And contributes nothing else. Likewise a lot of commentary, perhaps of necessity,by delving into statistical trivia, fails to recognize the extremely weak posture of current global climate science. We may be heading into a new 10 year or longer cooling period, but the IPCC climate models are useless for analysing this trend, or as this blog shows, any other observed trends or claimed climate fingerprints.

snip – please don’t vent about climate politics.
Andrew

Posted May 2, 2008 at 8:02 PM | Permalink

148 (Steve): Agreed. I hate getting into these stupid arguments.

150 (aurbo): The modelers have definitely done this with the GMST. But I don’t think this is going on with the tropical trends. They large variability here has to do with something distinct about the tropics-slightly different parameters will make the biggest difference in model outputs there. And I don’t think every modeler knew the answer ahead of time, either.
bender

Posted May 2, 2008 at 8:35 PM | Permalink

We may be heading into a new 10 year or longer cooling period, but the IPCC climate models are useless for analysing this trend

They have done themselves and everyone a huge disservice in under-estimating and/or under-representing the stochastic “internal climate variability”. For example, I suspect lucia’s “refutation” is flawed – not through any fault of hers, but because you need an ill-behaved non-ergodic GCM to represent the true variaiblity in Earth’s cliamte.

And, yes, I am aware this is gross speculation.

For me, the question is just how poorly these models perform in representing the internal stochastics. The ocean scares me.
Pat Keating

Posted May 2, 2008 at 8:39 PM | Permalink

148 Steve McI

Uncertainties remain in homogenised radiosonde data sets which could result in a spurious inference of net cooling in the tropical troposphere.

So upper troposphere cooling is a “spurious inference”. Sometimes this IPCC stuff is hard to believe. They sure are committed to their belief in GCMs.
Ron Cram

Posted May 2, 2008 at 8:40 PM | Permalink

bender,
re: 153

Would you say your view of the ocean has changed at all in recent months?
steven mosher

Posted May 2, 2008 at 8:42 PM | Permalink

RE 148. I think there is an interesting issue to invesitage. Namely what tests where done
to decide the “attribution” If you look at the studies on attribution they ran the GCM with
all natural forcings and compared them to the observation record. Then they added GHG forcing.
In the former case they argued ( you can se the charts) that the models did not expliain the data. What kind of tests did they use. How wide was the distribution of the attribution runs?
Now clearly, by eyeball, when you add GHGs the hindcasts do “much better” But what method did they use to asses this “much better” It is exactly the SAME logical structure as Douglas presents. Obseveratations that don’t seem to accord with the models. So, one could apply the
beaker method ( he’s sharp and thought provoking) To the attribution question.

A fun thought experiement.
bender

Posted May 2, 2008 at 8:45 PM | Permalink

#155 No. Why?
Ron Cram

Posted May 2, 2008 at 8:45 PM | Permalink

bender,
re: 134

I like your idea for a paper. I think it should be run on 2, 8, and 12. I am already convinced the other models are crap.
bender

Posted May 2, 2008 at 8:46 PM | Permalink

#156

one could apply the beaker method (he’s sharp and thought provoking) to the attribution question

dr beaker, to the blackboard, please.
Ron Cram

Posted May 2, 2008 at 8:48 PM | Permalink

bender
re: 157

Because in a previous thread some months ago, you did not seem to think much of the PDO’s contribution to internal climate variability. In my view the PDO is the most powerful oceanic oscillation.
bender

Posted May 2, 2008 at 9:21 PM | Permalink

#160
1. I never said, suggested, or implied this. Maybe you misinterpreted something I said?
2. PDO more powerful than ENSO? First, this is not a matter of opinion. Second, I think it can be disproven fairly easily based on physics.
Ron Cram

Posted May 2, 2008 at 9:32 PM | Permalink

bender,

It is possible I misunderstood what you said. Yes, it is a matter of opinion. ENSO gets all of the press but you never get a powerfully warm El Nino when the PDO is in the cool phase. You never see a powerfully cool La Nina when the PDO is in the warm phase. Yes, I know about monthly variations in the PDO related to ENSO (possibly some kind of negative feedback?) but they do not last long. Anyway, just my opinion. Back to the Douglass paper.
Joe Crawford

Posted May 2, 2008 at 9:36 PM | Permalink

Re #37 beaker: “If a number of groups independently come up with similar models, it is an indication that a particular set of assumptions is more plausible than others”

If we are going to be precise, this only holds true if there is no bias introduced by the scientists that generated these assumptions. Using your own definitions and methods, when taken as a group the 22 models are biased. The 64 dollar question is “Is this bias due to good science, or preconceived notions / group think of the scientists?” Since that question is imposable to answer (i.e., the models were not developed in isolation), the IPCC is wrong and each model and its results must be treated separately.

Joe
bender

Posted May 2, 2008 at 9:41 PM | Permalink

#162 It is not a matter of opinion. It is a matter of Terawatts/Yr. But, yes, back to the paper.
Jon

Posted May 3, 2008 at 6:18 AM | Permalink

@138

This was not too long after you wrote this in the same comment

Again, that was in reference to a specific kind of post that finds a home here and at other blogs that let’s just say isn’t particularly credible. I didn’t call any poster or blog a “denier” and again, it’s pretty transparent to see attempts to imply such as a way to avoid my comments on this topic.

When you and others insist on treating Singer and Christy’s work in the context of the body of their work

Shockingly, I and others have a huge problem with the way Singer is choosing to comport himself. I would think, of all places, readers of this site would have zero patience for politicizing and grossly overstating the results of flawed science in order to advance a particular take on policy.

I have yet to see one person denounce Singer’s obviously unsupportable claims regarding this paper. I can only imagine what this thread would be like if the author was Mann. If Dr. Douglass returns to this thread, I would love to hear his reasoning for allowing his coauthors to purport this paper as being something that it clearly is not.

@142

Jon, must of us know what Beaker has contributed here and little of it has to do with his “auditing” picking out a term in the paper that the authors obviously were using in a general sense when the context is revealed.

As I said earlier:

1. The Douglass et al paper is fair game to be audited just like any other. As such it should be judged on its success or failure to achieve what it claims to.
2. It claimed to show inconsistency when it did not. Whether or not this was an error in understanding the language used or a genuine but mistaken belief that it did, the result is a failure to properly demonstrate the claim.

That the failure of this paper to demonstrate its claim is being so willfully excused by some- not all, and to the rest of you I tip my hat- is telling. Extremely so given the way in which it was presented by its own authors.
bender

Posted May 3, 2008 at 6:44 AM | Permalink

It claimed to show inconsistency when it did not.

What paper did you read? Are you sure you got the right one? Their conclusion is mostly correct: most of the models are inconsistent with observations.
bender

Posted May 3, 2008 at 6:45 AM | Permalink

And, Jon, Raven has a question for you in #2.
[Prediction: dodge.]
Pat Keating

Posted May 3, 2008 at 7:19 AM | Permalink

165 jon

Shockingly, I and others have a huge problem with the way Singer is choosing to comport himself. I would think, of all places, readers of this site would have zero patience for politicizing and grossly overstating the results of flawed science in order to advance a particular take on policy.

But no problem with Hansen for “politicizing and grossly overstating the results of flawed science in order to advance a particular take on policy”.
Craig Loehle

Posted May 3, 2008 at 7:25 AM | Permalink

It seems to me there is some serious cart-before-horse issues here. When testing a model, we want the model to fall in the confidence intervals of the data. We do not accept a model is valid when it has wide ci and these barely touch the data. In the latter case, we can say that 95% of the time the model results are way outside the data bounds–does that sound like “not falsified”?
Ron Cram

Posted May 3, 2008 at 7:40 AM | Permalink

jon,
re: 165

How did Singer mischaracterize the paper? Can you provide a link? My opinion of Singer is obviously much higher than yours and I want to see what evidence you have to back your claims.
Ron Cram

Posted May 3, 2008 at 7:51 AM | Permalink

Craig,
re: 169

I would prefer scientists stick to conclusions they can support without fear of contradiction. The Douglass et al paper concludes:

We have tested the proposition that greenhouse model simulations and trend observations can be reconciled. Our conclusion is that the present evidence, with the application of a robust statistical test, supports rejection of this proposition.

I would have preferred the conclusion read something like this:

After testing the proposition that greenhouse model simulations and trend observations can be reconciled, we conclude the 22 models we tested have mixed results. Using data currently available and considered appropriate, 19 of the 22 models failed the test and 3 of the models passed.

Certainly the paper would still come under attack from RC, but that says more about RC than it does about the science.
bender

Posted May 3, 2008 at 8:19 AM | Permalink

#169
Agreed. There are biases, and then there are biases that are so large as to invalidate the model. Some of the models tested by Douglass et al. are biased. Others are so biased as to be wholly inconsistent with the data, and therefore worthy of “rejection” (a word that must be used cautiously).

On model “rejection”. What the more reasonable alarmists object to is the wholesale *dismissal* of an entire model, when it may be just a few parameters or equations that are off. To a scientist, “rejection” means “back to the drawing board”, not “off with your head”. Instead of screeching that the Douglass et al. analysis is incorrect, the defendants, beaker and Jon, should be isolating the problem within those models that fail. That is productive path that Judith Curry tried to put us on, speaking of faulty vertical convection parameterizations. But then Jon drags us back to idiotville.
Ron Cram

Posted May 3, 2008 at 9:28 AM | Permalink

bender,

The models, even the ones that failed the test, have some value in that they teach us what we do not know about the climate. No matter how much better the models eventually become, I am not convinced they will ever have any predictive value. Scott Armstrong is one of the leading thinkers in scientific forecasts and he points out that climate scientists have not even taken the first step to prove climate is predictable 100 years into the future.
bender

Posted May 3, 2008 at 9:33 AM | Permalink

#173
You are committing the same crime as Jon of failing to disaggregate model components. 100-year responses to GHG forcing are deterministic and predictable (with large error bounds). Internal variability is the problem, and it emerges long before the 100-year horizon comes up.
Cliff Huston

Posted May 3, 2008 at 9:36 AM | Permalink

I’m still trying to understand the RC/Beaker view on testing consistency of models vs. observations. I would like to try to describe the way I view the problem and hopefully those with stronger statistics background can tell me what’s wrong with my picture.

Starting with just the models, the IPCC says that the best representation of the climate signal is the average of the outputs of the ensemble of GCMs. The ensemble can be viewed as a collection of instruments, with unknown errors (noise, bias, etc.), whose the individual outputs when averaged, largely reduce the unknown errors and produce a good signal.

Given that, the first thing of interest is to test the multi-model average against the individual models to see if any of the models are outliers, that is see if any of the models are inconsistent with the multi-model average. As Beaker has pointed out many times, the correct test is to use two standard deviations as the limit for model data that is consistent with the multi-model average. To my mind, the usefulness of this test is to identify and eliminate any of the individual models that are inconsistent with the multi-model average or at minimum, gain an understanding that the multi-model average is potentially biased by the outliers.

At this point, I have an understanding of the nature of the signal represented by the multi-model average. Now I would like to know the limit of significant values that this signal can have so I can compare it against some other signal. I can’t use two standard deviations as above, because that measure includes errors that cancel in the average that represents the signal. What I can use is two standard errors as the limit, because that measure accounts for the averaging effects. I believe this is correct because I’m no longer concerned with nature of pieces that make up the signal, but rather the nature of the average signal.

Now I can test to see if the multi-model signal is consistent with the observations by looking for overlap between multi-model signal two standard error limits and the observation error bars. This can be done against each observation individually or I could form a multi-observation average, as with the models, and look for overlap with the multi-observation average error bars.

The problem I see with the RC/Beaker approach to testing consistency is that they want to treat the observations as though they were just another model added to the multi-model set and apply the two standard deviations limit to see if the observations are outliers. This make no sense to me, because the observations are not part of the multi-model average, they are separate measures. This makes as much sense as taking Willis’ frog and moose and concluding that they are the same animal, because they share a common gene. The multi-model signal is not the same thing as the observation signal and the point is to compare those separate signals for consistency. Throwing the component parts of the multi-model signal into a blender with the observation signal and declaring consistency because they share a value is, IMHO, just plain silly.

So there you have my view and I would greatly appreciate any corrections where I have it wrong.

Cliff
Michael Smith

Posted May 3, 2008 at 9:42 AM | Permalink

Douglass noted that some of the 22 models do not agree with the surface trends. Others do not show any significant amplification of troposphere warming relative to the surface. But rather than throw those out, he argued that one should work with the average and not the range as a measure of model spread.

Models 2 and 22, as Willis pointed out, are outliers in the sense that they show insignificant amplification of warming in the troposphere. Models 8 and 12 show poor agreement with the surface record. Surely these models do not qualify as “AGW models“ since they don‘t support AGW theory.

So I eliminated those 4 models, recalculated the -2SD lower limit (that’s the thicker pink line on the chart) and re-plotted the data below. Sorry that my chart is not as readable as the ones by Willis. ( The data points to the left are the T2LT and T2 observations that I believe Douglass shows on the right hand side of his graph — I couldn’t get them to display that way.)

By my count, 27 of the 34 observations above the surface fall outside the new -2SD lower limit. I would say that Bender is right when he observes that RealClimate’s response was truthful, but not the whole truth. The whole truth seems to be that only the inclusion of models that don’t agree with the surface record and models that don’t fit AGW theory about amplified troposphere warming saves “the models” from being falsified by the most of the data.
Kenneth Fritsch

Posted May 3, 2008 at 10:05 AM | Permalink

Re: #171

After testing the proposition that greenhouse model simulations and trend observations can be reconciled, we conclude the 22 models we tested have mixed results. Using data currently available and considered appropriate, 19 of the 22 models failed the test and 3 of the models passed.

I would like to see both the naive (see Willis E in Post #16 above) and more complete statistical analysis/tests of those three models that would lead to this conclusion.

I have a major problem with using a climate model output for a region, comparing it to the observed for that region and then failing to test how well that model performs in another region of the globe against observation. I think Douglas makes that error here as I assume the criteria he used to eliminate or qualify the models were restricted to how they performed in the tropics. Fitting a model to a region and restricting evaluation to a region and/or selecting from global models based only on how they perform in that region and ignoring how they perform in other regions provides methods rife for over fitting and data snooping.

I have assumed that the term GHG fingerprint of climate models for tropical surface to troposphere warming trends had to do with the observation that all models show the same biases regardless of how well they agree in other regions of the globe. Claims for climate models skill appears to be more or less relegated to a global mean temperature trend with even the more ardent supporters conceding they have much less skill in getting the regional climates correct. Anyway I can see why the Douglas paper used a mean of the ensemble of models to test the “finger print” against the observed data. When Douglas starts tossing out models without more detailed and comprehensive analyses and suggests that we all do it using our own criteria, he has gone a step or two too far.

When Judith Curry starts tossing out models that do not pass this test (as she states below) on a regional basis, I would want to know much more about how these models perform in other regions. Climate models that might be constructed to look only at a certain region (if that happens) present an even more complicated situation for statistical testing.

This is particularly true if you are using the IPCC results in some sort of regional study. The simulations should pass some simple observational tests: a credible mean value, a credible annual cycle, appropriate magnitude of interannual variability. Toss out the models that don’t pass this test, and look at the projections from those that do pass the test.
Kenneth Fritsch

Posted May 3, 2008 at 10:31 AM | Permalink

Re: #175

Cliff, what I have posted here and on the other related thread is in essential agreement with what you said in post #175. I believe that Beaker is not (no longer?)claiming that the SD approach is a good one. Rather he has repeated several times that he wants to take the use of the word “inconsistent” as used by authors (which in the context of the paper was almost certainly used in the general sense) strictly in a statistical sense (more about this later) in their claims and to do this they needed to treat that observed data as just another model output (which makes no sense to anybody) and then use the SD to determine whether the observed was consistent with the models.

The authors might have been more cautious in their terminology to avoid astute and statistical informed readers from drawing attention away from the authors results and into a statistical/semantics battle.
Patrick M.

Posted May 3, 2008 at 10:39 AM | Permalink

re #176 (Michael Smith):

I think you are on the right track. But rather than picking which models to remove subjectively, you might want to show that certain models are necessary to pass the inconsistency test and then discuss why those models should be removed. It’s only a subtle difference, but it might help avoid the inevitable RC argument that you are cherry picking, (that may be why Douglas didn’t remove any models). It will also shift the burden onto the RC types as to why those models should be included. I don’t think they will be looking forward to defending the inclusion of the “outlier” models.
Steve McIntyre

Posted May 3, 2008 at 11:54 AM | Permalink

#154. I’ve spent some time in the last day or two looking at radiosonde data. Perhaps I’ll report on this some time, but my initial impression is that anyone who has concerns about biases and inhomogeneity in surface temperature trends should not regard the radiosonde data as engraved in stone merely because they like the story. Do we really know the trend in tropical troposphere temperatures from radiosonde data? On a first look, that’s not clear to me. IF I had a model whose only flaw was a discrepancy with radiosonde trends in the tropics, would I be cross-examining the radiosonde data for potential inhomogeneity rather than throwing out the models? I suspect that the uncertainties in the radiosonde data may be very large – perhaps significantly larger than presently presented in the radiosonde world.
Andrew

Posted May 3, 2008 at 12:07 PM | Permalink

I have yet to see one person denounce Singer’s obviously unsupportable claims regarding this paper.

Done. Shame on the old guy. Now please move on.
Cliff Huston

Posted May 3, 2008 at 12:42 PM | Permalink

RE: 178 Kenneth Fritsch

Willis asked Beaker for a citation for his insistent statement that the use of ‘inconsistent’ in a statical sense can only be used with standard deviation. Beaker’s only response (Tropical Troposphere #336) was to reference a paper where ‘inconsistent’ was used in that context. Not what I would call a conclusive citation. What I would call inconsistent would be any statical relationship with a low probability (pick your number).

What I have problem with is when probability is enhance by adding back error that has been remove by averaging. In effect, Beaker’s test for inconsistency is testing against the question ‘Are any of the ensemble models, at any point, consistent with the observed measurements’ not ‘Is the multi-model signal consistent with the observed measurements’. What Beaker seems to be saying, to me, is that Douglass et al used the word ‘inconsistent’ and as a consequence they should have used standard deviation, to test the question they did not ask.

Cliff
steven mosher

Posted May 3, 2008 at 12:46 PM | Permalink

re 180. More uncertainty. just what bender asked for. Looking forward to the post
Michael Smith

Posted May 3, 2008 at 1:47 PM | Permalink

Patrick in 176 wrote:

But rather than picking which models to remove subjectively, you might want to show that certain models are necessary to pass the inconsistency test and then discuss why those models should be removed.

I agree completely; there needs to be an objective criteria. As a starting point, the criteria that Douglass was shooting for was to evaluate the upper-air predictions of models consistent with the surface record:

We are comparing the best possible estimate of model-produced upper-air trends that are consistent with the magnitude of the observed surface trend.

Since the mean of all the models met this criteria, Douglass decided to work with that mean, and leave all models in. I was curious to see what the results would be if we excluded the models that fail this criteria.

So how do we decide which models are consistent with the observed surface trend?

Douglass gives us 5 surface trend observations, as follows:

Dataset Value

HadCRUT2v: 124
GHCN: 129
GISS: 126
IGRA: 176
RATPAC: 123

Mean 136
S.D. 23
+2S.D. 181
– 2S.D. 90

(Hope that table is readable. It doesn’t look good in preview.)
I don‘t know if that‘s a valid criteria of screening for consistency or not; 5 is an awfully small sample size. But it’s what I had. It was based on this lower limit of 90 that I excluded models 8 and 12 on the last chart.

However, to be consistent, I decided to exclude ALL models that do not fall within the 90 – 181 range — not just the ones on the lower end.

Only 11 of the 22 models meet that criteria. So, here are the recalculated mean and -2SD lower limit of those models whose surface trend prediction is between 90 and 181, versus the observations. (Model mean is the broad red line, lower limit is the black one.)

I now count 30 out of 34 observations outside the lower limit.
Jon

Posted May 3, 2008 at 1:54 PM | Permalink

bender says:

And, Jon, Raven has a question for you in #2.

2: CCSM3
8: ECHAM5
12: MIROC3.2 Merdes

It depends on under what assumptions they’re run, obviously, but averaged under A1B emissions for example CCSM3 is ~+2.8C, and ECHAM5 and MIROC3.2 Merdes are around +3.4C for end of century. Under A2 CCSM3 is +4C pretty much on the nose while the other two are just under.

[Prediction: dodge.]

I’d say your prediction seems to be… inconsistent with observations, wouldn’t you?
M. Simon

Posted May 3, 2008 at 1:54 PM | Permalink

Beaker says:

I don’t understand where the idea comes from that you can arbitrarily increase the error bars to account for any observation you want. You can’t, the error bars on the prediction of the ensemble is decided by the spread of the ensemble. The error bars are determined by the systematic and stochastic uncertainties inherent in this approach to modelling. They show the range of values the ensemble considers to be plausible. If an observation lies within the spread of the ensemble, then how can it be inconsistent with the ensemble?

It is not hard to do. You keep lowering the confidence interval until the error bars encompass the observations. With zero confidence any thing is possible.
Kenneth Fritsch

Posted May 3, 2008 at 1:54 PM | Permalink

Re: #165

1. The Douglass et al paper is fair game to be audited just like any other. As such it should be judged on its success or failure to achieve what it claims to.
2. It claimed to show inconsistency when it did not. Whether or not this was an error in understanding the language used or a genuine but mistaken belief that it did, the result is a failure to properly demonstrate the claim.

Interesting, Jon that, while I disagree with your conclusion on the paper that you continue to repeat, an audit being so sensitive to the term inconsistent would have missed its consistent use in the AR4 as noted at CA by Steve M.

By the way I looked for a definition of consistency and inconsistency in statistics books, a statistics glossary and published papers and came up with the following:

Click to access 07Prescher.pdf

A statistical estimator attempts to guess an unknown probability distribution by analyzing a sample from this distribution. One desirable property of an estimator is that its guess is increasingly likely to get arbitrarily close to the actual distribution as the sample size increases. This property is called consistency.

Click to access fukumizu07a.pdf

As in many statistical methods, the target functions defined in the population case are in practice estimated from a finite sample. Thus, the convergence of the estimated functions to the population functions with increasing sample size is very important to justify the method. Since the goal of kernel CCA is to estimate a pair of functions, the convergence should be evaluated in an appropriate functional norm; we thus need tools from functional analysis to characterize the type of convergence.

The purpose of this paper is to rigorously prove the statistical consistency of kernel CCA. In proving the consistency of kernel CCA, we show also the consistency of a pair of functions which may be used as an alternative method for expressing the nonlinear dependence of two variables. The latter method uses the eigenfunctions of a NOrmalized Cross-Covariance Operator, and we call it NOCCO for short.

http://stats.oecd.org/glossary/detail.asp?ID=5125

An estimator is called consistent if it converges in probability to its estimand as sample increases (The International Statistical Institute, “The Oxford Dictionary of Statistical Terms”, edited by Yadolah Dodge, Oxford University Press, 2003).

In other words a common statistical use of the term consistent could deal with the relationship SEM = SD/n^1/2 but would not apply to Beaker’s use of the term inconsistency.

The statistical glossary linked above did not have a definition for inconsistency but did give the definitions below:

Consistency error:

A consistency error refers to the occurrence of the values of two or more data items which do not satisfy some predefined relationship between those data items.

Consistency check:

A consistency check detects whether the value of two or more data items are not in contradiction.

I would have to agree with Cliff that Beaker or perhaps you, Jon, or anyone here needs to produce a published definition of the term inconsistency as it would commonly be apply to statistical testing.
Kenneth Fritsch

Posted May 3, 2008 at 2:10 PM | Permalink

Re: #180

IF I had a model whose only flaw was a discrepancy with radiosonde trends in the tropics, would I be cross-examining the radiosonde data for potential inhomogeneity rather than throwing out the models? I suspect that the uncertainties in the radiosonde data may be very large – perhaps significantly larger than presently presented in the radiosonde world.

I recall a paper of which John Christy was a coauthor that looked at radio sonde inhomogeneities in doing a comparison of UAH, RSS with radio sondes. They found inhomogeneities (change points in the data series)corresponding to changes in the radio sonde equipment. Since Christy was also a coauthor of the Douglas paper I would assume that at least some of the grosser inhomogeneities were taken into account — but maybe Christy could answer that question.
Kenneth Fritsch

Posted May 3, 2008 at 2:53 PM | Permalink

Re: #182

I went back to the reference that Beaker gave for showing the use of the term inconsistent with statistical testing. I searched for the word “inconsistency” and give below all the pertinent references in the paper. I replaced the SD for standard deviation as the paper used the Greek letter sigma.

The inconsistency being a measure of 2 SDs just does not come through in these excerpts. Note the use of “moderately inconsistent” with 2.6 and 2.3 SDs and dust is inconsistent with the data at the 94% (1.9 SD). It appears to me that the authors are referencing inconsistency to some changing level of significance and not using inconsistency as a constant measure of significance.

These possibilities are moderately inconsistent with the data at the 99.0% (2.6 SD) and 97.7% (2.3SD) confidence levels, respectively.

This dust is inconsistent with the data at the 94% (1.9 SD) confidence level (although the true inconsistency of this scenario is derived from the product of the two individual likelihoods, i.e., 98% or 2.3 SD).

The distribution of synthetic EB−I values is asymmetric and implies an uncertainty in the measurement for SN 1999Q of +1SD = 0.11 mag and −1SD = 0.17 mag (including the systematic uncertainties from K-corrections and the J-band zeropoint). The results are consistent with no extragalactic reddening and inconsistent with Galactic and gray dust reddening at nearly identical (though marginally higher) confidence levels as the MLCS fits.

The data for SN 1999Q are consistent with no reddening by dust, moderately inconsistent with AV =0.3 mag of gray dust (i.e., graphite dust with minimum size > 0.1 μm; Aguirre 1999a,b) and AV =0.3 mag of Galactic-type dust (Savage & Mathis 1979).

This alternative to an accelerating Universe (see Totani & Kobayashi 1999) is inconsistent with the data at the 99.9% confidence level (3.4 SD).

Click to access 0001384v1.pdf
steven mosher

Posted May 3, 2008 at 5:50 PM | Permalink

bender is a POV.

Thas a good one buddy.
Ron Cram

Posted May 3, 2008 at 6:08 PM | Permalink

re: 177
Kenneth,

The sample conclusion I proposed for the Douglass et al paper was not intended to validate the three models that passed the test. Certainly a number of other tests could be proposed and they may fail any number of them. In his comments above, Douglass writes that three models passed the tests they applied. The paper should have stated that.

I dislike using statistics to make claims about “the models” which may unintentionally mislead readers into thinking the conclusion applied to “all” of the models. I think it far better for researchers not to overstate the conclusions that can be reached from their research.

I think Douglass et al are on the right track and have provided a valuable service. I would like to see a battery of tests applied fairly to all of the models. The purpose would not be to see which of the models would provide the best forecasts for the future, because I do not believe computer models have any predictive value. The goal would be to learn what we don’t know about the way the climate works.
Kenneth Fritsch

Posted May 3, 2008 at 7:10 PM | Permalink

Re: #191

In his comments above, Douglass writes that three models passed the tests they applied. The paper should have stated that.

I believe that the criteria that Douglas set up for application to the models was simply in reply to a call for further analysis of the individual models and was something he thought up more or less on the spot just to demonstrate a point. Something like that should not go into a paper. Remember he said make up your own criteria.

Your comment that they passed the test seemed to imply that that was the statistical test using 2 SD that Willis had shown they had not (without including the errors of the replicate runs as Beaker suggested as a further requirement after the fact). They certainly would not have passed a test using SE limits.

I hope what Douglas was calling for was a thorough vetting of the models and his suggested criteria for pass/fail tests were simple starting points. My point would be to make sure the models’ performance in other regions is taken into account.
RomanM

Posted May 4, 2008 at 7:06 AM | Permalink

#187 Kenneth

I would have to agree with Cliff that Beaker or perhaps you, Jon, or anyone here needs to produce a published definition of the term inconsistency as it would commonly be apply to statistical testing.

This isn’t going to happen because NO such definition exists. The only technical definition of consistent in statistics is the one you found concerning the probabilistic convergence of estimators. Beaker admitted as much in #65 where he states

You won’t find inconsistent defined that was as an individual term, as it has another deeper meaning as a jargon term. However it does have a clear statistical implication , I posted an example of this on the other thread and gave full details of the paper. You can find plenty more by performing a Google Scholar search for “inconsistent with the data”.

It has no specific generally understood meaning in the wider statistical community as jargon nor is there a “clear statistical implication” of its meaning. It means exactly what the authors think that it means – the claims about the relative accuracy of the model results are not in relative agreement with the observed values. Criticism that the authors didn’t understand a specific meaning is simply and clearly unfounded.

Beaker threw out this red herring as part of a justification that the “correct” analysis as accepted by statisticians is to show that the results for EVERY model have to be ridiculously far from the observed tropospheric observations for anyone to be able to claim that the model results are “inconsistent” with reality. In #53, he states

One way of obfuscating with stats is to use phrases with pre-existing statistical interpretations (such as “inconsistent”) to mean something else, when there is already an existing statistical term with exactly the desired meaning (”biased”). The problem I have with the paper is that it makes a very big claim (inconsistency) but performs a test that can demonstrate only bias, a lesser claim (I have lost count of the number of times I have had to type that ;o). If they want to show the ensemble mean is a poor predictor of the observational data, that is a meaningful test of whether the ensemble is useful (although not the only one), and they have made their case. However bias and inconsistency are not the same thing, and they claim inconsistency.

Obfuscating, indeed…. Pre-existing or non-existing? Inconsistency is not a single property – it can take many forms. It can be systematic bias. It can be the presence of higher than or lower than expected variability in observations or calculated results. It can appear in the form of the distribution of the data or in the presence of outliers. To say that since there is a well-defined term for “bias”, you don’t need to refer to it as “inconsistency” is simply wrong. In this particular instance, to my knowledge, the climate science community has not made any claims or offered any specific evidence that any single particular model is THE “accurate” result. In the absence of such evidence they use the averages of the models, calculate error bars for those averages and present those as their best guesses. The authors have basically shown that these averages are NOT consistent with the observed behaviors. To properly evaluate the results of the individual models is more difficult given the lack of information about their parameters and properties provided by the modelers.

As an aside, in #17, beaker states “To test for inconsistency properly, standard practice would be to show that the mean +-2sd error bars of the models don’t overlap with the mean +-2sd error bars of the data.” In other contexts of making multiple comparisons (e.g. Analysis of Variance), comparisons are done by constructing a single error bar for the difference and checking whether zero is in the interval. The test described may be OK for a quick-and-dirty eye-ball check, but the actual probability value for an individual comparison in the normal distribution case would vary from about .005 (when the SDs are equal) to .05 (when one SD is substantially larger or smaller than the other). How can the overall error rate be controlled when multiple comparisons are done? This doesn’t sound right to me.
Jon

Posted May 4, 2008 at 8:20 AM | Permalink

Roman and Kenneth,

As has been noted by Beaker and myself, whether or not the error was due to intent or accident isn’t really the point. If Douglass et al wish to submit a comment clarifying their goal in a manner non-contingent upon language that has broader implications than they intend, I think everyone involved would be happy, no? I’m also interested to hear your (or others’) thoughts on why some of the authors of this paper chose to parade it about in the media as a definitive refutation of current scientific agreement re: attribution of warming, explicitly in regards to policy and viewed in light of past similar claims regarding other regulatory matters.
Steve McIntyre

Posted May 4, 2008 at 8:59 AM | Permalink

IPCC AR4 stated the following:

Whereas, on monthly and annual time scales, variations of temperature in the tropics at the surface are amplified aloft in both the MMD simulations and observations by consistent amounts, on longer time scales, simulations of differential tropical warming rates between the surface and the free atmosphere are inconsistent with some observational records. One possible explanation for the discrepancies on multiannual but not shorter time scales is that amplification effects are controlled by different physical mechanisms, but a more probable explanation is that some observational records are contaminated by errors that affect their long-term trends.

Douglass et al stated:

(1) In all cases, radiosonde trends are inconsistent with model trends, except at the surface. (2) In all cases UAH and RSS satellite trends are inconsistent with model trends. (3) The UMD T2 product trend is consistent with model trends. …

Summing up, then, there is a plausible case to be made that the observational trends are completely inconsistent with model trends, except at the surface …

On the whole, the evidence indicates that model trends in the troposphere are very likely inconsistent with observations that indicate that, since 1979, there is no significant long-term amplification factor relative to the surface.

I don’t see any relevant difference between the claim in Douglass et al and the corresponding statement in IPCC AR4. Beaker stated:

If Douglass et al. have defined their own meanings of existing statistical concepts, that are greatly at odds with their established use, it is hardly a surprise that their findings are misunderstood.

As to Beaker’s objection, Douglass et al did not define “their own meanings”. They used the terms in an identical way as IPCC and CCSP and thus did not use the terms in a way that is “greatly at odds with their established use” in climate science. If the use in climate science is different than the use in statistics, that’s a different matter, but one that cannot be laid at the feet of Douglass et al.
RomanM

Posted May 4, 2008 at 9:35 AM | Permalink

#194 Jon

I think that the authors are eminently clear in the language that they use. In my previous post, I thought that I was quite clear in my contention that the language issue was created by some posters ascribing a non-existent “standard” interpretation to the word “inconsistent” in an effort to argue that the ONLY appropriate comparison of the models to the measured temperatures is to reject each model individually.

What error are you referring to? From looking at the paper, it is reasonably apparent that they are looking at the models as an ensemble and their summary of the model results are the model means. If you disagree with this, please inform the IPCC since their chapter on Climate Models and Their Evaluation in the most recent report ( http://www.ipcc.ch/pdf/assessment-report/ar4/wg1/ar4-wg1-chapter8.pdf ) is just replete with the use of exactly that approach. If they think it is OK for the IPCC to present their evaluations in that manner, why isn’t it appropriate to criticize the models on the basis that those same “meaningful” means are just plain wrong?

The authors state their criteria for measuring agreement clearly; “Agreement means that an observed value’s stated uncertainty overlaps the 2σSE uncertainty of the models”. I pointed out in my post that I would use the more appropriate comparison of a single error bar for the difference for evaluating the probability value for the comparison – their method is actually conservative for testing at the .05 level of significance for a single comparison.

As far as why they “chose to parade it around”, I guess it is because it has become the modus operandi in the Climate Science community to give press releases on their latest research results in the tradition initiated by some well-known individuals who have been extensively audited on this web site. Did you complain about them? I personally would like to see less of these types of “science proclamations” in the popular news media, however if it will continue, then at least it is a bit refreshing to see some balance to it.
steven mosher

Posted May 4, 2008 at 9:49 AM | Permalink

Jon,

You write:

“As has been noted by Beaker and myself, whether or not the error was due to intent or accident isn’t really the point. ”

snip

“If Douglass et al wish to submit a comment clarifying their goal in a manner non-contingent upon language that has broader implications than they intend, I think everyone involved would be happy, no? ”

Happy? Happy +- bemused. Did they intend broader implications? That’s a interesting
proposition. How do I test someone’s intention from their words. This is known
as the intentional fallacy. Douglas et all applied a method. They showed the data;
they explained the method. They open both the data and the method to review and
criticism. And, we note, that both the data and the method get the scrunity such a
matter deserves. Good. Contrast that with authors who will not show their data, will
not show their methods and claim that challenges are out of bounds. For example, Phil Jones. So, I am quite happy to join beaker and say that Douglas et al has made
claims that do not stand up. We should thank Douglas et al for supplying enough information to question and perhaps refute their claims. Now, turn to Jones. Data? dr
jones? He wont give it. Methods? Wont reveal them. So simple question John. ( adding the h of course) I agree with beaker. That agreement is a function of beaker being ABLE to audit Douglas.
Jones wont release the data or the methods. Am I wrong to Reject Jones results merely on the basis that he will not release the data or the methods?

In my book I am. His papers, my backside. let the john do its work.

“I’m also interested to hear your (or others’) thoughts on why some of the authors of this paper chose to parade it about in the media as a definitive refutation of current scientific agreement re: attribution of warming, explicitly in regards to policy and viewed in light of past similar claims regarding other regulatory matters.”

I would hope that you have more sense than to speculate about intentions. Definitive refutation? Your words, strawgirl argument. Maybe give your argument a sex
change operation and upgrade it to a strawman.

Bluto.
Cliff Huston

Posted May 4, 2008 at 11:04 AM | Permalink

Bleaker,

In #52 above SteveM quotes from IPCC AR4 and notes that the IPCC concept of the model significance is different than Gavin’s (which you seem to share). You did not comment on that post. It would be helpful to this discussion if you would comment on Steve’s observation in #52.

Further, I would be interested in your opinion whether the following IPCC and Douglass et al statements are consistent and in both cases, are talking about a ‘best’ model signal that has reduced error, not a signal that includes the entire ensemble error range.

From IPCC AR4: (Re-quoted from #52, for convenience)
“The reason to focus on the multi-model mean is that averages across structurally different models empirically show better large-scale agreement with observations, because individual model biases tend to cancel (see Chapter 8). The expanded use of multi-model ensembles of projections of future climate change therefore provides higher quality and more quantitative climate change information compared to the TAR. (ch 10)” . . .

“The multi-model averaging serves to filter out biases of individual models and only retains errors that are generally pervasive. There is some evidence that the multi-model mean fi eld is often in better agreement with observations than any of the fields simulated by the individual models (see Section 8.3.1.1.2), …”

and

From Douglass et al:
“We are comparing the best possible estimate of model-produced upper-air trends that are consistent with the magnitude of the observed surface trend. With this pre-condition in place (granted to us by the fact the mean of the modelled surface trends was very close to observations) the upper air comparisons become informative and not confused by one or two model runs which are de facto inconsistent with observed surface trends.”

Thanks,
Cliff
Cliff Huston

Posted May 4, 2008 at 11:07 AM | Permalink

RE:198 Cliff Huston

Sorry, Beaker not Bleaker.

Cliff
Kenneth Fritsch

Posted May 4, 2008 at 11:09 AM | Permalink

Re: #194

Jon, this discussion will go nowhere unless you reply to those of us who have submitted counter arguments to Beakers and RCs comments and evidence that points to “fatal” errors in both Beaker’s and Gavin Schmidt’s POVs. Restating your previously drawn conclusion without engaging us and supplying ideas and evidence of your own makes it appear as though you are simply taking the merit of Beaker’s and Schmidt’s comments on faith.
M. Simon

Posted May 4, 2008 at 11:49 AM | Permalink

Re #194,

I’m also interested to hear your (or others’) thoughts on why some of the authors of this paper chose to parade it about in the media as a definitive refutation of current scientific agreement re: attribution of warming, explicitly in regards to policy and viewed in light of past similar claims regarding other regulatory matters.

It seems to me the correct statement given that there seems to be considerable disagreement. Not to mention the newly found Childe Cycle (not previously included in the models).

a definitive refutation of current scientific agreement

should be:

a definitive refutation of current agreement among some scientists

In fact can we really say anything useful about these models unless we know they included the Childe Cycle?

With apologies to that Great Scientist Gordon R. Dickson.
Eggplant fan

Posted May 4, 2008 at 2:17 PM | Permalink

If I might butt in again, having had a chance to read the Douglass et al. paper in more detail, the cause of the entire statistical controversy is reflected at the begining of the second page “The models, free to produce El Ninos at differing times and magnitudes, therefore, yield associated individual trends not directly comparable with observation.” This begs the question, why have the models not been run with the randomness of El Ninos/La Ninas (and any other random elements for which data is available) removed and replaced with the measured data? It would be a simple change in the software, and would allow for a more direct comparison with observation and therefore more complete model validation.

I’m sure that this has been discussed to death in the past, so a referral to one of those discussions would be most appreciated. (I assume that the runs are not already available or Douglass et al. would have used them.)
Eggplant fan

Posted May 4, 2008 at 6:31 PM | Permalink

The page http://www.gisclimatechange.org/faqPage.do includes the statement “CCSM modelers may perform additional runs where they supply ocean observational data, which forces the system to incorporate interactions between ocean and atmosphere and pick out specific events like El Nino.” which indicates that they could do what I suggested in #202. It also includes the statement “Essentially, the earth’s climate can be considered to be a special ensemble that consists of only one member.” That would seem to be consistent with the RC point of view as noted by Steve McIntyre in #52 (although not addressing the problem of averaging across models). It is difficult to tell then exactly what the IPCC means by “in better agreement with observations”.
Geoff Sherrington

Posted May 5, 2008 at 2:54 AM | Permalink

Re # 203 Eggplant fan

(although not addressing the problem of averaging across models).

Averaging across which models? I think while you math people have been arguing about deviations and errors you have failed to realise that you are working on a chosen subset of models from a bigger population of models. Is it not correct that the best mathematics include the whole of the raw data, not a subjective subset? I mean, if you want to throw dice, you have to include every throw, not just the ones that suit your point.

What is worse, the models that you are working on, have within them sets of variables with their own errors and distributions, which are opaque to you. How can you do correct error analysis when you do not have the raw data?

Blind Freddie can see that Douglass et al have at least demonstrated cause for concern that directly conflicts with the RC version that the models are in good agreement. The main models that I have seen in good agreement are about 6 feet tall, leggy, willowy, nice teeth, not too buxom – but they have mostly been cherry picked as well, I suspect. Maybe cherry picking can be good fun, even irresistable between consenting adults at times.
beaker

Posted May 5, 2008 at 4:12 AM | Permalink

Good morning all, there has been a lot of posts since Friday, and I’ll attempt to address all the comments that seem to be asking for a response from me, but they will have to be brief due to their number. If you feel I haven’t adequately addressed the point, just say so and I’ll have another go.

Steve McIntyre #93: 16 in your first innings sounds pretty good, four boundaries suggests you have a good eye, it is well above two standard errors of my league batting average (but not two standard deviations ;o).

Pat Keating #95: I have always thought the frequentist perspective rather odd as the intuitive interpretation of a probability is not a long run frequency in most cases. A frequentist can give no direct answer to the question “what is the probability that this hypothesis is correct?”, but it is the question that we would most like answered. At least a Bayesian can discuss whether it is more probable than an alternative hypothesis. See also PIs apposite comments in #97.

Steve McIntyre #96: A rewording of the point in the paper would indeed be possible, to say that “the ensemble mean is inconsistent with the data” would be a good start. From a statistical point of view it seems like a slightly un-natural way of putting it, but it makes the point that it is only the mean that is shown to be inconsistent, rather than the models themselves.

Phil #98: I was playing rather than watching (Saturday and Sunday and paying for it now, I ache all over ;o). I am also looking forward to the ENG-NZ series.

Kenneth Fritsch #102

I can only ponder why Beaker who has statistical insights to offer here keeps going back to the word “inconsistent” used in the Douglas paper.

Because my statistical insight is that there is a big difference between “the models are inconsistent with the data” and “the ensemble mean is inconsistent with the data” (see my coment to #96). Douglass claim the former, but only establish the latter. As for pondering motives, I believe a scientist should always assume good motives and address the argument directly rather than question the source. I am happy with the idea that the problem with the Douglass paper is unintentional, and that they did not appreciate their wording encourages misinterpretation.

Cliff Huston #104 : Read the summary of Douglass et al. they make no mention of the ensemble mean, but only talk of model trends.

N Taylor #107: No, it is not only of philosphy issue. There is a real difference between “the models are inconsistent with the data” and “the ensemble mean is inconsistent with the data” (thanks Steve for the compromise wording!). Baysian stats has very sound theoretical foundations, see the works of RT Cox, Harold Jeffreys and ET Jaynes. See also the (again) apposite comments from PI in #112.

As it happens, I do a lot of reviewing of medical papers based on the particular statistical methods with which I am most familiar, and I see many papers with very poor statistical evaluation. It is a field where a blog systematically aduiting problematic papers would be very valuable. Having seen the stats in some cardiology papers it is a great spur to exercise ;o).

Pat Keating #109

the language Beaker used (based on “the Baysian perspective”) stated that the observational data was “implausible”, a truly topsy-turvy viewpoint.

Frequentists statisticians do much the same thing in maximum likelihood based approaches, it isn’t a solely Baysian view. However, it is clear that I need to adjust my language to be a little less statistical in order for my arguments to be understood!

Steve McIntyre #110: I too am keen to reach an agreement on the issue. The compromise wording that you suggested in #96 seems helpful. The issue is the difference between the ensemble mean being inconsistent and the models or the ensemble being inconsistent.

Cliff Huston #111: There is a difference between the inconsistency of the ensemble mean and the inconsistency of the ensemble itself. Do the IPCC recommend the use of the ensemble mean without error bars? If so that is a practice that needs auditing and adds value to the findings of Douglass et al (although it is only implied). If the IPCC suggest the mean with the +-2sd error bars, then it is wrong to apply the +-2se test.

Jon #114: The SEPP press release goes a long way to encouraging the (mis)interpretation of the paper exemplified by the RC article. If the model ensemble itself were shown to be inconsistent, that might be a reasonable basis for the argument that AGW arguments based on the models were unsustainable, but not “inconsistency” in the +-2SE sense establised by Douglass et al. However even then it would not be evidence that AGW wasn’t happening, just that arguments should not be based on the models alone (but caveat emptor – I am not a climatologist!).

Lance #115: If the models had failed the +-2SD they would have been “slain” (at the 95% level of confidence), you acknowledge that they haven’t been “slain” by Douglass et al. That shows that the difference is not merely semantic, but substantial.

Bob B #116: Do the models claim to predict future temperature with any certainty? The +-2SD error bars give an indication of the certainty, and if they are wide then they are not claiming certainty. That is why IMHO the RC response makes a much better job of highlighting the problem the models have with tropical trends, the error bars are very wide and the data only overlap on the fringes. All predictions should have error bars and they should be used, not ignored.

Willis Eschenbach #117: The +-2SD error bars are very likely to be wider than the range of the data if you only have four datapoints. Why not just plot them and see. But first, you need to include the stochastic uncertainty by averaging over all runs and use the original data without normalising to the surface trend. The variation at the surface is part of the uncertainty of the ensemble, so it is inappropriate to arbitrarily delete it. This means you still have not performed the test properly.

NTaylor #122: For statistical validation of statistical methods, the thing that comes closest to that would probably be “computational learning theory” which covers the analysis of leaning algorithms (statistical models), see e.g. VC (Vapnik-Chervonenkis) theory the PAC framework (probably approximately correct). Saying that frequentist and Bayesian approaches have similarly strong theoretical foundations seems a good summary. However the Bayesian definition of a probability is closer to the intuitive interpretation and consistent (in the usual sense ;o) when a long run frequency is meaningful. Frequentists methods are preferable though where formal experimental design is possible (but I don’t generally work on such problems, so that is just my opinion).

Bender #125:

Don’t let beaker off the hook yet.
1. He has not explained why he thinks there is a bias in some of the GCMs.
2. He has not expained if tropical tropospheric warming is or is not part of the AGW fingerprint.
3. He hasn’t explained why GMT has flattened since 1998.
He’s half-answering all the easy questions, dodging the important ones.

I was quite amused by this post, it should be obvious that I am a statistician, not a climatologist and not a climate modeller. Asking me for authoritative answers to any of those questions is a bit like hiring a concert pianist in the expectation that he can fix the transmission in your car! ;o)

I’ll have a go at 3 (but I am just repeating arguments gven elsewhere, with which I am sure you will be familiar, however they seem reasonable to me). The magnitude of any AGW trend is small in relation to the short term variability due to e.g. ENSO, so over a ten year trend the short term variability can obscure the true trend (especially if you bias it by arbitrarily choosing to start at a conspicuous peak, just a guess, but maybe you would get a different result if you started the window in 1999?).

Bender #134: If you look, you will find that your recipe is equivalent to the G^2CM idea I mentioned in #13. The ensemble we implicitly have is a small, ad-hoc implementation of the same basic approach. I am not sure why the ratio you give is an indication of whether the models are “crap”, but the methodology is much closer to the RC article than the Douglass paper (which explicitly ignores these uncertainties by concentrating only on the ensemble mean).

kuhnkat #139: It is not so much the word “inconsistent” as its use to describe the results of a hypothesis test to make phrases like “the models are inconsistent with the data” when the test only shows the ensemble mean is “inconsistent” withe the data (thanks again Steve). Dr (Prof?) Douglass has already sent in one clarification in response to criticisms raised, my criticism could easily be resolved in a similar manner, and I think this would be beneficial. The main advantage of the ensemble is that it give an indication of the uncertainties involved in the prediction around the expected value.

Steve McIntyre: If the word “inconsistent” is used to descibe the results of a statistical hypothesis test, it is reasonable to interpret it in a statistical manner, if it is in a report making qualitative statements it is reasonable to interpret it in a more colloquial way. Anyway, your compromise wording clarifies the point rather better, in a statistical sense “the models are inconsistent with the data” and “the model means are inconsistent with the data” are very different. Also I am more comfortable with loose statements being made that unintentionally overplay factors pointing out the limitations of the authors arguments than those that unintentionally overplay the strength of their argument, however not all may agree with that.

Joe Crawford #163: You are quite right, I should have said “If a number of groups independently come up with similar models, it is an indication that a particular set of assumptions is [considered] more plausible than others [by the climate modelling community]”.

Craig Loehle #170: That is why the RealClimate article did a better job of pointing out the problems with models than Douglass et al. If the observations are within the error bars than the data are plausible given the assumptions of the models, in which case the models while not being shown to be right have not been shown to be wrong either. As I said, consistency and usefulness are not the same thing (even unbiasedness is not a guarantee of usefulness).

Ron Cram #171: A very good point. If you need to make a claim that is likely to be contentious, it is a good idea to adopt the strategy used in chess and make the move that minimises your opponents maximum gain. It is true that you then procede only via small steps, but they have the advantage of being in a uniformly forward direction!

Patrick M 179: The reason that all of the models should be in the ensemble is that all of them are needed to reflect the systematic and stochastic uncertaintiesthat affect their predictions. By taking models away, you make the ensemble error bars look as if there is more confidence in the model output that can is actually warranted.

Steve McIntyre #180: IIRC the Santer paper referenced in Douglass et al draws the conclusion that the “inconsistency” is due to problems with the data rather than the models. So we have the modellers (and the RAOBCORE group) saying it is a problem with the data and the data people (I believe that is one of Prof. Christies specialities at least) blaming the models. So who do we believe? Well I for one am in no position to decide, so I will adopt a true Bayesian stance and assume there is some truth in both positions (i.e. keep an open mind). ;o)

Cliff Huston #182: The question answered by the RC +-2SD question is “are the observations plausible given the systematic and stochastic uncertainties involved”. That is the relevant question for falsification of the models that might invalidate AGW arguments based solely on the models. The problem is a disconect between the test they performed and the summarisation of the results.

Kenneth Fritsch #187: I did say that there is a deeper meaning to “inconsistent” as a term (rather then in its use in the phrase “the models are inconsistent with the data”), and you have found it. It isn’t relevant to this discussion in any way.

Kenneth Fritsch #189: You are missing the point, in the paper I referenced, the word “inconsistent” is used with respect to tests based on the standard deviation not the standard error of the mean, that is the issue. I have already explained that the results of a statistical test are given at a confidence level that dictates the number of standard deviation/SEMs that are used, that is not the point I am making.

Steve McIntyre: Do the IPCC reports make these statements informally, or are they descibing the results of specific hypothesis tests? It is the context that decides whether a statistical or informal interpretation is required. It seems to me that the IPCC are being overly cautious in thier statement (although as the models are not far from being inconsistent in the statistical sense anyway).

Cliff Huston #198: I am looking into the IPCC reasons for the use of the ensemble mean, as a result of Steves post, but I would prefer not to comment directly until I have read up on the subject to make sure I understand the relevant issues. If they are saying that you should use the ensemble mean in isolation, or in conjunction with SEM error bars, then an audit is certainly warranted. If they suggest the use of the mean with SD based error bars, then that seems reasonable (but not as good as using the ensemble itself). However if they say to use the SD error bars, it is even more unreasonable to use the SEM in validation execrises.

Well that is more than enough typing for the moment, I’ll be back with some more substantive posts later.
beaker

Posted May 5, 2008 at 4:16 AM | Permalink

Geoff Sherrington says:

Blind Freddie can see that Douglass et al have at least demonstrated cause for concern that directly conflicts with the RC version that the models are in good agreement.

Where have RC said that the models are in good agreement with the tropical trends?
beaker

Posted May 5, 2008 at 5:10 AM | Permalink

Having started to read up on the IPCCs reasons for the use of the ensemble mean (which also are possibly a fair target for an audit) I am now fairly sure that the statistical test performed in Douglass et al. is meaningless (although the point they seem to be trying to make with it, that the models do not adequately reproduce the observations, seems perfectly valid). My reasoning is as follows (appologies if I have misused climate modelling terms, as is quite likely, I hope my point can still be discerned):

The ensemble consists of a variety of models, which are intended to represent the systematic uncertainties, i.e. the modellers uncertainties over the detailed physics governing the climate. Some models are represented by more than one run, such that the spread of runs for a model attempts to cover the stochastic uncertainty (i.e. the variability that is observed from a chaotic system depending on the initial conditions). So what does the mean of a multi-model, multi-run ensemble actually mean? As far as I can see, averaging over models and over runs attempts to average out the systematic and stochastic uncertainties, leaving you with an estimate of only the “forced component” of the climate, after the effects of the initial conditions have been averaged out.

Now lets consider the observations. We can consider the Earth as being a climate simulator, where the systematic uncertainty is zero (the physics is of course exactly right), but we have only a single run (the observed reality). The observed climate therefore does depend on a single set of initial conditions.

This means that even if the ensemble mean met its design brief perfectly, you would not expect it to exactly match the observed data. The ensemble mean aims to model the mean climate, averaged over all plausible initial conditions, whereas the observed climate is the climate observed for one particular set of initial conditions. There is no reason to expect them to be identical.

Why does this matter? Because even if the ensemble mean gave a perfect representation of the mean climate, there would be a point where the ensemble was large enough to fail the 2SE test, even if it were correct. That is clearly absurd, which suggests that it is the wrong test, basically it is not a like for like comparison, the ensemble mean represents only the forced component, the observed data represent both the forced component and a component due to the initial conditions.

What the test does tell us is that the true mean trend of some implicit distribution of models differs from the observed trend, and that we can be 95% sure that the observed difference can’t be explained as an artifact of the particular sample of models forming this particular ensemble. The strongest statement about the models that can be justified by this is that for the models to give a good estimate of the forced component, we would require that the difference can be explained by the variability of the climate due to the inital conditions. Can we make arguments for the plausible magnitude of this variability without recourse to the models (and therefore a circular argument)? I don’t know.

On the other hand, it is clearly interesting that the models only pass the conventional consistency test with a D- (as Patrick M puts it in #88), even taking more of the uncertainties into account. That seems to me to be a much more effective criticism of the models (or potentially the data, c.f. the Santer paper cited by Douglass et al.).
Michael Smith

Posted May 5, 2008 at 5:11 AM | Permalink

beaker asked:

Where have RC said that the models are in good agreement with the tropical trends?

From RealClimate’s response:

To be sure, this isn’t a demonstration that the tropical trends in the model simulations or the data are perfectly matched – there remain multiple issues with moist convection parameterisations, the Madden-Julian oscillation, ENSO, the ‘double ITCZ’ problem, biases, drifts etc. Nor does it show that RAOBCORE v1.4 is necessarily better than v1.2. But it is a demonstration that there is no clear model-data discrepancy in tropical tropospheric trends once you take the systematic uncertainties in data and models seriously.

(Emphasis in original)

Observe the use of the term “perfectly” in that first sentence. It serves to imply that the mismatches which do exist are trivial. Imagine how different that sentence would be if they had used the expression “well matched” instead of “perfectly matched”.
beaker

Posted May 5, 2008 at 5:19 AM | Permalink

Another thought experiment that makes me question the value of the +-2SE test. Consider an ensemble that was in good subjective ageement with the observed trends, say the difference was of the order of 0.05 K/decade, but that the ensemble was large enough that the observations are only just within the +-2SE error bars, so it only just passes the Douglass test. Now we add two new models that straddle the existing mean in such a way that both the mean and standard deviation of the ensemble remains the same. However, because of the 1/sqrt(n) factor, the ensemble now fails the +-2SE test! Can you justify how it could be reasonable for an ensemble can go from being “consistent with the data” to being “inconsistent with the data”, without the mean or standard deviation having changed?

Again, the aim of the Douglas paper seems reasonable (if not that novel), but the test they use does not make the point.
beaker

Posted May 5, 2008 at 5:30 AM | Permalink

Michael Smith #208: Funny, I read that as deliberate understatement as a mild form of humour (a sort of litotes?). Given that he gives a long list of problems with the models, I can’t see it as a statement that he thinks the models are in good agreement with the data. He should perhaps have said “inconsistency” rather than “discrepancy”, however the difference (discrepancy) can be explained by the explicitly stated systematic and stochastic uncertainties.
Michael Smith

Posted May 5, 2008 at 6:26 AM | Permalink

Can you justify how it could be reasonable for an ensemble can go from being “consistent with the data” to being “inconsistent with the data”, without the mean or standard deviation having changed?

As a layman, my interpretation would be that as a result of the additional information gained by adding the two new models, we are now that much more certain of the true value of the mean . Hence, the observed trends that previously could be said to be within the 95% CI of the mean — which we have somewhat arbitrarily defined as “consistent with the data” — are now only within some lesser % CI, which doesn‘t meet our standard for “consistent with“.

What I find far more difficult to “justify how it could be reasonable” is the notion that — in a test that purports to compare AGW models versus observations — it is proper to include models that don’t predict significant amplification of warming in the troposphere and models whose surface predictions are more than two standard deviations away from the surface trends — and then use the large standard deviation created by the inclusion of those models to conclude that “there is no clear model-data discrepancy”.
Steve McIntyre

Posted May 5, 2008 at 7:07 AM | Permalink

beaker, thanks for your comments. These issues really are quite fun, aren’t they?

I’ve noticed that “evidence-based medicine” raises many statistical issues about bias and cherry-picking , which I regard as similar to the ones that I raise about proxy studies. Some of the protagonists in this debate are also from Ontario; one even plays squash.

If there’s some topic that you want to start your own thread on, let me know.
beaker

Posted May 5, 2008 at 7:11 AM | Permalink

Michael Smith says:

As a layman, my interpretation would be that as a result of the additional
information gained by adding the two new models, we are now that much more certain of the true value of the mean.

But if the mean and standard deviation remain the same, the (Gaussian approximation of the) ensemble is giving exactly the same hindcast of the tropical trend as it was before. If the hindcast is the same, why does one come in for criticism and not the other?

I think it is relevant to ask what you mean by the “true value of the mean” in the first place. Is the ensemble a i.i.d. sample from some underlying distribution of models? As others have noted this is debatable (especially from a frequentist perspective).

What I find far more difficult to “justify how it could be reasonable” is the notion that — in a test that purports to compare AGW models versus observations — it is proper to include models that don’t predict significant amplification of warming in the troposphere and models whose surface predictions are more than two standard deviations away from the surface trends — and then use the large standard deviation created by the inclusion of those models to conclude that “there is no clear model-data discrepancy”.

The output of the models represents what is considered feasible by the modelling community, given the known systematic and stochastic uncertainties. It just gives the consequences of a set of assumptions, nothing more. If you prune the ensemble, then you reduce (incorrectly) the apparent uncertainty in the models, and make them look as if they are more confident than originally claimed. That is why they should not be pruned.

I have lost track of the number of times I have pointed this out, but the +-2SD test is only interesting if the models fail it. If they pass, it is not an indication that the models are good or even useful. The Douglass test doesn’t tell you if the models are useful either; as I have already pointed out, even if the GCMs had the physics exactly right they would still fail the +-2SE test! The fact that you could have two functionally identical ensembles, where one passes and one fails shows that the test is inappropriate in establishing usefulness or validity.
Pat Keating

Posted May 5, 2008 at 7:16 AM | Permalink

205 beaker

There has been discussion of Baysian and Frequentist views of probability, but no one has mentioned the Popper view. This might have lead to a more-enlightened discussion.

The conditions of selection of the models is relevant, as several pointed out without the Popper underpinning. IOW, if the selection criterion includes only models which satisfy the surface-trend data, then Douglass’ point is made. If other models are included, then Schmidt has a point. ‘Pure’ Popper.
beaker

Posted May 5, 2008 at 7:37 AM | Permalink

Steve #212: The issues are indeed fun to discuss, as the length of #205 demonstrates! Your compromise wording in #96 goes a long way in clarifying the problem I (and others) have with the Douglass et al. paper.

My views on the appropriateness/adequacy of the +-2SE have sharpened rather, the arguments in #207 and #209 seem to make the point rather better than I had previously. However, this doesn’t mean I have a problem with what I take to be the aim of the paper, namely to demonstrate that the observations were not in close accord with the models. The really interesting question would be whether this is a problem with the models or the data or both.

I think part of the problem lies in treating the models as if they were statistical models (e.g. linear regression) explicitly fitted to some data, which they are not.
It might be a good idea to have a look at the ensemble issue, I use them routinely in my work (though not ensembles of GCMs), perhaps we should discuss it off-line via email?
beaker

Posted May 5, 2008 at 7:55 AM | Permalink

Pat Keating #214: Part of the issue for me is that the ensmeble as a whole has a useful intepretation (as an indication of what is considered plausible given the assumptions of the climate modellers). If you prune the ensemble, it just becomes a set of models, for which the spread has no straightforward interpretation, so I am against pruning. However, the key point for me is that the +-2SE test does not give an indication whether or not the models are useful, if only because a perfect model isn’t guaranteed to pass!

I do like the Popperian view, it seems to encourage the proper uncertainty that we shouldn’t get too excited about a theory based on positive results (becuase observations can never validate a theory, only falsify it), and likewise we should not disregard a theory until it has been disproven (as the weight of evidence may change as observations are collected).
Eggplant fan

Posted May 5, 2008 at 7:57 AM | Permalink

Geoff Sherrington #204

I guess I’ve come across as far more negative toward the Douglass et al. paper than I intended. The paper does the best possible test against the claims of the IPCC taken at face value: The mean of the different models converges to the observations. If the hypothesis is that different models give a normal/Gaussian distribution around the actual observation, then the T-test (with standard error of the mean) is the appropriate test. RC jumps on the fact the model builders don’t make the same claim (note that the FAQ page I linked in #203 evades the issue of averaging different models altogether), but then goes off on a red herring that is compatible with neither the IPCC nor the model builders’ claims. I fully agree that the Douglass et al. paper shows that there are serious concerns about the modeled results. I find it troubling that the models have not been run with the observed El Nino’s incorporated (which the FAQ indicates is possible) to see what specific effects it has on other observables for a more direct comparison. As it stands now, there is so much uncertainty allowed that it is almost impossible to invalidate the “average-of-the-models model”, which makes it essentially useless for projections.
Steve McIntyre

Posted May 5, 2008 at 8:19 AM | Permalink

#215. I’ll ping you by email from your sign-on.

The main wording issue that I would like to see resolved is what the “true” sigma is on the climate given the forcings (or even how to put a rough prior on it, if I’m using the terms right). Gavin’s position (and the more I think about it, the less sense it makes to me) seems to be that, even if we knew all the forcings, the “true” sigma for the trend in the next 30 years could be estimated by the sigma of the models. I don’t think that this is true.

First, I’m not sure how much sense it makes to think of the trend as being a single draw. Surely there must be some time period at which the random conditions of the draw average out and the forcings make themselves felt. Let’s say that the period was 100 years and the models yielded ranges of 1-10 deg C. with a model sigma of say 2.5 deg C. I don’t believe that that represents the sigma of the possible earth responses.

Also, if Gavin were held to his position on stochastic variability, then this means that there is very large stochastic low-frequency variability and that the various calculations purporting to assert the significance of trends would be refuted by the existence of this low-frequency variability.

I’m going to post some more on radiosonde data issues as there are many interesting matters in these records.
Kenneth Fritsch

Posted May 5, 2008 at 8:28 AM | Permalink

Beaker, you have failed to show that the term “inconsistent” has any accepted and specific meaning in statistics. You keep changing the depth of your views on the use of SEM in the Douglas paper. There may be good reasons for not treating the model outputs as one would, for example, a samplings for control charting but I do not believe that you have made a good case for why this could not be done as a first approximation as Douglas et al. have done.

Surely one cannot consider the observed data as another result at an equal level with the model outputs because it is the target and therefore it is entirely wrong to attempt a comparison with 2 SD or a range of model outputs as was done in some earlier papers as I noted in previous post. One hardly needs a thought experiment to show that including model output outliers and/or physically unrealistic model outputs can make a range or 2 SD comparison of little value. If one wants to at least narrow the consequences of these problems one would use a comparison to the model mean and to a first approximation using SE seems most practical.

Therefore a choice between the Gavin/Karl et al. and the Douglas et al. approaches for comparing model out puts to observed data easily goes to Douglas. In fact the use of ranges and SD are fatally flawed in my estimation.
beaker

Posted May 5, 2008 at 8:53 AM | Permalink

Steve #218: I think this may be a misunderstanding of Gavin’s position. It seems to me that the stochastic uncertainty of the models is not the same thing as the sigma on the climate given the forcings, although if the models had the physics just right and had sufficient resolution, they might be expected to be of the same order.

Basically if you run a stochastic model with different initial conditions and get different results each time, then all of the results are (by definition) consistent with the assumptions of the model. This is why the stochastic uncertainty must be considered in deciding if the models are consistent with the data. However, if the physics of the model is not perfect, then it may be too sensitive to initial conditions, so the stochastic uncertainty of the model may be greater than the true sigma of the climate given the forcing. Likewise it may be too insensitive and the stochastic uncertainty of the models may be less than the true sigma (and therefore overly optimistic).

In principle, we could simulate the climate by visiting Magrathea (for Douglas Adams fans) and commissioning 67 duplicate Earths and measuring their observed trends, so from a statisticians perspective you can view the observed trend as a single draw. In fact, frequentist stats can have little to say if you can’t. How long it takes for the initial conditions to be buried would have a big say on the true sigma of the trend, but I don’t have the expertise even to make a sensible guess.

The key point is that if we knew the true sigma of the climate given the forcing, it should be added to the error bars on the observations (representing the measurement uncertainty) rather than on the models.
beaker

Posted May 5, 2008 at 8:57 AM | Permalink

I forgot to add, the ensemble probably under-estimates the true systematic uncertainty of the models, as the modellers are likely to have used only what they consider to be the most likely physics, rather than sampling model parameters from the range they consider plausible.
Kenneth Fritsch

Posted May 5, 2008 at 8:58 AM | Permalink

Re: #216

However, the key point for me is that the +-2SE test does not give an indication whether or not the models are useful, if only because a perfect model isn’t guaranteed to pass!

If a perfect model or one that even loosely approximated it were available and tested (even though that would be in-sample) do you think climate scientists would be using an ensemble of model outputs and their means? The way I understand the problems of varying model output is that some models get some regions and climate metrics approximately right but fail in other areas and metrics.

I think most models agree rather well in a broad sense about the ratio of temperature trends at the surface and troposphere in the tropics and that must have been a reason for Douglas et al. comparing against a model average.

By the way, the use of 2 SD has to consider a model average and when thinking in terms of a perfect model and the other models meaning nothing what would that average mean?
beaker

Posted May 5, 2008 at 9:09 AM | Permalink

Kenneth Fritsch #219: There is a good reason why I change the depth of my view in the use of SEM in Douglas et al. It is because I listen to the views expressed by my “opponents” in the debate (they are not really opponents as we are auditing), and if I see value in them I moderate my opinion. How Bayesian is that? ;o)

I also think about it and try to construct illustrative examples that determine whether the test makes sense, and revise my opinion accordingly. Isn’t this what a scientist should be doing?

However, I have now demonstrated that a perfect model can only pass the test in the assymptotic case (i.e. as the number of models becomes large) by a fluke. For me that shows unequivocally that the test is wrong. If you can show me that my reasoning is faulty, I will revise my opinion, demonstrating an open mind.

I have also shown that functionally identical ensembles don’t neccessarily give the same result for the test. How can the test evaluate the usefulness of the model if it doesn’t depend on what the model actually says?
beaker

Posted May 5, 2008 at 9:21 AM | Permalink

Kenneth Fritsch #222:

If a perfect model or one that even loosely approximated it were available and tested (even though that would be in-sample) do you think climate scientists would be using an ensemble of model outputs and their means?

Yes, of course they would, because even if you could eliminate the systematic uncertainty, the model would still have stochastic uncertainty. BTW, you can’t get a perfect model by fitting to the sample because the “sigma of the climate given the focing” as Steve puts it prevents you from observing the forced component directly. If you could do this, the job of the climate modellers would be much easier!

By the way, the use of 2 SD has to consider a model average and when thinking in terms of a perfect model and the other models meaning nothing what would that average mean?

Read the summary of Douglas et al. They make no mention of means, they only talk of reconciling the models with the data. If they said the means of the ensemble were “inconsistent” the problem would largely go away. However, as I have shown, that is meaningless anyway. The point they were trying to make is made much more strongly by observing that the ensemble only passes the easier SD test with a D- (I like that phrase, very droll! ;o).
Kenneth Fritsch

Posted May 5, 2008 at 9:21 AM | Permalink

Re: #

The Douglass test doesn’t tell you if the models are useful either; as I have already pointed out, even if the GCMs had the physics exactly right they would still fail the +-2SE test! The fact that you could have two functionally identical ensembles, where one passes and one fails shows that the test is inappropriate in establishing usefulness or validity.

Please demonstrate in detail how the GCMs could have it exactly “right” and fail a +/- 2SE test. It seems you are making a case for using an average of models and model runs and then simply waving off the use of SE in making a comparison.
RomanM

Posted May 5, 2008 at 9:28 AM | Permalink

# 209 beaker

Another thought experiment that makes me question the value of the +-2SE test. Consider an ensemble that was in good subjective ageement with the observed trends, say the difference was of the order of 0.05 K/decade, but that the ensemble was large enough that the observations are only just within the +-2SE error bars, so it only just passes the Douglass test. Now we add two new models that straddle the existing mean in such a way that both the mean and standard deviation of the ensemble remains the same. However, because of the 1/sqrt(n) factor, the ensemble now fails the +-2SE test! Can you justify how it could be reasonable for an ensemble can go from being “consistent with the data” to being “inconsistent with the data”, without the mean or standard deviation having changed?

Your example is not contradictory in any way. In fact it illustrates how statistics properly works.

The mean of your initial ensemble differs by some amount from the observed trends (the adjective “good” is irrelevant in this context and just clouds the issues since you state that “it only just passes the Douglas test”). The test says that given the variability of the model results AND the amount of information, i.e., the number of models in the sample, there is possibly some question whether the mean results of the model population (tacitly assumed to exist) COULD be equal to the value of the observed trends (a constant existing value regardless of the processes that generated that trend). Now you add two more models which straddle the existing average of the models (without changing the ensemble variability). Despite your numeric results not changing, you have added more information to the situation thus increasing the confidence with which that original difference can be viewed and are now able to make the decision that they are different (in view of the criteria that you set in creating the test).

Have you never said to someone, “I see X amount of difference between my sample values and the hypothesized value, but the sample size is so small, I can’t be sure it’s not caused by sampling variation. However, if my sample size was twice as large with that same difference, I would seriously doubt that they could be the same”.

If your two new values straddled the observed trends and not the existing mean (again without changing the variability), that extra information would provide evidence in favor of the model mean possibly being the same as the trend rather than against it.
Christopher

Posted May 5, 2008 at 9:32 AM | Permalink

Re: 224
From the abstract/summary

“Model results and observed temperature trends are in disagreement in most of the tropical troposphere, being separated by more than twice the uncertainty of the model mean.”

beaker, could you list your top objections to the paper? I’ve been trying to follow this but I’m getting slowly lost in all the back and forth. Fwiw, I can understand your issue wrt SE vs SD but Douglass lays out exactly what he does, defines uncertainty as 2SE, and then does his analysis. Part of me thinks the point is being missed. The observed ~ predicted agreement is lacking. Could the paper have been improved somehow? Of course, I’ve never published a single thing where I thought otherwise. But is that the issue now? I’ve trying to see this here but feel like some things are being lost in semantics.
beaker

Posted May 5, 2008 at 9:33 AM | Permalink

Sorry Kenneth, this is just getting ridiculous. I think I stated quite clearly that it was a thought experiment/illustrative example. The fact that a perfect model isn’t guaranteed to pass the test (indeed it is almost guaranteed to fail) shows the test is inappropriate, whether or not it is possible to construct a perfect model in practice.
Michael Smith

Posted May 5, 2008 at 9:45 AM | Permalink

However, I have now demonstrated that a perfect model can only pass the test in the assymptotic case (i.e. as the number of models becomes large) by a fluke. For me that shows unequivocally that the test is wrong.

I see two problems with this argument. First, Douglass is not addressing the asymptotic case. Second, a perfect model will not be evaluated as part of an ensemble; it will be evaluated by itself. (As I understand it, we are using an ensemble precisely because we don’t have a perfect model.) So I don’t see the relevance of your point to the case at hand.
beaker

Posted May 5, 2008 at 9:48 AM | Permalink

Roman M: The contradiction is this: How can a test tell you if the ensemble is useful if two functionally identical ensembles (they both make the same predictions with the same confidence) give the opposite test results. The correct interpretation of the SEM tells you to treat the two ensembles differently, when there is no good reason to do so.

Christopher. My main issue is that Douglass et al. do not make a clear distinction between the ensemble means and the ensemble itself, especially in the summary. My secondary issue is that the test they actually perform has a correct statistical interpretation, it is just that it doesn’t tell you what you want to know (given that a perfect model is almost guaranteed to fail it if the ensemble is large enough, which makes no sense).

Kenneth: Sorry, I think I may have missed the point in my previous post, however I have already explained the thought experiment in great detail and it isn’t reasonable to expect me to repeat it again. If you have a specific objection, of course I will be happy to discuss it with you.
Kenneth Fritsch

Posted May 5, 2008 at 9:50 AM | Permalink

Re: #224

Read the summary of Douglas et al. They make no mention of means, they only talk of reconciling the models with the data. If they said the means of the ensemble were “inconsistent” the problem would largely go away. However, as I have shown, that is meaningless anyway.

From the Douglas paper we have:

Evaluating the extent of agreement between models and observations Our results indicate the following, using the 2σSE criterion of consistency: (1) In all cases, radiosonde trends are inconsistent with model trends, except at the surface.

(2) In all cases UAH and RSS satellite trends are inconsistent with model trends. (3) The UMD T2 product trend is consistent with model trends.

We have tested the proposition that greenhouse model simulations and trend observations can be reconciled. Our conclusion is that the present evidence, with the application of a robust statistical test, supports rejection of this proposition.

Models are very consistent, as this article demonstrates, in showing a significant difference between surface and tropospheric trends, with tropospheric temperature trends warming faster than the surface. What is new in this article is the determination of a very robust estimate of the magnitude of the model trends at each atmospheric layer. These are compared with several equally robust updated estimates of trends from observations which disagree with trends from the models.

The authors are very obviously referring to the ensemble of models and the averages, not only by the wording in these excerpts, but by the statistical testing that they carried out.

I do not find putting words and meaning into the paper that are obviously not there very instructive.

You have made several declarations about the physical and stochastic nature of the model outputs without showing in detail how they would specifically affect the statistical testing of the model output in this case. I am more familiar with analysis at CA using the data in a paper to demonstrate a point. In my mind you are throwing caveats into the equation that are already accounted for in the assumptions of the tests.
Jon

Posted May 5, 2008 at 9:53 AM | Permalink

@207

So what does the mean of a multi-model, multi-run ensemble actually mean? As far as I can see, averaging over models and over runs attempts to average out the systematic and stochastic uncertainties, leaving you with an estimate of only the “forced component” of the climate, after the effects of the initial conditions have been averaged out.

The AR4 pretty much states this directly, and I would add that the ensemble runs are meant to show the forcing over time, not offer year by year or even increase per decade “predictions”- the period they use is 20 rather than the 30 year period some may be more familiar with re: climate.

This is something that should be clear to anyone with a reasonable grasp of the English language who has actually read the AR4, yet a surprising number of people either cannot or will not accept this and try to “falsify” the ensemble means by judging them against short term observations, despite the fact that as you noted the ensemble runs are designed to smooth out the very noise of internal variability that will likely cause disagreement between projections and observations on timescales as short as a decade or several years.
beaker

Posted May 5, 2008 at 9:55 AM | Permalink

Michael Smith #229: I mentioned the assymptotic case simply because the failure of perfect model is almost certain in that case. Depending on the apparently unquantifiable “sigma of the climate giving the forcing”, the perfect model will fail with finite (potentially quite small) ensemble sizes, the larger the ensemble, the more likely it is to fail. Your second comment shows why you have misunderstood this. Even a perfect model will have stochastic uncertainty, it is inherent in chaotic systems, so climatologists would still use an ensemble even if they had a perfect model.
RomanM

Posted May 5, 2008 at 10:06 AM | Permalink

#230

Roman M: The contradiction is this: How can a test tell you if the ensemble is useful if two functionally identical ensembles (they both make the same predictions with the same confidence) give the opposite test results. The correct interpretation of the SEM tells you to treat the two ensembles differently, when there is no good reason to do so.

You are missing the point. The two ensembles are NOT functionally identical. The one with the extra models has more information and therefore the results can be viewed differently. Are you telling me that a sample of two observations with the same average and standard deviation is somehow “functionally identical” to a sample of 1000 observations? They may make the same predictions, but in the case of a larger number of models, it is easier to see that the difference between their average results and the trend they are estimating is real and not an artifact of the model (or situational) variability.
Kenneth Fritsch

Posted May 5, 2008 at 10:08 AM | Permalink

Beaker you referred to GCMs getting it right not GCM. Regardless it is clear that Douglas et al. were comparing an average ofensemble of model outputs. And outputs that were rather consistent in Ts to Tt trends.

None of the model outputs matched the observed, but if you want to use (conveniently) here an average (that you imply has no meaning as there is only one perfect or near perfect model) to a 2 SD confindence limit to say that such a model could exist by less than mere chance, you might want to check your premises in arguing against the use SE in a comparison.
RomanM

Posted May 5, 2008 at 10:24 AM | Permalink

#232 Jon

This is something that should be clear to anyone with a reasonable grasp of the English language who has actually read the AR4, yet a surprising number of people either cannot or will not accept this and try to “falsify” the ensemble means by judging them against short term observations, despite the fact that as you noted the ensemble runs are designed to smooth out the very noise of internal variability that will likely cause disagreement between projections and observations on timescales as short as a decade or several years.

Translation: These models are guaranteed to to be correct only on a scale 30 or 40 years or more. Come back in 2030 and start evaluating them then. Any differences between reality and model values are a result of natural variation of the earth’s climate so don’t blame the models because they’re correct. It’s just Mother Nature who’s being cranky about it. So, listen to us and don’t even think about looking at the models…

So, what is everyone supposed to do – sit and wait? If the modelers want trust, don’t you think that it is up to them to provide evidence that the models are trustworthy? Where is that evidence? A single run from most models? We wouldn’t be arguing the issues in the paper if substantial information was available about the details of how well each model worked. Do you suppose that the modelers don’t actually possess that type of information? Why isn’t that in the AR4? You have to evaluate what’s available and in the form that is is available.
beaker

Posted May 5, 2008 at 10:30 AM | Permalink

Roman M: O.K., in that case if we were to use the two ensembles to decide policy, how would the resulting policies differ? They wouldn’t, because both ensembles tell you the same thing. Think of it like this, if we had an infinite number of models in the ensemble, does this mean that the ensemble insists that the observed climate should be identical to the ensemble mean (as the SEM is zero)? No of course not, because the standard deviation gives the uncertainty of the ensemble, not the standard error of the mean.

I think either the IPCC recommendations regarding the ensemble mean is badly flawed or it is being badly misinterpreted!
beaker

Posted May 5, 2008 at 10:39 AM | Permalink

Roman M: How about this then. We start with a model that gives a good subjective fit that just passes the 2SE test, as before. We then clone each model in the ensemble so it is twice the size, but has no additional information. It now fails the test, even though it is functionally equivalent.
RomanM

Posted May 5, 2008 at 10:51 AM | Permalink

#237

Deciding that two things are not the same (the focus of the paper) is not necessarily what a policy decision would be based on. In practice, for such decisions, one can form an alternative hypothesis that the difference exceeds a given (separately determined on practical importance consideration) amount. You still have the statistical problem of deciding whether this is indeed the case based on the uncertainty in the sample, the observed sample values, and the sample information, i.e. the sample size. In such a case, even your infinite sample could end up not rejecting the null hypothesis if the difference is less than or equal to the predetermined limit. In the case as well, both ensembles do not tell you the identical thing.
Jon

Posted May 5, 2008 at 10:56 AM | Permalink

RomanM

Translation:

Any time someone pulls the “translation” tactic around here they inevitably perform an Appeal to Ridicule. Sure, distorting what I am saying to the point where you have a strawman to attack is a heck of a lot easier than taking my words at face value. But doesn’t it make you feel… sleazy?

These models are guaranteed to to be correct only on a scale 30 or 40 years or more.

The ensemble runs aren’t “guaranteed to be correct” and no one is claiming they are. They are meant to project the future warming trend, not to offer a weather forecast.

Come back in 2030 and start evaluating them then. Any differences between reality and model values are a result of natural variation of the earth’s climate so don’t blame the models because they’re correct. It’s just Mother Nature who’s being cranky about it. So, listen to us and don’t even think about looking at the models…

If you don’t like the timescales or goals of projections for the ensemble runs, feel free to attack actual forecasts instead. The ensemble runs are specifically intended to do something other than what you’re whinging about here.

So, what is everyone supposed to do – sit and wait?

I don’t really care what you do. People that are interested in the process without the drama and implications that modeling is some sort of a cloak and dagger operations to prevent anyone from evaluating it can start by judging results based upon what the purpose of the exercise is. Ensemble projections are not meant to track short term variability in the system- quite the opposite.

Keenlyside 2008 is an example of a first pass attempt at forecasting. Can you honestly not see the difference, or are you choosing not to?

You have to evaluate what’s available and in the form that is is available.

The irony here is amazing.
RomanM

Posted May 5, 2008 at 11:02 AM | Permalink

#238

What do you mean by “clone” the model in each ensemble? If that means writing the result of a single run down twice, then you have not increased the sample size and you have no new information. If you run each model twice, and get the same result in each case, I would conclude all of the models are basically of zero variablility (i.e., deterministic) and that doing statistics with them is pretty much a moot exercise. If the runs are different then now I do have new information and the ensembles are different.
beaker

Posted May 5, 2008 at 11:18 AM | Permalink

Roman M: Testing if two things are the same doesn’t neccessarily tell you if one thing is a useful predictor of the other. If it did, then a perfect model would be guaranteed to pass, but as I have shown, it isn’t. This is because of the irreducible stochastic uncertainty of even a perfect model and what Steve refers to as the “sigma of the climate given the forcing”.

Can we agree that there is a level of disagreement that is small enough to be irrelevant, say 0.01 K/decade?
RomanM

Posted May 5, 2008 at 11:29 AM | Permalink

#240 Jon:

I apologize. You are correct, my response was phrased in somewhat disparaging terms aimed not directly at you, but at the concept of protecting the sanctity of the models by putting any serious evaluation of their results out of practical reach. Virtually every dire consequence predicted by the AGW advocates is based on models. Without them, there is a very little real evidence of how important the A is the recent GW. And, yes, I care very much whether they are correct since there are those advocating incredible changes based on these models.

There ARE persons so imbued with an AGW agenda that I would not trust them to embellish results by suppressing contrary information. Try reading the posts on the RC site. Any implication that anything (major or trivial) could be questionable is quickly and firmly beaten down by language far worse than what I wrote. You will excuse me if I prefer to ignore the self-protective put-real-tests-out-of-reach admonitions that you quote.
RomanM

Posted May 5, 2008 at 11:41 AM | Permalink

#242 beaker

I thought that I agreed to the last sentence in #239. We can agree to disagree on the rest…

…including the “Bayesian” word … 🙂
Michael Smith

Posted May 5, 2008 at 11:47 AM | Permalink

beaker, in 233 wrote:

Even a perfect model will have stochastic uncertainty, it is inherent in chaotic systems, so climatologists would still use an ensemble even if they had a perfect model.

Okay, but it is not going to be the same ensemble we have now, is it?
Jon

Posted May 5, 2008 at 11:48 AM | Permalink

#243 RomanM:

You will excuse me if I prefer to ignore the self-protective put-real-tests-out-of-reach admonitions that you quote.

What are you talking about?

I gave you an example of modeling-as-forecasting that makes short term predictions. We’ll see more and more in the run up to the AR5. And there will be a lot of crawling The long term ensemble projections just aren’t meant to do what you are demanding of them.
Jon

Posted May 5, 2008 at 11:50 AM | Permalink

Should have read:

And there will be a lot of crawling before walking before running, I would imagine.
beaker

Posted May 5, 2008 at 11:58 AM | Permalink

O.K. then so we could have two ensembles with identical means that are within 0.01 K/decade of the observations and identical standard deviations but one has more models in the ensemble than the other. Why should either ensemble be rejected, given that we both agree that both are satisfactory?

Basically any model with a finite stochastic uncertainty is guaranteed to fail the SE test, as the size of the ensemble increases, unless the ensemble mean is exactly equal to the observed trends (which is an unreasonable expectation because of the “sigma of the climate given the forcing”, as Steve puts it). Doesn’t seem like a fair test of the ensemble approach to me (given that the benefit of the ensemble increases as the size increases and more of the uncertainties are averaged out)!
beaker

Posted May 5, 2008 at 12:39 PM | Permalink

Michael Smith #245: Well if you want to defend a test that will reject even a perfect (true) model, I would venture you are setting your standards a little high, to say the least! ;o).

Note that a climatologist (assuming he understood the benefit of an ensemble) would include as many models as computationally feasible to account for as much of the stochastic uncertainty as possible. So best practice would lead to the model being more likely to fail the test than bad practice!
Christopher

Posted May 5, 2008 at 12:44 PM | Permalink

Re: 248

Yes, this is quite clear. But this seems to be practical vs statistical significance. While the climate sigma is greater than zero (and leads to this situation you describe) I do not see Douglass sullied at all. The observed ~ predicted agreement is lacking. That is apparent in the ms. I’m having mountain out of a molehill feeling here. Any thoughts on the longitudinal ~= cross-sectional aspect of such ensembles in general. Everything I’ve come across in, say, Bayesian model averaging, does not really go there.
DeWitt Payne

Posted May 5, 2008 at 12:47 PM | Permalink

If the SEM of the ensembles is a function of 1/sqrt(n) and n becomes very large then isn’t the consistency test with the standard error of the data? Aren’t there time series methods for deriving an estimate of the data variability that take into account persistence and autocorrelation or whatever?
Kenneth Fritsch

Posted May 5, 2008 at 1:21 PM | Permalink

Beaker, you appear to me to be discounting the use of averaging over n samples on one hand (to minimize the variations that make comparisons more uncertain) while at the same time explicitly pointing to the need to average.

I believe the approach of Douglass et al. was to find a convenient way of comparing the tendency of climate models to over estimate the ratio of Tt to Ts and it is reasonably assumed that there is some relationship of any model used to the all the other models.

That using SE limits could reject a single run of a model that perfectly replicated the observed data is of course not arguable and is in fact what one would expect if runs varied and even perfect ones varied in attempts at replication. If one had a model that replicated (even reasonably well to become a candidate for further testing – do we see any that Douglas used that could fit that criteria) with a single run would not the logical experimental/statistical method then call for many replicate runs of that model and then using the average of those runs with the SEMs to determine whether that model did not match the observed simply by chance. Or, alternatively, would one say that the replicate runs are viewed as sufficiently disconnected that we would merely take an average and look at the SDs?
beaker

Posted May 5, 2008 at 1:43 PM | Permalink

Kenneth Fritsch #252: The benefit of averaging over the ensemble is that it reduces the effects of the stochastic uncertainty. However, this does not mean that the ensemble mean will be a close match for the observed trend, even if you use the true model. The reason for this is because the ensemble mean only gives the forced component of the climate. However the observed trend is due to both this forced component and a component that depends on the initial conditions. Whether the ensemble mean is close to the observed trend depends on the strength of the component that depends on the initial conditions. That is what Steve terms “the sigma of the climate given the forcing”, which unfortunately we don’t actually seem to know.
You have to distinguish between the uncertainty of the ensemble and the uncertainty of the mean of the ensemble. The SE doesn’t give a useful indication of the claimed uncertainty of the ensemble, and so should not be used for validation. I’m sorry, I seem to have run out of ways of trying to explain this.

I understand the aim of the Douglass paper, but the test that they use didn’t make that point as it treats the ensemble mean as a confident prediction of the trend, with the SE as the error bars. It isn’t the SD is the error bars for the ensemble uncertainty. The SE is only the error bars for the uncertainty in estimating the mean of the underlying population of models based on a finite sample.

You have misunderstood my argument. The SE test will be more likely to reject a large ensemble based on the true model than a single run, because of the 1/sqrt(n) factor!
Kenneth Fritsch

Posted May 5, 2008 at 2:03 PM | Permalink

Re: #248

Basically any model with a finite stochastic uncertainty is guaranteed to fail the SE test, as the size of the ensemble increases, unless the ensemble mean is exactly equal to the observed trends (which is an unreasonable expectation because of the “sigma of the climate given the forcing”, as Steve puts it).

It would seem that any comparison, like the use of SEM, that takes into account the additional sampling size (to reduce uncertainty) gets eliminated by Beaker’s objection here.

Any statistics book will show that one can make the following comparisons for large samples:

Expected value: X1-X2 or even X1-X2 + E, where X1 and X2 are means for the sampling of two populations and E can be an error that one is willing to live with and declare X1 and X2 are sufficiently close for paractical matters.

SD12 = (SD1^2/n1+SD2^2/n2)^(1/2)

where n1 and n2 are the respective sample sizes

and the probability of the difference occurring by chance can determined from

z = (X1-X2+E)/SD12

By the way the use of a reasonable E error would not be expected to have changed the conclusion in Douglas et al.
beaker

Posted May 5, 2008 at 2:18 PM | Permalink

Kenneth #254: We should not be testing whether the mean of the ensemble is different from the mean observed trend. I have already given an explanation of why it is unreasonable to expect this to be zero (I repeted it for your benefit in #253). When you understand why this is not a reasonable expectation, you will understand why even the true model will ultimately fail the 2SE test.

I can’t put it any clearer than this. There is no reason to expect the difference between the observed trends and the ensemble mean of the perfect model to be
zero. Given that, why test for the difference of the two means being zero?
Tom Gray

Posted May 5, 2008 at 2:32 PM | Permalink

why test for the difference of the two means being zero?

Because it is useful

You seem to be conflating a statistical test with reality
Ron Cram

Posted May 5, 2008 at 2:45 PM | Permalink

beaker,

In his comments above, David Douglass wrote:

Invent your own tests or criteria. The main conclusion will be the same – only a few models can be reconciled with the observations.

Do you agree or disagree with this statement?
conard

Posted May 5, 2008 at 2:46 PM | Permalink

256 (Tom Gray)

But our “reality” is only one of perhaps an infinite number.
Kenneth Fritsch

Posted May 5, 2008 at 3:09 PM | Permalink

Given that, why test for the difference of the two means being zero?

I added an E (it could have a +/- E) in the difference in X1 – X2 that perhaps you missed.

You also make a big leap by declaring that the “natural variation” of the component in the models that can be averaged out is in the observed data to a significant extent. The observed (and modeled) values in Douglas are dealing with differences (or one could use ratios) and one would be required to show that that difference leaves a significant residual in the observed that is averaged out in the model results. We are also comparing trends that surely smooth out the low frequency natural variations in the observed.

I recall some relationships that David Smith showed on one of these threads that had a good annual correlation over the satellite era between Ts and Tt in the tropical zone for the observed data. I will attempt to find it again.

Of course when E for natural effects is shown to be large compared to the forced effects we have another debate on our hands.
Willis Eschenbach

Posted May 5, 2008 at 3:26 PM | Permalink

beaker, you say (emphasis mine):

The benefit of averaging over the ensemble is that it reduces the effects of the stochastic uncertainty. However, this does not mean that the ensemble mean will be a close match for the observed trend, even if you use the true model. The reason for this is because the ensemble mean only gives the forced component of the climate. However the observed trend is due to both this forced component and a component that depends on the initial conditions. Whether the ensemble mean is close to the observed trend depends on the strength of the component that depends on the initial conditions. That is what Steve terms “the sigma of the climate given the forcing”, which unfortunately we don’t actually seem to know.
You have to distinguish between the uncertainty of the ensemble and the uncertainty of the mean of the ensemble. The SE doesn’t give a useful indication of the claimed uncertainty of the ensemble, and so should not be used for validation. I’m sorry, I seem to have run out of ways of trying to explain this.

I understand the aim of the Douglass paper, but the test that they use didn’t make that point as it treats the ensemble mean as a confident prediction of the trend, with the SE as the error bars. It isn’t the SD is the error bars for the ensemble uncertainty. The SE is only the error bars for the uncertainty in estimating the mean of the underlying population of models based on a finite sample.

The IPCC, on the other hand, says (emphasis mine):

Multi-model ensemble approaches are already used in short-range climate forecasting (e.g., Graham et al., 1999; Krishnamurti et al., 1999; Brankovic and Palmer, 2000; Doblas-Reyes et al., 2000; Derome et al., 2001). When applied to climate change, each model in the ensemble produces a somewhat different projection and, if these represent plausible solutions to the governing equations, they may be considered as different realisations of the climate change drawn from the set of models in active use and produced with current climate knowledge. In this case, temperature is represented as T = T0 + TF + Tm + T’ where TF is the deterministic forced climate change for the real system and Tm= Tf -TF is the error in the model’s simulation of this forced response. T’ now also includes errors in the statistical behaviour of the simulated natural variability. The multi-model ensemble mean estimate of forced climate change is {T} = TF + {Tm} + {T”} where the natural variability again averages to zero for a large enough ensemble. To the extent that unrelated model errors tend to average out, the ensemble mean or systematic error {Tm} will be small, {T} will approach TF and the multi-model ensemble average will be a better estimate of the forced climate change of the real system than the result from a particular model.

To me, they seem to be explicitly making the claim that you deny. This is that the multi-model average will converge to the actual climate.

If the average of the models converge to the actual climate as the IPCC says, then the Douglass test (using the standard error of the mean) seems to be the proper test.

If the models do not converge to the actual climate, then … well, then the IPCC is wrong, and what’s the use of having an ensemble of models?

All the best,

w.
steven mosher

Posted May 5, 2008 at 3:43 PM | Permalink

Kenneth,

I think the issue of SEM and SD is played out. Besides, it not the real issue.

The models are not “beaker inconsistent” with the data. Next tuesday is not “beaker inconsistent” with the data. The key take away is this. The models are all over the map as we saw, the data is all over the map and being “adjusted” by the data chiropractors. The real question is model SKILL.that’s what we really care about. Well, I’m a luke warmer,
So I believe in AGW physics, but I dont trust the models farther than I can pitch
a leather couch.

Hansen claims, for example, that we cannot exceed 450ppm or greenland will do the wicked witch of the west dance and melt. Where is this level of accuracy in prediction supported? where? In models that produce a spread like you see in douglas.? “I got
some quatrains to write,” says moshtradamous.

its been a fun discussion. I’m glad beaker’s hung around,

Oh well. see ya round.
beaker

Posted May 5, 2008 at 4:11 PM | Permalink

Willis #260: Although averaging the models will on average reduce the error, that doesn’t mean it can be expected to reduce the expected error to zero. The reason that the ensemble mean of the ideal model will mot be equal to the observed trend is becuase of what Steve termed the “sigma of the climate given the forcing” which affects the observed trend and is unpredictable, so the models can do nothing about it.

Basically it would only be possible to reduce the expected error to zero if we could predict the weather as well as the climate, and even the perfect model can’t do that as we don’t know the initial conditions.

There is no conflict between what I wrote and what is given in the quote from the IPCC.
John F. Pittman

Posted May 5, 2008 at 5:43 PM | Permalink

From the IPCC #260 W.E. quote

The multi-model ensemble mean estimate of forced climate change is {T} = TF + {Tm} + {T”} where the natural variability again averages to zero for a large enough ensemble.

emphasis mine

Isn’t what the IPCC is saying is that the signal ( forced climate change – the A part of AGW) is estimated from a large enough ensemble mean such that the natural (assumed unforced wrt to the A part and natural forcings) variability averages to zero? From their definitions of natural and forced, are the IPCC are claiming to

reduce the expected error to zero

(from beaker 262), for natural variability and the systemic error to be small?
Ed Snack

Posted May 5, 2008 at 5:50 PM | Permalink

Beaker, Willis’ quote from the IPCC explicitly states that {T} will approach TF. What does approach mean if not to get arbitrarily close or reduce to (effectively) zero. If they meant that a finite error would remain, why didn’t they say so, and isn’t it a reasonable position that Douglas et al have taken to assume that meaning.

It always seems that the IPCC gets a free pass on any semantic issues, yet any skeptics must face the gauntlet of endless semantic nitpicking.
Kenneth Fritsch

Posted May 5, 2008 at 6:10 PM | Permalink

Re: #262

The reason that the ensemble mean of the ideal model will mot be equal to the observed trend is becuase of what Steve termed the “sigma of the climate given the forcing” which affects the observed trend and is unpredictable, so the models can do nothing about it.

Beaker, the “sigma of the climate” is in the models or at least should be and could tend to average out. The sigma of climate is in the observed overall tempersture signal but in this case we are looking at differences in trends. Please explain how that leaves an added sigma of the climate in the observed that is significantly different in the model out put?

We are looking for an effect of GHG forcing on both the observed and models by using this difference in order to zero out other effects. Please explain where the sigma due to non-forcing effects comes into play. And if you can show it in general then explain how it could be estimated vis a vis a significant model to observed difference.
Willis Eschenbach

Posted May 5, 2008 at 6:22 PM | Permalink

beaker, thanks for the response. Let me go over again what the IPCC says. First, it defines the temperature as being:

T = T0 + TF + Tm + T’ where TF is the deterministic forced climate change for the real system and Tm= Tf -TF is the error in the model’s simulation of this forced response.

It describes the use of the ensemble as follows:

To the extent that unrelated model errors tend to average out, the ensemble mean or systematic error {Tm} will be small, {T} will approach TF and the multi-model ensemble average will be a better estimate of the forced climate change of the real system than the result from a particular model.

Now, I think you’d agree with the IPCC so far. So we have to ask … if the “multi-model ensemble average will be a better estimate”, how much better an estimate is it?

So, let me pose the question to you: how would we measure to see if the ensemble average is “a better estimate” than the result from a particular model?

If, as you claim, the error on the mean of the model ensemble encompasses 95% of the models, it’s very hard to see how that would make the average a better estimate than any one of those models.

However, I await your answer to this question.

w.
Ron Cram

Posted May 5, 2008 at 7:49 PM | Permalink

beaker,

After you answer Willis’s question, please reply to mine in #257.
Kenneth Fritsch

Posted May 5, 2008 at 8:20 PM | Permalink

Below is plotted the UAH monthly temperature series for the lower troposphere versus the mid troposphere for the period Dec 1978 to March 2008. The excellent correlation would indicate to me that looking at differences between these series temperatures and temperature anomalies trends would leave a very small residual “climate signal”.

A reply is optional. I have become rather proficient in interpreting no answers.
Phil.

Posted May 5, 2008 at 8:28 PM | Permalink

Re #268

Considering those series are derived from the same sensors such a good correlation isn’t surprising:
TLT=Temperature Lower Troposphere MSU 2 and AMSU 5
TMT=Temperature Middle Troposphere MSU 2 and AMSU 5

There’s considerable overlap of the signatures:
Kenneth Fritsch

Posted May 5, 2008 at 8:55 PM | Permalink

Willis E., it is the T prime in the equation about which I have issues with Beaker.

I say that it can be shown that if one uses trends and differences in combination with a GHG (forcing) fingerprint effect for the observed temperatures (and modelled also, even though averaging should remove a non-biased T prime) that T prime becomes very small compared to the T component for forcing in the observed temperature. I judge that the correlation above in my previous post gives evidence for this.

Would not the T primes be nearly equal for surface and tropospheric temperatures?
srp

Posted May 5, 2008 at 9:04 PM | Permalink

Beaker’s analysis of the infinite ensemble has two problems:

1) Everyone would embrace the models if it were proved statistically that they converged to a value that was off from the data but very close to it. For policy purposes, using a reasonable loss function, a model that is known to be wrong but only by a tiny bit is better than one which is not known to be wrong but gives answers that may be very far from the truth much of the time. Bohr’s model of the hydrogen atom is known to be wrong when you look at details of the spectra, but for many practical purposes it’s still a damn good model. Ditto Newtonian physics. You don’t want to confuse statistical significance with practical importance.

2) The whole point of the IPCC exercise is that temperature variablility conditional on the forcing is just transient weather, not persistent climate. Under this maintained (by the IPCC) hypothesis, it is not true that the initial conditions allow a wide range of data trajectories for the means of things like the troposphere/surface temperature ratio. Their hypothesis is that if you wind up the climate and let it go with the same forcings, the averages will converge to the same predictable values; all the sensitive dependence on initial conditions is about weather, not climate. So it is incorrect to suppose that a test of this hypothesis should treat the observed climate statistics as coming from an urn with a widely varying population. The whole point of the IPCC exercise is that when you look at long-run means, the data should come out the same every time.
Ron Cram

Posted May 5, 2008 at 9:28 PM | Permalink

Kenneth,
re: 268

I do not understand your plot. Can you explain it for me?
Kenneth Fritsch

Posted May 5, 2008 at 9:46 PM | Permalink

Re: #272

It is simply a month by month plot of the UAH TTL (lower troposhere temperatures) versus UAH TMT (middle troposphere temperatures) for the tropics over the time period from Dec 1978 to March 2008. I need to point out that it is the tropics as that information was left out of the graph title.
Kenneth Fritsch

Posted May 5, 2008 at 10:15 PM | Permalink

In reply to Phil, I did a correlation of UAH TLT with the GISS surface temperature for the tropics on an annual basis for the period 1979-2005 and obtained an R^2 = 0.75 This in turn would correlate nearly as well with UAH TMT.

Below a link shows a good correlation by eyeballing the graphs for the RSS series for TLT TMT and TTS Tropo/Stratosphere (they are towards the bottom of the page and below the graph that Phil copied in his post). I believe that there are reasonable explanations why the TLS would not correlate with the troposphere readings and besides the Douglas paper covered mainly the troposphere to tropo/stratosphere.

http://www.ssmi.com/msu/msu_data_description.html#zonal_anomalies
Ron Cram

Posted May 5, 2008 at 10:52 PM | Permalink

Kenneth,

I’m sorry. I should have been more specific. I do not believe the plot is in degrees C. Is it measured as the anomaly to the monthly average for the surface temp? Or is it something else? As a scatter plot, the trend line cannot be time dependent. What does the trend line represent exactly? Just the fact that when Lower Trop temps are higher, so are the Mid Trop temps? And vice versa? The trend line is at 45 degrees (or would be if the rectangles were squares), what does this indicate to you? What would it indicate if the trend line was at 60 degree slope? Or 30 degree slope? What conclusion do you draw from the plot and why?
beaker

Posted May 6, 2008 at 1:26 AM | Permalink

Good morning all:

John F. Pitman #263:

I can see how the wording of the IPCC comment is causing the misunderstanding and I’ll try my best to explain. Basically the “natural variability” they are talking about is the simulated natural variation due to the initial conditions of the run, part of what I have been calling the stochastic uncertainty of the model. Averaging over the ensemble will indeed reduce this uncertainty. What Steve refers to as the “sigma for the climate given the forcing” is the effect of the initial conditions in the observed climate. The models have no way of predicting this, which is why the ensemble mean can’t be expected to match the observations exactly, even assymptotically using the true model, which is why even the true model will fail the +-2SE test.

The models can only hope to estimate the forced component of the climate, they can do no more. However, we can’t directly observe the forced component of the real climate as it is contaminated by a second component reflecting the effects of the “initial conditions” (i.e. the exact state of the atmosphere in 1979). Whether the model trends can be considered “close” to the observed trends depends on the magnitude of the second component (which as far as I am aware, we don’t know, but Douglass et al. implicitly treat as being zero).

I think that the misunderstanding stems from a confusion between modelling uncertainties and real uncertainty in estimating the forced component of the actual climate.

Ed Snack #264: The stochastic uncertainty of the ensemble will go to zero in the limit of an infinite ensemble. The problem with the +-2SE test is that it assumes the observed trend does not depend on the initial conditions, but it does. As I have said before, it is not a like for like comparison as it compares an estimate of the forced climate with the observed forced climate plus a component due to the initial conditions. Even a perfect model cannot predict the second component and so will fail the test if a large enough ensemble is made, which is absurd.

Kenneth Fritsch #265: Basically the observed trend is a combination of a forced component and a random component. The “sigma of the climate given the forcing” is the standard deviation of this random component. Even the true model of climate physics with infinite resolution is unable to estimate this random component, as we do not know the appropriate initial conditions for the simulation. As the models can only estimate the forced component, we can’t expect it to match the observed trend exactly.

The uncertainty removed by ensembling is the stochastic uncertainty of the simulations, it is not the comparable uncertainty that exists in estimating the forced component of the observed climate needed for a like-for-like comparison.

Willis Eschenbach #266: I hope my answer to #263 clears things up a bit. As for the IPCC claim, note that they say it gives a better estimate of the “forced climate change” not the “climate change”. Had they claimed the ensemble mean gave a better estimate of the actual climate change (including the chaotic component) they would be wrong.

Ron Cram #267: I didn’t answer the question because I have already made it abundantly clear that I was pointing out a problem with the methodology of the paper, not with the general finding (although it was expressed with more certainty and more strongly than was justified by their actual analysis). I have also said before, one very nice criterion that shows the problem with the models very clearly is the +-2SD diagram on RealClimate, so I had actually answered it already.

Kenneth Fritsch #268: That is only looking at the measurement uncertainty in the observations, the problem with the +-2SE test is in the uncertainty in estimating the forced component of the climate from a single observation of a chaotic system (represented by the “sigma of the climate given the forcing”).

SRP: #271: The models can only model the forced component of any climate change and give an estimate of the uncertainty. However, judging whether this is close to the observations depends on the “sigma of the climate given the forcing” which we don’t know. The +-2SE test rejects even the ideal model as it assumes that this is exactly zero, which is unreasonable. I’m sure it would be possible to come up with a good test, it is just that the one used by Douglass et al. is not it. As you say, you don’t want to confuse statistical significance with practical importance. I demonstrated that that the +-2SE test is doing exactly that in #248, which makes the comment in #256 rather a non-sqeuitur!

As to your second point, if the 20 year trend were long enough to supress the effect of the chaotic variability of the trend then there would be little or no variation in the trends for different runs of the same model, but according to RealClimate, there is. However, that is a modelling issue, which I as a statistician can’t really comment on usefully. As for the data, the observed data is not a mean, so the averaging out isn’t there, it is a single time series.
Michael Smith

Posted May 6, 2008 at 6:32 AM | Permalink

Beaker, you wrote in 276:

Even a perfect model cannot predict the second component and so will fail the test if a large enough ensemble is made, which is absurd.

But in reality, that’s not going to happen, is it?

If the purpose of the ensemble is to account for stochastic uncertainty around the perfect model, wouldn’t that mean the ensemble would only include models that increased the standard deviation?
Ron Cram

Posted May 6, 2008 at 6:53 AM | Permalink

beaker,
re: 276

No, I do not think you have made it “abundantly clear” that you agree with the general finding of the Douglass paper. You have made certain comments that could be interpreted to mean that possibly Douglass was not too far off the mark. But I have never seen you write: “After looking at the data and analysis in the Douglass paper, it is obvious that 19 of the 22 models do not pass the appropriate tests applied by Douglass.” But that is the general conclusion of Douglass in his comments above.

Thank you for making it abundantly clear now.
Kenneth Fritsch

Posted May 6, 2008 at 8:35 AM | Permalink

Beaker, your reply is much too general and dismissive to be of much use to me. You will have to explain the “sigma of the climate” and how that sigma is left in the ratios of trends (or R as Douglas et al. terms it in their paper) of the troposphere to surface GHG finger print in the observed data.

From my readings I know that the climate models can model the climate without the forcings included and this will give (or theorectically should give) in each individual run a sigma of the climate over a period of time. We can then take these individual model results without forcings and look at the trends. We can further do the same for the tropical surface and tropospheric temperature trends without forcings.

Since I have seen climate models run without forcings and yielding little or no trends over relatively long time periods (without doing any ratios) I suspect that the climate sigma you refer to would be very small in any trend ratios (as in Douglas) over appreciable recent time periods as derived from climate models or in the observed data. Now if you were to have disagreements with those model findings on a climate sigma as it effects trends then I suspect the models simply are not at such a state of advancement to allow reasonable evaluation of them or even parts of them.

The observed Ts/Tt ratios of trends should (and particularly one that gives a GHG finger print) be relatively free of any climate sigma no matter how manifested or related to a supposed chaotic system’s initial conditions.
beaker

Posted May 6, 2008 at 8:49 AM | Permalink

Ron Cram #278: Sorry, I genuinely thought e.g. the last line of #207 was unambiguous. I don’t think Douglass has applied any “appropriate” tests from a statistical point of view, so that would be too strong a statement for me. He has however made some ad-hoc tests that make the point well enought, and I have no problem with that whatsoever. Having said which,the fact the models only just pass the consistency test even with the large uncertainty makes the case rather better, from a purely statistical perspective (IMHO).

Michael Smith #277: If even a perfect model is almost guaranteed to fail the +-2SE test, how can it be a fair test of a real model, given that it could not be any more accurate?

The standard deviation of the perfect ensemble would equal to the “sigma of the climate given the forcing” as the physics is exactly right. You would not want to reject any member of the ensemble as they are all valid realisations of a chaotic process with the same forcing. However the observations also have a signal that is due to the chaotic nature of the climate (which you could say was “weather”) that can’t be estimated, and that is what prevents the perfect model from passing the +-2SE test. I appreciate that is a fairly subtle point, basically the models can only predict the forced climate, but we can only observe the actual climate, not the actual forced climate.
Kenneth Fritsch

Posted May 6, 2008 at 8:51 AM | Permalink

Ron Cram:

Is it measured as the anomaly to the monthly average for the surface temp? Or is it something else? As a scatter plot, the trend line cannot be time dependent. What does the trend line represent exactly? Just the fact that when Lower Trop temps are higher, so are the Mid Trop temps? And vice versa? The trend line is at 45 degrees (or would be if the rectangles were squares), what does this indicate to you? What would it indicate if the trend line was at 60 degree slope? Or 30 degree slope? What conclusion do you draw from the plot and why?

The plot is anomaly to anomaly on a monthly basis and, as you said, shows that the climate noise (sigma?) that is affecting the signals affects the TLT and TMT and TTS and surface, for that matter, nearly the same.

A deviation from a 45 degree line would indicate different trends for the temperature anomalies being plotted. The UAH and RSS trends for TLT, TMT and TTS are different, but not necessarily by a lot.
Christopher

Posted May 6, 2008 at 8:51 AM | Permalink

Well, I continue to find this back and forth intriguing. But I can’t just shake the feeling that this is somehow old hat. I’m sure anyone who sat thru a basic stats course learned that you can get an arbitrary significant result by upping n. That’s really all we’re talking about here. That’s why the perfect model fails asymptotically. And it would seem to motivate the whole baseline used for comparison purposes and the range > SD > SE issues. The last bit bothers me as it seems to represent a moving target that can be tweaked on the fly to get the answer you want. The other subplot I find odd is that, imho, taken to its logical conclsuions, beaker’s notion means that we can not test such models at all. And this reminds me of the need of falsifiability that seems to be so absent in the world of GCMs. Lastly, at the end of the day we are comparing model to model (at least a dataset so processed and versioned as the Raobcore certainly feels like a model to me). And here there are so many issues that I am reminded of a house of cards.
beaker

Posted May 6, 2008 at 9:07 AM | Permalink

Sorry Kenneth, I am genuinely trying to communicate my point, but perhaps being hampered by being too much the statistician and not enough a climatologist to have the right language.

The problem with the test is as much to do with the data as with the models. The Earth’s climate is the result of a chaotic system, with external forcings and internal feedback mechanisms. However, there is a chaotic element to the climate that is unpredictable, that depends on the initial conditions. This is why we can’t predict the weather more than a week or so ahead, because to make accurate one extra unit of time into the future, we (apparently) need to make exponentially more accurate measurements of the current atmospheric conditions. I understand there are longer term aspects of this chaotic behaviour, e.g. ENSO, that could affect the 20-year tropical trends that we are discussing. Even a perfect model can’t predict this component, so there is no reason to expect the model to predict the observed trend exactly (which is what is required to pass the +-2SE in the assymtotic case of an infinite, i.e. perfect, ensemble)

By “forced climate”, the IPCC seem to mean the average climate after you have averaged out the chaotic component. You can do that with the models by having an
ensemble, but sadly you can’t do the same with the observed climate.

Do keep asking questions, and I’ll keep trying to answer them the best I can, I may be near unintelligible, but I am determined! ;o)
beaker

Posted May 6, 2008 at 9:23 AM | Permalink

Christopher #287: I am sure a reasonable statistical test is possible (even if it involved a subjetive-Bayes element). The key step would be to agree on the reasonable magnitude of the “sigma of the climate given the forcing” and use that with the measurement uncertainty and uncertainty in the estimation of the trend (probably comparatively small) to make +-2SD error bars for the observations. You could then test to see if the ensemble mean were a good estimate of the actual trend by comparing the mean +-2SE error bars of the ensemble with the mean +-2SD of the observations. I think that would be a valid test, the only objection would be the assumption of a value for “sigma of the climate given the forcing”, but a subjective-Bayes approach would at least allow you to make a statistically sound test based on explicitly stated assumptions. The only point of contention would be one of climatology rather than statistics, which would be progress.
MBK

Posted May 6, 2008 at 9:37 AM | Permalink

Re: the various references to chaos. In deterministic chaos (which is itself a mathematical abstraction because the real world may add non deterministic noise) you will likely not be able to predict the trajectory / time-series of the system with any reliability except for the very short term. But, importantly, as long as the system remains in its previous basins of attraction, the attractors themselves will not change. So while climate models should be expected not to predict actual time-series, they should indeed be able to predict changes in the attractors: as a result of changes in model parameters (“external forcing”), does the model produce features of which measured real features are a plausible subset? And such features should include the kind of temperature differentials discussed here. Of course, “define plausible” is at the heart of this discussion.

To me this means it doesn’t matter in which exact initial condition some model is initialized. The initial condition only modifies the trajectory. But the attractor remains the same (for a given “external forcing”). I also don’t see why one would have to use an ensemble rather than one and only one “correct” model, with as many runs as necessary to get a good idea of the likely attractors (system features).
Ron Cram

Posted May 6, 2008 at 9:44 AM | Permalink

beaker,
re: 280

Obfuscation appears to be your forte. I asked you a very simple yes or no question and received back paragraphs that appear to contradict one another. First you say you agree with the general findings of Douglass. Then you claim the models just barely pass the statistical test. You refer me to a previous statement you made, but it is as much off-point and unclear as your other comments. You use similar terms as I do but make them refer to different antecedents. My question had to do with the tests Douglass applied in his comments above, not the statistical tests (I already know your opinion of those). My original question related to this statement by Douglass above:

Invent your own tests or criteria. The main conclusion will be the same – only a few models can be reconciled with the observations.

Do you agree or disagree with this statement?

In the statement you referred me to (the one you say is unambiguous in Comment #207), you wrote:

On the other hand, it is clearly interesting that the models only pass the conventional consistency test with a D- (as Patrick M puts it in #88), even taking more of the uncertainties into account. That seems to me to be a much more effective criticism of the models (or potentially the data, c.f. the Santer paper cited by Douglass et al.).

But this is not addressing the question I asked. Douglass says 19 out of the 22 fail. So are you saying that as a group you think the models pass with a D- because three of them pass? If so, is that unambiguous to you? If you think the first 19 models pass the test, can you provide me with your analysis showing how they pass the test? You do agree that the tests applied by Douglass in his comments above are valid and appropriate, correct?
beaker

Posted May 6, 2008 at 9:45 AM | Permalink

MBK #285: The fact that there is considerable spread in multiple runs for the same model (IIRC from the realclimate article), suggest the time span for the trend is short enough for things like ENSO to have an effect on the trend. More than that I can’t say.

If there more more than one attractor wouldn’t that make the test even less reasonable as the models can’t predict which one the real climate will have fallen into?
Sam Urbinto

Posted May 6, 2008 at 9:46 AM | Permalink

I feel it’s old hat also Christopher. Make n bigger. So don’t we need both SEM and SD?

beaker, it seems to me you’re saying SD is the correct way to go because the models can’ handle the random/chaotic part and we don’t know what exactly the initial conditions are for reality. That the SD needs to be used to remove the randomness. Therefore, the two can’t be directly compared, right. I think I said that once, maybe for another reason, you can’t exactly really compare them at all.

It seems the SD is the correct one to use. If you want the test to pass. And the SEM is the correct one. If you want the test to fail.

Doesn’t using the SD and then comparing it to the SEM basically prove the point the models fall short of being able to hand reality? That we need both?
Kenneth Fritsch

Posted May 6, 2008 at 9:58 AM | Permalink

Beaker, unless you can address this issue with more specifics of the climatology involved like the climate sigma and how it survives not being cancelled out by the use of ratios of trends, I do not think I will learn much from our discussion going forward.

Obviously we could make an educated guess or at least decide a reasonable measure of the climate sigma if one were to conclude that it does not cancel out by ratioing trends (as I think it does). Then instead of looking at X1-X2 with SEMs we would look at X1-X2 +/- CS. If CS is large compared to the forcings or mysterious in magnitude as it appears where you want to leave it, it would seem that the consensus of climate scientists is in much doubt.

Your argument as a statistician is that given a stochastic residual (or climate sigma) remains in Xobserved and is averaged out of the Xgcm then Xgcm – Xobserved can never equal zero. But as a statistician someone in climate science has to provide you with this given and also show that the residual is sufficiently large to make a practical difference in Xgcm – Xobserved failing all tests using SEM.

If we compared one group of climate model results with another group using large samples of both groups, we could take averages and use SEMs to determine whether the difference in means could happen by chance. We have effectively average out the climate signals in both groups. This appears to be a valid test that does not involve a residual climate signal. Now what are the chances that Xgroup1 – Xgroup2 = 0? If it isn’t zero does that mean the comparison cannot be made by this method?

I will conclude, by saying that you doing more than looking at the Douglas method strictly as a statistician, but instead are making assumptions about the climate science that you have not discussed or referenced in detail. At this point I have to go with Douglas et al. and assume that they can show that the climate signal cancels in their ratioing of trends or at least becomes practically small — or else why do the testing as they prescribe. I will patiently await an explantion from them or do more detailed analyses of my own to answer the questions.
beaker

Posted May 6, 2008 at 10:03 AM | Permalink

on Cram #286: If you ask whether I agree with a statement that I see some truth in, but do not completely agree with, then it seems to me the best thing to do is to set out my position more fully. I am not going to give a yes or no answer to a question that I don’t think should have one.

You keep talking about tests. It would be more productive if you want to state exactly what it is you want to test for, inconsistency, bias, usefulness?

Is I believe Steve was suggesting, without knowing (or assuming) the “sigma of the climate given the forcing”, I don’t believe any statistical test of the ensemble mean, or individual models is really valid, except for the very easily passed +-2SD inconsistency test (including the stochastic uncertainty).

Douglass statement you give is vague and informal, I agree with it to an extent, only a few of the models give a reasonable approximation to the data, but without knowing all of the uncertainties, especially the “sigma of the climate given the forcing”, I wouldn’t go as far as “reconcile” as that would imply more than the plot can support (as there are no error bars).
beaker

Posted May 6, 2008 at 10:07 AM | Permalink

Kenneth, the “sigma of the climate given the forcing” is NOT part of the models, it is uncertainty in estimating the actual forced climate (which we cannot observe) from the actual climate (which we can). It is part of the error bars on the data, not the models and ensembling can do nothing about it. This seems the key issue that I am not getting accross, if you get that point, the rest will follow.
beaker

Posted May 6, 2008 at 10:19 AM | Permalink

Sam Urbinto: Whether you use the SD or the SE for a given test depends on what exactly it is you are testing for. I personally don’t care whether the models pass or fail a test. I care whether the correct conclusions are drawn from the results of the test. The SE-SE test is not useful here, it doesn’t answer a useful question, even the perfect model with an infinite ensmeble (and therefore a perfect representation of the stochastic uncertainty) is almost guaranteed to fail! The SD-SD inconsistency test is easy to pass, but because it is easy to pass is no indication of the models usefullness and so is no big deal unless it fails. How much can we enjoy a victory that is so easily won? The SE-SD test I mentioned answers the question “can the ensemble mean be reconciled with the observations”, but we have to guess one of the uncertainties. It can also be performed on individual models, and I would expect some to pass and some to fail. At the end of the day the process is this: Decide exactly what it is you want to find out. Construct a hypothesis that test for it. Perform test. Intepret test. Sadly in this case the chain failed at the second link because the hypothesis test is not the right one for the question (ambiguosly) posed.
MBK

Posted May 6, 2008 at 10:25 AM | Permalink

beaker #287: Well from my understanding of chaotic systems a really good model would be rather like a risk analysis of how likely it is that the real climate will fall into a different attractor. Classic visualization, a ball in rough terrain. It will fall into a depression (the attractor) and if you move it most of the time it will fall back into the same “valley”. But if you move it too much, it may roll into a neighboring valley. It seems to me though that climate models are far from any such predictive capability. The very idea that there is a trend to be predicted (albeit with noise overlaid) is somehow quite incompatible with the idea of a system exhibiting chaotic nonlinear dynamics, possibly over a large range of scales (self similarity). Such a system would make any effort at predictive modelling look naive.

Somehow I doubt though that within the time and parameter scales we’re interested in the system is really all that chaotic. It looks rather homeostatic to me, it is rather its relative stability that is surprising. If this is true then models do have a chance, and a point. So this leaves the door open to models or observations that predict a departure from a previous homeostasis within its random-noise bounds. On the other hand this also means then that models can’t take the chaos excuse anymore. They do have to predict something useful and testable.

For the case in point here we’d ideally want to know what the “random-noise bounds” of the temperature differential were, w/o GHG additions, over a time-scale much longer than the one we are trying to test. Just common sense – yes we’d still only have one real climate “run” but over a period spanning many of the lengths of time we’d like to know about, so we can assess typical variability over any one length of time. Sadly, we don’t have much data here – not over time scales of historical, plausible variability. But it is precisely this that models should actually be doing in the first place – to put expected bounds on natural variability, and contrast them with expected departures from this variability, complete with significance levels and predictive power (what is the smallest departure from normality that can be asserted with good confidence within a certain time frame?). And this kind of data precisely is what multiple model runs give, so multiple model run data analysis is crucial IMO.

What I don’t understand is why these data do not seem to be discussed, why models aren’t understood as research tools rather than as forecasts. The first thing a model should do is to give an assessment of expected variance in the output (one model, many runs). Without that, you can’t determine predictive power. The model output becomes untestable.
beaker

Posted May 6, 2008 at 10:26 AM | Permalink

Kenneth:

If we compared one group of climate model results with another group using large samples of both groups, we could take averages and use SEMs to determine whether the difference in means could happen by chance. We have effectively average out the climate signals in both groups. This appears to be a valid test that does not involve a residual climate signal. Now what are the chances that Xgroup1 – Xgroup2 = 0? If it isn’t zero does that mean the comparison cannot be made by this method?

The chaotic component of the climate doesn’t come into this test because the test doen’t involve the data, just two groups of models. If the data isn’t involved in the test, the uncertainty in estimating the true forced compionent of the climate can’t be involved either.
Jon

Posted May 6, 2008 at 10:39 AM | Permalink

@286

Obfuscation appears to be your forte.

Are you kidding?! He is going out of his way to answer and re-answer (and re-re-answer) the questions posed to him and avoid falling into a false dilemma when a yes or no is insufficient. How can that possibly be called obfuscation?

His patience is remarkable.

He also appears to be quite capable of discriminating between ensemble projections and forecasts, which is more than can be said for quite a few people around the blogosphere.
Patrick M.

Posted May 6, 2008 at 11:02 AM | Permalink

beaker:

If you were going to buy a computer model to forecast the effects of GHG’s on the troposphere, what amount of money would you be willing to spend on any of the models mentioned in the Douglass paper? Please specify model and price. I will accept an answer of “I would keep shopping around.”

🙂
Ron Cram

Posted May 6, 2008 at 11:03 AM | Permalink

beaker,
re: 290

You wrote:

Douglass statement you give is vague and informal, I agree with it to an extent, only a few of the models give a reasonable approximation to the data, but without knowing all of the uncertainties, especially the “sigma of the climate given the forcing”, I wouldn’t go as far as “reconcile” as that would imply more than the plot can support (as there are no error bars).

The statement by Douglass we are discussing is not “vague and informal.” It is a challenge:

Invent your own tests or criteria. The main conclusion will be the same – only a few models can be reconciled with the observations.

The “sigma of the climate given the forcing” is irrelevant to this discussion. We are not talking about climate sensitivity. The issue is “Are the models and the theory consistent with observations?” The models and the theory call for faster warming of the tropical troposphere than at the tropical surface.

From your answer, I can only assume you are unwilling to attempt any mathematical effort to salvage more than the three models Douglass left standing. Is that correct?
Ron Cram

Posted May 6, 2008 at 11:32 AM | Permalink

jon,
re: 295

I did not say beaker was not patient. But he brings up issues that are not relevant and attempts to change the discussion from one set of tests to another. And he pointed to an earlier statement as “unambiguous” but it did not address the point of the challenge issued by David Douglass at all. beaker’s most recent response looks to be his most clear statement so far, but I will not know for certain until he confirms it.
Sam Urbinto

Posted May 6, 2008 at 1:01 PM | Permalink

I think I figured this out.

beaker, your #292 I never said you cared if they pass or fail and I don’t care either. 🙂 And you’re right, what test do you perform? The question is if test A can reconcile models and observations, isn’t it?

Pick test A.

If this paper is trying to show that models and observations can’t be reconciled using SEM, does it do that or not? Not that it’s trying to reconcile them, just to see if it can be done using SEM.

I can pick any test and number of models and see if I can reconcile. Could I pick SD-SD? Sure. But no, too easy to pass. Could I pick SD-SE (or SE-SD)? No, I have to guess at variables. Could I pick individual observations? No, I already know if it passes or fails depends on the individual test and which I pick (which most of the time would be fail).

So question; does this test show that if you pick this group of 22 models you can’t use SEM to compare to observations? Does it falsify the usefullness of SEM to combare this ensamble to those observations?
Michael Smith

Posted May 6, 2008 at 1:20 PM | Permalink

Beaker, thank you for responding,, In 280 you wrote:

The standard deviation of the perfect ensemble would equal to the “sigma of the climate given the forcing” as the physics is exactly right. You would not want to reject any member of the ensemble as they are all valid realisations of a chaotic process with the same forcing.

I’m confused by this. Previously, I understood you to be arguing against the use of the SE because the ensemble put together to work with the perfect model could be expanded until the SE approached zero, guaranteeing failure.

The point I was trying to make in response to that concern is this: if the purpose of that ensemble is to reflect the stochastic uncertainty, why would it include models that did not increase the standard deviation? Isn’t that a reasonable way to determine what that ensemble should, and should not, include, and does that not eliminate the potential problem of increasing N until SE gets so small even the perfect model fails?

I can’t tell from your response whether you agree or not.
Kenneth Fritsch

Posted May 6, 2008 at 1:28 PM | Permalink

Beaker, this “sigma of the climate” that you have latched onto from a Steve M expression used recently at CA appears to be taking on a growing view from you of an unknown and unknowable quantity with a description so vague as to take on multiple meanings.

Please give me a good reference to its description and definition and how it would manifest itself differently in the models than in the observed climate using ratios as in the Douglas paper and, if possible, a range of the magnitude of its effects.

I thought you were differentiating the model ensemble output from the observed as the ensemble being merely the forced residual after averaging out a sigma of the climate, but I understand you are now saying that the models do not include a sigma of the climate but the sigma of the climate in the observed is related to the forcings (and initial conditions?).

Taken in total I think you are saying that all the models can do is give a potential measure of the forcing but only to an approximation because only in observed case does the unknowable sigma of the climate manifest itself — and from the forcings.

The unknowable sigma of the climate in a unique manifestation of an observed climate run (limited, of course, to the single run we have and will experience) would then appear to make a statistical comparison of a model output, much less an ensemble, with the observed, a worthless, or at best, a very uncertain proposition.
UC

Posted May 6, 2008 at 1:39 PM | Permalink

Jon 295,

He also appears to be quite capable of discriminating between ensemble projections and forecasts, which is more than can be said for quite a few people around the blogosphere.

I’d like to learn this discrimination; in the former the goal is to find predictive distribution of temperature given the forcings and in the latter just marginal distribution of future temperatures ?

Haven’t read Douglass paper, so I probably shouldn’t comment, but my interval-estimation-based hunch is that the big issue here is the difference between confidence interval and tolerance interval. ?
Jon

Posted May 6, 2008 at 2:15 PM | Permalink

I’d like to learn this discrimination; in the former the goal is to find predictive distribution of temperature given the forcings and in the latter just marginal distribution of future temperatures ?

Read AR4 WG1 Chapter 10 and then look at a paper like Keenlyside 2008. The former is an attempt to show the longterm warming trend by smoothing internal variability the latter tries to anticipate. In the Keenlyside example, they use sea surface temperatures to attempt to model variability of the MOC. Such variability in an unforced climate would look like noise- fluctuations up and down that over time had a flat slope. Increased greenhouse forcing doesn’t eliminate that variability- you still see the fluctuations. However it gives the overall picture a positive slope. Ensemble projections don’t try to predict the fluctuations, they show something more like the slope.

A simple visual example would be comparing the variability of PDO to temperature. The PDO fluctuates, as does temperature, but temperature does so with a positive slope whereas PDO as an oscillation has no long term trend.

If we look at an ensemble projection you can see even multiple runs of models will result in fluctuations, but averaged across all models the mean smooths them out, showing the overall trend.
Sam Urbinto

Posted May 6, 2008 at 3:04 PM | Permalink

UC 302, I think Jon in 295 was talking about beaker (see Ron Cram 286)

That said, about the paper, I think it attempts to answer the question if method X shows appropriateness in reconciling models (in this case 22) with observations. The method was SEM. It can’t reconcile them.

And seemingly neither can SD, SEM/SD, or individual models reconcile anything.

Inappropriate, meaningless, uncertain and conflicting.
(SEM, SD, SEM/SD, by model)
Sam Urbinto

Posted May 6, 2008 at 5:10 PM | Permalink

So again, does this test show that if you pick this group of 22 models you can’t use SEM to compare to observations? Or doesn’t it?
gp

Posted May 6, 2008 at 6:21 PM | Permalink

Quick comment from a lurker.

Let me try a reformulation:

It seems that there is reasonable agreement that comparing model+SE to observation+SD would be a valid test, if it could be performed.

There is, of course, some disagreement about the magnitude of the observation SD. RC, beaker, and others (seem to be) effectively arguing for a relatively large SD comparable to the SD of the model ensemble because we only have one observation history and lacking better information, the model ensemble SD could be considered at least a reasonable guess. Douglass, Kenneth and others are arguing for a relatively small observation SD based on the notion that we are discussing trend ratios rather than trends and they seem to be assuming that these ratios are not much affected by the “weather” and should therefore be relatively stable.

question: is it reasonable to directly test the stability of observed trend ratios by slicing the past 20 years into smaller time buckets and testing the stability of the resulting trend ratios. I suspect that this may only lead to an upper bound of sorts but this bound might be enough to settle the question.
Willis Eschenbach

Posted May 6, 2008 at 6:36 PM | Permalink

beaker, thanks for hanging in. However, you have consistently refused to answer my question. I’m a patient man, however, so I’ll ask it again:

The modelers agree that their models can’t predict an “individual realization” of the climate system. However, they say that they can provide a model of the long term climate trajectory. Here’s a typical statement of the situation, from Professor Alan Thorpe, the head of the National Environment Research Council. Dr. Thorpe is a world renowned expert in computer modeling of climate. He says (emphasis mine):

On both empirical and theoretical grounds it is thought that skilful weather forecasts are possible perhaps up to about 14 days ahead. At first sight the prospect for climate prediction, which aims to predict the average weather over timescales of hundreds of years into the future, if not more does not look good!

However the key is that climate predictions only require the average and statistics of the weather states to be described correctly and not their particular sequencing. It turns out that the way the average weather can be constrained on regional-to-global scales is to use a climate model that has at its core a weather forecast model. This is because climate is constrained by factors such as the incoming solar radiation, the atmospheric composition and the reflective and other properties of the atmosphere and the underlying surface. Some of these factors are external whilst others are determined by the climate itself and also by human activities. But the overall radiative budget is a powerful constraint on the climate possibilities. So whilst a climate forecast model could fail to describe the detailed sequence of the weather in any place, its climate is constrained by these factors if they are accurately represented in the model.

Thorpe, Alan J. “Climate Change Prediction — A challenging scientific problem”, Institute for Physics, 76 Portland Place London W1B 1NT

If individual realizations are not predictable, and “the key is that climate predictions only require the average and statistics of the weather states to be described correctly”, then we must ask whether the increase of temperature trends with altitude (which is both predicted by theory and “confirmed” by models) is an “individual realization”, or whether it is an “average or a statistic of the weather.”

Since we are looking at averages of what has happened over a relatively long period (20 – 30 years), it seems quite clear that it is an “average or a statistic of the weather”. This point of view is strongly supported by the fact that the increase is predicted by theory – in other words, it is not the result of a given “individual realization” of weather. It is something which is determined by the physics of the situation, not just in a general sense, but to the point where we can make theoretical predictions, based on the AGW hypothesis, of both its sign and its value.

Since it is an “average or statistic of the weather”, the standard error of the mean is the appropriate metric to use for comparisons. The idea that one model out of twenty-two getting this critical parameter kinda sorta right somehow vindicates the models is contrary to everything we know about averages. The individual model results are long-term averages of individual runs of each model. The averages of the model results are averages of long-term averages. The averages of the observational data are also averages of long-term averages. None of this is an “individual realization”, they are all “averages and statistics”.

I also note in my re-reading that two other people asked you to answer this question …

All the best,

w.

PS – beaker, you say:

The models have NOT been falsified, they have been shown to be significantly biased, I have explained the difference (and at Willis’ request demonstrated that these terms have pre-existing meanings in statistics).

My friend, you are trying to rewrite history. First, the term you used before was not “falsified”, it was “inconsistent”, so your claim that I requested you to demonstrate anything about the term “falsified” is … well … not to put too fine a point on it, your claim is falsified.

Second, in response to my request, you admitted that the “pre-existing meaning” of “inconsistent” was NOT the meaning you were using. Despite that, you have continued to use it as though you were vindicated, and have continued to make your incorrect claim that you were using “inconsistent” in some standard statistical manner.

I hardly think that either of those statements supports your case. I was impressed, however, by your clever use of words. You say you “demonstrated that these terms have pre-existing meanings in statistics”, which is 100% true, you did demonstrate that … but somehow you forgot to mention that you were NOT using the pre-existing meaning, you were using another meaning altogether …
Willis Eschenbach

Posted May 6, 2008 at 7:54 PM | Permalink

UC asked a valid question:

“He also appears to be quite capable of discriminating between ensemble projections and forecasts, which is more than can be said for quite a few people around the blogosphere.”

I’d like to learn this discrimination; in the former the goal is to find predictive distribution of temperature given the forcings and in the latter just marginal distribution of future temperatures ?

to which Jon replied …

“Read AR4 WG1 Chapter 10 and then look at a paper like Keenlyside 2008. …”

A couple of comments:

1) Jon, you cited the IPCC, you showed a pretty graph, you put out lots of words … but you didn’t answer the question. In fact, you managed to write your entire post without mentioning “forecast” even once. So, once again … what is the difference between an “ensemble projection” and a “forecast”?

2) Anyone who cites a whole entire chapter of the IPCC Report is no better than a Bible-bashing evangelist who screams “The answer’s in the Book, in Leviticus”. Well, maybe it is, but where? … if you have a claim that you thing the IPCC supports, give us page numbers, give us references, give us quotes, give us enough information to figure out what the heck you are talking about.

All the best to everyone,

w.

PS (emphasis mine): “… a multi-model ensemble strategy may be the best current approach for adequately dealing with forecast uncertainty, for example, Palmer et al. (2004), in which Figure 2 demonstrates that a multi-model ensemble forecast has better skill than a comparable ensemble based on a single model.” AR4 Chapter 8 p 624

“To the extent that simulation errors in different AOGCMs are independent, the mean of the ensemble can be expected to outperform individual ensemble members, thus providing an improved ‘best estimate’ forecast.” AR4 Chapter 10 p. 805

SO … once you have clarified for us the distinction between multi-model “ensemble projections” and “forecasts”, you can move on to multi-model “ensemble forecasts”, and finish up with the famous “single model ensemble forecast”.
Jon

Posted May 6, 2008 at 9:19 PM | Permalink

you didn’t answer the question. In fact, you managed to write your entire post without mentioning “forecast” even once.

I cited Keenlyside 2008 as an example of forecasting up the thread (240), I didn’t think I needed to repeat myself when referring back to it.

So, once again … what is the difference between an “ensemble projection” and a “forecast”?

The ensemble projections are taking in some cases multiple in other single, depends on the model, runs and averaging them out in an effort to reduce internal variability. A forecast unless otherwise stated is an attempt to actually capture a reasonable amount of variability. Interestingly, although Hansen’s 1988 projections were not intended to be forecasts, as single realizations rather than ensemble means they functioned similarly- and in them you can see quite a bit of variability compared to the way projections are handled now (compare them to Figure 10.5 in the AR4).

Anyone who cites a whole entire chapter of the IPCC Report is no better than a Bible-bashing evangelist who screams “The answer’s in the Book, in Leviticus”. Well, maybe it is, but where?

It’s on page 762, in discussing ensemble projections smoothing out internal variability.

Look at that section, look at the Keenlyside paper. The kind of variability that they describing as dampening the warming signal is smoothed out in the projections given in Figure 10.5. A forecast would have the kind of fluctuation we see in actual temperature whereas the mean ensemble projections would be closer to the slope of the temperature increase.

The first quote you cite would refer to a paper like Keenlyside using an ensemble of multiple models run with the same or at least similar inputs in initialization for sea surface temps in order to best depict this initial state’s impact as variability. Not an ensemble mean for projections run under differing initializations in order to smooth out variability. I assume the second is talking about the same- that mean ensembles under identical initializations should give a more accurate forecast than a single model because errors in the model would be averaged out.

I apologize if this doesn’t seem clear enough- it’s hard for me to describe a more clear way of distinguishing between the two than comparing the description of the projections to a paper like Keenlyside’s.

multi-model “ensemble projections”

Would be the forced component, with internal variability and model errors basically averaged out.

and “forecasts”

Using specific initializations to attempt to capture internal variability.

multi-model “ensemble forecasts”

Multiple models using the same initializations so that internal variability is present but modeling errors are averaged out.

finish up with the famous “single model ensemble forecast”.

I don’t understand this.
Willis Eschenbach

Posted May 6, 2008 at 11:07 PM | Permalink

beaker, thanks for your reply. You seem to be saying that model ensemble “projections” and model ensemble “forecasts” are different, in that “forecasts” start from the same initialization, and projections don’t. If I understand you correctly and that is the case … then where do the “projections” start from? Totally different initializations? … that doesn’t make sense. Kinda similar initializations?

Could you provide some kind of citation for this claim? I’ve never heard that distinction being made, and I don’t find it in the IPCC section you cited. In fact, they often seem to use the terms interchangeably, although “forecast” seems to be used more for short term (~one year) projections.

Second, I see no indication that an average of different model runs starting from the same initialization would “capture internal variability”. The IPCC specifically said that they don’t (see my post above). Surely the runs cited in the Douglass paper started from the same initialization … yet they don’t “capture internal variability”. The IPCC uses the term to describe the internal variability of the climate system itself:

It also differed in that, for each experiment, multiple simulations were performed by some individual models to make it easier to separate climate change signals from internal variability within the climate system.

So the IPCC uses a model ensemble, not to “capture internal variability” as you have suggested, but to average out the internal variability.

Third, perhaps my second point results from my misunderstanding … what are you calling “internal variability”? Internal variability in the climate system, as is used by the IPCC? Internal variability in each individual model? Internal variability between the models?

w.

PS — you say you don’t understand the “famous single model ensemble forecast” … it’s the one referred to directly above my question, viz:

“… a multi-model ensemble strategy may be the best current approach for adequately dealing with forecast uncertainty, for example, Palmer et al. (2004), in which Figure 2 demonstrates that a multi-model ensemble forecast has better skill than a comparable ensemble based on a single model.” AR4 Chapter 8 p 624

PPS — Reading the IPCC AR4 is always good for a laugh. This time I spotted this gem:

Recent studies reaffirm that the spread of climate sensitivity estimates among models arises primarily from inter-model differences in cloud feedbacks.

This is IPCCrap for two reasons. One is that they have not identified the “recent studies” … bad scientists, no cookies. The second is that since then, Kiehl has shown conclusively that the spread of climate sensitivity estimates among models comes from differences in forcing, it has nothing to do with cloud cover … shows how well the modeling community, and the IPCC, understand their own models. But I digress …
beaker

Posted May 7, 2008 at 2:40 AM | Permalink

Good morning all!

Patrick M #296: As I see it, the most useful thing the models tell us is the consequences of the modellers assumptions, and more importantly, the spread of the ensemble gives an indication of the uncertainty both within the modelling community (the systematic uncertainty) and the uncertainty that arises through the simulations themselves (the stochastic uncertainty). As a result, I am happy to see money spent performing as many simulations as possible on as many models, to get as good a representation of the uncertainties involved as possible.

Basically, I would not recommend deleting models as this makes the ensemble appear to give an unduly confident impression of the consequences of the modellers assumptions, which I think is in itself dangerous. It seems to me to be a form of cherry picking. These are not statistical models and should not be treated as if they are. When looking at a set of “forecasts”, we have to remember that this “forecast” is just a statement of the logical consequences of a set of beliefs about the way the climate works, nothing more. Whether you believe the forecasts should be based on whether you accept the assumptions rather than because you believe the “forecast” itself. Whether the “forecast” is a useful predictor of future climate (under a particular scenario) depends on how close these assumptions are to the reality (but I can’t comment on that, I don’t have the expertise).

Now if the modelling community decide that the assumptions underpinning a particular model are unreasonable (perhaps as a result of reading Douglass et al. or the RealClimate article), then that would be a good reason for deleting it from the ensemble.

Ron Cram #298: It seems to me that we are using “sigma of the climate given the forcing” to mean different things. I am using it (as I believe Steve was, but perhaps I am wrong) to refer to the uncertainty in inferring the forced change of the real climate (which we can’t observe directly) from the actual climate observations.

If the issue is “Are the models and the theory consistent with observations?” then I would say that the RealClimate +-2SD test is the correct test of this question. That is because the theory has uncertainties involved, which are represented by the spread of the ensemble. If the spread of the ensemble overlaps the spread of the data, then there is “no clear inconsistency”, but the wide error bars tell you that the theory is very uncertain. This isn’t “playing with words”, in science (and statistics in particular) we have to be very careful to set out the question completely unambiguously so that we can interpret the answer correctly.

All of the ensemble is useful in telling you the consequences of the modellers assumptions. That is the real purpose of the ensemble. That the ensmeble only just manages to explain the observations casts doubt on those assumptions, but doesn’t falsify them outright. As I have said before, I am not keen to delete any model from the ensemble as this invalidates the most useful intepretation of the ensemble projections.

Michael Smith #300: If you had a perfect model with unlimited resolution it still could not be expected to predict the observed climate exactly as the observed climate has an unpredictable chaotic component. The best it can do is to run the model lots of times to get an ensemble of predictions that are all consistent with the physics, but varied due to differences in the initial conditions of the simulation. Now if you had the perfect model and ran an infinite number of simulations, you would be able to find one of them that was the essentially identical to the observed climate. That shows that an infinte ensemble using the perfect model can be reconciled with the observations as the observations lie within the uncertainty of the model.

This model will fail the +-2SE test because the 1/sqrt(2) factor will mean that the SE is zero. The fact that a model that has been shown to be consistent fails the +-2SE test shows the +-2SE test is incorrect, not the model.

The SE is inappropriate as it doesn’t actually measure the uncertainty of the ensemble prediction, but the SD does. Therefore if you ask if the observations can be reconciled with the models (in statistical terms is there an overlap in the uncertainties of the models and the data) then you need the SD.

Kenneth Fritsch #301: I have given a definition of what I take “sigma of the climate given the forcing” to mean (in my comment to #298), I can’t see how it can be measured, if you can suggest something, I am happy to discuss it with you. It is an uncertainty in inferring the effects of the forcing on the real climate without the unpredictable chaotic component.

As I keep saying, the “sigma of the climate given the forcing” does not manifest itself in the models, it is a property of the observations, not the models.

The perfect model (with unlimited resolution) can predict the actions of the forcing on the climate with complete accuracy, but it can’t explain the chaotic component which depends on the initial conditions. That means it can’t predict the observations exactly, because the observations are not just the forced component. That is why it can’t pass the 2SE test (even though no model can do better).
gp #306: That would be a valid test, but it is only a test of the ensemble mean, not a test of the ensemble itself, which is an important distinction.

I am not making any estimation of the size of the “sigma of the climate given the forcing” as I am not a climatologist, however as a statistician I can see that it only needs to be non-zero for the perfect model to fail the +-2SE test.

As it happens the measurement uncertainty puts fairly large error bars on the observations already, just look at the disagreements between the observed trend computed from different sources.

Your slicing method might give a useful ball park estimate, however, as I said it only needs to be non-zero for the +-2SE test to assyptotically fail a model it should pass (invalidating the test).

Willis #307: Averaging (whether over time or over realisations) can only attenuate the effects of the chaotic component of the individual realisations (including reality), not eliminate it altogether. This means there will always be some residual uncertainty in estimating the forced climate. All that is needed for the perfect model counter-example to invalidate the 2SE test is that this uncertainty is non-zero, which arguably it is.

I have seen arguments that show that ENSO can a significant effect on ten year trends (1998-current being the most frequently cited example), I can’t see a good argument why it should have no effect whatsoever on a 20 year trend.

I have noted that this question has been asked more than once, and I have tried to answer it more than once, even if it doesn’t appear that way!
PS, for me showing that the models are inconsistent falsifies them, I was trying to make that distinction in what the Douglass inconsistency test actually shows.

I pointed out that “inconsistent” has a meaning as a term, it is the phrase “inconsistent with the models” that is the problem. It is quite possible that I have not explained this very well, but to a statistician to say “the data are inconsistent with the models” has a particular severity, much more than “the models do not provide a good fit to the data”. Douglass et al. certainly demonstrate the latter, but not the former. Does that make it any clearer?

Willis #310: While I might apparently appreciate the difference between a projection and a forecast, I am not sure that I understand the nuance in those terms as used by climatologists. I was using the “” to hint that I was not entirely happy with the terms as I understood them, but was trying to give as helpful an answer as I could.

I think I have made several statements about how I think the ensemble should be used (and hinted that an audit of IPCC recommended practica maybe in order). The average of the runs does not capture the variability, but the SD error bars would to an extent. It could well be that the IPCC are not making full use of the models.

regarding the multi-model ensemble versus single model ensemble, the reason for ensembling over models is just the same as ensembling over runs, it averages out (or if you are a Bayesian captures) uncertainties. In the former, it is the systematic uncertainties and stochastic uncertainties in the latter. The same benefits accrue in both cases. On average the mean will be a better predictor, but not neccessarily better than the best of the components.

Sam Urbinto #299: You chose the test according to the question you want to ask rather than the answer you want to get. As I have pointed out, you can try and reconcile the models and data using SEM, but it is pointless as the best model that you could possibly make is almost guaranteed to fail it, which shows that the test can’t give a sensible answer to that particular question.

SEM is the wrong statistic for this question. The perfect model example demonstrates this as the test gives an obviously incorrect answer to the question. The reason why it gives the wrong answer is that it treats the SEM as the uncertainty of the ensemble, it isn’t, it is only the uncertainty of estimating the true mean of some putative underlying distribtion of models given this particular sample of 22 models.
MrPete

Posted May 7, 2008 at 5:06 AM | Permalink

Jon,

Catching up (a little), I fail to see how model errors can be assumed to “average out.” So far, AFAIK there’s been discussion of two potential error components:
– “random” error (which would average out)
– consistent “bias” error (which would not average out)

But that’s just the tip of the iceberg, the smoke before the fire, so to speak.

Because we’re dealing with computer models, not direct measurement of a physical attribute, quite a few other possibilities exist (and are likely in my experience):

– non-modeling of significant physical systems [clouds, tropical t’storms, major non-human GHG source “discovery”, etc.]
– ill-posed modeling [fluid flow, etc]
– software bugs [fine-scale (calc’s in a range outside valid precision / accuracy of functions/methods in use, consistent rounding error bias, array bounds misuse (e.g. invalid LUT (Look Up Table) code)), mid-scale (non-maintainable code resulting in deteriorating calculation quality over time), and large-scale (ill-posed software architecture leading to incorrect system “assembly”]

I could go on. My point here: NONE of the above issues can be assumed to “average out”. ALL of the above issues are known to exist in various ways. Even what is programmed into the GCM’s has not been through V&V/SQA (Verification & Validation / Software Quality Assurance).

In case someone is wondering how an error can be anything other than random or consistent bias, please consider options such as chaotic and exponential.

There are many blogs on these topics here at CA and elsewhere. cf Browning, Hughes, etc etc.
MrPete

Posted May 7, 2008 at 5:27 AM | Permalink

beaker, thanks for your ongoing interaction!

A couple of comments:

1) Seems to me that if there’s a significant chaotic component to climate, we’re in trouble because natural variability can be huge. And AFAIK none of the GCM’s attempt to model a chaotic component.

2) See my post above — GCM’s are not (yet) validated as approximations of reality, so there is no reason to presume that an infinity of runs would incorporate the correct future. Important systems are not modeled (by anyone), or badly modeled (by everyone), or there’s just plain bugs (somewhere–and without good software practices, nobody really knows where but experience says the bugs are there to be found).

All this to suggest the CI of the GCM’s is quite large and not derivable from any attribute of model performance.

For example, the ensemble results says a little about modelers’ assumptions. What do
historic climate excursions really look like? We don’t actually know. A 30,000 foot view says: climate trends are said to require a few decades to show up; we only have thermometry and satellite data over one 30 year period and even that data is suspect; temp history over 150 years has proven to be quite suspect (for how many years has the global 1900-1950 temp record been a stable and non-controversial data base? Zero. Compare that to what we know of other observable phenomena.)

Thus (big picture here) don’t we have little with which to confidently validate GCMs, statistically or otherwise? Aren’t there too many unknowns?
Ron Cram

Posted May 7, 2008 at 6:31 AM | Permalink

beaker,
re: 311

You are doing it again. You are trying to change the discussion. I thought you had provided a clear statement of your position. I restated your position and requested a yes or no answer as to whether I had it correct. Instead, you try to change the subject.

I will give you a short course on communication. You can use this your whole life to help limit misunderstandings. Let’s say I begin by asking you a question. There is a chance you misunderstand exactly what I’m asking, but you do your best to answer it. I read your answer but it may not be clear to me and there is a chance I misunderstand what you are saying so I restate your answer in my words and ask if I understand you correctly. You then respond either “Yes” or “No, I’m saying X.” But that is not what you are doing. You pretend I have asked a different question altogether. It is very frustrating.

I did not ask about the ensemble. I am asking about trying to salvage more than three models. This has nothing whatever to do with climate sensitivity. It has nothing to do with the slope of the trend. It has to do with a basic, foundational concept essential to the theory of global warming. According to the theory, the troposphere has to warm more quickly than the surface and this will be most pronounced in the tropics. The models mostly bear this out and are in agreement with the theory. But do the observations match? Douglass says only three models match observations. I don’t care about tricky statistical efforts advanced by RealClimate to save the models as a group. RealClimate does not have a statistician among them. I want to look at each model individually to see if it is consistent with observations in the tropics just as Douglass did in his comments above.

My question to you is “Are you willing to attempt to salvage more than the three models Douglass left standing?” It is a yes or no question.
Geoff Sherrington

Posted May 7, 2008 at 6:46 AM | Permalink

There is an even better estimator than the Willis Eschenbach # 308 “single model ensemble forecast.” Maybe it is the “ultimate precision single model ensemble forecast” as shown by modeller 15 in the paper of Douglass et al, thus:

Model 15, first column. Average of 22 mosels next, then SD next column.
163 156 64 (SURFACE)
213 198 443
174 166 72
181 177 70
199 191 82
204 203 96
226 227 109
271 272 131
307 314 148
299 320 149
255 307 154
166 268 160
53 78 124 (100 KpA)

I know I have posted these figures before, but here is a question: Can the mathematical model be TOO good?

I think that model 15 comes very close to the mean of all the other models, but I do not think one can draw significant inferences about whether it represents Nature.

It surprises me that pro-model advocates have not triumphed this achievement as a vindication of claimed modelling prowess. One part in a thousandth of a degree per decade, agreement at 3 different altitudes? Who needs multiple runs to reduce error?
Ralph Becket

Posted May 7, 2008 at 7:00 AM | Permalink

Could someone correct the errors in my reasoning here?

As I understand it, we have a bunch of models and we have some actual temperature measurements.

If we are computing the mean of the models then we are assuming the models are all approximations to some ideal perfect model. Therefore, the standard error of the mean for the models is a guide to how far off the mean of the models is from the hypothetical ideal model. With a large enough number of models, the SE will tend towards zero and the mean of the models will tend towards the ideal.

Regarding the temperature measurements, we have an error term indicating how far off the mean computed from the measurements is from the real mean, whatever that is. This error is (at least) the standard deviation for the measurements.

Now, surely our confidence in the mean of the models must be related to the overlap of the mean of the models plus/minus (some multiple of) the SE for the mean of the models compared to the mean of the temperature measurements plus/minus (some multiple of) the error for the measurements.

If this is a fair assessment of the situation, can someone explain to me whether this is what Douglass’ comparison does? If this is not a fair assessment of the situation, can someone explain to me where I’ve misunderstood things?

Many thanks!
— Ralph
Michael Smith

Posted May 7, 2008 at 7:30 AM | Permalink

Willis, 314 was a wonderful summary. Thanks!
beaker

Posted May 7, 2008 at 7:48 AM | Permalink

Good afternoon all!

Mr Pete #313: As far as I can see, there is a chaotic element to the climate, but as others have rightly pointed out, averaging in time (e.g. to find a long term trend) or over replications (e.g. runs in an ensemble), will surpress this and leave a signal representing the effects of the forcing on the climate. However, it won’t get rid of it entirely, the longer the trend the better, the mode models in the ensmeble the better. The key point is that the SE test will fail the perfect model unless the chaotic element is reduced exactly and precisely to zero. Which is unrealistic.

I understand many groups are looking at validation (I think this is one of BenSanter’s particular interests). Your attitude to the models seems rather similar to mine (if a bit more extreme). The models only tell you what is plausible given that we agree on the assumptions about climate physics on which they are based. Each new generation of models will correct defficiencies pointed out by papers like Douglass et al. and hopefully will give more accurate predictions, but lets not treat the models as if they were anything more than that.

It is better to look at the spread of the models as a Bayesian credible interval rather than a frequentist confidence interval, they are not the same, but the credible interval is normally what we would want to know in practice. It basically shows the range of results that are plausible assuming the model is correct. In the case of a multi-model ensemble, it tells you what would be plausible assuming that the true physics lies somewhere in the span of the models. The CI is wide simply because the modellers are uncertain of the details of the physics. The fact that they are willing to talk about them (e.g. the RealClimate +-2SD diagram) shows that they know and are open about the uncertainties (which is good scientific practice).

As per the discusion about Popper, observational data can never validate a model, the most help it can give is corroboration. The only thing observations can tell you for sure is when they falsify the model, and even then it is only in a probabilistic sense. It is better to ask how well do the hindcasts fit the data (including all of the uncertainties) and try and diagnose the problems if the fit is poor.

Willis #314: I am unable to answer the “big picture” question as I am pretty much an amateur on the climatology, I can only comment on the statistical issues (and enjoy learning about the climatology).

If the real issue is that the theory disagrees with the observations, why bring the models into it (assuming that the observations accurately reflect reality, which may not be the case here)? If there was a clear case based on a comparison of theory and observations, why not make it?

Douglass et al MADE it a question of statistics, not me or RealCLimate, by trying to make a point using a statistical test using the output of a model ensemble. If they are going to use such a test to support a particular position, they have to get it right, that is the point of an audit. The answer they give may be right (I am not in a position to tell), but the method they used is wrong (I am in a position to say that).

Ron Cram #315: I am not going to give a yes or no answer to your question because it is a bit like the question “have you stopped beating your wife yet?”. If I gave either answer it would be a misleading representation of my position.

If I answer “no”, it suggests that I think there is no value in the models that have failed the Douglass test, when I have already explained there is (even if they give bad predictions they tell you something useful about the assumptions, and even though they can’t predict this trend, doesn’t mean they have no predictive value in predicting other things).

If I answer “yes”, it suggests that I agree with the idea of splitting up the ensemble (which I don’t and have explained why in great detail, repeatedly), just that Douglass has got the dividing line in the wrong place.

BTW, for the record, I have never beaten my wife (except at Scrabble and croquet).

Geoff Sherrington #316: Being near to the mean doesn’t imply that the model is any more correct than the others (it probably has assumptions that are a bit more bland than the models near the fringes, think of it as being a Glenmorangie model that almost everyone finds agreeable rather than a Laphroaig model that most would leave on one side after a single sip, but which some love with a passion).
The pro-model advocates probably haven’t triumphed this achievement because they know it to be meaningless. For a start, if you had not made the other models, how would you know #15 was near the mean?
RomanM

Posted May 7, 2008 at 8:25 AM | Permalink

#311 beaker

You have consistently repeated a number of statements which some of us have trouble accepting. In my view, part of the reason is that there are implicit assumptions about the situation we are discussing that are needed to justify those statements. If you have studeied statistics, then you understand the need in an analysis to specify what assumptions you are making before you follow a particular statistical path for that analyis. Whether or not the analysis is reasonable depends on how realistic those assumptions are. I have selected several quotes (not necessarily in the order made and bold added to parts to emphasize what needs assumptions for justification) from your recent post which I feel could use some explanation of the assumptions you made in stating them.

regarding the multi-model ensemble versus single model ensemble, the reason for ensembling over models is just the same as ensembling over runs, it averages out (or if you are a Bayesian captures) uncertainties. In the former, it is the systematic uncertainties and stochastic uncertainties in the latter. The same benefits accrue in both cases. On average the mean will be a better predictor, but not neccessarily better than the best of the components….

All of the ensemble is useful in telling you the consequences of the modellers assumptions. That is the real purpose of the ensemble. That the ensmeble only just manages to explain the observations casts doubt on those assumptions, but doesn’t falsify them outright. As I have said before, I am not keen to delete any model from the ensemble as this invalidates the most useful intepretation of the ensemble projections….

Basically, I would not recommend deleting models as this makes the ensemble appear to give an unduly confident impression of the consequences of the modellers assumptions, which I think is in itself dangerous. It seems to me to be a form of cherry picking. These are not statistical models and should not be treated as if they are. …

What assumptions are you making about the ensemble of models? Are they a random sample of models from a population with some underlying distribution or are they just a fixed finite set of models? Why is the average a meaningful quantity? What stochastic uncertinties are averaged out and why does this produce a better predictor (presumably of the climate or any of its parameters). What in the underlying structure relates the ensemble in a numeric way to it to the climate and/or its parameters? Why is the SD of the ensemble meaningful? What does it estimate (if anything)? What specifically makes +/- 2 SD meaningful? Why does deleting models which are obviously inept invalidate the most useful interpretation of the ensemble projections (and just exactly what is that interpretation)? If these models do try to predict or project a “chaotic” climate, why are they not “statistical” (or is there an EKT definition from the statistics community involved here as well)?

A general observation on the Douglas paper: What they are evaluating is the ability of the models to estimate a set of climate parameters, the trends of the temperatures. No one is asking that the models “project” or “predict” the daily or monthly temperatures or duplicate the climate through an infinite set of runs in a monkeys and typewriters writing Shakespeare sort of way. If they can’t, then there is a problem. Do you think that that particular aim was achieved?
Ron Cram

Posted May 7, 2008 at 9:23 AM | Permalink

beaker,
re: 320

My question is nothing at all like asking “Have you stopped beating your wife?” I am not assuming have you had begun any action at all.

If you answer “no” it simply means you are not willing to undertake such an endeavor yourself. You could easily have replied “No, but I know someone who will and he has agreed to post his analysis here.” Great! That would have been a contribution to the dialog. You could have replied “No, but that does not mean it cannot be done.” This would not have added much except your opinion, but that would have been welcome.

Or, you could have replied “Yes, I willing. I cannot save all of the models but if you look at it this way you can see that eight of the models are still viable.” That would have been a contribution to the discussion.

Your claim that you do not believe the ensemble should be split up is outrageous. What factors would qualify any model for so lofty a standing that it is not subject to its own validation and verification? Your claim that you have explained this view in great detail and done so repeatedly is equally outrageous.
Patrick M.

Posted May 7, 2008 at 9:42 AM | Permalink

re 311 (beaker):

You state:

As I see it, the most useful thing the models tell us is the consequences of the modellers assumptions, and more importantly, the spread of the ensemble gives an indication of the uncertainty both within the modelling community (the systematic uncertainty) and the uncertainty that arises through the simulations themselves (the stochastic uncertainty). As a result, I am happy to see money spent performing as many simulations as possible on as many models, to get as good a representation of the uncertainties involved as possible.

Regarding the bold text: and then what? Would you at some point consider thinning the field of GMC’s based on performance?
Ron Cram

Posted May 7, 2008 at 10:16 AM | Permalink

beaker,
re:320

Regarding your claim you do not favor splitting up the ensemble, I see now where you have stated your view that you are not in favor of deleting models from the ensemble. This must be what you were referring to. I have never seen you argue that individual models should not be subject to their own verification and validation. Such a view would be far from the mainstream. It should be clear that no model can be considered part of the ensemble unless it passes certain tests. As of right now, only three models pass the test regarding tropical troposphere. That is the ensemble you have to work with.
beaker

Posted May 7, 2008 at 10:46 AM | Permalink

Good evening all (one last post for the day)!

Roman M #321: That is a lot of questions, the reason why an ensemble is a good thing is a bit involved and maybe requires a new thread, I’ll give it some thought and get back to you, but in the mean time…

As to your general observation, yes indeed the general aim was achieved, I only have problems with the (statistical) methodology and the lack of a distinction between “the models” and the “mean of the ensemble” in many places in the paper.

Ron Cram #323: I have pointed out that I don’t think the ensemble should be whittled down at all, that means that both “yes” and “no” are incorrect answers. Your question is a logical fallacy as it assumes that the ensemble ought to be split up, which is not a premise with which we both agree. See http://en.wikipedia.org/wiki/Mu_(negative) for further details.

As to whether that view is “outrageous”, perhaps you should consider why I might think that the very wide 2SD error bars of the models highlights their defficiencies more clearly than the test in Douglass et al. If you understood that point, you would understand why you wouldn’t want do anything to make them narrower.

Patrick M #324: The only way that the ensemble should be thinned is as a result of a change of opinion on the paert of the modellers. Say a modeller reads Douglas et al and thinks “he has a point there (although the stats are wrong ;o)”, and then goes back to the drawing board to work out why the models don’t do what the theory and observations suggest they should. He then changes his opinions on the physics of the climate and this is reflected in the next (hopefully more accurate) generation of models. It seems to me that the safest way to treat the models is as a statement of what is plausible given the modellers best understanding of the physics (which will only be accurate if that understanding is good). The huge error bars basically mean that we have to consider a wide range of possible outcomes and that appropriately guards against “jumping to conclusions”.

N.B. My (rather cautious) view on how the models should be viewed is separate from the discussion of whether the 2SE test is valid, neither is contingent on the other.
Patrick M.

Posted May 7, 2008 at 10:55 AM | Permalink

Here’s a thought about the number of GCM’s. If there were a GCM whose runs did accurately predict conditions in the future, (in retrospect of course), wouldn’t the field thin itself? So is the number of active GCM’s itself a measurement of the “success” of the GCM’s, (i.e. the more models, the less predictive power)?

Haven’t weather forecastors narrowed down their models to something like 3 models?
Patrick M.

Posted May 7, 2008 at 10:59 AM | Permalink

re 326 (beaker):

Looks like my post 327 might be along the lines of your answer in 326.
Kenneth Fritsch

Posted May 7, 2008 at 11:02 AM | Permalink

Beaker, thanks for your efforts to spell out your POV on the issue of modeling, the chaotic unknown added to the observational and why you feel that SE is an inappropriate statistical measure of the differences used in the Douglas paper. For me your POV is crystal clear and that is not the impediment to a further dicussion, but here is what is:

1. You are unable to give references for your POV vis a vis the chaotic unknowable component (sigma, or should that be stigma, of the climate) in the observed data which is the essential ingredient in your arguments.

2. You state that you are not qualified to estimate the magnitude of the sigma.

3. You have ignored replying to my proposition that looking at the ratios of trends, as in the Douglas case, mitigates the objection to the chaotic component in the observed climate.

I think what has been lacking in your replies, and frustrating for those making the queries, is the lack of any references/links to papers that might add insight and agreement with your POV as applied to the climate and climate modelling. Your case is rather a simple one from a theoretical POV and I have no problem with it in general terms. What it lacks is any application to the details in a practical matter such as that from the Douglas paper.

I have begun a search of my own attempting to find links to the details of your POV as used in the statistics of climate modeling and comparisons with an observed phenomena as in the Douglas case.
Sam Urbinto

Posted May 7, 2008 at 12:37 PM | Permalink

beaker #311 “You chose the test according to the question you want to ask rather than the answer you want to get.”

And the answer I want to get is “If I assume these 22 are perfect models, can I use SEM to reconcile them to actual observations?”

And as Douglass et al and you and others have shown, the answer is no. That the models do or don’t agree with each other, or if the SD shows they’re uncertain or not is beside the point.

Inappropriate, meaningless, uncertain and conflicting.
(SEM, SD, SEM/SD, by model)

If my question is “Can I use SD to show the models have very wide 2SD error bars” or “Will SD show the models are defficent?” are different questions.

But I do agree (#326) that the models shouldn’t be broken up, take them all and see what “the models” tell us.
Which is you can’t compare them with SEM and get reconcilliation, and SD shows them defficient.

Why do you have such an issue with why the paper didn’t try to answer the question leading to the second, I have no idea.

MrPete #313 “Thus (big picture here) don’t we have little with which to confidently validate GCMs, statistically or otherwise? Aren’t there too many unknowns?”

Yes. I would say so.

Willis #314 “But the observations say it’s not warming aloft, it’s cooling compared with the surface. Not only that, but the higher you go, the greater the cooling.”

Then the models showing warming are wrong. No statistics needed. As far as why, well, how much air are we moving, and how saturated is it, and what temperature is it, and what is the pressure at the location? 🙂

http://goldbook.iupac.org/A00144.html http://raob.com/pix/EmagramB.JPG

Willis #314 “The point is that observations of negative and decreasing trends aloft is a direct refutation of the theoretical basis of the claim that the warming is caused by GHGs.”

“Extra warming” (aka rise in the anomaly trend) 😀

Ralph #317 “Could someone correct the errors in my reasoning here?”

I’d say with model comparing papers and tests and the like, what gets done depends on the assumptions made. Even if other people disagree with those. Or not. I think you basically have it correct as to what was done and what was assumed/decided upon/whatever.

Ken #329 “..that looking at the ratios of trends, as in the Douglas case, mitigates the objection to the chaotic component in the observed climate.”

I would think it would. I’d be interested in a short answer as to why that’s an unreasonable way to go about things.

But BTW, it’s bender who’s a POV! 😀
UC

Posted May 7, 2008 at 12:56 PM | Permalink

Jon,

The former is an attempt to show the longterm warming trend by smoothing internal variability the latter tries to anticipate.

Ok, so the in the former (ensemble projection) the goal is to find a location parameter of the distribution (of future T), and in the latter (forecast) whole distribution ( both given the forcings) ? Forecast without future forcings is then a different story, and has significantly larger variance. Trying to link this discussion to this picture, that is IMO a bit problematic (what do we really know about internal variability? ) ,
Sam Urbinto

Posted May 7, 2008 at 12:59 PM | Permalink

There’s an interesting tidbit here in this paper Phil linked elsewhere.

Arctic sea ice in IPCC climate scenarios in view of the 2007 record low sea ice event
A comment by Ralf Döscher, Michael Karcher and Frank Kauker

It is necessary to be aware of an important difference between the observed data and climate model results as they typically are published. Mostly, as is the case in Fig. 1, the presented climate model
results are ensemble means. In this context an ensemble is a number of model experiments which
differ from each other, e.g. by changed initial (starting) conditions and in the case of AR4 by
different model formulations.In coupled climate models small initial differences may grow large as a consequence of the non-linearity and chaotic behavior of the climate system (the famous butterfly effect) and due to different model sensitivities. The use of ensembles instead of just one experiment provides a range of possible realizations of the climate system under given external forcing. It is evident that the ensemble mean provides a more robust picture on the system’s long term response to the external forcing than any single ensemble member. On the other hand the ensemble mean has a reduced variability and less extreme events than any single member. For a test whether the IPCC models are ‘too conservative’ in comparison to observations, we should compare the observed ‘realization’ with single realizations of the ensemble models.
Sam Urbinto

Posted May 7, 2008 at 1:19 PM | Permalink

What about population, urbanization and industrialization as model inputs? As far as I know those aren’t accounted for. Anyone?

UC: The title of that slide is “Natural factors cannot explain recent warming” But what they mean is something along the lines “Our models cannot match what the anomaly is doing unless we add something. Since our opinion is that they cause the difference, the so-called greenhouse gases that are produced by human activity is that something added.”

Or at least it seems that’s the implied answer to what they mean….

Correct me if I’m wrong, but isn’t it actually that they produce the model with that already it mind and tweak it until it matches the anomaly? And then take out the GHG and get that one? In which case it’s “Our model matches, but when we take out GHG it doesn’t anymore.”
steven mosher

Posted May 7, 2008 at 1:38 PM | Permalink

re 331. ya UC I raised that issue before but nobody seems to get it.

One set of rules for attribution studies and another set for rejecting models.
UC

Posted May 7, 2008 at 1:47 PM | Permalink

re Harshad number

In which case it’s “Our model matches, but when we take out GHG it doesn’t anymore.”

And in this post the case is “Our model matches when we take out GHGs”? How to interpret green area in that slide ?
steven mosher

Posted May 7, 2008 at 2:03 PM | Permalink

RE 336.

I found the GISS hindcasts for 1880 to 2003 ( actually gavin pointed me to them)
You can pull out various forcings and see the reponse. Today I thought It might
be fun to compare the 4, 30 year periods. 1880-1910, 1911-1941, 1942-1972, and
1972 to 2003. And look at the metric of GSMT trend, subjectively, ModelE missed
on the first two climate periods, not even close, Nailed the flat trend from 1942
to 1972, and nailed the rise from 72 to 2002. so, 2 otta 4 aint bad.
Kenneth Fritsch

Posted May 7, 2008 at 2:10 PM | Permalink

Re: #332

Sam, the reference you linked in your post presents the general POV that Beaker has been reiterating at CA about comparing climate model outputs versus observed data. It revolves around the chaotic component of climate and its dependence on initial conditions. Unfortunately that does not answer my questions about the chaotic content in ratios of trends vis a vis the Douglas paper or alternatively how one would go about estimating the magnitude of its variation.

In order for one to infer anything about the models output in comparison to the observed one has to have some idea of the size of the chaotic effect or any attempts at comparing model to observed will be futile.

In a quick read of the linked article, I get the impression that, while the central tendency of the model outputs (with evidently different initial conditions) is nice to know and perhaps averages out some of the chaotic content due to initial conditions, the best comparison is to compare the individual model outputs and their ranges with the observed.

Now at this point if one assumes that the ensemble of the individual models variation is due to the chaotic effect and the center of the distribution is not biased by the model outputs one could make a comparison, in my view, of the observation with the individual models using SD, but not without taking into consideration the chaotic content of the observed. But all these assumptions lead to revealing an estimate of the variability of the chaotic content of the models and the observed. Given that that content can be estimated one could use averages in conjunction with SEM like tests for comparing a model mean to the observed.

Without these assumptions and limiting the magnitude of the chaotic content in a single realization I cannot see how any comparison of model to observed has any meaning in a strict statistical sense (the Beaker revealed POV).

As a practical matter I could certainly see climate modelers having a stake in avoiding a SEM test of an ensemble of models with an observed value when a SD and/or subjective view of where the observed fits with a range of models will make it more difficult to hypothesize that a difference between the observed and model output can be statistically shown. The same tendency of the modelers could also explain a hesitancy of the modelers to reject any models regardless of their apparent outlier status and that they might be considered gratuitously included to expand the range.
Sam Urbinto

Posted May 7, 2008 at 2:11 PM | Permalink

mosh #334

But the SRES is a ~~guess~~ scenario-driven “storyline” and even then, if they’re fro emission inputs, that’s just more GHG ~~hooey~~ factors, right?

What about the heat and lower albedo involved with those two and “urbanization” (and everything that comes from it animals, farms, cities, freeways, suburbs, etc)?

A case of “It must be GHG because we know they absorb IR.” it seems to me. That it’s a fairly well known static thing, therefore “it must be it” in the system. Not that they can’t explain it any other way, just that it’s too difficult to deal with?

But, hey.
Ron Cram

Posted May 7, 2008 at 2:28 PM | Permalink

beaker,
re: 326

You can say that you do not want to delete any models from the ensemble if you want to, but it is not science. Science requires models undergo verification and validation. I gave you an opportunity to attempt to save more than three models. The fact you chose not to try say something.

The next step in moving from models to scientific forecasts is to integrate the principles of scientific forecasting. So far, climate science has acted as if these principles do not apply to the physical world but they do apply. One of the first principles is to show that a scientific forecast is even possible. The current state of the models does not allow this. See http://www.forecastingprinciples.com
Kenneth Fritsch

Posted May 7, 2008 at 2:46 PM | Permalink

Re: #336

Steven Mosher, could you share that link with us or do I have to go back to web site where I noted that Gavin graciously gave it to you after a reminder. 2 out of 4 heads on coin flip is pretty average, so I have to see these 4 periods’ fits for myself. Trust but verify.
beaker

Posted May 7, 2008 at 11:56 PM | Permalink

Hi all,

putting to one side for the moment my views on how the ensemble should be interpreted and the assymptotics as that seems to have us getting bogged down in the details, rather than the key point, which is that the 2SE test is arbitrarily hard to pass.

(i) The SE test is a test of whether there is a statistically detectable difference between the observations and the ensemble mean.

(ii) It would be unreasonable to say that a model could not be reconciled with the observations if the difference between the ensemble means and the observations was less than 1e-6 K/decade.

(iii) An ensemble with finite variance will fail the SE test if it is sufficiently large, even is the difference between the ensemble mean and observations is less than 1e-6 k/decade.

(iv) Therefore the SE test doesn’t tell us whether there is a meaningfull difference between the observations and the ensemble mean
(v) Therefore the SE test doesn’t show if the ensemble mean can be reconciled to the data in a meaningful way.

(vi) The use of an ensemble reduces the stochastic uncertainty of the models.

(vii) The larger the ensemble, the more the stochastic uncertainty is averaged out.

(viii) The more the ensemble averages out the stochastic uncertainty, the more likely it is to be rejected by the SE test.

Let me know which of those propositions you disagree with an we can go on from there.
Geoff Sherrington

Posted May 8, 2008 at 3:30 AM | Permalink

Re 333 Steven Mosher

Do you infer that by nailing 2 of the 4 periods in the hindcast, the modellers knew which physics/chemistry caused upturns and downturns in a quantitative manner? If this is so, then it is the first admission of this that I can recall.
Michael Smith

Posted May 8, 2008 at 5:32 AM | Permalink

The argument from (iii) to (iiii) and (v) is a non-sequitur. It does not follow that merely because SE may cause an inappropriate rejection of an ensemble in a purely hypothetical case that it cannot be used with any ensemble in any case. The fact that you can create such a hypothetical case points out a limitation in the use of SE — but it does not prove that SE is across-the-board unusable for evaluating whether or not the ensemble mean can be reconciled with the observations.

In Douglass, the ensemble did not fail because (n) was large. Even if the number of models is cut in half, which I did to create the graph in comment 184, the ensemble still fails the SE test. That proves that in the case at hand, (n) is not the issue.

The ensemble fails the SE test because above the surface, the overwhelming majority of the models show rising trends to an altitude of 250hPa while the observations show falling trends over the same range.

Comparing the observations with the SE shows us that ALL 34 observations above the surface fall outside the SE on the same side. Surely that indicates that the probability that the mean of the ensemble is consistent with the observations is very low.

Whether or not, according to the conventions of statistics, that only allows one to claim bias in the models and not inconsistency, seems irrelevant to me. The notion that any amount of overlap in the +-2SD intervals of observations and the ensemble proves that “the observations and the models have been reconciled” can only mean that the statistical definition of “reconciled” is inconsistent with the common-sense definition.
steven mosher

Posted May 8, 2008 at 6:13 AM | Permalink

re 337

http://data.giss.nasa.gov/modelE/transient/climsim.html

table 1, select all forcings, select lat-time as the response

next page select a running mean of 1 month, select output options that give
you a table, you’ll get mothly data from 1880 on. i think i started with 1881
I then got hadcru data, make sure you adjust to the right baseline ( hadcru is
61-90

Then I just did simple linears on the first 30, next 30, next 30, last 30
please double check, i did it quick and dirty
steven mosher

Posted May 8, 2008 at 6:17 AM | Permalink

re 339. I infer nothing, first have kennth check my work in case I botched something
there were peroids when the hindcast was off, missing bumps and such. I havent spent much
time looking at it, you can select different forcngs and see the reponse.
steven mosher

Posted May 8, 2008 at 6:23 AM | Permalink

re 335. not sure I follow sam. the sres are storylines about future emmissions. the
hindcast from 1880-2003 uses “historical” forcings.
Tom Gray

Posted May 8, 2008 at 6:27 AM | Permalink

re 339

Do you infer that by nailing 2 of the 4 periods in the hindcast, the modellers knew which physics/chemistry caused upturns and downturns in a quantitative manner? If this is so, then it is the first admission of this that I can recall.

And why this same physics and chemistry did not nail the the otehr 2 of the 4 periods?
Ron Cram

Posted May 8, 2008 at 6:27 AM | Permalink

beaker,

Regarding (iii) (iv) and (v), I agree with Michael in #340. The issue is not whether the difference in the observations is less or more than 1e-6 K/decade. It does not have to do with any individual measurement. The issue is whether the troposphere is warming faster than the surface or not? This is a meaningful difference and is a sufficient basis to reject models that fail the test.

I have to reject (vi) as well. While the statement by itself is true, it does not apply in the present case. Someone may well argue that weather is stochastic and monthly averaged temp differences between troposphere and surface may not be uniform, but the process is certainly not stochastic over longer time periods. It is deterministic. The idea the troposphere will warm faster than the surface especially in the tropics is central to the current theory of AGW. We have more than 20 years of data to compare to the models. It is obvious the vast majority of the models are consistent with the theory but not with observations.
Cliff Huston

Posted May 8, 2008 at 6:36 AM | Permalink

Beaker,

If you are going to continue to tout your ‘SE fails a perfect model ensemble’ argument, you will have to specify what a perfect model ensemble is.

If all the models in an ensemble get the physical theory exactly right, the 2SD of the ensemble is going to be very close to zero (because they all get the same answers). At that point, nobody will care that 2SE is also very close to zero. But the model ensemble can still fail, if the theory does not match the real world.

If the definition of the perfect model ensemble is that it exactly matches the real world, by definition it will pass the test whether you use 2SE or 2SD.

Beyond all the hand waving, you need to come to grips with the IPCC ‘average of the model ensemble’ being the best model of the real world. This is a given in Douglass et al and you can only fault Douglass et al by proving that the IPCC intent was otherwise.

Cliff
steven mosher

Posted May 8, 2008 at 6:56 AM | Permalink

re 344, also note that I did a fairly easy test to hit. a 30 year trend, some features
like the early 20th century warming were not picked up by the model. this could be due
to errors in the historical forcings.
Tom Gray

Posted May 8, 2008 at 7:34 AM | Permalink

re 344, also note that I did a fairly easy test to hit. a 30 year trend, some features
like the early 20th century warming were not picked up by the model. this could be due
to errors in the historical forcings.

If the misses for early dates are teh result of errors in early period forcings then the claims for attribution studues would seem to be very questionable.

A real question for Gavin Schmidt would be how these inaccuracies in model outputs are reconciled with confident assertions about attribtions.
RomanM

Posted May 8, 2008 at 7:53 AM | Permalink

#338 beaker
Let’s examine your statements one by one:

(i) The SE test is a test of whether there is a statistically detectable difference between the observations and the ensemble mean.

No, the test is whether there is a detectable difference between one or more parameters of the population of temperatures and the ensemble mean. More specifically, the test decides whether a suitably selected sample from a population of models whose mean is actually equal to the values of the parameters could reasonably generate the calculated ensemble statistics or whether the difference is extreme enough that this would be unlikely to occur.

(ii) It would be unreasonable to say that a model could not be reconciled with the observations if the difference between the ensemble means and the observations was less than 1e-6 K/decade.

True or not, irrelevant to the issues here.

(iii) An ensemble with finite variance will fail the SE test if it is sufficiently large, even is the difference between the ensemble mean and observations is less than 1e-6 k/decade.

Nonsense. This same criticism of statistical testing could apply to every statistical test. All you are saying is that given an appropriate set of assumptions, if you have enough information, you can detect a difference as small as desired. It doesn’t mean that for large samples only small differences can be seen. Of course, according to the data in the paper, we are talking an ensemble of about 1600000000 models to detect your chosen difference.

(iv) Therefore the SE test doesn’t tell us whether there is a meaningfull difference between the observations and the ensemble mean

I repeat, it gives us insight into whether the observed differences could be accounted for randomness or whether there is a real systematic difference. “Meaningful” requires an external criterion to be applied to determine what difference you are willing to accept. Confidence intervals and other methods can be used to decide the size of the difference.

(v) Therefore the SE test doesn’t show if the ensemble mean can be reconciled to the data in a meaningful way.

I guess that means that every test involving population means (not just in climate science) fails in exactly the same way, if you take too large a sample, and decide that there is a difference, then that difference is meaningless.

(vi) The use of an ensemble reduces the stochastic uncertainty of the models.
(vii) The larger the ensemble, the more the stochastic uncertainty is averaged out.
(viii) The more the ensemble averages out the stochastic uncertainty, the more likely it is to be rejected by the SE test.

I am still waiting for the assumptions that you are making about the ensemble so that these last statements have some validity.
Basically, this whole issue appears to me as a specious red herring.

IMHO, if you want to test individual models, you do that by comparing the characteristics and output of each model separately to the observed data. The ensemble can at best (under the right conditions) tell you how the various models differ among themselves, but offers little or no information about the individual models. Assuming all models are equally uncertain is pretty much baseless – the lack of existing individual information to make determinations about model validity (because none is provided by the modelers) is no excuse. If a given model is deterministic, i.e., then every run from a fixed set of initial conditions gives the same result, so at best it seems that you can look at the model’s behavior as the conditions change. That is the appropriate “ensemble” within which to evaluate that type of model. However, questions of applying inferential statistical methodology to the ensemble do arise. For models with stochastic components, the appropriate “ensemble” is multiple runs of only that model under each condition. Then, it makes sense to try to do proper statistical evaluation.
steven mosher

Posted May 8, 2008 at 7:56 AM | Permalink

re 348, if I get time I’ll post some charts. I’m not defending anything or infering
anything. people can make of it what they will
UC

Posted May 8, 2008 at 8:03 AM | Permalink

The larger the ensemble, the more the stochastic uncertainty is averaged out.

More studies, more accurate results. http://www.dilbert.com/2008-05-08/
RomanM

Posted May 8, 2008 at 8:10 AM | Permalink

I thought yesterday’s strip was also relevant (and funny):

http://www.dilbert.com/2008-05-07/
Kenneth Fritsch

Posted May 8, 2008 at 8:12 AM | Permalink

Beaker @ #338:

putting to one side for the moment my views on how the ensemble should be interpreted and the assymptotics as that seems to have us getting bogged down in the details, rather than the key point, which is that the 2SE test is arbitrarily hard to pass.

The devil is in the details and that is where the real and practical world lies. You have made your general POV on the matter of using SE and the chaotic nature of the observed and models being sensitive to the initial conditions very clear.

However, you have not replied to the proposition that in the Douglas case where ratios of trends are being used that the chaotic content in the observed and models does not apply, i.e. it is essentially differenced out.

Your concentration on comparing means with large sample sizes and the means not being exactly equal makes it sound as though statistical tests taking sample size into account using SE measures are never or should never be used.

I have noted previously that an SEM test could apply to your idealized situation (not necessarily to the Douglas case because it may not need it, but the general case) of the the observed if one could estimate the chaotic residual (E) remaining in the observed that would be averaged out in the model mean. Then one has a comparison using SEM where the Xmodel – Xobserved = 0 is no longer expected but absolute value of Xmodel -Xobserved = E, where Xmodel is the average of the ensemble.

I think the discussion has reached point where it would be instructive to discuss details and references — and maybe with some insights into why modelers prefer evaluating their models the way they do.

Cliff Huston @ #346:

If the definition of the perfect model ensemble is that it exactly matches the real world, by definition it will pass the test whether you use 2SE or 2SD

Beaker’s point would be that the observed is more like another model output and the chaotic content or climate sigma of the observed will vary the output which in turn is averaged out to some degree in the ensemble mean. Therefore Xmodel – Xobserved = E.

As stated above I have proposed that that condition does not necessarily appy to the Douglas case and that one can test for differences in the general case by assuming that the difference could be as large as E.

I think on initial pondering of this issue that if one does not do some of this estimating of E, in the general case, that one would conclude that computer outputs could never be reasonably compared in a statistical sense to an observed value. One could make predictions/projections/whatevers of the future climate with model outputs and then point to the range of outputs with comments like they all show an increasing temperature and here is the range. Of course what is avoided here is any comparison to an observed. Lacking a rigorous statistical test I would suppose a proud modeler would say that that agreement and range is sufficient for policy makers to go do their work. When organizations such as the IPCC get involved they need to push the concept of uncertainty a step further and add in essence an show of hands by (some or the) experts.

I was hoping that Beaker would give us some insights into how the statistical testing of uncertainty of model outputs and comparisons to observed values is and/or should be progressing.
UC

Posted May 8, 2008 at 9:04 AM | Permalink

#352 🙂

steven, didn’t notice http://www.climateaudit.org/?p=3058#comment-243298 when I wrote #328. Quatloos belong to you. So, we need some zero A-CO2 runs for the tropical troposphere with different models .
Kenneth Fritsch

Posted May 8, 2008 at 9:23 AM | Permalink

Re: Steven Mosher @ #342

re 339. I infer nothing, first have kennth check my work in case I botched something
there were peroids when the hindcast was off, missing bumps and such. I havent spent much
time looking at it, you can select different
forcngs and see the reponse.

No, no Steven, that’s my line to get people to do my work — and that has not worked for me here.

Thanks for the link that your pal, Gavin, provided. I’ll look at it, but do not intend to preempt your posting on it first.
steven mosher

Posted May 8, 2008 at 9:52 AM | Permalink

re 355. My tom saywer skills are lacking
JamesG

Posted May 8, 2008 at 10:59 AM | Permalink

The argument that ensembles are more accurate than individual models is of course the argument that employs the greatest number of climate modelers. No longer can your model be total cr*p if it is a vital cog in the wheel. An appeal to a greater number of models to improve accuracy is only an appeal for yet more money, when in fact they need to reject the really bad models, disband the modeling groups responsible and focus on just the semi-bad ones. Seems to me the Russians are the winners of this contest and I’m not surprised.
Christopher

Posted May 8, 2008 at 11:02 AM | Permalink

I really think this discussion is bogged down on beaker’s POV and that the climate context is getting in the way. Let’s try a different tact. Treat GCMs as random number generators (no pun intended). Each RNG has some underlying distribution. Assume that we perform n draws from this RNG, call the variable of interest x. So now we have an RNG-generated distribution of x. What is the best way to see if actual observations of this same quantity x can be reconciled with the RNG-generated distribution of x? Note the amorphous language here, no consistent vs. biased vs. meaningful. That’s where we are in a nutshell. But it gets more complicated. We may have a RNG-generated distribution of x but our observations come in various flavors (they are not fixed and known without error). So now we are comparing distribution one to distribution two. We want to see if they are the same. Again, n dependence figures prominently. (Brief aside: this is why RC likes ranges, you circumvent this problem –kind of.) So, what is the most robust test? Well, testing for equality of means (kind of what Douglass did) has several assumptions we may not like. We could try a permutation test but the RNG-generated x’s will swamp the observations yielding to beaker’s endgame. In the end there is no good way to do this. It is quite easy to generate orders of magnitude more RNG-generated x’s than it is observations. Any test will not work as advertised, in the statistical sense, a la beaker. Maybe if we could force n = 1 to factor out the n dependence? How do you do this? Back in the GCM world you need the conditional mean of GCM-temperature vs. the conditional mean of observed temperature. I don’t care how many models/realizations/ensembles etc. –it’s irrelevant (for now). Sample size is no longer an issue as samples sizes are equal. What happens when you do this? (You can follow along in R, just cut and paste.)

We have two data vectors (only altitudes with observed and GCM values are kept here):

> yhat = cbind(64,70,82,109,131,148,149,154,160,124) #GCM-ensemble means
> yobs = cbind(135.6,93,38.75,10.5,74,94.25,74.33,-9.25,-125.25,-415) #obs. means
#test for exchangeabilty, permutation tests generally require this
> scale.test scale.test

Ansari-Bradley test

data: yhat and yobs
AB = 54, p-value = 0.9407
alternative hypothesis: true ratio of scales is not equal to 1
95 percent confidence interval:
0.8648649 1.7204301
sample estimates:
ratio of scales
0.8259587

#CI encompasses 1 so exchangeabilty is plausible, let’s look at location
> loc.test loc.test

Exact Wilcoxon signed rank test

data: yhat and yobs
V = 49, p-value = 0.02734
alternative hypothesis: true mu is not equal to 0
95 percent confidence interval:
13.45 285.25
sample estimates:
(pseudo)median
77.75

So what have we learned? That obs and pred are not quite the same (if you like alpha = 0.05). There are issues with this approach. Non-parameteric tests typically require exchangeabilty (this is rough shorthand for equal variances) and independence. The former we looked at with the Ansari-Bradley test, the latter is obviously violated. It stands to reason that a GCM value at 250 is related to the 200 value. And we know GCMs are based on the same underlying physics. This means we need a bootstrap but for that I’d rather have the actual raw outputs from each run and some experimental design in parameter choices (but we have neither). Another point, climate_sigma was not mentioned. I don’t care about it and see no reason to chase the unknowable.
Sam Urbinto

Posted May 8, 2008 at 11:20 AM | Permalink

How about something like this:

We have tested the proposition that greenhouse model simulations and trend observations of various altitudes of the tropical troposphere can be reconciled. Our conclusion is that comparing the standard error of the mean does not reconcile the two with each other. Other comparisons may yield different results.
Ron Cram

Posted May 8, 2008 at 11:41 AM | Permalink

Sam,
re:359

That is not bad, but it should probably include the point made above by Willis so it reads something like this:

We have tested the proposition that greenhouse model simulations and trend observations of various altitudes of the tropical troposphere can be reconciled. Our conclusion is that comparing the standard error of the mean (the approach used by the IPCC) does not reconcile the two with each other. Other comparisons may yield different results.

Perhaps more importantly, this does not encompass Douglass’s comments above. Douglass et al really need to publish a sequel that assesses the individual models and explains why only models 2, 8 and 12 survive the two tests discussed by Douglass.
Sam Urbinto

Posted May 8, 2008 at 12:48 PM | Permalink

Mosh #343 My comment about the SRES was related to an earlier conversation about explicitly taking population, industrialization and urbanization into account when trying to model the anomaly trend to match reality, rather than just taking GHG out of the model that rests upon GHG to make it match. (When I take my engine out of my car, it doesn’t match the speed profile)

Ron #360 Thanks.

The other question (besides “Can SEM do the job?”) is “Why didn’t we use the standard deviation, or a mix of SD and SEM, or individual models?” and the answer is (as I understand it):

The IPCC doesn’t do it that way.

Or

A mix forces assumptions or actions beyond the scope of the test and would be far too uncertain.

Using the SD results in “error bars” (or whatever you want to call it) that are so large as to only prove “uncertainty” (or whatever you want to call it). We already know there’s uncertainty (since whe have to use an ensamble), and the purpose of an ensamble is to take models, regardless if they’re correct or not, to get an aggregate that is theorized to more closely be able to match reality.

About the same reasons as the other two; using individual models we know don’t match doesn’t help (because the results depend on which you pick), and why an ensamble is used in the first place. Looking at them individually also amounts to trying to validate them which is beyond the scope of the test.

Or

A mix or individually is beyond the scope of the test as well as being meaningless. The SD isn’t used mainly for that reason, as the purpose of an ensamble is supposedly to provide an entity to use the SEM on.

Or

We used SEM. You can do the same test and use anything you want to.
Michael Smith

Posted May 8, 2008 at 1:56 PM | Permalink

How about this for a conclusion:

An ensemble of 22 GCMs yields surface temperature trend results that range from 28 milli deg C/decade — far below the surface observation range of 124 – 176 — up to 311 — which is far above the surface range. However, the ensemble’s mean surface prediction of 156 shows good agreement with the surface range.

Accordingly, we have tested the proposition that the same ensemble’s mean predictions throughout the troposphere show similar agreement with the troposphere observations. Using the standard error of the mean, we find that such agreement does not exist. The ensemble mean and the observations diverge above the surface, move in opposite directions up to an altitude of 250 hPa and remain widely separated thereafter. The ensemble cannot be reconciled with the troposphere observations the same way it can with the surface observations.
Larry T

Posted May 8, 2008 at 2:11 PM | Permalink

My wife is 85lbs and i am 285lbs on the average we are only slightly overweight but that doesnt truely tell the story
Sam Urbinto

Posted May 8, 2008 at 2:18 PM | Permalink

“…The ensemble cannot be reconciled with the troposphere observations the same way it can with the surface observations…”

Michael, I’d add (maybe not)

“… unless resorting to the comparison of the standard deviation to make them match. However, doing that results in error bars that are so wide, it simply makes the comparison become uncertain and therefore of little use, since uncertainty is already evident from the individual models ranging from .028 to .311 degrees.”
Bob B

Posted May 8, 2008 at 2:30 PM | Permalink

This argument doesn’t help make decisions on the farm!

http://sciencepolicy.colorado.edu/prometheus/archives/climate_change/001420teats_on_a_bull.html
steven mosher

Posted May 8, 2008 at 3:31 PM | Permalink

re 361, Ok I get it.

Here is the dilemma introduced by “beaker inconsistency” Beaker inconsistency allows models
with a wide esemble spread to pass unscathed. With wide ensemble spreads and wide
data spreads nothing is inconsistent with anything and “the moon is made of blue cheese” cannot be rejected as a hypothesis. The Net of wide uncertainities is that the hypothesis
under test is not rejected. Which means it lives to fight another day. Typical low power
test.

On the other hand the attribution studies consider divergence between model and observation as
proof of AGW without really explaining the statistical tests they use. So lack of agreement
between models and data is proof of AGW on one hand, and divergence between model and data
is waved away as noise on the other.

(Say hello to time series who forget their means)

So under beaker logic a model that fits the data better has nothing to reccommend it.
A model with GHG forcing fits better than one without GHG forcing. Under beaker logic
this is immaterial, so the logic used to reject douglas is applicable to attribution studies.
Now, dont get me wrong. I think beaker has the stronger argument ( SE versus SD) HOWEVER,
how does that argument play in attribution studies. Good question I think. I dont know.
steven mosher

Posted May 8, 2008 at 3:35 PM | Permalink

re 365. TOBs
Ron Cram

Posted May 8, 2008 at 4:19 PM | Permalink

Mosher,
re: 366

I hope you don’t mind me calling you that. It is simpler since I use “Steve” for Steve McIntyre.

Looking at the ensemble is only one way to approach the question. If the ensemble does well, then you probably learned enough. But if the ensemble does poorly, as even beaker admits with his grade of D-, then one ought to look into the models individually to find out if any models are doing well or if all of them are consistently bad. The comments by Douglass at the top of this thread are really important.

The question you raise is also important. I would like to see one attribution study done with all 22 models (because the IPCC has not yet rejected the models failing the test) and one attribution study done using 2, 8 and 12. It would be interesting to see how the results are different, if at all.
Kenneth Fritsch

Posted May 8, 2008 at 4:30 PM | Permalink

So under beaker logic a model that fits the data better has nothing to reccommend it.
A model with GHG forcing fits better than one without GHG forcing. Under beaker logic
this is immaterial, so the logic used to reject douglas is applicable to attribution studies.
Now, dont get me wrong. I think beaker has the stronger argument ( SE versus SD) HOWEVER,
how does that argument play in attribution studies. Good question I think. I dont know.

Steven Mosher, I would not call Beaker’s POV, as expounded here at CA, beaker logic, because what he is stating is the generally accepted view in modeling statistics (as well as I have been able to determine). In fact many comparisons do not even use SD but instead merely present a range of computer outputs and the observed and then make subjective statements about it much as was the case where Douglas et al. complained about Karl et al. doing this for a nearly identical comparison to Douglas et al where they use SEM. I have read papers that appear to judge the state of the statistical testing in model to the observed rather subjective and to be saved by using Bayesian methodology to put it on a more formal footing.

I think if one pushed the beaker POV to its logical conclusion, use of SD (as also indicated by RC) for comparisons would fail also and in fact would show that a strict frequentist approach would always fail. It mostly comes down to the assumption that we have only a single realization of the observed and the observed realization has a chaotic content that means the mean Xmodel – Xobserved = E and not zero for a perfect model. Now I can see the logic of the Beaker POV given the assumptions he makes vis a vis the Douglas paper, i.e., if the comparison is the same as comparing climates modeled to observed and we cannot estimate E. That is not my point of contention. My point is that since Douglas use ratios of trends in their comparison I see the chaotic content of the observed and the models canceling since I am comparing two outputs from the same initial conditions and looking strictly at any GHG effects.

Please explain how and why you think beaker has a stronger argument about using SD in the place of SE for the Douglas case – without Beaker or someone else showing that the chaotic content remains in the observed.

The more I read about these types of comparisons in the climate modeling world the more I am convinced that the choice of statistical tests have much to do with what keeps the model outputs as being plausible and the reaction to the Douglas et al. using SEM would be expected.
Sam Urbinto

Posted May 8, 2008 at 4:31 PM | Permalink

Reality is .150 +/- .026
Model ensemble is .156

Quite decent, see my reality has a +/- of 17% and my ensemble is only off 4%

Individual model range .028 to .311 The center of the range is .1695
But the range is +/- .145 Oh, a mere 85%, pshaw, close enough.

But certainly the center of the range fits. What does that tell us?

Sure, if we use the center of the range of models, we have vindicated the models! Yay!

Not that the +/- of the actual models are almost the same as the center of the range or anything.

Yay!
steven mosher

Posted May 8, 2008 at 5:27 PM | Permalink

re 369. I’m just calling it beaker logic to give him some credit for sticking to his guns
and remaining civil. As an aside I tend to judge threads by the people they attract, and
who sticks with it. So, willis is always thoughtful, as are you, and when RomanM comes along
it’s icing on the cake.

I rather liked your approach. I thought few people got it. You always have a fresh way of
going at things ( like in the surface stations).
conard

Posted May 8, 2008 at 5:37 PM | Permalink

369 Kenneth Fritsch

My point is that since Douglas use ratios of trends in their comparison I see the chaotic content of the observed and the models canceling since I am comparing two outputs from the same initial conditions and looking strictly at any GHG effects.

How is the observed chaotic content measured and quantified when there is only a single draw? I do not understand how this is done and therefore cannot understand how the two chaotic components (earth’s climate, modeled climate) can be said to cancel.

Thanks
steven mosher

Posted May 8, 2008 at 5:53 PM | Permalink

re 368. you can call me mosh or moshpit. no worries. Getting the data is hard. Even for ModelE they only make certain data available and they always filter it! For example with modele
data you can only get the average of the 5 runs per esemble, and they give it to you as a running mean. The IPCC has an online database, so gavin has promised that I can get access
if I ask nicely and write up a good explaination of why I need the data. I’ll resubmit my
request an see what happens. Then let you guys know how to get the data. Or you can all go try
and see what they say to you. If your interested ask And I post a link
Sam Urbinto

Posted May 8, 2008 at 6:12 PM | Permalink

Ron #368 My issue is that if the models range from .028 to .311 (or whatever) will the ones showing as within the same 17% range around center of .15 now continue to do so, or the ones that match at all atmospheric pressures within ~20% stay there? What if you pick one of them, and it turns out the .028 or the .311 turn out to be better in the next 5 years? Or in other words, which model do you pick and run with in reality?

Kenneth #369 I think with this ensemble, it depends on what you want to show as to which you pick, SD or SEM. If I start out initially saying “I know this ensemble would pass SD, but I’m not interested inknowing what I already do; the models as a whole do better than the ones at the extremes, and they’re all over the place. I don’t want to know that; I want to know if SEM will compare these two things that are said by some to be an apples to apples comparison.” then I’d pick SEM. mosh #371 I think beaker has been very civil also. I simply think this is a matter of how you look at things as to what opinion you form on what’s appropriate or not, not that there’s a “right” or “wrong answer at all, much less a universal one.

mosh #373 A good reason to need the raw data? What, is it made of gold, or could destroy the universe if it fell into the wrong hands?
Kenneth Fritsch

Posted May 8, 2008 at 6:21 PM | Permalink

Re: #372

How is the observed chaotic content measured and quantified when there is only a single draw? I do not understand how this is done and therefore cannot understand how the two chaotic components (earth’s climate, modeled climate) can be said to cancel.

I believe Beaker’s reference to chaotic content (my terminology not his) was that it depends (very sensitively) on the initial conditions and conditions that are not readily determined. We get only a single rendition of the observed so that we cannot average out that content using several runs.

Having said all that I am not saying that the cancelling of the chaotic content is between observed and models but between, for example, the surface and troposphere temperatures when one uses a ratio of surface to troposphere since both are realizations from the same initial conditions. I also know that the noise level of the climate is nearly the same in the surface and troposphere, i.e. they correlate well. I am therefore saying that since we are not comparing complete climate system, model to observed, but ratios of trends, models to observed, that the chaotic content does not come into play.

That has been my ongoing question to Beaker, who to this point has not responded.

If as Beaker suggests (and as do authors I have read on this subject)that the chaotic content depends on the initial conditions, I assume that one would make model runs with a range of initial conditions to get an estimate of the effect. Once one has this estimate why could not it be applied to observed in making comparisons.
Sam Urbinto

Posted May 8, 2008 at 6:30 PM | Permalink

I’m confused by now. We have a number of models. Some show a match with reality at some pressures from 100 to 1000 but not others. We have some models that match the curve well but are too high or too low.

So in 22 tries only 3 are viable matches. The question is; how do you know ahead of time those three will remain viable? We don’t. So how can we keep them or throw them away or not just because they got it somewhat correct from the start of the satellite era until now?

Track them all singly and see what happens as a group, as a group of viable only, and as single models. 🙂
Kenneth Fritsch

Posted May 8, 2008 at 6:57 PM | Permalink

Re: #376

Sam, those three models passed a test proposed by Douglas in matching certain climate observations but did not “pass” a test (2 SD per Willis in a post in this thread) when comparing them to the observed ratio of surface to troposphere temperature trends.

Also remember that Beaker, and in other instances, other climate modelers were hesitant to remove models for reasons that were not clear to me. Also Douglas et al. complained in their paper that Karl et al. had included some model results in their paper’s comparison of the models to the observed that appeared to be outliers in a physical sense. In the end, removing dubious models reduces the SD and makes a biased model ensemble average easier to reject as the same as the observed.

I think I see a thread running through all this. Do not use SEM as a method of comparison of an ensemble to observed and do not eliminate physical outliers from the ensemble — and the models will live to see the light of another day.
conard

Posted May 8, 2008 at 7:37 PM | Permalink

375 Kenneth Fritsch

Thank you for the reply. I am still not understanding but since this is between you and beaker I will take my time and try to visualize your question as time permits.

I assume that one would make model runs with a range of initial conditions to get an estimate of the effect. Once one has this estimate why could not it be applied to observed in making comparisons.

Sounds like a lot of monkeys and typewriters to me 😉 I am glad I earn my living by following after more modest aims.
Geoff Sherrington

Posted May 8, 2008 at 8:01 PM | Permalink

Re 351 UC

I defer to Dilbert’s 87. Brilliant.

Sounds like the meaning of Life, the Universe and all that. 37.

But what is the approved method to calculate if the 87 and the 37 derive from populations that overlap with various degrees of confidence?
DeWitt Payne

Posted May 8, 2008 at 8:16 PM | Permalink

Geoff,

Nitpick. The Answer to the Great Question of Life, the Universe and Everything was 42 not 37.
Geoff Sherrington

Posted May 8, 2008 at 8:50 PM | Permalink

Re # 323 beaker

You write –
He (the modeller) then changes his opinions on the physics of the climate and this is reflected in the next (hopefully more accurate) generation of models.

A scientist does not change his/her opinion of physics. If there is doubt about the physics, the modeller should not start on them. Physics is more formal than “Do gentlemen prefer blondes?”. The modeller must have authenticity for his physics. The next iteration of the modeller you mention might be more “precise” following changes to the emphasis put on the parts of the physics, but it can only be more “accurate” if wrong physics are deleted and correct physics substitued. You want to change F=ma because in your opinion it is wrong?
Geoff Sherrington

Posted May 8, 2008 at 9:06 PM | Permalink

Re # 380 DeWitt Payne

Nitpack back. 42 was the Northern Hemisphere version, 37 is the SH version.
Ron Cram

Posted May 8, 2008 at 9:38 PM | Permalink

Sam,
re: 374

We have had more than 20 years of observations and the models can run hundreds of years in a particular run. If the models are not matching the observations within the time period available, they never will.

I think it is time someone pointed out the obvious here. In 2007, Roy Spencer and co-authors published their observations on a newly discovered negative feedback over the tropics. They identified this as confirming the “Infrared Iris” effect hypothesized by Richard Lindzen. When the peer-reviewed paper came out, they mentioned that no GCM currently factors in this negative feedback.

It is entirely possible that if this feedback was fed into the GCMs, they may produce results more in line with the observations. That still would not mean they have any predictive power, but they would mimic current climate changes better.
frost

Posted May 8, 2008 at 9:39 PM | Permalink

The idea of considering an ensemble of models as being samples of some distribution seems silly to me. Here’s a picture that Oliver Morton linked to in January.

By my count, there are 19 models in this GCM “Family Tree” but only 6 different families, so how many families are there really?
DeWitt Payne

Posted May 8, 2008 at 9:47 PM | Permalink

Geoff,

That’s really curious. I’d never heard that before. Did they change it for SH audiences in the recent movie version too? What about the BBC TV and radio versions? OT I know, but this sort of trivia fascinates me.
Geoff Sherrington

Posted May 8, 2008 at 11:14 PM | Permalink

Re # 385 DeWitt Payne

In the NH the recent values were adjusted upwards, something in the shape of a hockey stick, if I can use that analogy, with the subsequent effect on the mean. The SH value was submitted to “Natural Health” but was peer-reviewed out of existence. Most folk today, without knowing the history, believe 42 was the original number.

37 is a prime and is pure; 42 is too rich in information, being the product of three prime candidates (something like the current USA presidential prospects ). 42 includes 2 numbers from the most popular Fibonacci series and is double another. These spooky coincidences cause overfitting speculation.

(Actually, I made a typo in my first post).
Geoff Sherrington

Posted May 9, 2008 at 12:37 AM | Permalink

Re # 385 DeWitt Payne (concluded)
This is way OT and I mean no disrespect to David Douglass and co-authors,
but the number 37 came from here, now I have dug deeper. Monty Python & Holy Grail (I think the text still has indirect relevance or irreverence in the context of Climate Science)

King Arthur: Old woman.
Dennis: Man.
King Arthur: Man, sorry. What knight lives in that castle over there?
Dennis: I’m 37.
King Arthur: What?
Dennis: I’m 37. I’m not old.
King Arthur: Well I can’t just call you “man”.
Dennis: Well you could say “Dennis”.
King Arthur: I didn’t know you were called Dennis.
Dennis: Well you didn’t bother to find out did you?
King Arthur: I did say sorry about the “old woman”, but from behind you looked…
Dennis: What I object to is you automatically treat me like an inferior.
King Arthur: Well I am king.
Dennis: Oh, king eh? Very nice. And how’d you get that, eh? By exploiting the workers. By hanging on to outdated imperialist dogma which perpetuates the economic and social differences in our society.
beaker

Posted May 9, 2008 at 1:12 AM | Permalink

Morning all.

Michael Smith #340: The jump from (iii) and (iv) to (v) is not a non-sequitur, if you use a test that will reject a model that has a meaninglessly small difference to the observations, it is a meaningless test. Someone mentioned confusing statistics with reality, that is exactly what is going on here, the 2SE test is a valid statistical test, it just doesn’t tell you anything about whether the models are useful or not in reality.

Ron Cram #345: The real issue may be something else, that is fine. But if the issue is something else, then don’t try and support it with faulty statistical evidence. I would have no problem with a paper that just looked at the data nad drew a conclusion, it is the use of incorrect statistics that I have issue with.
If the statement (vi) is true, then it applies in all cases. If what you say is true, then all of the model runs would give exactly the same value for the trend, but they don’t.

Cliff Huston #346: If you could create 67 replica solar systems and create the same forcing on the replica Earths, that would be a perfect model. You would not get exactly the same trend on all of them becuase (having an element of chaos) things like ENSO would not be the same.

If there is any chaotic component to the Earths climate, you would not get an SD close to zero as the model outputs would not be identical as the initial conditions would not be the same.

The perfect model would pass the 2SD test (with 95% confidence), but not the SE test if the climate has any chaotic comonent.

You need to read the IPCC comment someone posted. They claim that the ensmeble mean gives the best estimate of the “forced climate change” not of the observed climate (not the same thing). It seems that you are misinterpreting the IPCCs position on enembling.

Roman M #349: Your intepretation of the SE test is, shall we say “unique”. The SE gives the error bars of estimating a population mean from a sample. The test sees if the there is a detectable difference between this population mean and some particular value, nothing more.

Your comment “True or not, irrellevant to the issue here” demonstrates that you are unwilling to listen to my argument. That a test will reject a model that gives a good fit to the data, just because the ensemble is large, demonstrates the meaninglessness of the SE test.

The point I am making is that if the ensemble is large, it will get rejected even if the difference is small, even too small to be meaningless. That is not true of all tests.

Again, you are confusing statistics with reality. Yes, there is a statistically significant difference between the mean and the observations, but the SE test
can’t distinguish a meaningfull difference.

Kenneth Fritsch #353: how about if I use a quote from Douglass to show why the models can’t be expected to predict the data exactly:

David Douglass said:

We took the model data from the LLNL archive. We computed the tropical zonal averages and trends from 1979 to 1999 at the various pressure levels for each realizations for each model and averaged the realizations for each model to simulate removal of the El Nino Southern Oscillation (ENSO) effect; these models cannot reproduce the observed time sequence of El Nino and La Nina events, except by chance as has been pointed out by Santer et al..

Note that the ensemble predicts the climate with the effect of ENSO removed, but it is being compared with the trend for the observed climate where the effect of ENSO has not been removed. It is not a like-for-like comparsion, so there is
no reason to expect the ensemble mean to exactly match the observations.

I am not saying that there is anything wrong with the SE test in itself, it is just that is is not the right test for finding out whether a model ensemble is a useful predictor. It can’t be as it will fail models that provide an arbitrarily good fit to the data just because the ensemble is large, not because of its prediction.

Sadly, you can’t really estimate E using the models, as you end up with a circular argument.

As for alternative tests, as I said, first you have to decide exactly what it is you want to ask.o

Christopher #358: I have no endgame. The test you use depends on the question you want to have answered. The SE test answers a question, but it is not whether the models are useful or not. If there are “known unknowns” it is a bad idea for a statistician to ignore them, it leads to making unduly confident conclusions.

steven mosher #366: I have already pointed out that the 2SD test is easy to pass and that there is little glory in a victory so easily earned. I am not saying that the SD test tells you anything about whether the models are useful. But it does tell you if they are consistent. I am not saying that the SD test was the right one to use to see if the models have useful skill, just that the SE test definitely isn’t.
Peter Thompson

Posted May 9, 2008 at 4:40 AM | Permalink

Beaker #388,

This certainly is the nub of the issue:

You need to read the IPCC comment someone posted. They claim that the ensmeble mean gives the best estimate of the “forced climate change” not of the observed climate (not the same thing).

What the IPCC unfortunately does not also do, is go on to say that since this post hoc tuned, flux adjusted mess of spaghetti covers every eventuality from virtually no temperature effect to Hades on Earth, we have no scientific physical basis for our claims concerning anthropogenic climate forcings.
MarkW

Posted May 9, 2008 at 4:58 AM | Permalink

Geoff,

You better be carefull with this OT humor stuff. Otherwise Leif and the protocol police might get on your case.
Michael Smith

Posted May 9, 2008 at 5:11 AM | Permalink

Beaker wrote:

Michael Smith #340: The jump from (iii) and (iv) to (v) is not a non-sequitur, if you use a test that will reject a model that has a meaninglessly small difference to the observations, it is a meaningless test.

If a test rejects a model with a meaninglessly small difference to the observations under all possible conditions — or even under typical conditions — then you can argue that it is a meaningless test under all possible conditions, or even under typical conditions.

Your argument, however, amounts to the opposite: namely, that a test that can be shown to be meaningless under any conditions is meaningless under all conditions. But you have not shown why this should be so. And the fact that in Douglass, the ensemble failure has nothing to do with the size of (n) demonstrates it to be false.

The difference between the ensemble and the observations in Douglass may not prove the two are inconsistent — as conventional statistics uses the term — but by your own admission that the test shows bias in the data, you cannot argue that the difference is meaningless.
beaker

Posted May 9, 2008 at 5:34 AM | Permalink

Michael Smith #391: I find it hard to comprehend how someone would be comfortable with a test that is almost guaranteed to reject a model that is perfect, but I will continue to try to discuss this!

The point is that the ensemble mean is not intended to model the observed climate, only the forced climate (see quote from the IPCC). Therefore one would not expect the ensemble mean to exactly match the observations. For example, as Douglass pointed out the model ensemble removes the effect of ENSO (which being chaotic is not predictable even by a perfect model of the physics). The observations however have not had the effects of ENSO removed, so it is not a like for like comparison.

The 2SE test is a test of whether we can be confident that the mean of the ensemble is the same as the observations or whether it is different. Well off course the ensemble mean is different from the observations (except by fluke) as the thing the models aim to reproduce is not the observed climate, but only the forced component of the observed climate (having eliminated the effects of things like ENSO).

The +-2SE is a test of something that modellers would not expect to be true in the first place, so how can it be a useful test of whether the models are useful. I said that the test shows the model is biased, and I meant biased in the statistical sense, just because there is a statistically significant bias it does not imply that there is a meaningful bias, just that the bias is large enough that we can be confident a bias actually exists. I think I have already said that neither consistency nor bias neccessarily imply usefullness. Neither the SE nor the SD test can tell you if the models are useful on their own.
beaker

Posted May 9, 2008 at 6:04 AM | Permalink

I said:

The 2SE test is a test of whether we can be confident that the mean of the ensemble is the same as the observations or whether it is different

but that wasn’t quite accurate. The 2SE test is a test of whether the mean of the underlying population from which the ensemble is drawn is the same as the observations or whether it is different. As I have already explained we would expect them to be different from purely a-priori arguments. Thus if the ensemble passes the test it is only because the ensemble is too small for us to accurately estimate the population mean from the sample we have, and be sure that this difference does really does exist.
It is a test of something that we would not expect to be true and the ensemble passes if the uncertainties are too high to be sure of what we would expect. Great test! ;o)
Cliff Huston

Posted May 9, 2008 at 6:11 AM | Permalink

#388 Beaker,
You say:

You need to read the IPCC comment someone posted. They claim that the ensmeble mean gives the best estimate of the “forced climate change” not of the observed climate (not the same thing). It seems that you are misinterpreting the IPCCs position on enembling.

No. You are cherry picking quotes, which is consistent with the hand waving, I said you need to get past.
From #260 Willis Eschenbach:

The IPCC, on the other hand, says (emphasis mine):

Multi-model ensemble approaches are already used in short-range climate forecasting (e.g., Graham et al., 1999; Krishnamurti et al., 1999; Brankovic and Palmer, 2000; Doblas-Reyes et al., 2000; Derome et al., 2001). When applied to climate change, each model in the ensemble produces a somewhat different projection and, if these represent plausible solutions to the governing equations, they may be considered as different realisations of the climate change drawn from the set of models in active use and produced with current climate knowledge. In this case, temperature is represented as T = T0 + TF + Tm + T’ where TF is the deterministic forced climate change for the real system and Tm= Tf -TF is the error in the model’s simulation of this forced response. T’ now also includes errors in the statistical behaviour of the simulated natural variability. The multi-model ensemble mean estimate of forced climate change is {T} = TF + {Tm} + {T”} where the natural variability again averages to zero for a large enough ensemble. To the extent that unrelated model errors tend to average out, the ensemble mean or systematic error {Tm} will be small, {T} will approach TF and the multi-model ensemble average will be a better estimate of the forced climate change of the real system than the result from a particular model.

From #266 Willis Eschenbach:

beaker, thanks for the response. Let me go over again what the IPCC says. First, it defines the temperature as being:

T = T0 + TF + Tm + T’ where TF is the deterministic forced climate change for the real system and Tm= Tf -TF is the error in the model’s simulation of this forced response.

It describes the use of the ensemble as follows:

To the extent that unrelated model errors tend to average out, the ensemble mean or systematic error {Tm} will be small, {T} will approach TF and the multi-model ensemble average will be a better estimate of the forced climate change of the real system than the result from a particular model.

Willis’ IPCC quotes seem to be your source of “forced climate change”, but you left out “of the real system”. Perhaps you need to re-read the following IPCC comments.

From IPCC AR4: (Re-quoted from #52 Steve McIntire, for convenience – emphasis added )

The reason to focus on the multi-model mean is that averages across structurally different models empirically show better large-scale agreement with observations, because individual model biases tend to cancel (see Chapter 8). The expanded use of multi-model ensembles of projections of future climate change therefore provides higher quality and more quantitative climate change information compared to the TAR. (ch 10)
The use of multi-model ensembles has been shown in other modelling applications to produce simulated climate features that are improved over single models alone (see discussion in Chapters 8 and 9).

It continues to be the case that multi-model ensemble simulations generally provide more robust information than runs of any single model.

Chapter 8 states:

The multi-model averaging serves to filter out biases of individual models and only retains errors that are generally pervasive. There is some evidence that the multi-model mean fi eld is often in better agreement with observations than any of the fields simulated by the individual models (see Section 8.3.1.1.2), …

Why the multi-model mean field turns out to be closer to the observed than the fields in any of the individual models is the subject of ongoing research; a superficial explanation is that at each location and for each month, the model estimates tend to scatter around the correct value (more or less symmetrically), with no single model consistently closest to the observations. This, however, does not explain why the results should scatter in this way.

It is clear that at one point you read and understood the above IPCC statements, because you made the following comments:

From #205 Beaker:
Cliff Huston #198: I am looking into the IPCC reasons for the use of the ensemble mean, as a result of Steves post, but I would prefer not to comment directly until I have read up on the subject to make sure I understand the relevant issues.

From #207 Beaker:
Having started to read up on the IPCCs reasons for the use of the ensemble mean (which also are possibly a fair target for an audit) I am now fairly sure that the statistical test performed in Douglass et al. is meaningless (although the point they seem to be trying to make with it, that the models do not adequately reproduce the observations, seems perfectly valid).

From #237 Beaker:
I think either the IPCC recommendations regarding the ensemble mean is badly flawed or it is being badly misinterpreted!

What you have not done is cite any IPCC recommendations that, when comparing the model ensemble to real world observations, the whole ensemble instead of the ensemble average should be used. If you can show where the IPCC uses SD error bars for the model ensemble output, your case is made. Without that or some other clear citation from the IPCC, your position is in opposition to the clear IPCC statements above. If, after failing to find an IPCC statement in support of SD error bars for the ensemble output, you want to declare the IPCC ‘badly flawed’, I expect many here will agree. But without that support from the IPCC, your argument against Douglass et al lacks merit.

Cliff
beaker

Posted May 9, 2008 at 6:31 AM | Permalink

Cliff Huston:

No. You are cherry picking quotes, which is consistent with the hand waving, I said you need to get past.

I’m sorry, but if you are going to take that attitude there is no point in discussing things any further with you. The IPCC quote appears to me to be worded that way to make an important distinction between what the models can predict and what they can’t. If you are happy to ignore such distinctions, it is not overly surprising that you don’t understand the point I am trying to make.

Accusing me of cherry picking is insulting, and unhelpful, especially when I have gone out of my way to be patient and civil with you.
steven mosher

Posted May 9, 2008 at 7:21 AM | Permalink

re 388. I think we are in violent agreement
beaker

Posted May 9, 2008 at 7:31 AM | Permalink

steven: Cool, all we need to do now is to work out exactly what the question is ;o)
Cliff Huston

Posted May 9, 2008 at 7:39 AM | Permalink

#388 Beaker,
You say:

Cliff Huston #346: If you could create 67 replica solar systems and create the same forcing on the replica Earths, that would be a perfect model. You would not get exactly the same trend on all of them becuase (having an element of chaos) things like ENSO would not be the same.

Yes, that would be the case for real worlds, because real worlds have an element of chaos – but GCMs do not. And yes, I would expect the ensemble data of those 67 worlds would have a SD much greater than zero.

If there is any chaotic component to the Earths climate, you would not get an SD close to zero as the model outputs would not be identical as the initial conditions would not be the same.

No, you are confusing the real world element of chaos with an ensemble of GCMs which are not chaotic. If the GCMs are perfect and all are started with the same initial conditions, they will all produce the same output and the decade trends of the ensemble of GCMs will have a SD close to zero.

The perfect model would pass the 2SD test (with 95% confidence), but not the SE test if the climate has any chaotic comonent.

No, the results would be the same whether you used 2SD or 2SE against the real world (element of chaos, e.g. ENSO) observations. Passing or failing the test would depend on the observation error bars and the magnitude of the chaos element remaining after trending the observation.

Cliff
Ron Cram

Posted May 9, 2008 at 7:51 AM | Permalink

beaker,
re: 388

You write:

Ron Cram #345: The real issue may be something else, that is fine. But if the issue is something else, then don’t try and support it with faulty statistical evidence. I would have no problem with a paper that just looked at the data nad drew a conclusion, it is the use of incorrect statistics that I have issue with.
If the statement (vi) is true, then it applies in all cases. If what you say is true, then all of the model runs would give exactly the same value for the trend, but they don’t.

I don’t understand which of my comments in 345 you are referring to in your first paragraph. Regarding your second paragraph: Nonsense. You are assuming that the actions of GHG gases are purely stochastic. That is contrary to the theory of AGW since GHG gases are believed to be deterministic when it comes to warming the troposphere faster than the surface. Surely you understand that by now. Climate is a non-linear coupled system so models would NOT “give exactly the same value for the trend.” (Which really is not the point anyway as this is about warming ratio of troposphere to surface.) Many other equations and parameterizations apply. But the models would have the troposphere warming faster than the surface. And indeed they do. The problem is not that the models do not match AGW theory. The problem is the models are not matching nature. Because these facts are true, statement (vi) cannot apply. And since statement (vi) does not apply, neither do any of your following statements.
beaker

Posted May 9, 2008 at 7:56 AM | Permalink

Cliff #398: It is not true that the GCM simulations are not chaotic. If they weren’t you would get the same results everytime you ran the simulation with different initialisations. For instance, they can’t predict the actual ENSO, but the can make a synthetic ENSO with similar statistical properties. However when you average to make an ensemble, it tends to cancel the synthetic ENSO out and give a prediction of what the trend would look like without an ENSO.

The point is that the GCMs are always started with different initial conditions. There would be no point in having an ensemble in the first place if we run it with the same initialisation each time.

The infinite ensemble of parallel Earth simulations would not pass the 2SE test as each would have different initialisations (we didn’t measure the initial conditions on the real Earth accurately enought to start them off in exactly the same state). Therefore, each would have differences in things like ENSO. Averaging over the ensemble then cancels the effects of ENSO. However, there is an ENSO on the real Earth so there is going to be a difference between the ensemble mean and the observations. Therefore it will fail the SE test as you are dividing by sqrt(infinity).
Cliff Huston

Posted May 9, 2008 at 7:59 AM | Permalink

#395 Beaker,
Sorry. You have my apology, no offense was intended. I will choose my words with more care.
Cliff
beaker

Posted May 9, 2008 at 8:05 AM | Permalink

Ron: In #345 you seemed to be suggesting a different question to the one tested in Douglass et al. I was pointing out that I have no problem with that, it is only the incorrect use of stats to support a contention that I have a problem with. Sorry I obviously didn’t make myself very clear (for a change ;o)

As to the second point, I don’t see how anything I have written suggests that I assume that “the actions of GHG gases are purely stochastic”. Quite the opposite! I would have though the action of GHG are preserved by ensembling while the stochastic elements like ENSO are averaged out.
Christopher

Posted May 9, 2008 at 8:10 AM | Permalink

Re: beaker Yeah, I tried to look at this by denying all assumptions/knowledge and turning it into an abstract statistical exercise. And I wanted get an actual probability but not in the SD vs SE context (that’s why I went nonparametric). I know of no test that can get at a chaotic component (the endgame metaphor). You have mentioned the SD test could but I don’t see how that could work. SD is also unbiased as n -> infinity (sans chaos). And since climate_sigma is in fact a “bias” you would miss it, no? Or, to turn it around, are you not making assumptions on the magnitude of climate_sigma? Saying that it’s “somewhere between” 2 asymptotic SE and 2 asymptotic SD? Put another way, if climate_sigma > 2 asymptotic SD then the same result ensues as with the vanilla SE test. Here are some numbers. I can generate 10000 standard normals. I get a sigma of 1.001 with whatever seed Matlab picked this instance, and an SE of 0.01. Fine. Let’s say I do a 2SD test. The test will fail (in your sense) if climate_sigma > 2.002? Yes? And the 2SE test fails if climate_sigma > 0.02. So I can have my perfect model but will still “miss” in both cases? It’s easier to “miss” on the SE test because to 0.02 will get ever smaller asymptotically. But you can “miss” on the SD test too. See, this is why I said that I am not interested in chasing the knowable. I’d love to have at it but if it is, by definition, truly unknowable then, well, I’m not sure what else there is to do. The other thing here is that you can condition on the ensemble to estimate climate_sigma. In my previous post the relevant CI was (13.45,285.25). So the observations are not in agreement with ensemble means provided that climate_sigma is
Christopher

Posted May 9, 2008 at 8:12 AM | Permalink

Last bit got cut:

“So the observations are not in agreement with ensemble means provided that climate_sigma is
Kenneth Fritsch

Posted May 9, 2008 at 8:12 AM | Permalink

Re: #388

Note that the ensemble predicts the climate with the effect of ENSO removed, but it is being compared with the trend for the observed climate where the effect of ENSO has not been removed. It is not a like-for-like comparsion, so there is
no reason to expect the ensemble mean to exactly match the observations.

Beaker, this observation gets to the crux of my proposition that since Douglas et al. uses ratios of trends (and of a considered GHG finger print at that) that the chaotic content is not in play. You have not addressed that proposition in any of your replies. I can show that the ENSO has no significant effect on the ratios of annual surface to troposphere temperatures and trends.

Sadly, you can’t really estimate E using the models, as you end up with a circular argument.

Which if true, and in the strictest meaning it is true, we have no consistent means of determining whether an individual model or an ensemble average is different than the observed. You have consistently ignored using an SE test where Xmodels – Xobserved = E, where E could be some reasonable estimate of the observed variable content. For practical purposes if E is large than the climate becomes essentially unpredictable to any reasonable limits. (Note that I am talking climate here and not what Douglas et al. were measuring).

Beaker, when you note that use of a SE can reject a perfect model I think that confuses what the SE test is actually testing. It is testing the average of an ensemble because we have no method of determining which individual model is perfect. That perfect model result (and not necessarily from a perfect model) could have occurred by chance.
beaker

Posted May 9, 2008 at 8:14 AM | Permalink

Cheers Cliff, I should have been more even tempered as well, such misinterpretations are not uncommon on the net! It is they key point though, the models can’t predict things like ENSO so they can never predict the observed trends exactly. The SE test is a test for whether the population mean could be equal to some value for a given level of confidence, so it is testing for something we already know to be false!
Christopher

Posted May 9, 2008 at 8:18 AM | Permalink

Still got cut: “less then 13.45.”
beaker

Posted May 9, 2008 at 8:30 AM | Permalink

Kenneth: Can you show that the ratio cancels the effect of ENSO exactly. If it doesn’t, the SE test will reject a perfect model of the physics.

Have a good weekend everybody, I am off to play some more cricket!
Kenneth Fritsch

Posted May 9, 2008 at 8:43 AM | Permalink

Christopher @ #403:

I agree with your excercise and showing that if we take the strict view of Beaker and without a limit estimate on E, we have no way comparing model output to observed using SE or SD. If E is without limit then surely we conclude climate is not predictable, projectionable or estimateable. The models then become an excercise like solving a cross word puzzle with little practical value.

Karl et al. took a range of model outputs without an average when comparing the ratios of modeled tropical surface to troposphere temperature trends and then, according to Douglas et al., Karl made an observation that the observed was within the range of model outputs and thus could not conclude that model and observed outputs were different. In fact he concluded that results indicated that the observed data could be flawed.

To be fair to Beaker, I think his POV is in line with that of the climate modelling world. Remember that the IPCC and climate modelers started making projections of models to the future where there is no observed for comparisons. They could always fall back on a show of hands by the experts to put some uncertainty limits on those projections. Obviously a comparison of the observed to the model simulations gets more complicated. From my reading of the IPCC AR4 reports some authors push for use of an ensemble average but as I recall do not provide any means of statistically comparing it to the observed other than an show of expert hands.
steven mosher

Posted May 9, 2008 at 8:52 AM | Permalink

By moshersteven at 2008-05-09″ alt=”” />

Test post of ModelE hindcast versus Hadcru temps 1880-2003
steven mosher

Posted May 9, 2008 at 8:54 AM | Permalink

By moshersteven at 2008-05-09
Steve McIntyre

Posted May 9, 2008 at 9:11 AM | Permalink

Mosh, can you refresh the exact url for downloading the data and of any readme.
Cliff Huston

Posted May 9, 2008 at 9:43 AM | Permalink

#395 Beaker,
You posed the perfect model ensemble, with an infinite number of member models to show that SE error bars would goto zero. I posed the perfect model ensemble, with a finite number of perfect member models to show that the SD error bars would go to zero. In both cases we have undoable mind models. To apply to Douglass et al, ENSO like effects are removed in both cases. The only chaos elements involved in judging against real world observations are those of the real world. The case for running an ensemble of perfect models with the same initial conditions in no sillier than running an infinite number of imperfect models to produce a perfect ensemble. Both cases are straw-man arguments, neither provide any useful insight into the correctness of using SD or SE in Douglass et al.

I keep harping on the IPCC intent of the model ensemble, because if the ensemble average is not the ensemble output, there is no useful prediction made by the ensemble. If there is no useful prediction from the model ensemble, there is no way to judge the theory built into the models. The use of SD to judge the ensemble members is, to my mind, valid – that is: are the member models consistent? But to use 2SD as the ensemble prediction is to use, potentially, the worst allowable model data as the prediction. The IPCC claims observation consistency only with the model ensemble average, which they call the best estimate, and make no claims regarding the ensemble model cloud. The methods used in Douglas et al are consistent with the IPCC’s claim, in that they state they are using the best estimate model data.
Cliff
steven mosher

Posted May 9, 2008 at 10:11 AM | Permalink

re 412

Ok Steve I’ll post detailed instructions on how to get the data.

What you will all see below is the Model E results for 1880-2003 put on the same
graph with the Error bounds of the hadcru observations.

So modelE is an esemble of 5 runs and the other two lines are the upper limit and
lower limit of hadcrut3 error bounds

By moshersteven at 2008-05-09
steven mosher

Posted May 9, 2008 at 10:25 AM | Permalink

Ok getting ModelE data

Start here http://data.giss.nasa.gov/

See the link for climate simulations of 1880-2003. click that

http://data.giss.nasa.gov/modelE/transient/climsim.html

Here you will see the link to the paper and all the readme I know of

now to get the data look at table 1.

See line #4. ALL FORCINGS. These are the similautions that include all forcings
( line 1-3 contain individual forcings, like GHG only for example, or volcano)

On the left hand side of the table you will see links for the forcing. On the RIGHT
you will see a list of RESPONSES.

select “lat time”

http://data.giss.nasa.gov/modelE/transient/Rc_jt.1.11.html

Now you will see a pull down menu.

See the first box. Quantity? Pick surface temp ( there are others as well )

Mean Peroid. pick 1 month to get rid of the running mean

Time interval: pick what you like.

base peroid. I selected 1961-1990 because I wanted to compare ModelE to hadcrut.

Output. Formatted page with download links

Show plot and then get the data

http://data.giss.nasa.gov/work/modelEt/time_series/work/tmp.4_E3Af8aeM20_1_1880_2003_1961_1990-L3AaeoM20D/LTglb.txt
M. Simon

Posted May 9, 2008 at 10:58 AM | Permalink

Grammar police alert:

Let’s try a different tact.

Should be : Let’s try a different tack.

The word tack come from the days of wooden ships and iron men. Tacking was used when sailing into the wind. It was a means of maintaining the proper course by sailing to the right and left (starboard and port) of your desired course. When it came time to change course to keep the average direction you were tacking the ship.
MarkW

Posted May 9, 2008 at 11:18 AM | Permalink

Two of my favorite mangled sayings, that I’ve run across on the web.

two-headed sword
and
pitch quiet
Cliff Huston

Posted May 9, 2008 at 11:47 AM | Permalink

RE: 411 Steve Mosher,

Any chance talking you into doing the plot in 411 using 10 year trends? It would help us to see how much ‘weather’ is in the decade modeled and observed trends.

Pretty Please.

Cliff
Richard Sharpe

Posted May 9, 2008 at 11:58 AM | Permalink

Two of my favorite mangled sayings, that I’ve run across on the web.

two-headed sword
and
pitch quiet

I’m kinda partial to:

tow the line

and

bare with me
Sam Urbinto

Posted May 9, 2008 at 1:15 PM | Permalink

So if the ensemble mean gives the best estimate of the forced part of the climate and not the reality of the climate (forced plus random), isn’t it best to say that SEM can’t be used to match “missing-chaos forcing models of” climate compared to “having-chaos observed” climate? And therefore, using SEM to disprove itself as a valid statistical method (circular statistical test?) shows the models aren’t models of reality and are at least biased?

Biased to replicating “un-physical reality forcing-only” climate rather than “physical reality random-chaotic is included” climate. (If that’s bias or invalidation, whatever.)

On the other hand, we can say that using SD rather shows that from the other direction, we get too much uncertainty (large error bars to pass (a “low bar”, indeed)) because “the models can’t predict things like ENSO so they can never predict the observed trends exactly” so all we get is an idea of what’s going on that can’t be trusted, due to the (whatever you want to call it) of the models versus reality.

So really, again, all the ‘argument’ is about is what are we trying to prove how when? It appears we have a case where nothing can reconcile the ensemble with observations, but we know that anyway because the models miss a piece.

Nature.

So why even bother trying to reconcile them, there’s no test they can pass.
Christopher

Posted May 9, 2008 at 1:56 PM | Permalink

>So why even bother trying to reconcile them, there’s no test they can pass.

Well, I do not believe this. Climate_sigma is overrated and there are tests that circumvent SD vs SE (I performed one earlier up). I think a very good test could be had but implementing a grid of parameter values and initial conditions across the current state-of-the-art generation of GCMs. Such a “grid search” would likely entail 100s of runs on each GCM –a problem to be sure but I’m trying to tease out if such a test is achievable. In any event, the SAT trajectories for each run by each GCM form the raw data. And then you apply a fairly complicated bootstrap to get a probability. This is doable and I would trust the result as the irreducible climate sigma is factored out. It’s in each individual SAT trajectory and it’s in the observations. You circumvent the “n thing” by comparing a bootstrap sample to an observed sample, one cannot swamp the other. It’s like doing the test I did earlier but generating your own critical values based on the bootstrap world.
Sam Urbinto

Posted May 9, 2008 at 3:16 PM | Permalink

Christopher #421

I don’t know, you could be correct.

I suppose it depends upon your meaning of reconcile.

Just thinking out loud, would it be possible to find models that are accurate (say 5% or so) at one or more altitudes, and only use the accurate altitudes of each until a combination follows reality pretty well?? Franken-model!
Willis Eschenbach

Posted May 10, 2008 at 2:32 AM | Permalink

AS always, reading everyone’s comments brings up interesting ideas.

Regarding model ensembles, It is worth noting the IPCC use of a “simplified” model. This model, called MAGICC (Model for the Assessment of Greenhouse gas-Induced Climate Change), can be tuned to reproduce the outputs of the various individual climate models, without the weather “noise”. The UN IPCC uses it to give many of the outputs shown in their report. Here’s a quote:

By using such simple models, differences between different scenarios can easily be seen without the obscuring effects of natural variability, or the similar variability that occurs in coupled AOGCMs (Harvey et al., 1997). Simple models also allow the effect of uncertainties in the climate sensitivity and the ocean heat uptake to be quantified.

Curiously, it can reproduce the output of the various models with the use of only six parameters. Here are typical settings, alonge with the explanations, from here:

Table 9.A1: Simple climate model parameter values used to simulate AOGCM results. In all cases the mixed-layer depth hm=60m, the sea ice parameter CICE=1.25 and the proportion of the upwelling that is scaled for a collapse of the thermohaline circulation is 0.3, otherwise parameters are as used in the SAR (Kattenberg et al., 1996; Raper et al., 1996).

_______AOGCM___________F2x (Wm -2)_________T2x (°C)___________T +(°C)___________k(cm2s-1)_____________RLO_________LO|NS (Wm-2 °C-1)_
____GFDL_R15_a___________3.71_______________4.2_________________8_________________2.3________________1.2_________________1_________
_____CSIRO Mk2____________3.45_______________3.7_________________5_________________1.6________________1.2_________________1_________
______HadCM3_____________3.74________________3_________________25________________1.9________________1.4________________0.5________
______HadCM2_____________3.47_______________2.5________________12________________1.7________________1.4________________0.5________
____ECHAM4/OPYC____________3.8________________2.6________________20_________________9_________________1.4________________0.5________
______CSM 1.0______________3.6________________1.9_________________-_________________2.3________________1.4________________0.5________
______DOE PCM______________3.6________________1.7________________14________________2.3________________1.4________________0.5________

F2x – the radiative forcing for double CO2 concentration
T2x – climate sensitivity
hm – mixed-layer depth
CICE – sea ice parameter (see Raper et al., 2001a)
T+ – magnitude of warming that would result in a collapse of the THC
k – vertical diffusivity
RLO – ratio of the equilibrium temperature changes over land versus ocean
LO|NS – land/ocean and Northern Hemisphere/Southern Hemisphere exchange coefficients

Now, the existence of this model and its use by the IPCC brings up some very interesting issues.

First, it directly answers the question about the independence of the models. Since the individual models’ global temperature results can all be replicated by the MAGICC model, by changing only six parameters, it is quite clear that regarding the global mean temperature, the global climate models are not separate individual realizations. They are just variations of the basic MAGICC assumptions. They are nothing more than the results of a time-dependent function

Temperature = MagicFuntion(v1, v2, v3, v4, v5, v6, F)

where v1 through v6 are the six variables, and F is a time-dependent set of forcing values. Different results are in no sense independent, they are a result of the same function with different tunable parameters.

Next, it lets us understand the nature of a large ensemble of models. Since the models are all reproducible by varying those six numbers, we can in theory use the MAGICC model to give us a probabilistic distribution of all possible model results. All we need to do is to generate random numbers, perhaps using the mean and the standard deviation of those variables as given by the different GCM models. That will give us a probabilistic distribution of the results of all possible models for a given set of forcings.

Alternatively, we could use the range of values given by current models, and randomly generate values bounded by that range. This, of course, will give a wider distribution of outcomes. However, by the law of large numbers, their mean will not be far from the mean done the other way (provided that we include sufficient models to start with).

Next, from this universe of models, we throw away all of the realizations that are not within a certain distance of the historical record given the historical forcings. It’s the necessary (but not sufficient) first step. This gives us the universe of MAGICC-type climate models that can hindcast the historical climate, including all those that exist now.

It is worth stopping at this point to consider this ensemble of thousands of remaining different model results. A long-standing unexplained puzzle of the models is, why does an ensemble of models give better results than a single model? According to the IPCC, it should be more accurate (closer to the observations) than any single random realization. I would suggest the reason is that through our tuning procedures, we are drawing from a universe of models whose overall average is a good approximation to the historical record. Because of this, we would expect the average to do better than any given model at hindcasting the record …

Next, it is worth recalling the recent Kiehl study discussed elsewhere on this site. He states that:

A number of studies (Raper et al., 2001; Forest et al., 2002; Knutti et al., 2002] have shown that there are three fundamental climate factors, namely: the climate forcing, the climate sensitivity, and the efficiency of ocean heat uptake. Given these three factors one can calculate the evolution of the global mean surface temperature. All of the models applied to simulating the 20th century climate represent these factors in varying degrees of sophistication.

These three factors are among the six factors listed above,

There’s more issues here, but I’m going to bed.

w.
Ron Cram

Posted May 10, 2008 at 4:08 AM | Permalink

Willis,
re: 423

Outstanding post! What is that old saying? “Give me six parameters and I can fit an elephant?” Two thoughts jump out at me here.

First: Is this model used in attribution studies? If the model is designed only to predict climate change on the basis of greenhouse gases, isn’t this circular reasoning? Where is the model that seeks to understand natural climate variability? The model where you can tune TSI, PDO, ENSO, etc? Isn’t the model assuming zero natural climate variability? How can they say the temp record cannot be explained by natural climate variability alone if they have never tried?

Second: The T+ parameter strikes me as odd. Is the only variable the “magnitude of the warming that would result from a collapse of the THC?” Can you not also determine at what level of warming the THC would collapse? It seems a great deal of uncertainty around that point is glossed over.
steven mosher

Posted May 10, 2008 at 6:44 AM | Permalink

re 418, not sure what you want exactly, can you explain
Michael Smith

Posted May 10, 2008 at 12:23 PM | Permalink

Beaker wrote in 392:

The point is that the ensemble mean is not intended to model the observed climate, only the forced climate (see quote from the IPCC). Therefore one would not expect the ensemble mean to exactly match the observations.

But the ensemble mean does match the observations – it matches the surface observations. So we are left with the notion that the chaotic element of the climate is suppressing the warming the models predict for the troposphere but not suppressing the warming the models predict for the surface.

However, surface and troposphere observations generally move together:

In that plot, it certainly looks like whatever chaotic element is causing the swings is affecting both the surface and the troposphere. I certainly don’t see any evidence there of internal variability that can cause a lasting shift in one set of those observations without affecting the other set as well. Thus, I don’t see how the chaotic element of the climate can be invoked to explain why the ensemble mean doesn’t match the troposphere observations even though it does match the surface.

Or am I missing something here?
Kenneth Fritsch

Posted May 10, 2008 at 1:04 PM | Permalink

Re: #426

I would like to take the discussion where you are attempting to go with it, but I am afraid the operative Beaker response here will be: but is it a perfect match.

Here is some more practical insight into this phenomena below.

By the way, how would like to have Beaker calling balls and strikes if you were the pitcher in a baseball game.

The link below from which I excerpted the following seems pertinent to this discussion. Note my bolded emphasis in the first paragraph below.

Click to access 226.pdf

Also consider just one of the shortcoming of climate models cited by the IPCC (and listed earlier in this paper), their inability to reproduce the observed vertical temperature profile of the atmosphere. All climate models project that increased greenhouse gas concentrations should lead to the mid- and upper troposphere warming faster than the surface. However, data for the last two decades indicates that the troposphere has warmed at a considerably slower rate than the surface. In 2000, the National Research Council concluded that the differences in warming trends between the surface and troposphere were real, i.e., not the result of measurement errors, and that they were not adequately reflected in climate models.20 More recently, Chase, et al.,21 researchers at the University of Colorado, Colorado State University, and University of Arizona, examined whether the differences between observations and the outputs of four widely-used GCMs were caused by either forcing uncertainties, i.e., uncertainty in the effects of greenhouse gases, aerosols, etc. on climate; or by chance model fluctuations, i.e., the variability caused by the model’s representation of the chaotic behavior of the climate system. The authors found that neither of these factors explained the differences between model projections and observations. They further concluded:

Significant errors in the simulation of globally averaged tropospheric temperature structure indicate likely errors in tropospheric water-vapor content and therefore total greenhouse-gas forcing, precipitable water, and convectively forced large-scale circulation. Such errors argue for extreme caution in applying simulation results to future climate-change assessment activities and to attributions studies and call into question the predictive ability of recent generation model simulations, the most rigorous test of any hypothesis.22

The errors identified are in the fundamental equations in climate models, and relate to the water vapor feedback that is part of every climate model. Without this feedback, doubling the atmospheric concentration of CO2 would result in a global average surface temperature increase of 1.2°C. However, any increase in surface temperature will increase the rate at which water is evaporated and raise the average atmospheric concentration of water vapor. Since water vapor is a greenhouse gas, the result is a further increase in temperature. Climate models project that doubling the atmospheric concentration of CO2 would result in a global average surface temperature increase of between 1.5 and 4.5°C. This large range is due to the differences in the way the models handle the water vapor feedback. The increase in atmospheric concentration of water vapor also results in models projecting an increase in global average precipitation.

Building more elaborate models at this time is unlikely to address the errors identified by the Chase, et al. As the Marshall Institute argued in it comments on the CCSP’s Draft Strategic Plan: “… model development should proceed only as fast as theoretical understanding of the climate system and validation permit.”23 Water vapor feedback is understood in qualitative terms; greater quantitative understanding is now required. On-going scientific studies offer hope that more quantitative understanding of the water vapor feedback is achievable. For example, a recent paper by Minschwaner and Dessler24 discussed observations from NASA satellites and scientific analysis indicating that there is less water vapor in the upper atmosphere than assumed by some climate models. In other words, climate models overestimate the size of the water vapor feedback and therefore potential future temperature rise. These findings, if validated by additional studies and then incorporated into existing climate models, should reduce the spread between model outputs.
kuhnkat

Posted May 10, 2008 at 2:03 PM | Permalink

I finally found the quote!!!!!!

http://petesplace-peter.blogspot.com/2008/05/predictive-power-of-computer-climate.htm

The modelers can find a way to show anything. As the famous mathematical physicst von Neumann said

“If you allow me four free parameters I can build a mathematical model that describes exactly everything that an elephant can do. If you allow me a fifth free parameter, the model I build will forecast that the elephant will fly.”

That is by the way why many of us more senior climatologists and meteorologists prefer to work with real data and correlate factors with real data than depend on models.
Cliff Huston

Posted May 10, 2008 at 2:10 PM | Permalink

RE: 425 Steven Mosher,

I was thinking of something like this:
Plot @ 1890 Temperature Tend 1880-1890 (annual anomalies)
Plot @ 1891 Temperature Tend 1881-1891
Plot @ 1892 Temperature Tend 1882-1892
.
.
.
Plot @ 2007 Temperature Tend 1997-2007

The idea was to get a look at how much decade trend wiggle there is over time, in the observed and modeled. I was hoping that you could do that without too much trouble.

Cliff
Kenneth Fritsch

Posted May 10, 2008 at 2:42 PM | Permalink

Since Beaker’s references have been almost exclusively to RC, he should be aware of how Gavin Schmidt contrasts weather and climate modeling with regards to chaotic content as commented in the excerpt from the link below.

http://www.giss.nasa.gov/research/briefs/schmidt_04

Climate modeling is also fundamentally different from weather forecasting. Weather concerns an initial value problem: Given today’s situation, what will tomorrow bring? Weather is chaotic; imperceptible differences in the initial state of the atmosphere lead to radically different conditions in a week or so. Climate is instead a boundary value problem — a statistical description of the mean state and variability of a system, not an individual path through phase space. Current climate models yield stable and nonchaotic climates, which implies that questions regarding the sensitivity of climate to, say, an increase in greenhouse gases are well posed and can be justifiably asked of the models. Conceivably, though, as more components — complicated biological systems and fully dynamic ice-sheets, for example — are incorporated, the range of possible feedbacks will increase, and chaotic climates might ensue.
Willis Eschenbach

Posted May 10, 2008 at 3:48 PM | Permalink

kuhnkat, you are close above, but not quite correct. The actual quote was:

When I arrived in Fermi’s office, I handed the graphs to Fermi, but he hardly glanced at them. He invited me to sit down, and asked me in a friendly way about the health of my wife and our newborn baby son, now fifty years old. Then he delivered his verdict in a quiet, even voice. “There are two ways of doing calculations in theoretical physics”, he said. “One way, and this is the way I prefer, is to have a clear physical picture of the process that you are calculating. The other way is to have a precise and self-consistent mathematical formalism. You have neither.”

I was slightly stunned, but ventured to ask him why he did not consider the pseudo-scalar meson theory to be a self-consistent mathematical formalism. He replied, “Quantum electrodynamics is a good theory because the forces are weak, and when the formalism is ambiguous we have a clear physical picture to guide us. With the pseudo-scalar meson theory there is no physical picture, and the forces are so strong that nothing converges. To reach your calculated results, you had to introduce arbitrary cut-off procedures that are not based either on solid physics or on solid mathematics.”

In desperation I asked Fermi whether he was not impressed by the agreement between our calculated numbers and his measured numbers. He replied, “How many arbitrary parameters did you use for your calculations?” I thought for a moment about our cut-off procedures and said, “Four.” He said, “I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” With that, the conversation was over. I thanked Fermi for his time and trouble, and sadly took the next bus back to Ithaca to tell the bad news to the students.

Freeman Dyson, http://www.nature.com/nature/journal/v427/n6972/full/427297a.html

w.
steven mosher

Posted May 10, 2008 at 3:56 PM | Permalink

re 429. ok, I was planning on doing something like that. with a twist
Willis Eschenbach

Posted May 10, 2008 at 4:26 PM | Permalink

Ron, thanks for your comments. You say:

First: Is this model used in attribution studies? If the model is designed only to predict climate change on the basis of greenhouse gases, isn’t this circular reasoning? Where is the model that seeks to understand natural climate variability? The model where you can tune TSI, PDO, ENSO, etc? Isn’t the model assuming zero natural climate variability? How can they say the temp record cannot be explained by natural climate variability alone if they have never tried?

This is several questions. AFAIK the model is not used in attribution studies. This model is only used because it can model the GMST (global mean surface temperature) output of other, much more complex models, in a very short time (compared to how long it takes to do model runs).

There are precious few models that seek to understand natural variability. This MAGICC model does not assume zero natural variability, it only models other models output without their inherent variability.

Modelers claim the warming at the end of the last century can’t be explained by natural variability because of the following (deeply flawed) logic, which runs like this:

1) Tune a model with all (or more likely some subset) of the known forcings to match the historical record.

2) Pull out the anthropogenic forcings, and note that the recent years no longer match the historical record.

3) Declare the case closed.

I find the recent study saying that we won’t see warming for twenty years unintentionally quite hilarious in this regard. After all the claims that natural variability could not be the reason for the late 20th century warming, guess what they ascribe the current cooling to … you guessed it, natural variability. And not only that, they ascribe it to a particular natural variability (the PDO) that is known to have caused warming for the last thirty years …

Second: The T+ parameter strikes me as odd. Is the only variable the “magnitude of the warming that would result from a collapse of the THC?” Can you not also determine at what level of warming the THC would collapse? It seems a great deal of uncertainty around that point is glossed over.

You have mis-read the citation, which says: “T+ – magnitude of warming that would result in a collapse of the THC” (emphasis mine).

As I mentioned above, Kiehl shows that the modeling can be done with three variables: forcing, sensitivity, and ocean heat uptake. The T+ parameter is a part of the third variable, it describes how the ocean uptake varies with temperature.

w.
steven mosher

Posted May 10, 2008 at 5:40 PM | Permalink

RE 433. Willis you will get a kick out of this. Long ago I used to do wargaming modelling.
Count D’ bodies, was my name. So we would start with simple model, expected value stuff,
lanchester equations: Count D’bodies. We had micro models, regional models, global models. more Count D’bodies. And you could explore a whole parameter space in an hour
of computer time. As a joke once I wrote a program that calculated every possible result
of our simple model. Choked a disk drive to near death. Anyway, People were not satisfied with these simple models, so we improved them. Explosions were modelled in glorious fragmentary detail. This piece of metal severed that artery. This pressure wave blew that poor f*** into the burning gas fire. Count D’ bodies. This sensor can’t see through this fog, poor f*** dies. count D’ bodies.

So, the physics of the models got more and more complex.And you reached a point where
nobody understood the entire model, or maybe one twisted soul did.

Then after finishing all of this huge effort at creating a physical simulacrum the
boss would say ” can you tune the simple model to match the complex one?”

WTF over!

At which point one considered blue on blue misssions.
Sam Urbinto

Posted May 10, 2008 at 7:36 PM | Permalink

Does it matter exactly how the bits of metal fly out of the explosion to know they do?
Willis Eschenbach

Posted May 11, 2008 at 3:48 AM | Permalink

mosh, too good.

Now, back to our regularly scheduled programming.

w.
Michael Smith

Posted May 11, 2008 at 5:35 AM | Permalink

Re: #427

I would like to take the discussion where you are attempting to go with it, but I am afraid the operative Beaker response here will be: but is it a perfect match.

He may well ask that question. But I’d like to hear him explain how natural variability can account for the fact that the surface – troposphere relationship is the opposite of what AGW theory requires — not just what it predicts, but what it requires. (I think you’ve been making esentially the same point without getting a response, so I may not get one either.)

In the first place, the surface and troposphere observations generally move together, and, as I said in 426, there is simply no evidence in that data that shows internal variability having a long term effect (i.e. one lasting more than a month or two) on one of the two without also affecting the other.

Second, even if there is some natural variability suppressing troposphere temperatures without a corresponding affect on surface temperatures, that must mean that the surface heating that has occurred as a result of CO2-induced global warming can only account for a portion of the observed surface heating; it must mean that some other (non-CO2) factor has contributed to a significant portion of the surface heating — and that climate sensitivity must be lower than the models predict.

In either case, I see no way to invoke internal variability to save the ensemble from the conclusions reached by Douglass.
Cliff Huston

Posted May 11, 2008 at 8:55 AM | Permalink

Since no one has offered a plot of the Douglass et al raw model data with +/- 2SD, +/- SE and the balloon data, I thought I would give it a try:
Cliff Huston

Posted May 11, 2008 at 9:10 AM | Permalink

Well, it worked in preview – one more try:
steven mosher

Posted May 11, 2008 at 9:31 AM | Permalink

Cliff.

For the 10 years series is this what you want?
Kenneth Fritsch

Posted May 11, 2008 at 3:28 PM | Permalink

What I have been able to detect in my reading of background information pertinent to the subject of this thread appears as a rather repetitious and perhaps anticipated thread running through the ideas expressed in publications of comparing climate model output to observed results. It is expressed well in the paragraph of an article written by authors who are not antagonistic to the consensus view of AGW. The phrase “climate scientists have resisted statistical expression of uncertainty of climate model projections” says a lot and I think explains some of the reaction of bloggers/scientists at RC objecting vociferously to the rigorous Douglas et al. testing of the ratios of troposphere to surface temperature trends.

Click to access paper06.pdf

The first excerpt below summarizes the progression of expressing uncertainty of climate models and the components of it. The authors are proposing a Bayesian approach to comparing observed and model climate and using it to evaluate model variations in projecting future climate.

Note that the second excerpt below points to a rather obvious and critical limitation in their approach: the evaluation depends on cross validation of model outputs and while they claim to answer the skeptical claim that variation within the models invalidates them, the authors admit that “if there were systematic errors affecting future projections in all the GCMs, our procedures could not detect that”.

Until recently, most climate scientists have resisted statistical expressions of uncertainty in climate models. IPCC (2001, page 2) recommended expressing the uncertainties of climate model projections using phrases such as \virtually certain” (greater than 99% chance that a result is true), \very likely” (90{99% chance), and so on, but gave no guidance as to how these chances were to be assessed. Even this was considered an advance on previous IPCC reports, which made no formal acknowledgement of the role of uncertainty.

Nevertheless, over the past few years there has been growing recognition of the need for more rigorous statistical approaches. In this section, we summarize some of the main developments. Uncertainties in climate change projections are broadly of three types, (a) natural climate variability, (b) uncertainties in the responses to climate forcing factors, such as changes in atmospheric levels of greenhouse gases and sulfate aerosols, and (c) uncertainties in future emissions of greenhouse gases and other factors that could influence climate. The first two types of uncertainty are typically assessed in \detection and attribution” studies, which calibrate climate models based on their fit to existing observational data and which attempt to decompose observed changes into components associated with greenhouse gases, aerosols, solar fluctuations, and other known influences on the earth’s climate.

There are of course some limitations to what these procedures can achieve. Although the different climate modeling groups are independent in the sense that they consist of disjoint groups of people, each developing their own computer code, all the GCMs are based on similar physical assumptions and if there were systematic errors affecting future projections in all the GCMs, our procedures could not detect that. On the other hand, another argument sometimes raised by so-called climate skeptics is that disagreements among existing GCMs are sufficient reason to doubt the correctness of any of their conclusions. The methods presented in this paper provide some counter to that argument, because we have shown that by making reasonable statistical assumptions, we can calculate a posterior density that captures the variability among all the models, but that still results in posterior-predictive intervals that are narrow enough to draw useful conclusions.
Cliff Huston

Posted May 11, 2008 at 4:52 PM | Permalink

Mosh,

Thanks for the plot, but I was looking for something a bit finer grained. Rather than plotting the trend lines and anomalies, I was looking for an annual plot of the the decade trends (degrees/decade). Starting in 1890, each annual plot point would be the trend from the decade before. In the case you plotted, the decade trend for observation would be roughly -.03 degrees/decade and for ModelE the decade trend would be roughly -.35 degrees/decade. So the plot points for 1890 would be -.03 for observed and -.35 for ModelE. Shift one year on the anomaly data, calculate the trends from 1881 to 1891 and plot those points for 1891. Rinse and repeat through 2007 (or the last year of your anomaly data).

Since Douglass et al are comparing model to observed decade trends in the troposphere, I got to wondering about how well the model surface decade trends track the observed surface decade trends. Will this plot be useful? Don’t know. But I expect it will be interesting. If this is more than you are up for, maybe you could send me the data used in the #411 post and I can give it a go in Excel. My email is Prop-Head at comcast dot net.

Cliff
Cliff Huston

Posted May 11, 2008 at 6:17 PM | Permalink

Here is Douglass et al data including the outliers at 1000 kPa:

A lot of model ‘weather’ at 1000 kPa, we don’t want to lose that ‘information’. 🙂

Cliff
steven mosher

Posted May 11, 2008 at 6:30 PM | Permalink

re 441. ahh ok.. I didnt think you wanted me posting twelve plots.I’ll see what I can do in the next couple of day
Cliff Huston

Posted May 11, 2008 at 6:32 PM | Permalink

Oops, kPa -> hPa

Cliff
Sam Urbinto

Posted May 12, 2008 at 12:06 PM | Permalink

Michael Smith 427

…the surface and troposphere observations generally move together… …there is simply no evidence in that data that shows internal variability having a long term effect… …on one of the two without also affecting the other.

I’d think that would explain whatever extra IR effects we get from the long-lived gases as being taken care of by lapse rate and water vapor.

Cliff Huston #439 #443

So in other words, to make all the models fit (ignoring 1 KhPa but including it in the calculations), we have to add about +/- 1 C to our range (+/- .5 C at 500 hPa) etc. If we just not calc the outliers, it’s about .2 +/- .2 Something like that.

It appears surface to 925 is totally wrong, and from 200 and above very bad. Or am I misreading it?

I see both SEM and SD saying the same thing basically; SEM says “They’re way far away from the truth” and SD says “I have to have a huge plus-minus to get all of these in here”.
Cliff Huston

Posted May 13, 2008 at 1:01 AM | Permalink

#446 Sam,
You say:

It appears surface to 925 is totally wrong, and from 200 and above very bad. Or am I misreading it?

I don’t know if surface to 925 hPa totally wrong, but by my count, 10 of the 22 models are not well at 1000 hPa and you are right that above 200 hPa the model trends are way too high – way beyond even 2SD.

I see both SEM and SD saying the same thing basically; SEM says “They’re way far away from the truth” and SD says “I have to have a huge plus-minus to get all of these in here”.

The reason I wanted to post plots of the Douglas et al data is that it is obvious that the wide 2SD is a function of the models not agreeing with each other. Beaker argues that the model spread is model ‘weather’, which ignores the fact that Dogulass is using decade trends that average out shot term weather effects. Also, Beakers argument ignores the fact that the models in the ensemble are different programs, using different secret sauce to simulate the climate. Back in the Troposphere thread I wrote to Beaker:

I think if you spend a little time studying this document: An Overview of Results from the Coupled Model Intercomparison Project (CMIP) ( http://www-pcmdi.llnl.gov/projects/cmip/overview_ms/ms_text.php ) you will find that what you are calling ‘simulated weather’ is actually only a difference between models.

Here is a really short version of what I wanted Beaker to see.

From the CMIP introduction:

In 1995 the JSC/CLIVAR Working Group on Coupled Models, part of the World Climate Research Program, established the Coupled Model Intercomparison Project (CMIP; see Meehl et al. 2000). The purpose of CMIP is to provide climate scientists with a database of coupled GCM simulations under standardized boundary conditions. CMIP investigators use the model output to attempt to discover why different models give different output in response to the same input, or (more typically) to simply identify aspects of the simulations in which “consensus” in model predictions or common problematic features exist.

Details of the CMIP database, together with access information, may be found on the CMIP Web site at http://www-pcmdi.llnl.gov/cmip/diagsub.php . The first phase of CMIP, called CMIP1, collected output from coupled GCM control runs in which CO2, solar brightness and other external climatic forcing is kept constant. (Different CMIP control runs use different values of solar “constant” and CO2 concentration, ranging from 1354 to 1370 W m-2 and 290 to 345 ppm respectively; for details see http://www-pcmdi.llnl.gov/cmip/Table.php ) A subsequent phase, CMIP2, collected output from both model control runs and matching runs in which CO2 increases at the rate of 1% per year. No other anthropogenic climate forcing factors, such as anthropogenic aerosols (which have a net cooling effect), are included. Neither the control runs nor the increasing-CO2 runs in CMIP include natural varations in climate forcing, e.g., from volcanic eruptions or changing solar brightness.
CMIP thus facilitates the study of intrinsic model differences at the price of idealizing the forcing scenario.

Fig. 1. Globally averaged annual mean surface air temperature (top) and precipitation (bottom) from the CMIP2 control runs.

Fig. 20. Globally averaged difference between increasing-CO2 and control run values of annual mean surface air temperature (top) and precipitation (bottom) for the CMIP2 models. Compare with Fig. 1, which gives control run values.

The model wiggles in the figures are from internal weather effects and it is clear that these effects would be lost in a decade trend. It is also clear that there is disagreement on what constitutes a correct physical model, shown by the 11.5C to 16.5C range in Fig. 1. The IPCC deals with this model disagreement by using the average of the ensemble, as their best estimate data, to compare with the observed – some sort of ‘wisdom of the crowd’ theory I guess. Given the IPCC’s position that the ensemble output is the average of the ensemble models, the obvious choice for the output error bars is SE.

Beaker argues that this model disagreement is somehow information, that is lost if SE is used and can only be preserved by using SD, so I guess he also believes in crowd wisdom, but he goes beyond the IPCC in that he believes that all crowd members are equally wise. In Beaker’s view the ensemble is an information cloud that is reconciled if there is greater than 5% probability that any of the cloud members could produce the observed.

The problem with Beaker’s argument is that the ensemble average is not only the IPCC’s best estimate, but it also follows best the GHG theory of what should be happening in the troposphere. The ensemble model members that are closest to the observed data, don’t match the GHG theory. If the model ensemble were reduced to the 10 models that best match each other, the ensemble average would be very close to the 22 model ensemble average, but that 10 model ensemble would fail against observation even using 2SD. The 22 model ensemble is only reconciled, using SD, because it contains flawed members.

Cliff
Sam Urbinto

Posted May 13, 2008 at 5:25 PM | Permalink

Thanks Cliff. That’s why I’ve been saying is both SEM and SD prove the same thing from different directions; we can’t use SEM (too difficult to pass) and we can’t use SD (too easy to pass). Something like that.

I do believe beaker has said that the more models you have, the more of a chance you have to get into the ballpark and match SD. And that it’s so low a bar to be meaningless. Which is why he doesn’t think SEM is a way to check these 22, which is what has been proven by this paper IMO, is because it’s almost impossible to pass no matter what. Same difference AFAIC.

The question is then; is there a number of models, some ensemble, something somewhere that SEM is a good check for? Well, if we need the “bad models” to get it to pass a 2SD test, and 10 “good models” fail it, who cares? Really, doesn’t that prove the point, regardless of how you slice it?

Like what Roger Pielke Jr said over at Prometheus in two posts. (There’s some interesting comments by James Annan in the posts, too BTW)

Global Cooling Consistent With Global Warming

For a while now I’ve been asking climate scientists to tell me what could be observed in the real world that would be inconsistent with forecasts (predictions, projections, etc.) of climate models, such as those that are used by the IPCC. I’ve long suspected that the answer is “nothing” and the public silence from those in the outspoken climate science community would seem to back this up.

And

How to Make Two Decades of Cooling Consistent with Warming

If the test of “consistent with” is defined as any overlap between models and observations, then any rate of cooling or warming between -10 deg C/decade and +13.0 dec C/decade could be said to be “consistent with” the model predictions of the IPCC. This is clearly so absurd as to be meaningless.
Cliff Huston

Posted May 13, 2008 at 11:40 PM | Permalink

RE: 448 Sam,

Just to follow through on my ‘best 10 models’ speculation, I removed the 6 worst models above and the 6 worst models below the 22 model ensemble average. The remaining 10 models are those that best agree with each other. Here is the result:

As I guessed, there is little change in the ensemble average and except in the ensemble’s strange surface to 925 hPa region, there is little support for observation reconciliation.

Cliff
beaker

Posted May 14, 2008 at 2:10 AM | Permalink

Good morning all.

Kenneth Fritsh #409: No, the SD test is perfectly valid IF you include all model runs, rather than just the means for each model, so that the stochastic variation is put back in. If you do that, you don’t need to estimate E as it has a corresponding component in the stochastic variation of the models.

Cliff Huston #413: If you use a perfect model ensemble with perfect initialisation, the models will reproduce the trends *exactly* and therefore will pass both the SE and SD tests, even though the SE and SD go to zero (2 is in the range 2+-0). So your conterexample doesn’t work (but mine does). In fact yours highlights the problem with the SE test, an ensemble is only guaranteed to pass the SE test if it exactly matches the observed trend, which is clearly an unreasonable expectation. My counter example demonstrates that the best model that could possibly be made fails the SE test, I’d say that spoke volumes about its correctness!

Sam Urbinto #420:

So if the ensemble mean gives the best estimate of the forced part of the climate and not the reality of the climate (forced plus random), isn’t it best to say that SEM can’t be used to match “missing-chaos forcing models of” climate compared to “having-chaos observed” climate? And therefore, using SEM to disprove itself as a valid statistical method (circular statistical test?) shows the models aren’t models of reality and are at least biased?

There is nothing circular about it. The model mean has had the effects of e.g. ENSO cancelled out, the observed climate hasn’t, so you should not expect them to be exactly the same. There is no good reason for failing a large ensemble if it gives a good prediction, but the SE test will fail it unless it gets exactly the correct answer. That shows the SE test is unreasonable.

On the other hand, we can say that using SD rather shows that from the other direction, we get too much uncertainty

That seems to be pre-judging the answer. The amount of uncertainty you get is the amount of uncertainty there is. Whether data can be reconciled with a model depends on whether the data lie within the uncertainty of the model. That is because the error bars are part of the output of the model.

So why even bother trying to reconcile them, there’s no test they can pass.

Incorrect, it has already been demonstrated that the can pass the SD test. The question is are the models useful, the SD test doesn’t answer that unequivocally, and the SE test doesn’t as it fails a perfect model. What you need to do is to decide on a proper definition of “useful” and then we can try and make a statistical test for that.

Michael Smith #426:

But the ensemble mean does match the observations – it matches the surface observations. So we are left with the notion that the chaotic element of the climate is suppressing the warming the models predict for the troposphere but not suppressing the warming the models predict for the surface.

But in order to pass the SE test, they need to be an exact match, which they are not. That is the reason why the SE test is invalid. There is nothing wrong with a large ensemble if it gives a good prediction, but the SE test will fail it on the basis of ensemble size unless its prediction is absolutely perfect.

Kenneth Fritsch #427:

Beaker response here will be: but is it a perfect match.

Good forecast ;o)

Michael Smith #437:

But I’d like to hear him explain how natural variability can account for the fact that the surface – troposphere relationship is the opposite of what AGW theory requires — not just what it predicts, but what it requires. (I think you’ve been making esentially the same point without getting a response, so I may not get one either.)

As I have said before, I am a statistician, not a climatologist. Asking me for authoritative answers to questions like this is like hiring a concert pianist in the expectation that he can fix the transmission in your car.

The quote also suggests (although I may be reading too much into it) that just because I am attempting to audit a paper with a sceptic stance that I am a strong proponent of the models. However that is incorrect, I am just in favour of good science.

Why should a large ensemble fail a test that a small ensemble passes if they both have the same mean and standard deviation (the SE test can do this, but the SD test won’t)? Why should a model fail a test even though it is in arbitrarily close agreement with the data (the SE test can do this, but the SD test won’t). Why should a perfect model fail the test (the SE test will do this, the SD test won’t). I am not trying to save anything. The test used in Douglass et al was invalid. Pointing out scientific flaws, such as this, is the aim of an audit, is it not? If you only want to audit one side of the debate and not the other, then that is not science.

Cliff Huston #447:

The reason I wanted to post plots of the Douglas et al data is that it is obvious that the wide 2SD is a function of the models not agreeing with each other. Beaker argues that the model spread is model ‘weather’, which ignores the fact that Dogulass is using decade trends that average out shot term weather effects. Also, Beakers argument ignores the fact that the models in the ensemble are different programs, using different secret sauce to simulate the climate. Back in the Troposphere thread I wrote to Beaker:

If you do the SD test properly, you use all model runs, not just the means for individual models, so “weather” does come into it, if you compute the SD correctly. The point remains the “weather noise” in the models is removed by ensembling, but it is still there in the observations, that makes the SE test unreasonable as it is not a like for like comparsion. It is however a like-for-like comparison to compare the observations with the model runs, via the SD test (with the SD computed over runs).

Beaker argues that this model disagreement is somehow information, that is lost if SE is used and can only be preserved by using SD, so I guess he also believes in crowd wisdom, but he goes beyond the IPCC in that he believes that all crowd members are equally wise. In Beaker’s view the ensemble is an information cloud that is reconciled if there is greater than 5% probability that any of the cloud members could produce the observed.

No, you are reading too much into what I have written. That is the meaning of “consistent with the models” in a statistical sense. This is equivalent to asking if the models are capable of generating the observed climate, seems reasonable definition to me. However that doesn’t imply that they are useful (as I have ppointed out – repeatedly). In my view the ensemble is an indication of the spread of opinions about climate physics amongst the modelling community, if there is uncertainty about the climate physics, we whould not ignore it.

Sam Urbinto #448

Thanks Cliff. That’s why I’ve been saying is both SEM and SD prove the same thing from different directions; we can’t use SEM (too difficult to pass) and we can’t use SD (too easy to pass).

Not quite, the SEM shouldn’t be used because it is not a fair test, the SD test doesn’t tell you whether the models are useful. Come up with a definition of useful and we can have a test.

How to Make Two Decades of Cooling Consistent with Warming

If the test of “consistent with” is defined as any overlap between models and observations, then any rate of cooling or warming between -10 deg C/decade and +13.0 dec C/decade could be said to be “consistent with” the model predictions of the IPCC. This is clearly so absurd as to be meaningless.

There is nothing absurd about this if you understand the sources of uncertainty producing the error bars. Maybe all of these things are feasible given the uncertainties of the models, but that does not mean that they are all feasible in reality. Also, while many things are consistent with the models, they are not all equally probable under the models.

As I said, consistency is a weak test. If you wan’t to test the models to see if they are useful, then stop talking about consistency and come up with a reasonable definition of “useful”.

P.S. Sorry if I come accross as a bit terse in this post, I am rather busy at the moment!
Raven

Posted May 14, 2008 at 2:32 AM | Permalink

beaker says:

As I said, consistency is a weak test. If you wan’t to test the models to see if they are useful, then stop talking about consistency and come up with a reasonable definition of “useful”.

The trouble is the modellers keep making the assertion that data is “consistent” with models and that proves that the models are useful. This leads to arguments about the meaning of the word “consistent”.
steven mosher

Posted May 14, 2008 at 6:38 AM | Permalink

if you guys want to see how big a spread there is in the model forecasts, head over to RC
gavin has a good post on it
Ron Cram

Posted May 14, 2008 at 7:04 AM | Permalink

Mosh,
re: 451

I don’t know what you see in Gavin’s post that you think is insightful. It is another PR post, an attempt to salvage the reputation of the GCMs and AGW. In the past he has said ten years without a new high global temp would be reason to question the models. Now he is changing the standard to an “Unambiguous New Record” because it gives an extra eight years of funding to the AGW cause. He is using the fact the models are all over the place to argue for more time.
steven mosher

Posted May 14, 2008 at 7:34 AM | Permalink

re 452. It’s insightful for exactly the reason you state. I said it was a good post.
good for people to see what kind accuracy you can expect from the models
John F. Pittman

Posted May 14, 2008 at 9:12 AM | Permalink

#449 Beaker
Thank you for your patience and help. I know you have asked several times about a test and a definition of useful. I think you may have answered this. If so, please, just direct me towards the answer. Also, it may simply be that I have “the cart before the horse”.

The usefullness has been defined by IPCC and climatologists that a natural climate response could not be found that explained the 20th and present warming; it is thus CO2. So I pose a test that is “Do the models or the temperature anomaly data itself support this definion of usefullness?” If it is indeterminate that would falsify the claim. In other words we have the same broad window of comparison for usefullness of historical temperature anomalies including their uncertainties and errors that we have for the models including their uncertainties and errors. If one cannot conclude this with information at hand (it is unknowable), then is not the claim suspect anyway?
Jon

Posted May 14, 2008 at 9:32 AM | Permalink

@452

Now he is changing the standard to an “Unambiguous New Record” because it gives an extra eight years of funding to the AGW cause.

Explain.
Kenneth Fritsch

Posted May 14, 2008 at 9:41 AM | Permalink

Re: #449

I see where Ground Hog day was postponed from Monday to Wednesday this week.

Using SD to determine whether the observed is part of the distribution of all the model results is the same as treating the observed as another model result, i.e. the observed has no added value as a target that the models are attempting to reproduce.

Not addressing the issue of the using the ratios of temperatures trends in order to effectively eliminate the effect of the chaotic variable to the observed (and for the model results also) by simply repeating that the SE cannot be used unless the observed matches the modeled mean exactly will get us to another Ground Hog day without much resolution of the question at hand. SE comparisons are made everyday knowing full well that the averages being compared are never exactly going to match. Think of a control chart process whereby the average of a well controlled process is estimated and limits are set to using the sample size for the control test to establish SE limits for the process. One would not expect that the average of the distribution of an in-control process was captured exactly by the original estimate.

To see through to the essence of these arguments on how to compare observed and model results it is instructive to compare how Karl et al. looked at similar model out puts to the observed for the ratios of surface to troposphere temperature trends in the tropics as Douglas et al. analyzed. Karl looks at the entire range of values for model results and says that the observed falls within that range and therefore that one cannot say the models disagree with the observed. In fact Karl uses model outputs that show outlier tendencies for the surface temperatures and concludes that using their comparison approach that the observed results are probably wrong.

One can look at the model results this way, but then one is forced to conclude that the model range that the authors apparently legitimize by using as a range in their comparison with the observed is so large as to be practically worthless. One can invoke an unbounded limit for a chaotic content of the observed as making an observed to mean model result impractical, but that in turn implies that a large or unlimited chaotic content in the observed makes climate projections/predictions unrealistic to impossible.

On the other hand, I would suspect the thinking behind the Douglas approach goes something like:

We have a GHG finger print that the models show in a very biased pattern as the ratio of temperature trends in the tropics whereby the troposphere warms faster than the surface. We have model results that show a reasonable rendition of the surface temperature trends with the observed and they all show more warming than the observed shows for the troposphere. We have data that shows that the random climate variations match well between the surface and troposphere temperatures. We want to compare the surface observations as a target for the models but still recognizing that the observations have a variable content. The models have a biased tendency as a whole or ensemble that most likely results from the GHG fingerprint. We want to compare the average model output for this feature of the climate prediction to an observed target.
Michael Smith

Posted May 14, 2008 at 10:24 AM | Permalink

Beaker, you’ve simply unilaterally redefined the term “SE test” to mean a test wherein any ensemble being tested must be assumed to have an infinite number of models or runs of models — regardless of why that ensemble is put together and regardless of how many actual models the ensemble contains at the time of the test.

But I don’t accept your assertion that there be a universal “infinity constraint” on the SE test. For one thing, you haven’t explained why an ensemble built to capture the uncertainty around a perfect model would include runs that capture no additional uncertainty.

And please don’t respond by once again repeating the argument that under your arbitrary, hypothetical case of assuming infinity, a perfect model would fail. Tell me why the universal “infinity constraint” assumption is valid even when there is a perfectly logical reason to limit the models going into the ensemble.
beaker

Posted May 14, 2008 at 11:33 AM | Permalink

Michael Smith #458:

(i) The SE test will reject a model with finite difference between the observed climate and the ensemble mean if the ensemble is large enough, because of the 1/sqrt(n) factor. This is true no matter how small the difference actually is.

(ii) The models do not attempt to model the observed climate, only the forced component of the observed climate, i.e. what is left after the effects of things that the models can’t predict (e.g. ENSO), however small they may be, have been averaged out.

(iii) As a corollary of (ii), we would expect there to be a finite difference, even if the model is perfect, as the observations do not represent the forced climate that the modellers wish to estimate using the ensemble mean.

(iv) As a result, the SE test will reject a ensemble that gives perfect predictions (in terms of what the models actually aim to do) purely on the basis that it is “too large”.

(v) as a consequence of (i) the SE test will fail a model that has an arbitrarily small difference from the observations purely on the basis that it is “too large”

See the point I am making, whether a model passes the SE test doesn’t depend purely on how closely it agrees with the observations, therefore it isn’t a sensible test of whether the models are in good agreement with the data.
There, I haven’t mentioned infinity anywhere.

Lastly, here is some food for thought: David Douglass says

We noted that previous papers and IPCC reports had introduced the concept of averaging over models. There is no scientific justification for doing so. The models are not independent samples from a parent population because the models are, in fact, different from each other in many ways.

The SE of the mean gives the uncertainty of estimating a population mean using a finite sample. Douglass apparently doesn’t believe the sample are representative of a parent population, so why is he using a statistical test based on its mean?

The SD test on the other hand makes no such assumption, it is just a simplified Gaussian approximation of an empirical distribution.
Christopher

Posted May 14, 2008 at 11:52 AM | Permalink

>The models are not independent samples from a parent population because the models are, in fact, different from each other in many ways.

Yeah, this threw me when I first saw it. You can replicate GCMs with MAGICC. So we have a RNG, in effect, with a 6 (or was it 5) parameter distribution. So the models are not independent because the physics is, well, the physics. Maybe a semantic thing.
beaker

Posted May 14, 2008 at 12:10 PM | Permalink

Christopher #460: Indeed, MAGICC does suggest a view of an underlying population, even though it still isn’t an i.i.d sample from that population. The use of an ensemble is itself perfeectly reasonable without the models being an i.i.d sample (done all the time in machine learning, try a google scholar search for “ensemble diversity”). It did seem inconsistent ( ;o) ) for Douglass et al. to use a test that does require them to be an i.i.d. sample though!
Sam Urbinto

Posted May 14, 2008 at 5:10 PM | Permalink

beaker #450 etc

“That shows the SE test is unreasonable.”

Bro, you ain’t pickin’ up what I’m throwin’ down. I’m saying the same thing; SE can’t be used to reconcile the model ensemble with the observations. Because it’s too constrained. Is that a fair way to put it?

“Whether data can be reconciled with a model depends on whether the data lie within the uncertainty of the model.”

And if you have to make the uncertainty great, isn’t that the same as failing if you can show consistency? Or to put it another way, it only goes to show the models are biased, doesn’t it?

“The question is are the models useful, the SD test doesn’t answer that unequivocally”

If there’s no clear answer, isn’t that the same as failing a test of reconciliation? I’m not saying we can’t fit things, just that a test that’s inconclusive is, well, meaningless at a practical level, I suppose I’d say.

“But in order to pass the SE test, they need to be an exact match, which they are not.”

Good point. Of course they can’t be. You also have a point about the paper; if the goal was to say the ensemble can’t be reconciled, at all, in any manner, in any way, clearly SE is the wrong way to do it, since we know SD can. Perhaps, as you’ve suggested, the point should be made that the SE test can’t be reconciled, and that SD can but is basically undefined and therefore the proper test to show the bias (or uncertainty) of the ensemble would be SD but it tells us nothing except we can get it to work if we put enough uncertainty in. However you want to put it.

“The test used in Douglass et al was invalid.”

Invalid for what? Showing SEM (seemingly their interpretation (right or wrong) of the what the IPCC is doing) to gauge the ability of an ensemble to correlate with or reconcile to observations doesn’t work?

“This is equivalent to asking if the models are capable of generating the observed climate, seems reasonable definition to me.”
“However that doesn’t imply that they are useful ”

Can they generate it, sure. I don’t think anyone’s really debating that. Does it mean anything? No. I agree with you that they are consistent in a statistical sense. I think we’re again back to semantics; for me for them to be “consistent” they must not only be statistically consistent (in that there is overlap I suppose I’d say) but useful. If I can get answers that don’t match at all or don’t reasonably match, that’s inconsistent. But again, perhaps the paper could be reworded on this issue; but as you said, this isn’t just statistics, it’s climate also.

“Not quite, the SEM shouldn’t be used because it is not a fair test, the SD test doesn’t tell you whether the models are useful. Come up with a definition of useful and we can have a test.”

If SEM isn’t fair, and SD doesn’t tell us anything, neither is useful in attempting reconciliation that we can trust the models to tell us the reality to any degree of certainty.

“The SE of the mean gives the uncertainty of estimating a population mean using a finite sample. Douglass apparently doesn’t believe the sample are representative of a parent population, so why is he using a statistical test based on its mean?”

To show there’s no scientific justification to average over models? To show models only represent the forced component of reality?

We are comparing a mean of incomplete simulations to reality after all, right?
Kenneth Fritsch

Posted May 14, 2008 at 6:32 PM | Permalink

Below is a link to an article from NOAA published in 2005 that shows correct procedures for comparing climate observations to modeled results that avoids potential problems if the comparison is confined to using error bars. It is merely a standard statistical method using SEM in comparing of means and as such would not, in my view change the conclusions of the Douglas et al. paper. Note that SEM is used to compare means of observation and model outputs.

Click to access jrl0501.pdf

In contrast, the two-sample case is fundamentally different in that, in general, looking for overlap from two sets of error bars is not equivalent to the appropriate t test. Examples based on the data in Table 1 are used to illustrate the nature of the problem. Suppose we have finite samples of values of some quantity from both observed data and from a general circulation model (GCM). Estimates of the mean (X) and standard deviation (s) of the sample values can be made along with the uncertainty [standard error (SE)] of the estimated means. As is common practice, a confidence interval about the estimated means can be constructed by taking +/- twice the standard error. The intervals given in Table 1 are displayed graphically in Fig. 1 in the form of error bars.
For example 1, three sets of error bars are shown on the left side of Fig. 1 for the observations (O), GCM (G), and their difference (D). In this example, the observations and GCM have equal standard deviations. It can be seen that there is considerable overlap of the error bars from the observations and GCM. In such a case, a researcher would typically conclude erroneously that there is no statistically significant difference between their respective means. An alternate approach to the same problem is to apply a two-sample t test. Such a test has been applied and the corresponding error bars about the difference of the means (D) do not in-clude zero. Based on this test, the same researcher would conclude that there is a statistically significant difference between the means.

The reason for this apparent paradox can be understood by considering the relationship between the SE of the mean of the individual samples (observations and GCM) to that of the SE of the difference of their means. The crucial factor is that in the case of the two-sample t test, the SE of the difference is estimated by “pooling” the variances from the two different samples. It should be noted that while the two-sample t test is well founded in statistical theory, the use of overlapping error bars in the two-sample case is not.

The depiction of example 1 in Fig. 1 can be used to understand the crucial distinction between the two approaches. In order for significance to be declared using the overlapping error bars approach, one would have to move the means of the observations farther apart until the bottom whisker from the observations just touches the top whisker from the GCM. This can be expressed mathematically as:

X1 – X2 =/> cSE1 + cSE2

where X1 (X2) is the mean of the observations (GCM), SE1 (SE2) is the standard error of the observations (GCM), and c is a constant that determines the level of confidence. The two terms on the rhs of (1) represent the distances from their respective means to the end of the whisker (i.e., half the length of the confidence intervals). In example 1, c = 2 since the confidence intervals represent +/- two standard errors. Note that throughout this paper no distinction is made between population and sample parameters; it should be understood that estimates of various population parameters from an available finite sample are being used.
Ralph Becket

Posted May 14, 2008 at 10:24 PM | Permalink

Re: beaker, 459:

I said this before, but didn’t get a response from you. If you could tell me where you think my reasoning is faulty I would be grateful.

The justification for taking the mean of the models is that they are approximations to an hypothetical ideal model. The SE of this mean gives you confidence intervals of how far off the sample mean is from the right value. With more models the SE tends to zero, as expected.

The test of the mean of the models should surely be with respect to the SE *and* the error on the observations.

If we had perfect observations (no error) and a perfect model (no error) and they differed then we would have to conclude that the perfect model was wrong (i.e., the model is not actually a perfect model of the process being observed). The next step is to ask how great is the discrepancy. At least then you can say something about whether the model is in the right ballpark.
Reference

Posted May 15, 2008 at 6:02 AM | Permalink

Warming world altering thousands of natural systems – 14 May 2008

The study, by an international research team featuring many members of the Intergovernmental Panel on Climate Change (IPCC), is a statistical analysis of observations of natural systems over time. The data, which stretch back to 1970, capture the behaviour of 829 physical phenomena, such as the timing of river runoff, and around 28,800 biological species.

Researchers led by Cynthia Rosenzweig of NASA’s Goddard Institute for Space Studies in New York created a map of the planet with a colour-coded grid showing how much different regions have warmed or cooled between 1970 and 2004.

They then placed each of the thousands of datasets on the map and determined whether they were “consistent with warming” or “not consistent with warming”. Trees, for example, might flower earlier in regions where the climate has warmed significantly.

In around 90% of cases where an overall trend was observed, it was consistent with the predicted effects of climate warming, the researchers report in this week’s Nature.
Reference

Posted May 15, 2008 at 6:07 AM | Permalink

Apologies for the above post – wrong section

Admin: a delete option would be really really helpful, edit even more so)
Ron Cram

Posted May 15, 2008 at 8:18 AM | Permalink

Mosh,
re: 454

Okay, I misunderstood. I thought you were saying there was something new in Gavin’s post. I guess the fact he admitted the GCMs are all over the place is new. His attempt to change the standard was also new, but I don’t like changing the rules in the middle of the game.

Jon,
re: 456

It is pretty obvious, from a scientific standard – not a political one, that AGW catastrophism is in trouble. No new temp records since 1998 and a number of peer-reviewed papers explaining why this is so. And now the PDO has turned to its cool phase and we probably will not see any new records for about 30 years. Without changing the ground rules, one would expect the funding to study climate change would be drastically cut within a few years (funding cutbacks have to wait until the politics follows the science). By changing the ground rules to “unambiguous,” Gavin gets to claim they need another eight years of funding.
Bob B

Posted May 15, 2008 at 10:34 AM | Permalink

Lucia, falsifying IPCC projections once more:

http://rankexploits.com/musings/2008/ipcc-projections-do-falsify-or-are-swedes-tall/#comments
Cliff Huston

Posted May 15, 2008 at 2:03 PM | Permalink

RE: 463 Ken Fritsch,

Thanks for the link to the Lanzante (2005) manuscript. Lanzante’s advice is not to use error bars at all, for hypothesis testing. Rather he suggests using a two-sample t-test. The author’s main objection to the use of error bars, is based on the work of Schenker and Gentleman (2001), which I have not been able to find free access to. I have found several papers that discuss aspects of Schenker and Gentleman (2001) and it seems to me that Schenker and Gentleman (2001) could provide further insight into the best way to judge the data in Douglass et al. Perhaps someone here does have access to Schenker and Gentleman (2001) and can provide us with some additional clues.

Googling on Schenker and Gentleman (2001), I found that climatologists are not the only ones confused by the use of error bars. The following is from a preprint of Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005).
( http://www.latrobe.edu.au/psy/cumming/docs/Belia%20Fidler%20et%20al%20PM%2005rev.pdf )

Our aim was to study some aspects of the understanding of graphically presented CIs and SE bars by authors of articles published in international journals. We investigated the interpretation of error bars in relation to statistical significance. This may not be the best way to think of error bars (Cumming & Finch, 2001, 2005), but is worthy of study because of the current dominance of NHST and p values, and because greater interpretive use of error bars is unlikely unless the relationship with p values is understood. We also investigated researchers’ appreciation, when comparing two means, of the importance of experimental design—in particular whether the independent variable is a between-subjects variable, or a repeated measure.

We elected to study graphically presented intervals because we believe that pictorial representation is often valuable, and can “convey at a quick glance an overall pattern of results” (APA, 2001, p. 176), and because we agree with the advice of the APA Task Force on Statistical Inference (Wilkinson, et al., 1999): “In all figures, include graphical representations of interval estimates whenever possible” (p. 601). By ‘error bars’ we refer to the ambiguous graphic, two of which are shown in Figure 1, that marks an interval and may represent a CI, SE bars, or even SD bars. All error bars we use are centered on means (M), and n is the number of data values contributing to a mean. CIs are calculated as M ± tC x SE, where SE = SD/ n, and tC is the critical value of t, for (n – 1) degrees of freedom, for the chosen level of confidence, C. For us, this is 95%, implying that tC is close to 2. In all cases, SE bars are M ± SE.

To make an inferential assessment of a difference between two means, it may be best to consider a single interval on the difference itself (Cumming & Finch, 2005). For two reasons, however, we chose to study a comparison of intervals on the two separate means. First, it is common in journals to see figures showing separate cell means, sometimes with error bars, and, with such figures, assessing any difference requires consideration of intervals shown on the separate means. The Publication Manual includes two examples of figures of this type (APA, 2001, pp. 180, 182). Second, Schenker and Gentleman (2001) reported that in medicine and health science a rule of thumb is sometimes used for interpreting CIs on two separate means. The rule maintains that non-overlap of two 95% CIs on independent means implies a significant
difference at the .05 level between the means, and that overlap of the two CIs implies there is no significant difference. This rule is widely-believed, but incorrect, and refers to CIs on separate means. In fact, non-overlap of the two CIs does imply a significant difference, with p distinctly less than .05, but overlap does not necessarily imply there is no statistically significant difference at the .05 level.

Here is another paper that provides an additional view of the issues: Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? ( http://www.insectscience.org/3.34/Payton_et_al_JIS_3.34_2003.pdf )

Abstract
We investigate the procedure of checking for overlap between confidence intervals or standard error intervals to draw conclusions regarding hypotheses about differences between population parameters. Mathematical expressions and algebraic manipulations are given, and computer simulations are performed to assess the usefulness of confidence and standard error intervals in this manner. We make recommendations for their use in situations in which standard tests of hypotheses do not exist. An example is given that tests this methodology for comparing effective dose levels in independent probit regressions, an application that is also pertinent to derivations of LC50s for insect pathogens and of detectability half-lives for prey proteins or DNA sequences in predator gut analysis.

One thing I note about these papers is that the authors don’t seem to understand Beaker’s point that a CI of 2SEM cannot be used for hypotheses testing – go figure. They all seem to agree with Lanzante’s statement:

Before beginning discussion of the problem, it is worth reviewing some of the types of error bars that are commonly presented, along with related terminology. Error bars may represent the standard deviation, the standard error, the confidence interval, or some other measure of error or uncertainty. While the standard deviation is a measure of dispersion of individual observations about their mean, the standard error is the standard deviation of a derived statistic, such as the mean, regression coefficient, correlation coefficient,etc. A confidence interval can be constructed about a sample statistic such that it contains the true population statistic with a specified probability(2) and thus can be used in hypothesis testing. This note is relevant to error bars that represent confidence intervals for hypothesis testing.

Cliff
Sam Urbinto

Posted May 15, 2008 at 2:25 PM | Permalink

I was under the impression the ensemble was a derived statistic.
Cliff Huston

Posted May 15, 2008 at 2:58 PM | Permalink

RE: 470 Sam,

I was under the impression the ensemble was a derived statistic.

According to the IPCC it is – see the IPCC statements in post #394
Cliff
Kenneth Fritsch

Posted May 15, 2008 at 3:17 PM | Permalink

Re: #469

Cliff, it is important on reading the Lanzante (2005) manuscript to note that his point (which makes sense from a t test of the differences using SEM) is that using overlapping error bars for determining significance will always favor the case of no significant difference, i.e. when Douglas et al. use overlapping errors bars they would have been more likely to see no significant difference when one actually exists.

Beaker has an opinion about these tests just like RC does, but if one understands the underlying assumptions one can judge for oneself whether the beaker is half full or half empty. I know it is rather empty of references, links and details.
Cliff Huston

Posted May 15, 2008 at 3:39 PM | Permalink

RE: 463 Ken Fritsch,

Cliff, it is important on reading the Lanzante (2005) manuscript to note that his point (which makes sense from a t test of the differences using SEM) is that using overlapping error bars for determining significance will always favor the case of no significant difference, i.e. when Douglas et al. use overlapping errors bars they would have been more likely to see no significant difference when one actually exists.

I agree. The papers I linked in #496, make that point clear and put some numbers around it. What I don’t find is any support for using 2SD for the CI, but even in that case this issue would apply.

Cliff
Cliff Huston

Posted May 15, 2008 at 4:01 PM | Permalink

RE:473

Oops – RE: 463 Ken Fritsch, should be RE: 472 Ken Fritsch and …linked in#496, should be … linked in 469.

The ability to edit would be nice.

Cliff
Sam Urbinto

Posted May 15, 2008 at 5:40 PM | Permalink

#471 Cliff

Well, if the IPCC thinks the standard (systematic) error of the ensemble mean of the forced portions of the models are a good estimate, then:

To the extent that unrelated model errors tend to average out, the ensemble mean or systematic error {Tm} will be small, {T} will approach TF and the multi-model ensemble average will be a better estimate of the forced climate change of the real system than the result from a particular model.

Matches well with:

the standard error is the standard deviation of a derived statistic

And it appears that depending upon one’s interpretation on the appropriateness of this measure or not will decide if the SEM is a prudent way to compare the mean of a model ensemble of forcings only with the reality that encludes forcings and natural variability.

I have no opinion one way or the other (my post was pointing out this is indeed a derived statistic, but it makes no comment on the appropriateness of one test or another) but it appears that all of this is apples to cherrys. In that (as I’ve said) SEM can’t pass because the idealized realization is not idealized, disproving the IPCC’s stance if indeed that is their stance) and SD can almost always pass (since the outliers make the bars wider).

I’d say if I have a group of whatever and get a +/- .2 on it, then add outliers both directions that boost that to +/- 100, then what can’t I fit? The jist of this is that if -90 is the reality, and I throw away those outliers, I fail reality. But if I include the outliers, and +.1 is the reality, what does that tell me?

Nothing! 🙂
Sam Urbinto

Posted May 15, 2008 at 6:26 PM | Permalink

Another question; what is a model of only the forced aspects good for?
Willis Eschenbach

Posted May 15, 2008 at 7:57 PM | Permalink

According to Lanzante, the proper test for significance is not whether the 2σ (more accurately 1.96 sigma) error bars overlap. It is whether the absolute distance between the means is greater than the square root of [ (1.96 * σ1)^2 plus (1.96 * σ2)^2 ], where σ1 and σ2 are the two relevant standard deviations.

IF one agrees with Gavin/beaker that the proper measure is the standard deviation rather than the standard error of the mean, then we find the following:

__hPA______________95%______________Difference_____________________
Altitude___________CI_______________Bet. Means__________Result_____

__1000____________0.01_________________0.15_________Not Different_
___850____________0.08_________________0.15___________Different___
___700____________0.15_________________0.18___________Different___
___500____________0.22_________________0.24___________Different___
___400____________0.20_________________0.27___________Different___
___300____________0.25_________________0.32___________Different___
___250____________0.30_________________0.37___________Different___
___200____________0.42_________________0.53___________Different___
___150____________0.39_________________0.33_________Not Different_
___100____________0.49_________________0.28_________Not Different_

If the difference between the means is larger than the 95% CI value, the two are significantly different. Note that these calculations are done with the raw data, not set to the same starting point, and not normalized.

Note that everywhere between the surface and 150hPa, the models and the datasets are significantly different even when we use Gavin/beakers curious method.

w.
Cliff Huston

Posted May 15, 2008 at 10:34 PM | Permalink

RE:477 Willis,
Looking at my plot of the Douglass Data in #439, I’m having trouble relating to your conclusion of ‘Not Different’ for 150 and 100 hPa. Either I screwed up the data plot or there is something else I don’t understand. The differences between the means on the plot look larger there, than anywhere else.

Cliff
beaker

Posted May 16, 2008 at 12:45 AM | Permalink

Ralph Becket #464:

The justification for taking the mean of the models is that they are approximations to an hypothetical ideal model.

No, it is just that the mean of an ensemble is generally on average a better predictor than any of its component parts. There is nothing to guarantee that this is in any way an “ideal model”.

The SE of this mean gives you confidence intervals of how far off the sample mean is from the right value. With more models the SE tends to zero, as expected.

Yes, it is the uncertainty in estimating a population mean from a finite sample. Note that Douglass points out it isn’t an i.i.d. sample, which would make the SE an invalid estimate of this uncertainty, but nevemind…

The test of the mean of the models should surely be with respect to the SE *and* the error on the observations.

If you wanted to know whether there was a detectable difference between the ensemble mean and the observations, yes it would be a fine test. However, that is not finding out if the observations are consistent or inconsistent with the models, or even if there is a meaningful difference, so it doesn’t really tell you anything about the usefulness of the models.

As I have already pointed out, that since the model ensemble is only an estimate of the forced component of the climate, wh know before we start that we should expect there to be a difference, even if the model is perfect. Therefore it seems nonsensical to test for a difference that we know should be there. If a model passes the SE test it is only because the ensemble is too small for us to be sure that the difference we exepct to see is not a random sampling artifact.

If we had perfect observations (no error) and a perfect model (no error) and they differed then we would have to conclude that the perfect model was wrong (i.e., the model is not actually a perfect model of the process being observed).

No, because the models only predict the forced component of the climate, the observed climate also has a component due to things like ENSO that the models are not capable of predicting, even in principle.

The next step is to ask how great is the discrepancy. At least then you can say something about whether the model is in the right ballpark.

Actually, that is the first thing that Douglass et al. should have done instead of using a test for a detectable difference (which we know a-priori is almost certainly there anyway).

Cliff/Ken #463/469:

Whether you use error bars or a (slightly more sophisticated) t-test is not really the issue. The SE test is a test for a statistically detectable difference between the mean and the data. We know a-priori that a difference exists, so why is it any criticism of the models that the ensemble is large enough for it to be detected?

One thing I note about these papers is that the authors don’t seem to understand Beaker’s point that a CI of 2SEM cannot be used for hypotheses testing – go figure.

Except that isn’t the point I was making (of course). The SE can be used for hypothesis testing, just not the hypothesis that the models are inconsistent with the data, or that the models are useful. It can be used to test the hypothesis that there is a difference between the ensemble mean and the observations, but then that is what we would expect to see anyway, even if the model was a good one.

Kenneth Fritsch #472:

Beaker has an opinion about these tests just like RC does, but if one understands the underlying assumptions one can judge for oneself whether the beaker is half full or half empty. I know it is rather empty of references, links and details.

I have provided references where the SD error bars are used to test whether models are inconsistent with the data. You have provided references to papers showing the SE test can be used to detect a difference between the mean and the observations. Did Douglass et al claim to have detected a difference, or did they claim to have shown the models were inconsistent.

If you understood the underlying assumptions, you would know the SE test is only a test for a detectable difference, not a meaningful one (and as I said, we know the difference exists a-priori – so why test for it?).

Willis Eschenbach #477:

Have you used the individual model runs in calculating the SD or just the means for each model? If not the SD doesn’t include the stochastic uncertainty of the simulations, so it isn’t a test for inconsistency in the statistical sense. I have made this point several times. BTW it isn’t my curious method, I have given references to other papers using the same definition, and is what a statistician would recognise as inconsistency.

Not that consistency tells you if a model is useful, as I have also repeatedly pointed out.
Cliff Huston

Posted May 16, 2008 at 12:52 AM | Permalink

RE: 476 Sam,

…a model of only the forced aspects…

I am quite sure Gavin would not agree that ‘only the forced aspects’ are being modeled, IMHO this is just Beaker’s latest (intentional or not) Wookie (see http://en.wikipedia.org/wiki/Chewbacca_defense ). Beaker is simply trying to confuse (or is confused by) the IPCC’s clear statements that the average of the ensemble models is the best estimate of the observed climate, claiming that there is a nuance because they used the word ‘forced’ and that nuance makes all the difference. Beaker seems to be saying that the models have a purpose built limitation (only the forced aspects) and so they will never match observation, hence the need to have wide error bars. Even if this were true, it could only be used to help explain why the models failed the Douglass test, not as a part of a proof that the Douglass et al test (CI=2SE) is wrong.
In #450 Beaker disclaims:

As I have said before, I am a statistician, not a climatologist. Asking me for authoritative answers to questions like this is like hiring a concert pianist in the expectation that he can fix the transmission in your car.

Yet despite that disclaimer, Beaker in #459 claims the climate science knowledge to state:

(ii) The models do not attempt to model the observed climate, only the forced component of the observed climate, i.e. what is left after the effects of things that the models can’t predict (e.g. ENSO), however small they may be, have been averaged out.
(iii) As a corollary of (ii), we would expect there to be a finite difference, even if the model is perfect, as the observations do not represent the forced climate that the modellers wish to estimate using the ensemble mean.

Much as in Beaker’s response to my #394 with the IPCC quotes:

The IPCC quote appears to me to be worded that way to make an important distinction between what the models can predict and what they can’t. If you are happy to ignore such distinctions, it is not overly surprising that you don’t understand the point I am trying to make.

Again, as with the word ‘consistency’, Beaker is slicing and dicing words to invoke some deeper meaning (largely undefined), which somehow serves to make his point. In this case Beaker is invoking the ‘forced component’ as an additional source of model noise, which in turn fails his perfect model (also a Wookie) with CI=2SE, hence CI=2SE can not be used. This is nonsense.

I have asked Beaker for IPCC citations, that support his position that CI=2SD is correct for the model ensemble output, but so far none have been offered. The climatology paper Ken linked in #463 uses 2SE for the CI and discusses the use of the CIs in models vs. observation falsification. The same is found in the papers I linked in #469 with regard to medicine, behavioral science, psychology and entomology. In all cases the use of 2SE for CI for falsification is presented as the standard practice.

Cliff
Willis Eschenbach

Posted May 16, 2008 at 2:28 AM | Permalink

Cliff, you say:

RE:477 Willis,
Looking at my plot of the Douglass Data in #439, I’m having trouble relating to your conclusion of ‘Not Different’ for 150 and 100 hPa. Either I screwed up the data plot or there is something else I don’t understand. The differences between the means on the plot look larger there, than anywhere else.

Cliff

Good catch, Cliff. The auditing never stops. You are 100% correct, my labels were reversed. So, they are not statistically different. Ignore Number 477, what was Nixon’s lovely phrase … “Those statements are inoperative”.

w.
Kenneth Fritsch

Posted May 16, 2008 at 9:41 AM | Permalink

RE: #479

Whether you use error bars or a (slightly more sophisticated) t-test is not really the issue. The SE test is a test for a statistically detectable difference between the mean and the data. We know a-priori that a difference exists, so why is it any criticism of the models that the ensemble is large enough for it to be detected?

Just repeating over and over that the means are different a-priori does not make it true. One can assume as Douglas et al. evidently did that if the models are claimed to provide a reasonable prediction/projection/hindcast of the climate that indeed the average of the ensemble can be compared to the observed with the assumption that the two should be assumed equal within reasonable limits and thus validating the use of an average and SEM for comparison. The chaotic component remaining in the observed that you claim is averaged out of models would not be a reasonable assumption given that the authors were comparing ratios of trends that one would reasonably assume cancels the chaotic content.

Assuming that the observed result can be handled as just another data point (and not a target) in a distribution of all models (and all model runs for that matter) and therefore we need only look at the entire distribution of all runs and determine whether the observed lies significantly in that distribution (or range per Karl et al.) just does not make sense unless one considers the models a huge guessing game whose ensemble average means little or nothing and particular when dealing with an individual feature of the models that depends heavily on the GHG effect. The Beaker and RC arguments and approaches would seem to put the general capability of climate models to ever be practically compared to observed results in grave doubt. This should be made as part of a separate argument which would confirm the Douglas et al. results/conclusions. Douglas et al., in my view, were merely saying given these assumptions by climate models here’s a comparison of their output to the observed.

The term inconsistency has no known common meaning in statistics as described by beaker nor has he demonstrated that with references/links. Besides Douglas et al. made it crystal clear what it was that they were comparing and there can be little need to become pedantic about it.

Thinking people will realize one can use numerous methods for making statistical comparisons and that it is the underlying assumptions that are made and conditions that are revealed that one uses to judge the utility of those methods.
Christopher

Posted May 16, 2008 at 11:16 AM | Permalink

Re: Gavin on forced (see the what the IPCC models really say post)

“Claims that GCMs project monotonic rises in temperature with increasing greenhouse gases are not valid. Natural variability does not disappear because there is a long term trend. The ensemble mean is monotonically increasing in the absence of large volcanoes, but this is the forced component of climate change, not a single realisation or anything that could happen in the real world.”

So beaker is correct here. Steve would likely agree to that statement as well. I’m somewhat surprised we are still debating this. Again, there is no one way to resolve this mess as statistical significance (ill-posed or otherwise) does not equate to practical significance. I’ve stated that climate sigma is overrated and the reason is that I do not think it’s greater than the distance from the “observed” vs. “predicted” lines in any of Douglass graphs, SD/SE this way or that. But that is practical significance. Again, there’s no test for this. You can only threshold this to my knowledge. I’m hoping we can leave the wordsmithing behind and move forward, LTP and ergodicity as confounding factors here are more compelling imho. A pity bender left us long ago…
Gaelan Clark

Posted May 16, 2008 at 12:45 PM | Permalink

Interesting post on Piekle Jr.–“The Helpful Undergraduate…”, where he links to this PDF written by John R. Lanzante where his final sentence concludes

While the error bar
method is tempting, it is not grounded by statistical
theory when more than one sample is involved.

Who here says use the error bars on 55 modeled (toy) results which are ensembled???
bender

Posted May 16, 2008 at 1:18 PM | Permalink

Beaker is obfuscating the argument to try to win a trivial point. Yet the point he is trying to win only serves to further the relevance of the Douglass observation. The observed data do not fit within the expected* range of predictions. Of course the “expectation” may be wrong, and that is a separate issue. If internal climate variability is under-represented in the GCMs then the expectation is far from realistic. Is that what Gavin is arguing? He is unclear, and shifting about. As usual.
Kenneth Fritsch

Posted May 16, 2008 at 2:11 PM | Permalink

Christopher @ #483

The effect that Douglas et al. is comparing is not an overall climate as Gavin and Beaker are evidently referring to here. How much of this left-over thingy would be left in the single rendition observed data over the model outputs when comparing ratios of the tropical tropospheric to surface temperature trends? I just do not see where it applies in this comparison. Can someone show me why it should?
Kenneth Fritsch

Posted May 16, 2008 at 2:24 PM | Permalink

Re: #484

Who here says use the error bars on 55 modeled (toy) results which are ensembled???

I sometimes miss the point that the poster intended in cases such as this one.

I reiterate: the paper referenced is pointing to the problem of missing statistically signifigant differences when using error bars instead of the proper t-test using means and SEMs. In effect it says that Douglas et al. were being too conservative in using error bars. The author also points to abuse of error bars in IPCC reports and the Journal of Climate. I would add that I have seen that abuse at CA also.
beaker

Posted May 16, 2008 at 4:26 PM | Permalink

Kenneth Fritsch (#487) says:

I reiterate: the paper referenced is pointing to the problem of missing statistically signifigant differences when using error bars instead of the proper t-test using means and SEMs. In effect it says that Douglas et al. were being too conservative in using error bars.

No, that is not the case as Douglass et al. ignore the uncertainty on the observations altogether, so they do not test whether the error bars overlap (which is common practice, but overly conservative as noted).

However, that is a minor issue compared to the others.
Sam Urbinto

Posted May 16, 2008 at 5:45 PM | Permalink

Statistical mumbo-jumbo aside, what is the practical nature of a range of anything that takes into account any reading between -5 and +5 as being “consistent”?
Kenneth Fritsch

Posted May 16, 2008 at 6:15 PM | Permalink

Re: #489

Statistical mumbo-jumbo aside, what is the practical nature of a range of anything that takes into account any reading between -5 and +5 as being “consistent”?

If an ensemble of climate models has has a very wide range of outputs and the observed result is treated as if were just another model output and if that observed result is obviously biased at one end of that range and you are told that you cannot reject the proposition that the models do not cover the range (or 2 SD) that the observed result is in, you can give that claim a judgment of practicality without much statistical inference. That judgment should also take into account the claims that are otherwise made by many of those arguing for this approach that the models together are telling us something about past, current and future climate.
Kenneth Fritsch

Posted May 16, 2008 at 6:34 PM | Permalink

Re: #488

No, that is not the case as Douglass et al. ignore the uncertainty on the observations altogether, so they do not test whether the error bars overlap (which is common practice, but overly conservative as noted).

Below I have excerpted the following from Douglas et al. that hardly would indicate that the authors ignored the “uncertainty of observations altogether”:

3.2. Uncertainties

3.2.1. Surface The three observed trends are quite close to each other. There are possibly systematic errors introduced by urban heat-island and land-use effects (Pielke et al., 2002; Kalnay and Cai, 2003) that may contribute a positive bias, though these are estimated as being within ±0.04 C/decade (Jones and Moberg, 2003).

3.2.2. Radiosondes
Free et al. (2005) estimate the uncertainties in the trend values as 0.03–0.04 C/decade for pressures in the range 700–150 hPa. For HadAT2, using their figure 10, we estimate the uncertainties as 0.07 C/decade over the range 850–200 hPa. For RAOBCORE, Haimberger (2006) gives the uncertainty as 0.05 C/decade for the range 300–850 hPa.
Several investigators revised the radiosonde datasets to reduce possible impacts of changing instrumentation and processing algorithms on long-term trends. Sherwood
et al. (2005) have suggested biases arising from daytime solar heating. These effects have been addressed by Christy et al. (2007) and by Haimberger (2006). Sherwood
et al. (2005) suggested that, over time, general improvements in the radiosonde instrumentation, particularly the response to solar heating, has led to negative
biases in the daytime trends vs nighttime trends in unadjusted ropical stations. Christy et al. (2007) specifically examined this aspect for the tropical tropospheric layer and indeed confirmed a spuriously negative trend component in composited, unadjusted daytime data, but also discovered a likely spuriously positive trend in unadjusted nighttime measurements. Christy et al. (2007) adjusted day and night readings using both UAH and RSS satellite data on individual stations. Both RATPAC and HadAT2 compared very well with the adjusted datasets, being within ±0.05 C/decade, indicating that main cooling effect of the radiosonde changes were evidently detected and eliminated in both. Haimberger (2006) has also studied the daytime/nighttime bias and finds that ‘The spatiotemporal consistency of the global radiosonde dataset is improved by these adjustments and spurious large daynight
differences are removed.’ Thus, the error estimates stated by Free et al. (2005); Haimberger (2006), and Coleman and Thorne (2005) are quite reasonable, so that the trend values are very likely to be accurate within ±0.10 C/decade.

3.2.3. MSU satellite measurements
Thorne et al. (2005a) consider the uncertainties in climate-trend measurements; dataset construction methodologies can add bias, which they call structural uncertainty (SU). We take the difference between MSU UAH and MSU RSS trend values, ∼0.1 C/decade, as an estimate of SU.

Much has been made of the disparity between the trends from RSS and UAH (Santer et al., 2005) – caused by differences in adjustments to account for time-varying biases. Christy and Norris (2006) find that UAH trends are consistent with a high-quality set of radiosondes (VIZ radiosondes) for T2LT at the level of ±0.06 and for T2 at the level ±0.04. For RSS the corresponding values are ±0.12(T2LT) and ±0.10(T2). For T2LT, Christy et al. (2007) give a tropical precision of ±0.07 C/decade, based on internal data-processing choices and external comparison with six additional datasets, which all agreed with UAH to within ±0.04. Mears and Wentz (2005) estimate the tropical RSS T2LT error range as ±0.09. Thus, there is evidence to assign slightly more confidence to the UAH analysis.
UMD does not provide statistics of inter-satellite error reduction, and, since the data are not in a form to perform direct radiosonde comparison tests, we are unable to estimate its error characteristics. In the later discussion we indicate the likelihood of spurious warming. UMD data were not in a form to allow detailed analysis such as provided in Christy and Norris (2006) to generate an error estimate.
bender

Posted May 16, 2008 at 6:53 PM | Permalink

#490 is right on.
#491 beaker is referring to statistical uncertainty caused by sampling error, not measurement uncertainty caused by experimental imprecision.
beaker

Posted May 17, 2008 at 12:32 AM | Permalink

#491 There is evident uncertainty in the observations, as you get different values for the trend using data from different sources, as error bars were not placed on the observations, the uncertainty of the observations was ignored in the statistical test. Therefore it is biased in favour of rejection of the models rather than acceptance.

As I have commented earlier, it would be more productve to try to come up with a meaningfull definition of useful that we can test for rather than try an salvage the test used by DOuglass et al. which is meaningless (although that does not mean that their conclusion is wrong, just that it isn’t supported by the test they perform).
Cliff Huston

Posted May 17, 2008 at 3:22 AM | Permalink

Here is the Douglass et al data with the observations (treated as an ensemble) showing the ±2SE region.

Cliff
Michael Smith

Posted May 17, 2008 at 6:18 AM | Permalink

Beaker, in 459 wrote:

(iv) As a result, the SE test will reject a ensemble that gives perfect predictions (in terms of what the models actually aim to do) purely on the basis that it is “too large”.

(v) as a consequence of (i) the SE test will fail a model that has an arbitrarily small difference from the observations purely on the basis that it is “too large”

There, I haven’t mentioned infinity anywhere.

Beaker, I know you’ve ceased being serious when you expect me to believe that switching from the term “infinite” to the term “too large” changes anything. Whatever you call it, your demand that any use of the SE test be based on the assumption of an unlimited “n” constitutes the assignment of an arbitrary constraint to that test.
Kenneth Fritsch

Posted May 17, 2008 at 7:41 AM | Permalink

Re: #493

#491 There is evident uncertainty in the observations, as you get different values for the trend using data from different sources, as error bars were not placed on the observations, the uncertainty of the observations was ignored in the statistical test. Therefore it is biased in favour of rejection of the models rather than acceptance.

The Douglas rendition of observational uncertainty deals with the uncertainty of the individual measurements with intent to show that even the most generous estimates of that uncertainity would not change their conclusions that the observed value does not overlap the models average +/- 2 SE. If they merely took all the observational data and put error bars on it, it would not overlap the model +/- 2 SE and a t-test using 2 SE would show a difference.

By looking at all the observed results with the individual estimated uncertainties for each adds credibility to their comparisons in that it allows the reader to determine the sensitivity of the comparison with regards to how the observed results where obtained and by using different data series. One can readily visualize the worst case observed (for showing no difference between models and observed) with uncertainty bounds and compare it to the model outputs.
Mike B

Posted May 17, 2008 at 9:27 AM | Permalink

Beaker, I know you’ve ceased being serious when you expect me to believe that switching from the term “infinite” to the term “too large” changes anything. Whatever you call it, your demand that any use of the SE test be based on the assumption of an unlimited “n” constitutes the assignment of an arbitrary constraint to that test.

First, my apologies in my absence from this thread in refuting some of Mr. Beaker’s more tortured criticisms of Douglass. I’ve had other obligations…

Let me address Beaker’s latest argument (can’t use SE test because sufficiently large n will always reject) on two fronts.

First, Beaker has argued repeatedly that “we” need to focus on defining “useful”. Well, Beaker, if you would define useful, then your above concern would be rendered moot, because any statiscally significant(induced by sufficiently large n) difference smaller than the “useful” difference would be irrelevant. Goose meet gander.

Second, following your “too large” lead, as long as the between-model sigma (actually s) is too large, the SD test will never have the power to reject the hypothesis that the ensemble mean is different from the observed value. How convenient for Gavin and the other climate modelers! The more the models diverge, the harder it becomes to falsify the ensemble!

Douglass provides a devastating analysis of the efficacy of GCMs. Gavinian tantrums and Beaker’s carefully parsed nit-picking change nothing.
bender

Posted May 17, 2008 at 11:56 AM | Permalink

The question is no longer “are the GCMs crap?”, but “which components, exactly, are faulty?”. Once they are corrected we will have a better test of the hypothesis. (The last test was, possibly, not as strong as we would like.)

beaker, please, disappear now. Your usefulness has dropped to zero. Or less. (You’re going to have to start paying quatloos to post.)

On “rejection”. The hypothesis test was clearly a failure, by any measure. We reject the null hypothesis (that every aspect of the GCMs is correct), but not the hypothesis that there may be quite a bit that is correct. In science we do not throw the baby out with the bathwater. We carry on, figuring out which part of the models is wrong. (Is it vertical convection parameterization, as Judith Curry has offered? Curious. Gavin assured me that no more than a few parameters were guessed at. I wonder how many parameters were guessed at in the vertical convection parameterization scheme alone.)
Cliff Huston

Posted May 17, 2008 at 7:57 PM | Permalink

Beaker says:

As I have commented earlier, it would be more productve to try to come up with a meaningfull definition of useful that we can test for rather than try an salvage the test used by DOuglass et al. which is meaningless (although that does not mean that their conclusion is wrong, just that it isn’t supported by the test they perform).

While I strongly disagree that the test used by Douglass et al is meaningless, I do agree that a more robust presentation of their test is in order. Lanzante (2005) has shown that a t-test is more robust for comparing climate model ensemble data to observed ensemble data (link to the paper is in #463 above). Following Lanzante (2005), here is the unpaired t-test results for Douglass et al.

The above was calculated using the Web based unpaired t-test calculator found here.
It is clear that the t-test results are in complete agreement with the Douglas et al test and, in this case, that the extra robustness of the t-test was not needed.

Cliff
M. Simon

Posted May 17, 2008 at 11:29 PM | Permalink

bender asks:

I wonder how many parameters were guessed at in the vertical convection parameterization scheme alone.

In keeping with the general theme of the thread I have an answer to your question.

Your guess is as good as mine.
Craig Loehle

Posted May 18, 2008 at 8:32 AM | Permalink

Cliff: thanks for the t-test. How come it took until post 500 for someone to do this?
kim

Posted May 18, 2008 at 8:53 AM | Permalink

Well, C, Cliff is one in a million but it is possible for such a one to appear in a sample of five hundred.
================================================
Ron Cram

Posted May 18, 2008 at 8:53 AM | Permalink

Cliff,

I also thank you. Perhaps you can publish this somewhere?
Kenneth Fritsch

Posted May 18, 2008 at 9:35 AM | Permalink

Cliff, I second (or is it third) your efforts in showing us what the t-tests reveal about the model mean versus the observation mean. I find efforts like the one you made here is what makes analyzing papers at CA fun — and a learning experience. I think that way of looking at papers (extracting as much practical meaning as possible) may be why we, or at least I, tend to be impatient with Beaker’s seeming willingness to write-off the importance of a paper like Douglas et al. based on some technical discrepancies imposed by the critic himself.

I do, however, think that Douglas et al. looking at the individual observations with uncertainties for each in comparison with the model outputs is more instructive. Perhaps they should have presented the comparison both ways. One should also remember that the paper was making a comparison to UMD, UAH and RSS data at two tropospheric levels of T2lt and T2 that would make the observed ensemble more difficult to present in this form.
Geoff Sherrington

Posted May 19, 2008 at 2:49 AM | Permalink

Still, I remain skeptical of model error estimates when the modeller does not state how many model runs were rejected voluntarily before submitting those published. I was taught that distributions help dictate the type of statistics to be used and that the whole population of relevant observations should be used to construct the distribution. Arbitrary or subjective selection of a sub-population is not scientific, is it? Because these are projections, there cannot easily be any criteria for selection/rejection, apart from a discovered violation of known physics/chemistry.

There remains a probability that one of the silly-looking rejected projections will, as the real results roll in over the years, be the one that was closest to reality – irrespective of the learned discussions of statisticians above. Sorry, I’ve seen it happen before in other fields. I suspect, but have no good evidence, that this is the story of the ozone hole as well.

Can we close this off now? The angels have danced on the head of the pin.
Kenneth Fritsch

Posted May 19, 2008 at 10:05 AM | Permalink

Re: #505

There remains a probability that one of the silly-looking rejected projections will, as the real results roll in over the years, be the one that was closest to reality – irrespective of the learned discussions of statisticians above. Sorry, I’ve seen it happen before in other fields. I suspect, but have no good evidence, that this is the story of the ozone hole as well.

Douglas et al. looked at model results that can be compared (judged) to observations and were not evaluted as future looking projections. Under such conditions I would think that one could make a case for throwing out, or at leasting pointing out, model results that are shown not to reflect reality. If one has no confidence in an ensemble of results then one must admit that they have no confidence in modeling. Unless one believes that one of the billions of monkeys pounding on typewriter will come up with a literary masterpiece.

Now we can close this off.
Cliff Huston

Posted May 19, 2008 at 2:06 PM | Permalink

Craig, kim, Ron, Kenneth: Thanks for the kind words.

Craig:

How come it took until post 500 for someone to do this?

Mostly it has been all of the silly side issues due to the Gavin/Beaker spin. Until I saw the Lanzante (2005) paper, posted by Kenneth, I didn’t know that the t-test could be applied to this issue. Still I was unsure how to do the t-test until I read Roger Pielke, Jr.’s post “The Helpful Undergraduate”, then it was just a matter of turning the crank. Sorry to be slow, but I’m learning. 🙂

kim: Yes, but remember: bad pennies have their own odds.

Ron:

Perhaps you can publish this somewhere?

I have, here. 🙂 I think the t-test and a plot as in #494 would be a good addition to Dogulas et al’s SI. I wonder if that is still doable now.

Kenneth:

One should also remember that the paper was making a comparison to UMD, UAH and RSS data at two tropospheric levels of T2lt and T2 that would make the observed ensemble more difficult to present in this form.

I read that comparison as a first order reality check on the balloon data and their test was only using the balloon data as observations.

BTW, I decided to try using the unpaired t-test that does not assume equal variance in the samples. Here is the quick and dirty results from Excel:

Welch’s T-Test (unequal variance):
Surface___0.855783081___Not Significantly Different
1000 hPa__0.012353380___Significantly Different
925 hPa___0.008871460___Very Significantly Different
850 hPa___0.002147190___Very Significantly Different
700 hPa___0.000516788___Extremely Significantly Different
600 hPa___0.001357159___Very Significantly Different
500 hPa___0.000616360___Extremely Significantly Different
400 hPa___0.000063744___Extremely Significantly Different
300 hPa___0.000000809___Extremely Significantly Different
250 hPa___0.000000369___Extremely Significantly Different
200 hPa___0.000017286___Extremely Significantly Different
150 hPa___0.000000004___Extremely Significantly Different
100 hPa___0.000000630___Extremely Significantly Different

In case anybody would like to check my work or play with the data, I have zipped up my Execl spread sheet, along with a few of the plots as PDFs – the zip file can be found here.

Cliff
Raven

Posted May 19, 2008 at 2:55 PM | Permalink

Geoff Sherrington says:

Still, I remain skeptical of model error estimates when the modeller does not state how many model runs were rejected voluntarily before submitting those published.

I have seen it documented in a number of places that all model runs start with a ‘calibration period’ where the trends are expected to be 0. Any model runs that show a non-zero trend during the calibration period are discarded. Furthermore, some model runs will show ‘spurious’ cooling which is due to some numerical problem caused by the over simplified ocean. These runs are dropped because they are ‘obviously wrong’.

The climate modellers would like us to believe that discarding model runs is a perfectly normal part of the process. I would like to hear the opinion of Jerry or others on whether this is true or not.
Ron Cram

Posted May 19, 2008 at 4:18 PM | Permalink

Cliff,

Publishing it here is great and I thank you for it. I guess what I had in mind was possibly publishing it as a comment in the journal that published the Douglass paper. You could discuss the criticism from RealClimate and your analysis and conclusions. Making it part of the SI is an idea, but would be below the radar of most readers of the journal. I would encourage you to look into it.
Cliff Huston

Posted May 21, 2008 at 1:50 AM | Permalink

I found an error in my Douglass et al spread sheet, which I have now corrected. If you have downloaded my zip file, please do so again for the corrected version (link in #507). Also, the PDF versions of the plots in the zip file have been corrected.

The plots shown in posts #439, #443, #449 and #494 above have been updated as well.

Sorry for the bother,

Cliff
Kenneth Fritsch

Posted May 21, 2008 at 10:57 AM | Permalink

Cliff, for what it is worth having an amateur checking your calculations, I verified what you found in Post #499. I would go with this more conservative estimate over that used in Post #507, even though that means for the small sample case pooling the SD estimate when it is rather obvious that the observed data has a smaller SD than that for the models — at most of the pressure levels.
Sam Urbinto

Posted May 21, 2008 at 6:27 PM | Permalink

I think I found another example of excellent comparisons between an ensemble (a French word meaning “We have no clue, but combining these models will let us say we have averaged out the errors and sound all statistical and stuff.”) and observations. But maybe not one anyone thinks of much. And it also gets into how reliable observations actually are.

While I was out and about doing other things, long ago, I noticed that GISS in one of the FAQ pages puts the most trusted models giving a +14 C figure as the illusive surface air temperature, and various other conversations recently answered one question and of course brought up more….

There seems to be very little on the “base period” (for the GST anomaly) and what it means, but it appears to be the median of the min and max value reported by CMIP control runs. Now, this is again a model ensemble, isn’t it. We have an 11-17 C range, or 14 +/- 21% Can that be compared to a global mean temperature anomaly in the first place, and if so, what does it mean? Certainly the ensemble mean is not inconsistent with the observations in a range of less than 7% over 130 years. Compared to the base period of whenever during the control run of 11-17.

Now, there has been some question of how valid the observations are of the satellite readings of various altitudes in the atmosphere, but I don’t see the same questions about how valid the sampled air and water observations much. At least not on the same level. Aren’t they the same, the land/sea versus satellites?

No, worse. We know the satellites have improved, but they’re a modern invention over a shorter time period. Not only have the air/water samplings also improved, but they’ve change the method and the locations over time.

I hate to bring this up. What is the proper method of comparing the GISS anomaly observation to the CMIP ensemble?

We know ~ -.1 to +.1 is going to fall into -3 to +3 I take it…..
Kenneth Fritsch

Posted May 21, 2008 at 7:04 PM | Permalink

Re: #512

Sam, I think one needs to be careful what one is referencing here. When I looked at how well the climate models matched observed past regional temperatures from the AR4 and AR3 reports, I found that the +/- variations, and biases, for that matter, were invariably large. I believe that is the same concept of absolute temperature to which you are currently referring to in your post.

What I also found from these reports was that when anomaly trends were reported and compared to the observed that the error bars decreased significantly. I have to assume – even though I found no explicit explanation in the reports – that the models have large errors in absolute temperature simulationsand smaller ones in hindcasting temperature anomalies.

All this leads me into taking this a step further with using ratios of temperature anomaly trends between surface and troposphere which should be easier to model (assuming that the correct theory is being applied) and easier to measure and particularly so for the satellite observations.

I am not defending the climate models here but simply observing that there are levels of measurements which the models can make and with which some are evidently less apt to result in large errors.
Cliff Huston

Posted May 22, 2008 at 3:54 AM | Permalink

#511 Kenneth,

Thanks for checking my calculations. This amateur use can all the help he can get.

I would go with this more conservative estimate over that used in Post #507, even though that means for the small sample case pooling the SD estimate when it is rather obvious that the observed data has a smaller SD than that for the models — at most of the pressure levels.

I agree. The reason I did the Welch’s T-Test in #507 is that the issue of unequal variance was raised on Roger Pielke Jr.’s site, as a criticism of his use of the T-Test. I wanted to check to make sure that the T-Test in #499 was in fact conservative. IMHO, #499 validates the case Douglass et al was making and I see no need to go further, but it is nice to know that the Welch’s T-Test results show more significant difference.

Cliff
Sam Urbinto

Posted May 22, 2008 at 9:45 AM | Permalink

That should have been above ~ -1 to +1

Thinking further, before CMIP (1995) they had to track it also, so probably a less sophisticated ensemble, or simply one of the ones around 14 C or something else…

Kenneth 513

I was simply making the observation that the “base period” (zero line for the comparison) seems to be a model ensemble output of some sort, which would make the anomaly a variation from some mathematical operation on a model or group of models. If so, it doesn’t seem too surprising that we’d move around in the range, nor be unable to fit inside 3 degrees of freedom so to speak. It’s like comparing observations to an ensemble, but almost sort of in reverse.

I have no issue with any of the comparisons, even if they’re “consistent but meaningless” or even “inconsistent but meaningful” or however it’s put.
DeWitt Payne

Posted Oct 13, 2008 at 2:52 PM | Permalink

New paper from LLNL refutes Douglass et al.?
Kenneth Fritsch

Posted Oct 13, 2008 at 4:22 PM | Permalink

On first glance at the RC post, it would appear that the new paper is simply a rehash of the position looking at the models and assuming that the central tendency means something and making a valid comparison with the measured results as Douglass et al. did or taking the position of the authors that the model results are so varied that the range the model results can cover the measured results. The latter position could certainly support the position that the models are so uncertain that the results are meaningless vis a vis the ratios of the temperature trends between the tropical surface and troposphere.

It appears that RC employs the advocacy arguments of a good defense lawyer here, although I am not sure how the graphs of the model versus measurement results would fly with a jury that was either not committed to a conclusion or who are half-way intelligent.
hswiseman

Posted Oct 13, 2008 at 8:53 PM | Permalink

The paper
Consistency of modelled and observed temperature trends in the tropical troposphere
B. D. Santer, P. W. Thorne, L. Haimberger,K. E. Taylor,T. M. L. Wigley, J. R. Lanzante,S. Solomon, M. Free, P. J. Gleckler,
J. D. Jones, T. R. Karl, S. A. Klein, C. Mears, D. Nychka, G. A. Schmidt, S. C. Sherwood,and F.J.Wentzj

Can be found here.

Click to access NR-08-10-05-article.pdf
- Dave Dardinger
  
  Posted Oct 14, 2008 at 7:40 AM | Permalink
  
  Re: hswiseman (#520),
  
  Here’s an interesting tidbit from that paper:
  
  [RICH] uses a new automatic data homogenization method involving information from both reanalysis and composites of neighbouring radiosonde stations (Haimberger et al., 2008).
  
  Do the name “Mannamatic” ring a bell? Terms like “reanalysis”, “data homogenization” and “composites” worry me.
DG

Posted Oct 14, 2008 at 6:10 AM | Permalink

Why doesn’t that paper reference Randall & Herman 2008?

http://www.agu.org/pubs/crossref/2008/2007JD008864.shtml
Using limited time period trends as a means to determine attribution of discrepancies in microwave sounding unit–derived tropospheric temperature time series

Abstract

Limited time period running trends are created from various microwave sounding unit (MSU) difference time series between the University of Alabama in Huntsville and Remote Sensing System (RSS) group’s lower troposphere (LT) and mid troposphere to lower stratosphere channels. This is accomplished in an effort to determine the causes of the greatest discrepancies between the two data sets. Results indicate the greatest discrepancies were over time periods where NOAA 11 through NOAA 15 adjustments were applied to the raw LT data over land. Discrepancies in the LT channel are shown to be dominated by differences in diurnal correction methods due to orbital drift; however, discrepancies from target parameter differences are also present. Comparison of MSU data with the reduced Radiosonde Atmospheric Temperature Products for Assessing Climate radiosonde data set indicates that RSS’s method (use of climate model) of determining diurnal effects is likely overestimating the correction in the LT channel. Diurnal correction signatures still exist in the RSS LT time series and are likely affecting the long-term trend with a warm bias. Our findings enhance the importance of understanding temporal changes in the atmospheric temperature trend profile and their implications on current climate studies.
hswiseman

Posted Oct 14, 2008 at 10:09 AM | Permalink

Santer et. al 2008 never misses an opportunity to take a pot shot at Douglass, implying that Douglass’ use of the then current datasets was some kind of slipshod approach. I am sure that if Douglass had merely asked, they would have sent him an FTP password to access all the newly recalibrated data. The contempt for UAH also slips through, holding out RSS data as somehow far superior, when in fact the results are quite close, and with far less divergence than exists in the 49 models. I wonder with the slight difference between the two was enough to move the significance needle and render the conclusions slightly less robust. A result-oriented need to select one over the other might explain the intensive justifications offered.

I cannot comment on the claim of bias in Douglass’ screening technique for selecting eligible models, but there are alot of fancy statistics to support Santer’s filtering approach and claim of bias. Douglass’ approach was commonsensical and mechanistic, but that alone doesn’t mean it is without fault, Occam’s Razor notwithstanding. I really don’t see what is accomplished by the petty and contemptuous tone of the article, but that attitude seems pretty pervasive these days among the AGW crowd. Such an attitude isn’t that constructive here at CA either (maybe I should not have posted that monkey chart video), but this is blog-land, not publishing.

I think as a general principle that anytime you take a scalpel to the data in order to make the models look better, the burden of justifying each and every alteration falls heavily on the authors. I will leave it others with the math and stats chops to opine whether that burden has been carried.
hswiseman

Posted Oct 14, 2008 at 12:00 PM | Permalink

“Since most of the 20CEN experiments end in
1999, our trend comparisons primarily cover the 252-
month period from January 1979 to December 1999,
which is the period of maximum overlap between the
observed MSU data and the model simulations.”

This approach also sticks the 1998 Super El Nino (a statistical outlier if there ever was one, begging for data tinkering) on the end of the study period where it can cause maximum mischief with the smoothing.
- jae
  
  Posted Oct 14, 2008 at 5:36 PM | Permalink
  
  Re: hswiseman (#524),
  
  Good point. Look at their figures. Why on earth would a 2008 paper omit all the data since 1999??? Looks like a hockey-stick Team approach to the problem.
Pat Keating

Posted Oct 14, 2008 at 5:35 PM | Permalink

The MSU data is perhaps a little shaky, due to the lack of consistency in the instrumentation over the years (re ‘homogenization’). However, IMO, the strongest evidence for the cooling trend comes from the radiosonde data, which seems to be pretty solid, despite Sherwood et al’s attempt at criticism.
Kenneth Fritsch

Posted Oct 14, 2008 at 6:58 PM | Permalink

I have just finished a first read of Santer et al. (2008) that was published in reply to the Douglass et al. (2007) paper that had thrown into doubt the capability of the climate models to reproduce the the observed results of the ratios of the trends in tropical surface and troposphere temperatures over the recent past.

I think that the proper reaction to Douglass et al. and the recent Santer et al. should be a big thank to both parties. Thank you, Douglass et al. for provoking the reaction of Santer et al. and thank you Santer et al. for showing us that not only is the variation of the models from floor to ceiling but when you throw all the observations together using the more recent sources of measurements of the tropical troposphere temperature trends, we see that the observation variation is apparently from floor to ceiling (and maybe the sky).

If one is willing to wade through the obligatory advocacy oriented statements from Santer et al. as indicated from the following statement, “We find that there is no longer a serious discrepancy between modelled and observed trends in tropical lapse rates. This emerging reconciliation of models and observations has two primary explanations.” and proceed to a graph (shown below) that summarizes the model and observed results, one can readily conclude the following:

1. There exists a huge range in model results as indicated by the 2 Stdev of the model trends the reaches approximately 0.6 degrees C per decade.

2. An apparently even larger variation exists in the observed results as indicated by the individually displayed trends from 7 radiosonde and 2 satellite sources. Note that the radiosondes cover the entire pressure levels while the satellite results are confined to pressure zones. i.e. T2 and T2L.

3. On my first read I did not see any discussion of something that hits the layperson smack in the eye, i.e. the shapes of the observed results are very different than the average of the model results. Even the newer radiosondes measurement sources, that Santer et al. claims bring the observations more in line with the models since they show higher tropospheric temperature trends, only show that characteristic at some points on the pressure level range. In fact, all of them except RICH, show curves over the entire pressure level range that are considerably different then the model average and the older radiosonde measurements. As layperson I would suspect that a better statistical analysis of the observed versus model results over the entire pressure level range would be to compare the shapes of the curves for statistically significant differences.

I also have a problem with the graph in that the important value is the difference between (or ratio of) the surface and troposphere temperature trends and this graph does not show that value.

Finally the Santer paper makes reference to some recent correction of the RSS measurement that brings it closer to the model result average but further from the UAH measurements. I would like to read more about that adjustment and any comments of Spencer and Christy. Does anyone have a reference?

Figure 6. Vertical profiles of trends in atmospheric temperature (panel A) and in actual and synthetic MSU temperatures (panel B). All trends were calculated using monthly-mean anomaly data, spatially averaged over 20 °N–20 °S. Results in panel A are from seven radiosonde datasets (RATPAC-A, RICH, HadAT2, IUK, and three versions of RAOBCORE; see Section 2.1.2) and 19 different climate models. Tropical TSST and TL+O trends from the same climate models and four different observational datasets (Section 2.1.3) are also shown. The multi-model average trend at a discrete pressure level, _ bm(z) _, was calculated from the ensemble-mean trends of individual models [see Equation (7)]. The grey-shaded envelope is s{}, the 2σ standard deviation of the ensemble-mean trends at discrete pressure levels. The yellow envelope represents 2σSE, DCPS07’s estimate of uncertainty in the mean trend. For visual display purposes, TL+O results have been offset vertically to make it easier to discriminate between trends in TL+O and TSST . Satellite and radiosonde trends in panel B are plotted with their respective adjusted 2σconfidence intervals (see Section 4.1). Model results are the multi-model average trend and the standard deviation of the ensemble-mean trends, and grey- and yellow-shaded areas represent the same uncertainty estimates described in panel A (but now for layer-averaged temperatures rather than temperatures at discrete pressure levels). The y-axis in panel B is nominal, and bears no relation to the pressure coordinates in panel A. The analysis period is January 1979 through December 1999, the period of maximum overlap between the observations and most of the model 20CEN simulations. Note that DCPS07 used the same analysis period for model data, but calculated all observed trends over 1979–2004.
- Pat Keating
  
  Posted Oct 14, 2008 at 7:34 PM | Permalink
  
  Re: Kenneth Fritsch (#527),
  
  Where did the RICH data come from?
  - Kenneth Fritsch
    
    Posted Oct 15, 2008 at 9:26 AM | Permalink
    
    Re: Pat Keating (#528),
    
    Pat, below I have excerpted a comment from the Santer et al. (2008) paper describing and referencing the RICH dataset and measurements.
    
    The third (RICH; ‘Radiosonde Innovation Composite Homogenization’) uses a new automatic data homogenization method involving information from both reanalysis and composites of neighbouring radiosonde stations (Haimberger et al., 2008).
    
    I would strongly recommend the Steve M review of the radiosonde datasets and data manipulation in his thread at CA titled, “Leopold in the Sky with Diamonds” that is linked here:
    
    http://www.climateaudit.org/?p=3082
    
    Below I have excerpted a number of comments (in no particular order) from Steve M’s review that I think point to major concerns with the methodologies used and describe some of the essentials of the data manipulations.
    
    It is my view that given adjustable data some in the climate science community can adjust it rather arbitrarily without detailed and proven methodologies as evidenced by the wide range of the results of individual massages given by various scientists. The further upshot of this process is that the modelers can than point to their preferred manipulated data for comparisons.
    
    Raobcore v1.4 is hardly the last word in radiosonde adjustments. Leopold Haimberger (and I intend no slight by the title of the post, I just liked the sound of it) has already moved on to yet another adjustment system (“RICH”); Allen and Sherwood 2008 have used Iterative Universal Kriging, adding wind information into their adjustment brew. What each of these studies has in common is that none of them are new “experimental verification”; they are merely adjustments of ever increasing magnitude.
    
    Allen and Sherwood 2008 try a different tack – they try to create a homogenized wind data series on the basis that the radiosonde wind data is much less screwed up. They then argue that the trends in wind are consistent with tropical troposphere warming. They use this as evidence for the side of the argument that the UAH satellite temperature trends in the tropics are incorrect. I guess that we’ll see more about tropospheric wind data in the next while.
    
    So Raobcore v1.2 argued in a peer reviewed journal that there were post-1986 inhomogeneities in the ERA-40 model that required adjustment, giving a list of such inhomogeneities. Raobcore v1.4 decided that adjustments to ERA-40 after 1986 were not required after all. One would have thought that Journal of Climate would have required a detailed explanation of why Haimberger et al had changed their views so quickly and a detailed analysis of each post-1986 adjustment that was no longer deemed pertinent. A year earlier, Haimberger expressed concern about “jumps” at the end of NOAA-14. A year later, he was no longer concerned. Why? Is there any such analysis in Haimberger et al 2008? Nope.
    
    The algorithm in Haimberger et al 2007 added a novel tweak to changepoint methods – a tweak that should not be accepted a proven methodology, merely because it’s been published in a journal with weak statistical refereeing (Journal of Climate):
    
    “This paper introduces a new technique that uses time series of temperature differences between the original radiosonde observations (obs) and background forecasts (bg) of an atmospheric climate data assimilation system for homogenization.”
    - Pat Keating
      
      Posted Oct 15, 2008 at 3:36 PM | Permalink
      
      Re: Kenneth Fritsch (#529),
      Thank you for a thorough answer to my question.
      It appears that we have a serious case of “If you don’t get the result you want, adjust the data”. I believe that this comes from the same group that also continually ‘adjusts’ historical temperature data. Anything I can say about the apparent level of integrity in some of the work in climate science has already been said by others, so I will leave it at that.
DG

Posted Oct 15, 2008 at 10:38 AM | Permalink

Pat Keating,
Could the referenced RSS adjustment be related to this?
ftp://ftp.ssmi.com/msu/readme_jan_2008.txt

Again however, there is still Randall & Herman
http://www.agu.org/pubs/crossref/2008/2007JD008864.shtml

also noted by RPS
http://climatesci.org/2008/01/01/important-new-paper-using-limited-time-period-trends-as-a-means-to-determine-attribution-of-discrepancies-in-microwave-sounding-unit-derived-tropospheric-temperature-time-by-rmrandall-and-bm-herman/