Resolving the Santer Problem

In today’s post, I think that I’ve developed an interesting approach to the Santer problem, which represents a substantial improvement to the analyses of either the Santer or Douglas posses.

I think that the approach proposed here is virtually identical to Jaynes’ approach to analyzing the difference between two means, as set out in the article recommended by beaker. As it happens, I’d done all of the calculations shown today prior to reading this article. While my own calculations were motivated primarily by trying to make sense of the data rather than anything philosophical, academics like to pigeonhole approaches and, to that extent, the approach shown below would perhaps qualify as a “bayesian” approach to the Santer problem, as opposed to the “frequentist” approach used both by Santer and Douglass. I had the post below pretty much in hand, when I was teasing beaker about being a closet frequentist.

The first problem discussed in Jaynes 1976 was the following question about the difference of two means (and keep in mind that the issue in Santer v Douglass is the difference between two trends):

The form of result arrived at by Jaynes was the following – that there was a 92% probability that B’s components have a greater mean life.

The form of conclusion that I’m going to arrive at today is going to be identical – ironically even the number is identical: there is a 92% probability that a model trend will exceed the true observed trend. Now as a caveat, my own terminology here may be a little homemade and/or idiosyncratic; I place more weight on the structure of the calculations which are objective. I think that the calculations below are identical to the following equation in Jaynes 1976 (though I don’t swear to this and, as noted before, I’d done the calculations below before I became aware of Jaynes 1976).

Let me start with the following IMO instructive diagram, which represents my effort to calculate a probability distribution of the slope of the trend, given the observations to 1999 (Santer), 2004 (Douglass) and up to 2008 (most of which were available to Santer et al 2008). It looks like this corresponds to Jaynes P_n(a) , i.e. what bayesians call a “posterior distribution”, but I’m just learning the lingo. The dotted vertical lines represent 95% confidence intervals from the profile likelihood calculations shown in a prior post, color coded by endpoint. The colored dots represent 95% confidence intervals calculated using the Neff rule of thumb for AR1 autocorrelation in Santer. The black triangle at the bottom shows ensemble mean trend (0.214 deg C/decade).

Figure 1. See explanation in text immediately preceding figure.

A couple of obvious comments. The 95% confidence intervals derived from profile likelihoods are pretty much identical to the 95% confidence intervals derived from the AR1 rule of thumb – that’s reassuring in a couple of ways. It reassured me at least that my experimental calculations had not gone totally off the rails. Second, the illustration of probability distributions shown here is vastly more informative than anything in either Santer or Douglass both for the individual years and showing the impact of over 100 more months of data in reducing the CI of the trend due to AR1 autocorrelation. (A note here: AR1 autocorrelation wears off fairly quickly – if LTP is really in play, then the story would be different.)

Thirdly and this is perhaps a little unexpected, inclusion of data from 1999 to 2008 has negligible impact on the maximum likelihood trend of the MSU tropical data; the most distinct impact is on the confidence intervals, which narrow a great deal. The CIs inclusive of 2008 data are about 50% narrower than the ones using data only up to 1999. Whether this is apples-and-apples or the extent to which this is apples and apples, I’ll defer for now.

Fourth and this may prove important, although the ensemble mean triangle relative to 95% CI intervals is unaffected by the inclusion of data up to 2004, it is not unaffected by the inclusion of data up to 2008. With 2008 data in, the CI is narrowed such that ensemble mean is outside the 95% CI interval of the observations – something that seems intuitively relevant to an analyst. I feel somewhat emboldened by Jaynes 1976’s strong advocacy of pointing out things that are plain to analysts, if not frequentists.

My procedure for calculating the distribution was, I think, interesting, even if a little homemade. The likelihood diagrams that I derived a couple of days ago had a function that yielded confidence intervals (this is what yielded the 95% CI line segment.) I repeated the calculation for confidence levels ranging from 0 to 99%, which with a little manipulation gave a cumulative distribution function at irregular intervals. I fitted a spline function to these irregular intervals and obtained values at regular intervals and from these obtained the probability density functions shown here. The idea behind the likelihood diagrams was “profile likelihood” along the lines of Brown and Sundberg – assuming a slope and then calculating the likelihood of the best AR1 fit to the residuals. I don’t know how this fits into Bayesian protocols, but it definitely yielded results, that seem to be far more “interesting” than the arid table of Santer et al.

My next diagram is a simple histogram of LT2T trends derived from data in Douglass et al 2007 Table 1, which, for the moment, I’m assuming is an accurate collation of Santer’s 20CEN results. In Douglass’ table, he averaged results within a given model; I’ll revisit the analysis shown here if and when I get digital versions of the 49 runs from Santer (Santer et al did not archive digital information for their 49 runs; I’ve requested the digital data from Santer – see comments below). [Note – beaker in a comment below states that this comparison is pointless. I remain baffled by his arguments as to why this sort of comparison is pointless and have requested a reference illuminating the statistical philosophy of taking a GCM ensemble and will undertake to revisit the matter if and when such a reference arrives.]

I derived LT2T trends by applying weights for each level (sent to me by John Christy) to the trends at each level in the Douglass table. In the diagram, I’ve carried forward the 95% CI intervals from the observations for location and scale comparison, as well as the black triangle for the ensemble mean. The $64 dollar question is then the one that Jaynes asks. [Note – I think that this needs to be re-worked a bit more as a “posterior” distribution to make it more like Figure 1; I’m doing that this afternoon – Oct 21 – and will report on this.]

Figure 2. Histogram of LT trends from Douglass et al 2007 data.

My next diagram is the proportion of models with LT2T trends that exceed a given x-value. I think that this corresponds to Jaynes P_m(b) . The dotted lines are as before; the solid color-coded vertical lines are the maximum likelihood trend estimates for the respective periods. Again for an analyst, the models seem to the the right of these estimates.

Figure 3. Proportion of Model Trends (per Douglass Table 1 information) exceeding a given value.

I then did a form of integration which I think is along the lines of the Jaynes recipe. I made a data frame with x-values in 0.01 increments (Data[[i]]$a, with i subscripting the three periods 1999, 2004 and 2008. For each value of x, I made a column of values representing the results of Figure 3:

for(i in 1:3) Data[[i]]$fail=sapply(Data[[i]]$a, function(x) sum(douglass$trend_LT>=x )/22)

Having already calculated the distribution P_n(a) in column Data[[i]]$d, I calculated the product as follows:

for(i in 1:3) Data[[i]]$density=Data[[i]]$d*Data[[i]]$fail

Then simply add up the columns to get the integral (which should be an answer to Jaynes question):

sapply(Data,function(A) sum(A$density))
# 1999 2004 2008
#0.8638952 0.8530540 0.9164450

On this basis, there is a 91.6% probability that the model trend will exceed the observed trend.

I leave it to climate philosophers to decide whether this is “significant” or not.

Going back to Jaynes 1976 – I think that this calculation turns out to be an uncannily exact implementation of how Jaynes approached the problem. In that respect, it was very timely for me to read this reference while the calculation was fresh in my mind. It was also reassuring that, merely by implementing practical analysis methods, I’d in effect got to the same answer as Jaynes had long ago.

Reblog this post [with Zemanta]


  1. Steve McIntyre
    Posted Oct 21, 2008 at 9:38 AM | Permalink

    I asked one of Santer’s coauthors for a digital version of the 49 runs, but he said that he didn’t have the data, that I would have to write a letter to Santer. I wrote a letter to Santer yesterday saying that I’d been a good boy and could I please have the data.

    I received an automated email saying that Santer was occupied at a workshop.

    Of course. It’s only a couple of months till Christmas. Yeah, yeah, I know it’s lame, but it seemed funny at the time. (And yes, the obvious joke about a collective name for Santer’s coauthors crossed my mind.)

  2. beaker
    Posted Oct 21, 2008 at 9:54 AM | Permalink

    I think that the approach proposed here is virtually identical to Jaynes’ approach to analyzing the difference between two means, as set out in the article recommended by beaker.

    Steve, while I would recommend the tools of Bayesian probability that Jaynes recommends, a test of a difference between the mean of the ensemble and the mean of the observations is not appropriate. This is because the ensemble mean is an estimate of the average tropical trend for a frequentist style hypothetical population of Earths, not the trend on a specific member of that population (i.e. the Earth we actually occupy). When you understand the parallel Earth thought experiment, you will realise why this is the case.

    There is no point in talking about statistics until you appreciate what the modellers claim the models are actually capable of (i.e. the significance of the “forced” in “forced climate change”).

    Sadly, I doubt that Jaynes would support your test as it there is a disconnect with the question.

    P.S. I suspect Santer may just have heard that one before ;o)

    Steve: beaker, I would very much like to “appreciate” what the modelers claim that the models are capable of. On previous occasions, I’ve asked you to provide a citation to a relevant authority and you’ve failed to do so and I wasn’t able to see a connection between your own opus (beaker has cordially identified himself to me offline) and use of ensembles in models. If I am having trouble grasping this point and there is no usable reference material, then you can hardly blame me. The fault lies with the climate community for not providing reference material on an important point. I know that you’ve tried hard to educate me on this, but honestly I don’t get the point. In any event, it’s not your responsibility. But again, if you can think of a statistical reference that deals appropriately with the concept of a model ensemble, I’m all ears.

    • RomanM
      Posted Oct 21, 2008 at 10:38 AM | Permalink

      Re: beaker (#2),

      This is because the ensemble mean is an estimate of the average tropical trend for a frequentist style hypothetical population of Earths, not the trend on a specific member of that population (i.e. the Earth we actually occupy).

      Beaker, give it up! Your straw man of a “hypothetical population of Earths” is just hogwash. Hey, the past really happened. There IS an actual set of “trends” to the temperature that can be defined in a unique fashion. One set. A single set of values , one for each pressure. The next group studying what happened will be considering that same set of values. The only uncertainty in the measured trends comes from the measuring and the processing of the data. The real question is how well can the models guess those values. Nobody is asking them to match every wiggle. All we want in this case is that the models exhibit a certain general type of behaviour as measured by the trend values. Calculate a summary of 9 or 10 trend values for each model. Get a handle on how variable those values from each model are when random components are included in the model. Now compare (using the appropriate statistical methodology) to see how close they came to the real world values with a view to whether they might be able to do this consistently (no, not your specious definition) in the future. It’s plain ordinary common sense! Your personal version of what constitutes “testing consistency” is pretty much a nothing exercise and can provide no insight to the situation.

      • DeWitt Payne
        Posted Oct 22, 2008 at 12:23 PM | Permalink

        Re: RomanM (#7),

        I think you have it exactly backwards. The variability within and between models is only important if it tells us something about the variability of the real climate, the weather noise as opposed to measurement error. Beaker’s parallel universes represent the ideal case of an ensemble of perfect physical models. Our planet, however, is only one realization. Even if you could measure past trends on this planet with zero error and plug the numbers into a perfect physical model, the future trend would still be uncertain. To represent this uncertainty you would still need to put non-zero confidence (or whatever you want to call them) limits on the past trend to see if any model or collection of models fit the data.

        • Alan Wilkinson
          Posted Oct 22, 2008 at 1:45 PM | Permalink

          Re: DeWitt Payne (#43),

          If you can’t predict the future with any model because of chaotic uncertainty, what on earth (or off it) is the point of the IPCC projections? And why should anyone believe that an ensemble of arbitrarily chosen runs is any more powerful than the best single model?

        • Kenneth Fritsch
          Posted Oct 22, 2008 at 1:46 PM | Permalink

          Re: DeWitt Payne (#43),

          Beaker’s parallel universes represent the ideal case of an ensemble of perfect physical models. Our planet, however, is only one realization. Even if you could measure past trends on this planet with zero error and plug the numbers into a perfect physical model, the future trend would still be uncertain.

          You have summarized the ongoing debate that I have had with Beaker. If a single realization of the earth’s climate has this uncertain value attached and if that uncertain value varies with individual realizations and further if the models cannot capture that uncertain value and its variation, then how could a model or ensemble of models make a prediction about future earth climates.

          If one could estimate that uncertain value then one could compare a climate model (or ensemble of models average) result to that of the observed earth (our experienced single rendition that is) plus/minus that uncertain value and proceed with that result in doing the statistical testing. If one cannot estimate that uncertain value of the single rendition then climate models results would become prone to subjective interpretation as they would say, our perfect models say this will be the future trend due to AGW, or what ever, but unfortunately our single earth rendition will have this plus/minus uncertain value that is, well uncertain and evidentally not capable of being estimated.

        • RomanM
          Posted Oct 22, 2008 at 2:22 PM | Permalink

          Re: DeWitt Payne (#43),

          We are interested in examining what is happening on this earth (there is only one). The “perfectly measured” temperature record is what we are interested in. The talk of parallel universes is an unnecessary construct which offers nothing tangible to the situation. I would also be interested in knowing on what scientific basis you would model the distribution of the characteristics of these worlds – some sort of invented Bayesianism?

          Surely, you must also realize that a set of nonrandomly selected models each consisting of a single value description cannot be scientifically analyzed in a meaningful way without an extraordinary amount of assumed baggage. The only reason this has been done by these authors is that the IPCC chose to do this in their reports.

          The only common sense way to assess the models is individually. Models have many different aspects to them – the trends are just a single part. If a model is deterministic, its ability to “model” the trend does not lend itself to statistical analysis. You need to look at it in terms of evaluating how different input conditions affect the output. If there are random components then you also need many runs to evaluate how the model responds to the random factors. Look at how variable the predicted trends are and where they are centered. These runs are all under the same (known) conditions that existed on this earth. In the case that these results are not centered near the measured trends, then there is a bias in the models. That’s not good. If the variability is too high, then they have no practical value. Both of these last two features are where statistical techniques are useful.

          And, yes, regardless of how the models perform, there is no guarantee that they will do that in the future. However, they must first be able to hindcast in a sufficiently accurate fashion at the same time demonstrating they are reacting in an appropriate fashion to changes in the various input factors for us to have any faith in their future ability.

        • DeWitt Payne
          Posted Oct 22, 2008 at 10:31 PM | Permalink

          Re: RomanM (#46),

          However, they must first be able to hindcast in a sufficiently accurate fashion at the same time demonstrating they are reacting in an appropriate fashion to changes in the various input factors for us to have any faith in their future ability.

          I agree completely.

          I don’t think we’re all that far apart on the general principles. I found Beaker’s parallel Earths a useful concept for gaining a better understanding of the problem. YMMV.

        • Kenneth Fritsch
          Posted Oct 23, 2008 at 10:21 AM | Permalink

          Re: RomanM (#46),

          The only common sense way to assess the models is individually.

          I agree, Roman, and that is why I was hoping someone would apply Ross McKitrick’s suggested method to the data. His approach as I understand it would take into account the individual model results and the individual observation results and the results compared over the entire spectrum of temperature trends over the altitude range.

          I can, on the other hand, see Douglass et al. attempting to comply with their impression of the prevailing view on the handling of the ensemble of model results.

          One must also be careful to note that the claim of Santer et al. is that in bringing the new adjusted observed results into the calculations that the model and observed results become more compatible (or some such spin for future IPCC reviews). To me this is revealing in two ways:

          One, the authors recognize a difference in the model and observed results.

          Two, that the newer (and adjusted) observed results that show on some parts of the altitude range even higher tropospheric temperature trends than the models are what the authors use in their claim, but at the same time not discussing the fit of the model results and newer and adjusted observed results over the entire altitude range. I think a McKitrick-like analysis across the altitude spectrum would address this cherry picking tendency.

          The discussion here at CA, in my view, has skipped over what cancels when a difference or ratio for the tropical tropospheric to surface temperature trends is used. Would that include ENSO effects and/or the chaotic content of one of beakers parallel earths? Those effects, if not canceling, would be relevant to a statistical comparison whether one used averages of model ensembles or individual model results.

          Further we are losing the critical issue of the differences between the adjusted observed results both for MSU and radio sondes. Steve M uses the UAH MSU adjustments and perhaps with some good a priori reasons, but if one has no good preconceived basis these differences have to be considered in the analysis and discussions.

          If one looks hard at the all the observed and model results (from the Santer and Douglass papers) and over the entire altitude range, I think one has to conclude that overall the picture is a mess and that, further, this area of climate science has a lot work to do.

    • Lance
      Posted Oct 21, 2008 at 11:56 AM | Permalink

      Re: beaker (#2),

      …a test of a difference between the mean of the ensemble and the mean of the observations is not appropriate.

      I agree except that this is the metric by which various defenders of the IPCC’s climate models claim that these models have predictive skill.

  3. tolkein
    Posted Oct 21, 2008 at 10:04 AM | Permalink

    There’s a couple of questions.
    (1) What do the modellers claim the models are actually capable of?
    (2) How does global climate actual compare with the models forecasts in the various IPCC ? We’ve had 17 years since 1989 and a few since 2004 and a comparison with how good the models were in forecasting would be very good, and I don’t mean confidence intervals, I mean visual look and see.

  4. Jeff Alberts
    Posted Oct 21, 2008 at 10:22 AM | Permalink

    I received an automated email saying that Santer was occupied at a workshop.

    Of course. It’s only a couple of months till Christmas. Yeah, yeah, I know it’s lame, but it seemed funny at the time. (And yes, the obvious joke about a collective name for Santer’s coauthors crossed my mind.)

    Sorry, but


    Steve: I thought that the workshop part made the joke rise slightly above the usual situation, notwithstanding beaker’s turning up his nose.

  5. Steve McIntyre
    Posted Oct 21, 2008 at 10:30 AM | Permalink

    #2. beaker, I didn’t use the ensemble mean at any point in the above calculations other than as a location point on the graph. Had Santer provided individual runs in a usable format, I would have used those. In the absence of Santer providing such information, I used the available information from Douglass Table 1. I’ll revisit the calculation if and when I get the info that IMO Santer should already have archived. (Yeah, yeah, the information is doubtless somewhere at PCMDI, but the exercise of replicating his collation is not germane to the present tests.)

  6. Lune
    Posted Oct 21, 2008 at 10:33 AM | Permalink

    This is because the ensemble mean is an estimate of the average tropical trend for a frequentist style hypothetical population of Earths, not the trend on a specific member of that population (i.e. the Earth we actually occupy).

    What’s the difference? The hypothetical population of Earths is of value only if it can provide us with meaningful information about the Earth we actually occupy. Presumably one must fit some kind of distribution curve to the hypothetical Earths in order for them to be predictive with respect to our actual Earth. (If they were generated randomly, that wouldn’t be very interesting.)

    Steve’s analysis says, if nothing else, that from the standpoint of the methodology that was used to generate all the hypothetical Earths, our actual observed Earth is ‘low’ probability. Certainly that tells us we need to better understand the assumptions underlying that methodology.

    • srp
      Posted Oct 21, 2008 at 4:51 PM | Permalink

      Re: Lune (#6),
      As far as the “philosophical” debate about the appropriate comparison statistics goes, I think this is the key point. Beaker’s parallel Earths thought experiment only works if we know the degree of variation (caused by senstitive dependence on initial conditions) of the various Earths with respect to the property we are trying to predict/explain.

      If the distribution of the property over the parallel Earths is very tight, then it is reasonable to treat the realized data on our earth as the benchmark to which any model or ensemble mean must converge. If not, then a looser standard should be applied. But simple hand-waving about variability due to dependence on initial conditions is not adequate for a quantitative prescription of test statistics.

  7. Craig Loehle
    Posted Oct 21, 2008 at 10:53 AM | Permalink

    To my mind there are two ways of looking at the model results. In the first, we want to see if the population of models looks like the population of data we have. This leads to various analyses, of which Steve’s in this thread is very nice. The second says the ensemble mean is the best estimate of model wisdom, so we use that as a SINGLE POINT ESTIMATE of what the models say about reality and compare that to the data (and it is outside the ci of the data by any possible test).

    • MC
      Posted Oct 21, 2008 at 2:12 PM | Permalink

      Re: Craig Loehle (#8), I haven’t read Santer in full yet but will do. I agree though with what you say in that the reference data (RSS, UAH) should be combined as a reference with errors and the model data combined as a reference with errors and the two compared. Then each model should be compared with the RSS/UAH etc data to see what model assumptions are agreeing with measurements. Obviously if both data sets have large errors (50% plus) then I would be less confident in overlap and would state the null hypothesis and ask for better resolved data. I normally would accept data with 10% error or less for a reasonable estimate between the two.

      As for in general, Steve this is a good post but I have to disagree with yourself and Jaynes for that matter. You both appear to have made the Central Limit Theorem assumption in that you regard a measurement x ± y as an expected value ± error. In a pure scientific sense (or more exactly in an empirical measurement sense) the convention of x ± y is used with the understanding that the ‘real’ or expected value has equal probability of lying within x -y to x +y. Some may disagree but this is making an assumption.
      Now how the errors y are decided can be a bit finger in the air but as long as assumptions and rationale are stated and the intention is to be as truthful and conservative as possible then the errors are accepted. There can be devil in some details admittedly but in general good scientific method should be clear enough. So for Jaynes example, I would have to say there is no difference because a) the uncertainty is large and b) there are only two data points. Get me more data with less uncertainty and then I’ll be able to resolve the issue better. Otherwise it is just a ‘gut feeling’ guess and no more different than cherry picking proxies for that sports implement people like to talk about.

  8. Jeff Alberts
    Posted Oct 21, 2008 at 11:05 AM | Permalink

    We’ve had 17 years since 1989

    I’m no math whiz or anything, but I’m pretty sure it’s been 19 years since 1989 😉

    • Posted Oct 21, 2008 at 11:18 AM | Permalink

      Re: Jeff Alberts (#9),

      I’m no math whiz or anything, but I’m pretty sure it’s been 19 years since 1989 😉

      What’s that in model years? (It appears the answer may be that we’ve had 10 years since 1989!)

      • Jeff Alberts
        Posted Oct 21, 2008 at 11:45 AM | Permalink

        Re: lucia (#10),

        Well, as stated it said “we’ve had 17 years since 1989”, not 17 model runs. 😉

    • Dave Dardinger
      Posted Oct 21, 2008 at 11:25 AM | Permalink

      Re: Jeff Alberts (#9),

      Jeff & others,

      We now have the handy “reply and paste link” below the displayed comment number. I’d urge everyone to use it regularly as it’s a savings in comment construction time and of real value to the reader as well. In this case it saves having to look at several messages to see who used “17” In other cases it will take you to the proper prior message even if a message has been deleted in between. If used regularly it allows a person to easily go back through a thread of responses at the click of a button.

  9. Jedwards
    Posted Oct 21, 2008 at 11:29 AM | Permalink

    Beaker, I think I see what you are getting at. Steve’s calculation of the probability distribution Pn(A) for the MSU Tropical Trends appears to be a correct application of Jaynes96, but the comparison should be against the probability distributions Pn(B) of each model run individually in order to give a true apples to apples comparison. In other words, the methodology that Steve has illustrated here may be a reasonable qualitative test for the ability of individual models to properly hindcast the “real Earth”. Craig’s point in #8 appears to get to the crux of the problem in that the ensemble mean is used “politically” by the IPCC as a single point estimator without regard (apparently) to the CI of the individual model runs which have their own distributions. So what we really need are the distributions of those individual runs, and we could then determine whether there are any of those models which are better/worse at hindcasting, which would then give us “confidence” in that particular models ability to forecast.

    Steve: I think that the “posterior” for the MSU trends is probably OK, but I’m going to re-work the “posterior” for the model trends.

    • Craig Loehle
      Posted Oct 21, 2008 at 11:43 AM | Permalink

      Re: Jedwards (#12), I agree that the multiple runs of a single model make a proper ensemble. It is not clear, as others have stated, that it makes any sense to combine the output of a bunch of models and call this an “ensemble” but that is what IPCC does.

  10. Posted Oct 21, 2008 at 11:58 AM | Permalink

    It’s a great post. I love to read the most. The graphical represent is not bad at all. It’s simply the great. Thanks for sharing this in your post with us.

  11. SteveSadlov
    Posted Oct 21, 2008 at 12:11 PM | Permalink

    Probably the first time someone has applied these several-decades-old, completely main stream quality methods to the models. The modelers are unlikely to have encountered their practical use previously, since most of them never worked in manufacturing.

    • Jedwards
      Posted Oct 21, 2008 at 1:07 PM | Permalink

      Re: SteveSadlov (#18),
      I’m actually in QA, and this method makes more sense than anything else I’ve seen (maybe because we actually use it,…a lot).

  12. tolkein
    Posted Oct 21, 2008 at 1:14 PM | Permalink

    re 9. Whoops! Slipped when typing, honest. 19 years since 1989. But the point is, how did the models do compared to actual? A visual would do best.

  13. Henry
    Posted Oct 21, 2008 at 3:29 PM | Permalink


    You are going to have to be very careful with language if you delve into Bayesian statistics: you lose phrases like “confidence interval” and “statistical significance”; instead you get things like “credible interval” and “this decision is better than that”. And while Jaynes wrote in an attractive style, he was often pressing an unconventional line (even among Bayesians), for example with his determination to use “maximum entropy”.

    You will even have to consider again your conclusion “there is a 92% probability that a model trend will exceed the true observed trend”. Since the actual trend has turned out lower than the model, there is little surprise there, but what you need to deal with is how big the difference probably is (if you were absolutely certain that the actual trend was exactly 0.0001° less than the model then the probability would be 100%, but the difference would not be seen as substantial enough to care about, since any loss associated with the error would be minimal); Bayesian methods provide a way into this, but the price to pay is that you are no longer talking the same language as those you are auditing.

  14. Steve McIntyre
    Posted Oct 21, 2008 at 4:14 PM | Permalink

    #22. This sort of enterprise is more prone to problems than working with the proxies. The stalemate between Santer and Douglass does seem frustrating and sometimes it’s a little fun experimenting.

  15. sky
    Posted Oct 21, 2008 at 5:29 PM | Permalink

    SUMMUS SUMMARUM: We are interested in the collective skill of models in estimating the actual observations. While the prospect of some model (or individual run) showing some skill cannot be statistically foreclosed, their pitiful lack of skill in the mean is unmistakable, no matter what statistical formalism is applied.

  16. Nick Moon
    Posted Oct 21, 2008 at 5:37 PM | Permalink

    Usually, I reckon I can follow the postings on CA. And usually I can just about keep up with the maths. But this posting makes me realise I know far too little statistics. So on behalf of for those of us with only average sized brains, can I ask a hypothetical question?

    Supposing a bunch of skeptics got themselves a few super-computers and came up with a few models for predicting global temperatures. And, further suppose, they ran these models over and over again to produce an ensemble of ensembles of different model runs. And now lets suppose a really big suppose. Lets suppose that the spread of the individual runs in this skeptics ensemble had pretty much the same variance/SD/error as the Santer ensemble. And say, in some eerily spooky way that when Steve McI plotted a triangular point to represent the mean of the skeptics ensemble, it came out at around -0.1C. i.e. The same distance from the mean of the 2008 plot but on the opposite side.

    Now is there any mathematical technique which would allow one to say that this ensemble mean is more or less consistent with reality than the Santer one?

  17. Roger Pielke. Jr.
    Posted Oct 21, 2008 at 6:00 PM | Permalink

    You’ll find some interesting parallels to this discussion here:

    and here:

  18. Christopher
    Posted Oct 21, 2008 at 6:00 PM | Permalink

    Steve, NOAA has an model ensemble walkthru with sources. Might be fruitful?

  19. Alan Wilkinson
    Posted Oct 21, 2008 at 6:12 PM | Permalink

    I am still stuck on my original issue with beaker’s parallel earths. Although there may be an infinite number of them, the only ones which can conceivably model and predict our own earth’s future climate are ones that are compatible(?) with the observed data on this earth.

    The others (including non-complying runs) must all be excluded from IPCC prediction methods.

    Otherwise, I am entirely in agreement with the strategy of focussing on the real observed data and measuring the models’ ability to match it rather than the absurdity of using model scatter as an input to judging compatibility.

  20. Dishman
    Posted Oct 21, 2008 at 6:13 PM | Permalink

    It seems to me that at least some of the model runs within this ensemble have been falsified.

    I’m curious whether IPCC or anyone else puts any effort into investigating why.

    Are we running open-loop here, or is feedback from reality taken into consideration?

    • Pat Keating
      Posted Oct 22, 2008 at 8:37 AM | Permalink

      Re: Dishman (#31),

      or is feedback from reality taken into consideration?

      No, if the reality doesn’t agree with model results, reality must be “adjusted”.
      The feedback is in the other direction.

  21. Christopher
    Posted Oct 21, 2008 at 6:19 PM | Permalink

    And this looks promising.

    Sorry for the linkfarm-esque posts but I’ve been looking at this for another reason and simply wanted to share.

  22. James Smyth
    Posted Oct 21, 2008 at 8:03 PM | Permalink

    I received an automated email saying that Santer was occupied at a workshop. Of course. It’s only a couple of months till Christmas. Yeah, yeah, I know it’s lame, but it seemed funny at the time. (And yes, the obvious joke about a collective name for Santer’s coauthors crossed my mind.)
    I don’t get it. Are you saying they are short, have point ears and do all the work; or that they have four legs and antlers and pull him around?

  23. Paul29
    Posted Oct 21, 2008 at 9:37 PM | Permalink

    Has anyone treated the standard deviation of the model runs as an estimate of the standard error of the average model (since each model run provides an estimate of the average). To get an estimate of the population’s standard deviation, multiply this standard error by the square root of “n”.

    Could this standard deviation and the ensemble average be used to define the population distribution that can be compared to actual data’s distribution?

  24. Steve McIntyre
    Posted Oct 21, 2008 at 10:03 PM | Permalink

    I re-did this calculation assuming a normal posterior (?) distribution for the model trends centered on the sample mean of my estimate of the mean of the model trends ( 0.208) and sd of the model trends (0.091). This yielded the following probabilities that the model trend exceeded the observed trend. I think that the posterior distribution may be a t-distribution of some sort, but I’d be surprised if the results varied a whole lot of the assumption made here.

    # 1999 2004 2008
    #0.858 0.840 0.904

    • eilert
      Posted Oct 22, 2008 at 4:47 AM | Permalink

      Re: Steve McIntyre (#35), Thanks Steve. I like your approach of analysing the data far better, since is designed to highlight the differences and not hidden behind some porbabilty values, which are more designed to obfuscate a problem. From the Fig.2 we can clearly see that there is a systematic shift in the two data sets. If I were a scientist who is modelling such data I would be very much more interested in what these shifts represents to actually improve my modelling.

  25. eilert
    Posted Oct 22, 2008 at 2:43 AM | Permalink

    Your green (2008) curve in Fig. 1 shows a closer distribution (the base is narrower). Which would mean that the curve fitted to the data has less outliers. Do you perhapse have a plot of that curve? It schould be interesting to see how the trends have evolved with time.

    • Alan Wilkinson
      Posted Oct 22, 2008 at 5:17 AM | Permalink

      Re: eilert (#36),

      No, it means the line is longer thereby giving a more certain definition to the trend (slope) even if the scatter around the trend line remains similar.

  26. Clark
    Posted Oct 22, 2008 at 8:57 AM | Permalink

    I still don’t see how one can use an “ensemble” of models, and their mean or variance for anything. These measurements are completely dependent on the choices one makes in what models to include, and how many runs of each.

    Variance too small to include current observations? Simply include a few more runs of the high- and low-end models. Presto – larger uncertainty ranges. Mean not trending high-enough? Add in a few models with strong positive feedback. Oh look, the mean is now higher.

    The ensemble values say more about the ensembler than the predictive power of any individual model.

    • JamesG
      Posted Oct 22, 2008 at 10:29 AM | Permalink

      Re: Clark (#40),
      Douglass and Gavin Schmidt clashed about the reasoning behind the use of an ensemble on Matt Briggs’s blog and Gavin’s response was simply an Italian phrase meaning “it just works”. Hence I think that’s admitting that there isn’t any actual theory behind it. It was just a discovery that they couldn’t match any individual model to the surface temperatures but they could get a match using an ensemble. Which is really quite cheeky of them. Since you clearly don’t get a match at the troposphere then I guess it actually “just doesn’t work” so the practice should be abandoned henceforth. But nobody in the IPCC worries about that too much I’m sure. Plenty of time for more data adjustments before the next report.

      • Pat Frank
        Posted Oct 22, 2008 at 11:18 AM | Permalink

        Re: JamesG (#41), Reading that debate, I got the feeling that Gavin used the Italian phrase to analogize himself to Galileo, who allegedly said, concerning Earth, “Eppur si muove (Yet it moves)” as he was threatened by the Inquisition. Gavin apparently likens himself to a martyr for science. What an irony. Some time ago, I watched the Iq^2 debate on global warming, here, featuring Gavin on one side and Richard Lindzen on the other. Of the six principals in that debate, only Gavin accused his opponents of lying. Some martyr.

  27. tolkein
    Posted Oct 22, 2008 at 2:59 PM | Permalink

    Am I understanding this correctly? The IPCC central tendency is not just the mean of lots of runs of the same model but averages of runs of lots of different models, each with different assumptions.

    So how do we (relatively simply) flex the model forecasts so as to see what happens if assumptions change? I can’t see the financial regulators going a bundle on this if I was trying this approach to justify my capital requirements under Basel 2, so how can this approach be valid for spending $trillions on AGW mitigation?

    How does this make any sense?

  28. Dave Andrews
    Posted Oct 23, 2008 at 4:52 AM | Permalink

    There’s an interesting discussion of climate models, involving Gavin and others, over a period of months at:-

    • Willis Eschenbach
      Posted Oct 23, 2008 at 8:06 PM | Permalink

      Re: Dave Andrews (#49), thanks for the links to the discussion. On the page you cited, Gavin Schmidt says:

      So how should one interpret future projections from climate models? I suggest a more heuristic approach. If models agree that something (global warming and subtropical drying for instance) is relatively robust, then it is a reasonable working hypothesis that this is a true consequence of our current understanding. If the models fail to agree (as they do in projecting the frequency of El Niño) then little confidence can be placed in their projections. Additionally, if there is good theoretical and observational backup for the robust projections, then I think it is worth acting under the assumption that they are likely to occur.

      This is statistical handwaving of the highest order. How can models agree that something is “relatively robust”? To start with, what on earth does “relatively robust” mean? How would we measure it? Is Santer showing a “relatively robust” result? Do the Santer models “agree”? How would we know if they do or don’t?

      He says that if the results are “relatively robust” (whatever that means), that a reasonable hypothesis is that this is a “true consequence of our current understanding.” As opposed to a “false consequence”? … I think that he is trying to say that if the models agree, the programmers of the models likely agree … but so what? Why should we pay a lot of attention to which climate principles computer programmers might happen to agree on?

      The fact that two models agree, or two hundred models agree, does not make a result “robust” in any sense of the word. It only reflects the knowledge, ideas, biases and errors of the computer programmers who wrote the models. It has no more weight that having two hundred fundamentalist Christians tell me that the Earth is only 6,700 years old. Yes, the result with the Christians is “relatively robust” … but that does not mean that it is right.

      Let me restate the problem: we have no general theory of climate. We don’t know what all the feedbacks, forcings, and inter-relationships are. We don’t know if the earth has an “equilibrium” temperature, and if so, what mechanisms might maintain it. We don’t know many of the feedbacks’ size, and in some cases we don’t even know their sign. We are trying to model the most complex system we have ever attempted to model, and we have not been at it for long. All of the models have hideous errors which are covered up by parameterization.

      Given all of that, even “relatively robust” doesn’t mean anything except that the models agree. But do they agree because they are correct, or because they are all built on a common base of incorrect assumptions?

      As a man who has made models of varying complexity for a variety of systems including the earth’s climate, I strongly support choice B, that they are built on common, incorrect assumptions. For example, most models see increasing clouds as warming the earth. To me, that’s nuts, that’s an incorrect assumption shared by most models. It doesn’t even accord with our everyday experience. We all know that a cloud passing overhead on a hot day cools us more than the same cloud passing overhead warms us at night. In addition, a cloud near the horizon can block the sun entirely, but it has almost no effect on the IR radiation. In short, clouds cool the earth more than they warm the earth.

      However, most of the models agree on this positive cloud feedback … does that make it a “robust result”, or merely a “robust error”?

      For me, the fact that the most recent, up-to-date you-beaut models give results that are no better than the models of 20 years ago means that the modelers are on a fundamentally wrong path …


      PS – he says that if “future projections from climate models” have good “observational backup”, then we should do something or other … but me, I got lost at the part where he claims to have “observational backup” for future projections … me, I find it a bit difficult to make observations in the future, but then I don’t work for NASA, maybe it is rocket surgery after all …

      • henry
        Posted Oct 25, 2008 at 10:55 PM | Permalink

        Re: Willis Eschenbach (#56), I think the “observational backup” he’s referring to is the list of observations that all show the AGW “result”: the melting glaciers, the shrinking sea ice, the change in growing seasons – you know, “the list”.

        They seem to pull out “the list” every time one of the observations don’t show what they expected.

        • Alan Wilkinson
          Posted Oct 26, 2008 at 2:36 AM | Permalink

          Re: henry (#58),

          Except that “the list” may suggest GW but cannot distinguish, measure or imply AGW.

  29. Martin Lewitt
    Posted Oct 23, 2008 at 7:10 AM | Permalink

    > On this basis, there is a 91.6% probability that the model trend will exceed the observed trend.

    > I leave it to climate philosophers to decide whether this is “significant” or not.

    The IPCC has spoken on this issue in the AR4. For most of their conclusions they used 90% as “very likely”.

  30. ED
    Posted Oct 23, 2008 at 9:29 AM | Permalink

    Attached is a link to an article about a substantial attempt to eliminate an area of uncertainty in the Climate models. “Scientists Go Cloud-hopping In The Pacific To Improve Climate Predictions” linked at

    What type of date will the they be able to collect to determine how a cloud forms and how reflective it is?

  31. Craig Loehle
    Posted Oct 23, 2008 at 1:44 PM | Permalink

    Teacher: What are all these answers to my questions? And none of them are right!
    Student to teacher: I applied different computational techniques and got a range of answers. The variance in the answers is such that you can not preclude an overlap in their distributions. Therefore I pass the exam.

  32. Bob North
    Posted Oct 23, 2008 at 1:59 PM | Permalink

    Craig — I think you just summarized the viewpoint of many of the modellers perfectly.

  33. BarryW
    Posted Oct 23, 2008 at 7:22 PM | Permalink

    Four modelers were playing golf. On a par 3 hole the first hit his ball even with the hole but ten feet to the right. The second hit it even with the hole but ten feet to the left. The third hit his ten feet past the hole but dead in line, while the last hit his ten feet short of the hole but also in line. They all started cheering and each marked his score card with a hole in one, since the average of the ball’s locations was in the hole.

  34. Jeff Norman
    Posted Oct 25, 2008 at 11:02 AM | Permalink


    Excellent post! As an engineer I understand this practical QA approach better than the more hypothetical discussions.

    Regarding your Santer workshop comment. This would go a long way in explaining why the “climate community” appears to ignore what’s happening in the Antarctic/South Pole.

    Thanks yet again.


%d bloggers like this: