Deciding Which Runs to Archive

Have any of you seen any articles discussing which model runs are archived? It doesn’t appear to me that all model runs are archived. So what criteria are used to decide which model runs are archived by the modelers at PCMDI? (This is a different question than IPCC selections from the PCMDI population.) We’re all familiar with cherrypicking bias in Team multiproxy studies e.g. the addiction to bristlecones and Yamal. It would be nice to think that the PCMDI contributors don’t have a corresponding addiction.

Figure 1 below shows the number of 20CEN runs in the Santer collection of 49 20CEN runs. A few models have 5 runs (GISS EH, GISS ER, NCAR CCSM, Japan MRI), but many models only have one run.

Figure 1. Number of Runs (49) by Model for Santer 20CEN Population

PCMDI now has 81 20CEN runs (KNMI – 78), but the distribution has become even more unbalanced with much of the increase coming from further additions to already well-represented models e.g. NCAR CCSM.

Figure 2. KNMI 20CEN Runs (78) by Model

It’s hard to envisage circumstances under which a modeling agency would only have 1 or 2 runs in their portfolio. Modeling agencies with only one 20CEN run include: BCCR BCM2.0, Canadian CGCM 3.1(T63), CNRM CM3, ECHAM4, INM CM3.0, IPSL CM4 and MIROC 3.2 (hi-res). Modeling agencies with only two archived 20CEN runs include the influential HadCM3 and HadGEM1 models. Surely there are other runs lying around? Why are some archived and not others?

The non-archiving impacts things like Santer et al 2008. One of the forms of supposed uncertainty used by Santer to argue against a statistically significant difference between models and observations is autocorrelation uncertainty in the models. While we are limited on the observation side by the fact that we’ve only got one earth to study, a few more available runs of each model would do wonders in reducing the supposed uncertainty in model trends. Santer should probably have 1) thrown out any models for which only one run was archived; 2) written to the modeling agencies asking for more runs; 3) include a critical note against non-archiving agencies in his paper (though I’m led to believe by reviewers that such criticisms would be “unscientific”.)

Here’s another interesting scatter plot illustrating an odd relationship between trend magnitude (for each model) and trend standard deviation (for each model). This is done only for multi-run models – as standard deviation is obviously not defined for singletons.

Figure 3. Santer Population : Trend standard deviation (by model) versus mean trend (by model).

The above relationship is “significant” in statistical terms. But why should there be a relationship between the mean stratified by model and the standard deviation stratified by model. I’ve had to scratch my head a little to even think up how this might happen. I think that such a relationship could be established by a bias in favor of inclusion of (shall we say) DD trends relative to their less endowed cousins.

Or perhaps there’s some mundane reason that would trouble no one. Unfortunately, IPCC doesn;t seem to have established objective criteria requiring modeling agencies to archive all their results and so, for now, I’m left a bit puzzled. For the record, I’m not alleging such a bias on the present record. But equally it is entirely legitimate to ask what the selection criteria are. Not that I expect to get an answer.


  1. pjm
    Posted May 17, 2009 at 4:54 PM | Permalink

    Is there any correlation between the number of runs and the trend?

  2. Posted May 17, 2009 at 4:54 PM | Permalink

    Maybe some agencies did only perform one run. Time and funding could affect the number. Still, given the constant rhetoric about how one of the primary benefits of models is the ability to average out noise, you would think each agency would want to run at least 3 cases for the 20th century. Otherwise, how do you really do attribution– particularly with some of the extremely noisy models?

  3. Posted May 17, 2009 at 5:37 PM | Permalink

    Why does it matter that models are well represented? There is no logical reason to believe the suite of models by organization should bear any relationship to the set of possible climate states. Variation of parameters, in the matter of Stainforth et al. 2005, makes some sense of keeping multiple runs at least. What is the point of a heap of ill-characterized model runs, except for documentation purposes?

    • Steve McIntyre
      Posted May 17, 2009 at 5:51 PM | Permalink

      Re: David Stockwell (#3),

      David, we’re not talking about Stainforth variation of everything, but more than one simulation with 20CEN standard forcing. Otherwise, it’s like using one red noise series.

      remember the arid discussions last fall about “consistency” between models and observations where things were supposedly vindicated if one model spanned the observations, even if the vast majority don’t.

      In the case at hand, the “spanning model” is a run of INM CM3.0 – one of the singleton models. It’s not like it behaves much differently than other models in A1B, but was a bit noisy in the end 20th.

      I’d be amazed if the model still spanned observations if it was represented by three runs instead of one – if the model is characterized by its properties as a model, rather than an individual run.

      So, no, I don’t think that there’s any reason not to have a proper inventory of 20CEN runs in order for a model to be qualified for inclusion in studies. Quite the opposite. I’d be inclined to boot the models without an adequate inventory.

  4. Geoff Sherrington
    Posted May 17, 2009 at 7:53 PM | Permalink

    The next fascination will be the reactions of modelling teams.

    Those who partook in earlier round robins would have reasonably examined why their means did not agree with the ensemble mean. Was more tweaking required? Were actual errors found?

    Now that there are more runs, the ensemble mean will probably have shifted. Is this going to cause more tweaking of knobs, more looking for errors?

    Is this process a continuing loop that converges on high agreement with a very low s.d.? If not, then why do it? If so, then what does the result mean, because it does not seem to be reflecting reality as we understand it.

  5. Posted May 17, 2009 at 8:02 PM | Permalink

    Perhaps if positive feedbacks are magnified the noise level goes with it – higher trend higher sd. Just because people like me assume positive feedbacks in the models will be highly dampened (long gradual response) doesn’t mean that is the case.

    I just took another step on the Mann08 hockey stick and came close to replicating the infilling by RegEM on the 1209 proxies.

    Steve: This is excellent work that you’re doing.

  6. AnonyMoose
    Posted May 17, 2009 at 9:58 PM | Permalink

    The models have to be run several times for testing. Surely they don’t get it right the first time. Surely the models run quickly enough that they can be run several times during testing, and several times for results. Do the organizations which archive these results publish papers based on a single run of their tool?

    Steve: Dunno. Would be interested in any answers.

  7. richard
    Posted May 18, 2009 at 1:03 AM | Permalink

    Have you any evidence that there are unarchived model runs? They are expensive to generate (BCM can do a few years per day), and for AR4 had to be ready by a certain date. After that date, there would be little incentive to do more runs as it would preclude other experiments – see for example

  8. James Lane
    Posted May 18, 2009 at 2:30 AM | Permalink


    It’s not an accusation, it’s a question. Are all the runs archived?

  9. Lars Kamél
    Posted May 18, 2009 at 3:00 AM | Permalink

    My guess on what criteria that are used to decide which models to NOT archive:
    1. If a model shows cooling, it must “certainly” be something wrong with it. Thus don’t archive the result and change the code until the “correct” result is obtained.
    2. If a model shows warming, much not “enough”, use the same procedure as in 1.

    • David Cauthen
      Posted May 18, 2009 at 4:01 AM | Permalink

      Re: Lars Kamél (#10),

      3. Budget constraints prevent us from archiving.

  10. Allen63
    Posted May 18, 2009 at 5:17 AM | Permalink

    Wow, quite an observation. I had not realized that “single runs” of models were considered adequate for any purpose (other than model development).

    In any case, use of ensembles has always seemed “wrong”. The more models in the set, the wider the confidence interval based on the set — and a better chance that actual data will fall within the ensemble confidence interval (I also question how they justify the confidence interval calculation as having physical relevance).

    As others have noted, there should be objective criteria to select models that are the “best performers” (whatever that means). Then work should move forwards with those — if money for model runs is an issue.

  11. Craig Loehle
    Posted May 18, 2009 at 9:21 AM | Permalink

    I do believe that if the base model gets cooling they consider it “drift” and discard the result as a fluke — though how you can consider it a fluke and not a bug escapes me.

    • Steve McIntyre
      Posted May 18, 2009 at 11:02 AM | Permalink

      Re: Craig Loehle (#13),

      in the effort to do a large number of HAdCM runs a couple of years ago, I recall that a noticeable fraction of models suffered from equatorial cooling or something like that and were discarded. As I recall, this was not made very clear in the Nature report – I remember noticing this a couple of years ago, but didn’t write about it at the time.

      Does this sort of “reasonable” screening bias results? Dunno. In multiproxy studies, correlation picking biases towards HS, but the circs are different in this case. Screening is not necessarily a bias, but it’s the sort of information that is very relevant to a statistical analysis.

    • Joe Crawford
      Posted May 18, 2009 at 3:12 PM | Permalink

      Re: Craig Loehle (#14), Many years and many computer generations ago, the hardware for most scientific computers contained either no error checking or at best, error checking in some of the memory circuits. One gamma ray in the wrong spot, at the wrong time, and you got 2 + 2 = 5 or some other strange result. With some of the old computers, it was not unusual to have to run a program several times to get the “right” answer, especially if it took a long time (e.g., many hours to many days) to execute. There was even a study done for the Air Force that found that a PDP-11 at 30,000ft would get something like one undetected error for every 8 hours of running. Many researchers developed the habit of “if the answer doesn’t look right, just rerun the program.” Or, more often, if a few points don’t fit the curve drop them from the data so you can get the right answer. Only if they started occurring “too” often (“too” in this case was purely subjective) would they bother to look at the code for errors. Otherwise they were just a “fluke” and not worthy of investigation.

      I don’t know the status of the current generation of systems, but I imagine a lot of researchers, brought up with the older systems, still have the above attitudes ingrained in their work habits. These attitudes appear endemic in Climate Science.


  12. Posted May 18, 2009 at 10:11 AM | Permalink

    I think “drift” is not necessarily a bug. All models will “drift” initially. This happens because the modelers can’t know the quasi-steady state solution for the earth’s climate before they run the model. To use an over-simplified example, suppose a mechanical engineer wants to find the steady state solution for the room temperature of my cellar when the furnace is running at some level of power, the outdoor air is at some temperature etc. To figure this out, they first take a SWAG (i.e. guess) at the cellar temperature, then run a computer program that marches forward in time.

    Eventually, the temperature in their “model” reaches a steady state. (If their models and forcings work, it will match my cellar temperature after we install the furnace and do the experiment at a later date.)

    Model “drift” seems to be the description for what happens between the first guess and the time when the modelers finally reach a steady state.

    Now, normally if this were an engineering problem, you test to see whether your temperature appears to be at steady state before reporting “steady state” results. But, with climate models, it appears they may do two things in parallel:

    1) Pick some year as year zero for the 20th century runs and hope that your models stopped drifiting.
    2) Continue the control runs and monitor whether the climate continues to drift after the year they picked to kick off the 20th century runs.

    If the extension of the control runs after the year they used to kick off the 20th century runs shows “drift”, then that means they should have run the control run longer. But, they didn’t. So… now what? Either they a) discard the runs initiated before the control run settled down or b) ‘correct’ the results from runs initiated before the control runs settle down. Does any method of correction really work? Who knows?

    I know what you are going to ask: If the control runs was still drifting, can we know whether the drifting would have ever stopped? Beats me. It’s much better to just stop. Other questions: How do they diagnose drift? Can the distinguish it from a hypothetical 500 year super-mega Meideval-Warm period/ Maunder minimum oscillation caused by mystery model physics? Beats me.

    Ideally, what they should be doing is running the control until they confirm non-drift. Then, running it further and picking off the years to use as the initial conditions for the 20th century runs. I suspect budgetary constraints and reporting milestones preclude this. Modelers from other agencies may be sympathetic to the difficulties since they all share them. But… it’s still a problem.

    • Posted May 18, 2009 at 10:33 AM | Permalink

      Re: lucia (#15),

      Thanks for this, I’ve never considered setup drift in models before.

    • Craig Loehle
      Posted May 18, 2009 at 11:21 AM | Permalink

      Re: lucia (#14), The models can drift hot and cold and odd (certain regions cold, say). If they don’t verify that drift has stopped, then they don’t know if some models might be drifting UP (warmer). In my research, I have needed to spin-up a forest growth simulator by running it for 1000 years before commencing my experiments, but by then it in fact had stabilized, and no further “drift” was going on. To run experiments on model simulations that might still might not be dynamically stable (realistic) seems….odd.

  13. Posted May 18, 2009 at 11:49 AM | Permalink

    Yes. Models can drift in either direction.
    Yes, if the modelers don’t verify their control stopped drifing there is a problem.
    Yes. Kick starting transients off models that are still drifting is … odd. (That’s putting it kindly.)

    The only reason I can imagine why a reviewer lets people publish results of transients initialized over control runs that still drift is being overly sympathetic for shared difficulties. In reality, the group should discipline themselves to insist that all transients be initialized from controls that aren’t drifting.

    I’m just trying to explain that drift may not be cause by “bugs”. Bug are generally code errors. The drift problem is the result of a bad decision to save time by not running the control as long as they should.

    • Craig Loehle
      Posted May 18, 2009 at 12:46 PM | Permalink

      Re: lucia (#18), When I mentioned bug, I had in mind runs that never stop drifting–never arrive at a stable climate that is reasonably close to actual. If they only drift in this sense “sometimes” this is still a bug to my mind. It indicates that the feedbacks are wrong.

  14. Posted May 18, 2009 at 1:07 PM | Permalink


    If they never stop drifting, there are serious problems! If they don’t end up anywhere near the planet earth temperature, that’s a problem also. (Finding the ‘steady’ temperature is a good reason to run until they stop drifting. I’m a bit perplexed that authors of the IPCC would include steady state earth solutions in the appendix of the IPCC if the models had not stopped drifting. But… there ya’ go.)

  15. Scott Brim
    Posted May 18, 2009 at 1:08 PM | Permalink

    Willis Eschenbach says this in the A1B and 20CEN Models thread in referring to model drift and natural variability:

    A1B and 20CEN Models (#23),
    Willis Eschenbach:
    Re: Jesper (#22), it’s not quite that bad. The “control runs” merely hold all of the inputs constant. They’re not eliminating “models that show climate change from effects other than CO2.’ They are eliminating models that “drift” when the “external forcings” are held stable.
    My objection to that process is that we have little information on the natural variability of the earth in the absence of external forcings. From the claims of the AGW supporters, I deduce that “natural variability” is a) big enough to overpower any CO2 effects and to explain any model vagaries, and yet b) small enough to ignore at all other times.
    However, since we don’t know how much the earth might “drift” in the absence of changes in the known forcings, as you point out, the cutoff is both suspect and arbitrary. Curiously, they don’t require that the models give realistic temperatures, only that they don’t drift …


    I quote Lucia as saying the following in the Unthreaded – n+2 thread in referring to Gavin’s special graphic, one which he generated as part of the ongoing Schmidt-Monckton Kerfuffle

    Unthreaded – n+2 (#62)
    ” …. No precisely similar graphic appears in the AR4. Their projections in Figure 10.4 shows 1 standard deviation bounds based on the model mean temperature anomalies. The graph above [see the referenced post] and the one Gavin shows show larger uncertainty bounds based on the spread of “all weather in all models”. For whatever reason, the authors of the AR4 choose a graphic that suggests less uncertainty in their “prediction/projections”; now that the temperatures have been flat, Gavin prefers to show these larger ones …..”

    When Willis Eschenbach refers to “the natural variability of the earth in the absence of external forcings”, he is referring to something we might think of as being “systemic internal variability.”
    Lucia notes that the authors of IPCC AR4 set their modeled projections, as shown in Figure 10.4 of AR4, at 1 standard deviation bounds based on the model mean temperature anomalies, while Schmidt in his own graphic prefers wider bounds.
    I have to wonder if Mother Nature, in setting her own objectives for managing an increase in the earth’s global mean surface temperature, doesn’t also set a performance goal of 1 standard deviation process control error.
    Or perhaps Mother Nature really has no inclination towards raising the earth’s global mean surface temperature at all, but instead attempts to enforce a kind of Climatological Six Sigma Program to keep temperatures flat — a noble effort on her part at pursuing beneficial climate process control in mankind’s best interests, but an effort which unfortunately is being thwarted by mankind’s actions in relying so heavily on the carbon fuel cycle.

  16. Mark T
    Posted May 19, 2009 at 1:57 PM | Permalink

    His inset graph looks almost identical to scenario A2 on page 803 of the WG1 report, with a slight aspect ratio adjustment as he clearly states. It looks like a copy that was stretched to fit differently, in fact. Not sure what you and lucia referring to, but it’s pretty easy to see.


  17. Mark T
    Posted May 19, 2009 at 2:24 PM | Permalink

    The stuff Lucy referred to regards the scenario A2 CO2 projections, on page 803 of WG1. Pretty straightforward use, probably with some scrape of the JPEG picture to get the numbers he used. As for what lucia is referring to, that is a different nut to crack.


%d bloggers like this: