Does the Endpoint of Santer H2 "Matter"?


Perhaps the first thing that I noticed about this article was the endpoint for analysis of 1999 – this seemed very odd. I mentioned that a Santer coauthor wrote to me, saying that the endpoint didn’t matter relative to the Douglass endpoint of 2004. That turns out to be true, but why would anyone in 2008 use data ending in either 1999 or 2004? (This applies to both Douglass and Santer). There’s been lots of criticism over the use of obsolete data in controversial articles – so why was either side of this dispute using obsolete data?

The Santer SI contains a sensitivity study of the H1 hypothesis up to 2006. There’s been some discussion here about whether trends to 1999 could be extended to 2006 for comparison purposes – something that made sense to me and Santer et al took the same position. They state:

In the second sensitivity test (“SENS2”), we calculated observed trends in T2LT and T2 over the 336-month period from January 1979 to December 2006, which is a third longer than the analysis period in the baseline case. As in SENS1, we set s{bm} = s{bo}. Since most of the model 20CEN experiments end in 1999, we make the necessary assumption that values of bm estimated over 1979 to 1999 are representative of the longer-term bm trends over 1979 over 2006. Examination of the observed data suggests that this assumption is not unreasonable..

They observe that the longer record leads to a sharpening of CIs for observed trends (as we’ve discussed) here but report that this does not affect their H1 results:

Even with longer records, however, no more than 23% of the tests performed lead to rejection of hypothesis H1 at the nominal 5% significance level

Later in the SI, they discuss several sensitivity tests for the H2 hypothesis, but, for some reason, they do not report on the impact of the SENS2 test on the H2 hypothesis – a rather surprising omission.

It’s completely trivial to do these calculations on up-to-date data. CA readers can obtain results to Santer Table III, updating to the most recent UAH data as follows:


By using current data, the value of the Santer d1 test (a t-test) increases to 2.232 (from the 1.11 reported in their Table III), yielding an opposite conclusion in this respect from the one reported in the article.

These results are obtained not by doing the tests in a different way that I happen to prefer, but using the same methodology as Santer et al on up-to-date data.

You can check results to end 2007, which would have been readily available to Santer et al at the time of submission of the article as follows, yielding a d1-value of 1.935, which would be significant against important t-tests.


The value for 2006 was 1.77, which would be significant against a one-sided t-test (and against a two-sided t-test at 90%). It seems odd that they would have gone to the trouble of doing the SENS2 sensitivity study on the H1 hypothesis, but not the H2 hypothesis. And if they did the SENS2 test on the H2 hypothesis, these results would be important and relevant information.

And when they saw these results, you’d think that Gavin Schmidt, Santer and so on would be curious as to what would happen with 2007 results. RC has not been reluctant to criticize people who have used stale data and you’d think that Schmidt would have taken care not to do the same thing himself. Especially if the use of up-to-date data had a material impact on the results, as it does with the H2 hypothesis in respect to the UAH data.

Reblog this post [with Zemanta]


  1. Posted Oct 23, 2008 at 3:51 PM | Permalink

    Santer doesn’t have the greatest track record when it comes to endpoint selection.

    See here

    or, more formally,

    Michaels, P.J., and P.C. Knappenberger, 1996. Human Influence on Global Climate? Nature, 384, 522-523.

    (BTW, this is an example of where we got a comment (derived from our truth checking) placed in the literature)


  2. John A
    Posted Oct 23, 2008 at 4:33 PM | Permalink

    I’m too old to believe in Santer any more.

  3. Dishman
    Posted Oct 23, 2008 at 5:09 PM | Permalink

    So, as I understand it…

    The models do not fail when hindcasting. That seems likely to be true, given that they were based on the same historical data.

    On the other hand, given forecasts to compare against actual results, it appears that the models are very close to being invalidated.

  4. Steve McIntyre
    Posted Oct 23, 2008 at 5:23 PM | Permalink

    #3. No, because there are other issues here. For example, and this is something that I’ve noted as we go along, I am not in a position to opine as to whether UAH or RSS is “right”. As long as competent specialists disagree on these matters, there is an imponderable that very much prevents someone saying that the models are “invalidated”. So don’t get all excited about this.

    All we’re saying here is that one of the Santer claims doesn’t hold up. Different point.

    However, there is perhaps en element of your point underlying this. The t-value has been climbing rapidly because of the discrepancies in recent results.

    • Jim Melton
      Posted Oct 24, 2008 at 12:15 PM | Permalink

      Re: Steve McIntyre (#4),

      I have to disagree with you here Steve – “competent specialists” is the reason – don’t fret I won’t mention the “F” word.

      Let me draw the obvious and be-laboured analogy.

      Here in the UK the coming (as was in 2005) financial crash was a dead cert. Pinning the date on donkey was the trick, US mortgage crash or no it was comming. We had a consensus of sorts at my place of work that the banking practices in the UK vis-a-vis mortgages, credit and checks and balances was a busted flush. Only one of my colleagues was an accountant by training the rest of us IT geeks.

      The “competent specialists” in UK banking were trusted by all to know better – they were wrong. Confidence now is shot which is why the UK gov has had to nationalise banks that in the bank’s own estimations were healthy. minor OTT digression completed I trust you can easily see the comparison. How is it that a layman can look at the graphs of model predictions V temperature observations and instinctively know that the models are wrong. Yet WE are beholden to some golden unwritten standard to trust those that should know.

      Let me simplify my point. If the trusted (and I include you here) cannot say if the squiggles say what we all think they say then why would you (if in gov office) use or defend the use of said squiggles to justify policy decisions.

  5. craig loehle
    Posted Oct 23, 2008 at 6:14 PM | Permalink

    Scent dogs can’t find a suitcase full of cocaine as fast as Steve finds a discrepancy! Well done!

  6. John M
    Posted Oct 23, 2008 at 6:29 PM | Permalink

    Speaking of RSS vs UAH, looks like Tamino’s tripping all over himself trying to make a point.
    Worth reading the comments.

  7. John Lang
    Posted Oct 23, 2008 at 6:53 PM | Permalink

    Do any of the observation datasets such as Raobcore end in 1999?

  8. steven mosher
    Posted Oct 23, 2008 at 8:11 PM | Permalink

    #3. the models are not based on the same historical data. It took some time but gavin confirms that some of the models used do NOT contain volcanic forcing.

  9. Jaye Bass
    Posted Oct 23, 2008 at 9:00 PM | Permalink


    That’s the parallel Earth thing. In ImaginationLand, some of the Earths have volcanoes, others don’t.

  10. John Lang
    Posted Oct 24, 2008 at 8:11 AM | Permalink

    Gavin commented on RealClimate that the model runs were from the IPCC AR4 which had a cut-off date of 2004 (for model runs only I presume since lots of other papers made it in even though they weren’t even published yet.)

    Those 2004 model runs were based on observation data which was confirmed only up to 1999. So that is why they cut it off at 1999.

    I note that the trend per decade from the UAH lower troposphere data for the tropics is the same number – 0.06C per decade – whether you stop at 1999 or continue the analysis into the September 2008.

    So, the actual observations are still well below the average of the model runs at close 0.2C per decade but the date doesn’t matter much.

    In my mind, this is a problem with a simple least squares regression of the trend. The 2008 tropics temps are below the 1979 temps but a regression line still projects an upward trend.

    • Alan Wilkinson
      Posted Nov 11, 2008 at 7:58 PM | Permalink

      Re: John Lang (#10),

      Plainly the trend since 1979 is not a straight line so any model that produces one is false. Other factors such as the PDO have to be accounted for properly.

      • John Lang
        Posted Nov 12, 2008 at 9:41 AM | Permalink

        Re: Alan Wilkinson (#15),

        Actually, if one adjusts the RSS data for the influences of the Nino and the AMO, the trend rises to 0.097C per decade.

      • John Lang
        Posted Nov 12, 2008 at 11:43 AM | Permalink

        Re: Alan Wilkinson (#15),

        Sorry, the little model I built has two constants and I started with the wrong one.

        The RSS tropics trend falls to 0.048C per decade if one adjusts for the influence of the AMO and Nino.

  11. Sylvain
    Posted Oct 24, 2008 at 12:51 PM | Permalink

    #6 John m

    The last update is really funny. He withdraw is opinion because after all UAH is closer to hadcrut2 than RSS.

    Roger Pielke sr posted this last summer:

    It is about this paper that compare UAH to RSS

    Randall, R. M., and B. M. Herman (2007), Using Limited Time Period Trends as a Means to Determine Attribution of Discrepancies in Microwave Sounding Unit Derived Tropospheric Temperature Time Series, J. Geophys. Res., doi:10.1029/2007JD008864, in press

  12. Dishman
    Posted Oct 24, 2008 at 1:18 PM | Permalink

    I can accept that the models are “not falsified”.

    Part of the problem is that we’ve got translation going on at the policy level, where “not falsified” gets translated into “proven”. I think part of that is due to the use of technical language which contains the same words as plain language, but with different meanings.

    As I understand it, “consistent with” means roughly “not falsified” (less than 95% chance of being wrong) in statistics, while in plain english, “consistent with” is much closer to “proven” (more than 95% chance of being right).

  13. Posted Nov 14, 2008 at 1:56 PM | Permalink

    We continue the saga of the paper of Santer+16 co-authors [S17 in IJC 2008]. You recall from recent TWTW newsletters at that it attacks the findings of Douglass, Christy, Pearson, and Singer [DCPS in IJC 2007] as well as of the NIPCC report “Nature – Not Human Activity – Rules the Climate”

    S17 claim that the observed temperature trends in the tropical troposphere agree with those calculated from greenhouse (GH) models. The claim is based on two assertions: The observations (or more properly, the analyses of the data) have changed drastically just in the past two years. And also — the uncertainties of both observed and modeled trends are found to be much larger.

    The first thing that struck me about S17 was their figure 6A, which depicts 7 (yes, seven) curves derived from the same set of radiosonde data, each claiming to show the true dependence of the temperature trend with (pressure) altitude. The curves fall into three “families” that show striking differences – for reasons that I will discuss elsewhere. Here I will concentrate on one feature only: the time interval chosen by S17 is 1979 – 1999. Please remember that 1998 was the year of unusual warmth because of a strong El Nino.

    Does the choice of endpoint matter and affect the trend values shown? You betcha. To check up on this matter, I briefly thought of writing to Santer at to request the underlying temperature data. But why waste time? So I used a proxy, the MSU-UAH data set for lower troposphere temperatures from satellites, kindly sent to me by John Christy. Here then are the OLS trends calculated for a time interval starting at the beginning of the satellite data set, 1979, and ending in 1993, 1996, 1999, or 2002: -0.010, 0.035, 0.103, 0.121 degC/decade.

    No need to comment further, except I just cannot resist quoting from page 130 of the CCSP-SAP-1.1 report (Karl et al 2006]. In an Appendix, Wigley, Santer, and Lanzante explain the mysteries of “statistical issues regarding trends” to the great unwashed in real simple words:
    “Estimates of the linear trend are sensitive to points at the start or end of the data set…. For example, if we considered tropospheric data over 1979 through 1998, because of the unusual warmth in 1998 … the calculated trend may be an overestimate of the true underlying trend.”

  14. akm
    Posted Feb 25, 2010 at 9:51 PM | Permalink

    What is the different way that you happen to prefer?
    What results did you obtain by doing your calculations that way?

    Your article beyond “that turns out to be true” seems really to be about something better introduced by you as your own work, leading to some conclusion.

3 Trackbacks

  1. […] Let me assure Gavin that Steve McIntyre also numbers among the “some” who downloaded climate data from archives in all sorts of places. With regard to requesting information from Santer: It appears Steve wants to figure out precisely what Santer did and whether certain pesky details affect the results. (Typical pesky detail associated with the choice of end point for the analysis discussed here.) […]

  2. […] you been following the Douglas vs. Santer bout? Do you remember the blog controversies that asked “Why did Santer stop analysis at December 1999 when Douglas ran analyses through 2004?” Have you been hoping someone would get TLT data we can compare to UAH and RSS? (Yes, I mean you […]

  3. By The Blackboard » Reviewer outs himself! on Apr 21, 2011 at 5:42 PM

    […] It seems to me some people wondered. Written by lucia. Previous Post: […]

%d bloggers like this: