As a diversion from ploughing through Mann et al 2008, I took a look at Santer et al 2008 SI, a statistical analysis of tropospheric trends by 16 non-statisticians and, down the list, Doug Nychka, a statistician who, unfortunately, is no longer “independent”. It is the latest volley in a dispute between Santer and his posse on one hand and Douglass, Christy and a smaller posse on the other hand. The dispute makes the Mann proxy dispute seem fresh in comparison. Santer et al 2008 is a reply to Douglass et al 2007, discussed on several occasions here. Much of the argument re-hashes points made at realclimate last year by Gavin Schmidt, one of the Santer 17.
The statistical issues in dispute seem awfully elementary – whether an observed trend is consistent with models. In such circumstances, I would like to see the authors cite well-recognized statistical authorities from off the Island – preferably well-recognized statistical texts. In this respect, the paper, like so many climate science papers, was profoundly disappointing. No references to Draper and Smith or Rencher or any statistics textbook or even articles in a statistical journal. In their section 4 (setting out statistical procedures), they refer to prior articles by two of the coauthors, one in a climate journal and one in prestigious but general science journal:
To examine H1, we apply a ‘paired trends’ test (Santer et al., 2000b; Lanzante, 2005)
Lanzante 2005 (A cautionary note on the use of error bars. Journal of Climate 18: 3699–3703) is a small article at an undergraduate level, arguing that visual comparison of confidence intervals can play tricks (and that this sort of error was prevalent in many then recent climate articles in IPCC TAR):
When the error bars for the different estimates do not overlap, it is presumed that the quantities differ in a statistically significant way.
Instead of this sort of comparison, Lanzante recommended the use of the t-test for a difference as described in any first-year statistics course. Lanzante 2005 cited Schenker and Gentleman (Amer. Stat. 2001), an article in a statistics journal written at a “popular” level. One can easily see how the standard in Lanzante 2005 raises the cut-off point in a simple case where the two populations have the same standard deviation σ. If typical 2σ 95% confidence intervals are applied, then, for the two confidence intervals not to overlap, the two means have to be separated by 4σ i.e.
For a t-test on the difference in means, the standard is:
Note that equality is the “worst” case. The value goes down to 2 as one s.d. becomes much shorter than the other – precisely because the hypotenuse of a right-angled triangle becomes closer in length to the x-length as the angle becomes more acute.
While the authority of the point is not overwhelming, the point itself seems fair enough.
Santer et al 2000b is more frustrating in this context, as it is not even an article on statistics but a predecessor article in the long-standing brawl: “Santer BD, et al. 2000b. Interpreting differential temperature trends at the surface and in the lower troposphere. Science 287: 1227–1232.” They stated:
All three surface – 2LT trend differences are statistically significant (21), despite the large, overlapping 95% confidence intervals estimated for the individual IPCC and MSUd 2LT trends (Table 1) (22).
Reference 21 proved to be another Santer article also in a non-statistical journal and, at the time still unaccepted:
21. The method for assessing statistical significance of trends and trend differences is described by B. D. Santer et al. ( J. Geophys. Res., in press). It involves the standard parametric test of the null hypothesis of zero trend, modified to account for lag-1 autocorrelation of the regression residuals [see J. M. Mitchell Jr. et al., Climatic Change, World Meteorological Organization Tech. Note 79 ( World Meteorological Organization, Geneva, 1966)]. The adjustments for autocorrelation effects are made both in computation of the standard error and in indexing of the critical t value
Santer et al (JGR 2000) proved to have much in common with the present study. Both studies observe that the confidence intervals for a trend of a time series with autocorrelation are wider. I agree with this point. Indeed, it seems like the sort of point that Cohn, Lins and Koutsoyannis have pressed for a long time in connection with long-term persistence. However, Santer et al carefully avoid any mention of long-term persistence, limiting their consideration to AR1 noise (while noting that confidence intervals would be still wider with more complicated autocorrelation. Although the reference for the point is not authoritative, the point itself seems valid enough to me. My interest would be in crosschecking standards enunciated here against IPCC AR4 trend confidence intervals, which I’ll look at some time.
Now for something interesting and puzzling. I think that reasonable people can agree that trend calculations with endpoints at the 1998 Super-Nino are inappropriate. Unfortunately, this sort of calculation crops up from time to time (not from me). A notorious example was, of course, Mann et al 1999, which breathlessly included the 1998 Super-Nino. But we see the opposite in some recent debate, where people use 1998 as a starting point and argue that there is no warming since 1998. (This is not a point that has been argued or countenanced here.) Tamino frequently fulminates against this particular argument and so it is fair to expect him to be equally vehement in rejecting 1998/1999 as an endpoint for trend analysis.
Now look at the Santer periods:
Since most of the 20CEN experiments end in 1999, our trend comparisons primarily cover the 252-month period from January 1979 to December 1999
If the models weren’t run out to 2008, get some runs that were. If they want to stick to old models and the old models were not archived in running order, the trend in CO2 has continued and so why can’t the trend estimates be compared against actual results to 2008? Would that affect the findings?
It looks to me like they would. Let me show a few figures. Here’s a plot of a wide variety of tropical temperatures – MSU, RSS, CRU, GISS, NOAA, HadAT 850 hPA. In some cases, these were calculated from gridcell data (GISS), in other cases, I just used the soure (e.g. MSU, RSS.) All data were centered on 1979-1998 and MSU and RSS data were divided by 1.2 in this graphic (a factor that John Christy said to use for comparability to surface temperatures), but native MSU [,”Trpcs”] data is used in the CI calculations below. The 1998 Super-Nino is well known and sticks out.
I’ve done my own confidence interval calculations using profile likelihood methods. Santer et al 2000, 2008 does a sort of correction for AR1 autocorrelation that does not reflect modern statistical practice, but the Cochrane-Orcutt correction from about 50 years ago (Lucia has considered this recently.)
Instead of using this rule of thumb, I’ve used the log-likelihood parameter generated in modern statistical packages (in the arima function for example) and calculated profile likelihoods along the lines of our calibration experiments in Brown and Sundberg style. I’m experimenting with the bbmle package in R and some of the results here were derived using the mle2 function (but I’ve ground truthed calculations using optim and optimize).
First let me show a diagram comparing log-likelihoods for three methods: OLS, AR1 and fracdiff. The horizontal red line shows the 95% CI interval for each method. As you can see, even for the UAH measurements, for the 1979-1999 interval, the observed mean trend of the models as an ensemble is just within the 95% interval of the CI for the observed trend assuming AR1 residuals. If one adds an uncertainty interval for the ensemble (however that is calculated), this would create an expanded overlap. Fractional differencing expands the envelope a little but not all that much in this period (it makes a big difference when applied to annual data over the past century.) Expanding the CI interval is a bit of a two-edged sword as no trend is also and even further within the 95% interval. So the expanded CI (barely) enables consistency with models, but also enables consistency with no trend whatever. I didn’t notice this point being made in either Santer publication.
Figure 1. Log Likelihood Diagram for OLS, AR1 and Fracdiff for 1979-1999 MSU Tropics. Dotted vertical red line shows the 0.28 trend of model ensemble. [Update – the multi-model mean is 0.215; the figure of 0.28 appears in Santer et al Figure 1, but is the ensemble mean for the MRI model only.]
On the face of it, I can’t see any reason why the model ensemble trend of 0.28 can’t be used in an update of the Santer et al 2008 calculation in a comparison against observations from the past decade. The relevant CO2 forcing trend has continued pretty much the same. Here’s the same diagram up to date, again showing the model ensemble trend of 0.28 deg /decade as a vertical dotted red line. In this case, the ensemble mean trend of 0.28 deg C/decade is well outside the 95% CI (AR1 case).
Now some sort of CI cone needs to be applied to the ensemble mean as well, but 47 cases appear to be sufficient to provide a fairly narrow CI. I realize that there has been back-and-forth about whether the CI interval should pertain to the ensemble mean or to the ensemble population. As a non-specialist in the specific matter at hand, I think that Douglass et al have made a plausible case for using the CI of the ensemble mean trend, rather than of the model population. Using a difference t-test (or likelihood equivalent) along the lines of Lanzante 2005 requires a bit more than non-overlapping CIs, but my sense is that the spread – using an ensemble mean CI – would prove wide enough to also pass a t-test. As to whether the s.d. of the ensemble mean or s.d. of the population should be used – an argument raised by Gavin Schmidt – all I can say right now is that it’s really stupid for climate scientists to spend 10 years arguing this point over and over. Surely it’s time for Wegman or someone equally authoritative to weigh on this very elementary point and put everyone out of their misery.
Here’s the same diagram using RSS data. The discrepancy is reduced, but not eliminated. Again, analysis needs to be done on the model CIs, which I may re-visit on another occasion.
I think that these profile likelihood diagrams are a much more instructive way of thinking about trends than the approaches so far presented by either the Santer or Douglass posses. In my opinion, as noted above, an opinion on model-observation consistency from an independent statistician is long overdue. It’s too bad that climate scientists have paid such little heed to Wegman’s sensible recommendation.
Similar issues have been discussed at length in earlier threads e.g here