In browsing AR4 chapter 3, I encountered something that seems very strange in Table 3.2 which reports trends and trend significance for a variety of prominent temperature series (HAdCRU, HadSST, CRUTem). The caption states:
The Durbin Watson D-statistic (not shown) for the residuals, after allowing for first-order serial correlation, never indicates significant positive serial correlation.
The Durbin-Watson test is a test for first-order serial correlation. So what exactly does it mean to say that the a test on the residuals, after allowing for first-order serial correlation, does not indicate first-order serial correlation? I have no idea. I asked a few statisticians and they had no idea either. I’ve corresponded with both Phil Jones and David Parker about this, trying to ascertain both what was involved in this test and to identify a statistical authority for this test. I have been unable to locate any statistical reference for this use of the Durbin-Watson test and no reference has turned up in my correspondence to date. (My own experiments – based on guesswork as to what they did – indicate that this sort of test would be ineffective against a random walk.)
The insertion of this comment about the Durbin-Watson test, if you track back through the First Draft, First Draft Comments, Second Draft and Second Draft Comments was primarily in response to a comment by Ross McKitrick about the calculation of trend significance, referring to Cohn and Lins 2005. The DW test “after allowing for serial correlation” was inserted by IPCC authors as a supposed rebuttal to this comment (without providing a citation for the methodology). I’m still in the process of trying to ascertain exactly what was done and whether it does what it was supposed to do, but the trail is somewhat interesting in itself.
Once again, the caption to Table 3.2 says:
Trends with 5 to 95% confidence intervals and levels of significance (bold: <1%; italic, 1—5%) were estimated by Restricted Maximum Likelihood (REML; see Appendix 3.A), which allows for serial correlation (first order autoregression AR1) in the residuals of the data about the linear trend. The Durbin Watson D-statistic (not shown) for the residuals, after allowing for first-order serial correlation, never indicates significant positive serial correlation.
Appendix 3.A doesn’t add much to this, other than providing a reference (Diggle et al 1999):
The linear trends are estimated by Restricted Maximum Likelihood regression (REML, Diggle et al., 1999), and the estimates of statistical significance assume that the terms have serially uncorrelated errors and that the residuals have an AR1 structure. … The error bars on the trends, shown as 5 to 95% ranges, are wider and more realistic than those provided by the standard ordinary least squares technique. If, for example, a century long series has multi-decadal variability as well as a trend, the deviations from the fitted linear trend will be autocorrelated. This will cause the REML technique to widen the error bars, reflecting the greater difficulty in distinguishing a trend when it is superimposed on other long-term variations and the sensitivity of estimated trends to the period of analysis in such circumstances. Clearly, however, even the REML technique cannot widen its error estimates to take account of variations outside the sample period of record.
As I mentioned above, it’s interesting to see the provenance of this “test”. It’s not mentioned in the First Draft. In a comment on Table 3.2, Ross McKitrick said (and I was unaware of his interest in this prior to researching this post):
3-425 A 9:0 Table 3.2. Here, and in Appendix 3.A.1.2, reference is made to “Restricted Maximum Likelihood” standard errors, but the citation is to a general text book (Diggle), not a published article. Considering the importance of the contents of this table to the Chapter, the reader needs considerably more guidance about the estimating methodology, as well as reference to current literature.
There is a substantial literature dating back to the early 1990s showing that anomaly data have long autocorrelation processes in them, making for long term persistence and near unit-root behaviour. It is well known in the climate literature that this can severely bias significance estimates in trend regressions. Yet there is no mention of this problem and it seems that the t-stats in this table reflect only a first order autocorrelation correction, almost certainly making them misleading. I will suggest some improved wording, but I believe this table needs a serious re-do and the reader is owed a substantial discussion of the problems of estimating significance of trends in climatic data. Below I cite a forthcoming treatment of the issue by Cohn and Lins, who comment “It is therefore surprising that nearly every assessment of trend significance in geophysical variables published during the past few decades has failed to account properly for long term persistence…. For example, with respect to temperature data there is overwhelming evidence that the planet has warmed during the past century. But could this warming be due to natural dynamics? Given what we know about the complexity, long term persistence, and non-linearity of the climate system, it seems the answer might be yes.” All the trends should be re-estimated using, at minimum, an ARMA(1,1) model, not an AR(1) model; and the lag processes need to be extended out to sufficient length to ensure the ARMA coefficients become insignificant.
The treatment of this key issue in this chapter is at least 10 years behind the state of the art (see, for instance, Woodward and Gray JClim 1993, who were already ahead of where this discussion is), and unless substantial improvement is made this Table and related discussions should be removed altogether.
In response the authors introduced the “IPCC Test” for long-term persistence (although the DW statistics were not actually included in Table 3.3 as promised):
Discussion expanded to new appendix. Residuals from linear trends after fit of AR(1) model do not show strong long term autocorrelation processes as illustrated by the Durbin-Watson statistics now given in Table 3.3.
I consulted the statistical reference, Diggle et al, Analysis of Longitudinal Data, (incorrectly cited as 1999 rather than 1994). It does not mention the Durbin-Watson statistic. When I tried calculated Durbin-Watson statistics on trend residuals applied to HadWWT2, CRUTEM3 or HadCRU3, I obtained values showing substantial serial correlation in the residuals (e.g. HadSST2 1850-2005 had a DW – 0.47 well below the limit of 1.5). There are some other odd features in the trend calculations (the perhaps idiosyncratic use of REML methodology), but I’ll revert to them on another occasion.
Anyway, last week, I wrote to Phil Jones, the co-lead author of AR4 chapter 3, (copy to the University of East Anglia FOI), and inquired as follows:
In Table 3.2 of IPCC AR4, you refer to Durbin-Watson statistics for various trend calculations, but do not show them. Could you please provide me with these statistics.
I am unfamiliar with any prior use of the Durbin-Watson statistic “after allowing for first-order serial correlation”. Could you please provide me your statistical reference showing how one calculates a Durbin-Watson statistic “after allowing for first-order serial correlation” and giving significance levels for the statistic “after allowing for first-order serial correlation”.
Could you please identify the statistical packages used in your calculation of REML trends and Durbin-Watson statistics?
Would it be correct to say that (1) fitted a trend to the various series; (2) fitted an AR1 arima model to the residuals from (1)? (3) carried out a Durbin-Watson test on the residuals from (2)?
Where applicable, these requests are made under FOI provisions.
Thank you for your attention, Steve McIntyre
I received a prompt and semi-responsive answer. Jones sent me the requested Durbin-Watson statistics; their value for the 1850-2005 HadSST2 trend was 2.20 versus my calculation (using a standard function in R) of 0.47. Jones described the calculation of the Durbin-Watson statistic, which I knew how to do – my question pertained to their methodoogy of using the Durbin-Watson statistic after allowing for first-order serial correlation:
The Durbin-Watson statistics were in an earlier draft of the chapter. They were removed simply for space reasons, as none were significant. As you can see, we also removed the lag-1 autocorrelations as well.
REML comes from Diggle et al 1999 (section 4.5 pp 64-68). This reference is given at the end of the chapter. The page numbers refer to the 1999 edition of the book. There is a later one available on Amazon, so the page numbers may differ in that edition. David Parker programmed the calculations of all the trends. As far as I know he didn’t do this with any specific statistical packages. He likely used PV-WAVE which the Hadley Centre used for almost all their analysis work. The use of REML is discussed in Appendix 3.A.
DW is very simple to calculate. We used the lag-1 autocorrelations to calculate the reduced number of degrees of freedom of the residuals. This number was used with the DW statistic to estimate the significance. Basically, any DW value above about 1.8 is not significant. DW Tables are in some statistics books. There should be two significance values (for any DW value and N, here the effective number of degrees of freedom). For the lower of these, values below would be significant. Values above the upper are not significant. For values in between nothing can be said. We were always above the upper value. For random numbers, the DW statistic should return a value about 2. There are different sets of Tables for different significance levels (1%, 5% etc). We used 5%, which is generally the one given in text books.
Any statistical package would likely not use the reduced number of degrees of freedom (reduced based on the lag-1 autocorrelation) when giving the significance of DW. Using the reduction to the degrees of freedom makes the test harder to pass.
While the response is polite and did provide the requested DW statistics, my request was not about how to calculate a DW statistic – which I obviously know and for which there are many references – but for the DW statistic “after allowing for serial correlation”. In addition, Jones explained that the effect of “allowing for first-order serial correlation” was to change the benchmarks for significance of the DW depending on the number of degrees of freedom – an explanation that did not accout for the discrepancy between the reported DW statistic and what I calculated. So I wrote back to Jones seeking further clarification and attaching an R-script for my calculations:
Dear Phil, thanks for the prompt reply, but there are a number of points that remain very unclear. I am extremely familiar with the Durbin-Watson statistic as it is familiar to all econometricians. I use the R language which has a convenient Durbin-Watson test in the lmtest package (the dwtest function.) When I ran a Durbin-Watson test on residuals from fitting a trend, I obtained a Durbin-Watson statistic of 0.49 for an OLS-fitted trend to the HadSST2 series presently online (over 1850-2005).When I re-fitted a trend line using the reported slope of 0.038 deg C/decade, I obtained an even lower Durbin-Watson statistic of 0.27. For this same situation, you reported a DW statistic of 2.2. I’ve attached a script in R. You say that “We used the lag-1 autocorrelations to calculate the reduced number of degrees of freedom of the residuals.” In order to get a Durbin-Watson statistic of 2.2, you must have done something to the data that is not a typical procedure and which is not explained in Diggle. The best guess that I could come up with as to what you did was that you might have fitted an AR1 arima model to the trend residuals and then calculated a DW statistic for the residuals for the arima-fit. However, this is just speculation and there is no clue in AR4 as to what was done.
BTW the usual interpretation of the DW test in econometrics is as a test for first-order autocorrelation, so the exact meaning of using a DW test “after allowing for first-order serial correlation” is by no means obvious. Again, if you can direct me to an article describing the exact procedure that you used together with its statistical properties, I’d appreciate it.
Regards, Steve McIntyre
Jones (who is generally a prompt and courteous correspondent) replied:
The DW test was carried out on the residuals after removing the AR1 persistence. So with respect to your earlier email, you are correct that the three procedures were in essence your three points below.
“Would it be correct to say that (1) fitted a trend to the various series; (2) fitted an AR1 arima model to the residuals from (1)? (3) carried out a Durbin-Watson test on the residuals from (2)?”
The AR1 persistence is modelled by REML. Maximum likelihood widens the error bars to account for the AR1. The DW test is to see whether any further widening is needed and the results show it isn’t.
If Jones’ response is correct, then I’m pretty sure that the AR4 has goofed in using the “IPCC Test” as a test for long-term persistence, since, in a couple of experiments on trend portions within a random walk, I got DW statistics above 2 as well – and the IPCC Test looks ineffective to me against a random walk null.
However, I’m not sure that Jones has described the procedure correctly. Concurrent with this correspondence, David Parker sent me an email that some of the daily station temperature data used in “A demonstration that large-scale warming is not urban” had been posted up at http://www.hadobs.org/ – under “Land surface data”, click on Urban. Since Jones had attributed the calculations to him, I also asked him about statistical references for the methodology as follows:
Dear David… On another topic, I asked Phil Jones about the calculation of trends in AR4 in which a Durbin-Watson statistic was cited “after allowing for first-order serial correlation”. He said that you did the calculations. I’m very familiar with the Durbin-Watson statistic which is widely used in econometrics, but I am unfamiliar with any statistical authorities describing the use of this statistic “after allowing for first-order serial correlation”. He was also unable to describe exactly what the methodology was for doing this. Can you give me a reference for this exact procedure and a description of how you carried out the calculation.
Regards, Steve McIntyre
Parker (also a prompt and courteous correspondent) replied:
The Durbin_Watson statistic was done on the residuals from the linear regressions after removing the AR1 persistence as modelled by the restricted maximum likelihood software. That is why the values were close to 2. This procedure is correct because the restricted maximum likelihood software will already have widened the error-bars to take account of AR1; the DW is a test to see whether any further widening is needed, and the results showed that it wasn’t for the IPCC series. The restricted maximum likelihood method is described by Diggle et al. cited in the IPCC Report.
We did the calculations using PV-Wave language. In the restricted maximum likelihood software, an iterative procedure was used to choose the best fit to the series: the iteration was done simultaneously in 2 parameters – the AR1 coefficient and the variance of the residuals. After the fit was chosen, the residuals were pre-whitened by subtracting the AR1 component.
I hope this helps
Again, a courteous reply, but notably no statistical reference showing that they’ve used the Durbin-Watson test appropriately under these conditions. So I wrote back once again:
Dear David, thank you for the courteous note. I am familiar with REML methods, but I am not familiar with any description of the properties of the Durbin-Watson statistic in the form that you describe in any peer-reviewed statistical literature and I take it that there is no such reference. I’ve never seen the DW statistic described previously as a “test to see whether further widening is needed”. At present, I haven’t formed any conclusion as to whether such a use is right or wrong; it’s just that I’ve never seen the DW test used in the way that you describe and so far, neither you nor Phil Jones have been able to show me any statistical authority for this use. In the absence of such a reference, could you send me the code for the calculation of the DW statistic in the PV Wave software as you’ve done it so that I can see precisely what you’ve done.
Thanks, Steve McIntyre
Parker promptly replied as follows:
Here’s the coding.
FOR iz = 1 , nz-1 DO BEGIN
zsq = z(iz-1)*z(iz-1)
zsqsum = zsqsum+zsq
zdiff = z(iz)-z(iz-1)
zdiffsq = zdiff*zdiff
zdiffsqsum= zdiffsqsum + zdiffsq
It was applied to the residuals in 1-dimensional array “z” after the AR1 fit had been subtracted from them.
The Durbin-Watson statistic is described in Section 12.1.5 of H von Storch and F W Zwiers, “Statistical Analysis in Climate Research”, Cambridge University Press 1999.
So once again, I had the Durbin-Watson statistic explained to me – the one thing that I wasn’t inquiring about. Reviewing my latest letter, it wasn’t perfectly specified. In my email to Jones, I attached a script showing calculations top to bottom so that he could see what I was doing. I’ll inquire one more time with Parker and see what turns up.
I’ve used the nlme package in R a lot and it covers many of the same issues as Diggle and uses an REML algorithm. The nlme package is written by Pinheiro and Bates and Pinheiro has coauthored with Diggle – so this is pretty familiar turf for me – and I rather enjoy the exploration. At this point, my hunch is that trend portions of random walks will routinely generate DW statistics above 2 and that the “IPCC Test” is accordingly completely ineffective at excluding random walks. If so, then there is a real problem in all the significance calculations attached to Table 3.2. However, so far, I remain unable to replicate any of the reported calculations even for trivial calculations in Table 3.2. So right now I’m merely observing that there are issues with AR4 Table 3.2 and that I’m trying to resolve them.