Granger and Newbold [1974] provided examples of spurious significance in a random walk context. This has been extended by various authors to a number of other persistent processes. Granger and Newbold suggested that the DW statistic could be used to test the autocorrelation in the residuals, giving a test that could be used in a relatively unsupervised way to check for spurious relationships. Here are some examples from cases familiar to readers: Gaspé cedars, the MBH98 NOAMER PC1, the MBH98 reconstruction and satellite temperature trends.

First, here is a graphic showing autocorrelation functions for some of the series that have interested us: the Gaspé cedar series, the NOAMER PC1, the temperature PC1 and, for comparison, the Central England series. The lesser autocorrelation in the CEngland series is quite dramatic. The autocorrelations of the Gaspé cedar series, NOAMER PC1 and temperature PC1 are all high enough that regressions involving them are in a red zone.

**Figure 1.** Selected Autocorrelation Functions.

In Spurious Significance # 2 […], I quoted the following conclusion from Granger and Newbold [1974]:

It has been well known for some time now that if one performs a regression and finds the residual series is strongly autocorrelated, then there are serious problems in interpreting the coefficients of the equation. Despite this, many papers still appear with equations having such symptoms and these equations are presented as though they have some worth. It is possible that earlier warnings have been stated insufficiently strongly.

From our own studies we would conclude that if a regression equation relating economic variables is found to have strongly autocorrelated residuals, equivalent to a low Durbin-Watson value, the only conclusion that can be reached is that the equation is mis-specified, whatever the value of R2 observed.

Here are some practical examples.

**Gaspé Cedar Series**

In our EE article, we discussed the Gaspé cedar series at considerable length. Together with the NOAMER PC1, it is one of two hockey stick shaped series that imprint the 15th century MBH98 results. We’ve shown that this series was specifically edited by MBH to insert it into the AD1400 calculations, where it lowered 15th century values (and that this editing was not reported at the time.) We showed that the series had many quality control problems, that later versions do not have a hockey stick shape, that the proponents have refused to identify the location of the site for re-sampling etc. etc. Here we look at a much simpler question: what are the results of the DW test and should it have given rise to concerns that there was a mis-specified relationship affecting the Gaspé tree rings?

It turns out that this is a classic case of a relationship failing Granger and Newbold criteria. The regression has a very high correlation (r- 0.59; r2 -.35) to the temperature PC1 (that’s why it gets very highly weighted in the regression phase of the reconstruction. However, **the DW statistic is 1.08 **;** the p-value for this statistic is 4.925e-06,** which mandates the conclusion that the relationship is mis-specified. No econometrician would present a calculation which depended on a step in which the DW statistic was 1.08. The idea of specifically editing a data set so that a series with a DW statistic of 1.08 can be inserted into a relationship and affect final results would be incomprehensible to any modern statistician.

Some of the defences to MBH98 have been that in a multiproxy method, errors or mis-specification in individual proxies will get washed out. One of the fundamental points in both our articles is that this claim is just arm-waving and not proven; in fact, the results are highly dependent on individual series: bristlecones and the Gaspé cedar series, and mis-specifications do not get washed out.

**The MBH98 NOAMER PC1**

The second key calibration is of course between the temperature PC1 and the MBH98 North American tree ring PC1 (which is essentially the bristlecones), which is the PC4 in a centered calculation. Relative to the Gaspé series, this has a slightly lower correlation to the temperature PC1 ( r — 0.46; R2 — 0.22). We discussed the relationship between bristlecone growth and temperature at length in our EE article and it appears highly probable that the relative high correlation between the NOAMER PC1 and the temperature PC1 is spurious. In this case, the DW statistic is right at the edge of rejection DW= 1.5668 (p-value = 0.02064). As Ross mentioned in a post yesterday, the DW statistic only measures AR1 serial correlation. Unsupervised statistics are not a magic bullet; here the DW statistic is very much in a danger zone and careful analysis of this critical relationship should have been carried out.

**MBH98 NH Temperature Reconstructions**

The DW statistics for the MBH98 temperature reconstructions are a little further away from the red zone. MBH have never reported (and still refuse to provide) a digital version of their AD1400 step. The DW that I obtained from AD1400 result from my Wahl-Ammann run-through (see […] was DW = 1.6335, p-value = 0.04965. For the AD1820 MBH98 step (which is archived), the DW = 1.7468 ( p-value = 0.1285).

**Satellite Temperature Trends**

I’ve posted up a graphic showing the “trend” in satellite temperatures. Formally, this “trend” is generated by a regression of the data against time. Here the DW statistic is DW = 0.4445 (p-value < 2.2e-16), clearly failing the test for autocorrelated residuals.

I’m interested in the reasons why the DW statistic goes out of the red zone going from the NOAMER PC1/Gaspé to the MBH98 reconstruction. My understanding is that the other proxies essentially add white noise as "ripples" on the wave. The addition of the ripples takes the DW statistic out of the red zone without any change in the spurious relationship. This is easy to say and, not even that hard to picture once you get there. I hope to show how this type of effect is consistent with Ferson et al [2003] and Deng [2005], which follow on from Phillips [1986] which gives some motivation for this.

Tomorrow, I’ll review our reconstruction of NH temperature using dot.com stock prices to show another example of spurious significance – this time with RE statistics, and, in a few days, start in with Phillips [1986], who provides a theoretical framework for spurious t-statistics. Ross also pointed out that the DW test only measured for AR1 relationships in the residuals and suggested the LM test – I’ll try that on the examples shown here as well.

## 28 Comments

W.J.Burroughs [Weather Cycles Real or Imaginary] suggests that the typical values for AR1 coefficients for monthly met. series lie in the range 0.5 to 0.9 “while for annual figures, including proxy records, the values are from 0.0 to 0.3″

The coefficients, for the charts, look much higher: Is this is simply because I havent understood the impact of the MA component?

Steve,

Let me see if I understand what you’re saying about white-noise moving spurious statistics in the MBH98 reconstruction out of the red zone. If we start with just a sloped straight line it will show high autocorrelation since you can predict quite well what the next value will be after a while. Much the same would be true of a sloped sinusoidal curve. But a curve which has random noise included will have less autocorrelation since the noise masks the underlying correlation.

So it would seem that if we have a bunch of ‘proxies’ and then use principal component analysis to see how much of a training signal each explains so as to weigh them properly in reconstructing the entire proxy record over a long period, a rather smooth record may do well at explaining some or much of a training signal, but the statistical tests may show that the autocorrelation is high giving it doubtful explanatory power. But if noise is added to or naturally present in a proxy, it will not show such high autocorrelation and the underlying trend, whether real or spurious will be let through the statistical filter.

OTOH, adding noise to a signal will also reduce the explanatory power of a proxy, so a balance would have to be looked for.

I may be far off above in understanding the principles involved but if you could let me know where it’d help. I’m a big-picture type thinker and until I understand the underlying principle I have trouble following the details of a discussion.

Dave, I didn’t phrase this well. It just moves it out of the Durbin-Watson measured red zone, while they may still be in the red zone. I’m going to post some more on this topic during the next week with some more concrete examples. If that doesn’t help, ask me this again.

Well, I suppose I didn’t explain myself well either. I’m trying to understand just what particular statistical tests can show about a data set in terms of high-level [meta]language. And what sort of data wouldn’t set off any alarm bells.

Obviously nobody could prevent someone from creating fake data, but nobody’s accusing anyone of doing so either. Still, knowing how someone who was intent on creating false data would go about doing it is valuable because it sheds light on whether or not incidental manipulation of data would inadvertently create problems. To guard against out and out fraud would, of course, require the ability to resample or reproduce the initial work. (This is what makes the withholding of data from the general public so dangerous. It creates doubts where there shouldn’t be any and spawns conspiracy theories, etc.)

But from a heuristic perspective, knowing how to fake something and get away with it generally both helps someone understand the process being used to study the data and makes one alert to what precautions to take.

Steve,

You may want to look into the method of surrogate data as a way of testing the significance of a time series. This method generates new random time series –from the original– that have the same autocorrelation function. You can then apply whatever test you want to both the surrogate data and the original time series. If there is no difference the original is just noise. I’ve used it many times to test if a series was generated by a deterministic nonlinear process, e.g., the weather, or noise.

Kaplan D and Glass L, 1995, Understanding Nonlinear Dynamics (New York: Springer), pp 342–6 and pp 356-8

Theiler J, Eubank S, Longtin A, Galdrikian B and Farmer J D, 1992, Testing for nonlinearity in time series: the method of surrogate data, Physica D 58 77–94

I agree entirely. I do a lot of red noise simulations. In our GRL paper, I used a method from Brandon Whitcher based on wavelets. For quick looks, I use arima models and the R function arima.sim. There was nothing particularly elevated conceptually about our application of these methods to paleoclimate, but it doesn’t seem to have been done much in the field.

Re #1, Chas. The two proxy series illustrated here are among the most highly autocorrelated in the entire corpus – hence they should be treated with particular suspicion. It looks to me like ocean gridcells have higher AR coefficients. Also if you model the process as ARMA(1,1) as I’ve been doing here, the AR1 coefficients are quite a bit higher with highly significant MA! coefficients (this is on monthly data.) The aggregates seem to be more highly autocorrelated than the individual gridcells. But I don’t cliam to know the temperature gridcell data inside-out.

Once again, I’d caution that autocorrelation in a record does not mean that the TREND derived by linear regression should be suspect (i.e. autocorrelation does not bias the estimated trend). It just means that it is POSSIBLE that the UNCERTAINTITY may have been underestimated if it was not calculated correctly (i.e. if the number of degrees of freedom was based on the number of data points rather that on the number of effectively independent input values (which is of order the number of data points divided by the “integer” autocorrelation scale)).

Re #8 John,

I’m no expert on this subject, but it would seem to me that if the number of degrees of freedom were reduced due to autocorrelation this would be equilivant to the length of the time series being shorter. Therefore the chances that the trend was spurious would be increased correspondingly. Now this may be just what you mean by uncertainty, but I think the average layman would think an increased uncertainty meant there was a trend but just what its value isn’t certain rather than that the whole trend may just be a section of a random variation.

Actually, I remember reading several years ago about a way to try and identify falsified data. It had to do with the number of appearances of digits 0-9 vs how often they tend to be distributed “naturally.” I recall there were a few preliminary cases of success, but I’m not sure how far it went.

Re: 10.

Doesn’t Freakonomics have a chapter on catching a teacher falsifying grades?

Michael,

I believe that in natural series 1 is most common since if we need to realize that many series have a maximum value, for example house numbers. But compare 1-999 vs 1000-1999. Same number of values (actually 1 more for the second). So if we look at likelihood of termination, we’re more likely to find one with a low first digit.

However we’re unlikely to find anyone just making up numbers. More likely they’d add random noise or a trend or something like that and presumably this wouldn’t show up in such a test.

John H., to take a example to get some practical agreement before considering the trend issue (which I’m working on), would you agree that the Gaspe calibration fits Granger and Newbold criteria for rejection?

Re#12,

It’s possible they only looked at values after the decimal point, omitted the first digit, etc. It certainy could be beaten, but I guess fraudsters can get sloppy and lazy, too, as many criminals do.

After that comes duplicability by other researchers…

Steve, I ran the MSU data with a dummy variable for the 1998 El Nino (I used calendar year 1998). This cut the estimated temperature trend in half and resulted in an estimate of El Nino of 0.4 degrees, which is close to the 0.5 that was estimated by the WMO and others. As far as auto-correlation, the El Nino dummy didn’t remove much if any of that.

Re #10:

I believe you’re referring to Benford’s Law, which describes the distribution of lead digits in many collections of data. It has been used to detect fraud. There’s a nice article in

New Scientist(fee required to read the entire thing).Would be helpful if you cited more about the use of the DW technique in the literature and in actual practice. Not the ref on the practice itself, but some papers that show it being used. Preferably, not from econometrics.

Steve (#18): I sent a rather long posting #18, which has got almost completely chopped — can you retrieve it or is there some other way I can send it?

John H., it’s not here. Try again. If no luck, email it to me and I’ll post it for you. I’m going to post something in a little more detail about trends in a few days. I tried to find Ostrom, but it wasn’t in the downtown U of Toronto library. There’s a copy in one of the satellite campuses., but it’s an expedition to find them. Is there a one-paragraph quote from Ostrom that covers the point that you relied on? Putting a top limit of 0.3 on the autocorrelation coefficient seems completely unjustified to me. There’s lots of evidence of coefficients >0.9.

I looked at the Emery and Thomson text as well, esepcially section 3.15.1 Trend estimates and the integral time scale. They don’t provide any references for the integral time scale technique. I don’t claim to have encyclopaedic knowledge of statistical literature, but I haven’t run into this technique in general statistical literature. It’s possible that it does something similar to what’s done in HAC (heteroskedastic-autocorrelation consistent) estimation of covariance matrices in economics, but I’m not sure. It’s an interesting topic to think about, which I’m doing. It would be nice to see an actual statistical discussion of the integral time scale technique – maybe I’ll email the authors and inquire as to their source.

I have a quite long posting in the pipeline (via email to Steve — “Submit Comment” can’t handle it). However, there is another way of looking at the “Durbin-Watson / autocorrelation” issue. Suppose that I have a set of observational data, I fit a trend to it and the residuals shows no autocorrelation according to the Durbin-Watson statistic (i.e. it is around 2). Now, contrary to the ideas in most of this thread, THE INDICATED ABSENCE OF ANY AUTOCORRELATION WOULD SOUND ALARM BELLS FOR ME — it would strongly suggest that the data had been UNDERSAMPLED in time. I would ask the question — what useful information have a missed (in between my data points) by sampling at this frequency?

It is a bit like attempting to estimate the trend in global average temperature by only looking at data observed on every Christmas Day — the residuals could well appear uncorrelated using the Durbin-Watson test, but the trend would ONLY represent the trend in “Christmas Day” global average temperature. If I sampled more frequently (e.g. every month), then I would learn about the seasonal cycle and be able to estimate better the trend in global average temperature (for example, it is quite likely that the trend in July temperatures is different from the trend in December temperatures). However, the Durbin-Watson statistic would now indicate autocorrelation of the residuals, which, as I have pointed out before and will point out again, does NOT necessarily disqualify the trend — it just means that I have to do things a bit differently (e.g. remove both a trend AND the seasonal cycle in the regression) and/or estimate the uncertainty in the trend appropriately).

(AND STEVE — CAN YOU DELETE #18? – IT IS JUST THE BEGINNING OF MY “LOST” POSTING)

seasonality has to be addressed

Steve (#20): You say:

“Putting a top limit of 0.3 on the autocorrelation coefficient seems completely unjustified to me. There’s lots of evidence of coefficients >0.9.”

I’m not sure what you mean — we are trying to find a threshold for the autocorrelation coefficient below which we are confident that the residuals are uncorrelated (or at least that any autocorrelation has little effect on the results of a linear regression). If you picked a critical value of 0.9, many highly correlated series of residuals would appear uncorrelated. Are you suggesting that we should view a series with an autocorrelation coefficient of 0.9 as NOT autocorrelated?

Re #23: I see what your point is now. (I haven’t located Ostrom,so I didn’t understgand your argument.) I take it that the point is that, if the AR1 coefficient of the residuals is

John Hunter sends the following which I am posting for him (John H, There are points that I disagree with, but many thanks for the thoughtful post):

SSteve (#13): I actually disagree with the Granger and Newbold criterion that you quote (“It has been well known for ….. whatever the value of R2 observed.”). I also believe that the following statement of yours is misleading:

“I’ve posted up a graphic showing the “trend” in satellite temperatures. Formally, this “trend” is generated by a regression of the data against time. Here the DW statistic is DW = 0.4445 (p-value < 2.2e-16), clearly failing the test for autocorrelated residuals."

since it implies that the trend should be questioned just because the DW statistic indicates autocorrelation of the residuals. I would rather not spend time getting into details such as the "Gaspe calibration", but here are my general beliefs on Durbin-Watson and regression.

Firstly, the Durbin-Watson test is a very stringent test for autocorrelation. It detects "autocorrelation" even if only adjacent values are correlated. Consider the following example. Generate a series of 100 independent random numbers from a statistical distribution of standard deviation S. The series may be symbolised by "A B C D ….." where each letter represents a random number (you'll have to make up some new symbols after "Z" to get to 100 numbers). The Durbin-Watson statistic for this series is close to 2, from which you correctly deduce that the data us not autocorrelated (at least for adjacent values). Now introduce intermediate points into the series, each equal to the preceding one, so the series becomes "AABBCCDD ….". The Durbin-Watson statistic for this second series is close to 1, so it completely fails the "no correlation" test. However, the second series contains JUST AS MUCH INFORMATION as the first (100 independent numbers). Linear statistics such as the

mean or trend are essentially THE SAME for both series. The only thing that is different is the way that you estimate the uncertainty. For example, the standard error (i.e. the standard deviation of the mean) of the first series is S/sqrt(100), since we know the values are independent. However, for the second series, we may be tempted to think that standard error is S/sqrt(200), since there are 200 values in the series. This would be wrong, as there are only 100 INDEPENDENT values in the series (i.e. the number of degrees of freedom), so the standard error is actually S/sqrt(100) — the same as for the first series. So there is nothing inherently WRONG in the second series, even though it fails the Durbin-Watson test for independence — it is in fact just as "valuable" as the first series because it contains exactly the same data (if you plotted the two series, they would be almost indistinguishable). You just have to be more careful estimating the uncertainty in statistics

(e.g.the mean or trend) derived from the second series.

Secondly, the Durbin-Watson test only considers the NOISE (e.g. the residual of a linear regression) — it "cares" nothing about the SIGNAL (e.g. any underlying trend). The signal-to-noise-ratio is all important here. Consider a signal (a linear trend) plus noise made up of 100 values defined by i + r(i), where i goes from 1 to 100 and r(i) is a set of 100 random numbers with a standard deviation of unity. The first value is therefore 1 + r(1) and the last value is 100 + r(100), and each r is given roughly be r(i) = +/- 1. Plot this series and it quite clearly looks like a straight line with unit slope and a tiny amount of noise. Do a linear regression on this data and you will find a slope close to unity. The Durbin-Watson statistic for the residuals will be close to 2, indicating no autocorrelation. Now generate a second series, which is similar to the first except that you reject values with even "i" (i.e. r(2), r(4), r(6) etc), replacing them with r(2)=r(1), r(4)=r(3), r(

6)=r(5) etc. Again, if you plot this, it will look like a straight line of unit slope with a tiny amount of noise. If you do a linear regression, you will obtain a unit slope. However, this time the Durbin-Watson statistic will be close to 1, which indicates autocorrelation. Do you now reject the estimated slope as suspect, just because the Durbin-Watson test has shown autocorrelation? Of course not — as above, you just have to be a bit more careful estimating the uncertainty of that slope.

Finally, the significance of a linear regression depends mainly on six things:

a. the actual trend in the data (for which you have only an estimate), T,

b. the number of data points, N,

c. the autocorrelation length of the residuals, C,

d. the uncertainty in each data point (which you may or may not know), DP,

e. the standard deviation of the residuals, DR, and

f. the length of the record, L.

If you don't know (DP), you generally assume it is the same as (DR), which means that you cannot judge whether a fit to

a straight line is a good fit to the data. If you DO know (DP), then you can tell that a "straight line fit" is a good model if (DP) is the same order as (DR) — if (DP) is significantly less than (DR), then you know a straight line is not a good fit (for example, a parabola may be a better fit).

So let's consider only the case where a straight line IS a good fit. The uncertainty in the trend is of order DR/(L x sqrt(N)) if there is no autocorrelation, and of order DR/(L x sqrt(L/C)) = DR/sqrt(L^3 / C) if there is correlation (since autocorrelation essentially reduces the number of independent points to L/C). (Please note the "of order" caveat here — I'm omitting lots of constants (which are of order 1) just to give you a feel for the problem.) You can derive this result simply by realising that the result of a linear regression is quite similar to what you get by taking the mean of the first half of the record and the mean of the second half, and dividing the difference of these means by the record length (again, I'm ignoring any annoying constants). As the autocorrelation is reduced, then C approaches the time interval and L/C becomes N.

The Durbin-Watson test only provides a warning that C is larger than L/N (the sampling interval) and that you need to estimate the uncertainty in the trend from DR/sqrt(L^3 / C) rather than from DR/(L x sqrt(N)) (which is the case for uncorrelated residuals). The test CERTAINLY doesn't suggest that you should not use the data for trend estimation.

To anyone who is interested in autocorrelation and the Durbin-Watson test:

Make a synthetic seasonal cycle from a sine wave of period one year and sample it every “season” (i.e. once every three months) starting anywhere you like in the series — a pretty reasonable thing to do — I expect there are countless observational records with 3-month sampling increments. Now, calculate the Durbin-Watson statistic for this data — it comes out to be close to 2, indicating independent non-autocorrelated data! Similarly, the lag-1 correlation is zero, again indicating independent values! Now, estimate the uncertainty in the mean, assuming that the data is uncorrelated — i.e. since the data is apparently independent, you would apparently use (amplitude of sine wave)/sqrt(2 x number of values in series), which is WRONG again — the actual uncertainty is around (amplitude of sine wave)/(number of values in series), which is very different.

So what can we say about Durbin-Watson — that is is flawed? — that an “audit” has shown that is can give completely incorrect results?

No — I think it just shows that nothing in statistics is simple, that there is no “right” way to do statistics and that most statistical tests make a good few assumptions, some of which may be quite false — in this case that the correlation between adjacent points is a good indicator of the overall serial correlation.

WRT “no right way”. Surely some ways are demonstrably wrong in any case. Whereas others may just be not provably true to the nth degree…

Unless…you beleive in Stanley Fish…or the other postmodernists…

re 25

Take a long series of consecutive days on which the temperature is measured at hourly intervals. Estimate the area under of daily curve and correlate it with the max daily temperature, the min daily or the difference between them. Do you get significant correlations? This is a quick test as to whether max/min temps are a suitable proxy for heat flow, which is the more important parameter.

The maths of correlation is way beyond me now, but I used to do a fair bit in mining scenarios. I keep coming back to Geostatistics, a certain type of mathematics developed by M.David at Fontainebleau in France. In a typical application, one has an ore deposit with a few expensive drill holes through it, which have been assayed every metre. The holes are X m apart. How small does X have to be before the values in hole A can be used to predict values in hole B and so allow interploation using a search ellipsoid whose size is determined by the geostatistical analysis.

This is the type of calculation I feel is important in relating far-spaced data such as climate data. There are also ways to forward project time-dependent data streams like temps at a single locality, and smoothing methods which can have advantages over weighted rolling means.

I have even extended it recreationally to correlation analysis – examples such as sunspot activity and the annual yield of tomatoes in California, substituting these arcane cases for drill holes A and B. Correlation analysis is possible when one parameter leads or lags the other.

I keep coming in from way out left field because I’m not a climatologist or a good mathematician. But I have seen good mathematics put to good effect when performed by other capable people.

#27. Geoff, I have experience with ore reserve calculations. Mining engineers have developed methods of spatial averaging and have learned to not be over-influenced by high outliers in ore reserve calculations. They cut the area of influence of high-grade holes. My own experience was that for vein ore bodies, high-grade holes tended to be even more localized than shown in usual polygon interpolations and that low-grade holes were less localized than the polygonal – so that usual polygonal interpolation tended to over-estimate the mine reserves.

Mann’s technique did exactly the opposite. The bristlecones were his equivalent of a high-grade hole. Instead of cutting the area of influence like a prudent engineer, his method over-weighted it. Mining promoters would love to promote ore reserves calculated with the equivalent of Mannian principal components. Mining promoters try to use high-grade outliers to run the stock – Myles Allen’s press relesae is very much what mining promoters like to do. Mining promoters are prohibited from doing a press release like Myles Allen’s by law, but obviously do whatever they can to promote their stocks.

The mining perspective on spatial averaging very much influenced my examination of Mann’s data – where the bristlecones do what an outlier high-grade hole does for a mining promotion.