Remember the argument of Gavin and Rasmus that you shouldn’t use empirical temperature data to calibrate ARIMA models to provide null distributions, which they used against Cohn and Lins. I’m not sure that the argument is valid, but if it is, then they should not have selectively used it against Cohn and Lins, since Mann does exactly the same thing in MBH98.

Here’s what Ramus said in his post:

When ARIMA-type models are calibrated on empirical data to provide a null-distribution which is used to test the same data, then the design of the test is likely to be seriously flawed. To re-iterate, since the question is whether the observed trend is significant or not, we cannot derive a null-distribution using statistical models trained on the same data that contain the trend we want to assess.

He re-iterated the point in a number of comments as follows:

I think that the use or ARIMA-type models, calibrated on the trended series itself, will not provide a good test for the null-hypothesis, since you a priori do not know whether the process you examine is a ‘null-process’. The risk is that the test is already biased by using the empirical data both for testing as well as tuning the statistical models used to represent the null-process…. ARIMA-type models do not contain any physics, one never knows if the ARIMA-type models really are representative or just seeming to be so, but I agree that they are convenient tools when we have nothing else. The question is not about using ARIMA-type models or not, but what conclusions you really can infer from them. link

I do not believe that statistical models are appropriate because i: they are used to test a null-hypothesis where no antropogenic forcing (of just solar volcanoes) is assumed, ii) they are trained on empirical data subject to forcings (be it anthopogenic as well as solar/volcanic). link

If you use ARIMA-type models tuned to mimic the past, then the effect of changes in forcing is part of the null-process. link

Gavin chipped in with a very similar comment:

The ‘problem’ such as it is with Cohn and Lins conclusions (not their methodology) is the idea that you can derive the LTP behaviour of the unforced system purely from the data. This is not the case, since the observed data clearly contain signals of both natural and anthropogenic forcings. Those forcings inter alia impart LTP into the data. The models’ attribution of the trends to the forcings depends not on the observed LTP, but on the ‘background’ LTP (in the unforced system). Rasmus’ point is that the best estimate of that is probably from a physically-based model – which nonetheless needs to be validated. That validation can come from comparing the LTP behaviour in models with forcing and the observations. Judging from preliminary analyses of the IPCC AR4 models (Stone et al, op cit), the data and models seem to have similar power law behaviour, but obviously more work is needed to assess that in greater detail. What is not a good idea is to use the observed data (with the 20th Century trends) to estimate the natural LTP and then calculate the likelhood of the observed trend with the null hypothesis of that LTP structure. This ‘purely statistical’ approach is somewhat circular . link .

Amusingly, Mann in MBH98 used an ARMA model (AR1) to do exactly what Rasmus and Gavin object to in Cohn and Lins. Here’s how they described their calculation of null distributions: they estimated AR1 coefficients (an ARMA (1,0) model) and applied these for estimating null distributions.

We test the significance of the correlation coefficients (r) relative to a null hypothesis of random correlation arising from natural climate variability, taking into account the reduced degrees of freedom in the correlations owing to substantial trends and low frequency variability in the NH series. The reduced degrees of freedom are modelled in terms of first-order markovian “Åred noise’ correlation structure of the data series, described by the lag-one autocorrelation coefficient à?à during a 200-year window… We use Monte Carlo simulations to estimate the likelihood of chance spurious correlations of such serially correlated noise with each of the three actual forcing series…

Significance levels for àÅ½àⰠwere estimated byMonte Carlo simulations, also taking serial correlation into account. Serial correlation is assumed to follow from the null model of AR(1) red noise, and degrees of freedom are estimated based on the lag-one autocorrelation coefficients (r) for the two series being compared.

I’m not saying that the method is or isn’t any good – only that, if using the empirical data to estimate a dull distribution is no good for Cohn and Lins, it’s no good for MBH98 (and there’s no a priori reason why an AR1 model is OK and a more sophisticated model is not. As so often, the Hockey Team is not merely sucking and blowing, but they are sucking and blowing out of every major orifice simultaneously – sort of a one-mann band.

## 7 Comments

dull distribution :-)

The data used to determining the null hypothesis CANNOT be the data under study.

Hans :-) Precisely

I haven’t followed the RC dispute with Cohn and Lins, but it is true that you can not rely on a method where the process that generates the benchmark data pre-supposes the falsehood of the null hypothesis you want to test. You’ll be comparing your observations to the wrong distribution. For example, if you generate random numbers that follow a log-normal distribution, then you can’t use the distribution of that series to test a null hypothesis that your observed data follow a standard normal distribution. You will be biased towards rejecting the null. You can only test a null hypothesis of a log-normal distribution.

But it’s perfectly legit to go the other way, by generating critical values under the null hypothesis you want to test. If you use ARFIMA to produce critical values under the null of a trendless, unforced climate, and the observed data fail to reject the null hypothesis, that’s a legitimate inference. It sounds like this is what Cohn and Lins are doing.

Also, once you introduce persistence models, there are qualitative changes in the character of the model that can be tested. If you have a theory that predicts a null hypothesis of persistency under greenhouse forcing, and the data exhibit antipersistency, you can reject the null, and any other null that also implies persistency. This is what Karner did.

In the case of MBH98, they fall prey to the first problem. They built an artificial hockey stick into the observed data via their PC glitch, then generated artificial noise data to benchmark the RE stat on the assumption that there is no artificial hockey stick, leading them to a spurious rejection of the null of no hockey stick in the observed data. That, basically, was the central point in our GRL paper (and our exchange with Huybers).

I am trying to understand what Ross and Lubos said, and my question for them is whether this is correct:

It is true that a purely time-series-based model of, say, temperature cannot answer questions about AGW. It can only tell us the nature of the temperature fluctuations, and whether they exhibit persistence and other characteristics. To address AGW you need a structural, physics-based model. However, times series methods can be used to test a model such as MBH98 and to reject it if it fails certain tests.

To my mind what is happening is that MBH have a model that, among other issues, has autocorrelation problems. This is because it is misspecified. For example, maybe solar activity is causing the autocorrelation, and the addition of variables for solar activity would eliminate the autocorrelation and provide a better model of temperature changes.

So the ideal is to have a correctly specified model, and ARIMA is a tool to check models for their validity.

re: #6 John,

To continue from your point, if GCMs are properly physics based, then extrapolating backwards from the point where CO2 emissions became important, we should see the models continue to match the historical record. But one must merely look at the spaghetti diagrams to see that there’s no concensus among the GCMs as to what this historical record should show. And even that’s after they’re allowed to tweak tons of parameters to keep the models within reason.

Therefore, since there’s no concensus about what the more distant past was like, but a good concensus on what the near present was like, why should we expect this concensus to be what we’ll see in the future?