Santer et al 2008 (including realclimate’s Gavin Schmidt) sharply criticized Douglass et al for failing to properly consider the effect of autocorrelation of regression residuals on trend confidence intervals, which they described as a “methodological error”. The need to properly account for autocorrelation in confidence interval estimation is a fairly long-standing theme at CA and it’s nice to see Gavin Schmidt come down so firmly for this view.
Now that the Santer 17 have spoken on this matter, I thought that it would be interesting to re-visit the Mann et al 2008 proxies, applying the Nychka adjustment for regressions in Mann et al 2008.
Mann et al argued that their proxies rose above mere scavenging through a garbage bin of red noise series as follows:
Although 484 (~40%) pass the temperature screening process over the full (1850–1995) calibration interval, one would expect that no more than ~150 (13%) of the proxy series would pass the screening procedure described above by chance alone. This observation indicates that selection bias, although potentially problematic when employing screened predictors (see e.g. Schneider (5); note, though, that in their reply, Hegerl et al. (10) contest that this is actually an issue in the context of their own study), does not appear a significant problem in our case.
On earlier occasion, we observed that Mann used a unique pick-two option for these correlations – but did not modify the benchmark. Today, I’ll take a first quick look at how he handled autocorrelation, merely applying Santer-Nychka methods.
Here’s how Mann described their test:
We assumed n ~ 144 nominal degrees of freedom over the 1850– 1995 (146-year) interval for correlations between annually resolved records (although see discussion below about influence of temporal autocorrelation at the annual time scales), and n~ 13 degrees of freedom for decadal resolution records. The corresponding one-sided P ~ 0.10 significance thresholds are |r| ~ 0.11 and |r| ~ 0.34, respectively.
Owing to reduced degrees of freedom arising from modest temporal autocorrelation, the effective P value for annual screening is slightly higher (P ~ 0.128) than the nominal (P~ 0.10) value. For the decadally resolved proxies, the effect is negligible because the decadal time scale of the smoothing is long compared with the intrinsic autocorrelation time scales of the data.
In order to sustain this argument, Mann needs to demonstrate that the temporal autocorrelation is “modest”. But is this actually true? “Modest”, in the context of the reduction of degrees of freedom in the above calculation, is AR1 of about 0.2 (I need to double check this). The following graphic shows a histogram of AR1 coefficients resulting from the regression relationship between the proxy and gridded temperature, showing the Luterbacher-gridded and Briffa MXD-gridded series separately. Far from the residuals having “modest” correlation, the majority had very high autocorrelations – certainly high enough to have a dramatic effect on the effective degrees of freedom. In some cases, an AR1 coefficient could not even be calculated – there was so much autocorrelation!
The effect of allowing for autocorrelation in the t-statistic (through accounting for the effective degrees of freedom) results in the following histograms of t-statistics, adjusted according to the procedures of Gavin Schmidt and his Santer coauthors. Out of 1033 non-Luterbacher non-Briffa gridded series, only 32 pass a positive t-test. (This is actually a little generous as I’ve not parsed the decadal series in these graphics and there’s further hair on how these are handled, whih I’ll get to some time.) On the other hand, the majority of Luterbacher and Briffa gridded MXD series are “significant”. Given that the Luterbacher data uses instrumental data in its “calibration” period, the existence of high correlations to a different gridded instrumental version proves nothing about the quality of the proxy population and, in effect, spikes the drink. The gridded Briffa MXD data only became available a few weeks ago following an FOI request and, of course, this is the data that was truncated in 1960 because of the divergence problem. Whatever its ultimate disposition, it is also a subset with properties that cannot be generalized to run-of-mill proxies.
Of the non-gridded proxies, 32 (3.1%) pass a positive t-test of 2, the usual value of a two-sided 95% confidence interval (or one-sided 97.5%). And only a handful of these go before AD1400 (including old chestnuts like Briffa’s Tornetrask reconstruction, which has its own problems.)
These results are not very impressive if one avoids the “methodological error” of failing to allow for autocorrelation, as emphasized by Schmidt and Santer. Once autocorrelation is allowed for, Mann’s “proof” that he had not merely scavenged red noise series is no longer valid.
It’s interesting that Schmidt praised one study containing such errors and condemned another.