If Nychka Standards Applied to Mann…

Santer et al 2008 (including realclimate’s Gavin Schmidt) sharply criticized Douglass et al for failing to properly consider the effect of autocorrelation of regression residuals on trend confidence intervals, which they described as a “methodological error”. The need to properly account for autocorrelation in confidence interval estimation is a fairly long-standing theme at CA and it’s nice to see Gavin Schmidt come down so firmly for this view.

Now that the Santer 17 have spoken on this matter, I thought that it would be interesting to re-visit the Mann et al 2008 proxies, applying the Nychka adjustment for regressions in Mann et al 2008.

Mann et al argued that their proxies rose above mere scavenging through a garbage bin of red noise series as follows:

Although 484 (~40%) pass the temperature screening process over the full (1850–1995) calibration interval, one would expect that no more than ~150 (13%) of the proxy series would pass the screening procedure described above by chance alone. This observation indicates that selection bias, although potentially problematic when employing screened predictors (see e.g. Schneider (5); note, though, that in their reply, Hegerl et al. (10) contest that this is actually an issue in the context of their own study), does not appear a significant problem in our case.

On earlier occasion, we observed that Mann used a unique pick-two option for these correlations – but did not modify the benchmark. Today, I’ll take a first quick look at how he handled autocorrelation, merely applying Santer-Nychka methods.

Here’s how Mann described their test:

We assumed n ~ 144 nominal degrees of freedom over the 1850– 1995 (146-year) interval for correlations between annually resolved records (although see discussion below about influence of temporal autocorrelation at the annual time scales), and n~ 13 degrees of freedom for decadal resolution records. The corresponding one-sided P ~ 0.10 significance thresholds are |r| ~ 0.11 and |r| ~ 0.34, respectively.

Owing to reduced degrees of freedom arising from modest temporal autocorrelation, the effective P value for annual screening is slightly higher (P ~ 0.128) than the nominal (P~ 0.10) value. For the decadally resolved proxies, the effect is negligible because the decadal time scale of the smoothing is long compared with the intrinsic autocorrelation time scales of the data.

In order to sustain this argument, Mann needs to demonstrate that the temporal autocorrelation is “modest”. But is this actually true? “Modest”, in the context of the reduction of degrees of freedom in the above calculation, is AR1 of about 0.2 (I need to double check this). The following graphic shows a histogram of AR1 coefficients resulting from the regression relationship between the proxy and gridded temperature, showing the Luterbacher-gridded and Briffa MXD-gridded series separately. Far from the residuals having “modest” correlation, the majority had very high autocorrelations – certainly high enough to have a dramatic effect on the effective degrees of freedom. In some cases, an AR1 coefficient could not even be calculated – there was so much autocorrelation!

The effect of allowing for autocorrelation in the t-statistic (through accounting for the effective degrees of freedom) results in the following histograms of t-statistics, adjusted according to the procedures of Gavin Schmidt and his Santer coauthors. Out of 1033 non-Luterbacher non-Briffa gridded series, only 32 pass a positive t-test. (This is actually a little generous as I’ve not parsed the decadal series in these graphics and there’s further hair on how these are handled, whih I’ll get to some time.) On the other hand, the majority of Luterbacher and Briffa gridded MXD series are “significant”. Given that the Luterbacher data uses instrumental data in its “calibration” period, the existence of high correlations to a different gridded instrumental version proves nothing about the quality of the proxy population and, in effect, spikes the drink. The gridded Briffa MXD data only became available a few weeks ago following an FOI request and, of course, this is the data that was truncated in 1960 because of the divergence problem. Whatever its ultimate disposition, it is also a subset with properties that cannot be generalized to run-of-mill proxies.

Of the non-gridded proxies, 32 (3.1%) pass a positive t-test of 2, the usual value of a two-sided 95% confidence interval (or one-sided 97.5%). And only a handful of these go before AD1400 (including old chestnuts like Briffa’s Tornetrask reconstruction, which has its own problems.)

These results are not very impressive if one avoids the “methodological error” of failing to allow for autocorrelation, as emphasized by Schmidt and Santer. Once autocorrelation is allowed for, Mann’s “proof” that he had not merely scavenged red noise series is no longer valid.

It’s interesting that Schmidt praised one study containing such errors and condemned another.

12 Comments

  1. Neil Fisher
    Posted Oct 30, 2008 at 10:54 PM | Permalink

    It’s interesting that Schmidt praised one study containing such errors and condemned another.

    To misquote you, Steve – Hey, it’s Cli-MATE Science! 😉

  2. Posted Oct 31, 2008 at 2:33 AM | Permalink

    And could you describe the details processes described in your fast?

  3. Posted Oct 31, 2008 at 9:15 AM | Permalink

    It should be remembered that there are in fact two Nychka standards: The Santer Nychka+15 (2008) adjustment of
    ne = n*(1-rho)/(1+rho),
    and the far superior Nychka Santer+4 (2000) adjustment of
    ne = n*(1-rho-.68sqrt(n))/(1+rho+.68sqrt(n)),
    which tries to compensate (if still only imperfectly) for the small-sample bias in estimating rho.

    See my discussion at Comment 64 of the Oct. 22 thread Replicating Santer Tables 1 and 3. The unpublished 2000 NCAR working paper was linked by Jean S at #10 of that thread.

    Note that Nychka Santer+4 2000 are using this formula to adjust the effective sample size itself, so that the effective DOF (ne-2 versus n-2) adjustment is even bigger. Furthermore, they then enter the adjusted DOF in the heavy-tailed Student t table to find a 5% critical value that can be well above 2.

    Steve here is only imposing the inferior 2008 Nychka standard on Mann, and not the far superior 2000 Nychka standard.

    • Posted Oct 31, 2008 at 10:33 AM | Permalink

      Re: Hu McCulloch (#4),

      and the far superior Nychka Santer+4 (2000) adjustment of
      ne = n*(1-rho-.68sqrt(n))/(1+rho+.68sqrt(n)),

      Why do you say the Nychka Santer +4 is far superior?

      I’ve run monte carlo on various AR1 process and I just don’t find this method superior. Without the 0.68/sqrt(N) correction, it rejects less frequently that it should, with the correction, it corrects too frequently. (Or at least it does for N near 90 and for N of 252.) For N=252, the original Nychka is closer to correct. (I don’t remember for N=90.)

      So, the “improved” Nychka method is only an improvement if the standard is that excess false positives is bad but excess false negatives is ok. But that doesn’t make sense. The analyst is supposed to select their false positive rate, then the method is supposed to provide the rate one intended!

    • Kenneth Fritsch
      Posted Oct 31, 2008 at 3:25 PM | Permalink

      Re: Hu McCulloch (#4),

      It should be remembered that there are in fact two Nychka standards: The Santer Nychka+15 (2008) adjustment of
      ne = n*(1-rho)/(1+rho),
      and the far superior Nychka Santer+4 (2000) adjustment of
      ne = n*(1-rho-.68sqrt(n))/(1+rho+.68sqrt(n)),
      which tries to compensate (if still only imperfectly) for the small-sample bias in estimating rho.

      Hu, if the co-author of the Santer et al. (2008)paper, i.e. none other than Nychka himself, “allows” the use of the Santer adjustment for the trend standard deviation over what you see as the better, no, make that far superior, Nychka adjustment from the 2004 paper, then either you or Nychka has some explaining to do.

  4. Steve McIntyre
    Posted Oct 31, 2008 at 9:41 AM | Permalink

    I did both calcs and showed a histogram here with pick-one Nychka 2000, which I should have made clearer. I’ll do a hist with Nychka-Santer 2008 (Quenouille). Hu’s point about the t-test needs to be kept in mind as well.

    This isn’t quite a final rendition as there’s much hair on the “decadal” data, which I’m planning to get to.

  5. Posted Oct 31, 2008 at 11:48 AM | Permalink

    For what it’s worth.

    I can confirm Steve’s ARMA 1,1 results by my own work modeling different M08 proxies. There were 15 series which I couldn’t get to converge at all with an ARMA 1,1 model. Some of them were the most ridiculous looking ones. (i.e. a big flat line replacing HF data in the middle of the graph). A bunch had very high AR figures in the regression.

    I understand that it is subjective, but the 5th grade class giggle test could have been used with some of this data. From the 100ish series which were rejected before correlation analysis (1357-1209-luter difference), this means something subjective was already used. The more unreasonable of these curves, would have been rejected to the 1357 anyway if they ‘accidentally’ were one of the few with high weighting in the final result because of the implications it would have to the paper.

    In CPS at least there isn’t any reason I see for putting in a fake flat line where data was missing (if it was). It simply flattens the historic pre-cali result. In the CPS version at least you could simply rework the averaging algorithm to simply leave it out.

    Series 422 is a good example.

    Quansheng Ge, Jingyun Zheng, Xiuqi Fang, Xueqin Zhang, and Piyuan Zhang. 2003. Winter half-year temperature reconstruction for the middle and lower reaches of the Yellow River and Yangtze River, China, during the past 2000 years. Holocene Volume 13, Issue 6, pp. 933-940

    BTW: since I am fairly new to ARMA regression, is there some reason that CA group prefers AR1 to ARMA 1,1 which often gives me a better SE result on proxies? Perhaps the term is being used interchangeably in the threads?

  6. Pat Frank
    Posted Oct 31, 2008 at 11:50 AM | Permalink

    Re: Lucia wrote, lucia (#6), “The analyst is supposed to select their false positive rate, then the method is supposed to provide the rate one intended!

    Maybe, in the spirit of the climatological times, Nychka has developed a new “proxy statistics,” in service to proxy thermometry, in which the statistical adjustments are chosen a posteriori to give the best signal.

    Remember the specious political manipulation of events to provide a government principal with “plausible deniability”? Well, we now see high analytical skill being bent to provide an academic principal with ‘plausible assertability.’ As distance from the event increases, and if the desired story holds through insistent repetition, the “plausible” part drops away and the denial or assertion takes on the force fact. So the meaning of “moved on” as regards MBH98&99, and so it goes in climatology these days.

  7. Steve McIntyre
    Posted Oct 31, 2008 at 1:29 PM | Permalink

    #7. I wouldn’t say that I “prefer” AR1 to ARMA 1,1. I did some posts a couple of years ago on ARMA 1,1. However, Mann uses AR1 (as does Santer) and, for analyzing their statements, one has to use the format that they used.

  8. Craig Loehle
    Posted Oct 31, 2008 at 2:11 PM | Permalink

    As Shakespeare said: “hoisted by their own petards”

    • Dave Andrews
      Posted Oct 31, 2008 at 2:52 PM | Permalink

      Re: Craig Loehle (#10)

      Craig, sorry to be a pedant but the Shakespeare quote is “hoist with his own petar” :-), although petar and petard mean the same and most people use your format (a bit like “play it again Sam”}

      • Craig Loehle
        Posted Oct 31, 2008 at 3:21 PM | Permalink

        Re: Dave Andrews (#11), Well, I didn’t go look it up…I trust my meaning was clear.

One Trackback

  1. […] Read the rest here:  If Nychka Standards Applied to Mann… « Climate Audit […]