In a post a few months ago, I discussed MBH99 proxies (and similar points will doubtless apply to the other overlapping series) from the point of view of the elementary calibration diagram of Draper and Smith 1981 (page 49), an older version of a standard text. Nothing exotic.
One of the problems that arose in these plots is that virtually none of the individual calibrations made any sense according to the Draper and Smith style plot. Since then, I’ve dug a little further into the problem. If a statistical relationship is not significant under a standard t-test of 2, then a Draper and Smith style plot throws up nonsensical confidence intervals. I’ve dug a little deeper into the matter and have determined that if you lower your t-standards, then you can typically “get” confidence intervals for the individual calibrations that at least are intelligible. The larger question is whether a multivariate calibration can overcome the deficiencies of the individual calibrations (the topic of Brown 1982 and subsequent articles, which several of us have been parsing.)
However, before re-visiting that text, I want to re-plot calibration diagrams for all the MBH99 proxies, lowering the t-tests where required to get an intelligible confidence interval. The post title refers to the Mann et al 2007 network because it is the same as the MBH98-99 network. In all cases, I’ve limited my analysis to the verification period mean, since Wahl and Ammann have agreed that MBH has no “skill” at higher frequency. So we might as well see which proxies, if any, have “skill” at the verification period mean. In the MBH99 period (as with AD100), only one “climate field” is reconstructed, so all the piffle about lower order temperature PCs can be disregarded.
Let me start with showing a plot for the MBH99 series with the highest t-score in calibration – the NOAMER PC2 (not the Mannian PC1). I haven’t shown plots of individual series (you can see these here – link), but the PC2 doesn’t have a HS shape. Its’ individual “prediction” of the verification period temperature (sparse centered on 1902-1980) is an even 0.0 deg C, with 95% confidence intervals of -0.1 and 0.16 deg C. While the confidence intervals are quite precise, unfortunately the observed verification temperature of -0.18 deg C lies outside the confidence intervals.
The next highest t-score comes from Briffa’s Tornetrask reconstruction. In this case, it verifies almost exactly. The problem with putting much weight on this particular reconstruction is that Briffa adjusted his results so that they worked as discussed [link].
Only one other series in the MBH99 passes a standard t-test – Cooks’ old Tasmania version. Here the 95% confidence intervals are rather uninformatively between 0.04 and 8.81 deg C, but even this wide interval failed to bracket the oberved -0.19 deg C.
All other series failed a simple t-test in the calibration period and yielded perverse confidence intervals that failed to bracket the estimate. I examined some of the more original literature, including Fieller 1954, and the most logical approach to these failed confidence intervals seems to be to lower the standard. If you can’t get a 95% confidence interval that makes any sense, maybe you can get a 75% confidence interval or a 50% confidence interval. There are some interesting mathematical issues involved in this, which Ross and I have been mulling over.
But the best way of seeing what is going on is simply to look at a lot of plots and the Mann et al 2007 data set (which is identical to MBH98) really provides an excellent compendium of perverse cases, which can be recommended to statistics students even if climate science “moves on”. In the following plots, I’ll show on the left, the 95% confidence intervals (where there is a breakdown) and on the right, a confidence interval based on a lower standard. Notice that the left hand plots all show both intersections on the same branch of the hyperbolae, thus the nonsense intervals. By lowering the confidence target, one intersection on each hyperbola branch is achieved and thus a “confidence” interval.
Here is the famous NOAMER PC1. (Note: This is from the original MBH archive and I need to re-check whether this is the “fixed” version or not. While this version fails a standard t-test, if the t-standard is lowered to 1.5, then one can say that with (say) 90% confidence, the verification period mean was between -1 and -16 deg C (actual value -0.19 deg C.) So this doesn’t seem as helpful a standard as one might hope.
The next highest t-score for NH temperature came from the Qualccaya 1 ice core (t=1.31, which is still 90th percentile). Somewhat disappointing though is the fact that that Quelccaya 2 ice core has a t-value of only 0.36. Now that I think of it, I think that we’ve seen a tendency in some recent studies to use the Quelccaya 1 ice core as a proxy on its own, discarding the Quelccaya 2 results. Hmmm. Quelccaya 1 results purport to show confidence interval (t=1.31) of between -0.39 and -4.8 deg C (once again not containing the observed -0.19 deg C.) Quelccaya 2 results, at an even lower confidence level (t=0.36) bracket -2 and -8 dg C, again not containing the observed -0.19 deg C. Hmmm.
Next in descending t-scores is Briffa’s Polar Urals series. This is the older version before the update (which resulted in a warm MWP), prompting the Team to switch to Yamal. Once again, on the left side, we see the broken down confidence intervals at the usual 95% t-values, but with a lowered confidence standard (t=1.25), confidence intervals of between -0.9 and -8.4 deg C result (again not containing the observed value.)
I hope you’re not getting too border, because there are some interesting examples still to come, though the next few are more of the same. Next we have the NOAMER PC3. In passing, how does application of a Preisendorfer rule result in 3 PCs for the AD1000 NOAMER network and 2 PCs for the AD1400 network. Maybe the PR Challenge will tell us. The t-value has now declined to 0.95, yielding the typical broken down diagram on the left. By lowering the t-standard, we can get a less confident interval, this time between 0.46 and 5.8 (and once again unfortunately not including the observed value.)
Next come two accumulation series from Quelccaya. It’s intriguing that this one site accounts for 4 of 14 proxies in the network. Perhaps each individual ice core is thought to be tuned to different channels on the teleconnection dial. The confidence intervals for one accumulation series are between 0.05 and 0.9 deg C with low confidence (but not containing the observed value), while the other accumulation series using a t-confidence of only 0.5 yields an uninformative interval of 0.5 -15.5 deg C, again unfortunately not containing the observed value.
Next here are three proxies all with t-values in the seemingly uninformative 0.6-0.65 range: an Argentine tree ring series, a French tree ring series and a Greenland dO!8 series (the last one being used over and over again in these studies.) At very low confidence intervals, each of these “proxies” yields only very wide “confidence” intervals, none of which actually overlap the observed value.
The last proxy has a bit of a place of honor. In this case, to obtain an intelligible “confidence” interval, one has to lower the t-standard to 0.02 (!!), resulting in a “confidence” interval between 12 and 19 deg C for the verification period reconstruction.
Reviewing the bidding, only one of the MBH99 proxies, considered in an individual calibration, yielded confidence intervals that contained the observed verification mean (and that proxy – Briffa’s Tornetrask series – had been fudged so that it “worked”.)
The interesting statistical question, for which hopefully the methods of Brown 1982 and subsequent literature can assist, is whether a multivariate calibration using proxies which have all individually failed so badly can yield an answer with a valid confidence interval using proper methods (as opposed to methods applied by IPCC relying on Wahl and Ammann and the rest of the Team.)