UC and Hu McCulloch have been carrying on a very illuminating discussion of statistical issues relating to calibration , with UC, in particular, drawing attention to the approach of Brown (1982) towards establishing confidence intervals in calibration problems.
In order to apply statistical theory of regression , you have to regress the effect Y against the cause X. You can’t just regress a cause X against a bunch of effects Y, which is what Wilson did in Kyrgyzstan and occurs all too frequent in paleoclimate, without a proper consideration of the effect of the inverse procedure.
Calibration deals with the statistical situation where Y is a “proxy” for X and where you want to estimate X, given Y. It’s the kind of statistics that dendros should be immersing themselves in, but which they’ve totally disregarded, using instead procedures for estimating confidence intervals that cannot be supported under any statistical theory – a practice unfortunately acquiesced in by IPCC AR4, hardly enhancing their credibility on these matters.
The starting point in Brown (1982) is the following:
Perhaps the simplest approach is that of joint sampling. It is easy to see that given α, β, σ, X, X’, the joint sampling distribution of is such that:
is standard normal. Note that this standard normal does not involve any of the conditioning parameters α, β, σ, X, X’ so that probability statements are also true unconditionally and, in particular, over repetitions of (Y,X) where both Y and X are allowed to vary.
I dare say that many readers may find that this statement is a fairly big first bite and that it may not be as obvious to them as to Brown’s audience.
However, this particular result is derived in chapter 1 of a standard textbook, Draper and Smith, Applied Regression Analysis (1981). I worked through this chapter in detail and found the exercise very helpful. Its’ approach is, in turn, derived from E.J. Williams (1959), Regression Analysis, chapter 6. In some fields, while people “move on”, they try to at least achieve results that survive the test of time.
The key strategy in the univariate cases is to draw curves enclosing the 100(1-γ) confidence intervals for y given x. These are quadratic in x. Illustrations are given in Draper and Smith Figures 1.11 and 1.12 and Williams Figure 6.2. The equation for the confidence interval curves is:
where t is the 100(1-γ) t-statistic for the relevant degrees of freedom, for the calibration set and the others are usual estimators.
This can be transformed to a quadratic equation in x. The strategy in these texts for estimating fiducial limits on x given y is to draw a horizontal line at y, determine the intersections with the two confidence interval curves and take the x-values of the intersection as the upper and lower fiducial limits, with the estimate being calculated from the fitted linear equation:
In a “well behaved case”, the upper confidence limit is on the upper quadratic and the lower confidence limit is on the lower quadratic and the estimate is between the upper and lower confidence intervals. However, if the roots to the quadratic are complex, there are no solutions to the equation, which means that any value of x falls within the confidence limits permitted from the data. Another related pathological case arises when both the “upper” and “lower” confidence intervals are on the same side of the estimate.
In these cases, if one examines the regression fit in univariate calibration, one finds that there was no statistically significant fit and in effect could not be statistically differentiated from zero. This is a point that UC has been emphasizing in recent posts.
I went through all 14 MBH99 proxies and found that they beautifully illustrated the pathologies warned about in these texts.
First here is an example where the calibration graphic in the style of Draper and Smith 1981 has the structure of a “well behaved” calibration. This is for Briffa’s Tornetrask series. The calibration here has been “enhanced” by some questionable prior manipulations by Briffa, who constructed his temperature series by an inverse regression of regional temperature against 4 time series – so the “raw” proxy is not really “raw” any more. In these graphics, I’ve used the average value of the proxy in the 1854-1901 “verification” period as the y-value (everything’s been standardized on 1902-1980). In this case, the fiducial limits for x (temperature) given y are 0.36 deg C, so this looks like a pretty successful calibration (BUT the prior massaging will have to be deconstructed at some point.)
Next here is the same style of diagram for the Quelccaya 2 accumulation series, showing a very pretty example of complex roots and no fiducial limits. Examining the original calibration regression, one finds an r^2 of 0.011 (Adjust r^2 of -0.00147) with an insignificant t-statistic of -0.94 for the proxy-temperature relationship. Because the coefficient is not distinguishable from 0, there is no contribution towards calibration from this data.
Here’s a snippet of the corresponding Draper -Smith from Google, which shows enough that you can see that the Quelccaya 2 Accumulation case matches the situation in the Draper Smith Figure 1.12 top panel diagram.
Quelccaya 1 accumulation is also pathological but, in this case, the quadratic solves, but the both the “upper” and “lower” confidence intervals are on the same side of the estimate, as shown below. This calibration also fails standard tests, as the t-statistic is -0.544 (the r^2 is less than 0.01).
Here’s another pretty example of total calibration failure – the morc014 tree ring series. This would make a nice illustration in a statistics text. This has a t-statistic of -0.037 – a value that is low even for random series.
In total, 10 of the 14 series in the MBH99 failed this chapter 1 calibration test. In addition to the above 3 series, other failed series were: the fran010 tree ring series, Quelccaya 1 dO18, Quelccaya 2 dO18 (why are there 4 different Quelccaya series??), a Patagonia tree ring series, the Polar Urals reconstruction and the West Greenland dO18 series.
Only 4 series passed this elementary test. In addition to the highly massaged Tornestrask series, the three were: the Tasmania tree ring series, the NOAMER PC2 and the NOAMER PC1 (AD1000 style.) I guess the Tasmania series teleconnects to NH temperature more than most of the NH tree ring reconstructions. Its calibration results are not strong – the t-statistic is 2.1 and the adjusted r^2 is 0.04.
Now to what we’ve been waiting for: the NOAMER PC series. The NOAMER PC2 (and the AD1000 network is far more dominated by Graybill bristlecones than even the AD1400 network) has the strongest fit. It has a t-statistic of 4.3 and an adjusted r^2 of 0.19, the highest in the network.
Now what of the NOAMER (Graybill bristlecone) PC1? This is the only MBH99 series that has a HS shape (I’ve flipped the archived series so that it has the expected upward bend). It has a very idiosyncratic appearance in the Draper-Smith style diagram as shown below. The upper and lower limits are on opposite sides of the estimate, but this series yields very broad fiducial limits. The t-statistic here is 1.71, somewhat below statistical significance. The MBH99 “adjustment” of the PC1 has the effect of “improving” its fit to temperature, and thereby increasing its weight in an MBH-style reconstruction.
Moving towards Multivariate Calibration
As we approach the mountain of multivariate calibration, let’s pause and consider the information on fiducial limits from the 4 series that actually calibrated, as summarized in the table below:
|Proxy||Lower (deg C)||Upper (deg C)|
Thus, we have the remarkable situation where the 95% fiducial limits for the 4 proxies essentially do not overlap at all (there’s a miniscule overlap between the NOAMER PC2 and Tasmania). It will be interesting to see what happens as one works through a Brown 1982 style calibration. It also illustrates rather nicely the total lack of significance of the majority of proxies.
It’s hard to think how one can purport to derive confidence intervals of a few tenths of a degree, when 10 of 14 proxies don’t calibrate at all and the remaining 4 yield results that are inconsistent in the verification period.
I did these calculations with the MBH “sparse” temperature series since it had a verification value. MBH obviously used temperature PCs for calibration. Even though the two series are highly correlated, the calibrations will be different though I’d expect the patterns to stay pretty similar.