Darrell S. Kaufman, David P. Schneider, Nicholas P. McKay, Caspar M. Ammann, Raymond S. Bradley, Keith R. Briffa, Gifford H. Miller, Bette L. Otto-Bliesner, Jonthan T. Overpeck, and Bo M. Vinther (*Science* 9/4/2009) propose a reconstruction of Arctic summer land temperatures for the last 2000 years, using 23 diverse proxies. Decadal averages of each proxy are normalized to zero mean and unit variance relative to the period 980-1800 AD, when all 23 proxies are available. These are then are averaged, as available, to form a 2000-year composite. This composite is converted into temperature anomalies by comparison to the CRUTEM3 summer (JJA) temperature for land north of 60°N latitude, for 1860 – 2000 A.D.

Unfortunately, the paper’s calibration of the proxy composite is defective. The 23 proxies used include lake varve thicknesses, varve densities, varve and sediment organic material (OM), sediment biosilica, ice core ^{18}O depletions, and tree ring widths, and are all from different locations. There is no *a priori* reason to expect these very diverse proxies to all have the same behavior with respect to temperature, even when normalized. Because not all the proxies are available in each decade, the composite does not have constant composition. As its composition changes, it essentially becomes a different index, which must be calibrated separately to temperature. The authors fail to do this, and hence the reconstruction is invalid.

In the online Supplementary Information, the authors give their calibration regression equation as

P = 2.079T + 0.826 (r^{2} = 0.79, p < 0.01, n = 14).

Presumably they then invert this equation to reconstruct T(t) as a function of P(t) during the reconstruction period 0-1860 AD. In itself, this corresponds to the univariate version of the “Classical Calibration Estimation” discussed by Brown (1982), and advocated on CA by reader UC. In general, it would be more efficient not to weight the individual normalized series equally, but there is no reason such an equally-weighted composite of standardized series cannot be calibrated univariately in this manner.

Nevertheless, there are three problems with the calibration as performed by Kaufman et al.: First, they estimate their calibration equation on a sub-composite index which includes only those 19 series that extend to 1980. Since the calibration equation is based on a composite that does not include 4 of the series, it is invalid to use it to calibrate the full 23 proxy series, or any sub-composite that includes any of these 4 series. The completely excluded series are #11 (the SFL4 sediment series from W. Greenland) and #19 – 21 (the 3 lake varve series from Finland, including the controversial Tiljander series #20). Since the Tiljander series was (validly) truncated at 1810 to avoid effects of human activities on the lake sediments, there is in fact no way to calibrate it or any composite containing it to post-1860 temperatures. Series #19 is likewise truncated at 1810.

The second problem is that even though the last decade in which the 19 series were observed together was the 1970s (12 decades), the authors use subsets of these 19 to fit their regression equation through 2000 (14 decades). The 13th decade in this regression in fact is based on the average of only 15 of the 19 series, and the 14th decade is based on only 12 of the 19 series. Since these are different composites, they do not necessarily obey the same regression equation, and it is invalid to include these points in the calibration of the 19-proxy sub-composite. (In fact, there were only *9* decades during the calibration period when the 19 series were all truly observed, because series 2 and 8 had a few internal missing observations that the authors interpolated. While interpolation might be an excusable expedient for the reconstruction, it is invalid for the calibration regression, since it does not add a true observation.)

The third problem is that the authors apply their calibration equation to reconstruct temperatures during decades when some, or even several, of the 19 calibration proxies were not available. Even if the reconstruction composite had been restricted to the 19 calibration proxies, and even if all 19 of these had been observed for all 14 decades of the calibration period, it would still be invalid to apply the calibration equation derived from all 19 to any subset of the 19. This in itself makes the reconstruction invalid before 970 A.D., when proxy #9 ceases to be observed. Prior to 460 A.D., only 13 of the 19 calibration proxies are available.

The irregularity of this data set does not make its calibration insurmountable. So long as the proxies were not “cherry picked” from a broader universe of proxies on the basis of their correlation with instrumental temperatures (a big “if” in the present instance), the following steps would be adequate to perform a valid calibration:

1. Regress the decadal averages of each proxy P_{i}(t) on the decadal averages T(t) of the instrumental temperature series, using all the observations for that proxy but excluding interpolations, to obtain a regression equation of the form

P_{i}(t) = a_{i}(t) + b_{i}(t) T(t) + e_{i}(t),

with standard errors appropriately adjusted for serial correlation. Of course, proxies with less than 14 decadal observations will tend to have higher standard errors than those with all 14 observations, and the two that were truncated at 1810 cannot be used at all.

2. Compute the covariance matrix of the regression errors for the 21 remaining series, as follows: First, compute the variance of the errors in each regression. Then, for each pair of proxies, compute the correlation between their regression errors, using only the observations that are active for both proxies, and without recentering the errors. Finally, convert these correlations into covariances using the full sample variance for each of the errors. This matrix is not of full rank, but since it does not need to be inverted, this should not be a problem.

3. For each reconstruction period t, compute P(t) = mean(P_{i}(t)), a(t) = mean(a_{i}), and b(t) = mean(b_{i}), taking the means over the time-t specific subset of proxies, including interpolations if desired. Compute the variance and covariance of a(t) and b(t) and the variance of e(t) = mean(e_{i}(t)) using the covariance matrix computed in step 2, and the assumption that the correlations of the coefficients from different regressions are equal to the correlations of their error terms. Generally, the smaller the active proxy set, the larger the standard errors of the coefficients will be.

4. Invert the t-specific equation

P(t) = a(t) + b(t) T(t) + e(t)

to obtain

T(t) = (P(t) – a(t) – e(t))/b(t).

Setting e(t) = 0 gives the point estimate

T*(t) = (P(t) – a(t) )/b(t).

5. Use the formula for the distribution of the ratio of two correlated normal random variables (D.V. Hinkley, *Biometrika* 1969, corr. 1970) to compute confidence intervals for T(t). This approach always leads to finite CI bounds, unlike the classical method of inverting upper and lower confidence bounds for P(t) as a function of T(t). It also can be given a Bayesian interpretation, in terms of a diffuse prior for T(t), provided Bayes’ Rule is invoked before the regression information is applied and not after.

Since four of the proxies are treering series that may well have been affected by CO2 fertilization in the last 60 years, CO2 should also be controlled for in the TR regressions, and appropriately entered in the reconstruction.

I have not yet attempted to perform these calibrations, but they should be straightforward.

*(I have previously noted, in Comments 35 and 37 of the Kaufman and Upside-Down Mann thread, that the correction for serial correlation of regression residuals as described in the Kaufman SI is also incorrect. This is a separate, but also important problem.) *

## 34 Comments

I hope you submit this as a comment. It might strengthen your argument to show WHY you can’t use a subset of the 19 that formed the calibration set (yeah, I can see it, but I bet team members will insist it is ok).

Looking back on my recon last year, it does seem that sometimes simpler is better. Especially if you get so fancy you don’t even understand what you are doing.

UnnecessaryRe: Darryl (#2), I am sure that many readers find it viscous…

OTHu, you refer to:

I’ve looked at the distributions of varves in the few cases where the raw data is available and the distributions are hugely non-normal and even the logged versions are nowhere near non-normal.

The handling of varve non-normality is inconsistently handled in the various original studies – with very unsatisfactory discussions in the original studies. Kuafman doesn’t specify the exact version of original data used, but with some experimentation, I’ve decoded the original version. Blue Lake uses log(thickness), interestingly Murray Lake uses log(mass), while Iceberg Lake, Big Round Lake and Donard Lake all use linear versions. Logging mitigates the non-normality though it doesn’t eliminate it. Iceberg Lake and Big Round Lake are both among the 4 series that contribute to HS-ness and the failure to properly normalize these series looks like it might have a noticeable impact on results.

RE Craig Loehle #1,

A valid yet simpler reconstruction would be to simply give up altogether on the 4 series that do not make it to 1980 (#19 and 20 can’t be used in any event), and then to give up also on the last two decades, and just calibrate the composite of these 19 on only the 12 decades for which all 19 are observed.

This reconstruction would be valid (assuming, that is, that the series themselves have not been cherry-picked) back to 970AD, when #9 drops out. But then it could be carried back to 750AD (when #11 drops out), simply by re-calibrating the resulting 18-proxy composite to the 12 decades when these 18 are all observed. And so forth, recalibrating the new series each time a proxy drops out.

This simplified procedure overlooks the fact that a few observations on proxies 2 and 8 were interpolated, and therefore not really valid as part of the dependent variable. But this is a lot less questionable than having some series missing altogether.

One subtle but important issue is the standardization period for the z-scores that enter the CPS composite with equal weight. I’m not sure there’s a “right” way to do CPS, but if the 4 proxies that don’t make it to 1980 are going to be dropped, the standardization “ought” to run through 1980. This makes a big difference for some of the standard deviations, and therefore for the weights these series get in the reconstruction.

(In fact, all 23 of the original series are observed from 970 – 1810, rather than the 980-1800 period used in the paper for the standardization. This puzzling choice already makes a small difference in the weights.)

RE Craig Loehle #4,

Did you mean thixotropic? 😉

Re: Hu McCulloch (#6), You got me, I had to look up thixotropy. Very viscous of you.

Re: Hu McCulloch (#6),

Apart from being surprised – even disgusted – by the elementary lack of calibration procedures that you rightfully document, might I taunt you that any more word plays on “thixotropy” will bring on “penecontemporaneous dedolomitisation” which so far has not entered the lexicology of processes in the formation of sedimentary piles or seashells. (Americans might spell it with an ultimate “z” not “s”).

Re: Geoff Sherrington (#9),

Brilliant, Geoff – love it🙂

Re: ianl (#12),

Of course, the far more serious questions are how or why some scientists dealing with statistics fail to even comprehend what a distribution is and why it is important that the type of distribution dictates which subsequent procedures can and cannot, or should and should not, be used. This really is elementary stuff.

Gamblers sometimes profit from a knowledge of distributions. So there is value in understanding them. Those who do not take the trouble to understand them will suffer economic or intellectual loss.

The paper under discussion reads like a gamblers’ convention, where each proxy (gambler) tries to outdo the ignorance of the other.

Hilarious? No, tragic.

Agree with Craig Loehle #1; submit this as a comment then Kaufman will have no option but to really, really try to respond to your points. Either that or wait for a corrigenda from the Team…

RE Steve #5,

I’m as big a fan as anyone of heavy-tailed non-normal distributions, and in particular Mandelbrot’s infinite variance

stableclass (the “Black Swan” distributions, if you like). See my “Financial Applications of Stable Distributions,” inHandbook of Statistics, vol. 14, and other papers on my webpage.However, you have to crawl before you can walk, and I am still struggling with the calibration issue when everything is nicely finite variance and Gaussian. It seems to me that if a(t) + e(t) is normal, and b(t) is normal, and their covariance is known, and P(t) is observed, then T(t) = (P(t) – a(t) – e(t))/b(t) has Hinckley’s distribution, which is just a

~~simple~~messy transform of Matlab’s mvncdf. I’m sure there is a similar function in R.Since the governing variance is only estimated, you have to integrate out its distribution numerically to be precise, but this just thickens the tails a little more than they are already.

RE Craig Loehle #8 and Geoff Sherrington #9, see Comment 81 on the Iceberg Lake thread.

Perhaps Craig just meant “dense” in #4?

Hu M, thanks for the analysis of the spotty time series. Those issues arose, loud and clear, when I did my calculations of break points and times series of the 23 proxies. I was amazed to find that the 23 proxy average was simply an average by decade of whatever proxy data were available. The lack of instrumental period data for 4 of the proxies was apparent also.

Would not some of these deficiencies have caused a peer reviewer to shout out, “You cannot be serious”? I am not attempting to be funny here.

Re: Kenneth Fritsch (#11),

Maybe they did. We don’t know what the reviewers or the editor said because the review process is closed. This is one question I would have asked Kaufman – to summarize the reviews.

UC, at #82 over in the “Is Kaufman Robust?” thread wrote,

They may be wide, but aren’t pathological if you’re using my proposed ratio-of-normals CI’s. Suppose, for instance, that a-hat and b-hat both turn out to be 0, and are uncorrelated (because the variables were demeaned, say). Then (y – a-hat – e)/b-hat is exactly Cauchy. This has very heavy tails and even infinite variance, but still has finite percentiles, including the .025 and .975. The 95% CI is therefore bounded, symmetrical, and contiguous.

Brown couldn’t get Bayes rule to work with an uninformative prior, because he tried to impose the uninformative prior

afterhe had run the regression, and therefore really had some information, thus contradicting an uninformative prior.In order to get it work, you have to impose the uninformative prior

beforeyou look at the data, when a and b (and var(e)) are all still hypothetical, data-free values. Then p(y|x,a,b,var(e)) is just Gaussian, as is p(x|y,a,b,var(e)) under a diffuse prior. Then if you run the regression, the posterior distribution of a and b given var(e) is BV Gaussian, which when integrated out just gives a ratio of normals distribution for x, which can easily be calculated in terms of MATLAB’s mvncdf.The variance may then be integrated out numerically (as when Student t is derived from a mixture of normals), but this just thickens the tails somewhat.

Re: Hu McCulloch (#15),

Brown’s CIs are sometimes ‘pathological’ (two disjoint semi-infinite lines, for example). But if the slope is statistically significant one avoids this problem (at least in the univariate case). If the true beta is zero (non-sense proxy), we shouldn’t obtain finite CI.

Hence, I’d like to add something to the procedure you describe in (#16) – first check that there actually is proxy-temperature relation. To me Ebisuzaki’s method ( http://www.climateaudit.org/?p=4640#comment-336449 ) seems to be a reasonable way to start. It doesn’t make AR(1) assumption, which can’t be valid for both annual

anddecadal averages. Fundamental problem of cherry-picked proxies remains, but at least it will be more difficult to cheat.Re: UC (#17),

According to Brown, any self-respecting calibrator will use only statistically significant calibration slopes. With this rule (sl 0.05), the proxies that can be used are

2

5

6

9

10

12

14

15

18

22

23

If we take serial correlation into account, with Ebisuzaki (sl 0.05) the remaining proxies will be

2

10

12

14

22

The interesting one is proxy # 10 (Donard Lake) that survives both tests, but with different sign that is used in the Kaufman composite.

Re: UC (#28),

[Use Nelson’s voice from Simpsons]

Ha ha!

Steve McI, at #6 in the “5 Alaskan BSi Series” thread wrote:

El Maestro is onto something here, but in fact we don’t have to wait for the authors to convert proxies to temperature — we can do it ourselves, and then just apply L&M:

Just regress each proxy Pi univariately on T (or T + CO2 if appropriate), and invert to get a point reconstruction Ti as a function of Pi (and CO2 if used). But then just Loehle the point estimates to get a consensus T = mean(Ti), and then use the

panel of reconstructed Ti’sto estimate their covariance about T, as in Loehle and McCulloch.The only difference is that there we assumed the 18 global proxies were independent (since they were well spread out), whereas here many of them are quite close and therefore the correlations cannot be ignored. However, this covariance matrix will not be singular, since it is computed from the 200 reconstruction dates (in Kaufman’s case), rather than just from the 14 regression residuals. Then use this covariance matrix to compute the precision of the equally-weighted consensus T. If the proxies go every which way during the reconstruction period, the derived se will end up big, even if the regressions seemed tight in the calibration period.

If desired, you can go one step further, by using the estimated covariance matrix to obtain a more efficient GLS estimate of T. This is doable since the covariance matrix now has full rank.

Kaufman’s uneven panel of proxies is no problem here, since you can use all of each proxy’s calibration points for its regression, and then just average over the available subsample (equally or by GLS) in each reconstruction period. Compute the reconstruction variances using each proxy’s full observation period, and the reconstruction covariances using each pair’s full period of overlap. This wouldn’t require discarding 4 of the proxies and the last two decades altogether as suggested in #6 above.

Of course, since Kaufman proxies #19 and 20 (Tiljander) don’t extend into the instrumental period, they can’t be calibrated at all, so that only leaves the other 21.

And of course, this method won’t do anything for the more fundamental problem of cherry-picked proxies.

RE UC #17,

Even if the point estimate of the slope (b or beta) is zero, it still has a positive standard error, and this uncertainty gives x a proper (if heavy-tailed and maybe even bimodal) posterior distribution that readily gives bounded and contiguous CI’s with equal probabilities in both tails.

And even if the slope is significantly positive, say, we are never positive that it is positive. (Would a double positive sort of like a double negative?😉 ) But the logic of Brown’s inverted CI’s requires that we

knowthe sign of the slope. In fact, there’s a small probability that it is opposite the point estimate, and as a result, Brown puts unequal probability in the two tails of the CI.As the p-value of the slope reaches the combined tail probability of the CI, Brown’s upper bound reaches infinity, and therefore has 0 probability in the upper tail, instead of equal probabilities like my proposed method. The difference is that the ratio-of-normals distribution exactly takes into account the growing possibility that the denominator could be of either sign, while Brown does not.

The end result is that my CI’s will be more centered than Brown’s unless the slope is highly significant in comparison to the CI probability, so that you will get more “could go either way” outcomes, and fewer “could be very extreme” outcomes.

Re: Hu McCulloch (#18),

I didn’t mean estimate, what if the true beta is zero? Proxy=zero times temperature plus noise.

“no one would make estimates from a regression equation unless he were convinced, either by theoretical argument or the statistical significance of b itself, that he knew the sign of beta” – Williams

..but that’s a bizarre statement, as

multivariate regression methods are insensitive to the sign of😉predictors

Re UC #19,

If the true value of beta is known to be zero, you can’t use the proxy to reconstruct temperature at all, and so the CI for past temp is (-inf, inf).

But we always just have an estimate of beta, with positive se. If you are certain that the true beta is non-negative, it is easy enough to truncate the usual gaussian (or student, taking var(e) into account) posterior distribution of the slope coefficient so as to place zero density on the negative values. This just knocks one of the two terms out of the ratio-of-normals distribution, and merely requires jacking up the remainder by a computable amount to compensate for the missing probability.

The resulting CDF for temperature has a proper distribution, however, which can be inverted to obtain a CI for past temperature with finite limits and equal probabilities in both tails. So there’s still no need for unbounded or noncontiguous CIs.

I’m not sure what your last point is.

I see that there’s a new hockey stick published, hard to keep up with these. But here are univariate classical calibration results for Kaufman21:

Takes some time to find finite CIs, let’s see..

Interesting, but remember when you combine these into a composite reconstruction that you have only (at most) 14 observations to estimate the covariance matrix across 21 proxies!

Kaufman gets around this (albeit inefficiently) by using CPS, which arbitrarily gives equal weights to the standardized series, somehow signed. He assigned the wrong sign to Tiljander (according to Tiljander herself, at least), and also to #19, I believe, but these won’t matter for you, since you have presumably excluded these two, which have been truncated before calibration data is available, to obtain your 21.

RE #21,

Now I get it — you were alluding to Mann’s reply to Steve in PNAS!

Re: Hu McCulloch (#22),

Maybe

Sunberg,Brown (1989) “Multivariate Calibration With More Variables Than Observations”, Technometrics, Aug 1989, Vol. 31, No. 3will help. It seems that Lehmilampi is now inverted ( http://www.arcus.org/synthesis2k/synthesis/index.php ) to the orientation that Haltia-Hovi et al prefer. Negatively correlated with PC1 and calibration temperatures now.

Re: UC (#23),

“say my name, say my name ..”, October 2009 -remix:

Re: UC (#23),

Seems that they didn’t bother to correct this statement form the original paper:

They changed the formula in the SI for autocorrelation adjustment as well.

RE Steve McIntyre #26,

In the original SI, they gave the wrong formula for the autocorrelation adjustment, with r1^2 where they should have had r1, but Roman has confirmed that they must have used the right formula for their actual calculation, so this was just a slip in the SI writeup.

In the new SI, they have corrected the formula and used it to compute a CI for the slope coefficient, but then add a second formula, that in principle is a little more precise, and use it to test for significance of the slope coefficient. This is a little odd, since the slope is significant at the 5% level if and only if 0 lies outside the 95% CI. They should just use one or the other, and leave it at that.

The newly added second formula actually goes back to Bartlett (1935), who took into account the serial correlation of x in addition to the serial correlation of the errors (the latter being the same thing as the serial correlation of y under the null of no correlation). In principle these interact, but if the regressor is a just time trend, its first order serial correlation is 1 in large samples, so it drops out, as in the simpified Santer-Nychka formula. The same would be approximately true for a variable with a strong trend or drift during the regression period, like temperature, so again the Santer-Nychka simplification should give very similar results.

Section D of the new Supplementary Online Material available at http://www.arcus.org/synthesis2k/synthesis/index.php does essentially what I suggested in Comment #6 above, by dropping the 4 proxies that were not used for the calibration from the reconstruction, and then recalibrating the early portion of the recon using only those proxies that are still observed to construct the z-score composite.

They find in Figure S4 of the new SOM that these changes have only a small effect on their temperature reconstruction. This is fine, but if there is a right and a wrong way to do a calculation, you ought to do it right, even if you get a similar answer doing it wrong.

They defend their original approach on the grounds that

It’s true that if you have say several ice cores from the same ice cap, or several tree cores from the same stand of trees, then the several series may as well be averaged together (in natural units, not z-scores) without much concern about which are present or not, since they are all measuring the same thing, whatever that is, with idiosyncratic noise that cancels out across the series.

However, in the case of Kaufman (2009), they have several very different proxies — treering widths, d18O ratios from ice cores, varve thickness, varve density, biogenic silica content, etc — that don’t even have the same units, let alone the same response (if any) to temperature. There is no reason to expect a composite index made from these to have the same relation to temperature when the makeup of the index changes.

While I’m pleased that Kaufman and coauthors have taken the time to respond to my comment, it would be customary in such a circumstance to acknowledge the person making the suggestion, if only in the revised SOM where the issue is addressed. Likewise, the published Corrigendum definitely should identify Steve McIntyre, who originally pointed out that Kaufman was using the Tiljander series upside down, and who (with Ross McKitrick) actually published a comment in PNAS on an earlier paper that had made the same mistake.

On another issue, which I had made only briefly in my headpost, Kaufman’s website linked above states,

Item (5) is disturbing, since it admits that they have cherry-picked proxy records that correlate with temperature (even if they have done a poor job of it per UC #28 above). As David Stockman showed back in 2006 (Aust. Inst. of Geophysics News, http://landshape.org/enm/wp-content/uploads/2006/06/AIGNews_Mar06%2014.pdf), this can lead to almost exactly the same type of picture Kaufman comes up, just using serially correlated randomly generated series.

Brown’s approach to multiproxy calibration, mentioned by UC above, does require all the proxies to have a significant correlation to the target state variable. However, if some are considered and set aside by this criterion, the confidence levels should be adjusted appropriately — either assuming independence or using Bonferroni’s less restrictive bound perhaps. The DOF adjustments in the final model should also be adjusted appropriately for the omitted series.

(Note to S. Mosher — a Bonferroni is sort of like a Zamboni, except that it tears the ice up instead of glossing it over, and occasionally resets the scoreboard to 0.) 😉

Thanks Hu. A long time ago a co worker hit my results with a bonferroni stick. It left a mark.

bender, please.

like so:

Found the Donard Lake data,

ftp://ftp.ncdc.noaa.gov/pub/data/paleo/paleolimnology/northamerica/canada/baffin/donard_2001.txt

it seems that the sign is correct in Kaufman. It just happens to be ‘significantly’ correlated with the inverted Arctic-wide JJA temperature. Interestingly, the local calibration done in Moore01 is ICE ( regression of temperature on proxy); this biased estimator easily gives you false impression on accuracy. Past temperatures should be well within the range of calibration temperatures for this one to work. Where does this prior information come from??

re Jean S

http://www.climateaudit.org/?p=7360#comment-363136

Probably there’s been efforts, but 10 degrees C wide CIs turned them away 😉

The Kaufman Correction: Before & After

Science has published the “Correction & Clarification” to Kaufman et al 2009, but curiously the reconstructed temperature proxy has changed yet again between the draft clarification and the published clarification, to “hide the decline”? from the temperature anomaly 2000 years ago that in the draft was shown to be about the same as present.

http://hockeyschtick.blogspot.com/2010/02/kaufman-correction-before-after.html

## One Trackback

[…] there is no need to name those who pointed out errors. There is a small improvement over the draft version though; congratulations Hu! We thank H. McCulloch and others who have pointed out errors and have […]