However, the formula Huybers gives for RE is itself clearly wrong. In his notation, he writes

RE = 1 = Sum(y-x)^2 / Sum(y^2),

where y is instrumental NH temperatures and x is eg the PC1 of proxy records. But as stated, this formula will give different answers if temperature is measured in dC or dF, which is nonsense. (A further problem is that, as in the MBH formula, the terms in the denominator have not been demeaned, but perhaps this is implicit, relative to an unstated reference period.) **[Steve**: all the series in question have been centered on calibration1902-1980 so I don’t see an issue here.]

But then he states that “Inspection of the MM05 Monte Carlo code (provided as auxiliary material) shows that realizations of x are not adjusted to the variance of the instrumental record during the 1902 to 1980 training interval — a critical step in the procedure. The MM05 code generated realizations of x having roughly a fourth the variance of y, biasing RE realizations toward being too large.”

Adjusting the proxy to have the same variance as temperature during the calibration period would indeed give it the same units as temperature, as if it were a predicted temperature value. However, neither MBH-style simplified CCE estimation (regressing the proxy on temperature and inverting the relationship) nor “direct” or “ICE” regression of temperature on the proxy will made the predicted temperature have the same variance as actual temperature. In the single proxy CCE case, predicted temperature will have a greater variance, by a factor of the R^2 of the regression, while in the “ICE” case predicted temperature will have a smaller variance, by the same factor.

The only estimation procedure in which predicted and actual temperature will end up with the same variance is what I call “Non-Least Squares” (NLS) regression, ie the variance-matching used by Moberg et al (2005). This NLS procedure has no statistical justification whatsoever, but is apparently what Huybers has in mind.

Considerably different values of RE will be obtained, depending on whether the reconstruction is performed by CCE or ICE. Since MBH are using a form of CCE, similarly constructed simulated forecasts should be used to establish critical values, and not ICE (used, eg by Li, Nychka and Ammann).

If anyone is wrong here, it is evidently Huybers.

Huybers also claims that his Figure 2 is “somewhat at odds” with MM’s statement that their PC1 is “very similar to the unweighted mean of all the series”. In his Figure 2, however, all series are plotted with equal mean and variance in the 1902-1980 period. It would seem to me that normalizing them over the entire period 1400-1980 would be more natural for visual comparison.

On his first page, Huybers notes that MBH normalize each series both by subtracting out the 1902-1980 mean, and also by dividing by the standard deviation *after detrending*, presumably over the same period. He grants that the latter step “seems questionable”, but claims that it “turns out not to influence the results.”

I would think that either step has equal potential for generating hockey sticks all by itself. Even without short-centering, dividing by the standard deviation about a trendline will amplify those series that happen to have a pronounced up- (or down-) trend in the last period, either randomly or because of some spurious factor in the series.

PS: The italics at the end of my post #41 above were supposed to end after “before any tests are run”, so I wasn’t trying to emphasize all this text.

]]>My null hypothesis is that MBH has one classic univariate spurious regression, plus 21 or so calibrations against white noise equivalents (all of which would have B=0).

If you do univariate MBH recons with rescaling, the recons tend to overshoot the verification period. If you blend them out with white noise, then you improve the tailored statistics in two ways: 1) you improve the calibration r2 just by having more series and 2) you control the overshoot in the RE statistic. I noticed this overshoot effect in responding to the Huybers Comment. Needless to say, Wahl and Ammann totally ignore this.

So they have a non-null impact on the recons. I’ll try to do a MBH style recon without the “white noise” class of series. My guess is that it will overshoot and be too “cold”.

]]>I’ve been plugging through univariate calibration equations for the MBH99 proxies using the starting univariate formula of Brown 1982 (which is in chapter 1 of Draper and Smith, Applied Regression Analysis, 1981, which uses a method from Williams, 1959. Unsurprisingly, the MBH99 proxies beautifully illustrate the pathologies.

Relative to the “sparse” NH temperature used in verification, nearly all of the MBH99 proxies are indistinguishable from 0. I observed some time ago that you got as “good” a reconstruction using bristlecones and white noise and this is another perspective on this phenomenon – although a more useful vantage point since it ties this insight into a statistical framework.

There tend to be only a very few proxies that survive first stage calibration testing. When one gets into lower order PCs, I can’t imagine that it’s anything other than a total dog’s breakfast as the entire procedure seems to be mining for any sort of relationship with total disregard for statistical testing.

#41. Note that Wahl and Ammann and (MBH) have heavily relied on the argument that can “get” a HS without using PCs. Their various salvage strategies tend to go from the frying pan into the fire and the criticism will continue to apply to the salvage arguments.

#40 and 41. One of the fundamental themes of Rao’s approach to multivariate normal distributions is that any linear combination of variables is also normal. He uses this strategy to reduce many multivariate problems to univariate problems. Even in cases where more than one temperature PC is reconstructed, the final NH temperature reconstruction is a linear combination of the reconstructed PCs and thus becomes a univariate problem under Rao’s strategy. It would take a bit of work to flesh this out, but I’m 100% sure of the approach.

]]>x*hat = Sum((y*j-aj) bj)/Sum(bj^2),

where the aj and bj are the OLS estimates from regressing the columns of Y on a constant and X. (Your equation simplifies by leaving out the intercepts, but it’s important in the end not to forget it.) A few insignificant or even actually 0 bj values doesn’t hurt this calculation, since if at least some of the bj’s are nonzero, the denominator is nonzero. But if we can’t reject that the denominator is zero, x*hat is completely meaningless, even though it is computable with probability one.

The near-zero bj’s obviously don’t have much effect on this weighted sum, and so these proxies may as well have been discarded as far as x*hat goes. However, my strong hunch is that discarding them is probably misleading in terms of the appropriately computed confidence interval for x*, through a cherry-picking effect.

A row of B that is zero or jointly insignificantly different from zero is another matter, since it means that the corresponding TPC is unidentified. But IMHO, the TPCs are a waste of time if one is ultimately only interested in global or hemispheric temperature, so I don’t see that anything is lost by just setting p = 1 (where X is nXp). Then the only issue is whether B’s single row is all zeroes.

Either way, the reconstruction step should not be attempted until *after* it has been established with an F test that B is not all zeroes. “Cross-validation” statistics on the reconstruction are commendable, but completely pointless if the reconstruction was meaningless to start with. MBH & Co err by omitting the important first step of making sure their divisor is not zero before dividing by it! Also, the primary test of “skill” of a proxy network is the F-test on B, not any ancillary cross-validation or “verification” statistics.

While PCA and Preisendorfer’s Rule N do not take care of zero B’s, they may help by reducing q and/or p down to a manageable number, since ideally both of these should be small compared to n *before any tests are run that compare the proxies to temperature**. If you have a hundred tree ring series from the American SW, say, it wouldn’t hurt to condense these down to their first few PC’s, using Rule N. And even then if you have say 20 such proxies, it is probably useful to do a second round of PCA and Rule N on the full set of proxies (some of which are PCs already), in order to reduce it to just a few “MacroPC’s”! (Or some such catchy term — MetaPCs?). Just passing Rule N does not mean that these are valid temperature proxies, only that they are objectively noteworthy attributes (perhaps identifiable with precipitation, insolation, etc) of the original set, so it is still important to do that F test for their joint significance versus temperature.*

*I see no harm in discarding contiguous high-order macroPC’s that are collectively insignificant as temperature indicators, so long as the first k are retained for some k. Discarding non-contiguous ones on the basis of their t-tests would have an adverse cherry-picking effect on the apparent significance of the retained ones, but the discipline of considering them in the order of their non-temperature-determined eigenvalues enormously mitigates this effect.*

If we define R2x to be 1 – SSRx/TSSx, where SSRx is the sum of squared residuals of X about its backed-out predicted values, and TSSx is the total sum of squared deviations of X about its mean, we get a different number, because now we are computing TSS from the X’s instead of the Y’s. Unless I am mistaken, R2x = 1 – (1-R2)/R2 = 2 – 1/R2,

I guess that in MBH9x case this is more complex, as calibration result is further transformed from U-vectors to grid-temperatures and then averaged to NH-mean (

http://www.climateaudit.org/?p=2969#comment-236757 ).

IMO the biggest problem in this monster is possible zeros in matrix B of the applied model

If some column of B is zero vector, corresponding proxy (column of Y) is a non-sense proxy. If some rows of B are zeros, there’s no response at all to corresponding X-vector (TPC in MBH-case). The model still holds, but in calibration we have to eventually invert B, so zeros in B are problematic. This is mentioned in almost all statistical calibration papers.

When Y is a vector, tests for zero rows of B simplify to well-known F-test. Extension of this test to matrix Y can be very likely found from statistical literature. However, in multivariate calibration papers, AFAIK it is always assumed that rows of E are independent, and thus errors are not correlated in time. In later publications of Mann it is assumed that rows of E are as correlated as rows of Y are. And then Mann proceeds with pseudoproxies, climate model realization + noise with non-zero SNR, i.e. possibility of zero B is completely ignored. Does Preisendorfer Rule N takes care of zero Bs? I don’t think so.

]]>MBH98 p. 785 give an equation for what they call the “conventional resolved variance statistic” beta, which appears to be another name for both these things, depending on whether their “ref” period is the calibration or verification period. However, their denominator is clearly wrong, since it is given as the sum of the squared data values in the ref period, rather than the sum of their squared residuals about their average in some period. It makes a big difference which period the average is taken over, so this is not very helpful.

Rutherford et al 2005 (2004 preprint p. 25) give separate equations for RE and CE, which is helpful, making the distinction that in RE the denominator uses the sum of squared deviations about the calibration period mean, while CE uses the sum of squared deviations about the verification period mean. However, they then state that the sums are taken “over the reconstructed values”. The reconstruction period is impossible, since the instrumental values are not known there. They must mean either the calibration period or the verification period,or both together, but which?

Wahl and Ammann (2005) have a lengthy verbal discussion of these statistics in section 2.3 in the text, and their Appendix 1, but don’t bother with any of those pesky equation things. However, between the discussion in the appendix and Rutherford’s equations, I think I see how these are calculated.

Both were evidently proposed by Fritts in his 1976 dendrochonology book as a well-meaning, if atheoretical, attempt to evaluate verification outcomes. They were evidently motivated as extensions of the conventional regression R2 (read R^2) statistic.

If we have a simple regression of some Y observations on a constant and one or more X’s, R2 = 1 – SSR/TSS, where SSR is the sum of squared residuals of the Y’s about their predicted values, and TSS is the total sum of squares, the sum of squared residuals of the Y’s about their average. Note that (absent serial correlation) R2 can be used to compute the standard F test for significance of the slope coefficient(s), and therefore has well-established critical values that depend on the sample size and number of regressors. Evidently the Hockey Team does not deem this test to be worth performing, and instead just look at the ancillary verification statistics instead.

If the regression is just run on a calibration period, and then tested against a verification period, RE and CE can, according to WA, be computed for either the calibration or verification period, so that potentially there are 4 separate statistics here. However, they note that in the calibration period, RE = CE = R2, so that in fact there is no point in even talking about a verification RE or CE — it’s just R2. (This does not necessarily prevent them from referring to R2 as the calibration RE/CE/beta statistic, however.)

In the verification period, RE is, according to their verbal description, computed as

RE = 1 – SSRv/TSSvc,

where SSRv is the sum of squared residuals in the verification period, and TSSvc is the total sum, over the verification period, of squared deviations of the actual values from their mean in the *calibration* period.

CE, on the other hand, is

CE = 1 – SSRv/TSSvv,

where TSSvv instead uses the sum over the same period of the squared deviations of actual values from their mean in the *verification* period. They note that since TSSvv %lt; TSSvc, CE %lt; RE, and they have different properties.

I don’t see that either of these is a particularly relevant way to use the verification outcomes to confirm the model. The sum of squared forecast-variance-adjusted forecasting errors, as I mentioned on the “More on Li, Nychka and Amman” thread, seems more promising, and may even have a standard F distribution. But be that as it may, MBH et al are wed to this RE statistic.

There is, however, the further complication that MBH are not looking at regression residuals themselves, but at calibration residuals, in which proxies Y have been regressed on temperature X, and then X backed out of observed values of Y. It turns out that this causes a discrepancy that at first confused me between R2 and r^2.

In an ordinary regression in which the predicted values of Y are found by OLS regression on a constant and X, R2 is simply the square of r, the estimated correlation between Y and X, so that R2 = r^2, and they are one and the same thing. However, if predicted values of X are backed out of a regression of Y on X, this identity no longer holds when R2 is computed from the actual and predicted *X* values. If we define R2x to be 1 – SSRx/TSSx, where SSRx is the sum of squared residuals of X about its backed-out predicted values, and TSSx is the total sum of squared deviations of X about its mean, we get a different number, because now we are computing TSS from the X’s instead of the Y’s. Unless I am mistaken,

R2x = 1 – (1-R2)/R2 = 2 – 1/R2,

so that R2x $lt; R2. R2x can easily be negative (whenever R2 %lt; .5), goes to minus infinity when R2 goes to 0, and is 1 when R2 = 1.

Since R2 can be backed out of R2x, and then the F-test for zero slopes backed out of R2, R2x could, in principle, be used to construct a valid test for “skill” in the calibration period, though it would be more straightforward to just use R2 or F from the original calibration equation of Y on X.

Likewise, whatever properties RE (and CE) have when computed from the Y-residuals, they will be very different when computed from the backed-out X-residuals. It might be useful in this case to call them REx and CEx.

Viewed in this light, the WA and Rutherford attack on Pearson’s r as a measure of “skill” is simply misguided. The calibration period r^2 = R2 tells us the standard F-test for statistical signficance of the proxies, whether or not MBH stoop to perform this test. Neither r^2 nor R2 purports to be a verification statistic except perhaps in the Team’s idiosyncratic statistics toolbox, which they evidently inherited from Fritts. As a verification statistic, it indeed would be “foolish”, as Mann puts it, to compute r^2. But as a calibration statistic, it would be equally foolish (or should we say innumerate), *not* to compute it. It is unclear whether WA realize there is a difference between R2x and r^2 in their applications.

(r^2 in the verification period could, of course, be used as a valid test of the statistical significance of a calibration in which the roles of the verification and calibration period were reversed. This would be “2-fold cross-validation” in LNA’s terminology. But even then, r^2 would not equal R2x.)

Wahl and Ammann make no attempt to confirm the MBH98 claim that “any positive value of beta [RE? CE?] is statistically significant at greater than 99% confidence as established from Monte Carlo simulations.” (SI p. 2) They instead pass the buck on to Huybers (2005), whom they claim “determined a 99% significance RE benchmark of 0.0 in verification”. I still haven’t looked at Huybers. WA blast MM for allegedly doing this wrong, but can’t be troubled to do it right themselves.

MBH98 SI gives various “Calibration” and “Verification” values of “beta” and “r^2”, but it is unclear how “beta” relates to what WA refer to as RE and CE, or whether by r^2 they mean R2 or R2x.

Note that “verification skill”, however tested, is not a foolproof safeguard against cherry picking of proxies, since the proxies can just as easily have been cherry picked for the size of their verification statistics as for the size of their calibration statistics. Ultimately what matters is how they were selected, not how big their verification statistics are.

]]>It was rejected because one reviewer (Holland, I think) said that the results were all wrong and that the analysis was even “fraudulent” and the other reviewer said that the results were already well-known and well-established in the literature.

I had a similar experience some years ago with a paper on binary arithmetic operations. One reviewer said it was correct but already well-known, and the other said it was wrong (no fraudulence implications, though). I never did go back and resubmit it — too busy at the time.

]]>Wouldn’t the clearer example of your complaint be better directed towards the hurricane fellow? Ie. Steve Mc. made a pretty clever hurricane analysis, (one of quite a few), and a limpy. snagged it without giving due credit? I DO remember that one…

]]>