Huybers’ second and more interesting (to me) issue pertains to the benchmarking of the RE statistic.I’m going to start in the middle of this issue. If I start with the history e.g. defining the RE statistic and showing its history (and I just tried), it’s hard to get to the punch line. So what I’m going to do is start with the punch line with minimal history and then go back and put the information into some context.
In our GRL article, having determined that the MBH cross-validation R2 statistic was ~0, we tried to explain the “spuriously” high RE statistic “using “spurious”‘? in the sense of Granger and Newbold  or Phillips . I’ve posted on some of these articles and, while they were not specifically applied, their viewpoint informs our GRL article and even more in our more recent work.
While the RE statistic may be “preferred”‘? by climatologists [Mann], it has many disadvantags as a statistic. It does not have a theoretical distribution. Its existence has fallen outside the notice of theoretical statisticians and its properties have never been studied from a theoretical point of view. It has however flourished in the world of dendroclimatologists. There we see statements such as the following [Fritts 1991] :
The theoretical distribution of the RE statistic has not been determined adequately so its significance cannot be tested. Any positive value of RE indicates that the regression model, on the average, has some skill and that the reconstruction made with the particular model is of some value.
While dendroclimatologists regularly describe the RE statistic as "rigorous" (and it’s remarkable to see the canonical use of this adjective), one cannot say that the derivation of the statistic is "rigorous". Anyway, in relfecting on the seeming contradiction between the high RE and low R2 statistics, it seemed quite possible to us that the significance level of the RE statistic has been benchmarked incorrectly in the particular circumstances of MBH98, which, after all, was a “novel”‘? statistical method. Maybe the rules of thumb of regression models as to RE significance did not carry over to this “novel”‘? methodology.
MBH themselves did not merely apply the “rule of thumb”‘? but carried out their own simulations to benchmark RE
Significance levels for àŽⰠ[RE] were estimated by Monte Carlo simulations, also taking serial correlation into account. Serial correlation is assumed to follow from the null model of AR(1) red noise, and degrees of freedom are estimated based on the lag-one autocorrelation coefficients (r) for the two series being compared. Although the values of r differ from grid-point to grid-point, this variation is relatively small, making it simplest to use the ensemble average values of r over the domain (r < 0.2).
In our GRL article, we pointed out that this benchmarking methodology was inappropriate as it completely failed to emulate actual MBH procedures, where data-mining principal components methods were being applied to much more persistent series. Our view was that the MBH principal components methodology would regularly yield series that had a high RE against NH temperature. If we were re-writing the article today, we would place even more emphasis on the issues and problems of a few “bad apples”‘? with nonclimatic trends.
Since we had already shown that the MBH PC methodology yielded hockey stick shaped PC1s, we tested the hypothesis that these simulated PC1s would yield high RE (and near-zero R2 statistics) against NH temperature. This is a different (and we think more subtle) use of the simulated PC1s., than is usually attributed to us. We didn’t say that the MBH reconstruction was simply an artifact of their PC methodology ( a position regularly attributed to us). We’ve always emphasized the interaction between flawed proxies (bristlecones) and flawed methods. But we did say that, if your method was biased toward producing hockey sticks, you would have to treat any hockey stick shaped series resulting from application of this biased method with a great deal of caution. In our RE benchmarking, we applied this insight, by using the biased PC1s to benchmark RE standards. We did not represent this as a complete model of MBH processes (as there were other degrees of freedom in hte MBH method that we did not use), but a result obtained only from operations on red noise looked like an excellent way to set a lower limit on RE significance.
We used the simulated PC1s as follows. MBH had fitted proxies to temperature (PCs) by a form of inverse regression (there were other steps as well). So we tried this procedure on the simulated hockey sticks, in our case, by simply doing an inverse regression of the simulated PC1s directly on NH temperature. What were the results? These simulated hockey sticks, generated entirely from red noise, frequently had high RE statistics. In fact, the 99% significance level was 0.59 from our simulations, as compared to 0 in the MBH simulations. Accordingly, as compared with these new benchmarks, the MBH RE statistic (0.47) was high but no longer wildly exceptional. Moreover, the pattern of statistics from these simulated PC1s was completely coherent with the observed pattern – they had high RE statistics, but R2 statistics of ~0. So we felt that we had a pretty convincing explanation of why the MBH RE statistic was spurius.
Enter Huybers. Huybers observed that our predictors (the simulated PC1s) had a significantly lower variance in the calibration period than the target being modeled (temperature). Huybers stated (without any citation) that the adjustment of the variance of the predictor to the target was a “critical step in the procedure”‘?, which we had omitted. Huybers then stated that
“when the MM05 algorithm is corrected to include the variance adjustment step and re-run, the estimated RE critical value comes into agreement with the MBH98 estimate [0.0].”‘?
We’ve attempted, without success, to obtain from Huybers a citation as to why he thinks that variance adjustment is a "critical step in the procedure". GRL would not intervene.
Variance adjustment is a bit of a hot topic in paleoclimate these days. Von Storch et al  attributed the attenuated variance of some multiproxy reconstructions to the use of regression models. (I disagree that this is the correct explanation of the phenomenon of attenuated variance, a feature which I attribute more to a wear-off of cherry-picking effects.) Esper et al.  (et al. including Briffa) contrast results between regression and scaling. Von Storch is hosting a session at AGU in Dec. 2005 in which the topic is mentioned. (I’ve submitted an abstract to this session.)
The argument of Esper et al  is that the predictor should have the same variability as the target. In a regression model, the variability of the predictor is necessarily less than the variability of the estimate, a result familiar to statisticians [e.g. Granger and Newbold, 1973], but a reminder to climatologists was recently issued again by von Storch et al .
If a predictor has less variance than the target, a common naàƒ⮶e response to the desire that the predictor have the same variance as the target is to blow up the variance of the predictor by simply multiplying by the ratio of the standard deviations. Von Storch  here strongly criticizes this approach (citing examples of the discouraged practice back to Karl ), suggesting, as an alternative, adding white or red noise. Granger and Newbold  also discuss this issue, in quite similar terms to von Storch . I mention this background, because I’m going to come back to these issues after showing some graphs about what happens in the respective simulations and place these results in the context of this debate. I think that the results are very surprising.
MBH98 did not report the use of a variance re-scaling step. However, their recent source code dump (and also the Wahl-Ammann code) shows that they did do a form of re-scaling, but NOT in the method carried out by Huybers. MBH98 took their North American tree ring PC1 and combined it with 21 other proxies in the AD1400 network. They performed a regression-inversion operation (the details of which I’ll describe elsewhere) to obtain a “reconstructed temperature PC1″‘?. They re-scaled the variance of the reconstructed TPC1 to the observed TPC1. This step was added only in April 2005 to Wahl and Ammann code – I mention this to show that the precise existence of this step in MBH was not obvious. From the temperature PCs, they calculated a NH temperature index.
In his Comment, Huybers simply blew up the variance of the simulated PC1s, rather than carrying out MBH98 procedures, which, as it happens, also blow up the variance in the exact degree desired, but in a different way and with completely different results.
We studied the effect using MBH98 methods. Once again, we used the same simulated PC1s from our GRL study. However, this time, we made up proxy networks consisting of one simulated PC1 and 21 white noise series (matching the 22 proxies of the AD1400 network) and did an MBH98 type calculation. Figure 1 shows the before and-after for three PC1s.
Figure 1. Left (before) – Simulated PC1; Right (after) – “NH reconstruction”‘? using simulated PC1 and 21 white noise series.
On the left is the simulated PC1 fitted against NH temperature (as used in our first GRL calculations) and on the right is the NH temperature resulting from the same PC1 combined with 21 white noise proxy series. It is obvious that the shape and scale of the two series are identical. The biggest observable difference is the increased variance on the right hand side. I hadn’t done graphs in this form at the time when our Huybers response was finalized a few weeks ago, but I find them extremely instructive. While the MBH98 re-scaling operation seems on the surface to be a simple re-scaling of the predictor as criticized in von Storch , the net result is more like an addition of white noise. I’ve always regarded these proxies as operating like noise, but it is remarkable to see exactly what they do in this before-and-after diagram.
The difference between Huybers’ rescaling approach and our implementation of MBH methods is illustrated in Figure 2 below, which should explain the profound differences in RE statistics resulting from the two re-scaling approaches.
Figure 2. Top: black – simulated PC1; red – recalled according to Huybers’ recipe; Bottom; after MBH methodology, as in Figure 1.
The top panel shows (for two different PC1) the simulated PC1 (black) and the series re-scaled according to Huybers’ recipe. Huybers’ recipe adds variance, but by blowing up the scale of the series. The RE values for the blown-up version (red) are very poor because the scale doesn’t match. It is the act of blowing up the scale that changes the RE performance.
The bottom panel shows the result of adding variance using the MBH method and white noise. (this figure re-states results of Figure 1.) Because the scale is not wrecked in adding variance, the RE properties of the simulated PC1s essentially carry forward into the bottom panel “predictors”‘? which also have high RE statistics against NH temperature, with a critical value also in excess of 0.5. In retrospect, I would probably have done the Huybers’ reply a little differently today than a couple of weeks ago, although the space constraints make it pretty hard to give a detailed reply.
“Real” Proxies versus White Noise
Ovious questions arising from the above is: can you simply model the vaunted MBH98 proxies as white noise? With all the emphasis on persistent red noise in the tree ring networks, shouldn’t you be using red noise?
I’m not saying that I won’t get to a red noise model, but it’s always a good idea to start with white noise and see what adding red-ness does. (I did this with the tree ring PCs as well.) Here we get interesting results simply using white noise. The reason for this may be that we’re in a different stage of the MBH98 procedure: before, we were talking about principal components analysis where the redness made a difference; here we’re doing regression-inversions.
To test the difference between MBH98 proxies and white noise, I tried the following experiments, illustrated in Figure 3 below (see explanation below the figure). I discuss RE results – the "preferred" metric for climatological reconstructions for each panel. As a benchmark, the RE statistic for the MBH98 AD1400 reconstruction (second right) is 0.46, said to be exceptionally significant.
Figure 3. Left – Tech stocks; right – MBH. Top- Tech PC1 and Gasp’-NOAMER PC! blend; Middle ““ plus network of actual proxies; Bottom ““ plus network of white noise.
Some time ago, I posted up the “Tech PC1″‘?, which I obtained by replacing bristlecones with weekly tech stock prices, as an amusing illustration that Preisendorfer significance in a PC analysis did not prove that the PC was a temperature proxy. I re-cycled the Tech PC1 to compare its performance in climate reconstruction against that of the NOAMER PC1 (actually a blend of the NOAMER PC1 and Gasp” – the two "active ingredients" in the 15th century hockeystick.
In the top panels, I fitted both series against NH temperature in a 1902-1980 calibration period. The top right panel shows the Tech PC1 (red), together with MBH (smoothed- blue) and the CRU temperature (smoothed- black). The “Tech PC1″‘? actually a higher RE statistic (0.49) than the MBH98 reconstruction (0.46), but it does have a lower variance (in Huybers’ terms). The Tech PC1 has an RE of 0.49, slightly out-performing both the NOAMER PC1-Gasp” blend (RE: 0.46) and the MBH98 step itself (RE: 0.47). In the most simple-minded spurious significance terms, this should by itself evidence the possibility of spurious RE statistics. Both the Tech PC1 and the Gasp’-NOAMER blend have less variance than the target (and the MBH98 reconstruction itself.)
The second panesl show the effect of making up a proxy network with the other MBH98 proxies in the AD1400 network. In both cases, variance is added to the top panel series, in exactly the same way as in the examples with simulated PC1s. The RE for the Tech PC1 is lowered slightly (from 0.49 to 0.46) and remains virtually identical with the RE of the MBH98 reconstruction.
Now for some fun. The third panel shows the effect of using white noise in the network instead of actual MBH proxies. In each case, I did small simulations (100 iterations) to obtain RE distributions. For the “Tech PC1 reconstruction”‘?, the median RE was 0.47 (99% – 0.59), while for the MBH98 case, using the NOAMER-Gasp” blend plus white noise proxies, the median RE was 0.48 (99% – 0.59). Thus, in a majority of runs, the RE statistic improves with the use of white noise instead of actual MBH98 proxies. The addition of variance using white noise is almost exactly identical to the addition of variance using actual MBH98 proxies.
Doubtless, these results can be re-stated more formally. They are hot off the press. At a minimum, the experiment shows the reasonableness of modeling the other 20 proxies as white noise for the purpose of the Reply to Huybers, since there is no evidence that the MBH98 proxies out-perform white noise. (I will on another occasion post up some information about correlations of these “proxies”‘? to gridcell temperature.)
This is off topic for the purpose of replying to Huybers, but, since it’s my blog, I guess I can digress. What does all of this mean, aside from merely rebutting Huybers (which is pretty much in hand). These results, which I find remarkable, tell me a lot about what is going on in the underlying structure of MBH98, which was, if you recall, a “novel”‘? statistical methodology. Maybe the “novel”‘? features should have been examined. (Of course, then they’d have had to say what they did.)
When I see the above figures, I am reminded of the following figure from Phillips  , where Phillips observed that you can represent even smooth sine-generated curves by Wiener processes. The representation is not very efficient Phillips’ diagram required 125 Wiener terms to get the representation shown below.
Figure 4. Original Caption: The series f(r)=r^2 on -pi<=r <=pi; extended periodically.
Phillips’ Figure 2 is calculated using 1000 observations and 125 regressors. In the MBH98 regression-inversion step, the period being modeled is only 79 years, using 22 (!) different time series (a ratio of 4), increasing to use even more "proxies" in later periods. My suspicions right now is that the role of the “white noise proxies”‘? in MBH98 works out as being equivalent to a “representation”‘? of the NH temperature curve more or less like Figure 2 from Phillips. The role of the “active ingredients”‘? is distinct and is more like a “classical”‘? spurious regression. I find the combination to be pretty interesting.
Whereas much of my earlier energy was spent just trying to figure out what was done in MBH98. Now I’m thinking about it in more theoretical terms and you can see some of my motivation for wading through the difficulties of Phillips .