I’m working on a long note, but probably a short interim note will make some sense. From the point of view of simply advancing our particular take on the MBH98 mess, as between Wahl et al and von Storch et al, there are pros and cons to us for either side being right. So I’d be inclined to say that we don’t have a dog in this particular race and that I can look at it pretty objectively. Right now, for what it’s worth, I’m inclined to think that VZGT have much the stronger position in the current dispute. Given my personal experience with the various parties, this seemed likely at the outset as well.
I’ve taken some time to re-read VZGT 2004 and have a better appreciation for it now than I did before, which I’ll discuss some now and more on a future occasion. The exercise of re-reading has been highly instructive as I think that I’m now getting into a position to pull together the various disparate threads of this small industry of MBH commentary – starting with the MMs, continuing with VZGT04, Bürger and Cubasch 2005, plus all the comments and replies.
For now I just want to present what VZGT actually say, so that there’s at least a blog record of it without going through the poisonous realclimate filter.
Von Storch et al 2004
Let’s go back to VZGT . The figure below is perhaps their key graphic. It showed results from an "MBH98 method" (which in this case involved the use of detrended proxies) using pseudoproxies. Let’s leave aside (for now) the issues of whether detrending is "right" and merely attend to the results under this particular assumption. I’ve blown up the image so that it can be seen clearly as I found the coloring in the original figure at its scale hard to follow. On another occasion, I’m going to try to place the "MBH98 method" within the range of multivariate techniques known to the rest of mankind (for now, grant me that it can be construed as a form of one-stage partial least squares – this follows fairly directly from my posts on MBH Linear Algebra, but I haven’t posted this up).
For now, let’s pretend that VZGT04 presented the graph below as an abstract presentation on the technical limitations of one-stage partial least squares methods with detrended calibration (with no named adversary). Would anyone be surprised that the reconstructions increasingly departed from the target as the proportion of noise in the proxies increased? Of course not. In the context of our empirical studies of actual MBH98 proxies, I’d say that the study left off at much too low a noise level – you need to go to 95% or even 99% or even 100% noise – and/or didn’t adequately explore the impact of contaminated or unrelated proxies. But that doesn’t invalidate the information in the graph below. So despite anything that Wahl et al said or any spin by realclimate, the position in VZGT 2004 Figure 2 below, as far as it goes, seems unassailable to me.
VZGT  Fig. 2 Original Caption. (A) The Northern Hemisphere annual temperature evolution over the last 1000 years. The NH annual temperature simulated by the model ECHO-G and MBH98- reconstructions of this temperature from 105 model gridpoints mimicking the multi-proxy network of MBH98. Increasing amounts of noise have been added to the gridpoint temperatures to mimic the presence of other than temperature signals in the proxies. The corresponding local correlation is also indicated. The 2-sigma uncertainty range (derived as in MBH98 from the variance of the interannual residuals) for the different noise levels is indicated. The reconstruction with ?=0.5 is shown with its 2àƒ—sigma uncertainty range. (B) the spectra of the NH annual temperatures shown in (A).
In partial least squares methodology, there’s a scaling decision that has to be made in each case and I don’t know how VZGT 2004 actually re-scaled the above graphic. There’s a great deal of nonsense being published on scaling and regression by paleoclimate people right now – think of almost any publication by IPCC 4AR Lead Author Briffa – but not the von Storch group. Von Storch  is an illuminating take on this. It’s cited in VZGT04, but it’s easy to miss the citation (I didn’t heed it); I was referred to it by von Storch last year in private correspondence and it’s useful both for understanding what’s involved in re-scaling and may be relevant to what VZGT are doing.
The caveat in von Storch  about re-scaling is that it is not a given that blowing up the variance of a predictor is a "correct" method of matching variance of a predictor to the variance of a target. Von Storch observed that there may be material differences and that it might be more appropriate to add white noise or red noise to the predictor if you want to match variance. Von Storch criticized inflation procedures in Karl  and while he did not mention MBH98, it turns out that MBH98 practiced a similar form of inflation as Karl . This doesn’t help much in finalizing an interpretation of the above figure and I’ll try to clarify what VZGT did, as I simply don’t know.
We do know (now) what MBH98 did with their PLS estimator – they re-scaled the variance of the PLS estimator of the temperature principal components in the calibration period to the variance of the target series in the calibration period. This re-scaling at the RPC step has been troublesome for people attempting to replicate MBH98 as it was not mentioned in the original text and different guesses as to when and how re-scaling or re-fitting took place can be plausibly made. The matter came up in the Huybers Comment, our Reply to Huybers; there’s a late note in Wahl and Ammann code showing an 11th hour change in how they dealt with this step presumably based on personal feedback from Mann. In any event, the rescaling step in MBH98 methodology is now clear. It should be mentioned that the MBH98 re-scaling procedure is not a law of statistics and is not mentioned in any statistical text. It’s a method used in MBH to deal with an indeterminacy in PLS methods. It’s not "wrong", but neither is it "right". Other methods and other choices could be plausibly argued on an a priori basis. Bürger and Cubasch described this type of decision as a "flavor" and that’s not a bad analogy.
So while I don’t exactly know the provenance of the VZGT 2004 figure, my guess is that you might very well get a result like this for smoothed curves, even with a variance matching exercise. (I’m not 100% sure of this.) As the proportion of noise increases, it gets harder and harder to pick out a signal with PLS or any other method. Think about a limiting case with 100% white noise or 100% red noise. In such circumstances, you can still match variance in the calibration period, you’re not going to recover any information about the "signal". Or consider limiting cases with spurious information (dot.com stock prices) and otherwise white noise. You’ll recognize this line of argument from our work. It’s coming at the situation from the other direction.
Reply to Wahl et al 
Now let’s look at the corresponding figure in the Reply to Wahl et al . Again, let’s not use the term MBH98 method; let’s continue to use the term one-stage partial least squares with and without detrending, with red noise and white noise. Directionally, the results are the same as before. In the white noise scenario, signal recovery using one-stage PLS without detrending is better than one-stage PLS with detrending. However, with red noise, the results using one-stage PLS are virtually identical with and without detrending. Obviously white noise is very unrealistic assumption not simply in climate series, but especially with tree ring chronologies, which, if nothing else, are reservoirs of red noise. Realclimate huffs and puffs against red noise, but it seems to me that the effect originally described in VZGT 2004 survives unscathed and even to be clarified a little. Again, the results in the Figure below seem unarguable to me.
Update: Apr 30 – Eduardo Zorita confirmed that the data below was re-scaled.
Reply to WRA Fig. 1. Northern Hemisphere temperature deviations from the 1900 to 1998 mean, simulated and pseudoreconstructed from a network of pseudoproxies and three implementations of the MBH98 reconstruction method (2): with detrended and nondetrended calibration using white-noise pseudoproxies with 75% noise variance; and, additionally, with nondetrended calibration and red-noise pseudoproxies with the same amount of total noise variance, constructed from a AR-1 process with 0.7 1-year autocorrelation. One hundred Monte Carlo realizations of the noise were used to estimate the median and the 5% to 95% range. Two climate models were used, ECHO-G (left) and HadCM3 (right). Scale on the right is half that on the left.
So what are Wahl et al and realclimate gloating about? Let’s look closely at what Wahl et al said, this time watching the pea under the thimble. Wahl et al did not disprove any specific result reported by VZGT about one-stage partial least squares methods – just as they have never disproved any specific result that we’ve ever reported. Their comment is usual Hockey Team procedure – isolate some point where parties failed to replicate some poorly disclosed aspect of MBH98 procedure, something which is never mandated as a statistical procedure and shout loudly.
If you think about what VZGT04 are actually doing, surely it is most reasonably construed as setting limits on what performance you can expect from a one-stage partial least squares method – both with perfect pseudoproxies and under noise assumptions (which I view as being inappropriately optimistic). Wahl et al do not show that these performance limits are wrong or incorrectly calculated. They just shout. When I find out exactly what’s going on in the re-scaling step in VZGT, I’ll be in a better position to comment further, but right now I don’t see a problem in relying on these specific VZGT results.
There are other VZGT and VZ assertions that I don’t buy. I think that they are seriously wrong from Zorita et al 2003 on in thinking that you can’t allocate weights to any given proxy and that the MBH one-stage PLS procedure is "robust". I think they really grabbed the wrong end of the stick here and, when they see this, we’ll get to a synthesis.
Back to Wahl et al, who, in effect, claim that the MBH98 reconstruction out-performs the theoretical limits of the partial least squares method. Thus, despite the fantastic noise levels of MBH proxies, the MBH reconstruction supposedly out-performs reconstructions with near-perfect pseudoproxies – and, considering the results in the VZGT Reply, this supposedly happens with and without detrended calibration. How is this possible? What accounts for this remarkable "achievement"?
Answering this takes us into MM world where we deal with the topics that are left out by VZGT and Zorita et al 2003 – the world of flawed proxies, the world of total noise, the world of spurious regression, the world of cherrypicking, failed (and unreported) verification r2 statistics. It takes us into the world of non-robustness, first hinted it in MM03, expanded in MM05 and placed into a broader conext by Bürger and Cubasch. Within a family of high calibration r^2 fits thrown up by various partial least squares (and other plausible methodological) alternatives, you get a wide choice of verification mean. (Also see Briffa et al 2001 for another example of wildly non-robust alternatives). The situation is fraught with temptations to tune on the verification period mean.
If you use the verification period mean to select your model, as B&C astutely point out, you no longer have a statistic left to check against overfitting. (One point here – in the calibration period fit with ultra-high noise MBH proxies, the proxies are close to orthogonal and thus the PLS fit is like a multiple regression of temperature on 22-112 proxies. No wonder you can get good calibration r^2 statistics. This is also why calibration period residuals are totally inappropriate for confidence interval estimation.)
All in all, it’s hard to imagine a worse statistical method.
This covers a lot of ground and I’m going to be tied up for the rest of the week-end. But you can see how one can start pulling together the various strands of MM, VZGT and B-C. Instead of the Wahl et al Comment acting as a vindication of MBH, I predict that it will be a type of catalyst. I think that placing MBH98 regression methods in the context of partial least squares methods will prove to be important in placing this type of work in terms that applied statisticians will understand without having to wade through pages of inflated Mannian, mini-Mannian and Ammannian commentary. Once the applied statistics community understands what’s going on,
and there are signs of interest, that will be the end of the line for MBH98.
I know that they’ve "moved on". But RegEM is another peculiar method, whose statistical properties can’t be read about in a standard statistical text. Mann claims that the similarity of RegEM and MBH98 results shows that MBH is all right. I suspect the exact opposite: the similarity of results using a method with unknown properties to results from a method known to be flawed suggests to me that it has some serious problems as well. No one has ever studied Mann’s application of RegEM to multiproxy reconstructions in a critical way, but doubtless some one will.