"7 c. Did you calculate the R2 statistic for the temperature reconstruction, particularly for the 15th Century proxy record calculations and what were the results?"
"My colleagues and I did not rely on this statistic in our assessments of “skill” (i.e., the reliability of a statistical model, based on the ability of a statistical model to match data not used in constructing the model) because, in our view, and in the view of other reputable scientists in the field, it is not an adequate measure of “skill.”
My previous discussion was based on the Supplementary Information where the cross-validation R2 statistic is notably not reported. However, I did not discuss the following Figure from the Nature article itself, which is well worth discussing.
Original Caption to MBH98 Figure 3. Spatial patterns of reconstruction statistics….For the r2 statistic, statistically insignificant values (or any gridpoints with unphysical values of correlation r , 0) are indicated in grey. The colour scale indicates values significant at the 90% (yellow), 99% (light red) and 99.9% (dark red) levels (these significance levels are slightly higher for the calibration statistics which are based on a longer period of time). A description of significance level estimation is provided in the Methods section…. The Methods section quoted in the caption says only: "For comparison, correlation (r) and squared-correlation (r2) statistics are also determined."
Before discussing the above figure, Mann’s full response to the House Committe on this question was as follows:
A(7C): The Committee inquires about the calculation of the R2 statistic for temperature reconstruction, especially for the 15th Century proxy calculations. In order to answer this question it is important to clarify that I assume that what is meant by the “R2” statistic is the squared Pearson dot-moment correlation, or r2 (i.e., the square of the simple linear correlation coefficient between two time series) over the 1856-1901 “verification” interval for our reconstruction. My colleagues and I did not rely on this statistic in our assessments of “skill” (i.e., the reliability of a statistical model, based on the ability of a statistical model to match data not used in constructing the model) because, in our view, and in the view of other reputable scientists in the field, it is not an adequate measure of “skill.” The statistic used by Mann et al. 1998, the reduction of error, or “RE” statistic, is generally favored by scientists in the field. See, e.g., Luterbacher, J.D., et al., European Seasonal and Annual Temperature Variability, Trends and Extremes Since 1500, Science 303, 1499-1503 (2004).
The last sentence deserves some analysis, since, other than the Luterbacher article cited by here by Mann, I have been unable to find an article in which the RE statistic is used without also quoting the R2 statistic. Last year at realclimate, Mann cited Cook et al  as additional support for this position, but dropped this citation after I pointed out that Cook et al  also report R2 statistics (which were significant). It would be nice to cheeck Luterbacher’s work to see if he has a significant R2 statistic, but, unfortunately, Luterbacher has not archived any data to check this. (He published in Science, which has a poor track record in this regard – I’ll post about this some time.) However, a full analysis of this last sentence will have to wait for another day.
In the original SI , the cross-validation R2 statistic was not reported. You can see columns for calibration beta (which is equivalent to the calibration period R2) and for the verification beta, plus some r^2 and g^2 statistics pertaining to Nino, but, if you look closely, there is no verification R2 statistic. We remarked on this in MM05a and MM05b.
We then pointed out that the source code shows that the cross-validation R2 statistic was calculated for each step, which shows how the table in the SI was calculated. A poster at timlambert.org recently drew attention to this Figure presumably to refute the idea that the R2 statistic had not been presented by MBH98. So what does Figure 3 in the original article show?
The text says the following:
In the reconstructions from 1820 onwards based on the full multiproxy network of 112 indicators, 11 eigenvectors are skilfully resolved (nos 1–5, 7, 9, 11, 14–16) describing ,70–80% of thevariance in NH and GLB mean series in both calibration and verification. (Verification is based here on the independent 1854–1901 data set which was withheld; seeMethods.) Figure 3 shows the spatial patterns of calibration b, and verification b and the squaredcorrelation statistic r2, demonstrating highly significant reconstructiveskill over widespread regions of the reconstructed spatial domain. 30% of the full spatiotemporal variance in the gridded data set is captured in calibration, and 22% of the variance is verified in cross-validation. Some of the degradation in the verification score relative to the calibration score may reflect the decrease in instrumental data quality in many regions before the twentieth century rather than a true decrease in resolved variance.
So what we have here in Figure 3 is a graphic showing cross-validation R2 statistics by gridcell for the AD1820 step which has 112 "proxy" series and very different success than the controversial 15th century step. The "full" multiproxy network of 112 "proxies" includes 12 instrumental temperature series which are hardly "proxies" for temperature. One would expect some "skill" in reconstructing temperature, especially in the northern Europe area, when you are spotted 12 actual temperature series. In this network, the cross-validation R2 statistics were favorable and they were not only reported but presented in a prominent graphic.
MBH posted up a graphic demonstrating high cross-validation R2 statistics in a step when they obtained high cross-validation R2 statistics. Most readers would conclude that similar results applied in other steps. However, if the SAME graphic is done for the AD1400 network (I’ve done this and will try to locate it and post it up), the graphic is extraordinarily unfavorable. They’ve provided results for the AD1820 step, but not the AD1400 step, which is the controversial one. So it’s a little hard to reconcile MBH98 Figure 3 with the answer to question 7C to the House Committee.