Jean S has sent the following longer contribution on Rutherford et al 2005. I always appreciate Jean S’s thoughtful comments (which is no secret to readers here). So enjoy. JeanS:
Now since the review of Burger and Cubash [BC06 here after] put Rutherford (2005) [R05] back on the table, there is an issue to which I’d like to draw readers’ attention. This issue went completely unnoticed by the AR3 of BC06 [I wonder if AR3 was a reviewer for R05 also, if so, what were his/her comments]. IMO, the issue is not fully noticed by BC, either. Namely that R05 does not use the RegEM algorithm by Schneider(2001), but a modified version of it. In fact, there are at least three completely ad-hoc modifications introduced in R05, I’ll discuss them below in some detail.
As a general note, one should recognize that any (optimal) properties proved for RegEM can not be claimed for the R05 version, since it has not been shown anywhere that these (ad-hoc) modifications introduced do not change the properties.
1) “Hybrid frequency modification”
This step is “justified” in R05 with a lot of handwaving without any indication of its mathematical soundness. There are some RE values given for the reconstruction, but those values seem to differ little. So it is hard to understand why this step was needed given the the trouble in coding etc.
Modification introduced: instead of using RegEM directly on (proxy) data, R05 chooses to apply RegEM to separately to “low” and “high frequency” components. In essence, they filter the proxy data with a low pass filter (10 point Butterworth) to create the “low frequency” component and then take the residual as the “high frequency” component. Then each of these series is given “weight” which is essentially the ratio between their variances with respect to the variances of unfiltered series. Then this weight is used as an initial covariance estimate for the RegEM, which is applied separately for the both series. Finally the results are somehow combined.
The combination code is not provided in Rutherford’s web page (why not???) [As a side note: Rutherford’s code is terrible to read, and I would go nuts if any of our PhD-students would return me code like that not to speak of making it public.] and the combination is described in R05 as:
The REGEM method is applied separately to the calibration of proxy and instrumental data in the high- and low-frequency bands. The results of the two independent reconstructions are then recombined to yield a complete reconstruction. Each proxy record is weighted by a bandwidth retention factor defined as the percent of its total variance within the particular frequency band under consideration. For example, a proxy record dominated by interannual variability, with
very little low-frequency variability (e.g., a very data-adaptive standardized tree-ring record) would be assigned a high weight (near one) in the high-frequency band calibration and a low weight (near zero) in the low-frequency band calibration.
From which I couldn’t figure out the exact details. If anyone can help me with this, I would be grateful as this would allow the complete testing of Rutherford’s results.
At this point I’d like also to remind the readers about the AR3 (of the BC paper) comment:
Concepts are used loosely in a way that leaves their specific reference in the context unclear
Hmmm… isn’t that customary in the field ;)
Anyhow, it seems that the “low and high frequency” component results are somehow weighted with respect to the relative variances of the proxies (under what period?). In essence, this step gives, in BC terminology, yet-another-flavor introduced by the hockey team. This is somewhat recognized by BC as they state in their answer to AR2 (see also the nonsense answer by AR2):
Regarding the hybrid-20 variant, Rutherford et al., 2005 state: “As described below, cross-validation experiments motivate the choice f = 0.05 cpy (20-yr period) for the split frequency in almost all cases.” – This sentence illuminates how a posteriori selections penetrate a model definition. If there is no extra (independent, a priori) motivation for that choice, the chosen model is no longer independently verifiable. This is especially critical if the above cross-validation is sensitive to the parameter in question, and is of course dramatically aggravated with the number of such choices.
However, IMO BC have not fully realized the arbitrariness of this step. Not only there is the “cut-off-frequency” variable issue (as realized by BC above), but also the ad-hoc weighting is used as an initial estimate for the covariance matrix and, supposingly, is later used in combining the “low” and “high” frequency results. Potentially, I see another “Mannian PCA”-step that would give high weight for proxies with “right” properties. In any case, the arbitrariness of this “yet-another-flavor” should be IMO clearly stated in the main BC text.
I also find it curious that Mann goes nuts on the detrending/trending-issue, but here, without hesitation, he is ready to introduce this type of procedure which seems to allow any type of “detrending” one is looking for.
2) “Stepwise modification”
This is a modification that I think that have not been discussed much before.
This step was “justified” in R05 by an unreported toy example. I find it troubling that in “climate science” papers you are allowed to modify existing algorithms and then justify the modification by simply claiming that the modification “works better” in a single toy-example. It seems that one does not need to even provide the results of the toy-example runs for the reader to check. Even worse in R05, they modified a recently proposed algorithm that in itself has not been fully explored. Anyhow, the “justification” in R05 ends:
The stepwise approach performed as well as or better than the nonstepwise approach in cross validation in each of these experiments. The results of these pseudoproxy experiments give us some confidence that the primary conclusions presented in this study are insensitive to whether the stepwise or nonstepwise approach is used.
This is very troubling for at least three reasons: a) either the first sentence is not correct or this is a strong case for inappropriateness of the used metrics b) they generalize the results of single toy example c) if the results are truly insensitive, why then use a modification?
The idea in this modification is that instead of calculating the target (temperature field) by inputing the whole available data into the algorithm at once, one reconstructs the temperature field in a step-wise manner: first back to 1820, then back to 1800, and so forth. This method leads to different reconstruction as that implemented by BC06 (for the AD1400 step):
if we are at step AD1400, the temperature field for the period 1500-1855 is obtained from the step AD1500 so it is not calculated by RegEM as in the BC06 implementation. In other words, the AD1400 step have no effect to the temperature field past 1500.
Now, the Table~2 in R05 supposingly calculates the RE values with respect to the BC06-style implementation (using only the proxies available to the given time). However, BC obtained clearly lower value for the AD1400 step (from the reply to AR2):
For this setting they report a verification RE of 40% (66% for the “early” calibration, with even 71% for the “hybrid-20” variant), as opposed to our emulated RE of 30%..
BC give possible reasons for the difference:
a) different temperature data (NH only, newer CRU version), b) different calibration periods than MBH98 (1901-1971 and 1856-1928), and c) an infilling via RegEM of the incomplete temperature data set prior to the entire analysis.
These are good sources for the difference reported, but IMO the difference (0.1) seems just too large to be completely explained by those factors alone. Rutherford’s code does not show how Table~2 was calculated. So maybe, just maybe, instead of calculating Table~2 in BC06-manner, they used the following procedure:
a) calculate the temperature field backwards as usual (maybe NaNing the verification period at each step)
b) calculate the verification temperature field for the target period by NaNing the verification period
BUT using the rest of the temperature field obtained so far (instead of using NaNs for the whole non-caliberation temperature field). I think this is worth considering in some detail. Another possible source for the difference is the fact that there is “yet-another-flavor” introduced in R05:
3) Unreported “standardization”
There is a “standardization” step in Rutherford’s code, which seems to be (to my knowledge) unreported. Here’s a segment of the code (taken from “mbhstepwiselowrecon.m”, but the same is present in all reconstruction codes):
if i==1% first step standardize everything
else % now just standardize the new pcproxy network
The above code “standardizes” all proxies (and the surface temperature field) by subtracting the mean of the calibration period (1901-1971) and then divides by the std of the calibration period. I’m not sure whether this has any effect to the final results, but it is definitely also worth checking. If it does not have any effect, why would it be there?
Rutherford, S., Mann, M.E., Osborn, T.J., Bradley, R.S., Briffa, K.R., Hughes, M.K., Jones, P.D., Proxy-based Northern Hemisphere Surface Temperature Reconstructions: Sensitivity to Methodology, Predictor Network, Target Season and Target Domain, Journal of Climate, 18, 2308-2329, 2005. Code and supplementary material available from: http://fox.rwu.edu/~rutherfo/supplements/jclim2003a/
Schneider, T., Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14, 853-871, 2001. Code available from: http://www.gps.caltech.edu/~tapio/imputation/index.html
Bürger,G., Cubasch, U., On the verification of climate reconstructions Climate of the Past Discussions, 2, 357-370, 2006