Rutherford et al [2005] Collation Errors

I’ve got something that’s a little bit amusing today. In MM03, we pointed out collation errors in pcproxy.txt (which I’ve recently hypothesized was used in the version of Rutherford, Mann et al [2005] submitted in July 2003 and was laundered after MM03). We pointed out that the PC series all seemed to start one year too early (1799 instead of 1800) and then the missing 1980 values (the last year in the matrix) were filled from left to right, so that, in some cases, 8 series had identical 1980 values. Rutherford, Mann et al [2005] said that these criticisms applied only to the "wrong" data set (while not mentioning that we were directed to this dataset at Mann’s FTP site by them). After MM03, they fixed the problems in their collation of the PC series for Rutherford, Mann et al [2005] (without thanking us). But they made the very same goof in their collation of the instrumental series into their composite matrix. It’s too funny for words.

I’ve just started looking at Rutherford’s RegEM method and first noticed another goof right away, which I will describe first. If you go to the script mbhstepwiselowrecon.m in directory http://fox.rwu.edu/~rutherfo/supplements/jclim2003a/regem/multiproxypc-scripts/, towards the bottom of page 1 in a Notepad rendering of the script, you see programming steps which combine MBH98 proxy networks and 1008 temperature gridcell series (the proxies are placed in columns 1009 on). In the 2nd step (i=10) representing the AD1450 step in MBH98, Rutherford has incorrectly set the number of proxies (nproxies) as 93, when it should be 25. It’s correct in the companion file mbhstepwisehighrecon.m. As a result of the error, the composite matrix is made with too many columns and he tries to fill columns 1009:1111 with a matrix that’s only 25 columns. A question for Matlab users: in such a mismatch, does Matlab leave the extra columns as empty or does it circularly apply the smaller matrix to fill or does the instruction simply fail (as it would in R) leaving the matrix "composite" at whatever prior value that it had.

I notice that there’s a later comment in his script that:

% there are typically a few gridpoints in the infilled instrumental data without % any values because they couldn’t be inflled. Remove them here. % note this also removes empty proxy columns

Depending on how Matlab handles the above programming error, this might very well cooper up the prior error. It raises another question: why would some of the infilled instrumental data not have any values? Did anyone notice any mention of this in the Rutherford text? Some people may remember that in MM03 we pointed out that some of the MBH98 cells with "nearly continuous" histories had 0 values in HadCRU2. This would be an interesting topic to look at for someone. Maybe someone would write Phil Jones and ask him.

Anyway, just above this step, there is the following instruction:

composite(456:598,1:1008)=respliced(:,1:1008)

On the face of it, there doesn’t seem to be anything objectionable about this. However, the devil is in the details and I’ll bet that there is a real clanger here. The gridcell temperature dataset archived here goes from 1854-1993 (and is the one used in MBH98). They’ve archived a different temperature dataset than the one that they actually used. The later HadCRU datasets start in 1856 and my guess is that they were using a temperature dataset that went from 1856 to 1998. Since the combined dataset starts in 1400 (not 1401), the correct collation instruction to insert a dataset going from 1856-1998 into a matrix starting in 1400 should have been:

composite(457:599,1:1008)=respliced(:,1:1008)

If they used a HadCRU dataset starting in 1856 as appears almost certain (I’m unaware of any versions starting in 1855 although who knows what may materialize out of the blue), they’ve collated the instrumental data incorrectly [insert smiley] – it’s all one year too early. This is precisely the collation errors that we oberved in MM03 about the collation of principal component series in pcproxy.txt – where they had spliced in the PC series one year too early. In Rutherford, Mann et al.[2005], they said that some of the MM03 criticisms applied to the "wrong" data set (not mentioning that it was the data set at Mann’s FTP site that they had directed us.)

Astonishingly, the only errors described in MM03 that MBH98 might have avoided (the collation errors – see MM03 Scorecard) appear to have been made in Rutherford, Mann et al [2005] – but in the instrumental series rather than the PC series. (All the other data problems of MM03, including incorrect PC calculations, carry forward pari passu to RM05.) Does this "matter"? In one sense, nothing "matters" to MBH98 except bristlecones (and you know how they squeal if someone tries to examine results without bristlecones). But geez, after all the commotion about collation errors, you’d have thought that even the gang that can’t shoot straight would make sure that this study didn’t have any clanging collation errors. And you’ve got almost the entire Hockey Team on the masthead of this study: Rutherford, Mann, Bradley, Hughes, Jones, Briffa, Osborn. What a tangled web indeed. I can almost hear the smoke rising from Mount Mann.

15 Comments

  1. Phil B.
    Posted Feb 3, 2006 at 10:03 AM | Permalink

    Steve, I have got mbhstepwisehighrecon.m to run successfully, but mbhstepwiselowrecon.m crashed when attempted. I did a file difference and found a number of errors which I corrected and now have mbhstepwiselowrecon.m running. The nproxy = 93 error was one of them, along with a double use of a for loop index i. It appears that the script was in the process of being modified, but never run. I have not found a script where the NH temps are calculated. Hope this helps.
    Phil B.

  2. Steve McIntyre
    Posted Feb 3, 2006 at 10:08 AM | Permalink

    The apparent splicing error is in mbhstepwisehighrecon.m as well.

  3. Phil B.
    Posted Feb 3, 2006 at 10:30 AM | Permalink

    Steve, didn’t mbh98 used 1082 temperature grid cell instead of 1008?

  4. Steve McIntyre
    Posted Feb 3, 2006 at 10:44 AM | Permalink

    yes. I’ll check what RM05 say about selection procedures for gridcells.

    BTW you couldn’t derive the 1082 selected gridcells from the reported selection procedures – I posted up on this a long time ago – see the Replication category in the right frame. There are many MBH conundrums that remain unresolved. So there’s no guarantee that there will be a replicable protocol for the 1008 either.

  5. Mark
    Posted Feb 3, 2006 at 1:27 PM | Permalink

    Steve, regarding your question about adding different size matrices in Matlab…

    Matlab will generate an error if you are adding two matrices of different sizes (or replacing a sub-array with a different matrix). If, for example, you have one matrix B that is 20×20 and another A that is 10×10, you can do a replacement but you need to specify the exact dimensions you want replaced ala…

    B(4:13,8:17) = A will replace the 10×10 subset of B with the data in A.

    If you just try to replace B with A (B = A), the old B will be gone and the new B will have A in its place.

    Is that what you were asking (I haven’t looked at the file to see the exact problem)?

    Mark

  6. Spence_UK
    Posted Feb 3, 2006 at 4:10 PM | Permalink

    Ooops I hadn’t spotted this topic.. and when I went to download the m-file I just got the message file not found. It could be me not copying the URL across correctly, or it could have been hastily removed…

    If you want to run these scripts without shelling out thousands of pounds for a Matlab licence, you could try Scilab. It is a piece of freeware produced by a consortium that isn’t a million miles from Matlab, and includes a matlab-to-scilab function conversion routine (m2sci). Unfortunately the conversion isn’t perfect and there is scope for apparent errors to appear not from the original code, but from the conversion process, but I understand for most straightforward stuff it is pretty good.

    Please note I haven’t used scilab, so I can’t vouch for it, just going on what people have told me about it and the blurb on their website (http://www.scilab.org)

  7. Spence_UK
    Posted Feb 3, 2006 at 4:11 PM | Permalink

    Ooops part two: thousands of Canadian dollars might have been a more appropriate comment. Not got my international head on today 🙂

  8. Steve McIntyre
    Posted Feb 3, 2006 at 4:41 PM | Permalink

    #6 – I fixed the url for the link. (Since I’m blocked, I couldn’t just cut and paste).

  9. ET SidViscous
    Posted Feb 3, 2006 at 4:59 PM | Permalink

    http://www.freeproxy.ru/en/free_proxy/cgi-proxy.htm

    http://www.theinquirer.net/?article=29457

  10. Spence_UK
    Posted Feb 4, 2006 at 8:49 AM | Permalink

    Thanks for the link update. I downloaded the code and ran it, but of course the previous posters have addressed your key points already.

    That just leaves me to rant about pet peeves in MatLab code, highlighting an observation from Phil. One of the things I am particularly critical of is when people use the variables “i” and “j” in matlab code – particularly in scripts where variable scope is loose – as both of those tokens have built-in definitions, in particular they represent the square root of -1. Once they have been declared as variables, the variable use overrides the constant, but it does not make for easy reading of the code.

    Of course the use of these as variable names is a hang over from Fortran, in which they are commonly used as integer loops. When anyone writing code for me does this I tell them to go away, and bring it back to me when they have used meaningful variable names!

    OK this is a little bit petty, these guys are scientists, not software engineers, and these are short routines so there is not so much of an issue with sloppy code, as long as the code runs and gives the right answer.

    Shame it didn’t do that either…

  11. Steve McIntyre
    Posted Feb 4, 2006 at 9:27 AM | Permalink

    I’m guilty of using i and j as integer loop subscripts (also a Fortran hangover).

    They didn’t make the instrumental collation error in the script combinedstepwiselowrecon2.m, which splices in the instrumental record as follows:

    composite(457:599,1:1008)=respliced(:,1:1008);

    The difference in splices is final proof of the cock-up in their MBH98 version. I’m surprised that there’s not been more piling on by readers with respect to the collation error in the Rutherford et al [2005] implementation of MBH98. I think that it’s going to be a big deal in a couple of ways.

    First, on an elementary basis, it’s a pretty humiliating error – both in itself and given all the discussion arising out of MM03 and collation errors and supposedly "wrong" data sets. I guess that they will suddenly announce that they posted up the "wrong" script and that anyone who was less stupid than me would have known enough to consult the "right script" which was really up all the time in some other location never previously mentioned and delete the "wrong" script. But it will be pretty tough to get away with this a second time.

    Secondly, think about what it means for MBH98. They have proudly announced that they get essentially the same results as MBH98 using a "completely different" method RegEM. (I’m going to do a post on whether RegEM is really a "completely different" method or whether the linear algebra ends up boiling down to being pretty similar in the 15th century.) But in the RegEM version, their "target" instrumental version for MBH is one year off. So they get essentially the same reconstruction regardless of whether they get the instrumental year right. !?!?!

    This is completely consistent with our understanding of MBH methods – a classic spurious regression of the bristlecones combined with massive "overfitting" from essentially white noise series in the calibration period, thus the breakdown of the cross-validation R2. I’ve shown a little how this works in Huybers #2, which has never attracted as much comment as I’d hoped, but really need to articule this in a longer form.

  12. Jo Calder
    Posted Feb 4, 2006 at 4:59 PM | Permalink

    But in the RegEM version, their “target” instrumental version for MBH is one year off. So they get essentially the same reconstruction regardless of whether they get the instrumental year right. !?!?!

    Is there some indication of the scale of the errors this induces (I guess some measure of year-to-year variation)? OTOH if the data is highly autocorrelated, wouldn’t that tend to reduce the consequent errors?

    I had the thought, further to this thread and G Schmidt’s comment about the unsuitability to Climate Research of any archive mechanism yet devised, that this site could offer a service to archive researchers’ SI code and have it debugged by willing volunteers. A similar pre-publication service might be more generally appreciated.

    Cheers, — Jo Calder

  13. Steve McIntyre
    Posted Feb 4, 2006 at 7:36 PM | Permalink

    Jo – Gavin’s comment is total crap. There is an absolutely adequate service providing permanent data archives at the World Data Center for Paleoclimatology, which has extensive archives and to which Hockey Team members contribute from time to time. They have ongoing funding and are a “permanent” archive. Doing something like this is well beyond the objectives of this blog.

    The target data is not as autocorrelated as some of the proxy regressors – see the warnings of Ferson et al for this type of situation.

    In a sense, it would reduce the errors, but you’re assuming that there is a “right” answer. My impression is that you have an unsavory combination of spurious regression and multilinear overfitting, creating quite a unique witches brew, the statistical properties of which are not understood by the authors.

  14. Jo Calder
    Posted Feb 5, 2006 at 3:19 PM | Permalink

    Thanks for that, Steve. I didn’t intend my suggestion too seriously.

    Cheers, — Jo Calder

  15. Phil B.
    Posted Feb 11, 2006 at 11:24 AM | Permalink

    For those that have been looking at Rutherford’s mbhstepwisehighrecon.m, on line 158 the true value of nproxy is stepped on by nproxy = 112, which affects the weighting going into regemhigh and affects the last 112-truenproxy grid cells and associated B matrix. A similar line of code is in mbhstepwiselowrecon.m. Phil

One Trackback

  1. […] entirely self-contained and so I re-visited some of our previous comments on Rutherford et al here here here here . I’m going to reprise two issues today: Collation Errors, a small but amusing […]