M08 with realdata

Yesterday we noted that “validation” in M08 means that the average of the “late-miss” and “early-miss” RE statistic is above a benchmark of about 0.35. I take no position at present whether this unusual methodology means anything, though I’m a bit dubious.

I also observed that the with-dendro reconstructions added surprisingly few series in the AD1000 step to the no-dendro reconstruction as the majority (16 of 19 NH) of dendro series are screened out.

I also noted that average RE statistic (and thus “validation”) turned out to be sensitive to the inclusion/exclusion of 1 or 2 series. In yesterday’s case, the exclusion of the one “passing” bristlecone in the AD1000 network strongly affected the period of validation.

Today, I’m going to report on another interesting calculation, this time re-visiting one issue noticed right at the time of publication of M08 and one that I just noticed today: both however involving the use of realdata.

First, in September 2008, Jeff Id and I both discussed M08’s replacement of actual Briffa MXD data after 1960 with “infilled” data. Obviously the deletion of post-1960 Briffa data has a long and sordid history. It was also deleted in the IPCC Third Assessment Report spaghetti graph and from the AR4 spaghetti graph. It was the data at issue in the ‘trick …to hide the decline” email. In this case, actual data was not replaced by temperature data, but by “RegEM data” calculated in Rutherford et al
2005. As a reader notes in comments, Rutherford et al say that they did not use Luterbacher data in making this particular sausage. At this time, I don’t know for sure exactly what is in the RegEM sausage as a substitute for actual MXD data. Actual data for 105 out of 1209 M08 series using MXD had been replaced after 1960 by, shall we say, verification-enhanced RegEM data. I had obtained realdata from CRU with a successful FOI request in 2008 and replaced the “enhanced” data with realdata for this analysis.

Second, M08 used performance-enhancing data in the form of 70 Luterbacher gridded data sets that incorporate actual instrumental data. This data was subtracted in Mann et al 2009 (Science) and I do so here as well in both the no-Tilj and the realdata cases.

Third and this was intriguing: yesterday I noted that there were only three dendro series in the AD100 with-dendro network. I discussed the bristlecone series yesterday. Today I looked at the “Tornetrask” series, the citation for which was Briffa, K.R. et al, 1992 (Clim Dyn) (the “bodged” reconstruction.) However, the M08 version went from 1 to 1993 – the time period of Briffa (2000). However, the M08 version didn’t match the Briffa (2000) version or the Briffa et al 2008 Tornetrask-plus-Finland version. Adopting the policy of using realdata whenever possible, I replaced the M08 “Tornetrask” version with realtornetrask data (the Briffa 2000 chronology in this case).

I then re-ran the M08 CPS algorithm using a realdata version extracting the RE statistics by step as before. Without the performance-enhancing Luterbacher series (fortified with instrumental data) and with real MXD data, the RE statistics after 1500 are noticeably reduced for both late-miss and early-miss versions. Before 1500, the late-miss RE statistic is also reduced significantly (due to no-Tilj and the impact of realtornetrask).

Figure 1. Late-miss and early-miss RE statistics for 3 M08-style networks: M08 from archived statistics; no-Tilj: removing contaminated Tiljander sediments and Luterbacher; realdata: using real Schweingruber MXD data, real Tornetrask data and without performance-enhancing Luterbached gridded series with instrumental data.

I then calculated the average RE statistics used to “validate” the M08 reconstructions. As you see, the realdata network results in substantially lower average RE statistics in all periods.
M08 do not report their benchmark “95% significant” statistic, but by inspection of their diagrams and the archived data cps-validation.xls, the 95% benchmark looks to be about 0.355 or so. This is plotted as a horizontal line.

Figure 2. Average RE statistics for three cases as above. Solid line shows the “adjusted” M08 RE statistic (which basically is the “best” RE statistic if there is a higher RE in a network with an earlier starting date ). Dotted line shows M08 “95% significant” RE benchmark.

In this example using realdata, the with-dendro M08 CPS reconstruction fails “validation” for all intervals. This example (CRU_NH) had the “best” “validation” statistics in the base M08 examples and so I don’t expect other combinations to fare any better.

The script to make these figures from previously collated Notilj and realdata versions is at
http://www.climateaudit.info/scripts/mann.2008/blog_20100808.txt. The runs to make the versions are at

Update: Here’s a graphic for the CRU_SH network – which is unaffected by Tilj and Luter – which serves to compare my emulation with archived M08 verification stats.


  1. scientist
    Posted Aug 8, 2010 at 9:21 PM | Permalink

    Most of the below, you may thing of as form related. (But I need to ask.) Comments 9 is sorta content. (In case you want to skip to that.)

    1. I understand that you are using a different program, coded in R, that emulates Mike’s Matlab code. That this is advantageous since it has the added feature of showing percent proxy in the results, (maybe) that you prefer working with R, and that you just learn algorithms better from the process of rewriting code. I understand that a few days ago, you didn’t have the code emulated (the stepwise reconstruction) but now you say you are very confident. Could you pleeeeze plot some Mann’s code versus your code figures? Duplicate some of his figures from the paper. Show a difference plot, etc.? I just want to know that you really have his code emulated (since you choose to run with your code rather than just use his Matlab for experiments).

    2. Please supply a citation or link for “Mann09”.

    3. When you say that Luterbacher was “subtracted in Mann09”, do you mean that they didn’t use it at all? Or that it was one of the sensitivity tests?

    4. It seems like there are 4 Briffa data versions in discussion?
    a. The 1992 Climate Dynamics published version.
    b. The Briffa 2000 version (not sure what article this refers to)
    c. A 2008 Tornetrask and Finland version (not sure the citation)
    d. The version Mike used, which goes to 1993 and does not match any of the above versions.

    4.1 Could you plot these and do some differences? I’m curious which version is most closely matched by what Mike used. Also if perhaps one of these versions would match with some years clipped off, added (the 1992 cited version) or perhaps there was a transcription or two. Note, I’m not excusing Mike citing an article and then using different data. But I just want to understand how different the series are, see where they come from, etc.

    4.2 Could you give the different citations for these?

    5. I guess if you are going to look at the sensitivity to version type, you might as well run all 4 versions, Steve. It’s just a computer…crank it.

    6. The last sentence before the first paragraph mentions that you ran “no tilj” in passing. Is this how Mike ran in 2009? If you choose to run that way, fine. But it seems like it should be mentioned up front, like the Luterbacher. I almost get the impression that you view this as automatic, but apparantly it wasn’t to Mike. so at least be clear.

    6.1. Really what would be helpful, would be some sort of full factorial (tilj y/n, Luterbacher y/n (if my suspiction on Mann09 is correct), Briffa MXD (1, 2, 3, 4). Just to make sure that we keep track of what combinations pass or fail. I guess you could use your choice if you consider early and late to be two different outputs, or make them an input. Probably outputs, I guess.

    7. You made a real effort, here, Steve on the figure caption. You could still clean it up more, but noticeable improvement. “M08 from archived statistics; no-Tilj: removing contaminated Tiljander sediments; realdata: using real Schweingruber MXD data, real Tornetrask data and without performance-enhancing Luterbached gridded series with instrumental data.”

    7.1. What is the

    8. I would avoid the use of terms like “realdata”. It’s confusing. It’s offputting (both a little insulting and too “cute”). It seems to be trying to win the argument by how you label things. Better to just call things descriptively and neutrally. If you have killer analysis, the analyses will slaughter your enemy. Cute terms like this are really a distraction.

    9. (Content) Figure 1 is an interesting chart. Yes, removal of some proxies does reduce RE, although seems general shape is followed. It’s hard for me to say what is significant. You can see areas where the lines stay on top of each other and others where it’s a .1 drop for the first reduction and a .2 for the more extreme reduction. Latem just seems to have more issues regardless, versus earlym (really in all cases).

    9.1. Hard for me to understand if the difference between “notilj” and “realdata” is being driven by the alternate tree proxy series versions or by the Luterbacher change. Intuitively, would think it is the Luterbacher issue driving the bus, (even thought the name “realdata” emphasizes the proxy version issue), but can’t tell without running the more intermediate combinations.

    10. Again another good effort on the figure caption. But I humbly ask questions to try to follow the explanation…

    “Average RE statistics for three cases as above. Solid line shows the “adjusted” M08 RE statistic (which basically is the “best” RE statistic if there is a higher RE in a network with an earlier starting date ). Dotted line shows M08 “95% significant” RE benchmark.”

    10.1. OK I get the three cases are re-represented (don’t understand why we have red and green colors for the crosses and squares).

    10.2 I can’t tell the difference of the dashed line for the base case M08 versus the dashed line for the 95% significance. I just can’t resolve the different lines visually! Maybe make all the cases be black. And use green for the limits?

    10.3. Does the solid line only correspond to one (the base) of the 3 cases? And if so, shouldn’t it always be at least as high as the base case? Why does it dip below around 1500? And why didn’t you show the other cases’s solid lines?

    10.4 By the 95 percent signifance dotted line, do you mean the 0.35 cutting across the figure?

    11. “In this example using realdata, the with-dendro M08 CPS reconstruction fails “validation” for all intervals.” I think it’s the absence of tilj and Luterbacher that also drive this problem. You actually haven’t shown a clean “realdate” versus “fakedata” comparison. And I’m not faulting you for deciding to show different combinations, but let’s be clear.

    12. “. This example (CRU_NH) had the “best” “validation” statistics in the base M08 examples and so I don’t expect other combinations to fare any better.”

    12.1 Huh? This example CRU_NH is the first we hear (at the last sentence) of different choices to to be made.

    12.2. Well…I kinda think “SH” will be pretty robust to removal of proxies in the Northern hemisphere or different versions of them. 🙂

    • Steve McIntyre
      Posted Aug 8, 2010 at 9:53 PM | Permalink

      As i said previously, most of the code was developed in fall 2008 and reconciled at that time – see category https://climateaudit.org/category/other-multiproxy-studies/mann-2008/ for previous posts setting out reconciliation and steps. The present version is modified to follow a bit further into the verification statistics and splicing from where I got to in 2008. There’s only so much that I can stomach at any one time. I’m not re-tracing this totally. I had 85% of this done in 2008 and reported on it then. However, the other 15% makes a difference.

      Briffa 2000 QSR, Briffa et al 2008 Phil Trans B. Both discussed at length. Look at the Briffa category in left frame or just try googling briffa 2000.

      SH won’t be affected. The various “target” temperature cases are HAD and some things called iCRU and iHAD, that I haven’t figured out and don’t have immediate plans to try to figure out.

      I’m tired of these cases for now. They take a while to run because Mann re-smooths everything in every step. I’ve tweaked it to use smoothed once-for-all but need to doublecheck this implementation before reporting results.

      I’ve got other things that I’ll be working on for the next couple of weeks. Mann methodology is like root canals – it’s very laborious figuring things out. I need a break from it.

      I’ve uploaded code to do these calculations. If you want to experiment with other Tornetrask versions, feel free to do so yourself.

      • scientist
        Posted Aug 9, 2010 at 11:47 AM | Permalink

        Your link gave MANY more papers than are available if I go to the side and pick the Mann et al 2008 category (only 3 papers come back).

    • geo
      Posted Aug 8, 2010 at 11:31 PM | Permalink

      scientist– This is the internet; this is a blog you’re commenting on not writing articles for. So you have an absolute right to anonymity. I don’t question that.

      Are you willing to tell us at least if those who follow climate science at least kinda-closely would recognize the name on your birth certificate if you shared it?

      I don’t think that’s an inappropriate question, but it is certainly one that you can decline to answer without being inappropriate either.

      I guess part of me hopes you’re really, oh, say (just for example), Kevin Trenberth or Tim Osborn, trying to be mostly constructive and honestly engaged, under the cover of anonymity, even if you have your own (and we’re all entitled to them) slant on things.

  2. scientist
    Posted Aug 8, 2010 at 9:25 PM | Permalink

    Please disregard 6 and 7.1. (But regard 6.1)

  3. Steve McIntyre
    Posted Aug 8, 2010 at 10:15 PM | Permalink

    Mann et al 2009 SI says of the Luterbacher data;

    The original Mann et al (S1) proxy dataset also included 71 European composite surface temperature reconstructions back to AD 1500 based on a composite of proxy, historical, and early instrumental data (S5). These data were not used in the present study, so that gridbox level assessments of skill would be entirely independent of information from the instrumental record.

    Despite this statement, its Table S1 shows the same number (1209) proxies as Mann et al 2008. You’ll have to ask them which is correct.

    • scientist
      Posted Aug 8, 2010 at 10:27 PM | Permalink

      Your figure caption says that you removed Luterbacher from the third case. Is this correct or was it removed from all?

      If you replicated the base case and some figures from the paper, it seems that it would be clear if Luterbacher was used at all.

      Steve: the no-Tilj case also has the Luterbacher removed. This was mentioned in yesterday’s post but should have been stated here as well. SOrry about that. I’ve amended the caption and text to clarify this. The M08 case is a transcription of M08 data and includes Luterbacher. The Luterbacher data starts only in 1500 and is the main contributor to the post-1500 difference in the no-Tilj (no luter) case.

  4. scientist
    Posted Aug 8, 2010 at 10:24 PM | Permalink

    I saw one good post on the issue of replication from 2008.

    Followup questions:

    A. Seems like in that post, you were only replicating the simple 1000 network. And just a few days ago said you had issues with replicating the stepwise reconstruction (“splicing”). So a difference plot showing replication would still be helpful.

    B. What setting on methodology settings (“pet tricks”) did you use, in this post, just now?
    Steve: default options: lat.adjustment= -1; outerlist_method=”mann”; center_smooth=”mann”;screenmethod=”mann”.

  5. Steve McIntyre
    Posted Aug 8, 2010 at 11:18 PM | Permalink

    I’ve added the same graphic for the CRU_SH variation. Because the Tilj, Luter and Tornetrask things don’t affect the SH network, this functions as a demonstration of reconciliation of my methods to Mann’s. The replication is about as exact as you could want. You can’t get this perfect a replication of verification stats without replication at the stages leading to this final calculation. There are a lot of advantages in my emulation since it is not only far more concise and understandable, but it has important additional features – especially the provision of weights.

  6. Fred
    Posted Aug 8, 2010 at 11:31 PM | Permalink


    I’m glad you’re auditing the auditor, maybe after following his tracks, you’ll change your mind…

  7. pete
    Posted Aug 9, 2010 at 12:35 AM | Permalink

    First, in September 2008, Jeff Id and I both discussed M08′s replacement of actual Briffa MXD data with “infilled” data. CA readers (and many others) are aware that substitution of real Briffa MXD data with temperature data was what instigated “hide the decline”. 105 out of 1209 series in M08 had been “adjusted” so that realdata was replaced by, shall we say, realclimatedata.

    Are you sure about this?

    From the Mann et al 2008 SI (S2):

    Because of the evidence for loss of temperature sensitivity after 1960 (1), MXD data were eliminated for the post-1960 interval. The RegEM algorithm of Schneider (9) was used to estimate missing values for proxy series terminating before the 1995 calibration interval endpoint, based on their mutual covariance with the other available proxy data over the full 1850–1995 calibration interval. No instrumental or historical (i.e., Luterbacher et al.) data were used in this procedure.

    Steve: I stated that actually observed MXD data in the post 1960 period was replaced by “infilled” data in Rutherford et al 2005.This is confirmed by your quotation (which I had considered in the 2008 posts) – the fact that Luterbacher data is not used in this particular “infilling” doesn’t alter the brute fact that actual data was replaced by infilled data. CRU conceded the point in respect to Rutherford et al 2005 in their submission to Muir Russell. The infilling was done with less extreme immediate impact than the famous hide-the-decline email but it was done nonetheless and had an impact on “validation” statistics.

    • pete
      Posted Aug 9, 2010 at 8:13 AM | Permalink

      The paragraph I quoted gives the impression that the infilling used instrumental temperature data.

      Is the hide-the-decline dig worth the resulting loss in clarity? WMO cover-art isn’t really relevant to discussions of M08.

      Steve: i’ll rephrase slightly to avoid any confusion. However, the continuity is relevant since the same period of Briffa MXD data was deleted in the IPCC AR3 spaghetti graph and the AR4 spaghetti graph – which I’ve consistently emphasized as the most critical examples (as opposed to the WMO report.)

  8. scientist
    Posted Aug 9, 2010 at 12:53 AM | Permalink

    It seems like they should have just said it failed the correlation test (rather then chop the part where it didn’t fail). Also should have put this crap in the PAPER, not the SI.

    Steve – the deletion of post-1960 data was not mentioned in the SI to Mann et al 2008 – only in the SI of a predecessor paper, Rutherford et al 2005. Also I’ll have to doublecheck this – and may not for a while – but my recollection is that the real data doesn’t fail the correlation test. In Spetember 2008 in the midst of the financial meltdown, I observed that, whatever the rights or wrong of the Briffa truncation versus the Tiljander non-truncation, the diametrically inconsistent accounting policies didn’t meet any GAAP standards. There were much better reasons for truncating Tiljander than for truncating Briffa MXD – indeed, no objective reasons for truncating Briffa have ever been provided.

    • scientist
      Posted Aug 9, 2010 at 11:42 AM | Permalink

      1. If he’s going to be truncating proxies like that, it needs to be front and center and it seems like a really bad idea given his calibration/validation* rests on “match” to instrumental.

      2. Please do me the courtesy of replies that are in separate boxes.

      *I’m still not convinced that you can or should try to have a separate validation step.

      Steve: “If he’s going to be truncating proxies like that, it needs to be front and center”. I guess you missed the events around the “trick to …hide the decline”. As an IPCC reviewer, I asked Briffa to show the decline in the AR4 spaghetti graph. He refused.

      • scientist
        Posted Aug 9, 2010 at 1:49 PM | Permalink

        What Briff did in a review paper is a different issue from what Mann did in a primary report.

  9. EdeF
    Posted Aug 9, 2010 at 1:09 AM | Permalink

    Here is a CA tutorial on the RE reduction of error statistic.


  10. Bill Jamison
    Posted Aug 9, 2010 at 3:19 AM | Permalink

    “performance-enhancing data” 😀

    Thanks for the laugh Steve, that’s a classic!

  11. Posted Aug 9, 2010 at 4:09 AM | Permalink

    Really nice work Steve. Rather crushing to the paper when actual data is used.

    Did any of the actual MXD data pass screening?

    I hope that the climate scientists who lurk here are catching the fact that this is a demonstration of a really weak signal in the ‘actual’ data. Not that it will come out in print, but it really is a very noisy set of data which has NOT been demonstrated to have anything at all to do with temperature. Well — except Luterbacher — because that data has temperature pasted right on it.

    Trees make lousy thermometers.

    Steve: there isnt as much deterioration in screening performance with real MXD data relative to MXD realclimatedata as one might expect. For some reason, these have much better calibration than ring width series and most seem to pass screening either way.

    • Posted Aug 9, 2010 at 8:08 AM | Permalink

      Now that is interesting. Perhaps Briffa could be fixed up for the IPCC using correlation screening methods rather than chopping the data off with a dull axe. It would have made a much less interesting email for climategate.

  12. Ale Gorney
    Posted Aug 9, 2010 at 4:42 AM | Permalink

    all of this is just nonsense.

  13. Geoff Sherrington
    Posted Aug 9, 2010 at 5:42 AM | Permalink


    Those of us who have been folowing this for some time are able to grasp the essential points without having to do lengthy primers in Steve’s space. We are suffering from your prior lack of reading, through having to wade through your catchups.

    All that redeems you is the occasional astute point, oftrn as a question, but even then it is rarely novel.

    Why not do us a favour by doing your homework first, then coming back with short, original advancements, rather than an essay.

    That would be polite on the blog of another.

    • Area Man
      Posted Aug 9, 2010 at 9:37 AM | Permalink

      The publishing of posts by scientist and the patient, respectful, direct, and accurate responses by SM are great examples of how the Team could have and should have responded to Steve early on.

    • QBeamus
      Posted Aug 10, 2010 at 10:29 AM | Permalink

      At this point, I have to disagree. I think the posts along these lines were valid when Scientist was disclaiming his own posts with “I haven’t read this” and “I can’t say until I read that.” But at this point I’m finding the byplay is helping me to get a better grip on the details than I got from my own reading.

  14. andy
    Posted Aug 9, 2010 at 5:42 AM | Permalink

    So the real data has no signal in it worth a damn. So I replace it with data I have which does. Is that correct?

  15. Craig Loehle
    Posted Aug 9, 2010 at 6:45 AM | Permalink

    Steve said about the averaging of the RE stat “I take no position at present whether this unusual methodology means anything”. I just want to comment that this is an excellent example for anyone–to be clear about what one does or does not know at any moment. It is quite all right to be pondering or investigating whether something is correct or not or to admit that you don’t at the moment quite understand something. This omniscience cr*p is a major cause of climategate.

  16. JimD
    Posted Aug 9, 2010 at 3:39 PM | Permalink

    @ ‘scientist’

    8. I would avoid the use of terms like “realdata”. It’s confusing. It’s offputting (both a little insulting and too “cute”). It seems to be trying to win the argument by how you label things.

    Personally, I would avoid the use of pseudonyms like “scientist”. It’s confusing. It’s offputting (both a little insulting and too “cute”). It seems to be trying to win the argument by how you label yourself.


    (For the sake of clarity, also a scientist)

  17. EdeF
    Posted Aug 9, 2010 at 6:01 PM | Permalink

    “I also observed that the with-dendro reconstructions added surprisingly few series in the AD1000 step to the no-dendro reconstruction as the majority (16 of 19 NH) of dendro series are screened out.”

    Red warning flags should be waved when you see 16 out of 19 dendro series flunk.

%d bloggers like this: