Wahl and Ammann: Some Verification Statistics

I’m now completely file compatible with Wahl and Ammann. In terms of my algorithm, I had to tweak the procedure for scaling RPCs a little, but the RPCs themselves were identical. It took some patience to reconcile differing data setups. I archived results from a complete run-through from their base case a couple of days ago. They have not archived results yet, but the graphs look very similar.

Our standing prediction for MBH98 type reconstructions is that that verification statistics other than the RE statistic will be insignificant. I suppose that another prediction is that the Hockey Team will not report these other verification statistics. Both predictions are correct here. Needless to say, only the RE statistic is at Ammann’s website. The archived program does not even calculate other statistics, although it’s hard to imagine that they have not peeked to see what the other statistics are.

As someone with prospectus experience, I think that these statistics are highly relevant to readers and that they should not be withheld by the Hockey Team. If they then want to argue that verification statistics other than the RE statistic are incorrect, then let them. (They will then have to defend this position against their own use of these other statistics in their other writings where it is to their advantage.) Anyway, here are my calculations for standard verification statistics for the AD1400 step from the Wahl-Amman run-through. The cross-validation R2 is virtually 0, the sign is correct only 54% of the time just better than random, the product mean test has a t-distribution and is insignificant, the CE statistic is negative. No wonder they don’t want anyone to look at these statistics.

RE (cal) RE(ver) R2 (ver) CE Sign Test Product Test
Run-Through 0.16 0.47 0.02 -0.24 0.54 0.91
Reported 0.39 0.48 NA NA NA NA

I’ve reconciled so that I can now specify differing methods as parameters, press a button on my function which calculates MBH98 NH temperature reconstructions and get both an AD1400 step result and a stepwise result. I’ve reconciled my algorithm step by step to WA methods and kept a reconciliation script if anyone is interested. There was a real curiosity in the stepwise reconciliation. My results were identical to 7 decimal places or so in the AD1400 calculation and to 7 decimal places or so in the stepwise reconstruction, except for 1750-1759 where there was a divergence amounting to about 0.9 sigma in one year. This pointed out an interesting little sensitivity. MBH98 changed the selection of temperature PCs used in different steps. The schedule is shown here. This information used to be on the old Nature SI, which has been deleted; other than the relict UMass FTP website, this information would no longer be publicly available.) If you look at the selected PCs, you’ll see that the AD1750-1759 step [why does this decade have its own step?] uses PCs 1-3,5,6,8,11,15.

I’ve never been able to determine how MBH98 determined which PCs to retain. It always seemed to me that the lower PCs had no conceivable physical meaning and were mere artifacts of matrix decomposition. MBH do not prove that these lower PCs have any stable meaning – they grandly reify them "climate field reconstructions". However, WA’s lower PCs using annual data differ considerably from MBH lower PCs. (Also if the sqaure root of cos(latitude) is used for area weighting in the PC decomposition, as it should be, a different set again emerges.) Anyway, this decade is the only decade in which PCs 6 and 8 are used. In all other periods, PCs 7 and 9 are used instead. I have no idea why. I presumed that this must have been an erratum and substituted PCs 7 and 9 in my parameterization. Wahl and Ammann didn’t – hence the discrepancy. But it makes for an interesting sensitivity result, which I would never have thought of calculating otherwise: using PCs 7 and 9 instead of PCs 6 and 8 can have an impact of up to 0.9 sigma on the reconstruction in a 10-year period.

I’ve not attempted to explore other aspects. However, it does seem presumptuous to assert that you have a 2 sigma confidence interval when such trifles can use up half of your confidence interval, before you even start considering real errors. The difference in scaling results from a couple of factors. MBH98 starts with temperature principal components calculated from a centered calculation using a 1902-1993 reference period. The resulting temperature PCs are re-based to a 1902-1980 reference period and final MBH98 results are reported for a 1902-1980 reference period. After calculating RPCs (reconstructed temperature principal components which have a 1902-1980 reference period under the calculation), WA appear to re-transform them to a 1902-1993 reference period, although MBH98 results and instrumental comparisons are in a 1902-1980 reference period. The effect does not appear to be material, but it is hard to understand why they would do this. I’ve set this up as an optional method parameter so that the re-transformation can be excluded if desired.

The other scaling difference resulted from re-scaling the variance of estimated RPCs to the variance of "observed" TPCs. The variance from the MBH98 RPCs does not typically correspond with "observed" variances. This is probably inherent in the MBH98 procedure for calculating the RPCs – which is not really inverse regression as von Storch et al. [2004]. It’s a two-step procedure in which inverse regression is used to establish a set of proxy models (described in inflated terms in MBH98); and then a minimization of the mean squared error of the various models. (This can also be shown to be a regression.) I think that the combination is a type of simple neural network, but, beyond the label, I’m not presently in a position to apply this information. But I’m pretty sure that that would be the direction to look in if one wanted to really examine confidence intervals for this type of process. Ammann added a step to do this re-scaling in April 2005. In our emulations without re-scaling, we also got completely different variances than MBH98; I ended up doing a re-scaling at the NH temperature index stage by re-scaling the variance of our final reconstruction to the variance of the MBH98 reconstruction.

These are pretty minute differences. Most of our principal results all relied on contrasts and are insensitive to this sort of procedural difference. Other results, such as the insignificance of MBH98 verification statistics, are stable to the sifference. Nonetheless, for good order’s sake, I’ll re-execute our reported results under WA methods (I’m still thinking about whether to re-transform to 1902-1993 as they do – I can’t see why.) I’ll send an email to Ammann, but, being a member of the Hockey Team, it’s unlikely that he’ll provide an explanation, but you never know. Now that I’ve reconciled on the temperature PCs used by Ammann, I’m also going to check the impact of the "simplifications" i.e. use of annual data and elimination of MBH98 weights. In a reconciliation, it is never a "simplification" to change some of the parameters. You should start with the exact parameters. Things are seldom what they appear with the Hockey Team. I’ll bet that the "simplifications" yield "improved" RE statistics. We’ll see.


  1. Spence_UK
    Posted May 17, 2005 at 10:44 AM | Permalink

    … surprise surprise, coefficient of determination at a staggering 2%. I think rumours of the hockey stick’s vindication were greatly exaggerated.

    I’m trying to follow the debate on the various forums and websites and as far as I can see the key comparisons they are making are between two sets of runs:

    (c) Run without bristlecone pines, Gaspe cedars and different procedure (i.e., MM05, “without merit”)

    (d) Run without Gaspe cedars (WA05, significant under RE stats, “without merit” from R2)

    However they seem to be claiming the difference between the two is due to the procedure followed, which I believe is supposed to include the PC selection and presumably the rescaling steps at the end, rather than being due to the presence of the bristlecone pines.

    It seems odd that they chose to do their runs (c) and (d) by changing several things simultaneously (pines, scaling, PC selection) and not assess these things independently, which would be somewhat more logical, to identify the extent of the problems caused by these issues. Given that the software has been written, it surely should not have taken long for them to do so.

    Watching the debate though, I don’t think the content of the paper matters much to the hockey team: I suspect if a context-free grammar was used to generate the paper from a random list of technical buzzwords, and the right conclusions were added, the usual suspects would be showering it with plaudits.

  2. Spence_UK
    Posted May 17, 2005 at 2:51 PM | Permalink

    Forgive me for posting more comments, but I’m curious about the link you provide for Mann’s PC selection schedule (or “Eigenvector Filtering” as he refers to it). There are some other things of interest to me there.

    The link given includes some R-squared statistics. Bizarrely, whilst the RE statistic is offered on various measures (including the global temp, Northern hemisphere temp and Southern oscillation index) the R^2 coefficient is quoted only for the Southern oscillation index. Now I understand that the MBH98 methodology has the benefit of gridding the world and therefore these “subcategories” can be easily generated, which (granted) is an appealling feature of the analysis. But why be so selective over which of the R-squared statistics are quoted? Also, I find it difficult to reason why the use of R-squared here is acceptable but not for the global temperature construction.

    Moving on to the actual figures, he sees a maximum of 0.14 for the 1820 step down to a negative value (!!!) for the 1400 step. So… this means that the calculations go the “wrong way” more often or by larger amounts than they go the right way for this early step. And this is on a selected component of his overall construction which I can only guess is “better” than others that could have been chosen.

    In addition to this, he must have appreciated that these figures don’t sound so great so he added some “significance” comments. He then goes on to band the confidence intervals scored by each step – one at 99% significance, 95%, 90% and 85%.

    To this end, in this sort of study a 95% confidence is about as far as I can see justification for. A 90% interval is a very weak test (1-in-10 chance that a random number generator could have done better) and 85% confidence is a joke. What is even worse is that 2 of the steps carry even greater error risk than this!!! (but suddenly he stops reporting what interval they are in – although since one is worse than the average performance achieved by a random number generator perhaps it isn’t surprising!)

  3. John A.
    Posted May 17, 2005 at 11:58 PM | Permalink

    This reminds me of an old joke:

    A school math teacher started a lesson on day without a word. He went to the blackboard and started writing some algebra. And then some more. The equations covered two blackboards. Eventually after 40 minutes, while the children became increasingly bewildered about how they were going to learn this, he ended triumphantly with:

    Therefore A=0

    "Class, what does this mean?", said the teacher.
    Someone from the back of the class said, "Does this mean you’ve done all of that FOR NOTHING???"

    That’s how I feel about this Wahl and Amman exercise. Ultimately, none of the exercises has any statistical (or real world) significance. Just like MBH98 and MBH99, the entire reconstruction means nothing at all.

  4. John Davis
    Posted May 18, 2005 at 10:53 AM | Permalink

    Any chance of a brief primer on the application & significance of these error statistics? I’m OK with basic stats (did some Quality Management once) but this is getting a bit arcane…

  5. Michael Mayson
    Posted May 18, 2005 at 3:19 PM | Permalink

    Re #4. Yes, I would welcome this too. Perhaps you have a link to a web site where the merits and drawbacks of various verification statistics are described.

    Steve: I’m swamped right now. If Spence or someone else wants to write something, I’ll post it up. Otherwise I’ll take a crack at it in about a week.

  6. Larry Huldén
    Posted May 18, 2005 at 11:39 PM | Permalink

    I think it would be best to let Steve concentrate at the most important things now! Few of us have the capacity to dig and clear up in the mess produced by the hockey stick team backed up by the enormous resources of the IPCC. I am quite convinced that Steve is not going to hide anything.
    Larry Huldén
    Finnish Museum of Natural History

  7. Peter Hearnden
    Posted May 19, 2005 at 4:04 AM | Permalink

    Re #6 Larry,

    Surely you’re concluding before the conclusion?

    The one conclusion I think ‘Climate Audit’ is coming to is it’s all down to the bristlecones. Well, I’ve asked this before but I’ll ask it again: What does the hockey stick look like without the bristlecones? There is no sign of such a graph – why isn’t it being trumpeted here if the bristlecones are soooooo important?

    Larry, can we see you reconstruction of past climate or are you another one only interested in deconstructions?

  8. Louis Hissink
    Posted May 19, 2005 at 5:59 AM | Permalink

    I have a vague feeling that all these papers are basically feeding canned statistical programs minus the scientific understanding of what the data meant in the first place.

    CSIRO LEME are doing a rather large geochemical sampling operation in which 54 different elements are analysed.

    In order to make sense of all this data, a somewhat geoscientific specific multivariate analysis is scheduled. The principal scientist remarked that all the data first had to be “normalised” and that rang a bell – much what Steve is studying I suspect.

    The computer program for this is not public domain, and the author Austalian, I believe, who will sell a copy for a monopoly price – standard procedure when the client is known to be a mining company.

    Mind you initial runs of the data through the analsis have been successful, so in this case, we are onto something and will therefore fund it. (success being defined by any anomaly in the data being verified on ground by physical fact).

    Pity climate science is not so constrained by facts.

  9. Steve McIntyre
    Posted May 19, 2005 at 6:15 AM | Permalink

    Peter, I don’t know how you can say that you’ve not seen reconstructions without the bristlecone impact. Look at our EE paper. Most of the popular renditions of our work feature graphs showing a high 15th century. Without the bristlecones, that’s what you get. Mann et al. knew this – we’ve talked about their CENSORED file, but they failed to report it. To make matters worse, they claimed that their reconstruction was “robust” to the presence/absence of all dendroclimatic indicators, when they knew that it was not even robust to bristlecone pines.

    Some of the issues related to bristlecone impact show up in discussions of how many PCs to retain. Mann et al. acknowledge that they need the PC4 under proper calculations (which carries the bristlecone shape). Without the PC4, there is no hockeystick; with the PC4, there is. We state this in our EE article. Or if you remove the bristlecones directly, as opposed to through not using the PC4, you get the same sort of graph – high 15th century. Mann et al know this, but they cloud it with code words. They use terms like “all available information” or “effectively throwing out information”; they try to conflate a sensitivy without bristlecones to not using any US tree ring data.

    So your question is asked and ansewered. Regards, Steve

  10. Peter Hearnden
    Posted May 19, 2005 at 7:04 AM | Permalink

    Re #9, ahhh OK fair enough, I have seen those graphs, though not one labelled ‘the hockey stick minus the bristlecones’ I think. I’ll check them out when time permits.

  11. John Davis
    Posted May 19, 2005 at 8:17 AM | Permalink

    Re #7 Peter
    You can always look at the Wahl and Amman version shown on this site under Wahl & Amaan 2. Its the pink curve. Higher in the 15th century. Without merit, as they say.

  12. Steve McIntyre
    Posted May 19, 2005 at 9:00 AM | Permalink

    John D., one of the frustrating aspects to Wahl-Ammann is that they throw up a variety of scenarios which don’t exactly correspond with base case scenarios. IMO, the most logical way of testing bristlecones in MBH98 methods is to exclude the bristlecones, carry on with the PC calculation methods etc., yielding a high 15th century. WA #2 doesn’t do this, although to a civilian, it would seem so. As I interpret it, they do not attempt to summarize US tree ring records, but use all the tree ring series directly without a PC procedure. This swamps out the tendency of the other proxies to yield a high 15th century. So they are being a little tricky here, but they are the Hockey Team. As to the 15th century, I’d like to re-iterate again that we do not say that the 15th century was warm – our argument was that Mann et al. can’t argue that the 20th century was unique on their data and methods. In the 2005 articles, we particularly emphasize statistical issues e.g. total R2 failure in MBH98. Regards, Steve

  13. Posted May 19, 2005 at 11:00 AM | Permalink

    Great work Steve.

  14. Peter Hearnden
    Posted May 19, 2005 at 12:40 PM | Permalink

    You’ll not believe me but I spent the afternoon, at least in part, reading M&M05.

    Steve, and others if you like, are you happy with the MBH reconstruction back to the 15th Century?

  15. Michael Jankowski
    Posted May 20, 2005 at 7:11 AM | Permalink


    Personally, no. For starters, way too many issues with proxy data. Way too many "adjustments" done when interpretting the proxy data. Way too many non-temperature induced responses in proxy data. Way too little coverage globally of the proxy data (see the locations here http://www.ncdc.noaa.gov/paleo/pubs/mann2003b/mann2003b.html). Proxy data needs to be brought up-to-date to see how well the interpretation methodology correlates with land-based temp changes over the past decades.

    You’ll also find in the 2001 IPCC Working Group I a conflict they noted between glacial recession timing and the temperatures of Mann’s reconstructions http://www.grida.no/climate/ipcc_tar/wg1/064.htm:

    "Nevertheless, work done so far indicates that the response times of glacier lengths shown in Figure 2.18 are in the 10 to 70 year range. Therefore the timing of the onset of glacier retreat implies that a significant global warming is likely to have started not later than the mid-19th century. This conflicts with the Jones et al. (2001) global land instrumental temperature data (Figure 2.1), and the combined hemispheric and global land and marine data (Figure 2.7), where clear warming is not seen until the beginning of the 20th century. This conclusion also conflicts with some (but not all) of the palaeo-temperature reconstructions in Figure 2.21, Section 2.3 , where clear warming, e.g., in the Mann et al. (1999) Northern Hemisphere series, starts at about the same time as in the Jones et al. (2001) data. These discrepancies are currently unexplained."

    One can try to reconcile that it’s highly likely global average temps are within the noise of Mann. That noise itself, however, is ridiculously large for trying to make any strong conclusions. As the IPCC so poorly worded it http://www.grida.no/climate/ipcc_tar/wg1/069.htm :

    "Taking into account these substantial uncertainties, Mann et al. (1999) concluded that the 1990s were likely to have been the warmest decade, and 1998 the warmest year, of the past millennium for at least the Northern Hemisphere. Jones et al. (1998) came to a similar conclusion from largely independent data and an entirely independent methodology. Crowley and Lowery (2000) reached the similar conclusion that medieval temperatures were no warmer than mid-20th century temperatures. Borehole data (Pollack et al., 1998) independently support this conclusion for the past 500 years although, as discussed earlier (Section, detailed interpretations comparison with long-term trends from such of such data are perilous owing to loss of temporal resolution back in time."

    John adds: I added some blockquotes for extra clarity.

  16. Posted May 21, 2005 at 12:25 AM | Permalink

    I’m astonished that W&A got published given what you’ve found, Steve.

  17. Louis Hissink
    Posted May 21, 2005 at 3:58 AM | Permalink

    Re #15

    Crowley and Lowery (2000) reached the similar conclusion that medieval temperatures were no warmer than mid-20th century temperatures. Borehole data (Pollack et al., 1998) independently support this conclusion for the past 500 years although, as discussed earlier (Section, detailed interpretations comparison with long-term trends from such of such data are perilous owing to loss of temporal resolution back in time.”

    “are perilous owing to loss of temporal resolution back in time”

    What on earth do they mean by this? What next, Litturgy in Roman circa AD 235 to limit understanding?

  18. Steve McIntyre
    Posted May 21, 2005 at 4:53 AM | Permalink

    W&A have only submitted; they haven’t been published yet, but they are supported by powerful interests and I presume that they will be published. Mann et al. got published with an R2 of ~0 (of course they withheld this information and the original referees didn’t think to ask.) Regards, Steve

  19. Spence_UK
    Posted May 21, 2005 at 5:46 AM | Permalink

    Re #5

    I would be happy to have a pop at trying to describe some of the statistics, especially if it would help free up some of Steve’s time – although I’m not going to be able to have a crack at this just yet, I’m running a fairly big event (voluntary work) this weekend which is taking up all of my spare time at the moment.

  20. Ross McKitrick
    Posted May 21, 2005 at 8:02 AM | Permalink

    Re #16: W&A didn’t get published. They put out a press release. Or rather, the National Center for Atmospheric Research put out a press release. The papers themselves have not passed peer review and aren’t published.

  21. Dave Dardinger
    Posted May 21, 2005 at 8:15 AM | Permalink

    Re #17

    I think it’s fairly simple. Heat spreads in all directions. As heat moves downward through the earth the ‘temperature signal’ has to be diluted and thus the ability decypher how the temperature changed with time will grow less the farther you go.

  22. John A
    Posted May 21, 2005 at 8:57 AM | Permalink

    Re: #20

    Yes, yet another “announce the result before the due diligence is done”. In stockmarket terms, we’d call it “ramping”.

  23. Michael Mayson
    Posted May 21, 2005 at 3:16 PM | Permalink

    Re #5 and #19. Thanks – that would be much appreciated.

  24. Peter Hearnden
    Posted May 22, 2005 at 2:13 AM | Permalink

    Re #20


    Is that a definite ‘have not’ or is it ‘have not yet’? On what authority do you say ‘have not’ (just doing an audit of your claims ;)).

  25. Ross McKitrick
    Posted May 22, 2005 at 1:42 PM | Permalink

    Peter: “Have not”, as of the present. I say this on the basis of knowing that they have only just entered the review process at each journal, both of which we are involved in. I did not say they “will not”, but to say they have not “yet” suggests a foregone conclusion that they will be publshed. Perhaps they will, but the referee process is not a rubber stamp, which is why we usually expect authors to wait until publication to put out a press release announcing their findings.

  26. Posted May 22, 2005 at 9:33 PM | Permalink

    Steve & Ross,
    My apologies. Thanks for the correction.

  27. Peter Hearnden
    Posted May 25, 2005 at 2:22 AM | Permalink

    Re #25,

    OK, I understand and welcome your clarification.

  28. Rob Matthews
    Posted Jun 18, 2005 at 11:05 AM | Permalink

    Can you direct me to on-line resources on statistical analyses and verification? It has been a long time since I took any course work in statistics and I would like to have a quick refresher in these topics. Thanks.

%d bloggers like this: