Reconciling Zorita

Eduardo Zorita and I are in the process of reconciling some results. We have taken one issue off the table – VZ implemented Mannian PCs accurately enough that this does not account for any differences between our results and theirs. So I take back some observations and I’ll place updates in appropriate places. In fairness to me, their description in GRL of what they did was perhaps language-challenged, shall we say; my interpretation of their description was the most reasonable one. Other than me, the description was probably not a problem for anyone else, but I was also probably the only one trying to figure out what they did from their description. This proves one more time the advantages of using source code for reconciliation.

By getting this issue off the table, we can try to focus on other issues perhaps accounting for the differences in results e.g. how the pseudoproxies are constructed – which was what we primarily discussed in our GRL article. In this respect, Eduardo has acknowledged that their reconstructions testing the impact of Mannian PCs in an ECHO-G context have high verification r scores in both cases, which certainly suggests to me, as we hypothesized, that their pseudoproxies did not accurately represent the dog’s breakfast of MBH proxies, where the high RE-low verification r2 combination is a distinctive signature that a simulation should replicate. Hopefully there will be more on this. Continue reading

NY Times: For Science’s Gatekeepers, a Credibility Gap

From the NY Times, a wide-ranging article on the problems of journal peer-review:

Recent disclosures of fraudulent or flawed studies in medical and scientific journals have called into question as never before the merits of their peer-review system.

The system is based on journals inviting independent experts to critique submitted manuscripts. The stated aim is to weed out sloppy and bad research, ensuring the integrity of the what it has published.

Because findings published in peer-reviewed journals affect patient care, public policy and the authors’ academic promotions, journal editors contend that new scientific information should be published in a peer-reviewed journal before it is presented to doctors and the public.

That message, however, has created a widespread misimpression that passing peer review is the scientific equivalent of the Good Housekeeping seal of approval.

Continue reading

Predict future climate change!

[Steve: Editorial comment] – This is John A’s post. I do not agree with his editorial flourishes linking this to models. I view the following as illustrating the defects of sole reliance by multiproxy reconstructions on the RE statistic – a statistic for which there are no distribution tables and which is little known or unknown to ordinary statisticians and unstudied outside tree ring circles. This is a battleground issue with respect to Mann and similar studies and it is important to illustrate how it operates; this is useful enough; I don’t view the post as showing more than that.]

Dave Stockwell has shown the future of climate prediction with a great new device that allows you create a new prediction of future climate change which are at least as good as the tree ring proxies were for the past. Arguably this new technique puts expensive climate modelling exercises like climateprediction.net to shame.

Here’s a prediction of future climate generated each time the page is refreshed. Note that the RE and r2 statistics are calculated automatically.

The data in blue is the instrumental data (courtesy of the CRU) and the red is the prediction of future temperatures for the next 100 years.

The explanation is here and the source code can be found here

Let Dave Stockwell explain:

The validation is based on the 11 points at the end of the temperature record not used in generating the simulated points. Two statistics were calculated and can be seen on the figure:

  • The R2 correlation is ubiquitously used for quantifying the strength of association of two variables. A critical value of 0.1 would indicate a possible mild correlation, but values closer to one indicate significance.
  • The RE reduction of error statistic is used in dendroclimatology and in the “‹Å“hockey stick’ reconstruction of MBH98, where critical values greater than zero are claimed to indicate significance of the model. RE is claimed to be superior to the R2 statistic in WA06.

Hit reload a few times to get a feel for the average of the statistics. The R2 statistic is usually close to zero indicating the prediction has no statistical skill over the validation period. The RE statistic, however, is always greater than zero, and often greater than 0.5.

MBH98 uses an RE benchmark of zero to indicate significance. The random numbers here give RE statistics greater than the critical value of zero. Therefore, using the RE statistic with a critical value of zero would attribute statistical skill to random numbers. That is, under criteria used in MBH98, random numbers could be regarded as skillful predictors of future temperatures.

This example illustrates (if the code is correct) a situation, similar to MBH98, where the R2 statistic correctly indicates no statistical skill in the predictions, but the RE statistic erroneously indicates statistical skill.

Conclusions hinge on the choice of statistic and where you set the benchmark. MM05 obtain a critical value for RE of greater than 0.5 using random red-noise data in a replication of the procedure used in MBH98. Non-existent statistical skill of the models is one of the main arguments in MM05 against the reconstruction method in MBH98.

So there you have it, statistical skill or not? If the statistical tests can be equalled or bettered using random numbers which have long-term persistence, then the next IPCC review, just like the previous one, will contain just as much information to inform policymakers as a table of random numbers. If this is so, then why are climate journals still publishing studies with just this behavior?

Eduardo Zorita Comments…

Eduardo Zorita sent the following in as a comment on earlier postings. As I did on a similar occasion with Rob Wilson, I’m re-posting this as a separate post on its own to ensure that it’s properly noted. Continue reading

Von Storch et al [2006]

I’m working on a long note, but probably a short interim note will make some sense. From the point of view of simply advancing our particular take on the MBH98 mess, as between Wahl et al and von Storch et al, there are pros and cons to us for either side being right. So I’d be inclined to say that we don’t have a dog in this particular race and that I can look at it pretty objectively. Right now, for what it’s worth, I’m inclined to think that VZGT have much the stronger position in the current dispute. Given my personal experience with the various parties, this seemed likely at the outset as well.

I’ve taken some time to re-read VZGT 2004 and have a better appreciation for it now than I did before, which I’ll discuss some now and more on a future occasion. The exercise of re-reading has been highly instructive as I think that I’m now getting into a position to pull together the various disparate threads of this small industry of MBH commentary – starting with the MMs, continuing with VZGT04, Bürger and Cubasch 2005, plus all the comments and replies.

For now I just want to present what VZGT actually say, so that there’s at least a blog record of it without going through the poisonous realclimate filter. Continue reading

HAnsen and Schmidt: Predicting the Past – Continued

continuation of the thread here.

Wahl, Ritson and Ammann on VZGT

Wahl, Ritson and Ammann, the authors of two rejected comments on MM05 to GRL – see here and here for our Replies to the rejected comments – have joined forces and pulished a critical comment on Von Storch et al [2004] in Science, to which von Storch et al have issued a Reply. realclimate has issued an editorial here.

There’s quite a bit of back story to cover on this exchange, which I’ll cover in more detail on another occasion. For now, here are a few quick observations. I’ve pointed out on this site for a long time both (1) that VZGT had not correctly implemented MBH procedures; (2) that I did not think that VZGT had correctly diagnosed the problems with MBH98.

I had identified at least 3 different problems with VZGT implementation of MBH: (1) they did not appear to implement a re-scaling step unreported in the original MBH98 article. As I pointed out in my AGU05 presentation, the variance differences alleged by VZGT empirically did not exist; (2) in the GRL Comment on MM, VZ did not accurately implement the goofy MBH principal components method, seemingly not fully comprehending just how bad the method was; [Update -May 4, 2006: I’ve reconciled code with Eduardo. Their description in GRL of what they did certainly suggested otherwise, but they did implement the key features of the Mannian PC method – so the differences with us lie either are probably due to the next item.] (3) relying on Jones and Mann 2004, the VZ (and the VZGT) pseudoproxies wildly over-estimated the temperature signal content of MBH proxies and did not allow for "bad apples".

I don’t blame (and didn’t blame) VZGT for any of these "problems"; the fault lies entirely with Mann et al. (1) How could VZGT replicate a re-scaling step that was never mentioned in the oriignal article? (2) While I think that we provided enough information in our articles to decode the MBH principal components method, I can readily understand how people would assume that the method was more reasonable than it really was and think that it wasn’t possible that MBH used such a weird method; but it was possible and it did happen. (3) absent a detailed investigation of MBH98 proxies of the type that we’ve carried out, anyone relying on MBH information would assign much better behavior to the proxies than is justified.

To the above three points, Wahl et al add a fourth: that VZGT calibrated on detrended proxies rather than non-detrended proxies. In reply, VZGT argue that Wahl et al overstated the impact of this methodological difference on their results, which they claim to be valid even with calibration on non-detrended proxies. Both Wahl et al., and especially realclimate, gloat over this seeming "error" in VZGT implementation. However, to the extent that VZGT have incorrectly implemented this MBH procedure, then one can certainly see some basis for the misunderstanding. In a criticism of MM03, Mann et al said:

The use of gridpoint standardization factors based on undetrended data (MM) to unnormalize EOFs that had been normalized by standardization factors of detrended data (MBH98) implies a pattern of bias in the projection of an eigenvector onto the surface temperature field that is increasingly large in regions where the 20th century is large.

Similarly in the Corrigendum SI, MBH stated:

Standard deviations were calculated from the linearly detrended gridpoint series, to avoid leverage by non-stationary 20th century trends. The results are not sensitive to this step.

While neither of these points specifically refer to proxies, MBH procedures are so poorly and inaccurately described that one can see why VZGT might innocently assume that if detrending was the"correct" MBH method in standardizing gridcell standard deviations, then it might very well be what MBH did with proxies. In the Original Supplementary Information to MBH at Nature (now deleted but presered in a University of Massachusetts mirror), MBH report results from a detrended (DET) run. So even if VZ have inadequately modeled one MBH variant, one could see why they might at least think that they had modeled the DET alternative.

I must confess to feeling a certain amusement at Mann savaging VZGT for allegedly "incorrectly" implementing his precious methodology. Back in 2003, when we sought clarification of MBH methodology, Mann refused, on the basis that von Storch and Zorita had found his existing disclosure sufficient to implement his methodology (see Mann correspondence). In the Corrigendum SI, Zorita et al 2003 is cited on 2 different occasions as an accurate implementation of MBH methodology. In summer 2004, we advised Nature that the Corrigendum SI remained insufficient; Nature said that anything further was up to the authors. So if VZ subsequently misinterpreted MBH, surely Mann has only himself to blame. All in all, surely it proves a point we’ve been making for a long time: code should be archived so that this sort of confusion is avoided. Even now, code archived by MBH is incomplete – how do they calculate confidence intervals? This mystery would be resolved in 2 seconds by looking at code. Likewise their supposed Preisendorfer calculations. Neither was archived last summer.

Realclimate completely mischaracterizes the handling of the detrending issue by Bürger and Cubasch, accusing them of following VZGT in using detrended calibration. In fact, Bürger and Cubasch carefully distinguish between trended and detrended calibration, analyzing each as separate "flavors". Nothing wrong with that.

The closing paragraph of the VZGT response raises two important issues, which are familiar to readers of this site. They state:

It is commonly accepted that proxy indicators may contain nonclimatic trends. This is particularly true with tree-ring data (8), which were intensively used in the study by MBH98. The calibration and validation of any statistical method using nondetrended data are dangerous, because the nonclimatic trends are interpreted as a climate signal. Only in the case that the trend in the proxy indicators can be ascertained to be of climate origin is a nondetrended calibration and validation permissible.

Does this sound to anyone like they are coming around to recognizing the impact of bristlecones on MBH? They go on:

In the validation period, in contrast, the correlation between the (5-year-smoothed) reconstructed and observed NHT in the validation period 1856 to 1900 is 0.23. This low correlation skill in the validation period has been recently acknowledged (9 – citing Wahl and Ammann, submitted to Climatic Change).

They do not cite M&M – unfairly under the circumstances, since these are points that we originally made and have dragged out of Wahl and Ammann, kicking and screaming. Citing the verification r (correlation), which is not mentioned in Wahl and Ammann, rather than the verification r2 , which is so directly associated with our critique seems a little sly. However, on balance, I’m happy to see them coming round to the points that we made.

I think that Mann’s going to regret that Wahl et al have put all of this back on the table. It’s pretty amazing how they complain on the one hand about people criticizing a "10 year old paper" – with Mann, arithmetic within 25% is pretty good – and then continually pick at scabs by publishing stuff like Wahl and Ammann [Climatic Change] and Wahl et al [Science]. Without Wahl and Ammann, I’d have "moved on" to other long overdue studies. However, as long as they keep contesting things – and especially when they do so with misrepresentations and withheld data – then I’m content to keep returning the ball from my side of the court.

Treydte, Moberg, Soon and Baliunas

Several people have written to me about today’s article in Nature by Treydte et al (including Esper) announcing that the 20th century is the wettest period in the millennium. Treydte et al state:

Comparison with other long-term precipitation reconstructions indicates a large-scale intensification of the hydrological cycle coincident with the onset of industrialization and global warming, and the unprecedented amplitude argues for a human role.

Nature published a special covering review of the Treydte article by Evans, which concurs with this as follows:

Furthermore, it seems that recent changes in precipitation patterns probably exceed the range of natural variability estimated for the past several hundred to one thousand years.

It’s hard to keep up with the extended Hockey Team, but here is a report from yr obedient servant. With the Hockey Team, nothing is ever quite what it appears on the surface and out text today provides an interesting oportunity to reflect on Soon and Baliunas, or rather the mugging of Soon and Baliunas by the Hockey Team. Continue reading

The Caramilk Secret … Finally

Reader DEA wrote in to say that he figured out the Caramilk secret a long time ago. Continue reading

Not A Solution to the Caramilk Secret

Update: the following does not explain the Caramilk secret of MBH99 confidence intervals, which remains unexplained and mysterious. End Update.

OK, Mann starts with a sigma obtained from the standard errors in the calibration period from his hugely overfitted model. He uses this in MBH98. In MBH99, recognizing the autocorrelation in the residuals, he adjusts the confidence intervals through a mysterious procedure. The adjustments as shown this morning in the figure below are done in bulk.


Top: black- MBH99 sigma; red – MBH98 sigma; bottom ratio of MBH99 sigma to MBH98 sigma (using the ignore2 to extend to 1000-1399).

The adjustment that we’re looking for is about 1.6 derived somehow. Now look at the following figure from MBH99 posted up previously here:


Original Caption. Figure 2. Spectrum of NH series calibration residuals from 1902-1980 for post-AD 1820 (solid) and AD 1000 (dotted) reconstructions (scaled by their mean white noise levels). Median and 90%,95%,and 99% significance levels (dashed lines) are shown.

This is derived from a spectrum but it is, as noted in the caption, "scaled by mean white noise levels". If you squint at the y-axis, you can perhaps persuade yourself that the value at the y-axis is about 1.6. which is the value in the "adjustment". So maybe what they do is use this y-axis value as an "adjustment" to the standard deviation and thus to the confidence interval. [Update: You cannot reasonably persuade yourself that it is 1.4. I was trying too hard and this possibility – which doesn’t make any sense anyway – is ruled out.]

Jean S sent me a note saying that he thinks he’s figured it out but was too busy to write it down till the end of the week. Jean S:

I kind of figured out the procedure from Mann and Lees. I don’t know what to say anymore, I’m kind of sad this type of papers do get published… Mann&Lees is sad.

[Update: I guess we’ll have to wait for Jean S. ]

Jean S has suggested a look at Stoica and Moses, Intro to Spectral Analysis, p. 37, section 2.4, "Properties of the Periodogram Method", so I’ll check that out.

I wonder what an econometrician working with spectral methods would say about this. – someone like Granger or Peter Robinson. For that matter, Bloomfield and Nychka of the NAS panel or both specialists in frequency domain. Now however odd Mann’s result is, bear in mind that Esper, Briffa, Jones, D’Arrigo are all even worse. At least Mann has indirectly considered the possibility of autocorrelated residuals and tried to allow for them – even if his method was weird. The other folks just ignore the problem and use the same approach as MBH98 for estimating residuals. Fit a model in a calibration period and then use the standard errors to calculate confidence intervals with no allowance for autocorrelation. But hey, they’re the Hockey Team.

[Update: The relevant section of Stoica says that the estimates of the Fourier coefficients through the periodogram are inconsistent and do not converge with N, but behave as a random variable. This doesn’t sound promising as a method of estimating an adjustment, that’s for sure. Now Mann estimates his spectrum using Thomson’s multitaper method – Stoica did not discuss this methodology; the purpose of the multitaper method is to reduce the variance of the estimate, so probably it’s less bad than the periodogram, but still the whole procedure, whatever it is, doesn’t sound promising. Having said that, doing nothing is not an alternative either. ]