Conflicted Reviewers Distort Literature

The comments by James Annan and his reviewers here on McKitrick et al (2010) demonstrate very nicely how the literature gets distorted by the rejection of a simple comment showing that the application of Santer’s own method to updated data resulted in failure on key statistics. Annan and his commenters are worrying about the novelty of the method and accusing us of being subject to the same criticisms that Santer made of Douglass.

The statistical apparatus of MMH10 is used in econometrics and is not “novel” in that sense. But it is unfamiliar to climate science readers and it’s entirely reasonable for them to wonder whether there is some catch to the new method. It’s a question that I would ask in their shoes.

That Annan and his commenters should be in a position to make such a comment shows how the IJC reviewers and editor Glenn McGregor have succeeded in poisoning the well, by rejecting a simple comment showing that key Santer results fail with updated data (our comment is on arxiv here.

In our rejected IJC comment, we used Santer’s exact methodology. Nonetheless, Annann makes the following accusation:

Tuesday, August 10, 2010
How not to compare models to data part eleventy-nine…

Not to beat the old dark smear in the road where the horse used to be, but…

A commenter pointed me towards this which has apparently been accepted for publication in ASL. It’s the same sorry old tale of someone comparing an ensemble of models to data, but doing so by checking whether the observations match the ensemble mean.

Well, duh. Of course the obs don’t match the ensemble mean. Even the models don’t match the ensemble mean – and this difference will frequently be statistically significant (depending on how much data you use). Is anyone seriously going to argue on the basis of this that the models don’t predict their own behaviour? If not, why on Earth should it be considered a meaningful test of how well the models simulate reality?

Later a commenter writes:

1) Didn’t Santer address this point? e.g. “application of an inappropriate statistical ‘consistency test’.” Perhaps you’re right that by adding all the extra bits to the paper, they made it so that an idiot might not realize the elementary nature of the most important error, and we need to keep in mind that there are many idiots out there, but…

To which Annan responds:

I haven’t got Santer to hand (and am about to go travelling, so am not going to go looking for it) so I will take your word for it. In which case this new paper is pretty ridiculous. Well, it’s ridiculous anyway.

The observation that key Santer results do not hold up with more recent data is not “ridiculous”. It holds for key Santer results using Santer’s own methodology. The only reason that this information is not in the “literature” is that IJC editor Glenn McGregor did not feel that, as N IJC journal editor, he had any responsibility to place rebuttal of Santer results in the literature and appears to have permitted reviewers with conflicts to determine the outcome of the rebuttal.

But now the debate is muddied because it is entangled with understanding a different methodology.

The people to blame for the muddying of the debate are McGregor and the IJC reviewers who rejected our simple comment.

If, as seems likely, the most adverse reviewer was Santer coauthor, Peter Thorne of the UK Met Office, Thorne would be the most responsible for Annan and his readers being unaware of this result. Thorne wrote to Phil Jones on May 12, 2009 (there had been no CA discussion to that point and the decision had been issued on May 1, 2009) as follows:

Mr. Fraudit never goes away does he? How often has he been told that we don’t have permission? Ho hum. Oh, I heard that fraudit’s Santer et al comment got rejected. That’ll brighten your day at least a teensy bit?

This represents the attitude of the climate science peer reviewers who tied up our Santer comment at IJC.

It didn’t do anything novel or fancy. It just replicated Santer’s methodology to updated data and showed that key results no longer held up. As noted yesterday, the paper was rejected. One reviewer’s principal complaint proved to be not with our results, but an argument with Santer’s methodology. (It looks like this reviewer was Peter Thorne, who ironically was one of the Santer coauthors.)

The authors should read Santer et al. 2005 and utilise this diagnostic. It is a pity that Douglass et al took us down this interesting cul-de-sac and that Santer et al 2008 did not address it but rather chose to perpetuate it. The authors could reverse this descent away to meaningless arguments very simply by noting that the constrained aspect within all of the models is the ratio of changes and that therefore it is this aspect of real-world behaviour that we should be investigating, and then performing the analysis based upon these ratios in the models and the observations.

McKitrick et al (2010) accepted by Atmos Sci Lett

CA readers are aware that Ross and I twice submitted a comment on Santer et al 2008 to International Journal of Climatology (both available on arxiv.org), showing that key Santer results (which were based on data only up to 1999) were overturned with the use of up-to-date data. These were both rejected (but have been posted up on arxiv.org). Ross has now led a re-framed submission, applying an econometric methodology for the analysis. This is available, together with SI and data/code archive here. Continue reading

M08 with realdata

Yesterday we noted that “validation” in M08 means that the average of the “late-miss” and “early-miss” RE statistic is above a benchmark of about 0.35. I take no position at present whether this unusual methodology means anything, though I’m a bit dubious.

I also observed that the with-dendro reconstructions added surprisingly few series in the AD1000 step to the no-dendro reconstruction as the majority (16 of 19 NH) of dendro series are screened out.

I also noted that average RE statistic (and thus “validation”) turned out to be sensitive to the inclusion/exclusion of 1 or 2 series. In yesterday’s case, the exclusion of the one “passing” bristlecone in the AD1000 network strongly affected the period of validation.

Today, I’m going to report on another interesting calculation, this time re-visiting one issue noticed right at the time of publication of M08 and one that I just noticed today: both however involving the use of realdata.

First, in September 2008, Jeff Id and I both discussed M08’s replacement of actual Briffa MXD data after 1960 with “infilled” data. Obviously the deletion of post-1960 Briffa data has a long and sordid history. It was also deleted in the IPCC Third Assessment Report spaghetti graph and from the AR4 spaghetti graph. It was the data at issue in the ‘trick …to hide the decline” email. In this case, actual data was not replaced by temperature data, but by “RegEM data” calculated in Rutherford et al
2005. As a reader notes in comments, Rutherford et al say that they did not use Luterbacher data in making this particular sausage. At this time, I don’t know for sure exactly what is in the RegEM sausage as a substitute for actual MXD data. Actual data for 105 out of 1209 M08 series using MXD had been replaced after 1960 by, shall we say, verification-enhanced RegEM data. I had obtained realdata from CRU with a successful FOI request in 2008 and replaced the “enhanced” data with realdata for this analysis.

Second, M08 used performance-enhancing data in the form of 70 Luterbacher gridded data sets that incorporate actual instrumental data. This data was subtracted in Mann et al 2009 (Science) and I do so here as well in both the no-Tilj and the realdata cases.

Third and this was intriguing: yesterday I noted that there were only three dendro series in the AD100 with-dendro network. I discussed the bristlecone series yesterday. Today I looked at the “Tornetrask” series, the citation for which was Briffa, K.R. et al, 1992 (Clim Dyn) (the “bodged” reconstruction.) However, the M08 version went from 1 to 1993 – the time period of Briffa (2000). However, the M08 version didn’t match the Briffa (2000) version or the Briffa et al 2008 Tornetrask-plus-Finland version. Adopting the policy of using realdata whenever possible, I replaced the M08 “Tornetrask” version with realtornetrask data (the Briffa 2000 chronology in this case).

I then re-ran the M08 CPS algorithm using a realdata version extracting the RE statistics by step as before. Without the performance-enhancing Luterbacher series (fortified with instrumental data) and with real MXD data, the RE statistics after 1500 are noticeably reduced for both late-miss and early-miss versions. Before 1500, the late-miss RE statistic is also reduced significantly (due to no-Tilj and the impact of realtornetrask).


Figure 1. Late-miss and early-miss RE statistics for 3 M08-style networks: M08 from archived statistics; no-Tilj: removing contaminated Tiljander sediments and Luterbacher; realdata: using real Schweingruber MXD data, real Tornetrask data and without performance-enhancing Luterbached gridded series with instrumental data.

I then calculated the average RE statistics used to “validate” the M08 reconstructions. As you see, the realdata network results in substantially lower average RE statistics in all periods.
M08 do not report their benchmark “95% significant” statistic, but by inspection of their diagrams and the archived data cps-validation.xls, the 95% benchmark looks to be about 0.355 or so. This is plotted as a horizontal line.


Figure 2. Average RE statistics for three cases as above. Solid line shows the “adjusted” M08 RE statistic (which basically is the “best” RE statistic if there is a higher RE in a network with an earlier starting date ). Dotted line shows M08 “95% significant” RE benchmark.

In this example using realdata, the with-dendro M08 CPS reconstruction fails “validation” for all intervals. This example (CRU_NH) had the “best” “validation” statistics in the base M08 examples and so I don’t expect other combinations to fare any better.

The script to make these figures from previously collated Notilj and realdata versions is at
http://www.climateaudit.info/scripts/mann.2008/blog_20100808.txt. The runs to make the versions are at
http://www.climateaudit.info/scripts/mann.2008/make.Notilj.txt
http://www.climateaudit.info/scripts/mann.2008/make.realdata.txt

Update: Here’s a graphic for the CRU_SH network – which is unaffected by Tilj and Luter – which serves to compare my emulation with archived M08 verification stats.

Mann and his bristlecones

Gavin Schmidt and others have claimed that the M08 usage of the Tiljander sediments didn’t ‘matter’, because they could “get’ a series that looked somewhat similar without the sediments. They’ve usually talked around the impact of the Tiljander series on the no-dendro reconstruction. But there are two pieces of information on this. A figure added to the SI of Mann et al 2008 showed a series said to be a no-tilj no-dendro version, about which Gavin said that it was similar to the original no-dendro version, thereby showing the incorrect M08 use of the Tiljander series didn’t “matter”. However, Gavin elsewhere observed that the SI to Mann et al 2009 reported that the withdrawing the Tiljander series from the no-dendro network resulted in the loss of 800 years of validation – something that is obviously relevant to the original M08 claim to have made a “significant” advance through their no-dendro network.

To better understand Gavin’s seemingly inconsistent claims, I re-examined my M08 CPS emulation – I had previously replicated much of this, but this time managed to get further, even decoding most of their (strange) splicing procedures. As I’ve done for MB98, I was able to keep track of the weights for individual proxies in the reconstruction – something not done in Mann’s original code, though obviously relevant to the reconstruction. This was not a small project, since you have to keep track of the weights through the various screening, rescaling, gridding, re-gridding steps – something that can only be done by re-doing the methodology pretty much from the foundations. However, I’m confident in my methods and the results are very interesting.

The first graphic below shows the NH and SH reconstructions on the left for the AD1000 network for the two calibration steps considered in M08: latem ( calibration 1850-1949) and earlym (calibration 1896-1995) for the “standard” M08 setup. On the right are “weight maps” for the latem and earlym networks. (The weight map here is a but muddy – I’ve placed a somewhat better rendering of the 4 weight maps online here.) The +-signs show the locations of proxies which are not used in the reconstruction. Take a quick look and I’ll comment below.


Left- AD1000 network reconstructions latem and earlym, NH and SH. Right – weight maps, earlym and latem.

First, there are obviously a lot of unused series in the M08 with-dendro reconstruction. Remarkably, the exclusions are nearly all dendro series. Out of 19 NH dendro chronologies, 16 dendro chronologies are not used; only three NH dendro chronologies are used: one Graybill bristlecone chronology (nv512) in SW USA; Briffa’s Tornetrask, Sweden and Jacoby-D’Arrigo’s Mongolia, all three of which are staples of the AR4 spaghetti graphs. Only one of 10 Graybill bristlecone chronologies “passes” screening.

In other words, nearly all of the proxies in the AD1000 network are “no-dendro” proxies. I.e. , the supposedly improved “validation” of the with-dendro network arises not because of general contribution of dendro chronologies to recovery of a climate signal, but because of the individual contribution of three dendro series with the other 16 series screened out.

Secondly, the reconstructions are weighted averages of the individual reconstructions. The latem and earlym reconstructions don’t appear at first glance to have remarkably different weights, but have noticeably different appearances. In the NH, the earlym 20th century is at levels that correspond to levels that were precedented in the MWP, while the latem reconstruction has higher values in the 20th century than the MWP – BUT a marked divergence problem. This divergence problem results in a very low RE for the latem version (about 0), while the earlym version has a RE of 0.84. The earlym SH reconstruction has MWP values that are much higher than late 20th century values, while the latem SH reconstruction has MWP values that are lower than late 20th century values.

Values of the latem RE statistic appear to be an important determinant of Mannian-style “validation” – more on this later.

As a fourth point – back in 2008, I’d noted that the M08 algorithm permitted the same proxy to have opposite orientation depending on calibration period and that at least one proxy did this. Note the Socotra (Yemen) speleothem in the weight map. This has opposite orientations in the two reconstructions – something that seems hard to justify on a priori reasoning and which appears to have a noticeable impact on the differing appearance of the two reconstructions.

In the SH, there are obviously only a few relevant proxies. The big weight comes from Lonnie Thompson’s Quelccaya (Peru) data, with other contributions from Cook’s dendro series in Tasmania and New Zealand (the Tasmania series being an AR4 staple) and from a South Aftican speleothem.

Other NH series include Baker’s Scottish speleothem (used upside down from the orientation in Baker’s article), Crete (Greenland) ice cores – an AR4 staple, the old Fisher Agassiz (Ellesmere Island) melt series (used in Bradley and Jones 1993), the Dongge (China) speleothem, the Tan (China) speleothem. A number of these proxies have been discussed at CA.

Notilj Nobristle
One of the large issues in respect to MBH98-99 was the impact of bristlecones. Eventually, even Wahl and Ammann conceded that an MBH-style reconstruction did not “verify” prior to 1450 at the earliest without Graybill bristlecones. However, for the most part, the Team avoided talking about bristlecones, most often trying to equate no-bristle (or even no-Graybill) sensitivities with no-dendro sensitivity. Over-generalizing criticisms of bristlecone chronologies to criticism of all dendro chronologies. M08 adopted the same tactic – discussing no-dendro, rather than no-bristle (which was the actual point at issue.)

I’ve done experiments calculating M08 style CPS reconstructions with first no-tilj and then with no-tilj no-bristle. At this point, no=tilj should be the base case for an M08 style result – as there is no scientific justification for including this data in an M08 style algorithm: it doesn’t meet any plausible criteria for inclusion.

Below are results for the no-tilj no-bristle case. At first glance, the shape of the recons looks fairly similar to the M08 case. In detail, there are some important differences: for example, the divergence problem in the no-tilj no-dendro latem reconstruction is much more pronounced than in the M08 reconstruction where the huge ramp of the Tilj sediments and the bristlecones mitigates the divergence problem considerably.

These differences arise with relatively little difference in the relative weights of the other proxies.

Mannian Splicing
M08 has a very unique methodology for splicing reconstruction steps – one which you definitely can’t read about in Draper and Smith. First they calculate RE statistics for latem and earlym reconstructions. In the figure below, I’ve plotted latem and earlym RE statistics for the different steps under three cases:
(1) M08 from their archive, shown as a line
(2) no-tilj (using my emulation of M08) shown as + signs.
(3) no-tilj no-bristle shown as “o”. As noted above “nobristle” in this context only involves one series (nv512.)

The latem RE stat decreases quite dramatically without the Tilj and nv512 data sets.


Figure 3. Mannian RE stats.

This sharp decline in latem RE statistic ends up affecting the rather weird M08 “validation” method. From the earlym and latem RE statistics, Mann calculated an “average” RE statistic – this is another ad hoc and unheard of method. If the “average” RE statistic is above a benchmark that looks like it’s about 0.35 (note that this benchmark is a far cry from the benchmark of 0 used in MBH98 and Wahl and Ammann 2007 – one that we criticized in our 2005 articles – more on this on another occasion), the series is said to “validate”. If the addition of more data in a step fails to increase the average RE (and the average CE), then the earlier version with fewer data is used. This is “justified” in the name of avoiding overfitting, but this is actually an extra fitting step based on RE statistics.

In any event, the reason why the no_tilj no-bristle ( and afortiori, no-tilk no-dendro) fails to “validate” prior to AD1500 or so is simply that the latem RE statistic becomes negative due to the divergence problem – without the Tilj series and nv512. (I haven’t studied the EIV/RegEM setup, but I suspect that the same sort of thing is what’s causing its failure as well.)

In a way, the situation is remarkably similar to the MBH98 situation and bristlecone sensitivity. One point on which Mann et al and ourselves were in agreement was that the AD1400 MBH98 reconstruction failed their RE test without bristlecones. (And also earlier steps.) In the terminology of M08, without bristlecones, they did not have a “validated” reconstruction as at AD1400 and thus could not make a modern-medieval comparison with the claimed statistical confidence.

Ironically, the situation in M08 appears to be almost identical. Once the Tilj proxies are unpeeled, Mann once again doesn’t have a “validated” reconstruction prior to AD1500 or so, and thus cannot make a modern-medieval comparison with the claimed statistical confidence. (BY saying this, I do not agree that his later comparisons mean anything.; however, they don’t “matter’ for the modern-medeival comparison.)

Mosher on Gavin’s “Frustration”

Mosher writes in: Continue reading

The No-Dendro Illusion

In September 2008, Mann et al reported a “significant development” in paleoclimate reconstructions – a “skillfull” reconstruction without tree ring data for over 1300 years.

A skillful EIV reconstruction without tree-ring data is possible even further back, over at least the past 1,300 years, for NH combined land plus ocean temperature (see SI Text). This achievement represents a significant development relative to earlier studies with sparser proxy networks (4) where it was not possible to obtain skillful long-term reconstructions without tree-ring data.

The story was widely covered at the time and the result has been relied upon to marginalize criticism of the reliance of IPCC multiproxy studies on strip bark bulges or tree ring chronologies developed by CRU. Now it turns out that the much vaunted claim to have a “validated” no-dendro reconstruction for the past 1300 years was merely an illusion.

Not only was it an illusion, but recent admissions by Gavin Schmidt show that it foundered on Mann’s much criticized use of the Tiljander sediments – a topic on which the seeming obtuseness of the climate science community to the simplest of issues (e.g. contamination by bridge and agricultural sediments) has mystified third parties over the past two years. Only last month, Schmidt had re-assured readers at Keith Kloor’s that Mann’s misuse of the Tiljander sediments didn’t “matter”. It turns out that it did.
Continue reading

Mann versus the Provincial Parrots

Roman M and TomRude have observed an interesting letter writing campaign in which Michael Mann contests adverse opinion in provincial newspapers, accusing the letter writers of being “parrots”.

Today (July 31, 2010), Mann sent the following letter to the Saint John (New Brunswick) Telegraph Journal objecting to a letter published July 30. Similar letters were sent on July 22 to the Fredericton (New Brunswick) Daily Gleaner and on July 29 to the Minneapolis Star Tribune. Continue reading

Make a stick, make a stick

NASA blogger Gavin Schmidt as part of his ongoing attempt to rehabilitate Mannian paleoclimate reconstructions, characterized here as dendro-phrenology, has drawn attention to a graphic posted up at Mann’s website in November 2009. In this graphic, Mann responded to criticisms that his “no-dendro” stick had been contaminated by bridge-building sediments despite warnings from the author (warnings noted by Mann himself but the contaminated data was used anyway.) I’ll show this figure at the end of the post, but first I’m going to show the “raw materials” for this “reconstruction” and my results from the same data. Continue reading

Kola versus Yamal

A news release on a new tree ring study here (h/t Anthony Watts) reported a reconstruction maxing out in the mid-20th century, with the characteristic late 20th century divergence problem. Their results contrast with CRU’s notorious Yamal chronology:

Following the summer temperature reconstruction on the Kola Peninsula, the researchers compared their results with similar tree-ring studies from Swedish Lapland and from the Yamal and Taimyr Peninsulas in Russian Siberia, which had been published in Holocene in 2002. The reconstructed summer temperatures of the last four centuries from Lapland and the Kola and Taimyr Peninsulas are similar in that all three data series display a temperature peak in the middle of the twentieth century, followed by a cooling of one or two degrees. Only the data series from the Yamal Peninsula differed, reaching its peak later, around 1990. What stands out in the data from the Kola Peninsula is that the highest temperatures were found in the period around 1935 and 1955, and that by 1990 the curve had fallen to the 1870 level, which corresponds to the start of the Industrial Age. Since 1990, however, temperatures have increased again evidently.

Although the reconstruction declined since mid-20th century, the sub-headline reads: “New data indicate rapid temperature rise in the coldest region of mainland Europe”.

EPA Denies Reconsideration Petitions

The EPA, as expected, has denied the various petitions for reconsideration of their Endangerment Finding. They refer to the various “inquiries” on some points. Interesting reading here
http://epa.gov/climatechange/endangerment/petitions.html