McKitrick et al (2010) accepted by Atmos Sci Lett

CA readers are aware that Ross and I twice submitted a comment on Santer et al 2008 to International Journal of Climatology (both available on, showing that key Santer results (which were based on data only up to 1999) were overturned with the use of up-to-date data. These were both rejected (but have been posted up on Ross has now led a re-framed submission, applying an econometric methodology for the analysis. This is available, together with SI and data/code archive here.

Although key Santer et al 2008 results are invalid with up-to-date data, they have been widely cited as showing that there is no inconsistency between models and observations in the tropical troposphere (e.g. CCSP, EPA), as had been previously believed/argued by some.

IJC reviewers and editor Glenn McGregor took the position that the invalidity of key Santer results was not of interest to the climate science community. They proposed all sorts of other investigations as a precondition for publication – many of them interesting enterprises and suggestions, but all very time consuming and not relevant to the simple issue of whether key Santer results were overturned with up-to-data.

The reviewers of our first submission refused to permit the editor to provide us with their actual reviews, requiring the editor to paraphrase their reviews.

In our second try, one of our reviewers objected to us using Santer et al 2008 methods in a comment on Santer et al 2008. He argued that the S08 methods were incorrect (blaming Douglass et al for leading them down a “cul-de-sac”) and condemned our demonstration that their results fell apart with up-to-date data as a “descent away to meaningless arguments”. He argued that our comment on S08 should instead use “diagnostic” of Santer et al 2005.

The authors should read Santer et al. 2005 and utilise this diagnostic. It is a pity that Douglass et al took us down this interesting cul-de-sac and that Santer et al 2008 did not address it but rather chose to perpetuate it. The authors could reverse this descent away to meaningless arguments very simply by noting that the constrained aspect within all of the models is the ratio of changes and that therefore it is this aspect of real-world behaviour that we should be investigating…

Again, potentially an interesting enterprise, but hardly relevant to the simple demonstration of our short comment. Particularly when Santer et al 2005 does not, in fact, have a “diagnostic” reduced to a statistical test.

The history of our comment was somewhat played out in the Climategate letters. In one Climategate email, Peter Thorne of the UK Met Office, a Santer coauthor, who appears to have been one of the reviewers who rejected our submission, wrote to Phil Jones notifying him of the rejection of our submission, using the defamatory term “Fraudit”.

Our first comment was submitted in January 2009 and the second comment in August 2009. I had previously reported some of the findings at Climate Audit. Despite numerous Climate Audit posts on our findings and two efforts to publish our results, Santer accused me of asking for data purely as a “fishing expedition” – see NASA blogger Gavin Schmidt’s realclimate here.

Mr. McIntyre’s FOIA requests serve the purpose of initiating fishing expeditions, and are not being used for true scientific discovery.

Santer’s campaign for support for his obstruction of my data requests accounts for many Climategate letters. As members of the editorial board of Climatic Change, Santer had previously co-operated with Phil Jones in 2004 in ensuring that Climatic Change did not require Mann et al to comply with reviewer requests for supporting data and code.

Santer did ultimately place some of the requested material online. Despite Santer’s whining and delaying, this archive was very useful as it enabled co-author Chad Herman of the excellent treesfortheforest blog to benchmark his own emulation of Santer’s calculations and to create a fresh archive of PCMDI runs. Chad’s archive is FAR more usable for statistical analysis than endlessly re-processing PCMDI and may well have use for interested parties over and above the analysis in this article. (Remind me to discuss this at greater length).

After a certain point, Ross gave up on us being able to publish the simplest of comments at IJC and re-framed the analysis with “new” econometric methodology and submitted to Atmospheric Science Letters. There was a Team reviewer, but the editor permitted Ross to respond and used his own judgment on the response – this is what is referred to in Climategate letters as a “leak” in the journal network.

Here is Ross’ letter to colleagues:

You might be interested in a new paper I have coauthored with Steve McIntyre and Chad Herman, in press at Atmospheric Science Letters, which presents two methods developed in econometrics for testing trend equivalence between data sets and then applies them to a comparison of model projections and observations over the 1979-2009 interval in the tropical troposphere. One method is a panel regression with a heavily parameterized error covariance matrix, and the other uses a non-parametric covariance matrix from multivariate trend regressions. The former has the convenience that it is coded in standard software packages but is restrictive in handling higher-order autocorrelations, whereas the latter is robust to any form of autocorrelation but requires some special coding. I think both methods could find wide application in climatology questions.

The tropical troposphere issue is important because that is where climate models project a large, rapid response to greenhouse gas emissions. The 2006 CCSP report pointed to the lack of observed warming there as a “potentially serious inconsistency” between models and observations. The Douglass et al. and Santer et al. papers came to opposite conclusions about whether the discrepancy was statistically significant or not. We discuss methodological weaknesses in both papers. We also updated the data to 2009, whereas the earlier papers focused on data ending around 2000.

We find that the model trends are 2x larger than observations in the lower troposphere and 4x larger than in the mid-troposphere, and the trend differences at both layers are statistically significant (p<1%), suggestive of an inconsistency between models and observations. We also find the observed LT trend significant but not the MT trend.

If interested, you can access the pre-print, SI and data/code archive at my new weebly page


  1. kim
    Posted Aug 9, 2010 at 8:27 AM | Permalink

    Regressed by double,
    Parametered and non-p.
    Look all ’round the world.

  2. Punksta
    Posted Aug 9, 2010 at 8:32 AM | Permalink

    The idea that “fishing expeditions” are not part and parcel of science, is somewhat reminiscent of “Why should I show you my data if I know you are just going to find something wrong with it”.

    • Duster
      Posted Aug 11, 2010 at 8:01 PM | Permalink

      In reality, you could consider “fishing expeditions” an essential part of science. Such expeditions help to identify potential productive and unproductive holes. If it looks productive, you turn around and say, “it looks like we’ll be needing some new and independent data.” If not, you say, “well H***, we’ll have to look somewhere else.”

  3. Hu McCulloch
    Posted Aug 9, 2010 at 8:42 AM | Permalink

    UTL doesn’t work.

  4. Posted Aug 9, 2010 at 8:47 AM | Permalink

    There’s a missing “close the href” tag somewhere in the paragraph about Santer’s response to the FOIA.
    Steve – fixed

  5. Bernie
    Posted Aug 9, 2010 at 8:59 AM | Permalink

    Figure 2 and Figure 3 are devastating, if replicated.

    The model error term compared to the real world is enough to make me wonder.

  6. j ferguson
    Posted Aug 9, 2010 at 9:21 AM | Permalink

    Are the three model bars in each chart identical and if so why are they repeated?

    • Posted Aug 9, 2010 at 9:35 AM | Permalink

      Yes they are identical. They are repeated in the chart as a visual aid. They are estimated in each of the panel regressions (look at the regression log files).

  7. Posted Aug 9, 2010 at 9:31 AM | Permalink

    I’d like to add a few comments to Steve’s post. The paper that went to ASL departs from the IJoC comment by building around 2 purposes. One is to provide an updated model/obs comparison in the tropical troposphere, and these results are likely to get the most attention for the time being — appropriately so.

    The other is to try and tutor the readership on modern trend comparison methods. I know that the “effective degrees of freedom” method is familiar and beloved among climatologists, but it is only accurate in limited circumstances, namely for computing the variance of a univariate AR1 process. We have some additional notes in the SI about why it is not accurate for other processes and for multivariate comparisons. Panel regression methods are easy to implement in standard software and in most cases the covariance matrix will be good enough, at least the way Stata estimates it. The multivariate method of Vogelsang and Franses is the most robust for the task and not that hard to program, but takes a bit of effort. One of the referees noted, however, that it was a “delight to read” and appreciated it being brought to the field’s attention. Hopefully these methods will get wider application, since trend comparisons are used all over the place, usually just eyeball methods (arghh).

    Our comparison of trends takes the model surface trend as given. As I understand it, Douglass et al. slid the vertical trend profile along so that the surface trend matches the observations. Since that tended to increase the trends aloft in the most models, it would, in our case, increase the model/data mismatch and might make the 79-99 results match the Douglass et al ones. However there was no need for this step under the circumstances to show that the models and data are in disagreement on the 79-09 data.

    The refereeing process for ASL was a textbook case of what peer review is supposed to do. Our first submission only used the panel method on Santer’s original data. In that round the editor obtained 2 reviews with diametrically opposed recommendations, so he solicited 2 more reviews, both recommending rejection. But he gave us the opportunity to respond. Dealing with the 3 critical reviews required a total rebuild of the data set (ably handled by Chad Herman), and introduction of the VF method to address the higher-order AC processes. In the next round one referee was carried over and a new referee was added, making 5 altogether to that point, with some new criticisms to address, but again the editor allowed us to respond. In the final round one referee continued to maintain the paper should be rejected, but I was able to show that the 2 arguments put forward by the referee were erroneous and the editor approved the paper for publication. The paper improved a lot over the course of reviewing and we benefited from the adversarial process.

    Of the five referees, four were anonymous. The one who signed his review provided a long and largely negative but constructive critique. The editor’s decision to solicit a review from this individual eliminates any possibility for people to argue that the process was soft or an easy ride.

  8. Craig Loehle
    Posted Aug 9, 2010 at 10:14 AM | Permalink

    Lessons from this adventure:
    1) Yes, Virginia, there is censorship in climate science when journals are “not interested” in disproof of published work.
    2) Sometimes reviewers don’t know what they are talking about, and really should admit it.
    3) Yes, sometimes editors must exercise judgement and not just reject because there is a negative review. That is part of their job.

    • j ferguson
      Posted Aug 9, 2010 at 10:43 AM | Permalink


      I thought his experience was much more positive despite one publication’s rejection and the review process – it sounds like he got a wonderfully constructive one.

      One of my architectural efforts was subjected to successive reviews by ‘The Fine Arts Commission.’ Although I hated the idea of the process, two of the members were renowned architects whose criticisms and contributions significant improved the project. At the point of the vote to approve, I held forth briefly on my discomfort with the process but confessed that the result was much better than what I had first envisioned and they should be thanked for their help with it.

      It turned out that they had never been thanked for their contributions and were quite charmed by the idea – approval issued.

      • MikeN
        Posted Aug 9, 2010 at 1:42 PM | Permalink

        That doesn’t strike me as remotely similar, and your case is much worse. Were they arguing the building’s structural integrity, or its aesthetic design?

        • j ferguson
          Posted Aug 9, 2010 at 2:07 PM | Permalink

          After looking at it, I concede its irrelevance, but to someone who takes those things seriously, and we all did, the comments and criticisms spoke to issues with which I was concerned and although not structural, the ‘collaboration’ produced a better result. I submit that the selection and combination of what gets covered in a paper and how it is discussed and expanded is aesthetic to some degree despite the eventual demonstration of ‘proof’ being much higher than what I did.

  9. anon
    Posted Aug 9, 2010 at 10:20 AM | Permalink

    “Fishing” is similar to “cherry picking”, both involve the use of true data that you do not like so you create an ideological sphere where the evidence, though true, becomes Unclean by being fished or cherrypicked.

    • Duster
      Posted Aug 11, 2010 at 8:43 PM | Permalink

      Not really. Fishing can be a form of exploratory data analysis. It may involve for example tossing a lot of data at a series of statistical tests and seeing what shakes out. It isn’t good statistics, since it is likely that you could randomly turn apparently significant relationships that are bogus, and you certainly would not want to publish the results, but the possible significant relationships can be used to explicitly frame new research that collects new data. It isn’t pretty, but the fact is that searching for WHAT to research is essential part of the process of science that is rarely or never addressed in classes in any science. There’s a frequent assumption that 1) the scientist already KNOWS what the question is, and 2) knows how to address it. In fact, like the mice in the Hitch Hiker’s Guide to the Galaxy, we frequently know the answer, but not what the actual question is.

      • Bernie
        Posted Aug 11, 2010 at 9:53 PM | Permalink

        I agree, if you can ensure that you can generate more data to test the refined hypothesis or theory.

  10. Posted Aug 9, 2010 at 11:19 AM | Permalink

    Well, ASL is obviously NOT a reputable climate publication.

  11. juakola
    Posted Aug 9, 2010 at 11:54 AM | Permalink

    I think this paper should be ignored by the community on the whole, and perhaps we should consider ASL no longer a legitimate peer-reviewed journal? Please suggest your colleagues not to cite, or to publish any more papers in this journal.

    • Posted Aug 9, 2010 at 4:54 PM | Permalink

      Tom crowley has sent me a direct challenge to mcintyre to start contributing to the reviewed lit or shut up. i’m going to post that soon.
      -Andy Revkin, in the emails

      So more likely he won’t submit for peer-reviewed scrutiny, or if it does get his criticism “published” it will be in the discredited contrarian home journal “Energy and Environment”. I’m sure you are aware that McIntyre and his ilk realize they no longer need to get their crap published in legitimate journals. All they have to do is put it up on their blog, and the contrarian noise machine kicks into gear, pretty soon Druge, Rush Limbaugh, Glenn Beck and their ilk (in this case, The Telegraph were already on it this morning) are parroting the claims. And based on what? some guy w/ no credentials, dubious connections with the energy industry, and who hasn’t submitted his claims to the scrutiny of peer review.
      -Michael Mann, Sept 2009, in the emails

  12. scientist
    Posted Aug 9, 2010 at 12:48 PM | Permalink

    It sounds like the paper changed pretty substantially if different methods were brought in, mid-way. So perhaps the early negative reviews were on target. Glad that McItrick kept working on his analyses and got some sort of finding.

    Steve: the findings about the impact of updated information on the Santer results remained unchanged throughout.

  13. Posted Aug 9, 2010 at 1:05 PM | Permalink

    The methods used were new to me. It was interesting, and probably will be so for some time.

    I hope that some of the readers here recognize that this is one of the most important results in climate this year. A lot of the other stuff out there is just repeating the mantra and comparing results to these same models.

    From this, we can now state with certainty that the models have problems on even a 30 year scale. They do exaggerate trend on that scale and they therefore cannot project temperatures out a hundred years. Most here probably recognized the shakiness of the projection aspect of models already, but it’s now proven.

    one of my favorite lines form the paper references adding the last ten years into the analysis record:

    “But with the addition of another decade of data the results change, such that the differences between models and observations now exceed the 99% critical value.”

    Good stuff that!

    Stick a fork in ’em, they’re done – if anyone in the climate modeling game pays attention. I’m not new enough in this game anymore to think they will. Someone should drop a link to gavin and see what he says.

    Steve; the same results were in the Jan 2009 submission to IJC (and related results reported in 2008 at CA). In the 18 months that IJC reviewers tied this article up, Santer et al 2008 has been cited in policy reports.

  14. Patagon
    Posted Aug 9, 2010 at 1:07 PM | Permalink

    I find the reviewer rejection very interesting:

    It is a pity that Douglass et al took us down this interesting cul-de-sac and that Santer et al 2008 did not address it but rather chose to perpetuate it.

    The reviewer seems to betray some frustration at Santer having exposed the great cacophony of models among themselves.

    The authors could reverse this descent away to meaningless arguments very simply by noting that the constrained aspect within all of the models is the ratio of changes and that therefore it is this aspect of real-world behaviour that we should be investigating…

    “Real world behaviour”. I find it puzzling that climate science must be the only discipline where a simulation of a real world process (a model) does not need to come with results close to the actual values in the real world. It suffices that the simulations replicate the rate of change. There seems to be a real consensus about that. If the rate of change is replicated we can trust the model.

    That is astonishing, I cannot find any physical, mathematical, philosophical or statistical justification for that. If your simulation can not replicate a key parameter but only its rate of change it tells you interesting things about the model (the good fit of the calibration for example), but it lacks predictive ability. Try that approach on other disciplines that rely heavily on computer simulations and models, wing design aerodynamics, for example. I would be happy to know the reactions.

    GCMs are a great achievement, there is a lot of effort into them and many aspects of atmospheric behaviour are better understood thanks to them. But they are not an oracle in any way. If they fail to simulate a key atmospheric behaviour, such as temperature profiles, that means that the implementation of the climatic system energy balance is incorrect. It is good that they tell us that, so we know where to put more research effort. The tendency to assume that they can not be wrong is just an alienation of what models are meant to be.

    For those who may not aware of how far away from the mark models are in simulating temperature (up to 4K in some cases), here there is a chart by Lucia comparing actual outputs (not anomalies), I have added precipitation here.

    • Bob Koss
      Posted Aug 9, 2010 at 4:32 PM | Permalink

      Re: Patagon (Aug 9 13:07), I saw that absolute temperature graph of the models at Lucia’s last year. It leaves one wondering how accurate their modeling of ice cover and albedo is.

  15. David L. Hagen
    Posted Aug 9, 2010 at 1:09 PM | Permalink

    We find that the model trends are 2x larger than observations in the lower troposphere and 4x larger than in the mid-troposphere, and the trend differences at both layers are statistically significant (p<1%), . . .

    A dry statement with remarkable import. The Fleet street version: “400% error in climate models“.
    M&M’s results have game changer consequences and should be highly publicized. Showing non-overlapping error bars is powerful.
    (This also highlights the RSS vs UAH temperature divergence that needs resolution.)

    • nono
      Posted Aug 9, 2010 at 1:28 PM | Permalink

      I may have missed something, but when I compare Fig. 3 of both articles, it is striking that MM2010 have error bars associated with the models that are WAY smaller than that of Santer08. Hence the non-overlapping between models and satellites/radiosondes.

      How can the error bars be so different?

      • Posted Aug 9, 2010 at 4:38 PM | Permalink

        When you extend the data by 10 years the confidence in trend shrinks.

        • Posted Aug 9, 2010 at 4:38 PM | Permalink

          Confidence in knowledge of the trend that is.

      • Posted Aug 9, 2010 at 4:41 PM | Permalink

        That’s hilarious, someone take my keyboard away. Confidence intervals shrink – you know what I mean.

        • nono
          Posted Aug 9, 2010 at 6:57 PM | Permalink

          Shrink that much?

        • Posted Aug 9, 2010 at 8:17 PM | Permalink

          In the paper they acknowledge confirmation of Santer’s result then extend that result for recent data. In the ‘rejected’ paper I think Santer’s method was replicated then extended to the same conclusion.

          Santer mentioned this in the emails as well, claiming that some unnamed people didn’t understand that confidence intervals would tighten with more data.

          I think of it like a lever, the farther the data is from the midpoint, the stronger the moment. 30 years instead of 20 is a large difference.

        • Craig Loehle
          Posted Aug 9, 2010 at 8:44 PM | Permalink

          If you have a process that is like a random fractal walk (red noise), the slope of a longer series will NOT necessarily have narrower confidence intervals. Santer is thinking like it is a normally distributed process like sampling a bunch of people’s heights. It is not.

          Steve: fractals are an interesting mathematical thing, but this issue is OT to this paper. It’s the sort of issue that has been discussed before.

        • VS
          Posted Aug 10, 2010 at 6:39 AM | Permalink

          Sander considered that an argument pro Sander et al [2008] vs MM…? That would be very amusing, were it not so tragic.


          A scientist measures the heights of three men and three women, and performs a test that fails to reject the H0 that men and women are of equal average height.

          A second scientist measures another twenty people, adds them to the dataset, performs the same test but rejects the H0 that men and women are of equal average height.

          Second scientist: “Apparently, when we extend the number of observations, and thereby introduce more confidence into our estimate of the difference in mean heights between men and women, one *can* reject the hypothesis that men and women are of equal average height”

          First scientist retorts: “That’s no rebuttal of my conclusions, you apparently don’t understand that confidence intervals for differences in means tighten with more observations”

          The whole point of the MM10 approach, as I understand it, is to measure (ie. estimate) the estimator covariance matrix without (arbitrarily) restricting the off-diagonal elements to zero. This in addition to introducing more obs.

          The fact that when Sander’s arbitrary restrictions are relaxed and more observations are taken into account, the confidence interval of the estimator of the difference between the trend slopes tightens – thereby rejecting the H0 of slope equality – is a *conclusion*.

          Any debate about the estimated confidence intervals should therefore revolve around the employed model specification. Needless to say, the arguments used in such a debate need to be *formal*.

          Climate Science™ might be all ‘yea’/’nae’… statistics isn’t.

        • nono
          Posted Aug 10, 2010 at 7:27 AM | Permalink

          I understand that the confidence interval around the mean decreases with the number of observations.

          However, the difference between MM2010 and Santer08 is really huge. I’m wondering wether 10+ years of data are sufficient to shrink the error bars that much. Especially with autocorrelated data (you know, when you dela with autocorrelated data, you need even more obs to shrink the error bar of a given amount).

        • nono
          Posted Aug 10, 2010 at 7:28 AM | Permalink

          [Edit: ok, this issue seems to be discussed further below. Had not initially seen that.]

        • Steve McIntyre
          Posted Aug 10, 2010 at 8:51 AM | Permalink

          This comparison is made pointlessly difficult by the rejection of our Santer comment using Santer’s own methodology. See for some results using Santer’s methodology.

        • Posted Aug 10, 2010 at 9:00 AM | Permalink

          Re: Steve McIntyre (Aug 10 08:51),
          Steve, I was interested to run the R code associated with that paper. But the data sets seem to have moved. Do they have a new home?

          Steve: .org become .info. I’ve got more recent versions as well.

        • nono
          Posted Aug 11, 2010 at 8:10 AM | Permalink

          Excuse me Steve, but I’m comparing the results reported in Table 1 of your rejected comment, with Fig. 6B of Santer08…

          …and I fail to see the difference between the two — except that I can’t find the error bars in your comment.

          The trends reported in their Fig. 6B (supposedly up to 1999) seem to me equal to those reported by you (up to 2009), within the error bars.

        • stan
          Posted Aug 10, 2010 at 9:25 AM | Permalink

          I like that — ‘Climate Science’ with a trademark. Nice touch. As in — the proprietors who “own” the intellectual property rights to the designation are not happy whenever the designated phrase is used in a manner which contravenes their business interests.

          Well-played. Hope the usage catches on. [how do I do it on my keyboard?]

    • TA
      Posted Aug 9, 2010 at 3:28 PM | Permalink

      I am sorry, being unaccustomed to reading such graphs, I am not sure where to find the “error bars”. Are they the I-beams that appears overlapping with the top of each bar, with the center of each I-beam at the exact top of the bar? If so, this makes the graphs a very powerful statement. It would appear that the models’ error bars are much too short. Correct?

      • nono
        Posted Aug 9, 2010 at 4:00 PM | Permalink

        I think that your assumption is correct. And unless I’m completely mistaken, the closest thing that can be compared with Fig. 3 of MM2010 would be Fig. 3A of Santer08:

        Click to access NR-08-10-05-article.pdf

        with RSS/UAH on the right side, and average of the models +-2 sigma (i.e. “error bars”) as grey area.

        It seems to me that the error bar of MM2010 is +- 0.025, while that of Santer08 is +-0.12 (1 sigma). I wonder why such a difference.

      • scientist
        Posted Aug 9, 2010 at 5:46 PM | Permalink

        Are the I-bars the 2 sigma or one sigma? I agree that the the models seem amazingly tight!?

        • scientist
          Posted Aug 9, 2010 at 7:53 PM | Permalink

          I’m concerned that there is some definitional disconnect giving you such AMAZINGLY tight error bars. Some difference versus Santer. that this is the operative issue, not the 10 years extra data or the new method. (Of course had you done the full factorial of all the changes, we could see what was causing what. We can’t tell when you are changing more than one thing at a time.

          I suspect you are wrong if you are using a different defintion than Santer, but even if not wrong, or arguably not wrong, I think it encumbent to you to discuss the difference if there was one (as was the case with Douglas versus Santer dispute).

          Or perhaps there are just many fewer models with the long runs? So you show a signifacnt difference for “these models” but others would not have one?

          In any case, I’ve asked James Annan to comment. He’s a sharp dude on this sort of thing.

          P.s. I think this is pretty content-filled post (even if I’m wrong) compared to all the peer review whining. But in any case cross posted at DC, to prevent/show your censoring…

        • Craig Loehle
          Posted Aug 9, 2010 at 8:47 PM | Permalink

          I am guessing that the narrower confidence intervals on the models is due to their covariation over time with each other, since they all go up even though some are warmer than others. The panel method takes that into account, but neither Douglass nor Santer are able to with their methods. Ross would need to speak to that.

        • scientist
          Posted Aug 9, 2010 at 9:59 PM | Permalink

          snip – nothing to do with McKitrick et al.

          ———–crossposted at Deep Climate because of censor———

        • Posted Aug 9, 2010 at 10:36 PM | Permalink

          The std errors are what they are, based on the model shown in the paper. Bear in mind, the panel model estimates the omega matrix one way, and the VF method estimates it a completely different way, but they both yield the same inferences about the significance of the trend differences. And as Steve has noted, we could even use Santer’s method and get similar results. So don’t expect any easy escape hatches by saying the std errors are too small.

        • pete
          Posted Aug 9, 2010 at 10:51 PM | Permalink

          So the error bars on the model estimates represent confidence intervals, not prediction intervals? Isn’t this the same mistake Douglass et al made?

        • nono
          Posted Aug 10, 2010 at 7:51 AM | Permalink

          Well I don’t expect any escape hatches, I’m just trying to understand the difference. (I’m also aware that it’s not the central point of the paper, and that those I-error bars anyway do not appear in the statistical analysis. Or am I wrong?)

          Anyway when you write “The std errors are what they are”, do you mean that, as far as the (N) models are concerned, you calculate it as:
          where trend_i is the trend (°C/decade) of model i?

        • nono
          Posted Aug 10, 2010 at 7:55 AM | Permalink

          sorry, I meant

        • Posted Aug 10, 2010 at 8:24 AM | Permalink

          Re: scientist (Aug 9 17:46),
          I agree. I’ve been contending at the Air Vent that the analysis does not take proper account of between model variability. And these tight error bands show that up. I’ve plotted a histogram here of the MT case. The models are far more scattered than those error bounds allow.

  16. John
    Posted Aug 9, 2010 at 1:49 PM | Permalink

    Is it Atmospheric Research Letters? Or Atmospheric Science Letters in which this article is now in press? The first is mentioned in Steve’s last paragraph, the second in Ross’s contribution above.

    Steve= fixed.

  17. Ed Waage
    Posted Aug 9, 2010 at 1:51 PM | Permalink

    Nice work Steve and Ross. I am glad you two are persistent. Kudos to the editor of ARL for doing the right thing.

    The radiosonde and satellite data appear to agree fairly well so this would seem to indicate that the real world data is “robust”.

    If the models were adjusted to match the observed data, one wonders if the models could still predict the levels of global warming attributed to CO2.

    • Craig Loehle
      Posted Aug 9, 2010 at 1:59 PM | Permalink

      Silly boy, the models can’t be wrong, so it must be the data that need fixing (smilely face goes here).

      • Kenneth Fritsch
        Posted Aug 9, 2010 at 3:33 PM | Permalink

        And, of course, we know that that agrument has been seriously put forth to explain the model and observed differences. Since more of the observed data was used in MMH 2010 we would now have to conclude that all, or nearly all, the observed data is wrong – on average.

        We have also come a long way from the earlier papers that used a range of model results and a no difference conclusion between models and observations when the observed results were enveloped by the model range – and even with outlier model results, that showed no surface temperature trend, extending the model range.

  18. Craig Loehle
    Posted Aug 9, 2010 at 1:58 PM | Permalink

    Typo? p.6, last para, 1st sent: “has improved size as the sample size grows”
    can’t be right.
    Very elegant methods by the way. Thanks for doing this paper.

    • Posted Aug 9, 2010 at 2:07 PM | Permalink

      “Size” in this context is a statistical term, related to power.

      • VS
        Posted Aug 9, 2010 at 2:23 PM | Permalink

        Ah, so you actually meant ‘power’ (ie. 1-beta), not ‘size’ (ie. alpha). Then the sentence makes sense.

        That *is* a typo 😉

    • VS
      Posted Aug 9, 2010 at 2:17 PM | Permalink

      I don’t think it’s a typo, although the wording is a bit ambiguous.

      The ‘size’ of a test is the false rejection rate of a true null hypothesis, or the significance level if you will (usually 1%, 5% or 10%).

      I presume that what the authors mean here is that the actual size approaches the chosen significance level (ie. nominal size) as the number of observations increases.

      However, this sentence is indeed a bit strange:

      “The VF05 statistic, as with all test statistics, has improved size as the sample size grows.”

      First of all, given the above, the term ‘improved size’ is not clearly defined. Second, there are plenty of test statistics that are exact in finite samples, under certain assumption, so I guess ‘all test statistics’ is a bit of a stretch.


      Steve, Ross

      I was also wondering about something else.

      Since you are using monthly data, why doesn’t your specification employ any seasonals (e.g. (1 – L^12) in the lag polynomial)?

      The reason I ask is that I played around with some of the data, and I’m not convinced by the six-AR-term-sledgehammer employed to clear out the residual serial correlation.

      Is it standard practice within CS to just skip the seasonality step? It wouldn’t surprise me.


      Nice paper by the way.

      I don’t agree with the first sentence (the trend stationarity assumption :), but conditional on it, the story is straightforward and clear.

      • Posted Aug 9, 2010 at 2:57 PM | Permalink

        Yes, the sentence is strange. I would not have added it but for the insistence of a referee who wanted something along the lines of “yeah but maybe if we had a hundred more years of data the results would be totally different”. The point was that 30 years is enough to get significance if the differences really are there. Faced with a 3500 word limit there are sections that get very terse.

        As for trend stationarity, these data are at the cusp. You could make a case for nonstationarity, or for trend stationarity, but it would be debatable either way.

        • VS
          Posted Aug 10, 2010 at 3:46 AM | Permalink

          Yes, editorial handwaving does somehow explain the presence of the sentence 🙂

          I mean, as sample size increases, the distinguishing power of the test (ie. 1-beta, ie. 1-Type II error prob, ie. the ‘probability’ of rejecting a false null) increases. That means that, if you would have a 100 years more data, I would say you are in fact *more* likely to reject the null that the difference in slopes is equal to 0, given that you have already rejected it with less datapoits.

          In other words.. he should have pointed that out to Santer et al, before pointing it out to you guys 😉

          PS. W.r.t. the MC simulations cited: 350+ obs and 100 obs is a pretty big difference.


          As for non-stationarity, I agree with you that with these data both cases can be made, with the problem here being the seasonality component. This makes the test equations very sensitive to the AR structures employed, because the error component has to be white noise for the ADF (and related) tests to function properly (and the PP-based tests have their own issues).

          Whether or not I reject the unit root on the Had data for example, depends largely on the lags included in the test equation.

          In any case, given this application and the context of this paper, I don’t think it’s very crucial. This especially holds given the approach of the Santer et al paper you are responding to.

          I simply added the nonstationarity comment as a ‘ceterum censeo..’ 😉

  19. Steve McIntyre
    Posted Aug 9, 2010 at 2:01 PM | Permalink

    Ross is far more deferential to the contribution of reviewers than I am.

    While Ross’ application of an econometric method to the present situation was nicely done and has resulted in an article that is of methodological interest, it’s absurd that Santer critics should be required to go to such extremes to make the simple point that Santer’s claims did not hold up with updated data using his method – the point made in our 2009 submissions.

    It’s a fluke that Ross was capable of doing something original with the methodology and thereby circumvent the review blockade at IJC. The IJC reviews were totally unhelpful and totally useless.

    I think that IJC had an obligation to ensure that problems in the Santer results were reported once they became aware of the problems. They failed to do so.

    • Posted Aug 9, 2010 at 2:42 PM | Permalink

      Pat Micheals claimed he has had 4 manuscripts rejected since climategate, he also stated the Spencer is having a difficult time. Others I know of aren’t exactly flying through. It’s like they don’t really like outsiders.

    • Dave Andrews
      Posted Aug 9, 2010 at 2:57 PM | Permalink

      So one of the IJC reviewers (Thorne) of your paper was a co-author of the original Santer paper. How can that be? I can understand that the original authors should be allowed to submit a rebuttal to the criticism, but to be allowed to act as a gatekeeper as well says there is something drastically wrong in the world of scientific peer review!

      Steve: I think that it was Thorne, but am not certain of this. This issue arose in the Climategate emails e.g. Jones “going to town” on articles that had the temerity to criticise CRU. Muir Russell didn’t even bother reporting on this well-publicized incident. I talked to a philosophy professor friend of mine -who says that conflicted reviews are not permitted in philosophy journals (whether friend or foe.)

      • Hu McCulloch
        Posted Aug 9, 2010 at 3:24 PM | Permalink

        Santer et al had 17 co-authors, so it must have been hard for the editor to find someone outside this set. 😉

    • michel
      Posted Aug 10, 2010 at 2:37 AM | Permalink

      Yes, without reading the reviews one cannot be sure, but as described the process seems to have become one where the reviewers have to be satisfied that the paper is right, for it to be worthy of publication, rather than, as the peer review process is intended to assure, simply being worth reading carefully and thinking about. You can see how this would happen, but once it starts being the first, we are in practice in the world of the censorship of non-consensus argument.

  20. Steve McIntyre
    Posted Aug 9, 2010 at 2:43 PM | Permalink

    Interested readers should also check in with Chad at treesforforest and send him your regards. AS I mentioned above, Chad has done some excellent work with the models – this collection of data needs to be publicized and used by others.

  21. scientist
    Posted Aug 9, 2010 at 3:54 PM | Permalink

    Steve: good job getting something written up. Makes it much easier to sink teeth into.

    All: Before indicting peer review, I would remind you that it is incredibly common for writers to complain about peer review and unfairness and while it’s not ALWAYS the writers at fault, my experience is it’s more often than not. Also, you really can’t judge the sitation without seeing the previous drafts, the reviews themselves, etc. You’re just getting a limited amount of info and from only one side. Many scientists will show pre-prints of papers that are in review, but (despite the very frequent communications inherent in a blog), the 3 Ms don’t seem to do so. (Ross does some and to his credit, and the drafts aren’t that strong at early stages, see WUWT for recent example.)

    P.s. Cross-posted at Deep Climate in case Steve snips this (while leaving the other “peer review complainers” up.)

    • Patagon
      Posted Aug 9, 2010 at 4:06 PM | Permalink

      Re: scientist (Aug 9 15:54),

      A good solution would be to make reviews (not reviewers) public, like in the EGU journals. In that way any unfair treatment is at least evident. Another problem with reviews (which I have seen personally) is that objections (even flagrant errors) are ignored by the editor if the paper goes with the editorial line or the authors are well known in that field.

      • Patagon
        Posted Aug 9, 2010 at 4:23 PM | Permalink

        Re: scientist (Aug 9 16:17),

        P.s. Any reason my posts are going into moderation now?

        Relax, mine too. Everyone, I guess, only that the info wasn’t showns to the author before.

      • scientist
        Posted Aug 9, 2010 at 5:58 PM | Permalink

        snip – I’m aware of Climate of the Past policies and have reviewed for them.

        • MrPete
          Posted Aug 9, 2010 at 6:41 PM | Permalink

          Re: scientist (Aug 9 17:58),
          Naaaah…. no idea what the auto-moderation system here is doing. I just now looked (sorry, busy day!) Found a few posts that were held up, released ’em.

          Steve: TCO’s surge of posts has resulted in me moderating.

    • Craig Loehle
      Posted Aug 9, 2010 at 5:00 PM | Permalink

      After getting published a lot, I can detect a difference. For non-sceptic non-climate papers across a variety of fields, I have had papers accepted with no changes, with changes, and rejected. The rejects are due to different POV of reviewers, misunderstanding what I did, and failure on my part to communicate clearly. All this is par for the course.
      On sceptic papers, the comments are nasty, short, and do not show that what I did was wrong. Things like “you can’t do this type of analysis” and “this should never see the light of day” and “this topic is of no interest” (the last was about my non-treering reconstruction, though Mann keeps getting his published in spite of the lack of interest). Real mean spirited and sometimes foaming at the mouth angry.
      It is particularly disturbing that papers showing major flaws in consensus work can’t penetrate the fortress.

      • scientist
        Posted Aug 9, 2010 at 6:09 PM | Permalink

        snip – nothing to do with this topic

  22. Posted Aug 9, 2010 at 4:34 PM | Permalink

    Yep. And as you can see, the requirement for skeptics to publish are much more stringent than the consensus crowd.

    • scientist
      Posted Aug 9, 2010 at 4:45 PM | Permalink

      Sonic, 1634:

      No…I can’t see that. I would need to see the draft papers and reviews from several sides to discern that. And there is a long history of skeptics doing poor work and complaining about the man keeping them down. It’s possible that skeptics have a tougher time in review. I just can’t tell for sure.

      Actually I speculate that they do have a mildly tougher time in review. Just human nature being what it is!

      But I also think they do a much worse job on draft papers (just look at the average quality of the few drafts we do see, for instance Eschenback’s stuff). And I think given the hurdles they face in review, the complications of the science itself, that they are trying to correct previous published work, etc. that they should really take a strain to do bee-yoo-tee-ful draft papers.

      Anyway…they got a byline on this one. Kudos and uncork the champagne!

      • steven Mosher
        Posted Aug 9, 2010 at 11:57 PM | Permalink

        since you dont get to see many drafts of believers ( recall what the inquiries said) you dont have any basis to make a comparison.

        Its pointless to speculate about what the process looks like in general.
        for either side. pointless and not verifiable.

        The simple question is this.

        If You replicated Santer’s method and brought the data up to date,
        AND IF, that reversed Santer’s findings

        Should that be published?

        The current state state of climate science journals probably says no. Partly because they are infected with some of the same stupid notions that you have. We’ve discussed this before.

        • per
          Posted Aug 10, 2010 at 1:34 AM | Permalink

          First of all, my congratulations to those who got published. Maybe we will see as many comments as last time 🙂

          Second, to interact with the issue that it is difficult to publish a paper that simply says a previous paper is wrong. There can be a principled argument for this; on the grounds that so much of the scientific literature is wrong, editors would soon end up publishing a lot of work, and frequently unworthy or boring work, solely on the basis that it shows something else published in the journal is wrong.

          I think you have here a strategy. Not only have you elaborated on a problem (the previous paper is wrong), but by undertaking a little bit of methodology, or sensitivity analysis of the methodology, you make a clearer contribution and novelty.

          It makes it much easier to get into productive discussion with editors.

          my congratulations

  23. scientist
    Posted Aug 9, 2010 at 4:41 PM | Permalink

    See my 1554 attaboy.

  24. Posted Aug 9, 2010 at 5:14 PM | Permalink

    My heartfelt congratulations to Ross and Steve and Chad on their significant contribution. This is a paper that should get a breakthrough and deserves to send their citation counts exponential.

    I had a realization that the name “peer review” is probably misleading everyone. In actuality the dynamic is more like a job interview. The ‘peers’ are frequently the ‘gatekeepers’ and they need to feel comfortable that their ‘hire’ will not embarrass them in any way.

  25. Ian
    Posted Aug 9, 2010 at 6:07 PM | Permalink

    Minor point: you refer to “up-to-data” at one point, when obviously you mean “up-to-date data”. Fix it and I can say for all time that I corrected one of Steve McIntyre’s articles on climate change 😉

  26. scientist
    Posted Aug 9, 2010 at 6:09 PM | Permalink

    snip – your idea of substantive and mine are different. I take no issue with technical comments. however, when this turns into whinging, as it has too often in the past, I will moderate after the fact and have started doing so.

  27. Patagon
    Posted Aug 9, 2010 at 6:28 PM | Permalink

    Re Nono

    Posted Aug 9, 2010 at 4:00 PM | Permalink | Reply
    I think that your assumption is correct. And unless I’m completely mistaken, the closest thing that can be compared with Fig. 3 of MM2010 would be Fig. 3A of Santer08:

    After all the discussion I spent some time on figure 6 of that Santer’s paper.

    Homogenized radiosoundings values are way off other sounding data. Even homogenized soundings are quite discrepant among themselves (RAOBCORE vs RICH), but the difference is even larger between homogenized and Ratpac or Hadat2. That really rises some questions about the spatial homogenization process (correlation with neighbours as in RICH) or the use of ERA reanalysis as a guide. There are known issues with instruments changes. Vaisala is a major provider and there are good and thorough studies about the radiative heating on some of their older sensors, especially at the daylight sounding (12Z or 00Z depending on where you are).

    I wonder if there is enough metadata about sensor use and sensor change dates to do a better homogenization.

    On thing that strikes me is that depending on the RAOBCORE version you choose, the correction date are on different years, that does not make any physical sense.

    It would be nice to get a good solution to this because those soundings could be a valuable data set, even if they weren’t intended for climate studies.

  28. Steve E
    Posted Aug 9, 2010 at 7:39 PM | Permalink


    I’m not qualified to comment on the science, but I understand communication and I know mole signs when I see them. I’ve been reading these threads for the past several days.

    Scientist’s role seems to be: engage on a reasonable basis; create the quotable where possible; run when colours run true. Cross posting at Deep Climate…puhlease. That’s the perfect neutral arbitrator for this discussion /sarcasm.

    You have the patience of Job to indulge the self indulgent!

  29. Robert
    Posted Aug 9, 2010 at 8:35 PM | Permalink

    snip -OT

  30. Craig Loehle
    Posted Aug 9, 2010 at 8:42 PM | Permalink

    The average paper in science is…well, average. And flawed. Even the good ones have flaws or things that could have been done better with more time and intelligence. In ecology, some of the most lauded and cited papers of the 1970s turned out to be completely wrong. They were eloquently written, clever, and told a good story, so people were not critical of them and just took them to be true. So getting praise is not necessarily a measure of truth.
    In the climate case what is happening is that the politics has led to a reaction that any criticism of the science is per force a criticism of policy or means you don’t want to save the world. Criticism of the existing science that in a non-contentious field would be considered an advance is viewed as a hostile move by oil funded deniers. The climategate emails show this in spades. If it was an airplane being built, I think someone would want to know that a critic has found an aerodynamic instability or metal fatigue. Instead, there is resistance to any hint that mistakes have been made — “UHI is unimportant” when it has been shown that it is, is just a simple example. The Team will never ever admit they are wrong, and they have power in many of the key journals. It is far from a level playing field. Demanding that critics write more elegant papers than anyone else may be what is required to get published, but often that is not even enough.

  31. pete
    Posted Aug 9, 2010 at 9:06 PM | Permalink

    While detrended climate model projections may be uncorrelated with observations, the assumption of no covariance among trend coefficients implies models have no low-frequency correspondence with observations in response to observed forcings, which seems overly pessimistic.

    I hate to ruin such a finely crafted piece of sarcasm, but aren’t you confusing ‘trend coefficients’ with ‘estimates of trend coefficients’?

  32. ZT
    Posted Aug 9, 2010 at 11:34 PM | Permalink

    Nice job gentlemen. Let’s hope that a tortuous riposte, with obscure terminology, and a few mentions of forcings, is not being concocted in some dark corner of the interweb. Oxburgh’s “If your experiment needs statistics, you ought to have done a better experiment.” Kelly should receive a copy.

  33. EdeF
    Posted Aug 10, 2010 at 1:19 AM | Permalink

    It is interesting to note in Fig 1 in the report that the starting and ending points of the model ensemble and the instrument data are nearly the same as are
    the overall trends, top graph looks like about 0.012 deg C/decade. The models seem to be out of phase with the measured data. Without the mega dips in 1983 and 1992 the models would be a lot closer to measured data. Why the huge dips,
    PDO or 11 year solar cycle? They seemed to have overdid the cooling in those
    years. Straight line from start to end works better?

    Steve: Model dips are from volcanoes – many 20CEN models use real volcano to parameterize aerosols.

  34. ianl8888
    Posted Aug 10, 2010 at 3:32 AM | Permalink

    James Annan today:

    “It’s the same sorry old tale of someone comparing an ensemble of models to data, but doing so by checking whether the observations match the ensemble mean.”

    He’s of the view that comparing the mean of the model ensemble with real data is of no value – he doesn’t state of what use the ensemble mean may actually be, then

  35. Adam Gallon
    Posted Aug 10, 2010 at 6:04 AM | Permalink

    Over to the modellers. Your models aren’t good enough, now improve them.

  36. PaulM
    Posted Aug 10, 2010 at 10:43 AM | Permalink

    It is unfortunate that the paper does not explain clearly the meaning of the ‘bars’ on figs 2 and 3, either in the caption or the text. How are they defined for (a) the models and (b) the observations? This has caused considerable confusion both here and on Annan’s blog. I hope this can be clarified before the paper goes to press. Bear in mind that most casual readers will just focus on these figures (as this blog post does).

  37. C Monster
    Posted Aug 10, 2010 at 9:40 PM | Permalink


  38. Posted Aug 10, 2010 at 9:47 PM | Permalink

    Wouldn’t this method of multi-model comparison be more appropriate to the task?

    Chad Herman using a multi-model comparison method to check skill in snow cover:

    Similar method used by Zeke Hausfather (scroll down)

  39. Paul-in-CT
    Posted Aug 10, 2010 at 9:54 PM | Permalink

    If it is the case that many climate scientists don’t follow the blogs, and therefore are not aware of much of what has gone on, say, here, I suppose that with each new publication like this one we can expect the pool of potential future referees to approach their reviewing duties with an incrementally more balanced perspective, due to their reading the PR Literchur.

    That is, whatever the importance of this particular paper on its merits, its mere publication may turn out to have much broader significance.

    So we got that going for us, at least I’m a hopin’.

  40. Posted Aug 13, 2010 at 9:30 AM | Permalink

    Steve, there’s something I don’t understand. The trends of UAH/RSS are given in the paper

    as 0.070/0.157 °C/decade in table 1
    as 0.079/0.159 °C/decade in table 2

    The trends are estimated using different techniques, so I’ll accept this results in the differences seen in the tables and figure 2. However, a simple linear regression of monthly data yields 0.138/0.162 °C/decade. Why is the UAH trend only about half its linear trend, whereas RSS is roughly the same?

  41. dahuang
    Posted Sep 20, 2010 at 8:27 AM | Permalink

    McKitrick et al (2010) has been already published online by ASL at WILEY (Early View, Article first published online: 17 SEP 2010). The URL is, and DOI: 10.1002/asl.290. A PDF copy of the article could be downloaded at

  42. Bob Koss
    Posted Dec 2, 2012 at 11:52 PM | Permalink

    I see Santer finally agrees the models run warm. Naturally he makes no reference to this having been demonstrated two years ago in the paper by McKitrick, McIntyre and Herman.

    • Posted Dec 3, 2012 at 1:21 PM | Permalink

      When you’re in the 97% you have no need to mention the 3% that have proved you wrong. That would give absolutely the wrong impression. Models may indeed have run warm but those who are out in the cold, who have been so rude as to prove it before you wished, shall be held at all times at the same temperature, unmentionable. And that’s consensus science.

12 Trackbacks

  1. […]  McKitrick et al (2010) accepted by Atmos Sci Lett. […]

  2. By Top Posts — on Aug 9, 2010 at 7:16 PM

    […] McKitrick et al (2010) accepted by Atmos Sci Lett CA readers are aware that Ross and I twice submitted a comment on Santer et al 2008 to International Journal of […] […]

  3. […] See also Climate Audit: McKitrick et al (2010) accepted by Atmos Sci Lett […]

  4. […] never will be…unless you believe models that miss reality by 400 percent. That’s why this paper by Ross McKitrick and Steve McIntyre and Chad Herman, finally being accepted for publication after […]

  5. […] to Steve, Ross and Chad for * **McKitrick, Ross R., Stephen McIntyre and Chad Herman (2010) “Panel and […]

  6. […] Go to comments MMH2010 Steve expands on the […]

  7. […] Well, there is still a huge amount of money and political pressure behind it. But survey after survey now shows that the public are no longer supporting the scare. And though the science has been shown to be lacking, […]

  8. […] at my favorite climate skeptic blog, CA, Steve McIntyre has several posts about his recent co-authored paper reviewing Santer 08′s analysis of the climate […]

  9. […] to Steve, Ross and Chad for * **McKitrick, Ross R., Stephen McIntyre and Chad Herman (2010) "Panel and […]

  10. By The Evidence of Failure « the Air Vent on Jan 14, 2011 at 10:08 AM

    […] an acceptable scientist is rather than the science. I am talking about McShane and Wyner 2010 , McKitrick et al (2010), and Lucia’s graphs. What they agree with is the overestimation of the effect of CO2. Some […]

  11. […] (LT, MT) layers and the discrepancies were statistically significant. This paper was published as MMH2010 in Atmospheric Science […]

  12. […] (LT, MT) layers and the discrepancies were statistically significant. This paper was published as MMH2010 in Atmospheric Science […]

%d bloggers like this: