Conflicted Reviewers Distort Literature

The comments by James Annan and his reviewers here on McKitrick et al (2010) demonstrate very nicely how the literature gets distorted by the rejection of a simple comment showing that the application of Santer’s own method to updated data resulted in failure on key statistics. Annan and his commenters are worrying about the novelty of the method and accusing us of being subject to the same criticisms that Santer made of Douglass.

The statistical apparatus of MMH10 is used in econometrics and is not “novel” in that sense. But it is unfamiliar to climate science readers and it’s entirely reasonable for them to wonder whether there is some catch to the new method. It’s a question that I would ask in their shoes.

That Annan and his commenters should be in a position to make such a comment shows how the IJC reviewers and editor Glenn McGregor have succeeded in poisoning the well, by rejecting a simple comment showing that key Santer results fail with updated data (our comment is on arxiv here.

In our rejected IJC comment, we used Santer’s exact methodology. Nonetheless, Annann makes the following accusation:

Tuesday, August 10, 2010
How not to compare models to data part eleventy-nine…

Not to beat the old dark smear in the road where the horse used to be, but…

A commenter pointed me towards this which has apparently been accepted for publication in ASL. It’s the same sorry old tale of someone comparing an ensemble of models to data, but doing so by checking whether the observations match the ensemble mean.

Well, duh. Of course the obs don’t match the ensemble mean. Even the models don’t match the ensemble mean – and this difference will frequently be statistically significant (depending on how much data you use). Is anyone seriously going to argue on the basis of this that the models don’t predict their own behaviour? If not, why on Earth should it be considered a meaningful test of how well the models simulate reality?

Later a commenter writes:

1) Didn’t Santer address this point? e.g. “application of an inappropriate statistical ‘consistency test’.” Perhaps you’re right that by adding all the extra bits to the paper, they made it so that an idiot might not realize the elementary nature of the most important error, and we need to keep in mind that there are many idiots out there, but…

To which Annan responds:

I haven’t got Santer to hand (and am about to go travelling, so am not going to go looking for it) so I will take your word for it. In which case this new paper is pretty ridiculous. Well, it’s ridiculous anyway.

The observation that key Santer results do not hold up with more recent data is not “ridiculous”. It holds for key Santer results using Santer’s own methodology. The only reason that this information is not in the “literature” is that IJC editor Glenn McGregor did not feel that, as N IJC journal editor, he had any responsibility to place rebuttal of Santer results in the literature and appears to have permitted reviewers with conflicts to determine the outcome of the rebuttal.

But now the debate is muddied because it is entangled with understanding a different methodology.

The people to blame for the muddying of the debate are McGregor and the IJC reviewers who rejected our simple comment.

If, as seems likely, the most adverse reviewer was Santer coauthor, Peter Thorne of the UK Met Office, Thorne would be the most responsible for Annan and his readers being unaware of this result. Thorne wrote to Phil Jones on May 12, 2009 (there had been no CA discussion to that point and the decision had been issued on May 1, 2009) as follows:

Mr. Fraudit never goes away does he? How often has he been told that we don’t have permission? Ho hum. Oh, I heard that fraudit’s Santer et al comment got rejected. That’ll brighten your day at least a teensy bit?

This represents the attitude of the climate science peer reviewers who tied up our Santer comment at IJC.

It didn’t do anything novel or fancy. It just replicated Santer’s methodology to updated data and showed that key results no longer held up. As noted yesterday, the paper was rejected. One reviewer’s principal complaint proved to be not with our results, but an argument with Santer’s methodology. (It looks like this reviewer was Peter Thorne, who ironically was one of the Santer coauthors.)

The authors should read Santer et al. 2005 and utilise this diagnostic. It is a pity that Douglass et al took us down this interesting cul-de-sac and that Santer et al 2008 did not address it but rather chose to perpetuate it. The authors could reverse this descent away to meaningless arguments very simply by noting that the constrained aspect within all of the models is the ratio of changes and that therefore it is this aspect of real-world behaviour that we should be investigating, and then performing the analysis based upon these ratios in the models and the observations.

This entry was written by Stephen McIntyre, posted on Aug 10, 2010 at 8:48 AM, filed under Uncategorized and tagged mmh, santer, thorne. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

49 Comments

David S

Posted Aug 10, 2010 at 8:52 AM | Permalink

Did you have the chance while you were in the UK to get a lawyer to have a look at Thorne’s behaviour? We have some very generous libel laws here.
Dan Hughes

Posted Aug 10, 2010 at 9:09 AM | Permalink

On this thread over at Annan’s, Chip Knappenberger provided an excellent summary of the situation:

Chip Knappenberger said…
Am I starting to sense a pattern here?
When we apply a test to the models that shows they aren’t perfect, folks get grumpy about the methodology.
But when that very same methodology is applied and shows that models seem to working fine, then all is well.
Lucia is simply applying the already-published Santer17 methodology. And the work behind the Heartland presentation is a modification of the already-published Knight et al. methodology.
Where were all these concerns when those papers were published?
-Chip
Kenneth Fritsch

Posted Aug 10, 2010 at 10:01 AM | Permalink

Some of these climate scientists’ comments are hard for me to decipher (care to comment TCO), but I take what Steve M has presented above to mean that some scientists and others commenting are objecting to the use of a mean model result to compare to a mean observed result. It would thus appear that they would suggest we go back to comparing a range of model results to the observed results and assume no difference if the range of model results encompasses the observed results.

Perhaps we need a reasonable a prior selection criteria for “validating” those model results we finally use for the comparison. Actually without such a criteria for selecting valid model results we have a situation where we can only say that the range of model results is so wide as to be meaningless in making predictions or even hindcasting. Without someone providing a selection criteria, I cannot see a better comparison than using the mean of the ensemble means. Using a range of results for comparison is the same to me as admitting that with enough models one can provide just about any desired result – and that is not a definition of “robust”.
- j ferguson
  
  Posted Aug 10, 2010 at 10:32 AM | Permalink
  
  Do they take ALL the models, or do they cherry pick them? Obviously there must be spurious or clearly wild ones, so not them, but what about the rest?
- Jeff Alberts
  
  Posted Aug 10, 2010 at 9:02 PM | Permalink
  
  I don’t see the value of obtaining the mean of model runs. Clearly only ONE model or run can be correct, so by lumping them all together you’re knowingly adding bogus data and expecting something meaningful.
Bernie

Posted Aug 10, 2010 at 11:30 AM | Permalink

As somebody pointed out at Annan’s site, 17 climate science worthies signed off on Santer (2008). If one of them should then critique the method is wonderfully ironic and raises other issues about the peer review process.
scientist

Posted Aug 10, 2010 at 11:56 AM | Permalink

You are responsible for the methods you use in your paper. Saying that the reviewer made you use an incorrect method is bunk. It’s your byline. You are responsible.

Your error bars for “the models” are unbeleiveably tight. You should have had enough physical intuition to know that did not make sense. That error is a real boner.

If correctly calculated, the error bars would still be different from observations, then you haven’t SHOWN it in your paper.

Why not show the test both ways, if there was a debate about how to make the error bars?

What if the reviewer who told you to put the error bars down the wrong way was Douglas? What if a reviewer told you to put something down that said 2 plus 2 is 5? Would you?

Why is the correct method not shown in your SI?

You messed up, Steve. And didn’t have the balls to let me criticize your paper last night. Crossposted at AMAC to display your censoring.

Steve: if you want to post your comments elsewhere, you’re welcome to do so. However, I’ll moderate out comments that include this sort of taunt.

Your comment here is stupid. I do not agree that the method was “incorrect” only that it was different and better than Santer’s. However, the introduction of this method to climate science should have taken place in a context where people were already familiar with the impact of the new data using Santer’s own methodology. For people who want to “disaggregate”, I suggest that they read our arxiv paper as well.
- scientist
  
  Posted Aug 10, 2010 at 2:17 PM | Permalink
  
  snip
  Oh…and if you group RSS and UAH together, the error bars of “the sattelites” are pretty big too and overlap “the models”, too…
- PhilH
  
  Posted Aug 10, 2010 at 5:53 PM | Permalink
  
  “And didn’t have the balls to let me criticize your paper last night.”
  
  I am sorry, but I find this kind of comment highly offensive. You need to clean up your act or move on elsewhere.
- kan
  
  Posted Aug 10, 2010 at 7:56 PM | Permalink
  
  “And didn’t have the balls to let me criticize your paper last night. ”
  
  Grow up, little anonymous one.
- Skip Smith
  
  Posted Aug 11, 2010 at 6:55 AM | Permalink
  
  >>”You are responsible for the methods you use in your paper. Saying that the reviewer made you use an incorrect method is bunk. It’s your byline. You are responsible.”<<
  
  Yet over at Annan's I notice you gave Gavin a free pass when he said the reviewers told them not to use the updated data.
JamesG

Posted Aug 10, 2010 at 12:15 PM | Permalink

As Annans wife has now produced a paper utilising a Bayesian technique for comparison – the only one imo that is sensible apart from strightforward % error , then he does have an interest in deriding others work. Quite why he’d avoid deriding Santer for using a totally inappropriate frequntist method on model outputs is not explicable, except that they are on the same team. And there we have it – you have to be in with the in crowd and accept the wrong but useful paradigm.

One question I’d have with the econometrics based assessment is “how did it do in predicting the current economic slump?”. Presumably that’s the best test of the method, and what it was intended for.
lucia

Posted Aug 10, 2010 at 12:49 PM | Permalink

that the constrained aspect within all of the models is the ratio of changes and that therefore it is this aspect of real-world behaviour that we should be investigating, and then performing the analysis based upon these ratios in the models and the observations.

Can anyone explain what “is the ratio of changes” means?
lucia

Posted Aug 10, 2010 at 12:54 PM | Permalink

James write this–

Even the models don’t match the ensemble mean – and this difference will frequently be statistically significant (depending on how much data you use).

But many individual model runs do match the ensemble mean in the way these tests are being applied to test the mean. No one is just testing to see if the observed mean exatly matches the ensemble mean. Of course it doesn’t do that– and no one says that is remarkable.
- TAG
  
  Posted Aug 10, 2010 at 1:13 PM | Permalink
  
  Even I understand this so why can’t PhD climate scientists
- Steve McIntyre
  
  Posted Aug 10, 2010 at 1:46 PM | Permalink
  
  Santer made precisely this sort of test as we discussed back in 2008.
  
  If Gavin and others think that this sort of test is wrong, then they should have criticized Santer et al 2008.
  - lucia
    
    Posted Aug 10, 2010 at 3:16 PM | Permalink
    
    Re: Steve McIntyre (Aug 10 13:46),
    If Gavin thought it was wrong, he should have complained as a co-author!
- EdeF
  
  Posted Aug 10, 2010 at 3:32 PM | Permalink
  
  Which model results do they wish to not use?
TAG

Posted Aug 10, 2010 at 1:11 PM | Permalink

Well, duh. Of course the obs don’t match the ensemble mean. Even the models don’t match the ensemble mean – and this difference will frequently be statistically significant (depending on how much data you use). Is anyone seriously going to argue on the basis of this that the models don’t predict their own behaviour? If not, why on Earth should it be considered a meaningful test of how well the models simulate reality?

Annan’s statements brings out two questions

a) If models cannot be matched to obsservations then how are models validated. If they can be matched to past emeprautres but fail to predict future temperatures then does this not mean something about their internal validity as models of reality?

and

b) if models cannot be matched to observatiosn then of what use are models

These questions are not meant as a form of sarcasm. If models are not some approximation of reality as represented by the observations then why are they created?
- JamesG
  
  Posted Aug 10, 2010 at 1:33 PM | Permalink
  
  Well I can answer that. He is using the finding that often model ensembles can be more accurate than individual model runs. Believe it or not there is evidence for this xhere uncertainties in non linear systems are high and the models are individually poor, though not yet in climate science – because they rarely care to use other peoples tried and tested methods. Usually, traditionally, you still have to eliminate the really bad models, and that can only truly be done with real data. This is the crux, they get the basic idea but invent their own ad hoc methods.
  
  Annan released a paper recently that also got rid of most of the outlier models, leaving just a handful to be relied on. Unfortunately the comparison with real data step was not addressed, as it was just a proof of concept to get the method accepted. On the way though he is, step-by-step and falteringly, trying to dispel some sacred cows in climate science. Trouble is, if you aren’t James Annan, then you are automatically wrong, whoever you might be and however sensible, logical, textbook and/or mathematically correct your argument may appear to you. It becomes quite farcical.
  - TAG
    
    Posted Aug 10, 2010 at 2:08 PM | Permalink
    
    Thanks for the answer. However, isn’t the test being done comparing the model mean to teh observations.
    
    There is a leading worker in my field who is like that. Even if you agree with his views completely, you are completely wrong.
  - Chip Knappenberger
    
    Posted Aug 10, 2010 at 2:09 PM | Permalink
    
    JamesG,
    
    I am not James Annan, and although James has declared me wrong in some cases (http://julesandjames.blogspot.com/2005/05/yet-more-betting-on-climate-with-world.html), when it comes to comparing observations to models, that didn’t prove to be the case (http://julesandjames.blogspot.com/2010/05/assessing-consistency-between-short.html).
    
    There are some methods that James warms to.
    
    I don’t think he is a big fan of the Santer17 method, but I agree that his direct criticism of that method tends to be geared more towards others who try to use it, rather than at the original Santer17 paper.
    
    -Chip
Wijnand

Posted Aug 10, 2010 at 3:15 PM | Permalink

Is this Annan the same guy that said in an interview:
“9) What advice would you give to young science students trying to plan their careers?

Pick a famous supervisor who will get you some early papers in one of the glamour mags.”

http://scienceblogs.com/principles/2009/08/pnas_james_annan_climate_chang.php
- ZT
  
  Posted Aug 10, 2010 at 3:55 PM | Permalink
  
  …good to know that the spirit of scientific inquisitiveness is alive and well in this individual.
  - SteveF
    
    Posted Aug 10, 2010 at 5:57 PM | Permalink
    
    I think you’re rather missing the sardonic tone of that comment.
Kenneth Fritsch

Posted Aug 10, 2010 at 5:37 PM | Permalink

Lucia when you ask: “Can anyone explain what “is the ratio of changes” means?” I think this might refer to the use of difference trends between the surface and troposphere temperatures. Santer et al. used differences only as an apparent after thought and found that, by gosh, a good deal of the variability went away. I think the paper that MM submitted and was rejected used differences.

I was going to ask MMH if, in effect, their panel regression methods deal with differences between the surface and troposphere as I think they do.
Kenneth Fritsch

Posted Aug 10, 2010 at 5:47 PM | Permalink

“But many individual model runs do match the ensemble mean in the way these tests are being applied to test the mean. No one is just testing to see if the observed mean exatly matches the ensemble mean. Of course it doesn’t do that– and no one says that is remarkable.”

Lucia if your methods for comparing observed and model results assume a normal distribution of model results then a distribution of individual model results that has a high percentage with a significant difference one would no longer assume a normal distribution. I think the method used in MMH 2010 does not assume or require a normal distribution of model results.
- Nick Stokes
  
  Posted Aug 10, 2010 at 7:32 PM | Permalink
  
  Re: Kenneth Fritsch (Aug 10 17:47),
  “the method used in MMH 2010 does not assume or require a normal distribution of model results”
  Indeed, by the simple expedient of assuming that there is only one set of model results for the analysis. No distribution needed. The errors of the trend calculations are taken into account, but not the scatter of the models. In the end, you get results only about that set of runs.
lucia

Posted Aug 10, 2010 at 5:56 PM | Permalink

Ken

Lucia if your methods for comparing observed and model results assume a normal distribution of model results then a distribution of individual model results that has a high percentage with a significant difference one would no longer assume a normal distribution.

What do you mean? Either a metric computed from the models have a normal distribution around their own mean, or they don’t. A large standard deviation around the mean doesn’t automatically result in non-normality. The distribution can be normal with a large or small st. dev, or it can be non-normal with a large standard dev?

As it happens, the distribution of individaul trends around the mean trend in the ensemble don’t happen to look far from normal. We don’t have enough runs to get “fail” on the tests of normallity– but even if we would if we had more runs, a t-test does use the normal distribution, but the results are known to be pretty robust to non-normality. So, it’s a theoretical flaw that is often not overlooked because “it doesn’t matter” (unless you are so far from normal that you would be failing the test with very few samples.)

I think the method used in MMH 2010 does not assume or require a normal distribution of model results.

That’s great. I think the more types of tests we have, the better.
Kenneth Fritsch

Posted Aug 10, 2010 at 8:05 PM | Permalink

Lucia, what do you mean, what do I mean? I believe the critique was made by James Annan that the models differ significantly within the models. With 20 odd model means we would expect about one model to be significantly different (approximately 2 sds) than the average of the means. Below is a link to a histogram from Nick Stokes of the model means. I get an average of 0.25 and a sd of 0.081. That would put one (0.45) significantly different than the mean and another (0.10) marginally different.

http://moyhu.blogspot.com/2010/08/underestimate-of-variability-in.html

From eyeballing the histogram I would suspect that the distribution is leptokurtic, but perhaps the question should be posed to Annan. As I said before, the MMH method does not require that distribution to be normal and perhaps that is what Annan does not understrand.
Or perhaps I do not understand.
- Nick Stokes
  
  Posted Aug 11, 2010 at 1:40 AM | Permalink
  
  Re: Kenneth Fritsch (Aug 10 20:05),
  I don’t think James was referring to the distribution itself there, but to the test applied to MMH to the distribution. As you say, the observed sd is about 0.08, and the se of the mean is about 0.018 (not allowing for the variance of the means as calculated trends). But in Table 2 MMH give 0.253 with se 0.012. I believe that latter was derived from the trend variance for each run, but did not include that scatter of runs seen in the histogram.
  
  If you back-calc from that MMH se of 0,012, you get an implied sd of about 0.053, not 0.08, for the runs. Then lots would be significant outliers.
  - Kenneth Fritsch
    
    Posted Aug 11, 2010 at 4:46 PM | Permalink
    
    Nick Stokes, the means and sds have to considered based on the weighting used in MMH 2010 for the model ensemble means which is based on the number of runs made with each model.
    - Kenneth Fritsch
      
      Posted Aug 11, 2010 at 4:48 PM | Permalink
      
      Nick Stokes, the means and sds have to be considered based on the weighting used in MMH 2010 for the model ensemble means. The weightings are based on the number of runs made with each model.
    - Nick Stokes
      
      Posted Aug 11, 2010 at 9:40 PM | Permalink
      
      Re: Kenneth Fritsch (Aug 11 16:46),
      Kenneth,
      I did that on the forward calc. But it doesn’t make much difference, so I didn’t do it on the back-calc.
lucia

Posted Aug 10, 2010 at 8:51 PM | Permalink

Hi Ken–
I thought when you said “if your methods for comparing observe…” you meant, well, methods i’d used. I’ve been looking at surface trends, and those have more normal distributions. So, I was puzzled.

Of course if a distribution looks non-normal, it’s better to use a method that doesn’t assume normality, when such a thing is available. As the bit I wrote that you quoted– I was just commenting on what Annan’s words seem to literally say– that someone is comparing an observatoin as a point to a mean as a point. That observation wouldn’t mean much and also wouldn’t require any statistics. You just say: Observed least squares trend = X. Ensemble trend =Y. X≠Y. Look, not equal. Done.

No one anywhere is doing that.
- Robbo
  
  Posted Aug 11, 2010 at 5:06 AM | Permalink
  
  Lucia: “if a distribution looks non-normal, it’s better to use a method that doesn’t assume normality”
  
  I think you have understated the point. For all that the methods that depend on normality are elegant and parsimonious of data, when normality is not demonstrably the case, it’s not just ‘better’ not to use them, if you want to draw meaningful conclusions, it is required.
  - lucia
    
    Posted Aug 11, 2010 at 12:56 PM | Permalink
    
    Robbo–
    
    …if you want to draw meaningful conclusions, it is required.
    
    Define “meaningful”.
    
    Whether or not one can use the assumption of normality when data are slightly off-normal is a matter of some debate. Some tests were derived based on the assumption that data are normal, and so apply exactly in that circumstance. Meanwhile, when we obtain data, we often do not know if data re normal– and yet go ahead and apply tests that use the assumption, often for lack of any better test.
    
    Because errors are important, people also do explore the robustness of methods to deviations from normality. It turns out t-tests are pretty robust to lack of normality. They aren’t perfect, but, as a practical matter, the imperfections of are no consequence relative to other problems. So, people state they are using the assumption, mention the imperfection, report results and then interpret the results providing caveats. There is nothing wrong with this.
    
    These results are still “meaningful” relative to throwing up ones hands, decreeing they don’t know if the distribution might deviate from normal and doing nothing.
    
    But of course, if we know the distribution is way off normal, we do something else. Try to transform to make things normal etc. Otherwise, the caveats become too severe and no one believes the results.
    - Robbo
      
      Posted Aug 11, 2010 at 2:01 PM | Permalink
      
      Lucia:
      To be clear, I am not advocating “..throwing up ones hands, decreeing they don’t know if the distribution might deviate from normal and doing nothing”. I am advocating, for example, using use a Wilcoxon test rather than a t-test when the distribution is unknown or not demonstrably Normal. That way no caveats are necessary and credibility is perfectly conserved.
      
      And you have to be veeerryy careful about transforming variables – the mean of the inverse is not the inverse of the mean !
      
      PS My personal view is that the Normal distribution is a special case and the assumption of Normality is often a mistake, with consequences. YMMV.
    - Kenneth Fritsch
      
      Posted Aug 11, 2010 at 5:52 PM | Permalink
      
      The Shapiro-Wilkes test for normality does not reject the hypothesis that the distribution of model means used in MMH 2010 has a normal distribution.
Ross McKitrick

Posted Aug 10, 2010 at 11:09 PM | Permalink

I have just arrived at our cottage after a day-long flight/boat ride from home, and I am on a slow dial-up for the next 3 weeks. I am just looking at this thread for the first time. I don’t want anyone thinking that if our IJC comment had gone through, the ASL paper would not have been written, i.e. it was only a tool for getting the tropical troposphere material out in another form. That is not the case. I have had it in mind to challenge the 1930’s era trend comparison methods used in climatology for a long time, and this topic was a good one to use as an application. The exasperating reviews at IJC kept us from printing a simple demonstration of the effect of updating the data while staying with Santer’s method, but I’d have written about the need for better methods in any case.

Over at James Annan’s blog a few insta-experts are obviously very relieved to have convinced themselves that they don’t actually have to read our paper or deal with the results, because in their view the methodologies are all so obviously wrong. I don’t plan to use up my dial-up time trying to figure out the basis for their jubilation, or trying to figure out how they avoid arguing along the way that models are simply immune from any empirical test. But I look forward to hearing about the impending submissions to econometrics journals by these would-be tutors explaining why panel regression methods yield biased standard errors, and why the multivariate convergence results in the Vogelsang-Franses paper are invalid.

However, before anyone sets to writing up such papers, I would suggest they reflect on some basic statistical modeling concepts, such as what it means to estimate a model under the null hypothesis rather than under the alternative. Sorry if that seems cryptic, but where I work that’s a pons asinorum.
- pete
  
  Posted Aug 11, 2010 at 12:39 AM | Permalink
  
  The problem is not with your standard errors, it’s that you calculated confidence intervals from those SEs instead of prediction intervals.
  
  How many false positives do you get when you run your test substituting model runs* for observations? How does that compare to the nominal size of your tests?
  
  [* individual model runs, not model ensemble means]
  - ianl8888
    
    Posted Aug 11, 2010 at 1:32 AM | Permalink
    
    This seems to be the core of the criticisms, especially the last, parenthetic comment
    
    My comment is based on a somewhat tedious reading of the discussion on Annan’s blog. Gavin is on about “pairs” there, which seems to mean comparison of confidence intervals with individual model runs
- VS
  
  Posted Aug 11, 2010 at 7:47 AM | Permalink
  
  Agreed.
  
  Climate scientists should understand that in order to assail a statistical method you need to wave some convergence theorems and matrices, not vague conjectures and the apparently popular term ‘dicking around’ (so far, I’ve seen both Joshua Halpern/Rabbet and James Annan use it to discredit a statistical method they fail to grasp).
  
  —————–
  
  Also, repost of my comment from James Annan’s blog.
  
  —————–
  
  Annan,
  
  I took a quick look at the Knappenberger presentation you linked to, and as far as I can tell, the conclusions broadly coincide with those of MM10. In summary: the models have a tendency to overestimate the warming trend.
  
  Having said that, I’m not impressed with the ‘statistical power’ of the setup employed there. Then again, it’s not really a formally derived test, so speaking about power/size doesn’t make much sense in this context. With all due respect, I don’t understand why everybody got so ‘excited’ about it.
  
  The nice thing about the MMH paper, which *is* a formal test, is that it allows for correlations between the distribution of the trend estimators. This greatly improves the power of the test employed, which is a *must* given the enormous cross-dependencies and short time-interval available.
  
  And given the reactions here I feel the need to point out the obvious: this method for a large part deals with the problem of inter-series dependency. Note that the naive ‘hypothesis test’ presented by Knappenberger in essence fails to do this. Also note that failing to account for the *positive* correlation between the trend point estimates *will* widen the confidence interval for parameter difference, leading to severe statistical power reductions in your test.
  
  People should brush up on their stats, and then proceed to thank MMH for addressing this issue.
  
  Jeez.
  
  Best, VS
  
  PS. I resent your ‘dicking around’ comment on statistical hypothesis testing, as should any positivist scientist. You guys would really benefit from some gravity, especially given that your methods fail to predict even the direction of temperature change correctly.
  - Hoi Polloi
    
    Posted Aug 11, 2010 at 9:08 AM | Permalink
    
    You forget climatology’s pons asinorum, which is: “trust us”.
MikeN

Posted Aug 10, 2010 at 11:19 PM | Permalink

Do you have any evidence for why Thorne is the reviewer beyond that email?

I didn’t think you made the case that Jones was talking about your paper in the Climategate e-mails last time, though he may have been the reviewer.

Steve- the review comments themselves indicate but do not prove. I can’t “prove” that I’m right, but I’d bet that I’m right.
Jay Currie

Posted Aug 11, 2010 at 2:54 AM | Permalink

“to estimate a model under the null hypothesis rather than under the alternative”

Back when I tried to understand political econometrics that was the gold standard. Put another way, check your model against the assertion that nothing is going on. If it fails that test at a statistically significant level you have nothing. Time to reconsider.

This is, apparently, a concept foreign to “Team” climate science.
Kenneth Fritsch

Posted Aug 11, 2010 at 3:32 PM | Permalink

I have posted this comment at TAV:

After my second read of MMH 2010 it is apparent that the statistic used was not a trend of a difference series between the surface and troposphere temperature anomalies, but rather the LT and MT series. I thought that it was that measure (difference between surface and troposphere) that was important in the debate about the tropical surface and troposphere trends. I also thought that at CA (and in Santer et al. (2008)) it had been shown that using the differences series allowed one to more readily show significant differences between models and observed results because the difference series has less variation. Could not a panel regression have been used with difference series?
TAG

Posted Aug 11, 2010 at 6:23 PM | Permalink

Annan wrote on his blog

Well, duh. Of course the obs don’t match the ensemble mean. Even the models don’t match the ensemble mean – and this difference will frequently be statistically significant (depending on how much data you use). Is anyone seriously going to argue on the basis of this that the models don’t predict their own behaviour? If not, why on Earth should it be considered a meaningful test of how well the models simulate reality?

I’m just a layman but is this a group of rather strange statements that miss the point of the paper entirely

It is acknowledged that models are immature and that they do not. as yet, make useful predictions individually. However it is asserted that the mean of the models may make a better prediction than individual models. The new paper is a test of that assertion.

One puzzling statement – if it is acknowledged that the individual models are poor predictors of reality and that the model mean can be then why is it surprising that the models do not match the mean? Isn’t that a rather essential part of the assertion that the model mean is a predictor of reality and is used as such by the IPCC.

Another puzzling statement – the paper’s test was of the model mean which supposedly has different properties than the individual models. However there is an assertion that this is a test of how well the individual models simulate reality.
- TAG
  
  Posted Aug 11, 2010 at 6:30 PM | Permalink
  
  Why is Annan assuming that the model mean will have the same properties as the individual models. if it did then wouldn’t that make the creation of multiple models pointless. For the mean and the models to match in properties they must all have the same properties. So one model would represent all of them and the generation of more than one is pointless. For the mean to be a better predictor of reality than the individual models, the process of creating it must somehow suppress some properties that cause errors in the prediction of reality.