Using Santer’s own methodology with up-to-date observations, here are results comparing observations to the ensemble mean of Chad’s collation of 57 A1B to models to 2009. In each case, the d1* calculated Santer-style has moved into very extreme percentiles.

The results from Ross’ more advanced methodology are not getting results that are in any sense “inconsistent” with the application of Santer’s own methods to up-to-date data.

Tropo |
Sat |
Obs Trend |
Ensemble |
Santer d1* (1999) |
d1*(2009) |
Percentile |

Lapse_T2LT |
rss |
-0.033 |
-0.079 |
-0.67 |
-2.819 |
0.003 |

Lapse_T2LT |
uah |
0.048 |
-0.079 |
-3.5 |
-7.395 |
0 |

Lapse_T2 |
rss |
0.005 |
-0.069 |
NA |
-4.212 |
0 |

Lapse_T2 |
uah |
0.084 |
-0.069 |
NA |
-8.518 |
0 |

T2LT |
rss |
0.159 |
0.272 |
0.37 |
1.69 |
0.948 |

T2LT |
uah |
0.075 |
0.272 |
1.11 |
2.862 |
0.996 |

T2 |
rss |
0.121 |
0.262 |
0.44 |
2.196 |
0.981 |

T2 |
uah |
0.04 |
0.262 |
1.19 |
3.449 |
0.999 |

## 11 Comments

I’m guessing the final sentence should say “up-to-date data”?

; fixedSteve

Script for this – which collects quite a bit of other information – is at

http://www.climateaudit.info/scripts/models/santer/script_comment_short.txt

There’s a bug in

`make.table4`

— it’s trying to read from your`d:\`

drive instead of downloading from the website.Steve, I’m looking at T2LT, rss.

The quantity in the numerator of d1*, (ensemble – obs trend), does not seem to be very different from that of Santer08.

I then assume that the difference in d1* comes from the denominator. So which term(s) of the denominator changes a lot between Santer08 and your estimate? Is it the (inter-model) variance of mean trends, or the variance of the observed trend?

Steve: the change results simply from more observations, which yields more degrees of freedom and thus narrower CIs in the trend estimation allowing for AR1 autocorrelation.okay, then the larger d1* comes almost exclusively from a smaller variance of observed trend, s(b0)^2 in the formula.

s(b0)^2 is inversely proportional to the nb of observations. Therefore, intuitively, adding +10 years of data (that is, +50% of data) should reduce s(b0)^2 to 2/3 of its previous (Santer08) value.

The denominator of d1* being the square root of the inter-model variance + s(b0)^2, such a reduction of s(b0)^2 should yield a denominator reduced to sqrt(2/3)=0.8 times its Santer08 value. And that’s a lower bound, obtained by neglecting the inter-model variance.

So this gives an upper bound for d1* = 1.25 times its Santer08 value. You find a x4 increase (1.69 vs 0.37).

What have I done wrong? Is s(b0) reduced by more that that? Is the inter-model variance reduced as well?

I assume that Lapse_T2LT and Lapse_T2 refer to the difference series between the troposphere and surface temperature anomalies.

Santerizing the models is not now nor ever will be a legitimate statistical procedure for the simple reason that statisticians do not consider the mere failure to reject the null hypothesis as evidence to support that hypothesis.

The problem of trying to show that the null hypotheses could be true also exists in other scientific areas. In particular, in bioavailability tests, a pharmaceutical company manufacturing a generic version of a drug tries to demonstrate that their product will be absorbed by the body in a manner equivalent to the original preparation. The testing required from the drug manufacturer must show that the null hypothesis of no difference in the mean absorption is not rejected.

However, in order to guarantee that the result is not due to high variability in the sample or to insufficient information due to an inadequate sample size, they

must also showthat if the difference between the two formulations was greater than a specified amount, the procedure would reject the null hypothesis at a predetermined significance level (in technical terms, the power of the test would be sufficient to distinguish differences of the given magnitude). This portion is completely lacking in the procedure used by Santer rendering the test useless.Dr. Pielke demonstrated in his presentation how the latter procedure works: More garbage in … Santerized models out.

Re: RomanM (Aug 13 19:34),

Yep. We never see information on power of the test when we get “fail to reject”. The power (or type II error) was discussed in my sophomore year statistics class, so it’s odd not to see it. There also never seems to be any suggestion that if we are testing a hypothesis HO and specify our assumptions about the process, we should, when possible, pick a method with greater power. (So for example, if a time series really IS AR(1), and we have a choice between using monthly data vs. annual average data, we should generally prefer a method that gives us more power. Admittedly, if the higher method requires a super-computer to implement and the poorer one can be done on a spreadsheet, one might go for the lower power method for that reason. But all things being equal, the higher power method is preferred.)

Santerizing the models is not now nor ever will be a legitimate statistical procedure for the simple reason that statisticians do not consider the mere failure to reject the null hypothesis as evidence to support that hypothesis.

Like you said!

Also, there’s a known rule of common sense – keep pharmacists away from statistics! How many press releases of ‘statistical significance’ are reported as red letter news! And yet, with climate science, we give them a pass!

Roman you are muddying the waters a bit. Steves point is that the null is rejected, with all the data!