Does the Endpoint of Santer H2 "Matter"?

Yes.

Perhaps the first thing that I noticed about this article was the endpoint for analysis of 1999 – this seemed very odd. I mentioned that a Santer coauthor wrote to me, saying that the endpoint didn’t matter relative to the Douglass endpoint of 2004. That turns out to be true, but why would anyone in 2008 use data ending in either 1999 or 2004? (This applies to both Douglass and Santer). There’s been lots of criticism over the use of obsolete data in controversial articles – so why was either side of this dispute using obsolete data?

The Santer SI contains a sensitivity study of the H1 hypothesis up to 2006. There’s been some discussion here about whether trends to 1999 could be extended to 2006 for comparison purposes – something that made sense to me and Santer et al took the same position. They state:

In the second sensitivity test (“SENS2”), we calculated observed trends in T2LT and T2 over the 336-month period from January 1979 to December 2006, which is a third longer than the analysis period in the baseline case. As in SENS1, we set s{bm} = s{bo}. Since most of the model 20CEN experiments end in 1999, we make the necessary assumption that values of bm estimated over 1979 to 1999 are representative of the longer-term bm trends over 1979 over 2006. Examination of the observed data suggests that this assumption is not unreasonable..

They observe that the longer record leads to a sharpening of CIs for observed trends (as we’ve discussed) here but report that this does not affect their H1 results:

Even with longer records, however, no more than 23% of the tests performed lead to rejection of hypothesis H1 at the nominal 5% significance level

Later in the SI, they discuss several sensitivity tests for the H2 hypothesis, but, for some reason, they do not report on the impact of the SENS2 test on the H2 hypothesis – a rather surprising omission.

It’s completely trivial to do these calculations on up-to-date data. CA readers can obtain results to Santer Table III, updating to the most recent UAH data as follows:

source(“http://data.climateaudit.org/scripts/models/santer.utilities.txt”)
source(“http://data.climateaudit.org/scripts/spaghetti/msu.glb.txt”)
f(msu[,”Trpcs”],”UAH_T2LT”,end0=2050)$info

By using current data, the value of the Santer d1 test (a t-test) increases to 2.232 (from the 1.11 reported in their Table III), yielding an opposite conclusion in this respect from the one reported in the article.

These results are obtained not by doing the tests in a different way that I happen to prefer, but using the same methodology as Santer et al on up-to-date data.

You can check results to end 2007, which would have been readily available to Santer et al at the time of submission of the article as follows, yielding a d1-value of 1.935, which would be significant against important t-tests.

f(msu[,”Trpcs”],”UAH_T2LT”,end0=2007)$info

The value for 2006 was 1.77, which would be significant against a one-sided t-test (and against a two-sided t-test at 90%). It seems odd that they would have gone to the trouble of doing the SENS2 sensitivity study on the H1 hypothesis, but not the H2 hypothesis. And if they did the SENS2 test on the H2 hypothesis, these results would be important and relevant information.

And when they saw these results, you’d think that Gavin Schmidt, Santer and so on would be curious as to what would happen with 2007 results. RC has not been reluctant to criticize people who have used stale data and you’d think that Schmidt would have taken care not to do the same thing himself. Especially if the use of up-to-date data had a material impact on the results, as it does with the H2 hypothesis in respect to the UAH data.

Reblog this post [with Zemanta]

Lucia on Santer

Excellent post here. Please comment at Lucia’s.

Replicating Santer Tables 1 and 3

Has anyone tried to replicate Santer’s Table 1 and 3 results? It’s not as easy as it looks. What’s tricky is that the table looks pretty easy (and most of it is), but, if you assume that it’s done in a conventional way, you’ll get wrongfooted. In fairness, Santer provided an equation for the unconventional calculation, but it’s easy to miss what he did and I was unable to verify one of the legs for the unconventional method using profile likelihood methods on the UAH data. The difference is non-negligible (about 0.05 deg C/decade in the upper CI trend).

We’ve also had some debate between beaker and Lucia over whether there was a typo in Santer’s equation – Lucia saying no, beaker yes. Lucia said that she emailed Gavin seeking clarification, but she hasn’t reported back to my recollection. Anyway, I can now see what Santer is doing in Table III and can objectively confirm Lucia’s interpretation over beaker’s.

First, here’s Santer Table 1:

Now here it is in a digital format (collated from the pdf to make comparisons handy)

loc=”http://data.climateaudit.org/data/models/santer_2008_table1.dat”
santer=read.table(loc,skip=1)
names(santer)=c(“item”,”layer”,”trend”,”se”,”sd”,”r1″,”neff”)
row.names(santer)=paste(santer[,1],santer[,2],sep=”_”)
santer=santer[,3:ncol(santer)]

Next let’s collect the UAH series for 1979-99 on (using a script on file at CA), dividing the year scale by 10 (because Table 1 is in deg C/decade):

source(“http://data.climateaudit.org/scripts/spaghetti/msu.glb.txt”)
x=msu[,”Trpcs”];id=”UAH_T2LT”
temp=(time(x)>=1979)&(time(x)<2000)
x=ts(x[temp],start=c(1979,1),freq=12)
year=c(time(x))/10;N=length(x);N #252

Column 3 (“S.D.”) matches the obvious calculation almost exactly:

options(digits=3)
c(sd(x),santer[id,”sd”]) # 0.300 0.299

The OLS trend calculations are pretty close, but suggest that the UAH version used in Santer et al 2008 may not be a 2008 version.

fm= lm (x~year)
c(fm$coef[2],santer[id,”trend”]) # 0.0591 0.0600

The AR1 coefficients from the OLS residuals are a bit further apart than one would like – it may be due to a different version of the UAH data or to some difference between Santer’s arima algorithm and the R version. (This is the step where Gavin got wrongfooted over at Lucia’s, as he did arima calculations for Lucia on the original series x rather than the residuals, as he acknowledged when I pointed it out to him.)

r= arima (fm$residuals,order=c(1,0,0))$coef[1]; #
c(r,santer[id,”r1″]) # 0.888 0.891

This difference leads to a 3% difference in the N_eff:

neff= N * (1-r)/(1+r) ;
r1= santer[id,”r1″]
c(neff,N * (1-r1)/(1+r1) , santer[id,”neff”]) ; #15.0 14.5 14.5

Now for the part that’s a little bit tricky. Santer describes an adjustment of the number of degrees of freedom for autocorrelation according to the following equation:

My instinct in implementing this adjustment was to multiple the standard error of the slope from the linear regression model by sqrt(N/N_eff) as follows, but this left the result about 7% narrower than Santer:

c(sqrt(N/santer[id,”neff”])*summary(fm)$coef[2,”Std. Error”], santer[id,”se”] ) # 0.129 0.138

It took me quite a while to figure out where the difference lay. First, the standard error of the slope in the regression model can be described in first principles as the ssq of the residuals divided by the degrees of freedom (fm$df = N-2) and the ssq of the time intervals (ssx):

c(summary(fm)$coef[2,”Std. Error”],sqrt(sum(fm$residuals^2)/fm$df /ssx) )
# 0.031 0.031

Santer appears to calculate his slope SE from first principles, describing the operation as follows:

If the N_eff -2 is used for the adjustment instead of N_eff, the Santer standard error of the slope can be derived:

c( summary(fm)$coef[2,”Std. Error”] * sqrt( (N-2)/(santer[id,”neff”]-2)) , santer[id,”se”] )
# 0.139 0.138

Santer did not provide a statistical citation for the subtraction of 2 in this context, but was conscious enough of the issue to raise the matter in the SI (though not the article):

The “-2” in equation (5) requires explanation.

Indeed.

Speaking in general terms, I don’t like the idea of experimental statistical methods being used in contentious applied problems. After a short discussion in the SI, they argue that their method is not penalizing the Douglass results because their method tends to run slightly low and thereby give a higher N_eff than the data itself:

The small positive bias in rejection rates arises in part because r1, the sample value of the lag-1 autocorrelation coefficient (which is estimated from the regression residuals of the synthetic time series) is a biased estimate of the population value of a1 used in the AR-1 model (Nychka et al., 2000).

Unfortunately Nychka et al 2000 does not appear in the list of references and does not appear at Nychka’s list of publications. In the predecessor article (Santer et al JGR 2000), the corresponding academic reference was check-kited (Nychka et al, manuscript in preparation), but 8 years later, the check still seems not to have cleared. (I’ve written Nychka seeking clarification on this and will report further if anything changes.) [Update – we’ve tracked down Nychka et al 2000 through a 2007 citation to it as an unpublished manuscript. The link at UCAR is dead, but the article is available here (pdf). Even though the article has not appeared in the “peer reviewed” literature, it is a sensible technical report and has much of interest. It builds on Zwiers and von Storch (J Clim 1995), another sensible article. It’s not clear to me as of Oct 23 that the method in Santer et al 2008 is the same as the methods in Nychka et al 2000, but I’ll look at this. ]

In the running text, they purport to justify their procedure using a difference reference:

The key point to note here is that our paired trends test is slightly liberal – i.e., it is more likely to incorrectly reject hypothesis H1 (see Section 4)

Turning to section 4, it states:

Experiments with synthetic data reveal that the use of an AR-1 model for calculating N_eff tends to overestimate the true effective sample size (Zwiers and von Storch, 1995). This means that our d test is too liberal …

I attempted to test this claim by comparing the Santer CIs to the CIs calculated using the profile likelihood method described previously. The black colors show 1999 data – which is the only data discussed here. The dotted vertical lines show the 95% CI intervals calculated from profile likelihood. The black dots show the intervals calculated using a sqrt(N/N_eff) adjustment relative to the maximum likelihood slope under AR1 – which in this case appears to be less than the OLS slope. The black triangles show the Santer confidence intervals.

The Santer 95% upper CI is 0.051 deg C higher than my result (0.336 vs 0.285). This result is exactly opposite to the assertion in Santer’s SI, as the expanded confidence intervals would make it more difficult to show inconsistency (rather than less difficult.)

I think that one can see how the Santer method might actually over-estimate the AR1 coefficient of the residuals relative to the maximum likelihood method that I used. The Santer results are NOT maximum likelihood for the combination of trend and AR1 coefficient. They calculate the slope and then the AR1 coefficient given the slope, whereas I did a joint maximization over both the slope and AR1 coefficient using profile likelihoods. In the specific examples, the AR1 coeffficient of this method is less than the Santer (N_eff -2) coefficient. This makes sense – the slope in the combined ML result is not the same as the OLS slope; the differences are not huge, but neither are the differences in the AR1 coefficient. So I’m not persuaded by the Santer argument for a combined ML calculation, even if it holds when the calculations are done one at a time.

As noted before, with the inclusion of data up to 2008, the upper CI declines quite dramatically to 0.182 deg C/decade (as compared to Santer’s 0.336), a change which may not be irrelevant when one considers that the ensemble mean trend is 0.214 deg C/decade. As noted before, the change arises primarily from narrowing of the CI intervals owing to AR1 modeling, with changes in the trend estimate being surprisingly slight.

Table 3
This is an important table as the results from this table are ultimately applied to significance testing. As noted above, controversy arose between beaker and Lucia over exactly what Santer was doing. Lucia’s interpretation (also mine) was that the only difference between the Douglass and Santer tests was the inclusion of an error relating to the observations and that both studies used the standard error of the model means; beaker argued that the appropriate standard error was the population standard error and argued that this must have been Santer used (however they made a typo in their article.) I can now report that Lucia’s interpretation can be verified in the calculations.

First let’s collect some data for the Table III calculations.

d1=1.11;d1star=7.16 #from Table III
trend.model=santer[“Multi-model_mean_T2LT”,”trend”]; # 0.215
# this is a little higher than the multi-model mean of Douglass Table 1 (0.208)
sd.model=santer[“Inter-model_S.D._T2LT”,1]; #0.0920
# this is a little higher than the multi-model sd of Douglass Table 1 (0.0913)
trend.obs=santer[“UAH_T2LT”,”trend”];trend.obs # 0.06
se.obs=santer[“UAH_T2LT”,”se”];se.obs #0.138
# this is the adjusted Std. Error of the trend estimate
M=19 #from Santer

Next we can verify that Santer’s Table III d1star calculation for his Douglass emulation (presuming his methods):

c( (trend.model-trend.obs)/ (sd.model/sqrt(M-1)), d1star)
## 7.15 7.16 #replicates Table III

Now here is the calculation yielding Santer’s Table III d1 result, proving that they used the standard deviation of the model mean (the method decried by beaker):

c( (trend.model-trend.obs)/ sqrt( (sd.model/sqrt(M-1))^2 + se.obs^2), d1)
#1.11 1.11

At this point, we really don’t need to hear back from Gavin Schmidt on what he thinks that they did. We know what they did.

Resolving the Santer Problem

In today’s post, I think that I’ve developed an interesting approach to the Santer problem, which represents a substantial improvement to the analyses of either the Santer or Douglas posses.

I think that the approach proposed here is virtually identical to Jaynes’ approach to analyzing the difference between two means, as set out in the article recommended by beaker. As it happens, I’d done all of the calculations shown today prior to reading this article. While my own calculations were motivated primarily by trying to make sense of the data rather than anything philosophical, academics like to pigeonhole approaches and, to that extent, the approach shown below would perhaps qualify as a “bayesian” approach to the Santer problem, as opposed to the “frequentist” approach used both by Santer and Douglass. I had the post below pretty much in hand, when I was teasing beaker about being a closet frequentist.

The first problem discussed in Jaynes 1976 was the following question about the difference of two means (and keep in mind that the issue in Santer v Douglass is the difference between two trends):

The form of result arrived at by Jaynes was the following – that there was a 92% probability that B’s components have a greater mean life.

The form of conclusion that I’m going to arrive at today is going to be identical – ironically even the number is identical: there is a 92% probability that a model trend will exceed the true observed trend. Now as a caveat, my own terminology here may be a little homemade and/or idiosyncratic; I place more weight on the structure of the calculations which are objective. I think that the calculations below are identical to the following equation in Jaynes 1976 (though I don’t swear to this and, as noted before, I’d done the calculations below before I became aware of Jaynes 1976).

Let me start with the following IMO instructive diagram, which represents my effort to calculate a probability distribution of the slope of the trend, given the observations to 1999 (Santer), 2004 (Douglass) and up to 2008 (most of which were available to Santer et al 2008). It looks like this corresponds to Jaynes P_n(a) , i.e. what bayesians call a “posterior distribution”, but I’m just learning the lingo. The dotted vertical lines represent 95% confidence intervals from the profile likelihood calculations shown in a prior post, color coded by endpoint. The colored dots represent 95% confidence intervals calculated using the Neff rule of thumb for AR1 autocorrelation in Santer. The black triangle at the bottom shows ensemble mean trend (0.214 deg C/decade).

Figure 1. See explanation in text immediately preceding figure.

A couple of obvious comments. The 95% confidence intervals derived from profile likelihoods are pretty much identical to the 95% confidence intervals derived from the AR1 rule of thumb – that’s reassuring in a couple of ways. It reassured me at least that my experimental calculations had not gone totally off the rails. Second, the illustration of probability distributions shown here is vastly more informative than anything in either Santer or Douglass both for the individual years and showing the impact of over 100 more months of data in reducing the CI of the trend due to AR1 autocorrelation. (A note here: AR1 autocorrelation wears off fairly quickly – if LTP is really in play, then the story would be different.)

Thirdly and this is perhaps a little unexpected, inclusion of data from 1999 to 2008 has negligible impact on the maximum likelihood trend of the MSU tropical data; the most distinct impact is on the confidence intervals, which narrow a great deal. The CIs inclusive of 2008 data are about 50% narrower than the ones using data only up to 1999. Whether this is apples-and-apples or the extent to which this is apples and apples, I’ll defer for now.

Fourth and this may prove important, although the ensemble mean triangle relative to 95% CI intervals is unaffected by the inclusion of data up to 2004, it is not unaffected by the inclusion of data up to 2008. With 2008 data in, the CI is narrowed such that ensemble mean is outside the 95% CI interval of the observations – something that seems intuitively relevant to an analyst. I feel somewhat emboldened by Jaynes 1976’s strong advocacy of pointing out things that are plain to analysts, if not frequentists.

My procedure for calculating the distribution was, I think, interesting, even if a little homemade. The likelihood diagrams that I derived a couple of days ago had a function that yielded confidence intervals (this is what yielded the 95% CI line segment.) I repeated the calculation for confidence levels ranging from 0 to 99%, which with a little manipulation gave a cumulative distribution function at irregular intervals. I fitted a spline function to these irregular intervals and obtained values at regular intervals and from these obtained the probability density functions shown here. The idea behind the likelihood diagrams was “profile likelihood” along the lines of Brown and Sundberg – assuming a slope and then calculating the likelihood of the best AR1 fit to the residuals. I don’t know how this fits into Bayesian protocols, but it definitely yielded results, that seem to be far more “interesting” than the arid table of Santer et al.

My next diagram is a simple histogram of LT2T trends derived from data in Douglass et al 2007 Table 1, which, for the moment, I’m assuming is an accurate collation of Santer’s 20CEN results. In Douglass’ table, he averaged results within a given model; I’ll revisit the analysis shown here if and when I get digital versions of the 49 runs from Santer (Santer et al did not archive digital information for their 49 runs; I’ve requested the digital data from Santer – see comments below). [Note – beaker in a comment below states that this comparison is pointless. I remain baffled by his arguments as to why this sort of comparison is pointless and have requested a reference illuminating the statistical philosophy of taking a GCM ensemble and will undertake to revisit the matter if and when such a reference arrives.]

I derived LT2T trends by applying weights for each level (sent to me by John Christy) to the trends at each level in the Douglass table. In the diagram, I’ve carried forward the 95% CI intervals from the observations for location and scale comparison, as well as the black triangle for the ensemble mean. The $64 dollar question is then the one that Jaynes asks. [Note – I think that this needs to be re-worked a bit more as a “posterior” distribution to make it more like Figure 1; I’m doing that this afternoon – Oct 21 – and will report on this.]

Figure 2. Histogram of LT trends from Douglass et al 2007 data.

My next diagram is the proportion of models with LT2T trends that exceed a given x-value. I think that this corresponds to Jaynes P_m(b) . The dotted lines are as before; the solid color-coded vertical lines are the maximum likelihood trend estimates for the respective periods. Again for an analyst, the models seem to the the right of these estimates.

Figure 3. Proportion of Model Trends (per Douglass Table 1 information) exceeding a given value.

I then did a form of integration which I think is along the lines of the Jaynes recipe. I made a data frame with x-values in 0.01 increments (Data[[i]]$a, with i subscripting the three periods 1999, 2004 and 2008. For each value of x, I made a column of values representing the results of Figure 3:

for(i in 1:3) Data[[i]]$fail=sapply(Data[[i]]$a, function(x) sum(douglass$trend_LT>=x )/22)

Having already calculated the distribution P_n(a) in column Data[[i]]$d, I calculated the product as follows:

for(i in 1:3) Data[[i]]$density=Data[[i]]$d*Data[[i]]$fail

Then simply add up the columns to get the integral (which should be an answer to Jaynes question):

sapply(Data,function(A) sum(A$density))
# 1999 2004 2008
#0.8638952 0.8530540 0.9164450

On this basis, there is a 91.6% probability that the model trend will exceed the observed trend.

I leave it to climate philosophers to decide whether this is “significant” or not.

Going back to Jaynes 1976 – I think that this calculation turns out to be an uncannily exact implementation of how Jaynes approached the problem. In that respect, it was very timely for me to read this reference while the calculation was fresh in my mind. It was also reassuring that, merely by implementing practical analysis methods, I’d in effect got to the same answer as Jaynes had long ago.

Reblog this post [with Zemanta]

Santer and the Closet Frequentist

In many interesting comments, beaker, a welcome Bayesian commenter, has endorsed the Santer criticism of Douglass et al purporting to demonstrate inconsistency between models and data for tropical troposphere trends. (Prior post in sequence here) Santer et al proposed revised significance tests which, contrary to the Douglass results, did not yield results with statistical “significance”, which they interpreted as evidence that all was well, as for example, Gavin Schmidt here:

But it is a demonstration that there is no clear model-data discrepancy in tropical tropospheric trends once you take the systematic uncertainties in data and models seriously. Funnily enough, this is exactly the conclusion reached by a much better paper by P. Thorne and colleagues. Douglass et al’s claim to the contrary is simply unsupportable.

In passing, beaker mentioned that he was re-reading Jaynes (1976), Confidence intervals vs. Bayesian intervals. I took a look at this article by a Bayesian pioneer which proved to contain many interesting dicta, many of which were directed at ineffective use of significance tests resulting in the failure to extract useful statistical conclusions available within the data, many dicta resonating, at least to me, in the present situation. The opening motto for the Jaynes article reads:

Significance tests, in their usual form, are not compatible with a Bayesian attitude.

This motto that seems strikingly at odds with beaker’s incarnation as a guardian of the purity of significance tests. Jaynes described methods whereby a practical analyst could extract useful results from the available data. Jaynes looked askance at unsophisticated arguments that results from elementary significance tests were the end of the story. In this respect, it’s surprising that we haven’t heard anything of this sort from beaker.

Not that I disagree with criticisms of the Douglass et al tests. If you’re using a significance, it’s important to do them correctly. The need to allow for autocorrelation in estimating the uncertainty in trends was a point made here long before the publication of Santer et al 2008 and was one that I agreed with in my prior post. But in a practical sense, there does appear to be a “discrepancy” between UAH data and model data (this is not just me saying this, the CCSP certainly acknowledges a “discrepancy”. It seems to me that it should be possible to say something about this data and that’s the more interesting topic that I’ve been trying to focus on. So far I am unconvinced so far by the arguments of Santer, Schmidt and coauthors purporting to show that you can’t say anything meaningful about the seeming discrepancy between the UAH tropical troposphere data and the model data. These arguments seem all too reminiscent of the attitudes criticized by Jaynes.

The Jaynes article, recommended by beaker, was highly critical of statisticians who were unable to derive useful information from data that seemed as plain as the nose on their face, because their “significance tests” were poorly designed for the matter at hand. As a programme, Jaynes’ bayesianism is an effort to extract every squeak of useful information out of the matter at hand by avoiding simplistic use of “significance tests”. This is not to justify incorrect use of significance tests – but merely to opine that the job of a bayesian statistician, according to Jaynes, is derive useful quantitative results from the available information.

Interestingly, the Jaynes reference begins with an example analysing the difference in means – taking an entirely different approach than the Santer et al t-test. Here’s how Jaynes formulates the example:

Jaynes observes that the “common sense” conclusion here would be that B out-performed even though the available information was less than one would want.

Jaynes inquires into how the authority arrived at this counter-intuitive conclusion:

Jaynes conclusion is certainly one which resonates with me:

Now any statistical procedure which fails to extract evidence that is already clear to our unaided common sense is certainly not for me!

Jaynes’ next example has a similar flavor:

Again, Jaynes observes that application of conventional significance tests failed to yield useful results in a seemingly slam-dunk situation:

This latter conclusion sounds all too much like the nihilism of Santer et al. Jaynes observes of this sort of statistical ineffectiveness:

Quite so.

Now I must admit that my eyes ordinarily tend to glaze over when I read disputes between Bayesians and frequentists. However, as someone whose interests tend to be practical (and I think of my activities here as more of as a “data analyst” as opposed to a “statistician”), I like the sounds of what Jaynes is saying. In addition, our approaches here to the statistical bases of reconstructions have been “bayesian” in flavor (as Brown and Sundberg are squarely in that camp and my own experiments with profile likelihood results are, I suppose, somewhat bayesian in approach, though I don’t pretend to have all the lingo. I also don’t have to unlearn a lot of the baggage that bayesians spend so much time criticising, as my own introduction to statistics in the 1960s was from a very theoretical and apparently (unusually for the time) Bayesian viewpoint (transformation groups, orbits), with surprisingly little in retrospect on the mechanics of significance tests.

I’ve done some more calculations in which I’ve converted profile likelihoods to a bayesian-style distribution of trends, given observations up to 1999 (Santer), 2004 (Douglass) and 2008 (as I presume a bayesian would do.) They are pretty interesting. I’ll post these up tomorrow. I realize that the Santer crowd have excuses for not using up-to-date data – their reason is that the models don’t come up to 2008. It is my understanding that bayesians try to extract every squeak of usable information. It appears to me that Jaynes would castigate any analyst who failed to extract whatever can be used from up-to-date information. Nevertheless, beaker criticized Douglass et al for doing exactly that.

Perhaps beaker is a closet frequentist.

Reference:
Jaynes, E. T. 1976. Confidence intervals vs. Bayesian intervals. Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science 2: 175-257. http://bayes.wustl.edu/etj/articles/confidence.pdf

Peter Brown and Mann et al 2008

Today, I’m going to consider the handling in Mann et al 2008 of 17 proxy series developed by Peter Brown and Connie Woodhouse. Peter Brown is an anti-CA dendro who made a few posts here last year mainly on this thread. He introduced himself by saying “I have little patience for your blog. .. Typically a thread here quickly devolves from anything remotely connected to science into a personal attack”.

Willis asked him a number of excellent questions, to which Brown gave polite though mostly unresponsive answers (see passim through the post.) When Willis asked him about the use of “novel untested” stastistical methods, Brown replied, and, when the obvious example of MBH was brought up, we heard no more from him on the matter.

I cannot answer this question, I am not a statistician. I can say that I think your question is mostly a strawman; I cannot think of a single paper that I have read that used what I what I imagine you might refer to as a “novel, untested” method without providing justification for its use. Often this justification comes in the form of reference to existing literature, but it is always there.

When Willis asked:

2) Are proxy reconstructions without ex ante proxy selection rules valid? And if so, what protection is there against “data snooping” and “cherry picking”?

Brown gave another unresponsive reply see here and here observing:

All scientific studies begin with premises that guide model and data selection. So in answer to the second part of your question above, the premise of the question is incorrect; there is no need for “protection” since those terms are irrelevant to the process of development of explicit site selection criteria.

Obviously the process of selecting and weighting proxies for use in a multiproxy study is fraught with potential problems and one doesn’t avoid the problems by saying that they don’t exist. Willis asked:

3) Many proxy temperature reconstruction use proxies which have been used previously for proxy precipitation reconstructions. What methods (if any) have been used to control for the other variables? If there is no attempt made to control for confounding variables, is the study valid?

Brown answered, again unresponsively (with the latter part seemingly at odds with his professed opposition to “personal attacks”):

I am not the person to answer this question. I would suggest that the main approach, as I see it, has been blunt force, the law of large numbers; with enough chronologies that have weak temperature response (and again in ppt-sensitive series this is typically an inverse response) the broader-scale patterns will emerge and be strengthened. It is analogous of course to trying to see global warming in a single time series, can’t be done. (And to the commenter in post #85 about my observation of early grass-cutting in Colorado, did I say anything about it being a sign of global warming? Simply an observation…please don’t place your biases into my comments, you sanctimonious SOB. )

Brown and Woodhouse Chronologies in Mann et al 2008
17 Woodhouse and Brown chronologies are used in Mann et al 2008; indeed they make up just a little less than 20% of the 104 Mann proxies with post-1995 values. The 17 proxies are listed below, with site names located in the ITRDB data set. The correlations are from the SI and from the rtable. As you see, all but one correlation is negative. The sites all come from the Great Plains (virtually all of the sites can be matched here and are listed in Woodhouse and Brown (Tree Ring Research 2003), a discussion of drought. In Rob Wilson’s terms, these chronologies obviously do not meet any of his ex ante criteria for selecting a temperature proxy. Not that he’ll ever admit it. In Brown’s terms, the sites would ex ante be expected to have have a dominant precipitation response, with a lesser negative temperature response (and this latter appears to hold here.)

ne008

Ash Canyon

1997

NA

-0.22

co564

Black Forest East

1997

NA

-0.24

nm575

Capulin Volcano

1998

NA

-0.23

nm576

Capulin Volcano

1998

NA

-0.22

nm574

Cornay Ranch

1994

NA

-0.3

co582

Escalante Forks Update

1999

NA

-0.2

co568

Kim

1998

NA

-0.2

co580

Land’s End

2000

NA

0.06

co572

Lily Lake

1998

NA

-0.16

co569

Mesa de Maya

1997

NA

-0.12

nm577

Mill Canyon

1998

NA

-0.25

ne004

Niobrara Valley Preserve

1997

NA

-0.24

co579

Pumphouse

1999

NA

-0.21

co581

Seedhouse

2000

NA

-0.12

co570

Sheep Pen Canyon

1998

NA

-0.26

ne005

Snake River

1998

0.17

0.17

co566

Valley View Ranch

1998

NA

-0.26

Peter Brown defended the temperature reconstructions by invoking the “law of large numbers” but that’s not what’s going on here. If dendros believe that Douglas firs (or whatever) in certain regions have a negative correlation to temperature (and support this) and then calculate large-scale averages (inverted), then that seems like a do-able procedure. But that’s not what Mann did.

Their data mining process is rather neatly illustrated by the one proxy that “passed”. The “passing” site – ne005, Snake River, Nebraska, is a PIPO site in the Great Plains at 42 42N, 100 52W and elevation 810 m. One of 17 sites happened by sheer chance to have a positive correlation to temperature and it’s the one that’s selected. The “screening” procedure was described as follows:

Where the sign of the correlation could a priori be specified (positive for tree-ring data, ice-core oxygen isotopes, lake sediments, and historical documents, and negative for coral oxygen-isotope records), a one-sided significance criterion was used.

This policy was not an idle puff as one can see evidence of its implementation in the above table: proxies with negative correlations are shown as NA in the SI correlations and excluded from the calculations. As I’ve observed elsewhere, they are not just excluded from the “passing” 484 but from the “full” network – which proves to be a network of only positively correlated sites. That’s not the law of law large numbers at work.

Now Brown will perhaps say – well, this is only 17 of 1209 proxies. Maybe the handling of “my” proxies was wrong, but it doesn’t “matter” because there are so many proxies. I submit that it does matter – largely because the problem is methodological. While the Brown proxies may not “matter”, if an incorrect method is applied over and over, the mishandling of the Brown proxies is evidence of an incorrect method.

Will Brown and/or Woodhouse take any steps to correct the mishandling of their data? Let’s hope so, but I’d be astonished if they do.

Santer et al 2008

As a diversion from ploughing through Mann et al 2008, I took a look at Santer et al 2008 SI, a statistical analysis of tropospheric trends by 16 non-statisticians and, down the list, Doug Nychka, a statistician who, unfortunately, is no longer “independent”. It is the latest volley in a dispute between Santer and his posse on one hand and Douglass, Christy and a smaller posse on the other hand. The dispute makes the Mann proxy dispute seem fresh in comparison. Santer et al 2008 is a reply to Douglass et al 2007, discussed on several occasions here. Much of the argument re-hashes points made at realclimate last year by Gavin Schmidt, one of the Santer 17.

The statistical issues in dispute seem awfully elementary – whether an observed trend is consistent with models. In such circumstances, I would like to see the authors cite well-recognized statistical authorities from off the Island – preferably well-recognized statistical texts. In this respect, the paper, like so many climate science papers, was profoundly disappointing. No references to Draper and Smith or Rencher or any statistics textbook or even articles in a statistical journal. In their section 4 (setting out statistical procedures), they refer to prior articles by two of the coauthors, one in a climate journal and one in prestigious but general science journal:

To examine H1, we apply a ‘paired trends’ test (Santer et al., 2000b; Lanzante, 2005)

Lanzante 2005 (A cautionary note on the use of error bars. Journal of Climate 18: 3699–3703) is a small article at an undergraduate level, arguing that visual comparison of confidence intervals can play tricks (and that this sort of error was prevalent in many then recent climate articles in IPCC TAR):

When the error bars for the different estimates do not overlap, it is presumed that the quantities differ in a statistically significant way.

Instead of this sort of comparison, Lanzante recommended the use of the t-test for a difference as described in any first-year statistics course. Lanzante 2005 cited Schenker and Gentleman (Amer. Stat. 2001), an article in a statistics journal written at a “popular” level. One can easily see how the standard in Lanzante 2005 raises the cut-off point in a simple case where the two populations have the same standard deviation σ. If typical 2σ 95% confidence intervals are applied, then, for the two confidence intervals not to overlap, the two means have to be separated by 4σ i.e.
(m_2-m_1)> 4\sigma

For a t-test on the difference in means, the standard is:
(m_2-m_1)> t \sqrt{\sigma^2 +\sigma^2} = 2.8 \sigma

Note that equality is the “worst” case. The value goes down to 2 as one s.d. becomes much shorter than the other – precisely because the hypotenuse of a right-angled triangle becomes closer in length to the x-length as the angle becomes more acute.

While the authority of the point is not overwhelming, the point itself seems fair enough.

Santer et al 2000b is more frustrating in this context, as it is not even an article on statistics but a predecessor article in the long-standing brawl: “Santer BD, et al. 2000b. Interpreting differential temperature trends at the surface and in the lower troposphere. Science 287: 1227–1232.” They stated:

All three surface – 2LT trend differences are statistically significant (21), despite the large, overlapping 95% confidence intervals estimated for the individual IPCC and MSUd 2LT trends (Table 1) (22).

Reference 21 proved to be another Santer article also in a non-statistical journal and, at the time still unaccepted:

21. The method for assessing statistical significance of trends and trend differences is described by B. D. Santer et al. ( J. Geophys. Res., in press). It involves the standard parametric test of the null hypothesis of zero trend, modified to account for lag-1 autocorrelation of the regression residuals [see J. M. Mitchell Jr. et al., Climatic Change, World Meteorological Organization Tech. Note 79 ( World Meteorological Organization, Geneva, 1966)]. The adjustments for autocorrelation effects are made both in computation of the standard error and in indexing of the critical t value

Santer et al (JGR 2000) proved to have much in common with the present study. Both studies observe that the confidence intervals for a trend of a time series with autocorrelation are wider. I agree with this point. Indeed, it seems like the sort of point that Cohn, Lins and Koutsoyannis have pressed for a long time in connection with long-term persistence. However, Santer et al carefully avoid any mention of long-term persistence, limiting their consideration to AR1 noise (while noting that confidence intervals would be still wider with more complicated autocorrelation. Although the reference for the point is not authoritative, the point itself seems valid enough to me. My interest would be in crosschecking standards enunciated here against IPCC AR4 trend confidence intervals, which I’ll look at some time.

Now for something interesting and puzzling. I think that reasonable people can agree that trend calculations with endpoints at the 1998 Super-Nino are inappropriate. Unfortunately, this sort of calculation crops up from time to time (not from me). A notorious example was, of course, Mann et al 1999, which breathlessly included the 1998 Super-Nino. But we see the opposite in some recent debate, where people use 1998 as a starting point and argue that there is no warming since 1998. (This is not a point that has been argued or countenanced here.) Tamino frequently fulminates against this particular argument and so it is fair to expect him to be equally vehement in rejecting 1998/1999 as an endpoint for trend analysis.

Now look at the Santer periods:

Since most of the 20CEN experiments end in 1999, our trend comparisons primarily cover the 252-month period from January 1979 to December 1999

Puh-leeze.

If the models weren’t run out to 2008, get some runs that were. If they want to stick to old models and the old models were not archived in running order, the trend in CO2 has continued and so why can’t the trend estimates be compared against actual results to 2008? Would that affect the findings?

It looks to me like they would. Let me show a few figures. Here’s a plot of a wide variety of tropical temperatures – MSU, RSS, CRU, GISS, NOAA, HadAT 850 hPA. In some cases, these were calculated from gridcell data (GISS), in other cases, I just used the soure (e.g. MSU, RSS.) All data were centered on 1979-1998 and MSU and RSS data were divided by 1.2 in this graphic (a factor that John Christy said to use for comparability to surface temperatures), but native MSU [,”Trpcs”] data is used in the CI calculations below. The 1998 Super-Nino is well known and sticks out.

I’ve done my own confidence interval calculations using profile likelihood methods. Santer et al 2000, 2008 does a sort of correction for AR1 autocorrelation that does not reflect modern statistical practice, but the Cochrane-Orcutt correction from about 50 years ago (Lucia has considered this recently.)

Instead of using this rule of thumb, I’ve used the log-likelihood parameter generated in modern statistical packages (in the arima function for example) and calculated profile likelihoods along the lines of our calibration experiments in Brown and Sundberg style. I’m experimenting with the bbmle package in R and some of the results here were derived using the mle2 function (but I’ve ground truthed calculations using optim and optimize).

First let me show a diagram comparing log-likelihoods for three methods: OLS, AR1 and fracdiff. The horizontal red line shows the 95% CI interval for each method. As you can see, even for the UAH measurements, for the 1979-1999 interval, the observed mean trend of the models as an ensemble is just within the 95% interval of the CI for the observed trend assuming AR1 residuals. If one adds an uncertainty interval for the ensemble (however that is calculated), this would create an expanded overlap. Fractional differencing expands the envelope a little but not all that much in this period (it makes a big difference when applied to annual data over the past century.) Expanding the CI interval is a bit of a two-edged sword as no trend is also and even further within the 95% interval. So the expanded CI (barely) enables consistency with models, but also enables consistency with no trend whatever. I didn’t notice this point being made in either Santer publication.


Figure 1. Log Likelihood Diagram for OLS, AR1 and Fracdiff for 1979-1999 MSU Tropics. Dotted vertical red line shows the 0.28 trend of model ensemble. [Update – the multi-model mean is 0.215; the figure of 0.28 appears in Santer et al Figure 1, but is the ensemble mean for the MRI model only.]

On the face of it, I can’t see any reason why the model ensemble trend of 0.28 can’t be used in an update of the Santer et al 2008 calculation in a comparison against observations from the past decade. The relevant CO2 forcing trend has continued pretty much the same. Here’s the same diagram up to date, again showing the model ensemble trend of 0.28 deg /decade as a vertical dotted red line. In this case, the ensemble mean trend of 0.28 deg C/decade is well outside the 95% CI (AR1 case).

Now some sort of CI cone needs to be applied to the ensemble mean as well, but 47 cases appear to be sufficient to provide a fairly narrow CI. I realize that there has been back-and-forth about whether the CI interval should pertain to the ensemble mean or to the ensemble population. As a non-specialist in the specific matter at hand, I think that Douglass et al have made a plausible case for using the CI of the ensemble mean trend, rather than of the model population. Using a difference t-test (or likelihood equivalent) along the lines of Lanzante 2005 requires a bit more than non-overlapping CIs, but my sense is that the spread – using an ensemble mean CI – would prove wide enough to also pass a t-test. As to whether the s.d. of the ensemble mean or s.d. of the population should be used – an argument raised by Gavin Schmidt – all I can say right now is that it’s really stupid for climate scientists to spend 10 years arguing this point over and over. Surely it’s time for Wegman or someone equally authoritative to weigh on this very elementary point and put everyone out of their misery.


Figure 2. Same up to 2008. [See update note for Figure 1.]

Here’s the same diagram using RSS data. The discrepancy is reduced, but not eliminated. Again, analysis needs to be done on the model CIs, which I may re-visit on another occasion.


Figure 2. Same as Figure 1 using RSS data. [See update note for Figure 1.]

I think that these profile likelihood diagrams are a much more instructive way of thinking about trends than the approaches so far presented by either the Santer or Douglass posses. In my opinion, as noted above, an opinion on model-observation consistency from an independent statistician is long overdue. It’s too bad that climate scientists have paid such little heed to Wegman’s sensible recommendation.

Similar issues have been discussed at length in earlier threads e.g here

The Silence of the Lambs

In March last year, I was intrigued by the following statement in the then recent IPCC Summary for Policymakers which stated:

“Studies since the TAR draw increased confidence from additional data showing coherent behaviour across multiple indicators in different parts of the world”

What exactly was the “additional data” since the TAR? At the time, AR4 had not been published. Even now, if you re-visit chapter 6, there is negligible support for this statement. Yes, AR4 lists many Team multiproxy studies, but, for the most part, these regurgitate pre-TAR data. Note that Mann et al 2008 will be of no help in this respect as it almost entirely relies on pre-TAR data as well.

Anyway, I was stimulated enough in March 2007 by this comment to re-canvass North American tree ring in the ITRDB for new additions. I noticed a recent addition of spruce chronologies from northern Alberta and did a post on Mar 19, 2007 showing a quick average of these chronologies to see if they supported the IPCC claim. I observed that, if these were temperature proxies (carefully expressing the matter in conditional terms), these did not evidence the SPM claim.

This post prompted a number of posts from angry/frustrated dendros and a discussion of what to do about CA at the dendro listserv.

Rob Wilson, a welcome and occasional visitor here, sharply criticized me for simply grabbing sites from ITRDB without determining that they had been sampled as temperature proxies, pointing out that Meko, the author of the majority of these particular chronologies, was known primarily for moisture studies:

For readers of this blog, PLEASE understand that one cannot randomly sample trees from any location and expect there to be a valid climate signal (temperature or precipitation)….
Dendrochronology is NOT just a study of climate. Tree-rings are used to study fire history, geomorphological processes and ecological aspects of tree growth (e.g. stand dynamics, insect attacks, pollution effects etc). Therefore, one cannot assume that ALL tree-ring data in the “North American tree ring data base at WDCP” was sampled with a climate related question in mind. Therefore, a simple average of 14 sites, without knowing why they were sampled in the first place, is arguably a meaningless exercise.

Mike Pisaric, another dendro, joined in the criticism, stating that my analysis was “flawed!”, that ITRDB chronologies had been collected for a multitude of purposes and that you could not assume that they contained a climate signal.

As Rob points out in 3 and 28, Steve’s analysis is flawed!… The ITRDB contains tree ring chronologies that have been used for a multitude of purposes and not just paleoclimatology. So no, you would not expect all tree ring chronologies from the ITRDB to contain a valid climate signal.

The dendro critique spilled over to the dendro listserv. (See CA post here) where the following question was asked:

So – should I (we) ignore this Blog? … Personally, I cannot do this. Although some of the criticisms and commentary are valid, some of it is simply wrong and misinformed, and in my mind, it is dangerous to let such things go. .. Overall, this is a matter of outreach. I believe that tree-rings are one of the most powerful palaeo proxies available. However, we cannot allow the discipline to be muddied by a few ‘loud’ individuals who’s motives may be suspect.

I responded to this criticism in a post The Dendrochronologists are Angry , stating that I had no interest in spreading misinformation and would undertake to promptly correct any such misinformation, opening up a thread for dendros to respond without any obligation to deal with CA readers. This did not result in much of a response.

I challenged the Dendro Truth Squad to root out the use of precipitation proxies, highlighting Dulan junipers as an example, a chronology used (either directly or indirectly in a number of multiproxy studies). One angry dendro replied via Rob Wilson (see here),

Those in the know, who really know the science, know not to use that chronology and know who still use that chronology. The work that uses that chronology for a temperature reconstruction is less-respected than others.

He accused me of ignoring “recent work that surpasses all others.“. I expressed my desire to visit this particular shrine and requested the identity of this work that “surpasses” all others , but no one was able to identify it for me. Perhaps the angry dendro had a draft version of Mann et al 2008.

There were about 10 posts at the time both on this dispute and on the more general question of North American treeline proxies, listed in this category.

As so often, my principal disappointment and frustration was on the one hand at the lack of substance in the dendro interventions and, on the other hand, at their inconsistency.

The inconsistency was particularly frustrating. While I didn’t think that I particularly deserved the criticism leveled at me for merely calculating an average of the 14 PCGL chronologies, I would have readily accepted the criticism if this led to the articulation of an industry-wide standard that dendros would apply to multiproxy standards. (In this light, I criticized dendros for feeling that it was important to speak out about an incidental post at CA, while remaining silent on MBH98-99 and similar studies.)

History repeats itself. If Peter Brown, Mike Pisaric, Rob Wilson and other dendros were outraged at my calculating the average of 14 Alberta PCGL chronologies, then how can they stand idly by while Mann et al 2008 essentially scavenges the ITRDB like a garbage picker? If they felt so strongly about my post last year, then they must be outraged about the total failure of Mann et al to assess “Location, location, location”.

Let’s go back to the questions asked at the dendro listserv and see which, if any, of the following applies to Mann et al:

So – should I (we) ignore this ? …

Although some of the criticisms and commentary are valid, some of it is simply wrong and misinformed, and in my mind, it is dangerous to let such things go…

I believe that tree-rings are one of the most powerful palaeo proxies available. However, we cannot allow the discipline to be muddied by a few ‘loud’ individuals who’s motives may be suspect.

What’s sauce for the goose is sauce for the gander.

Gavin's Boast

Over the past few years, I’ve tried to keep an eye on and review new millennium proxies, posting a number of reviews on high-resolution ocean sediments and new tree ring proxies.

I’ve reported on new tree ring data archived by Jacoby, Rob Wlson, David Meko, Connie Woodhouse and others, leading to some interesting interventions here by Rob Wilson and Mike Pisaric. I’ve kept an eye on ice core data, speleothems and lake sediments as well, periodically canvassing the WDCP site for new additions. So I think I can claim to be as up to date as anyone on new 1000-year proxies, and, by extension, Climate Audit readers are relatively well informed on this topic. There are some relevant new proxies, but not as many as one would think. Also, as Wilson and Pisaric were quick to point out, many new tree ring proxies were not collected as temperature proxies. (More on this matter on another occasion.)

So I was a bit startled when Gavin Schmidt boasted of an increase of 800 or so new proxies in Mann et al 2008 calling for “applause” for the paleoclimatological community:

The number of well-dated proxies used in the latest paper is significantly greater than what was available a decade ago: 1209 back to 1800; 460 back to 1600; 59 back to 1000 AD; 36 back to 500 AD and 19 back to 1 BC (all data and code is available here). This is compared with 400 or so in MBH99, of which only 14 went back to 1000 AD. The increase in data availability is a pretty remarkable testament to the increased attention that the paleo-community has started to pay to the recent past – in part, no doubt, because of the higher profile this kind of reconstruction has achieved. The individual data-gatherers involved should be applauded by all.

Where did these 800 or so proxies come from? How could I have missed the addition of 800 new series to WDP when I was keeping track of additions at least every quarter over the past three years?

Of the total increase (about 794, depending a little on how the MBH population is determined), as shown in the pie chart below, about 83% are tree ring proxies. The next largest increase comes from the Luterbacher series, replacing the long Jones-Bradley instrumental series – not really proxy series at all. The additions to ice core, coral, speleo and sediment series are all relevant and I’ll review them separately. Today I want to look at where the additions to the tree ring population came from.


Figure 1. Pie chart of proxy count increase by lass.

The next pie chart subdivides the tree ring increase by continent, showing that over half of the total increase came form the North American tree ring network, one that we’ve kept a particularly close eye on for obvious reasons. The other interesting aspect of this graphic is that there was actually a slight (and surprising) decrease in the number of Asia tree ring sites. MBH contained 61 Vaganov series; none of these were retained. There were quite a few Briffa MXD grid sites added in Asia which left matters close to being flat, but still a slight decrease. Curiously no Schweingruber RW series – sites chosen to be temperature sensitive – were included. (These were the sites that originally gave rise to worries about divergence. I suppose the argument would be that these sites are covered by the MXD versions but not all sites are so covered (and, for many sites, multiple indices are used, so there is no rule in Mann et al 2008 limiting a site to one series. I’ll do a separate post on this some time.)

As a bit of a change of pace from the North American network, let’s look at the additions to the Australia-New Zealand network. Here the number of series increased by 48 from 18 to 66. How had I missed this explosion of Australian dendro activity in the past 10 years? Well, I hadn’t. Every single “new” series in Mann et al 2008 ends in 1993 or earlier. Some of the “new” Australian sites (ausl005 ausl010 ausl012 ausl013 ausl019) were collected by Lamarche (of bristlecone fame) back in the 1970s. I then checked at the ITRDB version control to see whether the “new” Mann et al 2008 series had perhaps been archived after 1998, which wouldn’t justify Gavin’s boast but might at least explain the addition. Nope. All the Lamarche series were in the original archive and would have been available to MBH. So series that did not qualify for MBH for some reason or other were permitted in to Mann et al 2008, grossing up the count.

While Mann et al 2008 report tree ring selection criterion (“series must cover at least the interval 1750 to 1970”) , they did not include a material change report discussing any changes in selection criteria from MBH98, the reasons for such changes and their impact. Mann etl al 2000 say of MBH98 selections:

The first year of the chronology was before AD 1626, and it contained at least 8 segments by 1680;

It appears that this criterion has been relaxed somewhat in Mann et al 2008 and this has opened the door for many series that were not included in MBH98 (and which are not directly relevant to medieval-modern comparisons).

We get similar patterns in the other networks. Europe increased from 7 to 90 series, but only 4 series are 1995 or later: a 1995 series from Rob Wilson (germ040), the only Rob Wilson series used in the analysis, a 1996 Kirchhefer series and 3 Kuniholm series from the Aegean (not obvious temperature proxies.)

South America increased from 13 to 78 sites, but all are 1995 or earlier and all but 4 1991 or earlier. Again, the addition of these sites appears to result from changed inclusion criteria.

Now for North America. MBH98 used 281 series ( 11 Jacoby sites and mexi001 directly; 3 Stahle precipitation reconstructions and 219 (232 reported to have been used) in 3 PC networks (North American – 232 (219 used), Stahle SWM – 24, Stahle TXOK – 16). Mann08 uses 694 North American series, an increase of 413 series. The majority of the additional series end before 1985, many in the 1970s, and were in existence long before MBH98 publication. Indeed, only 15% of the new North American tree ring series extend after 1995. Of the new North American series, most appear to be located below altitudinal treeline and are at best precipitation proxies.

There are many puzzling omissions: important data from Alaska and Canada archived by Jacoby and d’Arrigo, by Rob Wilson and Greg Wiles is inexplicably not used, while Mann et al 2008 added series after series that are at best precipitation proxies. In part, the failure to include the Jacoby, Wilson, Wiles etc data is because of a very stale cutoff date: Mann et al 2008 say that they used ITRDB data set as of 2003. which left these series out.

But why have a 2003 cutoff for a study published in Sept 2008 (with calculations as of Dec 2007)? It’s not that hard to keep up to date – I’ve regularly collated ITRDB additions into my tree ring data collation and I don’t have hundreds of thousands of dollars of NSF grants. Mann et al have been lavishly funded by NSF to do exactly the sort of collation that they failed to do. Why didn’t they? It’s not even a matter of heavy equipment or Starbucks availability. This is just computer work. So why not use at least a 2007 ITRDB version? Baffling.

In any event, readers (and Gavin Schmidt) should clearly understand this: only a negligible number of new proxies were collected after MBH98 (about 32, mostly U.S. tree ring). So the increase to 1209 proxies in Mann et al 2008 is not a “remarkable testament” to recent paleoclimate work – it has virtually nothing to do with it, even to the extent of ignoring most primary work that has actually been done in the past decade.

I don’t mind acknowledging primary data collectors. Indeed, I’ve generally been positive in such reviews (as opposed to reviews of the narrow Team) and, despite what my critics may think, this has not gone entirely unnoticed among primary collectors. At the 2007 AGU conference, as I mentioned previously, several eminent oceanographers complimented me on my reviews of ocean sediment literature. My point here is not to disparage primary collectors, but merely to observe that the Mann increase has virtually nothing to do with recent work by primary collectors and has a great deal to do with casting a wider net over old collections, even collections made in the 1970s.

Gavin’s boast is an empty one.

When the Team is on the Move

When the Team is on the move, they can sometimes move with surprising speed, as you’ll see in today’s story.

Over the last month, we’ve seen multiple changes to the Mann SI, at first, without any notice. More recently, they’ve started to note the existence of changes, though, the changes themselves are typically not reported (in the sense of a list of before and after values) and the old files have been deleted when the change is made, even if the incorrect values were actually used in Mann et al 2008 (this happened with the lat-long reversal of the rain in Spain). Sometimes, the change notices themselves have not proved permanent: for example, the notice of the correction of the Schweingruber MXD locations, once on the website, has itself now been deleted.

Now that we have a little more experience with changes to Mann’s SI, I’d like to re-visit our experience with the first such incident, one where I got wrongfooted and both Gavin Schmidt and a couple of CA readers criticized me as a result. There’s an interesting backstory on exactly how I got wrongfooted, which may appeal to people interested in dating problems and arguably shows a surprising degree of coordination of the Team when on the move.

The dispute arose over my comments on Sep 4 and 5 about an inline response by Gavin Schmidt to RC comment #23 ( the comment timestamped Sep 4 9:18 am EDT, but the inline response does not have a timestamp), in which Gavin stated:

The raw data (before any infilling) is also available on the SI site, and so you can look for yourself.

As it happens, on Sep 4, between 11.33 and 11.35 am EDT, over 2 hours after the above comment was submitted, I visited the relevant directory at Mann’s website and did not observe the existence of non-infilled data at that time. Strangely enough, I have timestamped files on my computer which enable me to establish the exact time of this visit. Here is a screenshot from my computer showing timestamps of 3 downloads from Mann’s website (rtable1209, rtable1209late and itrdbmatrix), timestamped locally on my computer between 11.33 and 11.35 am EDT.

The screenshot of my directory also shows a subdirectory “proxy” timestamped Sep 5 9:11 am, which is when I downloaded the Sep 4 version of the “original” data. This directory now has a later timestamp on Mann’s website because Mann deleted the Sep 4 “original” data on the afternoon of Sep 5, inserting a new version of “original” data.

So when I read Gavin’s inline response to Comment #23 (some time on Sep 4), my assumption – and given my late morning inspection of Mann’s website, hardly an unreasonable assumption – was that he’d been wrongfooted by Mann and had simply been in error in his claim that non-infilled data was available at the SI. I had checked the SI late in the morning of Sep 4 and it wasn’t there. I certainly didn’t think that he was intentionally misrepresenting the situation; I simply assumed that he’d got the wrong impression – that sometimes happens.

By early the next morning (Sep 5), a reader had observed at CA that Mann had altered his SI and provided a link to a new directory which proved to contain non-infilled data. Although the timestamp was overwritten when later on Sep 5, Mann deleted the Sep 4 “original” data and replaced it with new “original” data, a CA reader had recorded the Sep 4 timestamp as 15:42, which I adopt here for my timeline.

Gavin subsequently asserted on Sep 5 that (1) he had verified that the non-infilled data was online prior to making his inline response to the 9:18 am comment, (2) he made his inline response to the 9:18 am comment when he approved the comment; (3) the approval and inline comment was some time after 9.18 am EDT. The inline comment was a finite time before 12:14 pm, because a comment on the RC thread at 12:14 pm refers to comment #25 (made after #23). Therefore, comment #23 had to have been approved and online long enough for the RC reader to have read the various new comments on the thread and composed a reply. The 12:14 pm comment is short and wouldn’t take much time to compose, so approval needn’t have been much before 12:14 pm.

There’s one more part of the puzzle – server time. When was 15:42? If it was EDT (as RC server time), this would be 3:42 pm and would have made it impossible for verification to have taken place prior to 12:14 pm. However, one CA reader hypothesized that Penn State server time was UTC and another CA reader was able to prove this. UTC which is 4 hours ahead of EDT and 15:42 converts to 11.42 am EDT.

Mann altered his data 7 minutes after I visited his site.

Perhaps critics will have a little more sympathy for my being wrongfooted on this. Surely even someone as diligent as me can’t check Mann’s website every 10 minutes to see if he’s changed it. And look at the near-military precision of the Team on the move.

 11.35 am  SM visits Mann website. Non-infilled data not there.
 11:42 am  Mann inserts non-infilled data.
 ~12.00  High Noon: Frank Miller arrives on train, meets gang; gunfight breaks out between Gary Cooper and Miller gang. Grace Kelly returns to assist Gary Cooper. Gary and Grace defeat all 4 members of the Frank Miller gang
 12.03  Schmidt verifies existence of non-infilled data at Mann’s website
 12.05  Schmidt approves comment #23, inserting inline response.
 12:14 pm  RC comment mentioning comment #25

While the above timeline is pretty improbable, it does seem to be what happened. Based on this review, I’ve added inline comments to any of my earlier comments that now appear to have been incorrect.

In this case, I think that one can also conclude that Mann had emailed Schmidt notifying him of the alterations to the SI and directed him to the new location. And that Schmidt had information not available to the public when he visited Mann’s website. (It’s possible that it’s one more bizarre coincidence, but c’mon.) If Schmidt had received information about alterations to Mann’s website that was not publicly announced, then, in my opinion, he should have provided a forthright change notice in his inline response, agreeing that, yes, Mann’s original SI failed to include “original” data, but reporting that Mann had amended his SI on Sep 4 (referring to the date of the amendment) so that there was no uncertainty. In my opinion, Schmidt’s inline response falls well short of the form of notice that is appropriate for someone in possession of information not available to the public. (If Schmidt was not in possession of such information and this was one more coincidence, then this criticism would not apply.) Provision of a proper change notice, with an explicit reference to the date of the change, would have avoided most, if not all, of the subsequent misunderstandings.

At the time, Gavin also made an accusation that appears to be against me that I wish to deny.

An RC reader:
made the following polite request to Gavin Schmidt:

Gavin, I realize it’s not your responsibility to patrol the skeptic hordes, but could you offer a quick summary of how the data set has been updated and where these changes are recorded?… I think (hope?) that McIntyre would happily “move on” and apologize after a clear statement that you were acting in good faith.

Instead of providing a clear timeline (which might have helped), Gavin petulantly stated:

[Response: What is the point? The presumption will be that I’ve just made something up and even if I didn’t, I’m a bad person in any case. I have no interest in communicating with people whose first and only instinct is to impugn my motives and honesty the minute they can’t work something out (and this goes back a long way). Well, tough. You guys worked it out already, and I have absolutely nothing to add. If McIntyre was half the gentleman he claimed to be, we’d all be twice as happy. – gavin]

I presume that I’m supposed to be one of the people whose “first and only instinct is to impugn my motives and honesty the minute they can’t work something out (and this goes back a long way)”. I deny this allegation on a number of counts.

First, I don’t have a track record “going back a long way” of repeated incidents where my “first and only instinct is to impugn [Schmidt’s] motives and honesty the minute [I] can’t work something out “. On the contrary, I think that any CA reader has to concede that I’m very patient with working out paleoclimate studies. On an earlier occasion, I asked a reader critical of me to name one such incident and received no reply.

I also deny that “impugning” people’s motives and honesty is “my first and only instinct” when I can’t work something out. It is neither my “first instinct” nor my “only instinct”. My general practice is exactly the opposite – to refrain from speculating on author’s “motives”. It’s a policy that I try to maintain at the blog though not all readers observe the policy. I challenge Schmidt or anyone else to support this allegation.

This is not the first such incident involving Schmidt. As an IPCC Reviewer (SOD Review Comment 6-760), Schmidt impugned the integrity of Ross and myself, falsely accusing us “deliberate obsfucations [sic]” as follows:

M&M2003 is a non peer reviewed publication, and as such should not be referenced here. The points raised are almost invaraibly due to misunderstandings, errors in the archived data set (subsequently corrected at Nature) and deliberate obsfucations. [IPCC SOD Review Comment 6-740]

Schmidt has no basis whatever for accusing us to IPCC of “deliberate obfuscations”. That is an untrue and defamatory allegation.

On an earlier occasion, Mann had made a similar defamatory allegation to Natuurwetenschap & Techniek (and you’ll be amused at the question that occasioned this answer):

This claim by MM is just another in  a series of disingenuous (off the record: plainly dishonest) allegations by them about our work.

Again, this is an untrue and defamatory allegation.

I must say that I’m surprised at the recklessness of Mann and Schmidt, both as individuals and as employees representing presumably responsible organizations, in making such comments.