Many of the problems in multiproxy studies have very strong parallels in econometrics. It’s really econometricians – who know all about autocorrelated series and data mining – rather than statisticians who really should be involved in climate statistics.
Here are some quotes from Clinton Greene, from an article about data mining from an issue of J Econ Meth (2000) which was devoted to the topic.
One of Greene’s resonant conclusions is that sometimes analysts simply have to wait for more data to provide a true out-of-sample test. He points out that repetitive use of the same data sets makes statistical tests invalid. He uses the example of an econometrician who develops his theory from data up to 1980 waiting until 2010 to test it. There’s much to be said for this in the present quandary of proxy studies – if Gasp” cedars for example were a "key proxy" up to 1980, then there’s a very easy way to see if they are a proxy for temperature – bring the damn series up-to-date.
Here are some comments from Greene, then I’ll mention a comment by Bradley below.
From this perspective data-mining refers to invalid statistical testing as a result of naive over-use of a sample. In particular, the use of a sample both for learning-inspiration and for testing of that which was learned or mined from the sample. Any test of a theory or model is corrupted if the test is conducted using data which overlaps that of any previous empirical study used to suggest that theory or model.The moral is clear. Scientific and reliable knowledge requires experiments. And good econometrics free of distorted statistic distributions requires repeated investigation of data which is at least new, even if not experimentally controlled. Given the difficulty in economics of testing in an unexamined data set it would then seem that the corollary to “‘œavoid data-mining’ is “‘œstick to pure theory’, a position taken by many.3
But testing in un-mined data sets is a difficult standard to meet only to the extent one is impatient. There is a simple and honest way to avoid invalid testing. To be specific, suppose in 1980 one surveys the literature on money demand and decides the models could be improved.
File the proposed improvement away until 2010 and test the new model over data with a starting date of 1981. In this simple approach the development and testing of theory would be constrained to move at a pace set by the rate at which new data becomes available (at which experiments are conducted). That the development of science must be constrained by the pace of new experimentation seems obvious enough. The mistake is to suppose every new regression is a new experiment. Only new data represents a new experiment. I do not consider this a pessimistic outlook. This is because I thinks much can be learned from exploring a sample. Patience and slow methodical progress are virtuous. And the impatient can conduct forecast based stability and encompassing tests with only a few years of new data. But seeing economists behave as though they do not believe in the central role of constraints and of inputs in the production of reliable knowledge, certainly is grounds for pessimism….
Statistics from data-mined specifications provide informal but valuable evidence or suggestions. The claim that one specification is less data-mined than another is not sufficient to justify formal interpretation of regression statistics as in classical statistics. All are guilty and a measure of explicit data-mining does not discriminate between useful and un-useful work. In-sample “‘œtests’ are useful as design criteria but only out-of-sample tests are precisely meaningful applications of statistics. Without out-of-sample testing there is no distinction between running regressions and constructing historical (ex post) narratives….
But the most important fear of data-mining stems from legitimate doubts about the validity of most testing in over-worked time series data and from the false hope that if our own contribution to data-mining is small then our own research will somehow circumvent the problem of pre-test distortion of standard statistic distributions. But avoidance of explicit specification search will not cure our discomfort with sloppy in-sample “‘œtests’ nor will it insulate us from dishonest or deluded results. Specification search is the only way to learn from the data. Out-of-sample testing of data-mined specifications is the only way to conduct statistically valid tests and create reliable knowledge. All else represents a wish for scientific validity unconstrained by scientific inputs, technology, method and sensibility.
The Bradley comment that I’m thinking about is one at the UCAR press conference in 2005 to promote the hockey stick. Someone asked him about what the proxies showed after 1980. Bradley said that they continue to function just fine after 1980 – citing Oerlemanns in a just-published Science paper. Now I think that glaciers going up and down are probably a pretty decent proxy for temperature, but Oerlemanns is a bait-and-switch. Glacier movements were not a proxy in ANY of the multi-proxy studies and Oerlemanns only goes back through the LIA. The question is whether bristlecones or tree ring densities function as temperature proxies after 1980. Most of the evidence is that they don’t – the Divergence Problem.
If Greene or some other econometrician were asked to consider the Divergence Problem, there would be little doubt as to the conclusion – the model of a linear relationship between ring width and temperature extending to warmer temperature ranges was invalidated. End of story. If someone wanted to provide a new model, then it would have to be tested, but the last one failed. In D’Arrigo et al 2006, Rob Wilson’s good influence at least ensured that they admitted the post-1985 failure, but they astonshingly went on to compare MWP and modern levels after admitting the failure.
Clinton A. Greene, 2002, I am not, nor have I ever been a member of a data-mining discipline, Journal of Economic Methodology 7:2, 217″€œ230 2000