There has been a great deal of discussion on a recent CA thread on the efficacy of screening proxies for use in reconstructions by selecting on the size of the correlation between the proxy and the temperature during the calibration time period. During the discussion I asked Nick Stokes the following questions in a comment:

Do you think that it would be appropriate to use correlation to screen tree rings in a particular site or region when doing a temperature reconstruction for that site? Would you not be concerned that the process could bias the result even if the trees did contain actual information about the temperature?

His reply was

Roman, let me counter with a question,

“bias the result”

Bias from what? You’re using the language of population statistics. But what is the population? And why do you want to know its statistics? What is it biased from?

But to answer yours, yes I do. The proxy result is going to track instrumental in the training period. That’s intended, and means it isn’t independent information. But how does it bias what you want, which is the pre-training signal?

## The Example

In order to examine whether Nick is correct or not, I thought it might be a nice idea to get some data for which we could compare the effects of screening to those of using all of the proxies. However, it should be clear that no *real* proxy data exists for doing this, so I decided we could use pseudo-proxies generated for exactly the purpose of testing the paleo reconstruction methodology. A paper (*DISCUSSION OF: A STATISTICAL ANALYSIS OF MULTIPLE TEMPERATURE PROXIES: ARE RECONSTRUCTIONS OF SURFACE TEMPERATURES OVER THE LAST 1000 YEARS RELIABLE? by By Gavin A. Schmidt, Michael E. Mann and Scott D. Rutherford*) written in the Annals of Applied Statistics to rebut criticism of their previous work by McShane et al (discussed on CA starting here) used pseudo-proxies generated from modeled temperature series with various forms or noise. The data from that paper is available in a 26 MB zipped supplement.

For my example, I chose to use a set of 59 proxies based on the NCAR CSM model temperatures which had generated “temperature” values from 850 to the year 1980. The “noise” is apparently autoregressive, but I don’t know the specific details of the parameters. Other sets of proxies along with another model (GKSS) temperature series are available in the supplement and people should be able to use the function in the R script to carry out further tests on the other pseudo-proxies or on their own generated proxies should they wish to do so.

For a reconstruction method, I wrote a simple script to perform CPS as described in the 2008 paper, *Proxy-based reconstructions of hemispheric and global surface temperature variations over the past two millennia (Michael E. Mann, et al)*. Two methods are available for the calibration step – simple scaling and regression. Since this post is not intended to evaluate which reconstructions are better than others, I though CPS would suffice.

First, the CSM series looks like this:

and the first 10 pseudo-proxies:

The calibration period was chosen (for no particular reason) to be 1901 to 1980. I ran the comparison with a cutoff correlation of .12, mainly to ensure that there were sufficient “screened” proxies to create a reasonable reconstruction. The actual critical value used is not germane to the discussion below – the character of the impact of the screening would be the same with only the magnitude of the effect increasing as the critical value increased.

The reconstruction results for CPS with simple scaling:

The reconstruction results for CPS with regression:

The graphs clearly show that all of the reconstructions are biased, but as I mentioned before, that is not the point of the post. If one looks at mean squared errors of the reconstructions, the screened proxies have done a poorer job:

MSE(All, Scale) = 0.07511459

MSE(Screen, Scale) = 0.15536520

MSE(All, Regression) = 0.1329907

MSE(Screen, Regression) = 0.1855703

Why is this the case? Well, one reason might lie in a plot of the correlation of the proxies with temperature in the Calibration time period and in the pre-calibration era:

It is quite obvious from the plot that a higher correlation in the calibration period is no guarantee of a similarly high correlation prior to calibration. In this case, to categorically state that it is always a good idea to select *only* proxies which have a correlation higher than some cutoff value is clearly unwarranted here. However, to understand *why*, we need to do some mathematics.

## The Mathematics

Suppose that we have a homogeneous set of proxies that respond linearly to the same temperature series. We can describe the average relationship with a statistical model:

**Y _{k}(t) = A + B T(t) + e(t)**

where Y_{k}(t) = the value of the kth proxy at time t, T(t) = the temperature at time t, e(t) = random noise at time t, and A and B are constants which relate the numerical effect of the temperature to the proxy. This model might be a simple description of (age adjusted) tree ring sequences from a collection of trees from a single species at a given site or in a small region.

The model would be set up this way (rather than with the roles of Y and T reversed) because it more naturally reflects the reality that the tree rings are reacting to temperature, not the other way around. Without loss of generality, the error (noise) e(t)’s are assumed to average out to zero to make the definition of the constant A unique. The e(t)’s are also assumed to be independent of the temperature sequence so that the relationship between the proxies and the temperature is encapsulated by the linear equation A + B T(t).

Now, some simple mathematics can show that B (which is the *slope* of the linear relationship) can be expressed as

**B = Correlation(T, Y _{k}) * sqrt(Variance(Y_{k}(t)) / Variance(T(t))**

This implies two things. First, the slope B is a simple multiple of the correlation. In fact, if the temperature and the proxies are standardized (i.e. rescaled to all have mean zero and standard deviation 1), B is exactly equal to the correlation. The second is that, since B is the same for all of the proxies, the correlation with temperature (and calculated slopes) for our model should be exactly the same for all proxies.

Despite this fact , in practice, taking a sample of proxies from this model would produce a variety of results for the different proxies. Why? Because some of these proxies are better than others … or is it because in this case *all* of the variation is due to the noise component of the proxy? It is easy to show that the calculated least squares slope estimate for proxy k can be written as **B _{k} = B + B_{e,k}** where B

_{e,k}is the slope of the noise regressed on the temperatures.

It is also important to note that the coefficient B has a physical interpretation. It represents the change in the proxy due to a unit change in the temperature. Thus, in order to get a valid reconstruction, it is necessary to be able to properly estimate the value of B. Suppose that we form the average of all the slope estimates. Then, it is easy to see that:

**Ave(B _{k}) = B + Ave(B_{e,k} )**

so that as we use more and more proxies, due to the independence of the noise and the temperature, Ave(B_{e,k} ) becomes consistently closer to zero and our estimate converges to the correct value B to be used in the calibration.

However, when the proxies are screened using the correlation coefficient, this is not the case. The screening process removes all of the proxies whose correlation with temperature is below a specific threshold. From our observation above that the slope is a multiple of that correlation, this is tantamount to removing all of the proxies for which B_{e,k} is below a particular level. Now, Ave(B_{e,k} ) will converge to a positive non-zero value and our estimate of B will be a biased overestimate of the correct value. This statistical problem is not easily correctable.

So what is the resulting effect on a reconstruction? In order to estimate the temperatures in the past, we need to invert the relationship. If B is known, we can easily remove A from the equation by subtracting the means from both the temperature and the proxy series. Thus we get **Y _{k}(t) = B T(t) + e(t)** or

**T(t) = (1/B) Y**. If B is overestimated, then 1/B is too small. Thus the reconstructed temperature series is flattened toward the mean of the temperatures in the calibration period.

_{k}(t) – (1/B)e(t)In a model where the proxies are theoretically not identical, the effect will still be the same even though the magnitude of the bias may vary. Choosing proxies with higher correlation will invariably increase the proportion of proxies with higher spurious noise-temperature correlation and reduce the opposite. This will contribute toward producing a flatter straighter shaft as in the example.

**UPDATE – June18, 2012**

Maybe an analogy might give some insight.

Suppose that you want to estimate the mean weight of a population of fish in a body of water. We will use nets to catch fish and weigh them. To avoid dealing with an specious argument from NS (wink,wink), we will assume that each fish is equally likely to enter a net and the weight of a fish is unrelated to where the fish is found. The only problem is that the holes in the net are quite large and all of the fish below a critical size escape from the net and are not caught. Our sample is the set of fish that do not escape.

*Is the sample of fish reasonable representative of the entire population? * Obviously not because we have no information on the distribution of the fish below the critical “escape” value.

*Is the average of the weights of fish in our sample a reasonable estimate of the mean of the entire population?* Under the assumptions we have made, the average weight of all of the fish that enter a net is known to have good properties. It is unbiased (i.e. will not *systematically* over- or under-estimate the mean and larger samples will generally have averages closer to the mean of the population. However, the average weight of the fish that do not escape will be larger than the average weight of the fish who have entered the net since all of the “critically small” fish will be removed. The difference between the two averages will remain approximately the same *regardless of how large the number of fish entering the net will be*.

How does this relate to our situation?

Each proxy contains information from which we can calculate an estimate B_{prox} of the slope B (which relates the size of the change of this type of proxy to a unit change in the temperature). Due to the random noise, this estimate can be larger than B or smaller. Because B_{prox} is calculated using least squares, it is known to be unbiased and the sample of B_{prox}‘s is similar to the sample of fish who *enter* the net. The screening procedure removes all proxies which have a correlation lower than the critical value, but because the B-value is directly related to that correlation, all of the “small” proxies “escape” leaving an unrepresentative sample from which to calculate the reconstruction relationship.

Why is this important? The correct value to use in the reconstruction is the population parameter B in the equations above – any deviation from that value will produce errors. With our screened sample, we cannot get a proper estimate of B and the effect of this is passed on to the reconstruction methodology.

If the proxies are a mix of different types with in some cases only one of a given type, the problem still remains in the same way that the fish can be of different species. Whales will not escape and their information contribution to the average will be unaffected. However, the information from species whose size is closer to the hole size (proxies whose *actual* correlation is closer to the critical value) will be filtered to appear more temperature responsive.

## 265 Comments

Here is an R script for the example:

#Warning: large 26 MB file.

#Could be easier to download separately into a known directory

file.url ="http://lib.stat.cmu.edu/aoas/398D/supplementB.zip"

download.file(file.url,"supplementB.zip")

unzip("supplementB.zip")

nh.all.csm = ts(read.table("SchmidtMannRutherford/nh_csm")$V1, start = 850)

#shorten temperature series to match proxies

nh.csm = window(nh.all.csm, end = 1980)

plot(nh.csm, main = "CSM MODEL TEMPERATURE SERIES ANOMALY", ylab = "Degrees C", xlab="Year")

csm.prox59 = ts(as.matrix(read.table("SchmidtMannRutherford/csmproxynorm_59red")),start=850)

plot(csm.prox59[,1:10],main = "Proxies V1 to V10")

CPS = function(proxies, temp, caltime = c(1901,1980), cps.type = c("scale", "regress"), sp = .05,plots=T) {

# scale proxies and temps to caltime values

dat.tsp = tsp(temp)

if (length(cps.type) > 1) cps.type = cps.type[1]

cal.temp = window(temp, start = caltime[1], end=caltime[2])

composite = ts(rowMeans(scale(proxies)),start = dat.tsp[1])

cal.prox = window(composite, start = caltime[1], end=caltime[2])

#calculate recon

#simple scaling

if (cps.type == "scale") {

temp.mean = mean(cal.temp)

temp.sd = sd(c(cal.temp))

comp.mean = mean(cal.prox)

comp.sd = sd(c(cal.prox))

recon = ts(temp.mean + (composite - comp.mean)*(temp.sd/comp.sd), start= dat.tsp[1])}

#regression

if (cps.type == "regress") {

reg.coef = coef(lm(cal.temp~cal.prox))

recon = ts(reg.coef[1] + reg.coef[2]*composite, start = dat.tsp[1]) }

#

# loess smooth results

ttime = time(temp)

lo.recon = ts(predict(loess(recon~ttime,span = sp)),start=dat.tsp[1])

lo.temp = ts(predict(loess(temp~ttime,span = sp)),start=dat.tsp[1])

if (plots) ts.plot(ts.union(lo.temp,lo.recon),col=1:2)

#

#calculate mean square error of reconstruction

mse = mean((window(recon, end = caltime[1]-1)- window(temp, end = caltime[1]-1))^2)

list(recon = recon, sm.recon = lo.recon, temp = temp, sm.temp =lo.temp,mse = mse) }

#test.scale = CPS(csm.prox59,nh.csm)

#test.reg = CPS(csm.prox59,nh.csm, cps.type ="regress" )

#ts.plot(ts.union(test.scale$sm.temp,test.scale$sm.recon,test.reg$sm.recon), col=1:3)

compare.screen = function(prox, temps,cal.time = c(1901,1980),critcor = .12,

lospan =.05, plots = T, cpstype = c("scale", "regress") ) {

cors = c(cor(window(prox,start = cal.time[1], end = cal.time[2]),window(temps,start = cal.time[1], end = cal.time[2])))

bigcor = which(cors > critcor)

screened = CPS(prox[,bigcor],temps,cal.time,cpstype,lospan,F)

allprox = CPS(prox,temps,cal.time,cpstype,lospan,F)

mse = c(allprox$mse,screened$mse)

names(mse) = c("all","screen")

if (plots) {par(mfrow = c(2,1))

ts.plot(ts.union(allprox$temp,allprox$recon,screened$recon),col = 1:3,

main = paste("Reconstuction with",cpstype[1]," ... Screen Cor = ", critcor),ylab="Anomaly C")

legend(1300,-1,legend = c("Temp","All","Screen"),fill=1:3, horiz=T)

ts.plot(ts.union(allprox$sm.temp,allprox$sm.recon,screened$sm.recon),col = 1:3,

main = paste("Loess Smoothed with",cpstype[1], " ... Span =",lospan ),ylab="Anomaly C")

par(mfrow = c(1,1)) }

list(all = allprox, screen = screened, mse = mse)}

comp.scale =compare.screen(csm.prox59, nh.csm)

comp.reg = compare.screen(csm.prox59, nh.csm, cpstype="regress")

comp.scale$mse

# all screen

#0.07511459 0.15536520

comp.reg$mse

# all screen

#0.1329907 0.1855703

cor.calc = function(dats, temp, calstart = 1901) {

#calculate and plot correlations

outcors = matrix(NA,ncol(dats),3)

outcors[,3] = cor(dats,temp)

outcors[,1] = cor(window(dats,start = calstart),window(temp,start = calstart))

outcors[,2] = cor(window(dats,end = calstart -1),window(temp,end = calstart-1))

colnames(outcors) = c("Calib","Early","All")

plot(outcors[,1:2],pch = 20, main="Correlation Scatter Plot",xlab = "Calibration",ylab = "PreCalibration")

invisible(outcors)}

`cor.calc(csm.prox59, nh.csm)`

abline(v=.12,lwd=2,col="green")

Can I ask what happens if you pre-screen the data looking for the temperature drop that occurs after a volcanic eruption?

You have two nice events in you model series; about 1250 and 1805.

The only reason I ask is that I cannot find either Mount Tambora or Krakatoa in the proxies used by Gergis; nor are they present in the reconstruction.

These events are surely internal positive controls for pre-calibration temperature/SO2 sensitivity.

If you pre-screen for events that you now affected temperature, what changes do you see in the reconstitution of the ‘past’?

It strikes me that picking on specific incidents such as volcanic eruptions is not necessarily valuable for numeric purposes since you need to make the assumption that any reaction of the proxies to the events is “normal operating procedure” and is not affected by other concurrent physical effects also initiated by the eruption. However, one certainly should expect to see some response in the reconstructed temperature series to such events.

As far as

usingthem for selection, would you not have to know more about the eruption specifically affected the region containing the proxy?Doc,

Frost events are also evident in some tree rings series. Hmm I have to dig up the paper

Very clear, Roman. Thanks, and thank you Steve for letting it appear here.

It is so incredibly FAIR to use Mann and team’s red noise data plus Mann’s team’s CPS method to demonstrate the point.

I still anticipate that there will be a protest that the analysis arises from the choice of data. After all, “simple physics” dictates that there MUST be signal in “real” data sets.

Is my inference correct that, given sufficient “proxies” one could find the particular one that most closely matches the calibration data, throw all the other proxies out of “sample”, and then use the one proxy to back cast the prior thousand years? Doing so would eliminate the suspiciously flat hockey stick shaft, at least. It might, or might not, include a MWP or any wiggles corresponding to volcanoes, etc. But boy howdy if one cherry picked so selectively the statistical tests would sure look goog.

Ouch! Picking the single, most representative proxy (however that may be defined and implemented seems to me a recipe for disaster. 😉

The reconstruction method should extract the temperature information available from all the proxies in a way which is unbiased and properly reflects the reality of that type of proxy. Estimate the parameters properly and let them determine the result. A single proxy just can’t do that.

A single proxy just can’t do that.Why not? If a single proxy can’t, why can two? Or three?

So if a single proxy can’t, why are we to believe a hundred proxies with magically derived weights?

Of course you are right. But perhaps the argument needs to be reversed. If a hundred proxies with magical weights does have validity to some conjur up a history, then the correlation between the best proxies (highest weights) outside the training period ought to still be impressively high. But is that what we find?

If the correlation between pairs of proxies is better in the training period than outside the training period, then how did the best proxies know ahead of time what we would choose as our training period years later?

Well I could follow your argument pretty well, though I’m not a statistics expert. It will be interesting to see if Nick (or anyone else) can find any flaws in your argument. If not, I think you have slain the dragon of proxy temperature reconstructions, at least of the sort used up to now by the Team.

I am

certainthat Nick will show up, but I suspect he may have gone to bed already.I would not say that this has “slain the dragon of proxy temperature reconstructions”, however, it should attract attention to the fact that selecting by correlation also tends to increase the presence of proxies whose noise matches the temperature. The artificial inflation of the proxy response by the noise becomes part of the calibration process causing bias errors in the estimated paleo temperatures.

Ok so it’s “if you choose only what’s best between A and B you’ll likely exclude what’s best before A and what’s best after B, and include what just looks good between A and B but is just noise”.

It’s another form of the telephone-based poll which finds 100% of people have a phone.

Past Performance is No Guarantee of Future Results…Past Performance is No Guarantee of Future Results….Past Performance is No Guarantee of Future Results…Past Performance is No Guarantee of Future Results…

ps I thought the whole issue had been killed off by the Divergence Problem?

pps in layman’s terms I would summarize the post with “if you choose only what’s best between A and B you’ll likely lose what’s best before A and what’s best after B”. Talk about, “simply physics”…

Re: Past performance is no guarantee of future results, see:

The Truth Wears Off:

Is there something wrong with the scientific method?

http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer?currentPage=all

Re: Michael Larkin (Jun 17 09:09),

Jonah Lehrer’s 13 December 2010 piece was very interesting, but his 3 January 2011 piece ‘More Thoughts on the Decline Effect’ betrays his blind spot.

Why is it that so many otherwise erudite people confuse the controversy over climate sensitivity with with the lesser, if not entirely mythical, one over climate change?

Past performance is no guarantee of earlier past results.

Both, if you’ll read it carefully.

Was there a reply to the Schmidt et al paper ?

I note they liked to exclude Tiljander on this particular occasion, their hypocracy on proxy selections can only be laughed at. Iljander has a contaminated modern portion, but in their studies they include it upside down because it improves their verification stats and has (when upside down) no MWP, yet in a published response they choose to EXCLUDE it because it’s “contaminated” and contributes (when used the correct way up) to a pronounced MWP. And they state all this with a straight face. They simply have no integrity left.

They also wanted to remove chronologies where there were fewer than 8 cores, after sitting for years on the secret that the Yamal chronology and its modern day hockey blade portion rested on a mere 6 cores. Yet again their hypocricy is astonishing.

Nice post.

Minor question:

“B = Correlation(X, Yk) * sqrt(Variance(Yk(t)) / Variance(T(t))”

Should this be “Correlation(T, Yk)” ?

This analysis is correct if all the error terms have the same magnitude. I think the counter argument is that if certain trees/proxies have larger error magnitudes (lower signal to noise ratio) it’s more effective to screen them out. Any thoughts?

Standardized Bs (where variables have zero mean and unit SD) can account for the different absolute magnitude of errors. This should not affect the results since Standardized Bs and crude Bs are simple multiples of each other. For this and other reasons, candidate proxies with larger variance should not be screened out. Their error adds to the general error of the procedure linking proxies and measured temperatures.

Thanks for pointing out the typo. It has been fixed.

Had to think about it. I don’t think that it changes the argument. The point is that we are really still trying to estimate the same B value. Conceptually, if the noise variance is lower, it implies that we will omit such proxies less often but they can still have a biasing effect on the end result albeit smaller. Furthermore, they will not compensate for the biases induced by the exclusion of the higher variance proxies caused by screening.

Roman:

Apologies, I forgot that there is a second step here that calculates regression weights as well. If you are doing some type of regression following the screening there is no justification for the screening as the regression will down weight the lower correlation proxies. IE you want to use all the information.

IMO what is needed is a comparison of simple averaging (or variance weighted averaging) versus the various types of optimization (regression) schemes that occur. If the results aren’t similar then there a probably larger issues with the methodology. There might be some argument for screening if the next step was simple averaging, and the correlation measurements were not smoothly distributes (ie bi-modal or something).

Actually, it should be Covariance(T,Yk) at this point. It is only after normalization of the variables to have unit variances that Cov(T,Yk) = Corr(T,Yk).

[RomanM: Actually not, Hu. B = Cov(T,Yk) / Var(T) = Corr(X, Yk) * sqrt( Var(Yk) / Var(T)” as given.]So true! 🙂

Hi Roman —

I think I follow and agree with your argument up until the end: “In a model where the proxies are theoretically not identical, the effect will still be the same even though the magnitude of the bias may vary. Choosing proxies with higher correlation will invariably increase the proportion of proxies with higher spurious noise-temperature correlation and reduce the opposite.” Are you sure this is a valid conclusion?

Doesn’t this depend on your choice of cutoff correlation and the proportion of “accurate” proxies? For example, assume you added 59 randomly generated “proxies” that have no correlation to temperature. Start by setting a correlation cutoff of 0, which selects all 1/2 good proxies and 1/2 random ones. Is there no higher correlation cutoff that would select more of the 59 real proxies while accepting only a fewer of the confounding ones?

Alternatively, you mention that there are different sets of proxies in the supplement. What happens when you blend together some number of proxies from one sets and twice the number from another, then correlate to either the “correct” or “incorrect” temperature record? Does your statement remain true in both cases?

My instinct is that with 100% known good proxies, screening will always produce the results you show. And with an overwhelmingly large number of bad proxies, your statement is correct and the false correlations will dominate. But I think there has to be a point in the middle where the discrimination produces a better reconstruction. The hard part is that there is no “in test” manner of determining where on this spectrum the proportion of good to bad proxies lies.

Nathan I think you have missed the point of the post. Why would screening even be necessary in your example? The effect of random noise in Bk = B + Be,k is to cancel to 0 and increase the contribution of the proxies “signal” component in the estimate of temperature. “Selecting” noise with a positive (ie: non-zero) relationship to temperature can only force a decrease/dampen (ie: bias) of the signal component. One can conjur up all kinds of examples where the magnitude of the effect is minimal, but that doesn’t mean that such selection is correct.

Yes, I think so. Proxies with spuriously higher correlation are more likely to be included and the those with spuriously lower excluded. The end result will be a collection of proxies which will tend to produce the end result that I described.

No. Increasing the cutoff reduces the numbers of proxies and the information available to create a decent reconstruction. For the example, the correlation plot shows that what looks good at one time may look poor at another. You can’t necessarily tell which from looking at the calibration statistics.

Yes, in the cases I looked at. however, that is why I wrote up a script which does that through a simple function. The exercise is left to the reader. 😉

Yes, but there is no reason why one cannot do some post-reconstruction evaluation to determine how the proxies fared. This does not need to be limited to just calculating some correlations. One could also come up with reconstruction methodology by which proxies can select or deselect themselves without overwhelming the entire reconstruction.

This statistical argument seems to me to miss the point. Obviously, if you assume that all the proxies are equally good, albeit randomly noisy, then throwing some of them away based on an essentially arbitrary criterion like whether the random noise happens to be low in a particular interval will produce a worse result than including all the proxies. My impression, however, is that the assumption underlying the selection process is that the proxies are *not* all equally good–that is, that the random noise functions for some proxies have much greater magnitudes (by various measures) than others– and that selection to eliminate the “bad” proxies is therefore justified. Presumably, for sufficiently extreme versions of this scenario–say, two classes of proxies, one with enormously larger random noise values than the other–this justification can be demonstrated statistically.

However, which of these two scenarios applies is surely an empirical question, rather than a statistical one. In particular, Nick Stokes and others invoke the “uniformity principle” to justify their assertion that a lower magnitude of random noise in one interval implies a lower magnitude of random noise in other intervals. Whether that’s true of, say, tree ring proxies is surely a question that can–and must–be answered empirically, rather than via statistical arguments.

Dan – what is your interpretation of the scatter plot?

This is what the defenders of the selection process say. However, my point is that the selection procedure by its basic nature tends to choose proxies tainted by a spurious correlation of noise with temperature more often. This results in an uncorrectable bias in the results.

This post has nothing to do with the “uniformity principle. The correlation plot above shows that the noise in these proxies is so high that at one time the proxy can look good and at another it may go the wrong way

purelydue to the noise character at that moment.RomanM, I understand that under the model you defined–all proxies behave identically with respect to both correlation and random noise–the selection process under discussion introduces bias. However, under a different model of proxy behavior, the assessment of the selection process might well be very different.

Let’s imagine, for instance, a model in which the proxies are not all identical, but rather fall into two sets–“good” proxies and “bad” ones. The “good” proxies follow your model, but most of the proxies are “bad”, and consist entirely of red noise. (This is my attempt at a rough approximation of the model implicit in the papers that use the selection process.) Under this model, the selection process stands a good chance of selecting strongly in favor of good proxies, and therefore of improving the resulting estimates substantially.

Now let’s imagine a model in which all proxies consist of “good” and “bad” *time periods*–that is, each proxy consists of a time interval (distributed according to some uniformish distribution) in which the proxy follows your model, spliced together with periods on either side in which the proxy consists of pure red noise. Under this model, the selection process is about the *worst* method one could use–it practically guarantees that proxies with good correlation during the calibration period contribute only red noise during the periods for which estimates are being calculated. Hence the estimates will be calcualated from pretty much nothing but noise, and thus be completely worthless.

It therefore seems to me that devising an empirically accurate model of proxy behavior–in effect, determining which of the above models better characterizes the variousproxies being used in climate reconstructions, or whether some other model entirely is more appropriate–is far more important than determining the effects of a particular selection process under a particular model that virtually nobody seems to believe actually applies to all proxies. Only when we understand how proxies actually behave can we tell whether the methods being used to analyze them make sense or not.

Yes, screening may very well remove some proxies which do not relate to temperature and leave more which do seem to exhibit a behavior related to the calibration temps. However, my point is that in the ones that remain the temperature information has been distorted to a larger or smaller degree, depending on how strong that proxy’s information is. This does not depend on the proxies being “identical. My reason for assuming that was to make it easier to see what was going on and to determine the character of the effect.

I firmly agree with you that understanding how proxies for something behave (both qualitatively and quantitatively) is extremely important for solving the selection problem. However, that sometimes does not always seem to be as important to some paleo researchers who are of the opinion that their methods can easily determine which are real and which aren’t.

In particular, Nick Stokes and others invoke the “uniformity principle” to justify their assertion that…

The term “uniformity principle” is a misnoma. It is the “uniformity assumption”.

If you call it a “principal” you instantly get the idea that it “proves” something. Merely mentioning its name is enough to justify your argument and rush blindly ahead.

If you call it correctly the “uniformity assumption” you (and your reader) instantly realises that you are making an assumption and that you argument is limited. It invites the question of whether the assumption is correct for the case in hand.

I pointed this out in the previous thread but you seem to have missed.

Envoking the “uniformity assumption” should be uniformly avoided.

I very much share your skepticism regarding the “uniformity principle”–in fact, as I have pointed out in other threads, we have a concrete example of this “principle” failing quite conspicuously for a class of proxy: the famous “divergence problem” associated with the “hide the decline” scandal.

But that’s precisely why we need to look very carefully at the empirical behavior of the various proxies being used for temperature estimation. These arguments about statistical methods are therefore something of a red herring–as I explained above, depending on how those proxies behave, the statistical methods used could be eminently sensible or ridiculously wrong.

Steve: the divergence problem is NOT evidence of the uniformity principle failing.Steve, as I mentioned in the previous thread, the term, “uniformity principle” is being thrown around quite imprecisely, and has clearly been interpreted with respect to temperature proxies in at least two different ways. I’m using it in a very broad sense, as a synonym for, “once a good proxy, always a good proxy”, because that’s the meaning that defenders of the screening method such as Nick Stokes implicitly attribute to it when they use it to justify selectively trusting certain proxies based on correlation over a calibration period.

You’ve pointed out, quite correctly, that one could instead use the term more narrowly, to mean, “proxies behave consistently over time”, and that one can still believe that this narrower version of the principle holds without accepting the truth of the broader version, since the narrower meaning clearly does not necessarily imply the broader one. You gave the example of tree ring proxies responding quadratically to temperature, in which case one could believe that they always do so, as per the narrower version of the “uniformity principle”, without accepting that (linear) correlation with the instrumental temperature record (i.e., being a “good” proxy) over a given time interval necessarily implies a similar correlation over other time intervals.

Neverthless, I believe this is in the end merely a semantic quibble. What we presumably both believe is that (a) the selective use of proxies based on correlation with the instrumental temperature record is only justified if the strong, “once a good proxy, always a good proxy” sense of the principle holds, and (b) this strong version of the principle appears not to hold for at least some proxies, as demonstrated by the “divergence problem”.

If your concern is that the term “uniformity principle” is being misused when given the stronger meaning, then fine–but then your gripe is really with Nick Stokes and the others, since they’re the ones who are using the “uniformity principle” in a way that implicitly associates it with this stronger meaning. I’m just following their terminology for consistency while I critique their arguments.

Extracts of “CARGO CULT SCIENCE” by Richard Feynman: Adapted from the Caltech commencement address given in 1974.

If you make a theory, for example, and advertise it, or put it out, then you must also put down all the facts that disagree with it, as well as those that agree with it.

==================

Climate science specifically excludes the facts that disagree with the theory. The hockey stick is a prime example. Only those trees that match current temperatures are used, while the trees that don’t match are hidden from the study.

The problem is that those trees that don’t match, they are telling you that the trees that do match are likely an accidental match, that you can’t place much weight on their reliability. So, by hiding these trees, climate science makes it appear that trees are a much more reliable measure of temperature than the evidence supports.

Why do this? Because Climate Science is a pseudo science. It is not searching for the truth. It is searching for evidence that supports a predetermined, politically correct conclusion. The evidence that supports this conclusion is included in the studies. The evidence that does not support the conclusion is hidden from the final report.

“Hide the decline” as revealed via Cliamtegate is a prime example of this, where the IPCC poster child was shown to be a result of selection bias. A more recent example is the “southern hemisphere hockey stick” which was recently retracted from publication. In this case the “experimenter expectation” effect likely played a role, where an obvious mistake got past the researchers, peer review, and “climate scientists” such as “Real Climate”.

Instead, the error was first publicly identified on Climate Audit. After the fact the researchers claimed they discovered the error first, but this was only after CA Audit published the error. If in fact the researchers did discover the error first, it suggests they knowingly withheld this from the public until CA announced the error. This raises the question about whether the researchers would have released their knowledge of the error except for CA. If Climate Science history is an indication, they would not have made this error public. They would have allowed the erroneous paper to go forward, because it supported the predetermined conclusion.

Rather, I would say that similar trees that match more poorly are an indicator of a higher noise level which should be taken into account during the reconstruction.

…What about the physical argument that stands-in and undergirds any time the statistical argument is relatively weak (low correlation co-efficients, p ~ 0.1, etc.)

It’s the ‘one working thermometer among broken thermometers’ idea…

Jim Bouldin was saying over at RC (and anticipating folks seizing on it) that even if you only have one tree that properly reasonably reflects ITR, then that’s what you go with, and can still legitimately do so.

[RomanM: Sheerhubrison the part of some who could use a real statistics course to get some understanding of both the capabilities and perhaps more importantly the limitations of statistical techniques.]Jim Bouldin was saying over at RC (and anticipating folks seizing on it) that even if you only have one tree that properly reasonably reflects ITR, then that’s what you go with, and can still legitimately do so.

=========

If 999 trees don’t correlate and only 1 does, those other 999 are telling you that the one that does is likely doing so simply by accident.

for example:

If you look at enough trees, eventually you will find one that correlates with the stock market.

Using the logic of tree rings, since it can be argued that economic activity results in CO2 which enhances tree growth, and the observation that the market leads economic activity by 6 months, we should then be able to use this tree as an proxy for the stock market and future economic activity.

In other words, if climate science is correct, this one tree should be able to forecast future economic activity by about 6 months.

However, the trees that don’t correlate are telling you that what you are observing is the laws of chance. Given enough time and enough monkeys, eventually you get Shakespeare.

Fred, the Shakespear metaphor was also invoked there too (actually, it was “Moby Dick”) … The idea being that it’s IMPOSSIBLE to get a Moby Dick out of an infinite number of monkeys, unless that monkey happens to be Herman Melville.

Likewise, it is postulated on physical principles that a tree that follows the ITR is a valid tree-mometer (and therefore does not require statistical significance by comparison within a population of like-situated trees that don’t follow the ITR).

Where does it cross the line into Hubris? (does an argument like this have to satisfy all statistical as well as physical principles? Is because it is postulated as a matter of physics there need not be any insinuation that there are a few assumptions being made,or articles of faith?)

That goes back to my earlier comment. The inference that “one good themometer” can be determined among the 999 bad ones should let us dispense with averaging, collating, or otherwise combining data from, say 20 or 30 not-quite-so-good thermometers with my cherry picked ideal. Reduce the whole idea to the extreme. One carefully picked proxy tells us everything we want to know, or ought to know, or can know, about the periods before and after the calibration.

Unless of course the one good themometer has a divergence problem after selection for periods later than the calibration interval.

Rather, I would say that similar trees that match more poorly are an indicator of a higher noise level which should be taken into account during the reconstruction.

========

Noise is noise. It is randomly distributed. Thus, if a tree has low noise in one sample, that say ABSOLUTELY NOTHING about the noise for that tree in another sample.

Your argument is similar to the Gambler’s Fallacy. The assumption being that random chance in one sample affects random chance in another sample. It is statistical nonsense.

If a tree has low noise in one sample, that in no way reduces the odds of noise in any other sample.

Fred’s language is much clearer than Roman’s in this instance. So unless Fred’s language introduces an inaccuracy, if you want to communicate with the public, use Fred’s formulation of the problem..

Why is it surprising that random generated proxies have no correlation between the Calibration and Pre-calibration? If they had, they wouldn’t be random would they?

random says their should be no correlation between the data points. It does not say there should be no correlation between the pre and post screening.

Statistics says that the selection process must leave the distribution pre and post screening unchanged. If you change the distribution via screening, you have broken the underlying assumption in the mathematics.

Focusing on using longer calibration periods seems to me to be the way to go. Using 30 year calibration periods, focused on a warming period is self-predicting, as Steve has proven many times.

If you can’t find the correlation be strong enough to be relevant on, say, a 150 year period, then why would you think it’d be good for another 850 years beyond that?

Focusing on using longer calibration periods seems to me to be the way to go.

===========

Longer calibration period will not solve the problem. The problem is calibration. It is a hidden variable problem:

You are telling the statistics that

G = F(T), where G=Growth and T=Temp

But what you are actually doing in the model is:

G = F(T,C) where C=Calibration.

C, Calibration is the hidden variable. It results in bias because it hasn’t been accounted for in the solution equation.

So you’re saying there is no length of time you’d be willing to use as statistically valid to extrapolate past the end of the observational record?

For instance, if your target was 1,000 years, and you had 950 years of observational data to calibrate your proxies, you would be unwilling to trust that last 50 years to the proxy that calibrated significantly to the observational record?

Extreme example, of course; but if you had admit “yes”, then we have to start walking further up the line to find the point where you’ll insist “no”.

Hello

Over at RC,

http://www.realclimate.org/index.php/archives/2012/05/fresh-hockey-sticks-from-the-southern-hemisphere/comment-page-4/#comments

I asked this question:

“As I understand it, the p value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed. So obtaining one result with p value of .1 out of, say, 10 tree rings (10%) could hardly be considered remarkable. Or obtaining around 100 correlations out of 1000 with p=.1 could not be considered remarkable. I would argue that in those cases, a p=.1 could easily be attributable to random noise.

Question: Can you point to any specific examples where data shows much greater than 10% percentage of the tree rings samples matched temperatures with a correlation of p=.1 or better?”

There were a few answers from the RCbloggers, referencing this supplement information to a Mann study:

http://www.pnas.org/content/suppl/2008/09/02/0805721105.DCSupplemental/0805721105SI.pdf#nameddest=ST1

One of Mann’s comments in this study is:

“Although 484 (40%) pass the temperature screening process

over the full (1850–1995) calibration interval, one would expect

that no more than 150 (13%) of the proxy series would pass

the screening procedure described above by chance alone.”

40% of the data passing a screen that should only yield 13% by chance alone, seems extremely improbable to me, if the data were really just random noise.

I don’t know the exact formula, but I suspect the odds are less than 1% of this occuring by pure random chance.

Any comments on this?

Notwithstanding the possibility that the tree rings are truely measuring temperature, my one issue I have is that the data might have gone through two additional “sieves” before even getting to Mann’s screening process:

1. All the data was pulled from the “International Tree Ring Data Bank” The various miscellaneous data providers might have had some sort of screening done to the data before uploading to this data bank.

2. Mann “prescreened” the data before screening it. He describes this process, in detail in the SA, which is unrelated to how well it correlates to local temps. However; it seems possible to me that Mann still might have used other criteria (cherry picking) ot disclosed in his paper. Or not, impossible to check without ridiculous effort and time.

Steve:You can’t take anything at face value from Mann et al 2008. We discussed the screening back in Sept 2008 and showed that many of the “passing” results are bogus. They include 71 Luterbacher series that incorporate instrumental series. They also include briffa MXD series where the decline has been deleted and replaced with “infilled” values before doing the check. They also compared to two gridcells picking the “best” while doing the benchmark for one. On and on. The claimed skill is fictitious.

40% of the data passing a screen that should only yield 13% by chance alone

========

What the means is that prior to screening, only 1/8 of the trees in the sample correlated by chance. After screening, 1/3 correlated by chance.

In other words, after screening, a much higher proportion of trees correlated simply by chance that prior to screening.

Thus, the screening process amplified the noise.

Yes but, according to Mann, the prescreening had nothing to do with correlation, so it should not have affected the results.

This was his “prescreening criteria”

“(i) series must cover at least the interval 1750 to 1970,

(ii) correlation between individual cores for a given site must be

0.50 for this period, (iii) there must be at least eight samples

during the screened period 1800–1960 and for every year used”

Just some guy, why do not you go read Mann (08) and the SI very carefully and decide for yourself. Mann(08) used a p value for correlation of the proxy to temperature of 0.10 which had to be corrected in the SI to 0.13 after adjusting for what Mann (08) assumed was the correct ARIMA model. It had all the problems that SteveM noted above plus Mann (08) uses infilling of proxies that were incomplete to 1998. That bit is hidden in SI also. For a number of proxies the infilling was extensive.

Okay thanks Kenneth and Fred and Steve. I’ll investigate that and also check out the Sept-2008 threads, although it will take me some time to really get my arms around it. (I was kind of hoping this wouldn’t get overly complicated.)

The practice of picking the best of two gridcells seems to me like it might double thier results, if every tree ring gets 2 chances to pass the test. So maybe that means only 20% should have passed?, before taking into account the other problems? I’ll lob that one into a RC thread and see what they say. If reality is closer to 13%, I would argue the whole study is garbage.

Does anyone know if the raw data for Mann08 is available (other than scattered over many files at ITRDB)? I can’t seem to find it.

Steve: it’s at Mann’s website and in a collated R-table at climateaudit.info.

“Does anyone know if the raw data for Mann08 is available (other than scattered over many files at ITRDB)? I can’t seem to find it.”

just some guy, go to the Mann (08) SI and in the SI are links to data used. You wil not find the in-filled data unfortunately, but welcome to the world of climate science.

Yes but, according to Mann, the prescreening had nothing to do with correlation, so it should not have affected the results.

This was his “prescreening criteria”

(ii) correlation between individual cores

============

Prescreening had nothing to do with correlation? Did you check point 2 of the criteria. His criteria contradicts his assertion. He says it has no effect, and the reasons the gives is that he doesn’t do the very thing he then does.

You are telling the statistics that

G = F(T), where G=Growth and T=Temp

But what you are actually doing in the model is:

G = F(T,C) where C=Calibration.

========

From this:

C = S(T,G), where S = selection function

Thus, what you are modelling is:

G = F(T,S(T,G)), where F = growth function.

You have now got G on both sides the the solution. A circular problem.

Not at all like your statistical assumption:

G = F(T).

I am not referring to Roman in the above. I’m referring to climate science.

What climate science tells the statistics:

G = F(T).

What there are actually modelling is:

G = F(T,S(T,G))

Roman has laid out the math why this must result in bias when F(T) is linear. In point of fact, tree growth is not linear, it is parabolic, with a maximum at optimum temperature, making the solution extremely sensitive to mishandling of the error term.

It is the sensitivity to the error term, coupled with the mishandling of the error as a result of preselection that renders tree ring science the modern equivalent of phrenology.

Roman —

You are using “Classical Calibration Estimation”, which correctly (in UC’s view and mine) regresses proxies on temperature and then inverts to get reconstructed temperatures:

But I suspect Gergis is naively using “Inverse Calibration Estimation” which regresses temperature on proxies instead. If this is done using an average of proxies that have been screened for their high correlation with temperature, wouldn’t this give too high a coefficient in the ICE equation, and therefore a reconstruction that is biased in the opposite direction?

Or perhaps she is using CPS variance matching (“Composition plus scale”, which averages z-scores of proxies and then matches variances rather than running the regression either way). What is the bias then?

The guess about Gergis regressing temperature on proxies is interesting, but I made that calculation earlier. The OLS slope is quite small, and yields an extremely flat reconstruction, flatter than Gergis’s. During the period 1000-1150 when there are only two proxies, the Gergis reconstruction follows fairly closely the “Classical Calibration Estimation” from Mt. Read, but varies far more than the “Inverse Calibration Estimation” from either Mt. Read or Oroko.

.

I plead ignorance of the CPS variance matching method, so have not made that calculation.

Hu, I used the “Classical Calibration Estimation” in the discussion, because I am also of the same opinion as UC and yourself that is the correct way to approach the problem. However, the R analysis in the example was done using both the variance matching and the inverse regression fit.

It appears to me that the bias goes in the same direction but I need to think it out some more.

Oddly enough, we both discuss the exact same point using somewhat different examples. Mine is the “easier more cartoonish, fewer equations” method. Carrick

alsohas came up with examples he thought of and posted while I was coming up with my graphs. Here’s my “example for the less math tolerant”:http://rankexploits.com/musings/2012/screening-fallacy-so-whats-the-past-really-like/

A well written article Roman, however, I think it is even simpler than what you present.

Take a number of ramdom number sequences and select those which match the recent temp record. Take the average and you will have something that still bears the same overall trend as temps since they all correlated to it.

Since all are random, the fact that they match in the test period does not relate to how they vary before (or after) that period.

since they are random they will generally NOT correlate and will tend to cancel out over the rest of the record.

Thus such a selection process will almost guarantee a hockeystick.

Thank you, this explains things nicely for people like myself.

I was wondering, does such as process exists where each and every proxy is correlated with each other, rather than with temperature. You could then rank all the correlations, etc…

I doubt any would get over 0.5, since you would be trying to match entire series.

Yes, thanks for that. Every stripped down explanation makes the ostensibly more confusing actual situation easier to parse.

Roman,

Thanks for doing this very thorough analysis. As you noted, I’ve been sleeping through some of the discussion, and it’s still not quite light. I’ll run your code and look carefully through your analysis.

Just one early contribution to the discussion though – much of the discussion seems to have been based on the claim that screening would generate hockey sticks from nothing. But in your analysis, it seems to straighten them?

RomanM: It would straighten the shaft where the reconstruction would be taking place. Assuming that the calibration was being done in the most recent era and that the correlation with the temperature was higher then the blade would still be there.]Its always been about the SHAFT

Since the reconstructions are used to make claims about warmest year ever, its always been about the shaft.

smoothing the shaft.

From a climate sensitivity stand point the smoothing of the shaft would tend to support arguments of a less sensitive system. What that means is this. Somebody like me who is a luke warmer should be happy to see

a smooth shaft. big amplitudes in the shaft would be consistent with higher sensitivity. But I am not happy to see a smooth shaft if it involves questionable un tested methods.

Steven, a wobblier shaft does not necessarily imply more sensitivity, it could just as easily more variability than is explainable by the current models. Since there is already more variability than the models (typically) can explain, that’s not a giant leap.

let me be clearer.

Assume a MWP that is roughly the same as today.

Assume a LIA that is 1C cooler than today.

If the LIA is the result of changes in solar forcing ( which is small) then that is a large change in T

for a small change in Watts

If however the MWP is, say .3C cooler than today and the LIA is .7C cooler than today ( attentuated humps)

Then you have smaller changes in T for the same assumed forcing.

So its not the number of wobbles, but rather the fact that we have MWP peak, a LIA trough and a modern peak.

Given that the forcing during that period is what it is, a reconstruction that has a higher peak to trough

( delta T) is consistent with higher senitivity and a smaller peak to trough is consistent with lower sensitivity.

But, I’m open to being 100% wrong about this and learning something.

Steven, let me try again. What I’m saying here isn’t all that profound:

If you have the same forcings, and the variability is caused by these forcings, then a larger change means a greater climate sensitivity to these forcings. Certainly that’s the case. If you start by assuming the MWP and LIA were caused by changes in radiative forcings, then your conclusion follows.

The problem is that we have unforced climate variability in addition to forced climate variability which can be used as explanations of the MWP and LIA, so looking at how much variability you have, implies either that there is more unforced variability or greater climate sensitivity (or some combination of the both).

Since we don’t have a good lock the causes of these two episodic periods, there isn’t a whole lot to be learned from what temperature the MWP peaked at or how cold the LIA got to.

that assumes that all forcing by longwave and shortwave have identical impacts. Is that a certainty?

Mosh:”Then you have smaller changes in T for the same assumed forcing.”

One other possibility is a higher sensitivity to solar forcing (SW or other) than LW GHG forcing of the same W/m^2.

Steve, I understand your point, but a smooth shaft would really just point to less variability, not necessarily less sensitivity. A nuance, but an important one.

Just thinking out loud here . . .

‘Sensitivity’ presumably means sensitivity to some specific thing. The standard CAGW story is based on sensitivity to CO2. So even if the shaft varied a lot, it would not mean there is particular sensitivity to CO2, absent some definitive causative correlation. Particularly when there is evidence of absense of CO2 driving temperature (ice cores, 20th-21st century temperature profile, etc.).

Finally, there is lots of empirical evidence of actual climate change independent of temperature reconstructions (ice ages, warmer periods, etc.). Thus a smooth shaft reconstruction — essentially by definition — should clue us in that the reconstruction is not faithfully representing reality.

The thing that is most astonishing is that these very elementary issues are apparently lost on Mann and company, in addition to the massive statistical blunder of actively selecting proxies that result in a hockey stick. Pretty astounding.

Steven,

“Its always been about the SHAFT”Yes, that’s been my contention. But we’ve been in a small minority.

But in this recon screening seems to raise the shaft relative to the blade. So still less HS, not more, with screening.

Steve Mc: Nick, please don’t assume from my lack of interest in engaging that I agree with this point.

That particularly shifting of the shaft is an artifact of the particular underlying temperature sequence assumed in generating the Monte Carlo series. Compare instead these real reconstructions.

The “real” effect is a reduction in variance, which results in the shaft appearing flatter compared to the the blade than it would, were the bias not present. It also has an effect in shifting the baseline in the pre-calibration period, but that’s mathematically a consequence of the change in scaling and is not a new effect.

Re: Steven Mosher (Jun 17 17:19),

… or, alternatively, consistent with low sensitivity together with large natural variability.

Not 100% wrong, but (apparently) 100% certain, where Judith Curry has clearly warned us about the uncertainty monster.

If you selct samples that follow the instrumental period, of cause your result follows the instrumental period, but any other time would be more or less noise.

This type of selection could be OK, if you are certain that there is a signal, but if not, then you are lost, and you can not use the signal to prove that there is a signal.

Imagin you selected those that forecasted the last years weather with 80% score, and then said they must be the best. Would you believe them?

Svend: “if you are certain that there is a signal”. Yes, exactly. It is an assumption, unproven. Very much like the search for extraterrestrial communication or those who played Beatles albums backwards (am I showing my age?). Those doing reconstructions ASSUME there is a signal there that they can amplify.

In science, one should test one’s assumptions or control for extraneous influences, not just run off with unproven assumptions.

Steeve : I think the best way for everyone to get it is to forget about tree rings and use some other kind of data that is not climate.

For example, to reconstruct world population growth (=temperature), is it possible to download the data of population growth for all the countries in the world (=oproxies), apply to it the screening fallacy (i.e. keep those with good correlation calibrating it over some period of your choice, say 1950-2010) and reconstructing the evolution of global world population from those “good” proxies over a much longer period ?

How is that different from what is currently being talked about ?

My remark also applies to Roman

Roman –

Excellent presentation, thanks!

The weakness of this counter example is that it assumes that the proxies are drawn from some well behaved population, when in fact we know they aren’t. What’s just one more source of bias along the way?

So at the risk of getting into hot water here, I think you misunderstand Nick. His view is that proxies are the temperature – hence “Bias from what? You’re using the language of population statistics. But what is the population?”

Thus any subsequent analysis is in terms of the proxy, not temperatures. The sleight of hand of course is that all the conclusions end up being about the temperature when they should be about the proxy, but you’d have to agree that if you make Nick’s assumption your kind of bias doesn’t arise.

As it happens Gregis’ pre-screening does clearly bias the subsequent analysis even if you equate proxies and temperature. Don’t know that I made much more progress with Nick on that score – he’s definitely a details man, into re-running this analysis like a pig into mud as we talk.

Roman,

I’m curious about why both screened and unscreened have a bias – I know you said that was not the point of the post. But I think it may be relevant. I’m going to explore why there is a bias, but it seems to me that we’re seeing the way the reconstruction approaches the known value as proxy numbers increase. In which case the apparent screening bias may be simply the effect of fewer proxies.

Nick:

Nah. I’ve done the Monte Carlo’ing too and it follows exactly as predicted by RomanM’s formulation. Screen for r ≥ r0 and you introduce a bias in the slope. In retrospect it’s obvious.

To give a flavor here’s what an individual Monte Carlo proxy looks like.

This is what happens to the mean value of the slope as you increase r0, reducing the percentage of proxies retained. 10% retained for this example still represents 1000 proxies.

Well, Carrick, as I’ve described downthread, I ran Roman’s code with the selection by correlation just replaced by random subselection, and it gave very similar results. I even tried just the first 24 proxies – same. So I am firming in my view that what you’re seeing here is just due to having a smaller set of proxies.

I’m not sure I know the complete answer to that. If you look at the paper by Schmidt et al, they give plots of a series of reconstructions including their own for the proxy data and ALL of the recons are biased, some more than others.

I will hazard a guess here that the reason may have to do with the fact that most of these are what I term “passive” reconstructions, i.e. those for which the relationship between proxy and temperature is determined completely by the calibration time data. I have been working on one where the coefficients and the reconstruction are estimated recursively and have had better luck pretty much eliminating most of the bias. However, that work is not complete yet.

Broken link sorry. Here’s the mean value of slope as you change the selection criterion.

Let me try this again… Nick are you referring to a bias in the offset as opposed to a bias in the trend?

The answer to that is they are linked. If you have a reduction in scaling, it will cause a vertical shift, as long as you anchoring your data to line up with the instrumental data. (That seems really obvious to me so maybe I’m just misunderstanding your question here.)

Regarding the reduction in scaling and bias in the nonscreened data too, I found that too (note that at 100%, my curve doesn’t intersect the apriori slope.). It’s a small effect for the data I was using, so I didn’t mention it, but it may be larger in more “tree-like” proxies.

You might be happy to hear this because this could be a place where your and Hu’s formulations of simultaneous fitting to AR(n) and the slope of the regression could be useful in reducing the bias.

I haven’t tested it yet, but I would predict that if you used white noise instead of red (or “climate-like”) noise, you won’t get a bias in the trend. (That’s actually provable, I think.)

Good question Nick. Do you think it has something to do with scaling a proxy containing both noise and signal to match the variance of the instrumental temperatures?

– – – – –

RomanM,

Can you please amplify why you say “it should be clear that no real proxy data exists for doing this”.

I would appreciate background on why it is the case that it should be clear.

Thank you.

John

The proxy data exists, but the temperatures associated with those proxies centuries ago are not available. The example used proxies which were generated from the model “temperatures” so that comparisons between reconstruction and temperature could be made.

I hope that the readers here understand that what Roman has shown here is that given a group of proxies has a signal there can be noise introduced by screening. I guess one could argue if you have a signal then use all the proxies.

The argument in general against screening where the proxy validity as a thermometer is not known is much easier to make.

I’m still (between breakfast etc) trying to get the R code and data running. But I’d like to make this general comment on where we stand.

On this site and at Lucia’s I repeatedly asked for a specification of the “selection fallacy”. And I was told things like “you’ve been told many times” or it’s “baby food statistics”.

Reasons for wanting a specification include:

1. We’d know if we’re all talking about the same thing.

2. We’d know what bad effect it actually entails

3. We’d know how to model it

I eventually made enough of a nuisance of myself at Lucia’s that definitions started trickling in. But they were wildly disparate. We weren’t all talking about the same thing.

And then the modelling started. But again, without a specification, it’s hard to see whether they are actually modelling the SF. That’s not, BTW, to grunble about the models, which do seem to be modelling something interesting.

But the worst is that having modelled, have we shown the bad effect of SF? Problem is, we never said what they would be. People like Josh and jeez drew cartoons and everyone cheered. But nothing quantitative.

So now we have three modelling efforts – 2 at Lucia’s and one here. And again they show various effects. But are they what we expected from the SF? We had nothing a priori. It seems like the effects of “SF” are whatever discrepancy the model throws up. And again, this one and Lucia’s seem to have very little in common. And nothing in common with the cartoons.

The real screening fallacy is the assumption that your reconstruction has anything to do with temperature prior to your calibration period.

Then there’s the loss of scaling. It’s inaccurate on your part to claim that this hasn’t been discussed. It has. Von Storch’s name has been mentioned. Jeff Id’s work has been mentioned.

Multiple times.

You are either reading the criticisms with your eyes closed, or your mind shut, or both.

That’s one of the wildly disparate definitions. But all these modelling efforts show reconstructions which, with whatever faults, certainly have to do with temperature prior to the calibration period. So either they aren’t modelling SF or …

You can’t use the argument of “wildly disparate definitions” to argue against the fact that it or its ramifications have not been discussed.

What we’re discussing here to be specific is using correlation to screen, and the argument has been presented that this is an erroneous procedure on a number of grounds.

If you are left to argue over the semantics of a particular word choice rather than deal with the in depth discussion that has gone over the ramifications of using correlation to screen, then I gently as I can suggest your own position is intellectually bankrupt and that you need to move on.

The selection fallacy and what it portends in the case of attempts to discover temperature proxies is not an issue that can be covered with a sentence or two. Much of the weaknesses in the efforts of those doing reconstructions is their failure to explain and warn of all the pitfalls in the methods.

To avoid the selection fallacy one must either (1) show a physical reason for selecting proxies and then apply a reasonable criterion a priori and subsequently use all the data that that selection provides or (2) have the means to test the model out-of sample which means testing it with data other than that used to construct to model or using in-sample data. Many skeptical practitioners will accept only truly out-of-sample data for testing a model in these cases and that is data that the modeler could not have been able to observe in making his model. What can turn the so-called out-of-sample data that is sometimes used in calibration and verification methods into in-sample data is that the modeler has looked at all the data and knows that the verification will succeed and/or the modeler throws out models that eventually do not work without somehow including these decisions in the statistics or the mentioning the failures in the papers published. Hu McCullough has referred to more sophisticated methods like a 10-fold cross correlation for calibration and verification of models, but those methods are also subject to peaking and throwing out bad models.

Those people doing reconstructions use the same proxies repeatedly and thus the proxy series have been well peaked at and tested over a calibration and verification period. Also, since the blade of the hockey stick can be rather linearly trending, and as is the instrumental temperature record, finding a candidate proxy series with a quick peak is a rather easy task.

Selecting proxies based on non detrended correlations can rather easily be shown to be trend matching a proxy to the instrumental record within certain limits and that those matches can happen by chance. I would think that those people advocating for screening by correlation would have to have plausible deniability that that matching can occur by chance. Wiggle matching on proxy and temperature using detrended series could provide reasonable correlations but not provide a good thermometer which would require trend matching.

Finally, even if one were to except that a screening by correlation “works”, the screening threshold that, for example, in Gergis (2012) and in Mann(08), gives an average correlation that when squared would indicate that on average approximately 10% the proxy variation could be attributed to temperature. How well would one expect that relationship to translate back in time and particularly if one admitted the obvious that we have a temperature signal in the proxies in the instrumental period that is only 10% due to temperature and for the other 90% we do not know what is affecting the proxy response nor do we model for it. In some other time period might that relationship of the 10% temperature and 90% unknown change?

I find much of the foundations about temperature reconstructions as currently applied based on wishful thinking and I think that thinking is epitomized by Nick Stokes discussions at these blogs.

Nick: “But all these modelling efforts show reconstructions which, with whatever faults, certainly have to do with temperature prior to the calibration period.”

No they don’t. Unless you are saying that because the recent instrumental record is used for calibration that the proxies somehow take on this characteristic of temperature. Let’s be specific in the present case: tree rings have to do with tree ring growth. Period. Saying they have to do with temperature is assuming the very issue in question. They might be a valid temperature proxy, but that needs to be established independently and is currently very much in question. Simply screening them with a recent instrumental record does not magically move them from being in the category ‘not-about-temperature’ into the category ‘about-temperature.’

“The real screening fallacy is the assumption that your reconstruction has anything to do with temperature prior to your calibration period. ”

I don’t think that’s right, we routinely make assumptions that allow us to move from a sample to a wider population.

The screening fallacy is just a special case of any situation where you select data for use later in an analysis, and the selection process has unintended (or unreported) consequences for that subsequent analysis.

In this case we are selecting proxies based on correlation with a trial period, and we are showing that this has the unintended consequence when we use the subset to do a hindcast.

Another example that has been quoted is a pre-selection that predisposes the subsequent analysis to a particular outcome.

HAS, what you are describing as a counter example is not the same thing.

Specifically the problem here is using a multivariate proxy like tree rings, correlating them against temperature to “select” which are temperature sensitive, and assume that they remain correlated with temperature prior to the calibration period. Simply because you observe a correlation from 1950-2000 for example between a proxy and temperature is no basis for assuming that same correlation will persist from e.g. 0-1950 AD.

The example of sample being representative of a population is a quite different assumption. If you took a sample of a population today, you certainly wouldn’t necessarily assume it would be representative of the population present in 0 AD.

Yes I perhaps shouldn’t have used the word “population”, I was just making the point that this doesn’t mean you can’t make conditional inferences about the past based on what we know about today.

Agreed, HAS.

But you can’t (or shouldn’t) make inferences about the past based purely on the observation of the correlation of two variables. One obvious problem from the perspective of empiricism is there is no way to put an error bar on the uncertainty introduced by that assumption (without further measurement that is).

IMHO the screening process seems fine as long as one is certain that at least SOME of the trees in a series carry a relevant signal. It is the process of finding those trees.

If there is no signal, the screening method will produce what SEEMS like a signal from noise.

The method can not verify if there is a temperature signal, and since it is impossible to know whether such a signal exists in the series or not, the relevance of the method becomes a question of faith.

1) Assume some subset of the whole set of data contains signal.

2) Measure part of each element of the set looking for correspondence to an assumed signal.

3) Having identified the subset, summarize the results of that set.

4) Compare the cherry picked summary signal to the assumed signal.

5) Publish as “proof” of the signal.

What keeps me or anybody else from assuming a completely different “signal” in step two and following the procedure to prove a contrary position? Is it the claim that, given 1000 proxies and a model completely inverting the instrumental anomolies trend, I couldn’t find and cherry pick a subset of proxies that could be calibrated to the non-physical, deliberately falsified, signal? What would make it harder to match a false signal to red noise than match the instrument record to that noise?

ottar: “IMHO the screening process seems fine as long as one is certain that at least SOME of the trees in a series carry a relevant signal. It is the process of finding those trees.”

What signal? An up signal? A down signal? A steady signal? A zig-zag signal? The whole problem is that, by definition, you don’t know beforehand which proxies carry valid signals.

This kind of thinking is what led Mann to (i) think he had published a valid reconstruction in the first place, when it was just a statistical artifact, and (ii) hide the decline when the chosen proxies stopped carrying the signal. He went looking for a specific signal, found some proxies that carried it. Then in later years when the proxies stopped carrying the signal he wanted to see, he simply dropped them because (so he convinced himself) they were no longer carrying the signal.

ottrar:

This was mentioned on Lucia’s blog too. The question of whether it is temperature or not gets answered by employing non-tree-ring proxies for which we have confidence that the proxy itself relates to temperature, and for which there is a real physical model relationship between the proxy value and temperature, and for which the real uniformity principle can be applied (and being the real version instead of the “weak uniformity principle” = junk science version, you almost don’t need to mention).

Moberg 2005 suggested a method based on using low-frequency pure temperature proxies to “anchor” the reconstruction on, then use tree-ring proxies to fill in geographically and spatially.

This method was demonstrated by von Storch et al (2009) to not have the loss of variance suffered by Mann 2008’s CPS algorithm.

My take home is — we can and should move on from the provably flawed junk-science method screening using calibration. It’s an inconvenient result for some, but science is like that some days.

By the way, there is a discussion that has gone on over the validity of Moberg’s reconstruction.

I would suggest that it would be helpful if that issue be kept separated from the question of the validity of the methodology he proposed, unless the purpose of that is to demonstrate that his methodology is not workable. (I.e., if you disagree with the use of a particular proxy selection, that doesn’t invalidate the methodology, even if it might invalidate his result.)

clt50: “

The question of whether it is temperature or not gets answered by employing non-tree-ring proxies for which we have confidence that the proxy itself relates to temperature..”Translation: ‘

argument by association has more scienceness than blatant correlation = causation.’Pretend scientists arguing pretend science. Incredible.

Nick’s question, “

But how does it bias what you want,” exhibits everything that’s wrong with Nick’s view of proxythermometry, and in fact everything that’s wrong with proxythermometry itself: there is no “pre-training signal.”which is the pre-training signal?All these complicated statistical vaporings have zero scientific content. Proxythermometry as currently practiced has no scientific content.

The whole field is a physical crock, Nick’s view is a physical crock, and the statistical manipulation of proxies, including “training,” is a physical crock.

Pat: tell us what you really think!

Craig, I specifically exempt your focused work on methodology from that general condemnation. 🙂

“training signals” exist only when you have a priori information regarding the desired outcome, e.g., the preamble (aka unique word) in a packet of data that is always the same. An algorithm can synchronize to that preamble in order to aid parameter estimations and increase the detection capability. But that’s just it – the preamble is known a priori. Not so with tree rings. That’s what you DON’T KNOW and you are trying to find out. It biases what you want simply by choosing what you want based on post hoc correlation.

What a joke.

Mark

Excellent post Roman. However I have two comments. These are all great exercises but they all use pseudo-proxies and modelled temperatures, not real world data. In the real world these is no way of getting around the problem of having only 100-150 yrs of instrumental records for calibration (with luck!).

So a person dealing with a temperature reconstruction in a particular area will have to work with what’s available. He/she may see these exercises as good examples of potential uncertainties in the final reconstructed time series, but will go ahead anyways and select the proxies that best match his instrumental data. There is no way to know which proxies will be biased in earlier centuries!

If you have a large set of proxies and do not select them somehow during calibration the resulting model will likely end up being very poor/non significant. Most people will not spend much time with (or trust the results from) models that show no reasonable skill during the calibration period.

One way around the problems you discuss here may be to always validate the model using data that had been excluded during calibration. To be considered acceptable, the model has to match relatively well the instrumental data during the calibration AND validation periods. This is what most (if not all) dendro-climatological studies do prior to developing the reconstructions. Given the lack of temperature data during earlier centuries, what other option(s) you would suggest?

The impact of collecting data in a particular manner or of a particular method for calculating estimates is not always obvious. In the case of selecting proxies, my intent was to make people aware that correlation selection could also have an impact on proxies which did contain genuine temperature information. Knowing that a problem exists motivates one to find a way to accommodate for that specific problem.

As far as a possible solution goes, IMHO, one way to remove “false” proxies would be to have them “deselect” themselves during the calculation of the past temperatures. E.g., construct the past temperature so that they relate to a proxy in the same way as the calibration temperatures relate to that proxy. This could reduce the effect of distorted information in the calibration period and could possibly de facto eliminate false proxies by assigning them a low weight towards the reconstruction.

I was also wondering if a poor proxy could be deselected by poor correlation to the other proxies or perhaps to a reconstruction done without that proxy. However, it seems to me that a poor proxy that is purely noise gets “deselected” by being averaged with other noisy proxies (at a cost of greater error bars).

If the ALL reconstruction is better than the SCREEN reconstruction, what would a reconstruction made from the proxies which FAILED the screen look like? Presumably it will be worse during the calibration period. Since the FAILED proxies do contain a temperature signal, it might be possible to get a reconstruction from the FAILED proxies that is just as “reliable” as the reconstruction from the SCREENED proxies. The absurdity of any selection scheme might best be demonstrated by the signal present in the discarded proxies.

Imagine a collection of pseudo-proxies that contain different amount of noise: say 20% with 1:1 signal to noise; 20%, 2:1 S:N; 20%, 4:1 S:N; 20%, 8:1 S:N; and 20%, 16:1 S:N. A sensible screen (with or without detrending) would select for the pseudo-proxies with higher S:N while minimizing suppression of the signal amplitude in the shaft/reconstruction period.

Roman,

I’ve only fairly recently tried to run your code. I’m interested in the question of bias. However, I thought I’d investigate my earlier hypothesis that the apparent increase in bias was due simply to the lower proxy numbers after screening. So I replaced bigcor in your code with a random selection of approximately the same number. And I got very similar plots.

Example.

Don’t know why you’d think number of samples would matter. Could you explain your intuition on this?

I’m pretty sure it’s just the bias in scale that RomanM talks about creeping,just manifesting itself as an apparent baseline shift, and I don’t see why the number of samples would affect the apparent bias.

Yes. I think that the reconstruction converges on the black temp as the number of proxies goes to infinity. I don’t know why it converges from above – I’ve been a bit distracted with that. But the selected set is about 24 out of 59. I’m guessing convergence goes as sqrt(N). And that looks like what is happening.

If so, then it shouldn’t matter how you select the 24. And it doesn’t seem to.

No, it won’t converge to the black line. Sorry to be harsh, but you understand anything?

Like I have that much to say just having figured out to get wordpress to report my name in stead of my user name on this blog. 😉

But seriously, Nick. You’ve completely missed a very basic point if you think it should converge. Since there is a descaling, it will never reach it.

Correction (to my erstwhile correction, I was right the first time): it still fails

unlessyou have a very high SNR.This is the numerical experiment by RomanM I was discussing.

You can see the effect of SNR on the bias at retention of 100% of the data in my Monte Carlo.

Well, I won’t argue the detail of that at the moment. But more proxies should get closer. And that seems to be happening.

But you haven’t addressed the main point. The green/red discrepancy seems to be a function of subset size only. Here’s a plot selecting respectively the first 4,8,16,32 proxies.

I’m not sure why you think that’s the “main point” Nick. I think I missed part of of the plot here. At this point, I’m not even sure exactly what it is you are manipulating. Are you using different subsets of the same 59 proxies?

Carrick,

Yes. Instead of choosing the subset of 24 by correlation, I simply choose the first 4, or 8 etc. I also tried choosing them in random order, and choosing just the first 24. So it’s the main point because you get a bias similar to selection by selecting in a way that has nothing to do with the temp curve.

Nick, you just can’t read and understand the mathematical concepts in the post which explain the effect.

I took the temperature series and made new proxies using that series and simple white noise. Here is the function:

`Simulate = function(nrun, temp, prox.sd = .5, corr.cut = .12, caltime=c(1901,1980),runComp = T){`

proxmat = matrix(0

rep(temp,nrun),nrow = length(temp))

proxmat = ts(proxmat + matrix(rnorm(length(temp)*nrun,sd=prox.sd),nrow = length(temp)), start = tsp(temp)[1])

list(proxies = proxmat,compare = compare.screen(proxmat, temp,caltime ,corr.cut))}

I ran the script with increasing sample sizes

Look at the MSEs:

`Simulate(100,nh.csm,1,.25)$compare$mse`

all screen

0.01036564 0.06406199

Simulate(1000,nh.csm,1,.25)$compare$mse

all screen

0.001047594 0.047505437

Simulate(10000,nh.csm,1,.25)$compare$mse

all screen

0.0001561857 0.0439037337

Simulate(100000,nh.csm,1,.25)$compare$mse

all screen

1.074715e-05 4.228215e-02

The resulting graph for n = 10000:

The “All” case overlies the temperature curve, but the “Screened” (a sample of about 5000) is nowhere near.

Yes, both converge towards the temperature series as the reconstruction bias decreases, but the selection bias from the screening remains.

Roman,

Were those numbers using “scale” or “regress”?

It may well be that the asymptotics for very large numbers of proxies are different.

But can you demonstrate a difference for a Gergis size sample?

It seems to me that at the very least, in presenting your loess plot in the head post tou should show to what extent, if any, the difference you show is due to reduced sample size rather than to the selection effect. At the moment your post implies that it is all due to selection.

[RomanM: Hey, I just wrote the script after I got up this morning! It used the default “scale”. It is easy to alter the program to allow a choice of type if you wish.Give me a break, the post was long enough as it was. If you think it was wrong in any way, please indicate where.]Roman,

Just checking the example you showed in the head post. For the “regress” method, you got MSE’s of

0.1329907 for All and 0.1855703 for 24 sekected proxies

But if I choose just the first 24 proxies on the list (effectively random) I get a MSE of 0.2303739

In other words, 24 selected proxies do much

betterthan a randomly chosen subset of 24.So? That’s what selection methods do. They reduce the sample size. To say “It’s not fair. The

Allsample had more observations!” is silly.What we are comparing is the results you get with all the data to what we get with a sample reduced by correlation selection. That’s what happens in “real life”.

Roman,

I guess we’re equally handicapped – it’s midnight here 😦

“If you think it was wrong in any way, please indicate where.”But that last comment has it. Using the 59 NCAR-based proxies, I’m finding that the 24 proxies selected by correlation do better than 24 at random, though the example of 1:24 seems to be at the higher MSE end of things. So on that basis, I don’t think you can say selection is causing bias.

Though I’ve run your Simulate() function and indeed, it shows the contrast that you describe, even for 59. So I’m not sure what makes difference between the NCAR proxies and white noise.

Roman,

“That’s what happens in “real life”.”No, we’re supposed to be talking about a selection fallacy, not a size reduction fallacy. But in real life you don’t get 59 equally good proxies. Gergis cut to 27 because the rest she couldn’t calibrate.

But in any case, the fact is that Gergis presented an analysis with 27 proxies. You’re saying that she might have, if all the ramaining proxies had been equally good, have got a better analysis by using them all (if she could have). That doesn’t say that there was anything wrong with the analysis she actually did.

Let me understand this. If she

introducedunnecessary biases into the methodology and at the same time avoided a possibly better analysis:“That doesn’t say that there was anything wrong with the analysis she actually did.”

I guess that it all depends on the meaning of the word

wrong…No, she didn’t introduce bias. Unlike your toy example, she had a inhomogeneous collection of proxies, many of which were quite useless for temperature. She found 27 she could use. You have not shown that her analysis of those 27 was in any way damaged by the selection process.

To give an example, the collection included two Callitris collections from WA amd NT. These are warm dry areas and were collected as part of a historic rainfall study. There is no real reason to expect them to be good temperature proxies, and she could have eliminated them on that basis. But she decided to check. Unreasonable? I suspect that if she had eliminated without checking, people would be hollering about that.

Leading Australian coral specialists have stated that the secular increase in coral O18 is due to precipitation not temperature. Australian coral specialists Hendy and Gagan criticized the use of coral O18 as a temperature proxy by Mann et al 1998 and Jones et al 1998, but their criticism has been ignored. it is ironic that Gergis and Karoly have continued to ignore the caveats of coral professionals.

Does anyone know whether Gergis or Karoly actually have relevant specialist experience? I don’t recall seeing Karoly in paleoclimate publications previously. Gergis seems to be a geographer and activist and not a statistical specialist.

Nick. How did she determine they were useless?

Nick, it is absolutely

amazingthat you can argue (depending on the circumstances) what we don’t or can’t know about what someone was thinking or did, and then (in other circumstances) you can tell us EXACTLY what someone knew and did.It is so transparent an effort to just

spin thingsthat I think your contributions have become quite useless for furthering any honest discussion.Roman, I think Nick has shown that the bias is a function of the number of proxies used in the reconstruction. Your response is to point out that if screening reduces the number of proxies used, it does not matter that the bias is just a consequence of the reduced number of proxies, the bias has still been introduced, which is a fair point. What seems to be missing from this discussion is attention to the range of Signal to Noise Ratios (SNRs) in the proxies.

I believe Nick has shown that with the proxies being used, with a slight range in SNR, the selected proxies perform better than an equal number of random proxies. Intuitively, the greater the range of SNR in available proxies, the better selected proxies will perform relative to an equal number of randomly selected proxies. Conversely, if all proxies have the same SNR, selected proxies will perform the same as randomly selected proxies.

The reason for this intuition is simple. If the SNR of all proxies is the same, then selection will only select based on a random factor which does not project into the past. On the other hand, if there is a great range of SNRs, the selection will preferentially select those proxies with a high SNR. Presumably that is a property that will project into the past, and hence can improve the reconstruction.

So, granted the correctness of this intuition, if the range of SNRs in available proxies becomes sufficiently large, do selected proxies ever out perform the full range of available proxies. My intuition is that they do, although how great the range of SNRs has to be for them to do so I cannot say. In fact, if they do not, inclusion of any data including that with a SNR or 0 will always improve (or at least never degrade) the reconstruction, which is absurd.

So, would you expand your experiment by explicitly comparing reconstructions which only differ in the range (and distribution) of SNRs of the proxies?

I don’t believe that Nick has shown any such thing. In my test of convergence which looked at what happens when the number of proxies increases, non-selection performed better with 100 proxies than selection in the 100000 case (where selection used about 50000 proxies). The bias disappeared as the the number of proxies increased when all were used, however hit a limit far from the temperatures when selection was involved.

Unfortunately, the flaw in this is that when the theoretical SNR of a proxy is the same for all proxies, the

apparentSNR varies randomly when estimated over a temporal subset. One can see this by calculating correlation between proxies and the temperature series in a sliding window. There is no guarantee that what looks good in calibration remains so during reconstruction.Yes, proxies with actual low SNR will be correctly selected without much problem. But the existence of a critical cutoff strongly affects what happens with proxies whose SNR places their correlations in a range which overlaps that cutoff. If this just added extra uncertainty to the end result, it would not be so bad. However, the contention in the post is that the selection procedure for such processes

alsofavors the selection of proxies whose noise has a spurious positive correlation with temperature which has nothing to do with the magnitude of the noise. This increased correlation is a result of lower noise combining with lower temperature and higher noise with higher. Since the selection tends to reject those with a reversed pattern, the bias results.Tom, can you please demonstrate where Nick has shown any of that? I’ve seen a lot of questions which seem to be (trying to) take the thread in different directions, but haven’t seen where he has “shown” either of the points you’re talking about. As always, I could be wrong – links welcome. Thanks!

Nick, you clearly haven’t read Gergis yet.

The proxies for temps at various locations in the SH were selected on the basis of their individual correlation with the SH construct (the target). She made no attempt to assess their joint contribution to the target during the instrumental period, but it was to their joint contribution that she subsequently turned to do her hindcast.

Even putting aside the issues being discussed here, by this step alone she biased the results.

BTW you’ll be pleased to know that the use of WA & NT Callitris isn’t a totally lost cause http://ic.ltrr.arizona.edu/pp/td1.pdf

Mosh,

“Nick. How did she determine they were useless?”No significant observed correlation. What can you do?

Tom Curtis

“it does not matter that the bias is just a consequence of the reduced number of proxies, the bias has still been introduced, which is a fair point.”Well, not really. I’ve been trying, still with little success, to get a statement of what the selection fallacy is and what bad effects it might have. If it comes down to:

if you select 27 proxies from 62 based on calibration correlation, then you’ll have 27 proxiesthen OK, I’ll concede, there is a selection fallacy.

HAS,

Yes, of course the proxies have to be aggregated at some point to create an index. But there should be no claim that that index is some kind figure representative of the 62 that were under consideration at one point of the process.

Interesting Callitris study – there were a lot more than just WA and NT.

TerryMN, Nick shows that he can reproduce the bias by random selection of fewer proxies here:

https://climateaudit.org/2012/06/17/screening-proxies-is-it-just-a-lot-of-noise/#comment-338529

(Follow link in post for graph).

He points out that proxies selected based on correlation over the calibration period perform better than an equal number of proxies selected at random here:

https://climateaudit.org/2012/06/17/screening-proxies-is-it-just-a-lot-of-noise/#comment-338559

Roman also shows that increasing of selected proxies improves the reconstruction, in that by increasing the total number of proxies, he increased the number selected by correlation,here:

https://climateaudit.org/2012/06/17/screening-proxies-is-it-just-a-lot-of-noise/#comment-338549

Unfortunately he did not indicate how many proxies where selected, which is relevant.

In all, what has been clearly shown is that reducing the number of proxies degrades the reconstruction. It has not been shown that reducing the number of proxies by selection based on correlation during a calibration period causes a greater degradation of the reconstruction than simply reducing the number of proxies by random selection. In fact, evidence has been presented that the opposite is the case. Therefore, based on this analysis to date, it is the reduced number of proxies alone, not the selection process that is degrading the reconstruction. Indeed, the selection process is partially counter acting the degradation due to reduced proxies.

Everything from the second sentence of my second paragraph of my first post has not been shown, as yet. It simply describes and attempts to motivate my intuitions on the subject, with the hope that Roman (or Nick) will set up the appropriate experiments to test them.

[RomanM: The selection rate was about 50% in the example. I don’t have time today to look into, but perhaps I can tomorrow.TerryMN,

“where Nick has shown any of that?”I went back and did a little Monte Carlo. Roman’s result for “regress” was:

MSE(All, Regression) = 0.1329907

MSE(Screen, Regression) = 0.1855703

I ran 100 runs each time selecting randomly 24 proxies from the 59. I found, averaging:

MSE = 0.20755633 &plusmin; 0.02023938

So the screened 24 is not significantly different from this average of randomly chosen groups.

Nick @ Jun 18, 2012 at 10:29 PM

“Yes, of course the proxies have to be aggregated at some point to create an index. But there should be no claim that that index is some kind figure representative of the 62 that were under consideration at one point of the process.”

Just assume (as you are often wont to claim) that the proxies were accurate thermometers so we have accurate temperatures at 3 locations in the SH back in the MWP.

Meanwhile I have defined a composite index for the SH temp for the instrumental period.

I come to you and say I want to hindcast that index back into the MWP and what I am going to do is to correlate each of my 3 thermometers one by one with the index during the instrumental period, and not use any that don’t correlated, what would you say?

This is a trick question because if you would say “don’t do that” I’m going to say that that is exactly what Gergis did (but more so).

And I’m further going to say that is an example of inappropriate selection leading to bias.

HAS,

It sounds like you’re talking about spatial issues there. Yes, the true thermometer is measuring a temperature different from the target temp, so there is a question of how well they are correlated. And there is the question of whether the three are sufficiently regionally representative.

So I would think a reconstruction based on such a small number would be problematic. So what’s the trick?

So, Nick she screened them using corellation. And she assumed that screening would introduce no bias. Test that.

1. Lets see the results with all the proxies.

2. lets investigate whether screening introduces bias with synthetic cases.

A. I want to see her screening test to confirm that she did it correctly.

B. I want to see the results with and without screening

C. I want to see a synthetic example where screening by corelation introduces no bias.

Easy peasy. Until then the decision to screen is an undocumented, untested, unverified procedure that may introduce bias and artificially reduce our uncertainty

The uncertainty introduced by a unproven method swamps all other

uncertainties.

Mosh

“B. I want to see the results with and without screening”Well, you can’t, in her method. I’ve said it many times but again: if you can’t significantly relate a proxy to temperature, you can’t use it. You need that relation to proceed.

Nick,

1. Gregis’ failure has nothing to do with spatial issues, it is intrinsic in the experimental design.

2. If a reconstruction based on a small number of measurements would be problematic, the trick in the question is that is exactly the number Gregis used for the MWP.

3. Steven Mosher can see the reconstruction with and without screening. Just put Law Dome into the mix and look at the MWP. Gregis was relating the proxies for locations in the SH to an index (no temperature), and its easy peasy to include another location’s proxy in the model (and test for its significance within it using standard techniques).

Nick, the thing to get your head around is that the correlation between the proxy and the index had nothing much to do with the subsequent analysis.

When I was young we had a very simple “artificial intelligence” programme called ELIZA (http://en.wikipedia.org/wiki/ELIZA). I suspect that CSIRO may have secretly been working over the last few decades to develop a more sophisticated version to keep climate bloggers entertained.

No content in the response just a chatterbot.

Whoops not “(no temperature)”; “(not temperature)”

Roman –

You wrote that the result of screening is that “the reconstructed temperature series is flattened toward the mean of the temperatures in the calibration period.” In your example https://climateaudit.org/2012/06/17/screening-proxies-is-it-just-a-lot-of-noise/#comment-338549 the calibration period has decidedly a different mean than the remainder of the time. The majority of the screened MSE is due to the difference between the in-sample and out-of-sample means (the smoothed plot shows about 0.2K difference on average). Perhaps you can separate the screened MSE into a component due to the mismatch of means, and one due to the reduction in variance. Looking at the Loess smoothed versions, it appears that the screened reconstruction has “wiggles” about 2/3 the amplitude of the actual temperature. But the scale factor the same for the unsmoothed reconstruction? I think a comparison of the average of the (1/B) terms, between screened and unscreened, would show the compression factor. By varying the cutoff factor, you could derive a curve similar to Carrick’s, showing how the compression varies as a function of percentage of proxies retained.

Nick:

I think we would all agree that is correct…as far as it goes. It’s just that using correlation on a multivariate proxy to establish whether a proxy is a temperature proxy isn’t it. Even if we allow you the Nick/North weak uniformity principle/assumption, it still fails even when you have a very high SNR.

By the way, *I think* the issue you’re wondering about (random selection of proxies still giving a bias) I think is a separate issue from what Roman is talking about. It relates to a deflation in scale associated with the reconstruction method, rather than with the selection process.

I’ve said in other places we see the same deflation of scale in frequency regions with poor SNR using the correlation method for broad-band calibration. [*]

My suspicion is the central value of e.g. the regression slope in my Monte Carlo and the sensitivity in that other example have a bias in the presence of noise but that this bias depends on sqrt(E(n(t)^2)/N) [or sqrt(E(n(f))^2/N) for frequency based tests]. Just your effect, but I believe it’s separate from the one associated with screening/truncation using correlation.

Increasing the number of samples improves the SNR as 1/sqrt(N) just as you say it does.

So IMO there are two effects here, one has to do with small number of samples and poor SNR. That’s why RomanM, when he repeated the test with artificial proxies, as he increased N, the bias in the reconstruction eventually plateaus.

[*] No, this is not the same as screening by correlation. It’s a well established practiced used in science in engineering where you correlate input versus output using broad-band noise to obtain frequency response (transfer function) associated with a sensor. Curiously the frequency regions where we have the most problems, the noise in those bands is very “red”. Also somewhat curious, using median rather than mean seems to remove much of the bias.

I a follow-on response on the wrong sub-thread. It’s located here.

Mosh

“B. I want to see the results with and without screening”

Well, you can’t, in her method. I’ve said it many times but again: if you can’t significantly relate a proxy to temperature, you can’t use it. You need that relation to proceed.

#####

err no you don’t. You can and she should have shown the results with and without screening. Basically, she should show the sensitivity to the correlation.

Roman

“In my test of convergence which looked at what happens when the number of proxies increases,”Roman, if I understand your test correctly, it uses a much higher S/N than your 59 proxies had. By my sums, csm.prox59 used in your head post seems to be created by adding to temp noise of sd 2.5. But in your later Simulate() you’ve added white noise with prox.sd=0.5.

I take the underlying point that with high S/N and large numbers of proxies, there is a bias that, while relatively small, can’t be removed by increasing proxy numbers further. Your post example used fairly realistic S/N and proxy numbers – is that true of the convergence test?

Well, I can test that easily enough with a much larger “N” set of pseudoproxies.

That’ll have to wait until tomorrow, since my night is getting late.

In the mean time I checked the spectral content of the 59 proxies, It’s a bit worrisome. Looks like low-passed white noise.

Wonder what gives with that, after Mann has already been dinged for not using realistic noise in his proxies in prior papers?

“Looks like low-passed white noise.”

aka “correlated” noise, the source of so many real-world “signals”.

rickb, low-passed white noise looks nothing at all like red noise which has a 1/f^nu spectra.

Many real world signals, including proxy noise, have a 1/f^nu spectra.

It’s really amazing how many people have a reflexive need to defend bad science.

Thank you, Roman, for a post that needed to be written with the clarity you have used.

What follows below is not important, it’s OT but related. I have problems with correlation coefficients, often wonder if they are the best or even appropriate technique. With that in mind, I wrote up a short essay that deals only with the instrumented period and at only one site (Melbourne) using daily temperature data.

If you had the time and inclination, I’d be much obliged if you could make a short comment about correlation. I would not even consider using 0.12 as a correlation ceofficient cut-off because to me, that’s still down in noise territory. Some of my reasons follow in this essay, the main one being that a calibration can be no better than the data against which calibrations are made, unless occasionally by chance:

I think the data is the reason for the bias. Looking at the smoothed plots in the post, the temperature series is decreasing over the last 30 years, while the proxies are increasing. Wouldn’t this have the effect of decreasing the slope of the linear transformations (B in the equations)? In playing with RomanM’s code, changing the region used for the correlation and the CPS call changes the bias. Using c(1801,1930) gives a better reconstruction by eye, and I think quantitatively as well. I added:

to the compare.screen function. Using the scaling method and looking at both correlation and variance (I don’t know if the variance measures the bias or not, I think lower variance would infer lower B value).

using 1901 to 1980 gives 24 screened proxies and:

0.64719306 0.50333904 0.07061145 0.06722973 0.03351538

and using 1801-1930 gives 40 screened proxies and:

0.64719306 0.65109188 0.07061145 0.07497447 0.05705736

1801-1930 and the first 24 proxies screened proxies gives:

0.64719306 0.55836857 0.07061145 0.07497447 0.06238678

[RomanM: Yes, much of the reconstruction bias is likely related to the specific behavior of the proxies and temps of the 1900s. The two reconstruction methods used here have no way of correcting possible “errors” in the match-up.It is also interesting to try other calibration periods, eg. 1300 to 1400. How well are the 1900s estimated?😉Roman,

I’ve mentioned elsewhere that I don’t think the change in bias depends on your screening method. I think any subselection of 24 proxies from your set of 59 would give a similar result. It just reflects the lower numbers.

But in any case, I think your are selecting on the wrong criterion. You have selected by the correlation coefficient itself. I think you should be selecting based on the significance of the coefficient. Gergis et al used a 95% level (two-sided).

The magnitude of the correlation coefficient is not a figure of merit for the correlation. It is just a property of the proxy. Gergis had a mixture of treering and δ18O. The latter had negative correlation coefficients. But sorting these, even after normalising, doesn’t make much sense.

The magnitude of the correlation coefficient is not a figure of merit for the correlation.Oops, late night brain fart there – I was thinking of the regression slope T vs proxy prop. Still, I think the sign issue is significant.

Nick, for a given sample size, the significance of the slope coefficient in a simple regression is just a monotonic function of the correlation coefficient (abstracting from serial correlation).

In a simple regression we have the following identity relating the t statistic to test that the coefficient is zero (with n-2 DOF), the regression F statistic to test that the coefficient is zero (with 1 numerator and n-2 denominator DOF) and the regression R^2, which as its name suggests is just the square of the correlation coefficient r between the two variables:

t^2 = F = (n-2) * R^2 / (1-R^2)

For a 95% 2-tailed test with 70 observations and hence 68 DOF, the t critical value is 2.00, corresponding to an F critical value of 4.00, and hence an R^2 of 0.0625. This corresponds to abs(r) = 0.250.

Hence, a correlation coefficient greater than 0.250 in absolute value is equivalent to a t statistic that rejects zero at the .05 level on a 2-tailed test.

Yes, thanks, Hu That’s what I realised after posting the initial comment.

What utter nonsense. There is absolutely nothing

sacredabout a value which tells you what proportion of the correlations of two normally distributed random variables will lie above that particular value.I have described an effect which takes place when one selects proxies based on screening them by comparing a correlation to a fixed number. The

magnitudeof that effect may vary depending on what number has been chosen, but the effect is real and can have an impact on the final reconstruction.RomanM:

I had done a Monte Carlo analysis I did before I actually saw your post. (See my comment on Lucia’s blog for details.)

I generated my own that way I could make N large (I used N=10000). The noise is based on the assumption that the tree-rings act like real temperature proxies, and I used an algorithm I had already developed for generating “realistic” temperature noise.

By the away, as I mentioned above, it looks to me like there is an issue with Mann’s proxies, h the spectral characteristics fail to match those of real tree-ring proxies, instead they look like low-pass filtered white noise.

Anyway, for illustration, here is of one of my sequences. I just used a “hockey stick” temperature signal. That’s easy enough to change. I’m using linear regression to compute the slope for 1950-2000, and then just plotting the mean value of the slope versus percent of data retained using the r &e; r0 criterion,

Figure.

As you said “The magnitude of that effect may vary depending on what number has been chosen, but the effect is real and can have an impact on the final reconstruction.”

Hi Roman, this may end up being a duplicate my previous comment got held up in moderation.

I had done a Monte Carlo analysis I did before I actually saw your post (See my comment on Lucia’s blog for details.) and it agrees with your statement “The magnitude of that effect may vary depending on what number has been chosen, but the effect is real and can have an impact on the final reconstruction.”

Just an update, to my comment, DeWitt had pointed this problem out back in 2010 on Jeff Id’s blog.

link.

He also has a comment on Lucia’s blog.

It is rare to find a research who consistently muffs as many attempts as Mann does, yet is simultaneously (apparently) held in such high regard in his own community. In fact I can’t think of single example.

Steve: That was a good post by Jeff.

I also looked at the spectral characteristics of Mann’s proxies, the spectral characteristics fail to match those of real tree-ring proxies: they look like low-pass white noise.

Your effect doesn’t depend on the noise characteristics, but the noise characteristics certainly affect what the reconstruction looks like outside of the calibration region.

[RomanM: You are absolutely right. Since the effect of the screening comes from the magnitude of the correlation between the noise and temperature, one would expect that any proxy whose noise exhibited pronounced autoregressive behavior would be more likely to be affected by the screening]Nick,

I don’t know how to use the tools, but here is a suggestion for how you can convince yourself of the selection fallacy. Do the following.

Use Steve McIntyre’s data to select well correlated data based on a fifty year period. Then, from that same fifty year period, calculate the response to temperature. Next, using the same proxies, calculate the response to temperature during a different fifty year period. Create reconstructions for both. What should happen is the reconstruction determining the signal response from a different fifty year period should be more predictive of actual values than the one using the correlation period. I wish I knew how to do this myself, incidentally, but don’t.

If you do this for various numbers of proxies, you should find that using two different time periods to determine correlation and signal impact will give you better results.

The mean coefficient bias Roman discusses here is a valid consequence of screening, but is a different issue than most of us (at least me) had in mind.

The “screening fallacy” (or wheelbarrow effect or data mining effect, etc) I have in mind is the distortion of significance levels and predictive confidence intervals that comes from pre-screening the data.

When the ill-fitting proxies (or data points) are culled out, the remaining proxies will fit temperature far better than they could have just by chance. If they in fact have no true explanatory power, they may still appear to be collectively significant. And even if they are valid proxies, they will fit the instrumental temperature deceptively well, and hence give too-small reconstruction confidence intervals.

Ironically, the bias Roman has in mind can actually work to offset an opposite bias in “ICE” calibration estimates, that I mentioned above at https://climateaudit.org/2012/06/17/screening-proxies-is-it-just-a-lot-of-noise/#comment-338436 :

When exogenous temperature is naively regressed on the endogenous proxies (Inverse Calibration Estimation), the resulting OLS coefficients are biased toward zero, and in fact are inconsistent so that this bias would not go away even with an infinitely large calibration sample. But if only strongly positive proxies are retained, their coefficients will be biased back upwards for the reasons Roman gives.

However, although the two biases work in opposite directions, there is no reason to think they will cancel out, so that pre-screening is not a valid cure for ICE bias. Two wrongs make a right only by accident.

The issue I have is a different one, if you have a multivariate proxy (one that is known to strongly react to multiple stimuli), and you select for proxies that have large correlations during the correlation period and label these as “temperature proxies”, how to you compute the uncertainty associated with whether your proxies remain temperature limited outside of the calibration region?

Without additional information, it seems like an unmeasureable uncertainty.

It’s that a classic example of a flawed methodological design, where you end up with a quantity with an unquantifiable uncertainty?

No fair bringing in a real statistician! It is amazing to me how many people can not (or refuse to) see this. Ignoring these problems (and upside down proxies, and small Yamal sample sizes, and and and…) simply builds a house of cards that is only convincing to the builders. Hint to advocates: if you want to convince anyone but yourselves, be extra conservative and careful in your work.

When one uses proxies other than tree rings (corals, ice cores, sediments), they tend to both be not annual and to have dating error. The effect of dating error is to flatten peaks and valleys in the multi-proxy reconstruction even if the coefficients in the individual proxy reconstructions are exactly correct. This becomes relevant in many reconstructions such as Moberg and the recent Gergis ms, but is never considered.

see: Loehle, C. 2005. Estimating Climatic Timeseries from Multi-Site Data Afflicted with Dating Error. Mathematical Geology 37:127-140.

I can send copies cloehle at ncasi dot org

Roman —

I’m not fluent in R, but from the following it looks to me like your “regress” results use “ICE” (T on a proxy composite) rather than “CCE” (proxies on T), despite the discussion in your text which is based on CCE.

Since a scale-matching recon is just the geometric mean of ICE and CCE, this could explain why your “all regress” reconstruction is more attenuated than your “all scale” reconstruction.

Hu, I know that. I intentionally did the example using the methods that were described in the Mann paper (where I assumed that Temperature was the dependent variable) to avoid arguments on using an “inappropriate” reconstruction technique.

However, in the math discussion, I used the more realistic direction from which the effect could be more adequately deduced (proxy on T).

Yes, I see now that you already said that in your numerical examples you regress temperature on proxies, at https://climateaudit.org/2012/06/17/screening-proxies-is-it-just-a-lot-of-noise/#comment-338469 .

How would your graphs look using CCE instead?

I have added an update to the post (hopefully) in order to clarify the specific issues involved in the post.

Roman, When you say B = Correlation(T, Yk) * sqrt(Variance(Yk(t)) / Variance(T(t)), don’t you mean B to be what you later call Bk. Each Bk is proportional to it’s own correlation but with a different proportionality constant. I’m not clear how you conclude what B itself is proportional to.

Thanks, Karl Kruse

Roman Let Pk be the proportionality constant relating Bk to Correlation(T,Yk) If Bk= Pk*Correlation(T,Yk)= B+Bek, then specifying a threshold for correlation is not the same as specifying a threshold for Bek.

Karl Kruse

Sorry, my writing was a little sloppy.

I should have used the generic model proxy Y rather than Y

_{k}. B is the theoretical quantity which represents the slope when noise is absent.The correct statement is B = Correlation(T, Y) * sqrt(Variance(Y) / Variance(T)

When a specific proxy Y

_{k}is (randomly) selected from the population, the calculated coefficient B_{k}= B + B_{e,k}includes the noise from that proxy.If the Y’s have been standardized, then obviously it is the same. However, you may have a point in the situation when they have not been so and that bears some further looking to.

Using only a simple correlation between tree rings and temperature during the so-called training period is like saying nutrients, moisture, sunlight, co2, and anything else which affects growth all remained static during the training period. We know this isn’t true for co2 since it has been rising since just prior to the training period. It seems to me, not removing the effect of the increasing co2 signal during training means the stability of co2 concentration prior to the training period must result in erroneous proxy temperatures during the early record.

Due to poor records it may not be possible to remove the signal of all the different properties which affect growth, but it should be possible to do a reasonably good job with the available co2 records. It seems to me too many people in the field are sitting around coming up with suspect correlations between data sets instead of thinking up and doing unique experiments to tease out the actual physical relationship between the growth properties of trees. They may get to publish a lot of papers, but they aren’t doing much to advance the field.

Not providing the full set of proxy series from which their subset was selected leaves no way to check the validity of their selections. For example, take the hypothetical species Quercus Warmus. If three proxies are derived from that species and two were discarded and one selected, some physical justification, not just correlation with temperature, should be made for selecting one and discarding the others.

Thank you RomanM.

I have a suggestion. How about we all have a whip round to fly Nick over to Roman or Steve or anyone else in the statistical fraternity and sit him in front of his PC with the guys around him. Get him to ask all his qustions like ‘why doesn’t this work/hand wave/spin/change subject/spot spelling error’ and settle this once and for all.

Following Nicks ‘Black Knight’ impressions for years now…getting bored. Please lets bury this perenial blogfight. I’d be happy to chuck in some monies to the flight/hotel room.

Look…maybe Nick is right and I’m doing him a huge diservive, maybe all you guys are wrong…?

But this way we can finish it. Otherwise I may just have to top myself, can’t bear another of Nicks stonewalling in the face of everyone telling him he is wrong.

Caveat – I don’t really understand the maths you guys are on about, but it seems to me that one person argues from desperation (call it gut feeling I have). I’m happy to admit I’m wrong if Nick convinces you are the guys chasing ghosts.

Can we start a fund?Pleeeeeese!!!

Mikef2

Steve: Nick still hasn’t conceded that there was anything wrong with Mann 2008 using upside-down contaminated Tiljander. If he can’t concede something as simple as that, he’s never going to concede anything.

On second thought . . .

http://www.telegraph.co.uk/comment/9338939/Global-warming-second-thoughts-of-an-environmentalist.html

How about a non-tree-ring example? A problem I am working on is to predict the presense of an animal species from habitat variables like elevation, precip, tree cover, etc. I have 300 to 400 sample locations in each of 11 regions and 24 variables (each at 2 spatial scales) in each region (mostly the same variables, with minor differences). In each region, I get a 2 to 5 variable response surface type model with about 80+ percent accuracy (up to 88%). BUT the variables in the models are not the same in any of the 11 regions. Are the animals responding idiosyncratically? No, I think the models are spurious and result from the screening fallacy in the process of obtaining coefficients for each model. I have essentially replicated my experiment 11 times and shown that it is not working. I accept that it is not working rather than hand-waving in some way.

A question we can ask about temperature proxies: if they are valid, why do the coefficients for different sites vary so much from each other? Why at one site is the temperature “signal” so much stronger than at other sites? This is like the result I discuss above, and it suggests a problem with the assumptions.

There is some subtlety in RomanM’s update:

If whales are being included in the analysis of average

fishweight, it’s going to introduce a whole other host of problems. Sort of like when temperature reconstructions use non-temperature proxy data!What I would really like to see is what effect Gergis style screening has on a set of proxies we know are accurate – ie thermometers. That ought to settle once and for all what the screening effect (if any) really is.

Steve: that is not a relevant comparison. The existence of the bias has been demonstrated. The amount of bias depends on individual cases.In fact it’s how thermometers used to be made. Get blanks from the production process – calibrate – discard the ones you can’t calibrate.

It seemed to work.

Nick, are you suggesting that proxies that don’t match the temperatures of the instrumental records of the 20th century shouldn’t be used because they aren’t reliable? If you are could I introduce to Messrs Mann, Jones, Briffa, Osborne etc. and maybe you could explain it to them. Between them they have various papers that don’t match the instrumental records of the 20th century extant in the learned journals and are adament that these same proxies can give us an accurate reflectionn of the temperatures back 1000 years.

No. I’m saying, repeatedly, that in Gergis’ approach you have to have a calibration interval in which you can find aa significant relation of proxy to temp. Otherwise you don’t have the numbers you need to proceed. In CPS it;s a bit different because of aggregation of standardized proxies.

Nick:

“No. I’m saying, repeatedly, that in Gergis’ approach you have to have a calibration interval in which you can find aa significant relation of proxy to temp. Otherwise you don’t have the numbers you need to proceed. ”

How many correctly functioning thermometers do not correlate well with the regional average, and would you say this meant the temperature really was different and should be included in the average (as it is), or that that particular thermometer should be removed from the dataset and the average recomputed (at which point we could rinse and repeat until we were left with one last thermometer whose record would be The Regional Record).

It gets worse of course when you add a significant amount of noise, as these proxies have. The proxies you are retaining are merely the proxies which had the “right noise” to cause a very good fit over the calibration period. The implied claim that this means their historic noise artifact would also have caused a better fit prior to the calibration period is just not sane.

Nick:

Actually this has little to do with how real thermometers or other instruments are designed, constructed and selected for use.

Firstly, you don’t normally design thermometers so that they are equally sensitive to precipitation, number of hours of sunlight, amount of fertilizer, etc.

Real world, we not only look at the correlation with the variable of interest, we discard instrument designs which show an undue sensitivity to variables we’re not interested. We call those “defective” and we improve the design to reduce the sensitivity to them.

Unless you build instruments as part of your research (you don’t, I do), I’d suggest laying off the explanations of how instruments get built and selected for use. Your explanation has little to do with the real world.

Carrick, I was responding to the premises of the question.

That’s cool, just don’t get into a habit of saying that’s how they select good thermometers from bad ones. 😉

We use other criteria in addition to correlation to screen instruments, though we do use correlation to look at the frequency content of the sensor:

As an example, the bnormal roll-off at low-frequencies exhibited by Mann 2008 CPS method but not by Moberg’s 2005 as discussed by von Storch et al 2009 is a basis for tossing certain reconstructions over others.

My bet is if RomanM redid his analysis using red noise, and examined the frequency domain response, for smallish N, he’d have a bigger deflation of scale at low frequencies than high frequencies.

“My bet is if RomanM redid his analysis using red noise, and examined the frequency domain response, for smallish N, he’d have a bigger deflation of scale at low frequencies than high frequencies.”

No question. Mann 07 used this knowledge to create unrealistically high frequency proxies and then point out that there was minimal deflation of scale.

Nick,

I really don’t understand how you can not agree there is a bias.

There are simple ways to overcome the bias.

1) Don’t use correlation to temperature in selection. Use something else.

2) Use different time periods to filter based on correlation and to construct the response to temperature.

Steve, I wonder if method 2) might give an understand of how much bias might exist?

Nick,

Depending on the calibration process, it might add bias or not. Consider the following calibration process. Thermometers respond with from 4.8 to 5.3 mm for a degree of temperature change. The random error is +- .1 mm.

A calibration process that adds bias would do the following. Measure the response to a 1 degree C temperature change, and discard all those that do not have at least 5.0 mm of range in the reading. Then immediately mark the lines based on response. Here, you are selecting for thermometers that had a positive noise error, and including the noise as the thermometer’s response to temperature. This will overestimate the amount of range in thermometers. I take it this is what is happening with proxies.

Here is a calibration that does not add bias to the thermometers. Do the same as in the above, but do not mark the lines to the response. Cool them down a degree C, warm them up a degree C, and then add the measurement lines based on this response alone. These thermometers will not have bias (provided the noise is truly random over time).

Ed,

With the proxies, they are deciding not on the sensitiviy itself, but its significance. They measure the ±0.1.

Sorry, Steve, you’ve lost me. I am not as bright, nor as knowledgeable about the subject as you.

Alex, if the proxies were totally accurate Gergis’ method still gives bias.

In some ways this makes Gergis’ failings clearer (and it doesn’t need all the statistical analysis on this thread to see it).

Basically Gergis has developed an index of Southern Hemisphere (SH) temperatures from the instrumental record. She then wants to develop a model to project the back in time (hindcast) so she can make claims about whether it is hotter now than before.

She also has a set of temperatures at various points in the SH going back various distances in time, further back than the instrumental record (called proxies, but let’s assume they are absolutely accurate).

She then looks at each of these locational temperatures and correlates them one by one with the SH index over the instrumental period. (She actually knows that the temperature at each location went into constructing the index so there is a strong theoretical reason to assume correlation).

But she decides not to use any that don’t correlate well.

She builds her forecasting model from the subset of locational temperatures that remain.

The alternative (usual) approach would be to use all the locational temperatures to build the model. In the end this might lead to some of the locational temperatures not making a significant contribution to the final model – but even then they well may not be the same set that you got to originally.

The point is you want the find best model to hindcast the index, not the best correlation between the index and the individual temperatures.

And the latter doesn’t (necessarily) give you the former.

“if the proxies were totally accurate Gergis’ method still gives bias.”

That was what I was trying to get at.

OK, I think I understand this.

What is measured in a proxy (once adjusted) consists of three parts. A constant, a response to the temperature, and a random part. These are compared to a time period where temperatures are known.

Screening out poorly correlated samples screens for correlation improved by the random part. Some part of “random part” becomes attributed to the response to the temperature. This occurs because the time period used to determine correlation is also used, whole or in part, with or without other time periods, to determine the response to the temperature.

So for example with tree rings, perhaps each degree C above normal actually yields .5 mm of growth. But since random noise has been added, perhaps the models suggest .6 mm of growth is added per degree C above normal. So it takes bigger swings in tree ring growth in the past to change the temperature, flattening the tail.

If the time periods during which the correlation was determined, and the time period used to determine response to temperature were non-intersecting, the problem would go away. Provided the error added was time independent.

Another theory on the bias problem. I think it’s due to the definition of correlation. When plotting the screened proxy composite (simple average before the reconstruction), the region used for the correlation screening is biased away from the mean. Outside of that region, the composite doesn’t have that bias. After reconstruction, the correlated region bias (which increases the standard deviation) is fixed, but that’s also applied to the unbiased remainder, which gives the loss in variance.

The definition of Pearson correlation has a sum of Xi – mean(X) multiplied by Yi – mean(Y). This means that when Xi (a proxy value) is above the mean and Yi (a temperature series value) is below, or vice versa, a negative number is generated and the correlation is penalized. When both line up, correlation would favor Xi (proxy) values further from the mean. Since we’re looking for the highest correlation, the selected series will tend to give an exaggerated temperature signal in the correlated region.

Switching the screening function from higher correlation to lower mse values fixes the bias issue with the simulated proxy example. And it improves the reconstruction mse with the CSM Set.

I think you’re right, it sounded logical and I checked my Monte Carlo.

Here’s a comparison.

This was done for N=10000.

There still maybe a bias that depends on 1/sqrt(N) remaining, but as sqrt(N) → ∞ I am pretty sure the bias disappears.

Try again:

Fixed link.

Carrick,

Can you help me interpret that graph you just posted – is that essentially saying that if you use MSE instead of R for screening then you remove (or reduce) the bias as long as you have an adequate portion of proxies?

Yep, that’s what it says. What I plotting is mean slope as a function of percent of retained proxies for two different criteria, 1) Pearson’s r (customary screening approach), 2) mt’s suggestion of using mean-square error. This particular bias disappears when you use mean-square error.

Cool, heh? Props to mt.

Yeah – really cool and clever.

Could this really be a viable solution? I may have to reread this posts a bit to figure out why that solved this issue. So lets say that hypothetically one were to conduct an analysis – would the monte carlo simulation on the MSE be first step for choosing which MSE value to use as a cut off- lets say a minimum cut-off of 50% of the data must be retained. Also your MSE monte carlo has a trend – it is trending an increase in slope with more retained…

Robert I’m using a “red-noise” proxy… actually it’s meant to be a realistic temperature-related proxy (leave out the details for now) and treating the proxies as “ideal temperature proxies”.

Anyway the “red” nature of the source introduces a net bias in the estimate of the trend that depends on 1/sqrt(N), even for 100% retained.

Of course this is all very small, partly because I have such a large N (and relatively large SNR).

But to your primary question, Monte Carlo’ing is something that I absolutely would advocate before I tried it on real data! I’d use a spectral-based method, where I pre-computed the spectra of the proxy.

If you want more detail on this, send email to Lucia and she’ll forward it on to me. I’d rather not expose my email where spammers can get to it.

Robert, I believe the reason why mse screening might work is because negative slope values for the noise term would not excluded as they are by definition when using correlation screening.

Screening by MSE may lower the bias, but is it the test that they want? What they need is the significance of the relation to be used in the pre-training period.

I note that the North report description, which is presumably representative, would test the significance of the regression slope (T vs proxy) rather than the corelation coefficient.

I’ve been working on an analysis of the bias, which in the non-selective case at least seems to be quantifiable, and 1/N. I’ll post it some time today, I hope.

Perhaps the reason Carrick’s graph shows a slight trend in the slope vs retention is that the calibration range is off center. If I understand this correctly, this would tip the probability balance slightly towards an mse screen selecting slightly more positive than negative noise (or vice versa).

Nick:

Um… the North report again???

You seriously need to read something a bit more up-to-date and maybe something that actually appeared in a peer reviewed journal. Just a thought. Anyway I’m game, tell me what significance test you want to run, and I’ll generate a graph for that too.

Layman:

I think the bias you are seeing results from the red-niose nature of the proxy (actually it should be there in any proxy that has nonzero autocorellation present).. it is producing a bias in the estimate of the trend. I thought it might be interesting to try Nick’s algorithm where he fits to the slope and AR(n) coefficients simultaneously to see if that eliminated the bias. But it’s an artifact of the red noise nature of my proxies + the SNR present.

Although that bias goes to zero as 1/sqrt(N) or as SNR → &infin, you usually don’t have the luxury of large N in proxy reconstructions and SNR is always a problem in these reconstructions.

Re: Carrick (Jun 19 22:50),

Carrick, mt —

The graph Carrick links (Jun 19, 2012 at 10:51 PM) looks important to me, in terms of finding a way forward for using high-noise proxies in paleoclimate reconstructions. As Robert notes, it appears that using MSE instead of R removes the bias in assigning a (treering vs. temperature) correlation, when the investigator has compelling reasons to exclude a high proportion of the treering data series in the set. (Of course, a laundry list of caveats, as discussed eleswhere, will still apply.)

Of particular interest, as the percent of treerings retained declines, the uncertainty estimate for the Slope increases. Qualitatively, this makes sense — excluding data should decrease the investigator’s confidence in the accuracy of the estimate produced by the procedure. Is there a way to know if the change in uncertainty as percent-retained is lowered is “just right”?

Robert, Nick, my thoughts are that there are really two processes going on here, orientation and screening. Correlation-style algorithms are inappropriate for screening, as they can exaggerate the signal in the calibration region, which then depresses any signal in the reconstruction. They can still be used for orientation, although personally I agree with Steve that if the orientation is unknown, the process generating the proxy is unknown, and it shouldn’t be used.

For screening proxies (after orienting), MSE will be able to order the proxies, but I don’t think there’s a “natural” cutoff like p-value. The only idea I have would be to compute probabilities for the proxies using a draw from a statistical model derived from the temperature. Or maybe each proxy gets a model and the probability of the temperature series is drawn from that? This would also have a nice side effect of requiring a real examination of uncertainties and error. How large is the spread of the distributions, do different proxies deserve different error distributions, what’s the null hypothesis?

Nice idea! Screening on MSE seems like a promising concept.

I assume that what you would be doing is to scaling both series and calculating their mean square errors. In that case, given the model in the head post, the theoretical expected value of the MSE is (assuming no errors in a quick calculation):

E[MSE] = 2 { 1 – B / sqrt( B^2 + (Var(e) / Var(T))) }which varies from 0 to 4 and is monotonic decreasing in B for fixed Var(e) and Var(T). In the case, B = 0, the expected value is two.

Determining a “natural cutoff” would likely present some difficulties. If B = 0 and the sequences are white noise, then it is a case of finding the distribution of two independent sequences of t-distributed random variables (which are dependent within each sequence). This I believe to be tractable, but offhand I don’t have the specific answer.

If there is autocorrelation in either or both of the temperatures and proxies, the difficulty of the problem becomes greater. However, the same is true when using correlations in this case as well.

More study is necessary to resolve this critical climate science problem. Maybe we can get a million dollar grant. Where do we apply… 😉

Oops! How have you defined the mean square errors?

Using the definition I assumed was used, some experimentation seems to indicate that the MSEs are perfectly linearly correlated to the correlations.

`comp.mse.cor = function(dats, temp,caltime = c(1901,1980),critcor = .12){`

cal.temp = window(temp, start = caltime[1], end=caltime[2])

cal.prox = window(dats, start = caltime[1], end=caltime[2])

cors = c(cor(cal.temp, cal.prox))

mse = colSums((c(scale(cal.temp)) - scale(cal.prox))^2)/(length(cal.temp)-1)

msex = mse - 2

reg = lm(msex ~ 0 + cors)

par(mfrow = c(2,1))

plot(cors,msex, main = paste("Correlation = ",cor(cors,mse) ))

plot(cors,residuals(reg), main = "Residuals")

par(mfrow = c(1,1))

invisible(list(corr = cors,mse = mse, reg = reg))}

test1 = comp.mse.cor(csm.prox59,nh.csm)

`test2 = comp.mse.cor(matrix(rnorm(5900), nrow=100),rnorm(100),caltime= c(1,100))`

Nick, with the remaining bias in the calculations, try removing the scale call from the CPS function, and there’s no more bias:

composite = ts(rowMeans(proxies),start = dat.tsp[1])

In the simulated proxy case, scaling changes the standard deviation (increasing it) which then represses the reconstruction. If you want to see the bias go the “other way”, try:

composite = ts(rowMeans(proxies*0.6),start = dat.tsp[1])

To establish the proper cut-off point why not order the proxies by MSE then use a monte carlo simulation to remove proxies along the MSE scale until it reaches a point where the MSE values lose consistency (say 20% of proxies remaining) and then just make the reconstruction be the area which overlaps the most amongst all the individual reconstructions (ie with different MSE cut-offs).

At least that should be able to pick out the signal that is the most dominant in the proxies.

makeMeans = function(prox, temps,cal.time = c(1901,1980)) {

pcount = dim(prox)[2]

mse = rep(0, pcount);

for (i in 1:pcount) {

mse[i] = mean((window(prox[,i], start=cal.time[1], end=cal.time[2])

– window(temps,start=cal.time[1], end=cal.time[2]))^2)

}

mse

}

mt:

Do you standarise either or both of the variables before you run this? If not then this selection procedure will not work very well when there is a mix of proxy types in the selection set.

RomanM,

In my Monte Carlo’s, I dont see a distinct relationship between MSE and r.

(I’m plotting Pearson’s r versus the RMS residual fit for the model P = A T + B + n(t).

NB: I”m using “red” noise.

RomanM, regarding standardization, were I doing multiple proxies, based on what I’ve read, I’d never use CPS to start with.

All this relates to, IMO, is the question of the best approach to weed low SNR samples from high SNR samples.

Correlation doesn’t work, MSE (at least in my tests) does.

How we go form there to practical, defensible reconstructions… that’s a serious research project. I’d certainly use Monte Carlo’ing as one of the tools and I would do it

during the methodological design(rather than just as a a posterior “verification” on a reconstruction method I duct-taped together).RomanM, nope. Here’s where I switch back and forth in your compare.screen function:

#cors = c(cor(window(prox,start = cal.time[1], end = cal.time[2]),window(temps,start = cal.time[1], end = cal.time[2])))

#bigcor = which(cors > critcor)

cors = makeMeans(prox, temps, cal.time)

bigcor = which(cors < critcor)

I was primarily looking at the simulated proxies, I don't think standardization matters for those. It would make sense that standardization should be used for real proxies. But if I add "sprox=scale(prox)" and calculate mse with sprox, your comp.mse.cor above with my makeMeans function still gives me a -0.25 correlation with comp.mse.cor(csm.prox59,nh.csm), and I get -.40 without scaling.

Scaling removes the time series property from a times series. I presume that you replace it before you enter the scaled series into a “window” function. Also, if you scale the proxy over the full time rather than the calibration period, this will also affect the results.

In the model example, I would assume that the same B value might have been used so that a higher variability does indicate higher noise and not a larger value of B. However, in real life, I would expect that theoretically “identical” proxies might not be as identical as we would like so that

not scalingis a risky business with its own effects. When the proxies are all scaled in the original calibration period, your function should give the basically the same results as mine.I can confirm that adding scaling gives me the same mse results that you’re getting, which means that screening with scaled mse doesn’t change anything vs correlation. However, the behavior of scaled mse vs unscaled is surprising (at least to me)

s1=c(-1.0, 0, 1.0)

s2=c(-1.0, 0, 2.0)

s3=c(-1.0, 0, 0.5)

c(cor(s1,s2),cor(s1,s3))

[1] 0.9819805 0.9819805

c(mean((s1-s2)^2),mean((s1-s3)^2))

[1] 0.33333333 0.08333333

c(mean((scale(s1)-scale(s2))^2),mean((scale(s1)-scale(s3))^2))

[1] 0.02402599 0.02402599

RomanM:

In the real world, I normally would obtain calibration constants for each proxy using e.g. linear regression and use that to scale each proxy to temperature. I think it would generally be inappropriate to scale data like this to unit variance, at least if the goal is to combine them to form a global reconstruction of temperature.

That’s my take on it anyway.

As a bit of a wrap up, here’s a view of reconstructions with different screening algorithms using the simulated proxies. Each graph shows the black temperature series, the red composite screened proxy series (average of standardized proxies) and the green reconstruction. The gray regions show the calibration range. The proxy count from the correlation screen was used in the MSE/random screens. Cal.MSE is the calibration MSE, MSE is the pre-calibration MSE.

For the simulated proxies: Correlation gives an enhanced temperature signal in the calibration range, which is then causes the muted response in the pre-calibration region during reconstruction. Standardized MSE gives the same result as correlation. Unstandardized MSE and random selection are essentially equivalent, random can give better results. If standardization is disabled when generating the composite series in CPS, MSE always yields a better calibration region, but not necessarily a better full reconstruction. Also note there’s a bit of a bias in the all proxy composite series, that disappears with standardization disabled.

Here’s the CSM proxies. Same results as above, except unstandardized MSE generally beats random.

So, biased screening yields a biased reconstruction.

Nice, MT!

You can really see the difference in scale when you have “significant” climate signals like the 1815 Tambora and 1883 Krakatoa eruptions.

A search of the web for the phrases “stable isotopes”, “tree rings”, and “climate” yields nearly 59,000 records. Several – this may sampling error based on what I can access without paying $40 to Elsevier for a paper – find that there is no evidence of correlation between tree ring width and temperature is indicated by delta C-13/C-12 and delt O-18/O-16. Studies extend back the early 1980s at least. None show hockey sticks, even where the authors appear to be attempting to correlate temperature via isotope ratios with ring widths. Why bother with ring widths at all?

I don’t have the references to hand, but this seems similar to another problem. About 100 years ago an experiment was done at a village fete. Guess the weight of the cow.

The guesses of the experts both individually and averaged (screened proxies) were less accurate than the average of all the guesses. The most extreme guesses helped to draw the averages back to the correct figure.

Congratulations to Roman for a well-conceived example and good posting. Unfortunately, this thread has become absurd. As Steve points out, Nick cannot concede the Tiljander mistake. He also failed to address the very pertinent questions that Don Keillor posed vis-a-vis tree temperature response and the selection fallacy. He is not operating in good faith and this is a huge waste of effort.

sooooooooo…question is then, is Nick just seriously in need of some statistics education (hence my earler suggestion) or is he actually willfully disingenuous and sets out to distort and disturb debate in the skepticle community. In short..is he just a troll, all be it a very polite troll, but a troll nonetheless.

Maybe its time to decide if its worth engaging at all if ‘troll’ is the answer?

Re: mikef2 (Jun 19 08:30), in general one can assess such situations quite well by applying the consequences of the Tiljander trichotomy: supporting the use of upside down Tiljander is only possible if you are either (a) completely ignorant, or (b) spectacularly stupid, or (c) utterly disingenuous. In some cases it is possible to identify which of these three applies, but more fundamentally it doesn’t really matter: there is no useful purpose served by attempting serious discussions with anyone who falls into any of the three categories.

I have not supported the use of Tiljander. I would not have used it.

Re: Nick Stokes (Jun 19 15:49),

Yes Nick, BUT DO YOU CONCEDE IT WAS A MISTAKE? No word parsing. Was it an error to use it? Does it’s inclusion lead to erroneous results?

Re: Nick Stokes (Jun 19 15:49), (I mean its – D’oh)

Re: Nick Stokes (Jun 19 15:49)

> I have not supported the use of Tiljander. I would not have used it.

Nick — Reading that prompted me to take a stroll down memory lane, in the form of the Tiljander-themed CA post of July 6, 2011, “Dirty Laundry II: Contaminated Sediments”. The fifty or so comments you authored in that thread were not, on the whole, your finest moments.

I imagine you could argue that you have not “supported the use of Tiljander.” That is, given the arguments we icould enjoy about the proper definitions of the words “supported,” “use,” and “Tljander.”

At the time — and at other times — your remarks on the subject seemed quite supportive of the procedures by which the authors of Mann08 and Mann09 employed the uncalibratable Tiljander data series in their reconstructions. And quite unsupportive of any efforts to discuss the shortcomings of those papers in that regard.

Here is the point in the discussion where I was frustrated enough by your tactics to compare them with those used by the famed defense attorney “Racehorse” Haynes.

Throughout that thread, interested readers can judge your variegated positions on Tiljander for themselves, contrasting your comments with those written by me and others.

I have merely noted that Prof Jones statement was incorrect. I have not supported the use of the Tiljander proxies. I have said that I would not have used them. Beyond that I am not going to get into tribal loyalty oaths. It is way OT for this discussion anyway.

Re: Nick Stokes (Jun 19 15:49),

I note your comment, “I have merely [sic] noted that Prof Jones statement was incorrect…” I supplied a link at 10:47 PM to provide context to your rather fantastic claim of 3:49 PM (“I have not supported the use of Tiljander”), and will leave it at that.

Amac,

And I’ll just note this succint comment from that thread. It is not supporting the use of Tiljander.

“Does it’s inclusion lead to erroneous results?”

Nothing tribal about that.

Re: Nick Stokes (Jun 20, 2012 at 6:15 PM),

> And I’ll just note this succinct comment from that thread.

Yes, you did say that in July 2011, on that thread…

And much else.Believing as many as six impossible things before breakfast is an excellent strategy for accomplishing some tasks. Constructive discussion of scientific work, not so much.

Nick –

Thanks for responding to my comment. Do you have any idea why Mann would continue to defend and use the Tiljander proxy?

It is really funny to watch the hoops that Nick Stokes will jump through so as not to acknowledge what everyone knows – that Mann screwed up. He resorts to two separate excuses on this thread: (1) that it is OT and (2) that he is “not going to get into tribal loyalty oaths”. The latter is particularly funny from Nick “Mr. Tribalism” Stokes. Many of us will never take Nick seriously until he has the integrity to clearly admit the error. Simple question Nick – in Mann’s CPS reconstruction, did he use the sediment series upside down relative to the accepted physical meaning? Yes or No? Come on Nick, restore your credibility.

Despite local opinion, I am not a spokesman for Mivhael Mann. I have stated ny view.

Nick, No-one is asking you to be a spokesman for Mann (though at times you appear to act like one) and it does you no credit to try re-frame the question that way. We are simply requesting that you express a clear, unambiguous opinion on what has been one of the most discussed climate memes in the skeptic blogosphere over the past few years. Consistent with most “mainstream” AGW advocates, your own tribalism will just not let you acknowledge than Mann erred and used Tiljander upside down. I can understand why Mann is trying to avoid embarrassment, but I don’t get why people like Nick are so eager to sacrifice their own integrity in defense of Mann.

gerben, I do not accept your framing of the matter. It is not a requirement of integrity that I should do so. I have given my views at length, and if you want to read them, I suggest you start at the clear unambiguous comment linked above.

Nick, Your comment is not clear and unambiguous – it is a cheap cop-out. Saying you do not support the use of the proxy does not acknowledge Mann erred. In fact your own link (in the linked piece) starts with “There’s no reason to suppose that Mann did it incorrectly.” Nick, you have never been shy to express an opinion. I even recall threads when you felt comfortable opining on points of US federal and Virginia state law. (As an Australian research scientist!) And yet when it comes to a question undisputedly within your area of expertise, suddenly you suddenly turn reticent. You may feel this has no bearing on your credibility. Others may differ.

Stokes and Connelley and the other enablers do this schtick where they talk around and around and every which way from Sunday to avoid conceding some ridiculous transgression of Mann et. al. My opinion is that from the point of “teleconnection” on they should not be engaged but simply mocked.

I think it’s a good thing that Nick is here making his arguments. Think of him as a proxy for those who’ve been perpetrating these statistical techniques and who lack the courage to come here and face the music.

I imagine those people, the ones who are not here, are grimacing as they read Nick’s posts, since his arguments are essentially their arguments and those arguments do not appear to be holding up at all.

Speaking of those who are not here, what ever happened to Jim Bouldin?

Re: theduke (Jun 19 10:11), It is likely he confers with “the ones who are not here” continuously and is coached via chat windows, and relays the points arrived at by consensus.

Could someone who’s got the patience (and I could understand why those of you in the depths of this analysis might not care to do it) please offer or point to the basics of the treemometer premise?

It is unclear to me how an instrument such as a tree or group of trees, which rings vary in width from year to year depending on several factors including temperature, can be used with any confidence to determine marginal differences in any of those factors, especially a thousand years ago. How, for example, do we know the extent to which a tree is reflecting temperature change in 1820 relative to the extent to which it might have been reflecting a change in moisture in 1240?

A quick tour of dendrochronology principles (uniformitarian, limiting factors, site selection, replication, crossdating…) fills me at once with a sense of admiration for the puzzlers trying to piece it all together and a sense of incredulity that this science, in its infancy, is being used for anything other than entertainment. No wonder, then, that skilled and unskilled statisticians can draw just about any conclusion they like and that the Mann-McIntyre debate (loosely characterized, sorry) is so intractable.

Please, somebody, convince me that dendro-science is robust enough at this point to contribute meaningfully to our understanding of past temperatures.

Dendrochronology and “dendrothermology” so to speak are very different procedures. DC uses ring counts to develop a good calendrical age estimate. DT use the DC age estimate and biased sampling methods or bad statistical procedure or both to develop a “paleo-temperature” profile.

In the US southwest, where DC was originally developed, it was known that the trees growth rates varied with rainfall. The variation results in a pattern of variable abbual growth, occasional added rings or partial rings and not infrequently dropped rings as well. According to one source as much as 5% of a bristle cone pine’s rings may be missing. In over 50 years of observation though, there are no know examples of axtra rings in bristle cones – at least not in the White Mountains.

“Dendrothermology” assumes that in extremely cold climates the temperature rather than water is the rate controlling factor, though studies in Finland flatly contradict this.

If the “team” were really serious about things, they could developed delta-O18 and delta-C13 curves from the tree rings and then analize the correlation between growth rates and temperature as reflected in known proxies of temperature (i.e. stable isotopes).

You can find out more about real dendrochronology at this site:

http://c14.arch.ox.ac.uk/embed.php?File=calibration.html#tree_rings

“You’re using the language of population statistics.”

Is there any other kind of statistics? Can you compute the average and variance of Joe’s height? I don’t understand why you give credit to a question when the person asking it just seems to be trolling you. The screening fallacy is not in dispute – it exists. Its application here is textbook. Why honor a perverse dispute over what is indisputable?

Climate science is post-normal science. Post- meaning after. Nick is a master of it.

If he declared himself to be an artist, and that he was mocking that which he seemingly protecting, I would regard him as a master.

May I suggest “science with preferred conclusions” be designated as “pscience”?

Would the inconsistency statistic help here ( https://climateaudit.org/2008/07/30/brown-and-sundberg-confidence-and-conflict-in-multivariate-calibration-1/ ) ? Will R be higher if proxies with equal SNR are screened using the sample correlation?

BTW, pseudo-proxies generated from modeled temperature series is not completely fair. ICE-team will win if the series is known to be flat before looking at the proxies.

Having read through all this again I think the biggest problem with screening goes something like this, which I think is best described as a “meta” problem:

–In the search for proxies many different ideas are tried – sediment, ice cores, tress of all descriptions, etc. Presumably some underlying physical mechanism is proposed for a strong temperature effect. Then many different examples of each category are examined.

–The only possible way to see how these proxies work is to check empirical proxy data against the instrument record (or to run controlled experiments, which given the time scales involved may not be feasible)

–Clearly researchers will want to select proxy categories that have high correlation with the current temperature record. So the “meta” problem arises here. If there are proxy categories that don’t work well they are rejected. There is a type of confirmation bias here, although the researchers won’t see it that way – why use bad thermometers, indeed?

–Since the correlations are quite low, the question arises: of all the proxy categories (or proxies within a category) examined or considered, what is the probability of obtaining by chance proxies the have these low correlations during the instrument record, which actually have zero expected temperature correlation?

–If there is a strong chance that the proxies are just noise, then clearly the “shaft” will look like averaged noise of some type. I think the effect outlined here by Roman also exists but is different from the “meta” problem, it’s more of a “statistics” or “estimation” problem.

–These “meta” problems are much more difficult to analyze – however techniques have been developed – cf the “Reality Check” in finance.

My general conclusion is that there is nothing wrong with looking for proxies that correlate with temperature assuming there is a valid potential physical mechanism – there really is no choice without the ability to do controlled experiments. However the results of the search has to be analyzed for significance in the meta context of how many proxy categories or individual proxies were tried and rejected.

The basic problem is overfitting models to a collection of data and then using the same data to evaluate the fit. Anyone with a moderate amount of quantitative ability can fit mathematical equations to data to minimize some criterion. The hard thing is knowing when fitting has gone too far, producing models that give illusory measures of goodness due to matching the noise in data more than the signal. Selection of proxies using correlation-based criteria is a good example of the kind of ad hoc rule used by those with enough understanding to know that there is a problem, but not enough statistical training to know what to do about it.

For all their talk about “skill” the use of an ad hoc correlation-selection rule tells me that the climate modelers don’t have a clue of what proxies should be in their models a priori. Their “skill” comes down to data snooping. What this roughly means is that if they have N proxies the number of possible models that they must consider is on the order of 2^N. This fact alone makes their selection rule vulnerable to overfitting. Overfitting, in turn, can produce hockey sticks out of noise as has been amply demonstrated. That doesn’t mean that the hockey sticks that have been produced are noise, but the methodology does not offer any assurance that they are not.

A layman’s list of tree-ring variables: temperature, moisture, bugs, slope, drainage, species difference, tree age, cloudiness, fire, competition… I’m sure someone can add more. And I read that “limiting stands”, attempts to isolate one variable or other, are not even used that often? Is that true?

Is it any wonder, with all these confounding variables, that there are unexplained divergence issues here, or that one man’s signal is another man’s noise?

It seems to me the questions far outweigh the answers at this contentious point in the analysis, and that any attempts to claim consensus in dendro. temperature record is premature, and you don’t need to speculate about anyone’s motives to draw that conclusion.

I could easily be wrong about that, though. What in the dendro. record do scientists generally agree on at this point, in and outside of the so-called Team?

I have been looking at the correlation between HadCRU 1920-1990 and a few of the proxies.

As the authors have been quite insistent that we are observing ‘unprecedented’ warming, I thought I would test this out.

I plotted the R Squared value of the correlation vs end year of the proxy; thus in 1990 both series are lined up, the year 1890 plots 1920-1990 HadCRU vs. 1820-1890 of the proxies.

My favorites so far are Buckley’s Chance and Oroko.

The end date of 1990 is the 59th best correlation for Buckley’s Chance and the 49th best correlation for Oroko.

1954 gives the best fit for Buckley’s Chance and 1544 is the best correlation for Oroko.

Sorry about the length of this, but is it nonsense or does it have some merit?

A teacher has 5 pupils. She gives each of them a piece of paper with a different number from 1 to 5 inclusive. The numbers are drawn from a hat with no selection principle. She repeats the exercise, so that each pupil now has two numbers from 1 to 5. She asks each pupil to add the two numbers.

Should the resultant sum should be of equal probability for each pupil? After all, there was no prior selection of numbers to pupils. But no, we find that the likelihood of the sum is not constant. Here is the table of possible outcomes, showing the composition of the inputs.

Possible Possible All possible Summed Probability of

first second sums of number sum appearing

number number 1st & 2nd (relatively)

1 1 2 2 1

1 2 3 3 2

1 3 4 4 3

1 4 5 5 4

1 5 6 6 5

2 1 3 7 4

2 2 4 8 3

2 3 5 9 2

2 4 6 10 1

2 5 7

3 1 4

3 2 5

3 3 6 Mean 2.777777778

3 4 7 sd 1.314684396

3 5 8

4 1 5

4 2 6

4 3 7

4 4 8

4 5 9

5 1 6

5 2 7

5 3 8

5 4 9

5 5 10

There are 25 possible sums, with 1 chance in 25 to pick a 1 or a 10, up to 5 chances in 25 to pick a 6.

Notice that we have taken two very simple number strings and combined them in a very simple way, so simple that children can understand. However, the very act of imposing an operator (here ‘the sum’) has provided a third number string that has some more complicated properties than its parents. I think that most people would guess without thinking that the most popular combined number would be 5, not 6.

So far, so good. Suppose, however, that the teacher wants to select the top 2 of the 5 pupils to play some more games. It starts to get more difficult, because the top two pupils could have scored as high as 9 & 10 respectively, or as low as 4 & 5. The pupils had no choice when the numbers were handed out and the teacher used random selections. Yet, there were two very different score sets possible for the top two pupils.

From simplicity to complexity.

Suppose that the 5 pupils were first handed a paper that had a range of temperatures and then a second piece of paper containing some more temperatures at another point – plus a form of measurement between points. Instead of adding the first of these, as above, we calculate the correlation coefficient between the number strings. Neglecting noise for now, we end up with 5 students who between them could have 25 different coefficients. The teacher wishes to select the top 2 pupils, the ones with the highest coefficients…….

There is no need to continue the analogy.

The main argument against the analogy is that in the world of climate work, the temperature strings and distance string are related. Stats done on independent numbers need not apply to numbers with dependent relationships. But, this might happen only in a perfect world. Often, the connection between temperatures might be very weak, such as when only rudimentary or inaccurate climate data were available for the chosen distance pairs.

Irrespective of what happened next, the very act of calculating correlation coefficients had to produce a result that ranged from +1 to -1. The preferred selection of the “best” combination is not of necessity built into the math package. It has no way (as described above) to return a zero for all combinations below a threshold. See figure –

Concluding questions.

How much of the correlation in this figure is due to effects in the schoolroom selection example and how much is not? Is the correlation coefficient a useful device in this work?

It is rare that I would miss a post like this.

Good stuff.

There is something spooky about sending a a contribution on noise that gets a neat table mangled into noise. But one can follow it, as Jef has done with his comment that followed. Thanks, Jeff. It was meant for me, was it not ? Maybe not ? No?

Steve Mosher says:

If the LIA is the result of (cloud) albedo changes that just happen to correlate at that time with solar forcing variations, then the sensitivity to solar forcing can be small and we can still have the LIA @ -1C.

Yours is the classical “one parameter explains it all” fallacy – if there are several, uncorrelated forcings that happen to coincide at various times, or even worse, if they each have *some* effect on the others, it’s a wicked problem with no easy solution. ISTM that climate is indeed just such a problem.

## 3 Trackbacks

[…] calibration period that spanned the final 50 years, all of which had an uptick. But I noticed that Roman’s calibration period ended in 1980. So, I thought: What does the reconstruction look like if calibrate on a period that doesn’t […]

[…] the results in favour of the desired result if that correlation is with a short period of the data. RomanM states the issues succinctly here. My, more colloquial take, is that if the proxies (to some […]

[…] D’Arrigo said if you’re going to make cherry pie you have to pick cherries. The issue is also discussed in decisions about choosing proxies. Another example is selection of start and end points for a […]