More on Voodoo Correlations

Mann said:

Although 484 (~40%) pass the temperature screening process over the full (1850–1995) calibration interval, one would expect that no more than ~150 (13%) of the proxy series would pass the screening procedure described above by chance alone.

Reader DC said:

Of the 484 proxies passing the 1850-1995 significance test, 342 also passed both sub-period tests (with 341 having r values with matching sign). 111 passed only one of the sub-period tests, and 31 failed both sub-periods.

Let’s think about this a little in terms of statistics. If a “proxy” is a proxy, then it is a proxy regardless of the subperiod. It is not enough to have a “significant” relationship in the 1850-1995 period, it should also have “significant” relationship in the 1850-1949 and 1896-1995 periods (Mann’s late-miss and early-miss periods.)

DC remarked above, in effect, that nearly 30% of the 484 “passing” proxies failed this elementary precaution. I checked this calculation and can confirm it. This can be done as follows.

download.file(“http://data.climateaudit.org/data/mann.2008/Basics.tab”,”temp.dat”,mode=”wb”);load(“temp.dat”)
details=Basics$details; passing=Basics$passing
temp=(passing$whole&passing$latem&passing$earlym);sum(temp)
#342

342 out of 1209 is only 29% (as opposed to Mann’s stated 13% by chance). As observed in September, Mann’s chance benchmark is wrong because his pick two-daily keno method inflates the odds. [As a reader noted, Mann’s 13% is based on the 1850-1995 period and the yield for passing 1850-1995, 1850-1949 and 1896-1995 would necessarily be lower. This goes the other way from pick two daily keno. Autocorrelation is a third benchmarking issue and it doesn;t look to me like Mann’s benchmarks adequately allow for observed autocorrelation.]

I don’t want readers to place any weight on any benchmarks right now other than indicatively, as today I want to look at a different issue: how different proxy classes stand up to this undemanding test. In a given proxy class (ice cores, dendros, speleos, whatever), which proxy classes outperform random picking?

The “best” performer are the Luterbacher series – series which have no business whatever being in a “proxy data sets”. 71 out of 71 Luterbacher series pass the above test. This is not much of an accomplishment since Luterbacher uses instrumental data in his “proxies”. That instrumental data has a high correlation with instrumental data means precisely nothing. You’d think that someone in the climate science “community” would object to this, but seemingly not. The inclusion of these series obviously inflates the count. Without these absurd inclusions, we have 24% of the proxies passing elementary screening ( (342-71)/(1209-71).

“Low-frequency” make up 51 of the 1209 series. Of these 51 series, only 8 series pass the above elementary screening (15.8%). One of these series (Socotra O18- which is non-incidental in M08 reconstructions BTW), fails an additional undemanding test that “significance” have a consistent sign. This leaves 7/51 (13.7%) as being “significant”.

annual=Basics$criteria$annual;
c(sum(!annual),sum(temp&!annual))#51 8

Code 9000 dendro proxies make up over 927 of 1209 M08 proxies. Only 143 pass the above simple test ( 15.4%).

dendro=(Basics$details$code==9000)
c(sum(dendro),sum(temp&dendro)) #[1] 927 143

On the other hand, Briffa MXD proxies (code 7500) have a totally different response: 93 out of 105 (88%) pass M08 screening. This is such a phenomenonal difference from run-of-mill dendro proxies that one’s eyebrows arch a little. Now these aren’t ordinary Briffa MXD proxies. These series were produced in part by Rutherford (Mann) et al 2005 performing RegEM on Briffa MXD data; then M08 truncated the Rutherford Mann MXD versions in 1960 because of the “divergence” problem and replaced actual data from 1960 to 1990 by infilled data, all prior to calculating the above correlation. I haven’t parsed every little turn of Mannian adjustments, but you will understand if I view the statistical performance of this data for now as a little suspect. None of this data is earlier than AD1400 in any event.

I’ll look at the other classes of data (only 55 series left) tomorrow.

This entry was written by Stephen McIntyre, posted on Jan 21, 2009 at 12:18 AM, filed under Data, Mann et al 2008, Proxies, Spurious and tagged correlation, Mann, screening, Statistics, voodoo. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

159 Comments

pete

Posted Jan 21, 2009 at 1:11 AM | Permalink

Let’s think about this a little in terms of statistics. If a “proxy” is a proxy, then it is a proxy regardless of the subperiod. It is not enough to have a “significant” relationship in the 1850-1995 period, it should also have “significant” relationship in the 1850-1949 and 1896-1995 periods (Mann’s late-miss and early-miss periods.)

It terms of statistics, you’re wrong here. For a given level of correlation, significance will decrease with sample size.

Steve: that’s not what I said. I didn’t say that the benchmarks need be identical. In this case, I used Mann’s results as is (which applied different benchmarks as length varied). SO please take a little trouble before throwing stones. I merely noted that the supposed relationship (and the benchmark should be appropriate to the length) should not be opportunistic.
pete

Posted Jan 21, 2009 at 1:32 AM | Permalink

that’s not what I said … SO please take a little trouble before throwing stones.

It’s a direct quote of what you said. If you get in a snit whenever you’re corrected you’ll just turn this place into an echo chamber.

Steve: you also stated:

It terms of statistics, you’re wrong here. For a given level of correlation, significance will decrease with sample size.

as though that was news to me and as tho it corrected the prior point.,Your point is correct but not at issue in the quote. As I observed above, the benchmarks are different in the comparison depending on the length so your criticism is invalid.
Ed Snack

Posted Jan 21, 2009 at 1:43 AM | Permalink

Pete, please re-read. It’s not the quote, it’s the interpretation you place on it. The significance need not be identical. Comprehension is the soul of criticism.
pete

Posted Jan 21, 2009 at 2:43 AM | Permalink

As I observed above, the benchmarks are different in the comparison depending on the length so your criticism is invalid.

The benchmarks over a shorter time period have to be harder (for a given significance), because the smaller sample size makes false positives more likely.

This means that a proxy that passes the (easier) benchmark over the longer time period might still fail the (harder) benchmark over the shorter time period.

So while 13% would be expected to pass the easier screening by chance, some smaller number (say x%) would be expected pass the harder screening you recommend. Therefore it makes sense to compare 40% (484/1209) to 13%, or to compare 28% (342/1209) to x%. Comparing 28% to 13% is a mistake.
kim

Posted Jan 21, 2009 at 6:46 AM | Permalink

It’s words. You all understand the points, and agree, and your discussion has allowed me to do so, too. Thanks.
===========================================================
bernie

Posted Jan 21, 2009 at 6:59 AM | Permalink

Pete:
If I understand the point, I am not sure that increasing the difference in the sample sizes of 1850 to 1949 compared to 1850 to 1995 would be particularly material to the point that Steve was making, even if technically you are correct.
- pete
  
  Posted Jan 21, 2009 at 8:22 AM | Permalink
  
  Re: bernie (#6), if you decrease the sample size from 1850–1995 to 1850–1949 you decrease the number of significant proxies from 484 to 466.
  
  If you increase the number of sceening tests from one (1850–1995) to three (1850–1995, 1850–1949, and 1896–1995) you decrease the number of significant proxies from 484 to 342. (At one point Steve also adds a fourth test, screening for consistent sign).
  
  So you’re right, the reduced sample size isn’t the most important factor here.
  
  There are really two points to be made:
  (1) additional screening reduces false positives but increases false negatives
  (2) reducing the sample size increases false negatives without decreasing false positives
Fred

Posted Jan 21, 2009 at 7:55 AM | Permalink

The Church of Climate Scientology marches on.

Obviously Statistics is not in the curriculum.
Soronel Haetir

Posted Jan 21, 2009 at 8:28 AM | Permalink

Steve,

So, after winnowing 1209 inputs (I will specifically not refer to them as proxies per your oft repeated comment that just because it is called a proxy doesn’t mean that it is a proxy) to 13-15% or some other number no better than the random chance threshold are we allowed to claim that there is any meaning to be gleaned from the data set? Or are we to just write off the entire exercise as a futile effort?
Peter Thompson

Posted Jan 21, 2009 at 8:33 AM | Permalink

I apologize for asking a probably stupid question, but if a proxy’s correlation to temperature is deemed significant in one screening period, but not another, then what exactly can it tell you about past temperatures? To a layman, the obvious answer is “not much” insofar as has been proven to be a proxy for temperature some of the time. When in the past? When not? Shouldn’t a proxy correlate with the instrumental record at any point for which you have data, all other things being equal?
- pete
  
  Posted Jan 21, 2009 at 8:48 AM | Permalink
  
  Re: Peter Thompson (#10)
  
  if a proxy’s correlation to temperature is deemed significant in one screening period, but not another, then what exactly can it tell you about past temperatures?
  
  The proxy should be correlated in both periods, but this does not guarantee it will be significant in both periods.
  
  If a proxy is significant, then we know it is correlated (to some specified level of confidence). If a proxy is not significant, then we don’t know whether it’s correlated or not.
  
  Significant is a tricky piece of stats jargon; it has more to do with our knowledge than about the underlying process. And the less data we use, the less we know — which is why I recommend taking the correlation over the full period.
pete

Posted Jan 21, 2009 at 8:35 AM | Permalink

Only 143 pass the above simple test ( 15.4%). This is only slightly above Mann’s 13%. Pick two-daily keno raises the bar a bit as well. One is forced to conclude that these dendro proxies as a group are not significant as well.

This is particularly wrong: you’re comparing the triple-screening result to the single-screening threshold — apples and oranges.
bernie

Posted Jan 21, 2009 at 10:20 AM | Permalink

pete:
I think the issue here is that if something is going to be called a proxy then the relationship should be persistent through time. Hence checking the temporal stability of the relationship (magnitude and direction) is a necessary piece of preliminary work. One can easily imagine a relationship that is overall positive but looks dramatically different at different periods of time (positive and negative), i.e, the divergence issue. If the latter is true then are we really talking about a proxy?
Mark T.

Posted Jan 21, 2009 at 10:24 AM | Permalink

Statistically speaking, that’s an issue of stationarity, bernie, and it is assumed from the beginning in all of the reconstruction work, without ever being demonstrated to be true.

Mark
Deep Climate

Posted Jan 21, 2009 at 11:11 AM | Permalink

As the one whose series of comments (from here on) apparently inspired this post, I’d like to jump in.

One can advocate tighter screening of proxies in order to reduce “false positives” (to use Pete’s term). I do think some sort of check on correlation “fluctuation” might be appropriate, although requiring two sub-period correlations to also be significant would be too restrictive. One possible simple test would be a sign test, requring the sub-periods to be of the same sign as the overall correlation, but not necessarily to the level of significance. (The sub-periods should also not overlap for this particular purpose, of course – half periods would seem appropriate).

But Pete’s essential point is very pertinent and needs to be dealt with in any meaningful analysis of this issue. A tightening of screening criteria will reduce the number of “passing” proxies, but will also reduce the proportion that would have passed by chance. So it is inappropriate to compare the result of a large change in screening criteria (such as requiring significance in two sub-periods) to the unadjusted 13% “by chance” figure.
- Craig Loehle
  
  Posted Jan 21, 2009 at 11:30 AM | Permalink
  
  Re: Deep Climate (#15), Jeff Id has shown quite clearly that screening for significance in the calibration period picks series that give a hockey stick even with random data. This is identical in concept to the use of Principal Components to weight proxies by their correlation (which M&M showed also manufactured hockey sticks from red noise series), except that here all passing proxies get a wt of 1. This is just automated cherry picking, and the details of your statistical cutoff are irrelevant to this bigger problem.
  - Deep Climate
    
    Posted Jan 21, 2009 at 1:49 PM | Permalink
    
    Re: Craig Loehle (#17),
    
    … {S]creening for significance in the calibration period picks series that give a hockey stick even with random data … This is just automated cherry picking, and the details of your statistical cutoff are irrelevant to this bigger problem.
    
    I’m well aware of such claims made here and elsewhere.
    
    You appear to be suggesting that the percentage of “chance” passes for a given statistical screening (presumably based on application of that screening test to a red noise data set of similar statistical properties to the candidate proxy set) is somehow “irrelevant” to the evaluation of the claim of bias from “automated cherry picking”. If so, I disagree.
    
    But, in any case, it’s certainly relevant to Steve’s analysis above.
    - Mark T.
      
      Posted Jan 21, 2009 at 2:52 PM | Permalink
      
      Re: Deep Climate (#25), The point you don’t seem to understand is that ex-post picking includes only proxies with an outcome that agrees with the hypothesis. It is immaterial whether their signal is by chance or an actual signal since the process uses an outcome to determine its input.
      
      Re: Hmmm (#29),
      
      Doesn’t divergence directly prove that these “proxies” are not so good?
      
      Yes, but this is “Team Science” so the easy refutation of such an argument is simply to chop the data off that disagrees with your desired result.
      
      Mark
Sam Urbinto

Posted Jan 21, 2009 at 11:19 AM | Permalink

If the resolution of the proxy is 100 years, periods of less than this time are likely to not be significant or such significance not meaningful. However, tests involving a period such as 1851-1950 would be expected to yield a similar correlation as 1801-1900, 1771-1870 or 1621-1720.

But if a “proxy” is actually a proxy, it has to be a proxy regardless of the period involved as long as it’s long enough and the length is taken into account. Any given period (within reason) of a true proxy should have a correlation of some significance, according to that sample length, with the metric being proxied. “Significance” means the results are appropriate for the length being considered, so varies accordingly. In other words, the expected results are not the same for all periods. Duh.

To be a proxy, it is not enough to have a “significant” relationship in period a-d, it should also have “significant” relationship in sub periods, such as a-b, a-c and c-d, and possibly even points with a sub sub period such as if b is 5 parts, b2-b4. If not all time frames match, it is impossible to know which period(s) in the past the proxy reflect the proxied metric and which don’t.

To Steve’s use of pick two he said this (edited);

The Mann et al SI mentions a highly unusual procedure (which I called ‘pick two daily keno’ in an early post. Pick Three Daily Keno is a lottery in Ontario.) Instead of using the correlation to the actual gridcell, Mann calculates the correlation to the two nearest gridcells in his network. The pick two procedure has another interesting result. A lot of ring width series that fail the Mannian correlation test to the actual gridcell are bumped into “significance” by the pick two daily keno procedure.
John Archer

Posted Jan 21, 2009 at 11:57 AM | Permalink

OT (with apologies):

I’m trying to locate a graph I’ve seen (or rather a link to it) of delta T versus CO2 concentration. It’s the essentially logarithmic one.

I’d be very grateful if you could help. (My email is johnwdarcher=at=hotmail=dot=co=dot=uk.) Thanks.
Carrick

Posted Jan 21, 2009 at 11:58 AM | Permalink

bernie:

I think the issue here is that if something is going to be called a proxy then the relationship should be persistent through time.

That isn’t necessarily always the case.

My comments are based on my understanding as a non-expert in this field, and I of course welcome clarifications or corrections:

All trees, for example, have an inverted U-relationship with temperature (meaning for all other variables held constant (there is an optimal temperature for growth rate). If you pick a given tree on a mountain side for example, as the climate warms, a tree may initially increase its growth rate as it approaches its optimum, then start slowing its growth rate after the optimum temperature has been attained.

Add to this, the tree also responds to fertilization effects from CO2 and to changes in precipitation, and you can have a fairly complex relationship among the variables that you have to tease out. And of course, as the temperature changes, other quantities such as precip amount and CO2 level co-vary. As I understand it, Mann is allowing for the relationship to flip over time between the proxy and temperature to tease out just such a complex relationship.

It certainly is likely that this is not be a particularly robust approach to temperature reconstruction.

In the end what is needed in my opinion is the sort of approach taken by Inez Fung at UCB, where she is incorporating biological models into the climate modeling. I could imagine that such a more model-driven approach to this would yield a much better temperature reconstruction.
Jeff Id

Posted Jan 21, 2009 at 11:58 AM | Permalink

I do enjoy the Mann posts. Let’s see. First I see comments about tighter screening. This sounds good but it is IMHO not correct, the spurious correlations are too high? With all the infilling and autocorrelation a claim of X percent pass so this must be temp is not a fair test. Removing the infilled data I was able to get 39% to pass a negative linear slope screen. Had to mention it again. Still, it is kind of the crux of the paper so I understand why people keep going back to that.

The Briffa proxies are interesting because they pass to such a high percentage and yet have substantial infilling. It might be interesting to look at the degrees of freedom of the infilled portion of the data after RegEM compared to the normal set. Oops, I have a meeting and have to go.
bernie

Posted Jan 21, 2009 at 12:58 PM | Permalink

Carrick:
I do not disagree. IMO the issue is for those who propose the use of a temperature proxy to demonstrate that it is in fact a reliable temperature proxy and has not been selected because it is say a CO2 proxy. Finding a temperature proxy that essentially acts like a thermometer is without a doubt a challenge – that’s why its hard to see this part of the paleoclimate science as settled.

I am not sure that more models are the answer per se, unless you mean that the models will offer precise means of separating out various contributors to changes in a particular data set such as tree rings. Such models would have to be subject to the same statistical assessment that is the subject of this thread.
Clark

Posted Jan 21, 2009 at 1:07 PM | Permalink

So am I correct in reading what you wrote, for the MXD series they:

1. Remove the last 50-odd years from the series because the proxy gives the ‘wrong’ result (um, I mean there is ‘divergence’).

2. They replace this deleted data with made-up data taken from the proxies that pass the correlation-with-temperature screening.

3. After this, then they test the newly ‘improved’ series in a correlation test.

And that finally, the MXD series and the Luterbacher temperature-as-a-proxy-for-temperature series are the only group to pass better than random correlation?
per

Posted Jan 21, 2009 at 1:22 PM | Permalink

For info, mann is in nature today, doing ice:

Click to access nature07669.pdf

per
- Mark T.
  
  Posted Jan 21, 2009 at 2:32 PM | Permalink
  
  Re: per (#23), Apparently doing “infilling” to determine that in spite of the instrumental record that indicates otherwise, Antarctica really is warming.
  
  This guy is a joke.
  
  Mark
- MarcH
  
  Posted Jan 21, 2009 at 5:12 PM | Permalink
  
  Re: per (#23),
  Per, I wonder how many of these Antarctic weather stations used in this reconstruction are affected by localised UHI effects around buildings etc, perhaps Anthony Watts might take a look!
  - Mark T.
    
    Posted Jan 21, 2009 at 5:32 PM | Permalink
    
    Re: MarcH (#36), The whole discussion on this paper will need a new thread, and obviously nobody has had time to look at it yet (published tomorrow, I believe), but in quick answer: the stations actually show cooling except for the one on the peninsula. Mann apparently came to the new conclusion using “new statistical techniques,” whatever that means.
    
    Mark
    - RomanM
      
      Posted Jan 21, 2009 at 6:17 PM | Permalink
      
      Re: Mark T. (#42),
      
      The only thing I could think of when I saw who the authors of the paper were was “the hand of Mann has set foot on Antarctica”!
    - Mark T
      
      Posted Jan 21, 2009 at 9:05 PM | Permalink
      
      Re: RomanM (#48), Yeah, something like that. I shivered, if it matters. The news article also mentioned something about melting, which is really funny when they also mentioned an average temperature of -50 C, and all reports indicate the continent’s ice mass is increasing.
      
      Mark
- RomanM
  
  Posted Jan 21, 2009 at 6:49 PM | Permalink
  
  Re: per (#23),
  
  The supplementary info for the article is available here.
  
  It might be useful until the articcle becomes more widely available.
  
  Part of the first paragraph:
  
  Accuracy in the retrieval of ice sheet surface temperatures from satellite infrared data depends on
  successful cloud masking, which is challenging because of the low contrast in the albedo and
  emissivity of clouds and the surface. In Comiso (ref. 8), cloud masking was done by a
  combination of channel differencing and daily differencing, based on the change in observed
  radiances due to the movement of clouds. Here, we use an additional masking technique in
  which daily data that differ from the climatological mean by more than some threshold value are
  assumed to be cloud contaminated and are removed. We used a threshold value of 10°C, which
  produces the best validation statistics in the reconstruction procedure.
  
  Models and Monte Carlo appear to play a role.
  
  Looks like a new thread to me in a day or two…
Steve McIntyre

Posted Jan 21, 2009 at 1:33 PM | Permalink

#1. the salient point in my inline comment could have been expressed without the extra editorializing. It was very late when I wrote this and there was no need to be Gavinesque. Sorry bout that.

#5,8. pete and DC, you’re both fixing on a point that was passim to the post. Mann asserted that a “passing” yield of 484 out of 1209 proxies showed something as only ~13% would pass by chance. The observation that I was attempting to highlight was the stratification of the yield: 100% of the Luterbacher instrumental series “passed”. No wonder – they are instrumental series. How can these be included in a statistical analysis of “proxies” in a serious article?

The second point was the very high yield (88%) in the M08 infilled version of Rutherford Mann ete al ReGEM gridded versions of Briffa MXD series, where the yield is far higher than run-of-mill dendro proxies.

The third point is the very low yield of low-freq proxies and dendro proxies.

I think that I may have misinterpreted one of Pete’s points above and my reply, while valid in one perspective, may have missed a more salient issue. The yield of red noise series that are “significant” in all 3 comparisons will necessarily be smaller than the yield for any one of the comparisons. What is the “right” benchmark allowing for pick two daily keno and autocorrelation? Dunno. Mann’s “test” is typically home-made, in the sense that there is no statistical reference for his test. So it’s very hard to tie his stuff to literature known off the Island. Is 13% the “right” number allowing both for 3 comparisons and pick two -daily keno? Dunno. Might be higher, might be lower.

What is the “right” sort of red noise from which to construct a test? Autocorrelation matters a lot – see Santer et al 2008.

And if (say) 19% of the dendro proxies passed the Mann test, what exactly does that mean? Does it mean that the series are a bit more autocorrelated than the red noise template? Or does it mean that they have a decodable signal?

At this point, I really don’t want to place undue weight on benchmarks, as these take time to analyse. My issue is primarily the simple one – Luterbacher and Briffa-Rutherford-Mann RegEM MXD.

In light of these comments (which I take as review comments), I’ve edited the post slightly to remove some passim references to the questionable Mannian benchmarks (which I will return to on another occasion).
- Mark T.
  
  Posted Jan 21, 2009 at 2:35 PM | Permalink
  
  Re: Steve McIntyre (#24),
  
  Or does it mean that they have a decodable signal?
  
  A colleague sent me a list of signal processing lemmas today, one of which seems entirely relevant.
  
  “In the limit, given enough matched filtering, enough cross-correlation, and enough coherent integration, there is no need for a signal.”
  
  There are two others that don’t really apply.
  
  Mark
- pete
  
  Posted Jan 21, 2009 at 6:22 PM | Permalink
  
  Re: Steve McIntyre (#24)
  
  #5,8. pete and DC, you’re both fixing on a point that was passim to the post.
  
  The point I’m fixing on is the very first substantive paragraph in this post; the passim points that you have removed are simply a follow-on from this original error. It’s hard to get to the intended point when you begin by confusing statistical and physical significance.
  
  In the spirit of offering a “review” comment:
  
  If your main point is the stratification, you would be better off ignoring the sub-period tests altogether and simply compare the strata using the original test. The way the post is laid out at present gives the impression that the stratification only shows up when using the triple-screening process, which would make the statistical mistakes in the triple-screening process relevant. But Luterbacher, for example, is still going to give 71/71 with the original test.
  - Deep Climate
    
    Posted Jan 21, 2009 at 11:33 PM | Permalink
    
    Re: pete (#49),
    
    Steve said:
    
    Code 9000 dendro proxies make up over 927 of 1209 M08 proxies. Only 143 pass the above simple test ( 15.4%).
    
    Then Pete said:
    
    If your main point is the stratification, you would be better off ignoring the sub-period tests altogether and simply compare the strata using the original test.
    
    For what it’s worth, 243 of the code 9000 candidate proxies pass under the original test (25%).
    
    I’m not sure which codes or other criteria correspond to “low frequency” (said to pass 7/51 under the more restrictive statistical screening), so I can’t review that particular statement at present.
    
    And, by the way, it’s a bit of a stretch to state that I “remarked in effect” that “30% of the proxies failed” what you term an “elementary precaution.” Rather, that is a conclusion you have drawn from data I drew to your attention. I’ve made it very clear that I don’t consider the sub-period significance test appropriate, although a much less stringent test for correlation fluctuation might be.
    
    Indeed my original point has been lost, namely, that only one of the 342 proxies that passed all three tests had an opposing sign in the sub-period correlations, which ran counter to your speculation:
    
    Out of 1209 proxies, 308 have opposite “low frequency” orientations between early-miss and late-miss. Presumably many are “significant” in both directions.
Steve McIntyre

Posted Jan 21, 2009 at 2:02 PM | Permalink

#15. I totally reject the idea of ex post picking – which all too often are connected to things like “false positives”.

As I’ve said on many occasions, if, say, white spruce treeline ring width chronologies are believed to be a temperature proxy, then climate scientists have take them all. They can’t say – after the fact – that ones passing correlation tests are “proxies” and ones failing aren’t. If the first study missed a salient factor, then they need to do a fresh out-of-sample study using new ex ante selection to avoid data snooping bias.

EVen worse is mining the entire ITRDB data set for correlations with complete disregard for whether the series meet any ex ante criteria of being a temperature proxy.

You can get high RE statistics easily – Yule’s classic spurious regression between alcoholism and C of E marriages has a high RE. If you have “Method Wrong” (in Wegman’s phrase), the work isn’t saved by a high RE.

DC seems to take a different position.
Hmmm

Posted Jan 21, 2009 at 2:42 PM | Permalink

STEVE-
I’m confused on one point: Why doesn’t the “divergence” issue itself dominate the debate? Doesn’t divergence directly prove that these “proxies” are not so good? How can proxies not be calibrated over the entire period? Did they cherry pick the best correlated timeframe so that the computated errors would appear lower than we are observing in real life? Shouldn’t the highest level of observed divergence be added into the error computation in the reconstruction (and if so, was it)? If it can diverge this much now, it could in the past.

I would suggest that instead of deconstructing their studies, you create your own study starting with the raw data available. You use your own methods and get your own results and compare to theirs.
- Neil Fisher
  
  Posted Jan 21, 2009 at 7:35 PM | Permalink
  
  Re: Hmmm (#29),
  
  I would suggest that instead of deconstructing their studies, you create your own study starting with the raw data available. You use your own methods and get your own results and compare to theirs.
  
  This is not what CA does. Nor do I see any value in such an undertaking. There are plenty of proxy based reconstructions out there already. Finding out which is correct will be difficult – if not impossible. By calling attention to defects in any such study, CA is aiding in the search for what really happened. Others may find other problems, but CA, by focusing on the stats, at least let’s us know what is wrong or even questionable about such studies, allowing everyone to converge on the right answer – and by “right”, I mean “correct” not “most believed” or “most popular”.
  
  I fail to see why finding errors in others work is so denigrated in climate science, when it is so revered in every other area of science. As far as I can tell, Steve is not interested in who is right and who is wrong, merely that when someone is wrong, they and everyone else who might rely on that work is aware of the problems he has found and that they do not continue to make the same mistake. What’s wrong with that?
  - James S
    
    Posted Jan 21, 2009 at 10:05 PM | Permalink
    
    Re: Neil Fisher (#52),
    
    And on top of that this site is called Climate Audit and not Climate Reconstruction. The purpose of a company audit is to look for errors and to try and quantify them to work out whether the overall picture is reflective of what the directors of that company want to tell the shareholders in the financial accounts
    
    If a large enough error (or errors) is found then the auditors go back to say that the accounts are materially incorrect and it is the directors’ job to decide whether or not to fix the errors. The auditors do not fix the errors for them or come up with a whole new set of accounts.
JaneHM

Posted Jan 21, 2009 at 2:50 PM | Permalink

Steve

More voodoo to be published in Nature tomorrow with new adjustments for Antarctica 1957 – 2006:

“The majority of weather stations on Antarctica sit around the coast, with only two providing an unbroken record from the continent’s interior. Steig and colleagues overcame this lack of data by using satellite data and statistical techniques to fill in the gaps…
“Eric has done a very clever analysis with extremely sparse data,””

http://www.newscientist.com/article/dn16460-even-antarctica-is-now-feeling-the-heat-of-climate-change.html

http://www.nature.com/nature/journal/v457/n7228/abs/nature07669.html
thefordprefect

Posted Jan 21, 2009 at 4:26 PM | Permalink

The use of multiple proxies – getting signal from noise
If I take a photo of my back garden in the dark, under exposing it (using an exposure time that does not trigger the long exposure algorithm of the camera), the result (after increasing the gamma until it is not just a black picture) is noise with no discernable detail – just a slight brightness at the horizon where a distant church is illuminated and lights from the nearest town light up the sky.
Looking at 128 photographs taken at the same time shows no visible correlation apart from the brightness.
Using these bright areas as the know valid record would enable me to throw away all the photographs
• where this brightness does not exist
• change to negative all photos which have a darkness in this area
• expand or shrink or rotate the photo until the known brightness aligns with others
Lining up all photos using this bright area and adding them all together magically (voodoo) enables the picture in the noise to be seen – i.e. the totally “random” firing of the sensor pixels is not actually random but very, very slightly skewed depending on the few photons reaching them (or not).
Also because in this instance all photos were taken on a tripod and simply added together using a photo editing program the imperfections of the sensor become visible – there are inconsistent bright pixels, vertical banding and a light windowpane-like pattern. Knowing the picture I can say these are not valid. Not knowing the picture I would possibly assume these were part of the view.
Whilst I can see this is not quite the same as 1000s of temperature proxies it seems similar to me.
Proxies are aligned with temperature records.
Proxies are added and “averaged”
Any common factors show up in the averaged results – this may not be temperature of course and there may be more than one effect! But what is common –
• Sunlight/radiation ~ not in caves
• rainfall ~ unusual for unrelated areas
• CO2 ~ hmmm
• Temperature ~ hmmm etc.
I still find it inconceivable that correlation statistics could untangle my photographs which may have different zoom ratio (=invalid time scale) different angle (= different time alignment) which the eye can do.
Photo 1 – A single exposure (gamma increased until noise visible is there any visible foreground?) 1/5th second f2.8
Photo 2 – 128 photos added (not averaged – i.e brightness increases for every photo added)
Photo 3 – a flash photo from the same location
- bernie
  
  Posted Jan 21, 2009 at 5:15 PM | Permalink
  
  Re: thefordprefect (#32),
  I am not sure of your point. Your pictures seem to suggest that the true picture can be recreated from very poor images – are you saying this is true for multiple types of data records that may or may not have a discernible and stable temperature signal in them? Chris Hull’s reaction is the same as mine.
- Jeff Id
  
  Posted Jan 21, 2009 at 5:18 PM | Permalink
  
  Re: thefordprefect (#32),
  
  If you choose 10 pixels in a line anywhere on the photo and sort the photos based on the darkest pixels in that region you will discover the images actually contain a dark line in that region not seen by the flash.
  
  On topic, there were I believe only 55 proxies used to create the temperature extension of about 90% of the dataset and this was done on a gridded basis. The RegEm control proxies nearby the briffa sets which passed, become critical to the HS result. This may be a reason some of the 1357 original series were cut down to 1209.
- JS
  
  Posted Jan 21, 2009 at 8:53 PM | Permalink
  
  Re: thefordprefect (#32),
  A problem here is that (within the limits of analogies and so ignoring many things that may be relevant to CCDs and are not relevant to trees) – you know the answer already (the brightness on the horizon is from a church) and so work towards that end.
  
  If I were to provide you with 128 similar photographs but not tell you anything about their context or whether there was a signal or not do you think you would be so successful? Or do you think you might end up emphaising something that wasn’t there in the first place? (Particularly if you get to discard all photos that do not conform to your a priori assumption about what the answer looks like.)
- xtronics
  
  Posted Jan 22, 2009 at 12:27 PM | Permalink
  
  Re: thefordprefect (#32),
  
  Yes, it is possible to retrieve a signal from adding signals where the s/n ratio < 1 – (All GPS units work with S/N < 1) But it is important to understand the limitations – You might be able to determine a picture because what we see is based on the difference between pixels (there are specialized neurons that produce this transform ) – but looking at that data you would not be able to determine the absolute brightness of any pixel. Further you can’t know the absolute difference between the brightest and darkest pixel – or that the resulting differences are even linear). Our brain does an amazing amount of processing and infilling – vision theory
  
  To claim knowledge where the s/n < 1 one first has to have knowledge about the noise – is it and has it always been Gaussian? Is the information we are seeking just the phase or do we need amplitude to support the case? Does the amplitude need to be relative or absolute? In my mind, most of the proxy studies can only possibly show phase information – not absolute temperature.
  
  The danger here is that as humans, we have a weakness to see patterns where non exist; picking a signal out of noise is key to our survival, but we are prone to see patterns where non exist. We can find pictures in random snow images when non were there to start with. Clustering illusion .
  
  (One thought that I keep returning to is that if global temperature varies randomly there is a 50:50% chance that the data will support or discredit AWG – and both sides are eager to claim these trends as proof instead of realizing it is unknowable.) To me, most of these proxy studies can mean nothing without the control of multitudes of confounding variables no matter how good the statistics are and then you end up with only phase information. To look at ice cores and tree cores as equally weighted proxies is insane.
  
  I also consider studies where the data and algorithms are not openly shared mere politics and not science.
  - Mark T.
    
    Posted Jan 22, 2009 at 12:54 PM | Permalink
    
    Re: xtronics (#67),
    
    and both sides are eager to claim these trends as proof instead of realizing it is unknowable.
    
    Actually, I think the legitimate skeptic “side” claims unknowable, or at least, “the conclusions are not supported by this line of evidence.”
    
    To look at ice cores and tree cores as equally weighted proxies is insane.
    
    Well, technically they aren’t equally weighted. I believe they are centered by mean and scaled by standard deviation. However, I know for a fact they used an ergodic assumption for the mean in RegEM, i.e., take the mean of each proxy, then the mean of the resulting vector, and subtract that for centering. There’s more to it, but the function was pretty clear in his RegEM code (Jean S spotted this first, as I recall).
    
    Mark
Mark T.

Posted Jan 21, 2009 at 4:40 PM | Permalink

I’m not sure what your point is…?
Mark
hmm

Posted Jan 21, 2009 at 4:47 PM | Permalink

STEVE-
I’m confused on one point: Why doesn’t the “divergence” issue itself dominate the debate? Doesn’t divergence directly prove that these “proxies” are not as good as claimed? How can proxies not be calibrated over the entire instrumental period; what justification is there for this? Did they cherry pick the best correlated timeframe so that the computated errors would appear lower than we are observing in real life?

Shouldn’t the highest level of observed divergence in the instrumental period be added into the error computation in the reconstruction period (and if so, was it)? If it can diverge this much now, it could in the past too IMO.

I would suggest that instead of deconstructing their studies, you create your own study starting with the raw data available. You use your own methods and get your own results and compare to theirs.
Chris Hull

Posted Jan 21, 2009 at 4:56 PM | Permalink

Reply to 32

The CCD photo above has a s/n ratio of better than 1/1 and has a consistant bias and noise charaistics…note the crossbars… in fact the cheapest CCDs have a consistancy to close to .5% and respond to photons in a extordianry linear manner If you turned the camera a random number degrees per exposer, change the camera brand, use the same file multiple times, put the camera in negitive mode, put multiple filters in front of the camera and vary the exposer time by factors of 10 then we have something close to fitting problem of putting together the proxies.
- Mark T.
  
  Posted Jan 21, 2009 at 5:19 PM | Permalink
  
  Re: Chris Hull (#35), Also build CCDs that respond to inputs other than light, say sound or time of day, and only provide enough information for the analyst to determine these various inconsistencies for 0.1% of the observation window.
  
  Mark
per

Posted Jan 21, 2009 at 5:14 PM | Permalink

just for amusement, I note that President Obama has just made clear that the FOI act is to be much more diligently enforced for american government employees.

wonder if that will have any impact ?
per
Edward

Posted Jan 21, 2009 at 5:25 PM | Permalink

#32 The Ford Prefect
You state: “Proxies are aligned with temperature records”. Is that an assumption? Based on my reading of this blog that statement may or may not be true and certainly depends on the particular proxy you would like to discuss.
thefordprefect

Posted Jan 21, 2009 at 5:33 PM | Permalink

What I was pointing out was you can align pictures (aligning proxies with the temperature record from 1700 onwards) If there is enough overlap and detail the pictures can be repositioned and zoomed to same size (adjusting time scale and time position). In the case of the photograph, camera zoom and position changes will lessen the effect of pixel faults.
If a picture has no point of visual correlation with the average of all photos then it can be discarded from the total. Negative images can be inverted, A photo with high noise levels but good visual correlation, can be added with less significance. If a flash light appears within the picture but the rest is visually correlated to the average then why not cut out the light?
Mark T.

Posted Jan 21, 2009 at 5:40 PM | Permalink

Yes, and what we were trying to understand was a question of relevance to the issue at hand, which I think has been detailed sufficiently since: not so much.

Mark
Mark T.

Posted Jan 21, 2009 at 5:44 PM | Permalink

The point is, thefordprefect, you can’t just arbitrarily use ex post selection criteria when you don’t already have a priori knowledge of the behavior of your sensor system. CCDs respond relatively linearly to a very limited range of inputs, proxies respond in an unknown fashion, non-linearly to boot, to a very wide range of inputs, many of which are correlated with each other.

Mark
Sam Urbinto

Posted Jan 21, 2009 at 5:53 PM | Permalink

Hey, where’s the infrared photos?

The ones in places with higher concentrations of carbon dioxide are especially cute.
Hu McCulloch

Posted Jan 21, 2009 at 6:12 PM | Permalink

Re Thefordprefect (#32),
An interesting example, Mike. However, the noise across your 128 photos is independent, so that their sum has a smaller relative variance than the individual photos by a factor of 1/128. Fortunately, there was also a faint meaningful signal there, which came out when they were summed.

Unfortunately, Mann’s 1207 proxies probably do not have independent errors when regressed on instrumental temperature, and may not even have a meaningful temperature signal in the first place (aside from the Luterbacher instrumental series). Hence they do not necessarily produce a meaningful picture when aggregated (except perhaps for the Luterbacher period, which does not overlap the controversial MWP period, or even most of the LIA period). Averaging 128 independent screenviews of a TV monitor that is not connected to a signal will just produce one more meaningless white noise screenview.

More later.
William

Posted Jan 21, 2009 at 7:31 PM | Permalink

#32 Ford Prefect
When I veiwed your image it teleconnected for me and brought up an image of an old I love Lucy re-run. Would that be considered a high correlation to temperature?
Steve McIntyre

Posted Jan 21, 2009 at 11:38 PM | Permalink

I’ve made it very clear that I don’t consider the sub-period significance test appropriate

I’m glad that you think that at least something is wrong with Mannian CPS 🙂 But I’m curious – why do you object to this? 🙂
- Deep Climate
  
  Posted Jan 22, 2009 at 12:21 AM | Permalink
  
  Re: Steve McIntyre (#57),
  
  To clarify: I meant that the sub-period sigfnificant test was not appropriate for purposes of screening and calibration of proxy set for use in the general reconstruction, whereas you seem to advocate this more stringent test as a “precaution.” To me, that seems overly restrictive – I’m in agreement with Pete on that one.
  
  But I don’t have a problem with the use of the sub-period test for validation, as actually used by Mann et al.
  
  However I do acknowledge that the Socotra dO18 proxy seems problematic as the correlation fluctuates so much. I haven’t seen any evidence that any other proxies suffer from this problem, at least not to that extent.
  - pete
    
    Posted Jan 22, 2009 at 12:33 AM | Permalink
    
    Re: Deep Climate (#60)
    
    However I do acknowledge that the Socotra dO18 proxy seems problematic as the correlation fluctuates so much. I haven’t seen any evidence that any other proxies suffer from this problem, at least not to that extent.
    
    Because a random sequence has ~13% chance of passing screening, we’d expect about 108 of the 484 selected proxies to be spurious*. Socotra dO18 is probably one of these.
    
    [ * These spurious proxies will add noise to the reconstruction, but will not bias the reconstruction if it’s done properly.]
Alan Wilkinson

Posted Jan 21, 2009 at 11:45 PM | Permalink

Using proxies that are not understood to the ridiculous extent that their relevance is not predictable prior to use is never going to yield reliable estimates of temperature.

When they are anyway non-linear to the point of sign reversal they become farcical.

Yet again this is not science, it is politics.
pete

Posted Jan 22, 2009 at 12:13 AM | Permalink

I’ve made it very clear that I don’t consider the sub-period significance test appropriate

I’m glad that you think that at least something is wrong with Mannian CPS 🙂 But I’m curious – why do you object to this? 🙂

You can get an unbiased estimate of the mean of N numbers by taking the first number. Of course you shouldn’t — in statistical terms you are using an inefficient estimate; in layman’s terms you are “throwing away data”.

In this case, the sub-period significance test is inefficient. Presumably there is a more efficient test to achieve the same end; unfortunately my time series knowledge is too rusty to tell you which one =(

Out of 1209 proxies, 308 have opposite “low frequency” orientations between early-miss and late-miss. Presumably many are “significant” in both directions.

For those following alomg at home: the other 307 with opposite orientations were non-significant proxies. The correlation is about zero, so the estimate in different periods might be plus-epsilon or minus-delta, pretty much at random.
Steve McIntyre

Posted Jan 22, 2009 at 1:07 AM | Permalink

For those following along at home: the other 307 with opposite orientations were non-significant proxies.

Don’t think so. Of these 308 proxies, 151 are considered “significant” in at least one of the 3 versions.

temp1= !(sign(details$rtable.r1850_1949lf)== sign(details$rtable.r1896_1995lf)) ;sum(temp1) #308
temp2=temp1& (passing$whole|passing$latem|passing$earlym);sum(temp2) #151

The usual interpretation of such failures in reconstruction literature is that the posited relationship was not “verified”. I agree that the Socotra case seems to be especially bad by yielding both a “significant” positive and “significant” negative relationship, whereas the others appear to have (say) a “significant” positive relationship in one period and an “insignificant” negative relationship in another period – not particularly edifying.

Statistically, it’s useful to ponder how you can have both a”Significant” positive and “significant” relationship. The explanation is almost certainly that the data is highly autocorrelated, that the degrees of freedom are much fewer than assumed by Mann (the type of argument used by Santer against Douglass) and thus neither relationship is actually “significant” allowing for a very modest number of degrees of freedom. I’ll do these calcs tomorow. The knock-on impact of applying Santer-style methods will probably be to eliminate quite a few “significant” relationships, but that’s just a guess.

I remind readers that Mann’s benchmarking lacks any statistical reference and cannot be relied on. I haven’t parsed whether this should be 13% or some other number or considered what an 18% yield actually means (given that M08 is very obscure on this.) As I noted above, the yield represented by “484” is highly inflated by inclusion of Luterbacher and perhaps by RegEM’ed MXD data. The percentage of “significant” dendro proxies is remarkably low even with Mannian under-allowance for autocorrelation.
- pete
  
  Posted Jan 22, 2009 at 4:53 AM | Permalink
  
  Re: Steve McIntyre (#62)
  
  I’d prefer to do it this way:
  
  temp1 <- sign(details$rtable.r1850_1949lf) != sign(details$rtable.r1896_1995lf) ; sum(temp1)
  # 308
  temp2 <- temp1 & (passing$whole & passing$latem & passing$earlym) ; sum(temp2)
  # 40
  
  You can’t determine if the correlations have opposite sign unless you can determine the sign, which you can’t do unless your estimated correllation is significant.
  
  Statistically, it’s useful to ponder how you can have both a”Significant” positive and “significant” relationship. The explanation is almost certainly that the data is highly autocorrelated,
  
  Actually, this is about as many as I’d expect by chance:
  
  Of the 484, say 376 are genuine and 108 spurious [108 / (1209 – 376) ~ 13%]
  Of those 108, 76 survive the two sub-period tests [108 * (342/484) ~ 76]
  Of those 76, 38 should have opposite signs [76/2 = 38; it’s a coin flip since they could have the same sign by chance]
  
  Which is pretty close to the 40 I got earlier.
  
  The percentage of “significant” dendro proxies is remarkably low even with Mannian under-allowance for autocorrelation.
  
  28% (260/927) compared to a threshold of 13%. That’s not “remarkably low”.
  
  Steve: As a start, 484 is a fake number because it includes 71 out of 71 Luterbachers. Plus the 88% yield of M08 massaged Rutherford-RegEMd Briffa MXD data is suspect. Whatever you do needs to be done without these datasets [codes 2000, 7500]
  - RomanM
    
    Posted Jan 22, 2009 at 2:59 PM | Permalink
    
    Re: pete (#65),
    
    Actually, this is about as many as I’d expect by chance:
    
    Of the 484, say 376 are genuine and 108 spurious [108 / (1209 – 376) ~ 13%]
    Of those 108, 76 survive the two sub-period tests [108 * (342/484) ~ 76]
    Of those 76, 38 should have opposite signs [76/2 = 38; it’s a coin flip since they could have the same sign by chance]
    
    Which is pretty close to the 40 I got earlier.
    
    Actually there are several things wrong with these calculations. As I pointed out in Comment 54 of the Mann Correlation Mystery thread, 13% is not correct. If you read the screening procedure on page two of the SI,
    
    The corresponding one-sided p=0.10 significance thresholds are |r|=0.11 and |r|=0.34 respectively. For the shorter (100 year) calibration intervals, we assumed n=98 nominal degrees of freedom for annually resolved records, and n=8 degrees of freedom for decadal resolution records. The corresponding one-sided p=0.10 significance thresholds are |r|=0.13 and |r|=0.42 respectively. Owing to reduced degrees of freedom arising from modest temporal autocorrelation, the effective p value for annual screening is slightly higher (p~0.128) than the nominal (p=0.10) value.
    
    you will notice that they themselves point out that the critical value is for a one-sided test at the .10 level (which I verified in my linked comment). However, as soon as they take the absolute value of the correlation coefficient before making the comparison, they convert it into a two-sided test with a corresponding doubling of the significance level. The nominal level is p = 0.20 and with their “slightly higher” correction gives an error rate of 25.6% not 13%.
    
    I would also take exception with your statement that “it’s a coin flip since they could have the same sign by chance”. In fact, because the two correlations overlap for 54 years of each validation period, I would expect a pretty strong positive correlation between the two correlations thereby increasing the proportion of pairs with the same signs.
    
    I wrote a little script to simulate correlated two independent normal series in the same fashion (since I didn’t have access to “climate model data” 😉 ):
    
    #n = number of repetitions
    samp = function(n) {
    result = matrix(NA,nrow=n, ncol=2)
    for (i in 1:n) {
    rand1 = rnorm(146)
    rand2 = rnorm(146)
    result[i,1] = cor(rand1[1:100],rand2[1:100])
    result[i,2] = cor(rand1[47:146],rand2[47:146])}
    result}
    
    #answers will vary
    test=samp(10000)
    cor(test[,1],test[,2]) # 0.5307599
    
    #number with same signs
    sum(test[,1]*test[,2] >0) # 6789
    
    In several runs, the correlation of the two calculated correlations was between .50 and .55 and the percentage of cases where both of the signs were the same was about 68%.
    - pete
      
      Posted Jan 22, 2009 at 6:10 PM | Permalink
      
      Re: RomanM (#69)
      
      you will notice that they themselves point out that the critical value is for a one-sided test at the .10 level (which I verified in my linked comment). However, as soon as they take the absolute value of the correlation coefficient before making the comparison, they convert it into a two-sided test with a corresponding doubling of the significance level. The nominal level is p = 0.20 and with their “slightly higher” correction gives an error rate of 25.6% not 13%.
      
      I read this as converting a two-sided p=0.10 test with thresholds +/-0.13 into a one-sided p=0.10 test with threshold +0.13.
      
      Thanks for the correction re “coin flip”, I was a little careless there.
      
      Um, Steve? Could you perhaps delete my post above? That huge chunk of whitespace was meant to be a +/- sign; might have to brush up on my TeX.
    - RomanM
      
      Posted Jan 22, 2009 at 6:53 PM | Permalink
      
      Re: pete (#78),
      All you have to do is calculate the probability that that a correlation (without absolute value exceeds the critical value. here is an R program:
      
      #Calculate probability correlation greater than r
      
      probcor = function(r,df) {
      tval = r/sqrt((1-r^2)/df)
      prob = pt(tval, df,lower.tail = F)
      list(t=tval, p =prob)}
      
      df = c(8, 13, 98, 144)
      r =c(.42, .34, .13, .11)
      
      probcor(r,df)$p # 0.11344236 0.10750085 0.09867576 0.09312986
      
      Since the distribution is symmetric around 0, the absolute value doubles the probability. The differences from .10 come from rounding the critical value to two digits (why would anyone round the values when computers are being used???). My suspicion is that someone got the idea that since the comparison on the negative side was unnecessary due to the absolute value, the test was therefore “one-sided” – an error that I’ve seen before in elementary courses…
      
      Steve: Roman, there’s probably a native function that does this even more directly. I can directly reverse this with qnorm:
      
      qnorm( 1-c (0.11344236, 0.10750085, 0.09867576, 0.09312986 ),sd=1/sqrt(c(8,13,98,144))
      #0.4272419 0.3438944 0.1302222 0.1101438
      
      A question: I’ve been taking the tanh of this expression for the Fisher transformation i.e.
      
      tanh(qnorm( 1-c (0.11344236, 0.10750085, 0.09867576, 0.09312986 ),sd=1/sqrt(c(8,13,98,144)))
      #0.4030138 0.3309497 0.1294911 0.1097005
      
      IT doesn’t make much difference but isn’t it required in some of these calcs?
    - pete
      
      Posted Jan 22, 2009 at 8:16 PM | Permalink
      
      Re: RomanM (#80), thanks got it now.
      
      (getting some weird errors, sorry if this double-posts)
    - Deep Climate
      
      Posted Jan 23, 2009 at 11:33 AM | Permalink
      
      Re: RomanM (#69),
      
      The corresponding one-sided p=0.10 significance thresholds are |r|=0.11 and |r|=0.34 respectively.
      
      You will notice that they themselves point out that the critical value is for a one-sided test at the .10 level (which I verified in my linked comment). However, as soon as they take the absolute value of the correlation coefficient before making the comparison, they convert it into a two-sided test with a corresponding doubling of the significance level.
      
      This appears to be a misinterpretation of the text. Mann et al are simply using a short hand notation to describe the two possible one-sided tests. But in the actual significance test for the vast majority of individual proxies, only one or the other is applied (otherwise it would be a two-sided test).
      
      At least, that’s how I understood it. But if you’re not convinced, check the data:
      
      temp1= (abs(details$rtable.r1850_1995) > .1) ;sum(temp1) # 930
      temp2 = temp1 & passing$whole ;sum(temp2) # 484
      
      It looks like there were quite a few proxies that would have passed a two-sided test, if it had been applied. But in the vast majority of cases it wasn’t.
    - RomanM
      
      Posted Jan 23, 2009 at 12:19 PM | Permalink
      
      Re: Deep Climate (#82),
      
      You are totally and completely confused. Have you not taken an elementary stat course?
      
      What we are talking about here is a SINGLE comparison involving the correlation of the proxy to a target sequence to try to determine whether the calculated correlation could be as far from zero as it is if there actually was no relationship between the two series.
      
      In this case, we have what is called a two-sided test because the decision is made to say that there is a relationship if the correlation is either large enough on the positive OR large enough on the negative side. If we were looking for positive correlation only, we would ignore the negative values (no matter how far below zero) and only do our check for large enoughness on the positive side. That is called a one-sided test.
      
      Now, the value that has been selected by the Mann will accept a non-existent, spurious correlation about 10% of the time when a one-sided test is done . That is what the SI correctly states (which my calculation #79 verifies – if you don’t agree with the calculations, by all means, point out the errors). Now in an elementary stat course, you would learn that using that same cutoff value on both sides of zero (for a symmetrically distributed statistic) will double the error rate. By taking the absolute value of that statistic, this is EXACTLY what you are doing even though it might look like you are only comparing on one-side, a mistake made by the folks who wrote this paper and selected the WRONG cutoff value for their stated error rate. Got it?
      
      There will be a test on this on Monday and you really need to pull up your grades if you expect to pass. 😉
    - Deep Climate
      
      Posted Jan 23, 2009 at 2:09 PM | Permalink
      
      Re: RomanM (#83),
      
      From Mann et al 2008:
      
      Where the sign of the correlation could a priori be specified (positive for tree-ring data, ice-core oxygen isotopes, lake sediments, and historical documents, and negative for coral oxygen-isotope records), a one-sided significance criterion was used. Otherwise, a two-sided significance criterion [e.g. spelo] was used.
      
      [Emphasis added]
      
      Here is the result of applying the two-sided test to every proxy: 902 pass.
      
      temp1= (abs(details$rtable.r1850_1995) > .106) ;sum(temp1) # 902
      
      Here’s the number that actually passed: 484
      
      temp2 <- temp1 & passing$whole; sum(temp2) # 484
      
      Honestly, I don’t know what else to say.
    - RomanM
      
      Posted Jan 23, 2009 at 2:29 PM | Permalink
      
      Re: Deep Climate (#85),
      
      So show me where in Mann’s program where he actually did a one-sided test.
    - Deep Climate
      
      Posted Jan 23, 2009 at 6:00 PM | Permalink
      
      Re: RomanM (#86),
      
      So show me where in Mann’s program where he actually did a one-sided test.
      
      Here are some code snippets from gridproxy.m
      
      One sided positive significance test for most of the proxies (tree-rings etc.):
      
      if (z(3,i)==9000 | z(3,i)==8000 | z(3,i)==7500 | z(3,i)==4000 | z(3,i)==3000 | z(3,i)==2000) &…
      x(kk,i+1)>-99999 & x(kkk,i+1)>-99999 &…
      z(1,i)>=ilon1 & z(1,i)=ilat1 & z(2,i)<=ilat2 & z(ia,i)>=corra
      
      One-sided negative significance test for code 7000 proxies:
      
      if ( z(3,i)==7000) &…
      x(kk,i+1)>-99999 & x(kkk,i+1)>-99999 &…
      z(1,i)>=ilon1 & z(1,i)=ilat1 & z(2,i)<=ilat2 & z(ia,i)<=corra2
      
      Two-sided test for code 5000/6000 proxies only (relatively small number of proxies)
      
      if (z(3,i)==6000 | z(3,i)==5000) &…
      x(kk,i+1)>-99999 & x(kkk,i+1)>-99999 &…
      z(1,i)>=ilon1 & z(1,i)=ilat1 & z(2,i)<=ilat2 & abs(z(ia,i))>=corra3
      
      These are for the annually resolved proxies, but I see similar code for the decadally resolved proxies too. It appears that Mann did what he said he did.
    - RomanM
      
      Posted Jan 24, 2009 at 10:14 AM | Permalink
      
      Re: Deep Climate (#88),
      
      Score one for DC and the hockey team. You are corect, they do one-sided tests for the proxies in several of the code snippets that you list. My bad! I had not gone through all of the Matlab programs in detail and overlooked the pieces that you listed. I agree that Prof. Mann et al. are aware of the difference between one-sided and two-sided tests and apologize to them for having unfairly implied the opposite. 😦
      
      Now having eaten some crow, I have to ask why in a serious scientific study, anybody would do a “two-sided test for code 5000/6000 proxies only (relatively small number of proxies)”. If you don’t know in advance whether the correlation is positive or negative (indicating that the physical processes behind that link are poorly and/or incompletely identified), how could you assume that that relationship is specifically to temperature in a linear fashion? Black box teleconnection at its finest!
- Deep Climate
  
  Posted Jan 23, 2009 at 12:53 PM | Permalink
  
  Re: Steve McIntyre (#62),
  
  Of these 308 proxies, 151 are considered “significant” in at least one of the 3 versions.
  
  I think it’s worth having a little more precision on this issue, focusing only on the “significant” sub-period proxies.
  
  First of all in the screening process (using HF):
  
  temp1 <- sign(details$rtable.r1850_1949) != sign(details$rtable.r1896_1995) ; sum(temp1) # 229
  temp2 <- temp1 & (passing$latem & passing$earlym) ; sum(temp2) # 1
  temp3 <- temp1 & (passing$latem | passing$earlym) ; sum(temp3) # 102
  
  So out of 229 opposite signed proxies, only one was “significant” and passed the screen. Among proxies that passed in one sub-period only, 101 were opposite-signed in the other subperiod – all rejected.
  
  Only 1/229 opposite-signed proxies was significant.
  
  Now let’s look after conversion to low frequency for calibration. Here there is no further test for significance. So we would expect some “passing” proxies to change, even to the point of changing signs.
  
  temp1 <- sign(details$rtable.r1850_1949lf) != sign(details$rtable.r1896_1995lf) ; sum(temp1) # 308
  temp2 <- temp1 & (passing$latem & passing$earlym) ; sum(temp2) # 40
  temp3 <- temp1 & (passing$latem | passing$earlym) ; sum(temp3) # 136
  
  So the correct number, as pete first showed, is 40/308, not 151.
  
  Based on a complete understanding of the process, I don’t find any of these numbers particularly surprising or problematic.
bill Drissel

Posted Jan 22, 2009 at 1:10 AM | Permalink

#56

243/9000 = 2.7%

Regards,
Bill
- pete
  
  Posted Jan 22, 2009 at 4:02 AM | Permalink
  
  Re: bill Drissel (#63)
  
  243/9000 = 2.7%
  
  The dendro proxies are “Code 9000”, i.e. that’s how they’re labelled, not how many there are.
  
  There are 927 dendro proxies; 243/927 ~ 15%
bril

Posted Jan 22, 2009 at 8:24 AM | Permalink

Pete,

I am one of those following at home (as you term us) and I am now confused. I can’t pretend to understand your formula in #65. Are you still saying that “the other 307 with opposite orientations were non-significant proxies”?
- pete
  
  Posted Jan 22, 2009 at 5:50 PM | Permalink
  
  Re: bril (#66):
  
  I am one of those following at home (as you term us) and I am now confused. I can’t pretend to understand your formula in #65. Are you still saying that “the other 307 with opposite orientations were non-significant proxies”?
  
  I got the 1/307 from DC’s quote in the original post. He’s since explained where it came from, but in post 65 I was unsure.
Steve McIntyre

Posted Jan 22, 2009 at 3:02 PM | Permalink

pete and DC, you can’t apply Mann’s benchmark of 0.13 to pick two daily keno. I re-calculated an item for details$pickone.1850_1995 i.e. correlation for the actual gridcell. Instead of 243 passing as stated by DC (actually I got 260 passing in Msnn’s table), there are only 166 pickone (17.9% versus 13% according to the Mann benchmark). This pick two methodology is VERY odd. I’d be astonished if anyone can cite a precedent.

details$pickone.1850_1995=NA
details$pickone.1850_1995lf=NA
for (k in 1:1209) {
(index= (1:2592)[!is.na(match(instrumental.info$jones,unlist(details[k,”next1″]))) ] )
#locates mannid for next1, next2 for proxy #1486 1487
x=g(mann[[k]]);y= infilled2poles[,index]
test=ts.union(x,y )
temp=(time(test)>=1850)&(time(test)=1850) &(time(x)< =1995)
temp1=(time(x)=1850)&(time(test).13 & details$code==9000) # 166
- pete
  
  Posted Jan 22, 2009 at 6:01 PM | Permalink
  
  Re: Steve McIntyre (#70)
  
  pete and DC, you can’t apply Mann’s benchmark of 0.13 to pick two daily keno. I re-calculated an item for details$pickone.1850_1995 i.e. correlation for the actual gridcell. Instead of 243 passing as stated by DC (actually I got 260 passing in Msnn’s table), there are only 166 pickone (17.9% versus 13% according to the Mann benchmark). This pick two methodology is VERY odd. I’d be astonished if anyone can cite a precedent.
  
  My understanding is that the 13% threshold applies specifically to “pick 2”, which means you can’t apply it anywhere else (like to “pick one”).
Deep Climate

Posted Jan 22, 2009 at 3:20 PM | Permalink

Pete said:

For those following along at home: the other 307 with opposite orientations were non-significant proxies.

Steve said:

Don’t think so. Of these 308 proxies, 151 are considered “significant” in at least one of the 3 versions.

It’s worth keeping in mind that HF (annual) correlations when available are used for screening, but LF (decadally smoothed) correlations are used for calibration.

Here are the numbers using the HF correlations:

temp1 <- sign(details$rtable.r1850_1949) != sign(details$rtable.r1896_1995) ; sum(temp1) # 229
temp2 <- temp1 & (passing$whole & passing$latem & passing$earlym) ; sum(temp2) # 1
temp3 <- temp1 & (passing$whole | passing$latem | passing$earlym) ; sum(temp3) # 113

So to summarize sign differences between “early miss” and “late miss” correlation:

HFD means count of high frequency correlations with sign difference
LFD means count of low frequency correlations with sign difference
Pass 3 means proxies that passed all 3 tests
Pass 1 means proxies that passed at least 1 test

Correl. HFD LFD
Pass3 001 040
Pass1 113 151
AllPrx 229 308

It should be noted, though, that in most cases in the pass-at-least-one row figures, the proxy with the “wrong” or opposing sign did not pass the significance test.
Steve McIntyre

Posted Jan 22, 2009 at 3:59 PM | Permalink

Mann stuff is always needles in eyes. Consider the following:

We assumed n = 144 nominal degrees of freedom over the 1850–
1995 (146-year) interval for correlations between annually resolved
records (although see discussion below about influence of
temporal autocorrelation at the annual time scales), and n = 13 degrees
of freedom for decadal resolution records. The corresponding
one-sided P = 0.10 significance thresholds are |r| = 0.11 and
|r|~ 0.34, respectively.

They have 8 annual series with stated correlations less than r=0.11. It looks like they used something lower – about .106.

More annoying is this. They have 3 series that don’t meet the low-frequency guidlines:

id code r1850_1995 rtable.r1850_1995lf
483 li_1998_d13c 6001 NA -0.3907627
1043 suk_1987_koreadrought 5001 NA -0.3540128
1205 zhang_1980_region3 5001 NA -0.3541390

Here’s it’s not that they used a different number, as there are a few series in the same range that “pass”. I guess maybe they forgot to flip in time.

# id code r1850_1995 rtable.r1850_1995lf
#316 chesapeak_paleosalinity 3001 0.3886 0.3886366
#382 curtis_1996_d13cpyro 4001 0.3971 0.3970749
#1055 thompson_2003_dasuopu 8001 0.3423 0.3422881
- Deep Climate
  
  Posted Jan 22, 2009 at 5:53 PM | Permalink
  
  Re: Steve McIntyre (#72),
  
  Hmmm …
  
  These two end in 1960 – could that reduce d.f. just enough to fail?
  
  1043 suk_1987_koreadrought 5001 NA -0.3540128
  1205 zhang_1980_region3 5001 NA -0.3541390
  
  Can’t explain the other one though.
Steve McIntyre

Posted Jan 22, 2009 at 5:56 PM | Permalink

#74. Doubt it. Remember that Mann infills everything for these calcs, so these series have data from 1960-90 (hence the correlations) even if the data’s fictitious.
Steve McIntyre

Posted Jan 22, 2009 at 6:17 PM | Permalink

#76. Oh, now I see why you got cross-eyed with my comments as we don’t have a common understanding of Mann’s benchmarks.

I’m quite sure that your interpretation isn’t right (though the article definitely leads you to the understanding that you have), because I think that I know how Mann derived his benchmarks. (Not that he actually provides references; it’s just that I understand the ecology.)

I’m out playing squash league in about one minute and will explain tomorrow, but for now, think about it on the basis that the test is premised on pick one and that the corresponding pick-one yield is about 17.9%. And that the pick-two yield compared to pick-one standards is a gross inflation.

Note also that that the pick-one yield is fairly sensitive to the autocorrelation assumption of the benchmark. Mann’s benchmark can be shown to be assume an average AR1 coefficient in the dendros of about 0.133 – a huge underestimate. To generate a yield of 17.9%, you need an average autocorrelation AR1 of about 0.35: the observed average AR1 autocorrelation in the Mann dendro network is higher than that.

I really doubt that there is a recoverable “signal” in Mann’s dendro network – and I’ll post up some evidence tomorrow that the pick one yield is not surprising given observed levels of autocorrelation.
Peet

Posted Jan 23, 2009 at 12:24 AM | Permalink

This is particularly wrong: you’re comparing the triple-screening result to the single-screening threshold — apples and oranges.
bril

Posted Jan 23, 2009 at 3:13 PM | Permalink

Deep Climate,

This one really ain’t rocket science. Mann said he was applying a one-sided test and thought he was applying a one-sided test. However, by using absolute values he was effectively using a two-sided test.
pete

Posted Jan 24, 2009 at 12:28 AM | Permalink

So when Mann says …

The corresponding one-sided p=0.10 significance thresholds are |r|=0.11 and |r|=0.34 respectively.

… he’s talking about two different one-tailed thresholds: r > 0.11 for a priori positive correlations and r < -0.11 for negative correlations?

That would explain #72
- Deep Climate
  
  Posted Jan 24, 2009 at 10:07 AM | Permalink
  
  Re: pete (#89),
  
  So when Mann says …
  
  The corresponding one-sided p=0.10 significance thresholds are |r|=0.11 and |r|=0.34 respectively.
  
  … he’s talking about two different one-tailed thresholds: r > 0.11 for a priori positive correlations and r < -0.11 for negative correlations?
  
  That’s exactly the point I’ve been trying patently to explain to the CA crowd; I think they get it now.
  [Steve: while I’ve taken exception to many aspects of Mann’s methods, this was not an issue that I had raised or taken exception to. Roman had – and has agreed that he misunderstood Mann’s methods (which is easy enough to do), but I’m not sure that anyone else had weighed in on the topic, so it’s a bit grandiose to say that you were dealing with a “crowd” on this topic.
  
  Of course, for low frequency one-tailed thresholds (|r|=0.34) the two tests would be r > 0.34 for a priori positive correlations and r < -0.34 for negative correlations.
  
  That would explain #72
  
  Sort of. I went back to that comment and noticed that the failing low-frequency proxies were code 5001 and 6001. So these would have been subjected to the two-sided test (as described in the paper and implemented in the code). I’d have to check the documentation and code, but obviously you would need a higher absolute threshold for the two-sided test to achieve the same level of significance. So it seems very plausible that that’s the explanation.
  
  As for the high frequency one-sided proxies that passed but were under .11, it’s clear that the actual calculated threshold was .106, which was rounded up in the SI documentation.
Hu McCulloch

Posted Jan 24, 2009 at 9:35 AM | Permalink

Re #1, 2, Steve originally wrote,

If a “proxy” is a proxy, then it is a proxy regardless of the subperiod.

True enough — if a proxy’s relation to temperature isn’t constant across subperiods of the calibration period, there is no reason to expect it to have remained constant back in the reconstruction period, either.

However, as Pete points out (with unnecessary rudeness), Steve’s next sentence is too strong. Steve said,

It is not enough to have a “significant” relationship in the 1850-1995 period, it should also have “significant” relationship in the 1850-1949 and 1896-1995 periods (Mann’s late-miss and early-miss periods.)

Requiring the proxy to pass three hurdles in this manner — a “Leap 3” criterion, as it were — is just as size-distorting, but in the opposite direction, as Mann’s “Pick 2 Keno” criterion. In fact, as the subperiods get shorter, the power of a test at the level in question gets smaller, and so one is almost guaranteed not to find a significant correlation in subperiods if they are sufficiently short.

Although it is too much to ask that the data show a significantly nonzero relationship in each subperiod as well as the whole period, a very different but valid double check is whether the relationship is significantly non-constant between subperiods. This can be done with a standard Chow test, breaking the whole period in half.

The standard two-regime Chow test could be easily modified in the spirit of the Mann 08 paper to split the whole period into three approximately equal periods, and then do a (single) test to see if the coefficients are equal in all 3 periods. Howevever, the power of such a test would be reduced. My inclination would be to stick with 2 periods, but if someone wants to do 3 that’s fine too.

The difference in the two approaches is that Steve would require the series to reject-reject-reject non-zero slope in the full and 2 subsamples, whereas I would have it reject nonzero slope in the full sample, and then not reject unequal coefficients between periods. My suggestion is still a double test of sorts, but it is more of the nature of leap one hurdle and then don’t fall on your face, rather than leap two hurdles.

On the opposite distortion generated by Mann’s Pick-2 Keno procedure, newbies are strongly urged to go back and read the 9/20 thread The Mann Correlation Mystery, and then to browse through the remainder of the Mann 08 Category on CA.
- Deep Climate
  
  Posted Jan 24, 2009 at 10:37 AM | Permalink
  
  Re: Hu McCulloch (#90),
  
  This can be done with a standard Chow test, breaking the whole period in half.
  
  The standard two-regime Chow test could be easily modified in the spirit of the Mann 08 paper to split the whole period into three approximately equal periods, and then do a (single) test to see if the coefficients are equal in all 3 periods.
  
  Your idea is a better version of the idea I raised back in Deep Climate (#15),
  
  I do think some sort of check on correlation “fluctuation” might be appropriate, although requiring two sub-period correlations to also be significant would be too restrictive. One possible simple test would be a sign test, requring the sub-periods to be of the same sign as the overall correlation, but not necessarily to the level of significance. (The sub-periods should also not overlap for this particular purpose, of course – half periods would seem appropriate).
  
  One implication of any sub-period test, of course, is that the same test should be applied in the validation exercise to the calibration subperiod. So you divide the whole period in halves for the “whole” reconstruction Chow test. And you would do the same thing within the two sub-period validation reconstructions.
Hu McCulloch

Posted Jan 24, 2009 at 9:54 AM | Permalink

Re #82, 83, 89, I think Pete and DC are right here — Mann just means that the 1-tailed critical value for r is +.11 or -.11, as the case may be, but then confusingly merged this into the single statement that the absolute value of the critical value is +.11, which could easily be taken as saying that the critical value for the absolute value of r is .11. As RomanM correctly objects, the latter would be a 2-tailed test, but Mann seems to be giving it the former interpretation. (A more exact value is .1067.)

Three valid problems with Mann’s critical value are:

a) Mann’s Pick2 procedure, which greatly disorts its validity, as Steve showed earlier on the 9/20 thread The Mann correlation Mystery,

b) The “one size fits all” adjustment for serial correlation claiming that the SC-adjusted test size is about 0.128 for the annually resolved proxies and 0.1 for the decadal proxies. I can see applying a single adjustment to each category of proxies, but surely tree rings widths, grape harvests, and ice cores have completely different serial correlation properties. An average serial correlation should have been calculated for each category, reported, and used to compute SC-adjusted critical values at a common p-value.

c) The “Voodoo” data mining procedure that applies a critical value for a single test to 1209 tests, and that was the original topic of this thread and its predecessor. More on that later.
Steve McIntyre

Posted Jan 24, 2009 at 10:51 AM | Permalink

If you don’t know in advance whether the correlation is positive or negative (indicating that the physical processes behind that link are poorly and/or incompletely identified), how could you assume that that relationship is specifically to temperature in a linear fashion? Black box teleconnection at its finest!

Quite so. A point made on other occasions in different forms, but absolutely valid.
Hu McCulloch

Posted Jan 24, 2009 at 11:26 AM | Permalink

Back, now to Voodoo Correlations and Correlation Picking, which was the original topic here. (See also Industrial Strength Voodoo Correlations).

As I discussed at Comment #19 of the original Voodoo Correlations and Correlation Picking thread, the standard Bonferroni size adjustment states that the individual test size α* may have to be set as low as α/n in order for the probability of a false reject to be no more than α. This is just an inequality that covers the worst possible case, but is not much different, for small test sizes, than the Šidàk adjustment, which is exact in the case of independent tests.

With n = 1209 and a desired collective test size of α = .10, the adjusted individual test size could have to be as low as α* = .1/1209 = .0000827. The Šidàk value of 1-.91/n = .0000871 isn’t different enough to be worth calculating. For the high-frequency proxies, the Bonferroni-adjusted 1-tailed critical value becomes 0.3068 (or -.3068, as the case may be) rather than 0.1067 (or -.1067) as for a single test. The low-frequency adjusted 1-tailed hurdle is 0.8228 (or -.8228).

Scanning the SI ProxyNames spread sheet, I see at least a few TRW’s that are greater than .3068, plus a number of other proxies with even higher value. I take this to mean that only these proxies are clearly significant at the .1 level, and that only these should be included in any reconstruction period where all 1209 are available, after eliminating the Pick-2 Keno trick and doing a more detailed adjustment for serial correlation.

If only a handful of the 1209 are active for a given period (say the MWP), the adjusted critical values should be recomputed using the actual number of active proxies and the selection procedure redone with these adjusted critical values.

Since the tests are in fact positively correlated (in the sense of being positively dependent), the Bonferroni and even Šidàk criteria are admittedly probably too severe. A more powerful test that all the proxies are zero (and therefore of determining that at least 1 is non-zero) would be based on an F statistic that took the correlations of the proxy residuals on temperature (as well as their individual serial correlation) into account.

Unfortunately, the matrix of the unrestricted correlations of these residuals is singular, since there are far more proxies than observations. However some sort of test like this might still be computable under strong restrictions on the covariances, such as that the members of each class of proxies have a common correlation coefficient with temperature, and that different classes (TRW vs ice core, eg) have uncorrelated errors.

A further problem with an F test is that it is intrinsically based on a 2-tailed alternative hypothesis for the coefficients. Perhaps this can be gotten around by Monte Carlo methods, but it’s a big complication if the one-tailed priors are to be maintained.
Hu McCulloch

Posted Jan 24, 2009 at 11:51 AM | Permalink

Steve wrote,

The “best” performer are the Luterbacher series – series which have no business whatever being in a “proxy data sets”. 71 out of 71 Luterbacher series pass the above test. This is not much of an accomplishment since Luterbacher uses instrumental data in his “proxies”.

I don’t see why these series shouldn’t be included in a temperature reconstruction, if they have a good correlation with hemispheric or global temperature in the post-1850 period. Since they are instrumental, they are in a different class than the more indirect proxies, and naturally fit a lot better. Since they are only fragmentary instrumental data, they can’t directly be used to construct global temperature the way it is constructed after 1850, but instead just have to be treated as (very strong) proxies that give an only noisy reading on target temperature.

To be sure, they don’t go back very far, and so shed no light on the MWP and very little on the LIA. Nevertheless, I think it is perfectly legitimate to include them to try to beef up the precision on the period they do cover.

Still, your point is well taken that it’s totally misleading of Mann to crow that the bimillenial proxy set as a whole is good just because it contains these few good proxies whose range is limited to the last couple of centuries.

Steve: Hu, in the instrumental period, they aren’t “proxies – they are actual instrumental series derived from the same data that is used for the gridcell comparanda. So it’s a totally meaningless comparison that inflates the yield. I understand the point you’re making, but the issue in these recons is the MWP, so the inclusion of this stuff ends up distorting the comparisons.
- Deep Climate
  
  Posted Jan 24, 2009 at 3:28 PM | Permalink
  
  Re: Hu McCulloch (#97),
  
  Still, your point is well taken that it’s totally misleading of Mann to crow that the bimillenial proxy set as a whole is good just because it contains these few good proxies whose range is limited to the last couple of centuries.
  
  Hu, I agree with most of what you write here (and elsewhere). However, I find this particular statement a bit of a stretch. I know most here would not agree, but I think Mann has done a reasonable job of reporting the increasing uncertainties as one goes back in time. This graph (from the original paper) is probably the best visual representation of Mann’s NH CPS recon validation results. Note the high RE score for the post-1700 network, and the progressively lower scores going back in time.
  - UC
    
    Posted Jan 24, 2009 at 3:45 PM | Permalink
    
    Re: Deep Climate (#105),
    
    I know most here would not agree, but I think Mann has done a reasonable job of reporting the increasing uncertainties as one goes back in time.
    
    For sure, he has taken the uncertainties into account as fully as possible. This visual representation is quite difficult to interpret, but the black line seems to out of all CIs ? Has he over-optimized something ?
UC

Posted Jan 24, 2009 at 12:02 PM | Permalink

propick.m is very difficult to follow, modified the code to show some values,

if sum(d3)>=igr % at least ‘igr’ correlation of ‘igrid’ pass the CL
n4=n4+1;
aa
bb=abs(aa)
ra=max(bb)
[iii,jjj]=find(bb==ra)
if sum(iii)>2
rra(n4)=aa(1,1); %07/20/06
else
rra(n4)=aa(iii,jjj);
end

rra(n4), pause
d4(n4)=i+1;
end

%% two nearest correlations

aa =

-0.1852
0.1093

%% both significant ? 🙂

%% absolute value
bb =

0.1852
0.1093

%% max abs correlation
ra =

0.1852

%% selected rr(n4)
ans =

-0.1852

Where is rtable1209late generated, btw?
Steve McIntyre

Posted Jan 24, 2009 at 12:27 PM | Permalink

Hu’s 96 gives much food for thought. On a more mundane topic, I think that I can derive Mann’s benchmarks (which do not have a reference and pretty much appear out of the sky) – and in the process prove that these are pick one benchmarks.

The Fisher transformation says that the atanh of the correlation has a N(0,sd=sqrt(N-3)) distribution. Mann’s benchmarks relate to this (with the usual Mannian flourish notes of inconsistency with the formulas).

Start with the annual one-sided data:

N=length(1850:1995)
tanh( qnorm(.9, sd=1/sqrt(N-3))) #[1]# 0.1067603

The reported number is 0.11, but the minimum observed “passing” correlation is a bit less at 0.1061 – very close to the tanh calculation but a bit under it. [UC observes below that a value of 0.106 is used in Mann’s code]. The maximum non-passing correlation is 0.1056892.

min(abs( details$r1850_1995),na.rm=T) # 0.1061
temp=!is.na(details$r1850_1995)
temp1=details$rtable.r1850_1995>0
max( details$rtable.r1850_1995[!temp&temp1&annual],na.rm=T) # 0.1056892

With a little experimenting, it looks like Mann might have used N-1 instead of N-3 in the above formula.

tanh( qnorm(.9, sd=1/sqrt(N-2))) #[1]# 0.1063918
tanh( qnorm(.9, sd=1/sqrt(N-1))) #[1]# 0.1060271 ##used
tanh( qnorm(.9, sd=1/sqrt(N))) #[1]# 0.1056660

[As noted by DC below, the code uses .106 and rounding doesn’t permit choosing between N-2 or N-1].

This pattern works for the two-sided decadal. On the basis that decadal N-1=13 (and there’s basis for this):

tanh( qnorm(.9, sd=1/sqrt(13))) # 0.3411898

The reported benchmark in the text is 0.34. The minimum observed passing is 0.3423 and the maximum observed non-passing is 0.3319072.

min(abs( details$r1850_1995[!annual]),na.rm=T) # 0.3423
max( details$rtable.r1850_1995lf[!temp&temp1&!annual&!tempb],na.rm=T) # 0.3319072

For the two-sided decadal case (speleo and documentary), the formula yields:

tanh( qnorm(.95, sd=1/sqrt(13))) # 0.4269823

The minimum observed passing absolute value is 0.4272 and the maximum non-passing is 0.4110194:

min(abs( details$r1850_1995[!annual&tempb]),na.rm=T) #] 0.4272
max( abs(details$rtable.r1850_1995lf[!temp &!annual&tempb]),na.rm=T) # 0.4110194

This explains the exclusion of a few series that we wondered about recently.

For annual two-sided, the formula gives a benchmark (2-sided) of 0.1357544, rather than 0.1063 one-sided.

tanh( qnorm(.95, sd=1/sqrt(N-1))) # 0.1357544

The minimum passing observed is 0.1494 (consistent with this), but there is only one non-passing two-sided annual series: 201 burns_2002_thickness, which has a value of 0.04973034 and does not bracket the one-sided value.

min(abs( details$r1850_1995[annual&tempb]),na.rm=T) # 0.1494
max( abs(details$rtable.r1850_1995[!temp &annual&tempb]),na.rm=T) # 0.04973034

However, the pattern shows beyond any dispute that a pick-one test was used for the benchmark.

What would be a correct pick two test (aside from Hu’s issues which I am assimilating)? If the two gridcells were independent (which they aren’t), it would be pretty much double the pick one less about .01 for the overlap. Obviously there’s considerable spatial covariance between the two gridcells, so the increase in benchmark is by some percentage less than a double.

The big problem, as we’ve noted in the past, is that whatever degree the pick two gooses the correlations, by leaving the benchmark unchanged, Mann has put his finger on the scales of the test.
- UC
  
  Posted Jan 24, 2009 at 1:00 PM | Permalink
  
  Re: Steve McIntyre (#99),
  
  The reported number is 0.11
  
  gridproxy uses
  
  corra=0.106; %% 9000,8000,7500,4000,3000,2000
  corra2=-0.106; %% 7000
  - Deep Climate
    
    Posted Jan 24, 2009 at 2:36 PM | Permalink
    
    Re: UC (#100),
    
    Here is the complete list of correlation thresholds:
    
    corra=0.106; %% 9000,8000,7500,4000,3000,2000
    corra2=-0.106; %% 7000
    corra3=0.136; %% 6000, 5000
    
    corrb=0.337; %% same as above +1
    corrb2=-0.337;
    corrb3=0.425;
    
    “+1” (i.e. 9001 and so forth) are the decadal proxies, so corra3 and corrb3 are the two-sided limits for annual and decadal proxy significance respectively.
    
    Since three (rather than four) significant digits are used, it may be difficult to determine exactly the form of the threshold calculation by “reverse engineering” these values, as done here Steve McIntyre (#99),
    
    With a little experimenting, it looks like Mann might have used N-1 instead of N-3 in the above formula.
    
    tanh( qnorm(.9, sd=1/sqrt(N-2))) #[1]# 0.1063918
    tanh( qnorm(.9, sd=1/sqrt(N-1))) #[1]# 0.1060271 ##used
    tanh( qnorm(.9, sd=1/sqrt(N))) #[1]# 0.1056660
    
    Presumably, N-3 was not used, but whether it was N-2 (rounded down) or N-1 (almost exact match) can not be definitively determined.
Steve McIntyre

Posted Jan 24, 2009 at 2:49 PM | Permalink

As you observe, the code numbers are slightly different tho very close to this derivation.

However you can’t get .337 because the steps for N are too big and rounding won’t do it.

I presume that this must have done the calculation some Mannian way with Mannian rounding errors accumulating along the way.

Regardless, my derivation reconciles all the retentions and permits further analysis, which I’m doing.
- Deep Climate
  
  Posted Jan 24, 2009 at 3:18 PM | Permalink
  
  Re: Steve McIntyre (#102),
  Sure, it’s close enough for your purposes – no problem with that.
RomanM

Posted Jan 24, 2009 at 3:23 PM | Permalink

Re:Steve’s inline comment on RomanM (#79)

The tanh transform that you are using is an approximation and is less accurate for small degrees of freedom. It is used mainly when you wish to calculate confidence intervals or test whether the correlation might be equal to some non-zero value.

For the test of zero correlation, the statistic

t = r / sqrt((1-r^2)/df)

where df = sample size – 2 is used. This statistic is distributed exactly as a t random variable with df degrees of freedom. To calculate the critical value, you use the inversion formula which I gave in the post. A simple R script for that calculation is

#calculate upper quantiles for testing corr = 0

quant = function(alpha, df) {
tval = qt(alpha,df, lower.tail=F)
(sqrt( tval^2 /(tval^2 +df)))}

df = c(8,13,98,144)

quant(.10,df)
#[1] 0.4427959 0.3506884 0.1292418 0.1066760

quant(.05,df)
#[1] 0.5493568 0.4408608 0.1654298 0.1366643

The corresponding values from gridproxy appear to be:

alpha = .10: .425 .337 .136 .106

alpha = .05: .521 .418 .164 .128

These differences could not be due to rounding as we know it.
- Deep Climate
  
  Posted Jan 24, 2009 at 3:55 PM | Permalink
  
  Re: RomanM (#104),
  
  df = c(8,13,98,144)
  
  quant(.10,df)
  #[1] 0.4427959 0.3506884 0.1292418 0.1066760
  
  quant(.05,df)
  #[1] 0.5493568 0.4408608 0.1654298 0.1366643
  
  The corresponding values from gridproxy appear to be:
  
  alpha = .10: .425 .337 .136 .106
  
  alpha = .05: .521 .418 .164 .128
  
  These differences could not be due to rounding as we know it.
  
  I think some of the values got transposed:
  
  alpha = .10: .418 .337 .128 .106
  
  alpha = .05: .521 .425 .164 .136
  
  For example, .136 is the threshold used for the “whole” period two-sided test (i.e. df=144 with .05 two sided).
  
  See Deep Climate (#101)
  
  for the four “whole” recon parameters (df=13, df=144).
  - RomanM
    
    Posted Jan 24, 2009 at 4:24 PM | Permalink
    
    Re: Deep Climate (#109),
    
    Makes sense and the smaller values are close, but it still isn’t rounding so who knows where the values came from.
    - Deep Climate
      
      Posted Jan 24, 2009 at 4:42 PM | Permalink
      
      Re: RomanM (#112),
      
      Steve has come very close for df=144, 13, so my hunch is that his method would come closer for the sub-period cases too.
    - RomanM
      
      Posted Jan 24, 2009 at 5:55 PM | Permalink
      
      Re: Deep Climate (#113),
      
      What I have calculated is correct – this material has been my bread and butter for 40 years. If the results differ by as much as they do, particularly for the lower degrees of freedom, then what was done by Mann et al is not the same process. What did they do?
    - Deep Climate
      
      Posted Jan 24, 2009 at 6:54 PM | Permalink
      
      Re: RomanM (#114),
      
      I’m not disputing the correctness of your calculations. I’m just saying Mann et al values are quite close to the approximation that Steve used (but still not exact).
      
      tanh( qnorm(.90, sd=1/sqrt(98))) # 0.1287379
      tanh( qnorm(.95, sd=1/sqrt(98))) # 0.1646430
      tanh( qnorm(.90, sd=1/sqrt(8))) # 0.4244413
      tanh( qnorm(.95, sd=1/sqrt(8))) # 0.5237864
      
      alpha = .10: .418 .337 .128 .106
      alpha = .05: .521 .425 .164 .136
      
      On the other hand, if you add 1 to each df, you get this from your program:
      
      df = c(9,14,99,145)
      quant(.1,df) # 0.4186622 0.3382816 0.1285896 0.1063083
      quant(.05,df) # 0.5214044 0.4259020 0.1645995 0.1361950
      
      That’s even closer – so who knows?
    - RomanM
      
      Posted Jan 24, 2009 at 8:11 PM | Permalink
      
      Re: Deep Climate (#115),
      
      That’s even closer – so who knows?
      
      So who knows what? There is a right way and there are wrong ways. You don’t just “add 1 to each df” and you don’t invent new ways to calculate “principal components” because there is no valid reason to do so. Are you implying that what they have calculated these values correctly? Just a little wrong? Or did they (as you may have discovered) use the incorrect value for the degrees of freedom?
    - Deep Climate
      
      Posted Jan 24, 2009 at 9:16 PM | Permalink
      
      Re: RomanM (#116),
      The question I asked was the same as yours: “who knows where the values came from?” It’s strictly a “reverse engineering” puzzle for me and I’m not implying that the method was correct, whatever it was.
      
      As I said before, as far as I know your threshold values are correct. And so they probably either used an approximation or made some other error that gave low threshold results, especially for the low d.f. cases. So I don’t really see that we have a disagreement here.
      
      Steve: I think that you’re being careful here, but just in case: Roman is simply trying to do things in an intelligible way; don’t assume that Mannian “benchmarks” calculated intelligibly are “correct”. Roman doesn’t, nor do I. There’s much still to disentangle.
    - RomanM
      
      Posted Jan 25, 2009 at 8:10 AM | Permalink
      
      Re: Deep Climate (#117),
      
      Given your previous history as a poster, I am leery of your motives (unquestioning vindication of the team and their methods – they simply can’t be wrong in any way) and interpret everything you say in that context. If I see any admissions from you that some of the publications under consideration might have inappropriate and/or incorrectly done analyses or unjustifiable conclusions, I might see your posts in a different light.
      
      Following up on your observation that the degrees of freedom might not be correctly used, here is some R output
      
      df = c(8,13,98,144)
      floor(1000*quant(.10,1+df))/1000 #[1] 0.418 0.338 0.128 0.106
      floor(1000*quant(.05,1+df))/1000 #[1] 0.521 0.425 0.164 0.136
      
      Only one of the values is still different (.338 is out by .001). To achieve this, all of the degrees of freedom are (incorrectly) increased by one and all of the results are rounded down to three decimal places (presumably as a “conservative” approach).
      
      Let’s call .338 a “typing error” by whomever typed up the matlab script. 🙂
    - RomanM
      
      Posted Jan 25, 2009 at 9:29 AM | Permalink
      
      Re: RomanM (#120),
      
      … wait a minute! Rounding down isn’t “conserative”. It allows for slightly easier acceptance as a proxy, not the opposite.
Steve McIntyre

Posted Jan 24, 2009 at 3:45 PM | Permalink

#105. You’ve changed topics here. The estimation of uncertainties (which is a separate can of worms) has nothing to do whether there is a “significant” yield of correlated proxies and whether Mann has misrepresented the situation here.

Even if the uncertainties were done correctly (about which I’m dubious), it doesn’t justify the inclusion of Luterbacher instrumental data into the reporting of proxy yields – different topic.
- Deep Climate
  
  Posted Jan 24, 2009 at 4:12 PM | Permalink
  
  Re: Steve McIntyre (#106),
  
  OK, point taken. Another way to get at this issue would be to analyze the “pass” rate by proxy length. For example, what is the “pass” rate for proxies going back to 1000 AD? 800 AD? That could be interesting, right?
Steve McIntyre

Posted Jan 24, 2009 at 3:50 PM | Permalink

#104. Roman, you’re undoubtedly right in how they should be calculated, but you just don’t see t-stats in the Mann corpus, so I’m guessing that the values are derived from the normal approximation. But we need to keep the correct t-formula at hand in the event that we get to trying to do these things based on methods known to the ROM.
Steve McIntyre

Posted Jan 24, 2009 at 4:17 PM | Permalink

#110. Yes, that’s worth keeping in mind, but the population of 927 is interesting in itself

Is the pick two yield “significant”? is the 13% benchmark “right”? are the M08 claims a misrepresentation of the situation?
Deep Climate

Posted Jan 24, 2009 at 9:38 PM | Permalink

Off topic, but interesting from a blog management perspective – did anyone wonder about #81? The link goes to some Russian dining website. Looks like there’s a bot that took an earlier comment of pete’s, mangled the name slightly and added a link. What will those spammers think of next!
- pete
  
  Posted Jan 25, 2009 at 3:49 AM | Permalink
  
  Re: Deep Climate (#118), I certainly wondered!
Hu McCulloch

Posted Jan 25, 2009 at 9:21 AM | Permalink

RomanM asks in #93,

I have to ask why in a serious scientific study, anybody would do a “two-sided test for code 5000/6000 proxies only (relatively small number of proxies)”. If you don’t know in advance whether the correlation is positive or negative (indicating that the physical processes behind that link are poorly and/or incompletely identified), how could you assume that that relationship is specifically to temperature in a linear fashion? Black box teleconnection at its finest!

Good question. However, its equally interesting converse is that if one is confident that a type of proxy (eg treerings) should have say a positive but noisy relationship to temperature, then the distribution of coefficients over a large number of such proxies should not be symmetrical about zero, but should at least have a clear majority of positives.

Mann’s 1209proxyames SI file does not give all the correlations, before or after his size-distorting Pick2 procedure, but only the ones that passed the usually 1-tailed significance threshold. However, Steve evidently found a version of the file that did have all the Pick2 correlations, and was able to compute the raw correlations himself. He reported the results graphically in his 9/20/08 thread on the The Mann Correlation Mystery.

The following scatter plot from that thread shows the raw correlations on the horizontal axis and Mann’s Pick2 correlations on the vertical axis.

The red symbols are dendro, which are supposed to have an a priori postive relationship to temperature. In fact, they look about equally positive and negative, with perhaps a majority of negatives!

He didn’t provide a separate histogram of the raw dendro correlations, but the right side of the following diagram is a histogram of all the raw correlations.

The little bump at the right end are the indisputably positive Luterbacher instrumental series. Aside from them, there is no evidence of a predominance of positive correlations, despite the fact that the supposedly positive dendros (TRW + MXD) make up the great majority of the proxies (927/1138 non-Luterbachers).

Mann’s own data therefore now places very heavy burden of proof on him and his coauthors to demonstrate that treerings or densities have a generally positive relationship to temperature at all.

Incidentally, the upper graph and the lower left graph illustrate that Mann’s Pick2 cherry-picking procedure, although highly size distorting, did not maximize his reported correlations as much as it might have: Instead of selecting the more positive of the two correlations, he selected the correlation with the greater absolute value, and then discarded that result if it was negative. As a result, the histogram on the left is hollowed out, but still fairly symmetrical after eliminating the Luterbacher bump on the right. See also my comment #12 on that thread.
- UC
  
  Posted Jan 25, 2009 at 11:09 AM | Permalink
  
  Re: Hu McCulloch (#121),
  
  before or after his size-distorting Pick2 procedure
  
  In propick.m there is option to select pick-n ( igrid , %numbers of gridpoints near proxy site used to cal. r )
  so I can try how would rtable look with igrid=1. But first I did
  compare my Matlab (igrid=2 ) result with the one Mann has in his
  webpage. Only difference is this:
  
  rtable_myrun(:,899)
  
  ans =
  
  -5
  37.5
  5000
  -0.072659745
  -0.27887841
  -0.12528468
  -0.50476717
  
  >> rtable_website(:,899)
  
  ans =
  
  37.5
  -5
  5000
  -0.149443
  -0.51617378
  -0.13864945
  -0.55486167
  
  is this result of some lat / lon issue discussed here ?
  
  Steve: Yep. That’s the rain in Spain in Kenya.
UC

Posted Jan 25, 2009 at 12:00 PM | Permalink

Nah, not that easy when there are clear all commands everywhere and separate files for high – pick and low-pick.. Just a second..
UC

Posted Jan 25, 2009 at 12:22 PM | Permalink

Here: http://signals.auditblogs.com/files/2009/01/rtable_pick1.txt
Steve McIntyre

Posted Jan 25, 2009 at 12:54 PM | Permalink

I’d got within .001 of UC’s values for 1193 out of 1207 proxies as follows. I’ve benchmarked my calcs against rtable results and they work in the cases that I’ve studies – I’ll need to check out the other 14. This calc uses the next1, next2 allocations that I calculated a while ago and there seems to be slight differences in a few cases with UC’s Matlab. As I recall, there were some seriously weird fencepost issues in Mann’s allocations.

url=”http://data.climateaudit.org/data/mann.2008″
source(“http://data.climateaudit.org/scripts/mann.2008/utilities.txt”)
download.file(file.path(url,”mann.tab”),”temp.dat”,mode=”wb”);load(“temp.dat”)
download.file(file.path(url,”Basics.tab”),”temp.dat”,mode=”wb”);load(“temp.dat”)
instrumental.info=Basics$instrumental.info; details=Basics$details

require(“R.matlab”);require(“Rcompression”)
download.file(“http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/instrument/infilled2poles_1850.mat”,”temp.dat”,mode=”wb”)
inst=readMat(“temp.dat”)
infilled2poles=ts(inst$anngridinst,start=1850); dim(inst) #157 2592

g=function(X) ts(X[,2],start=X[1,1])
library(signal)

details$pickone.1850_1995=NA
details$pickone.1850_1995lf=NA
for (k in 1:1209) {
(index= (1:2592)[!is.na(match(instrumental.info$jones,unlist(details[k,”next1″]))) ] )
#locates mannid for next1, next2 for proxy #1486 1487
x=g(mann[[k]]);y= infilled2poles[,index]
test=ts.union(x,y )
temp=(time(test)>=1850)&(time(test)=1850) &(time(x)< =1995)
temp1=(time(x)=1850)&(time(test)< =1995)
details$pickone.1850_1995lf[k]= cor(test[temp,],use=use0)[1,2]
}

test=read.table("http://signals.auditblogs.com/files/2009/01/rtable_pick1.txt&quot😉
test=t(test)
cor(test[,4],details$pickone.1850_1995) # 0.9985831
plot(test[,4],details$pickone.1850_1995)
temp= abs(test[,4]-details$pickone.1850_1995)<.001
sum(temp)
#[1] 1193
data.frame( details[!temp,c("id","pickone.1850_1995")],test[!temp,4])
id pickone.1850_1995 test..temp..4.
308 cana156 -0.05984178 0.074809533
395 fisher_1994_agassiz 0.16454857 0.258134410
396 fisher_1995_agassiz 0.04978613 -0.065326351
405 fisher_2002_plateau -0.20314180 -0.002479786
576 meese_1994_gisp2 0.01949094 0.061961720
641 mongolia-darrigo -0.03435543 0.091591579
799 nv037 -0.07755463 -0.040603843
897 qian_2003_yriver -0.18205392 -0.154840220
899 rodrigo_1999_andaluciarainual -0.12687449 -0.048791467
944 schweingruber_mxdabd_grid28 0.06002927 0.253837030
966 schweingruber_mxdabd_grid5 0.10559929 0.075073236
977 schweingruber_mxdabd_grid6 0.09591880 0.257575630
980 schweingruber_mxdabd_grid63 0.32504908 0.167564870
987 schweingruber_mxdabd_grid7 0.21058567 0.156203550
1017 smith_2004_thicknessmm 0.08221941 0.137785490
1055 thompson_2003_dasuopu 0.19079018 0.032087830
davidc

Posted Jan 25, 2009 at 9:39 PM | Permalink

Anyone who has come across the comments from the AGW camp that only material that has been peer reviewed should be taken seriously should try and get hold of a copy of some of those peer reviews. And maybe post them, for comparison with the level of discussion that goes on here. I think it would be acceptable as leading to an audit of the climate auditors (reviewers).

I have never seen a review by a peer in Climate Science but I have no reason to think it is vastly superior to other fields. From my experience the depth and rigour of the analysis on CA is orders of magnitude greater than in the usual peer review process.

This is as it should be. A typical scientific paper has little impact (to be cited 3 or 4 times by other papers in the field is generally considered good). There is little point in an anonymous reviewer putting a great deal of effort into the details. A typical good review (leading to publication) says thing like interesting work, advances the field, a few mistakes but nothing gross and a few key references omitted (which narrows the range of possible reviewers). The real test of the paper comes if it attracts attention sufficient to make it worthwhile by the wider peer group to engage in detailed criticism (which is itself peer reviewed and published). This last step does not appear well developed in Climate Science, for reasons everyone understands.

But if anyone without personal experience of peer review thinks that what goes on is something like what you read on CA, no, nothing remotely like it.

I call on Michael Mann to make his peer reviews avaiable on CA.
Hu McCulloch

Posted Jan 26, 2009 at 8:59 AM | Permalink

Re my #97 and Steve’s inline reply,

I’ve now taken a look at Luterbacher’s paper (Science 2004 vol. 303 pp. 1499-1503, with Dietrich, Xoplaki, Grosjean and Wanner), and now concur with Steve that these series should not have been used in Mann 08.

Can anyone recommend a good recipe for roast crow? 😉

This is not to say that Luterbacher’s Figure 1C is not necessarily a valid multiproxy reconstruction of European temperatures back to 1500, to within its large CI. Nor that some or even all of his raw data might not be valid inputs to a study like Mann’s. The problem is that the entire period 1900-present is not a reconstruction at all, but merely the CRU and/or GISS data with which the earlier period was calibrated. Furthermore, the period 1750-1900 has a much smaller standard error than the period 1500-1750, because fewer good proxies are available in the earlier period. The proxies themselves might be valid inputs to another multiproxy study, and the overall reconstruction itself might validly be averaged into other, independent reconstructions, but the 71 Luterbacher local reconstructions cannot be treated as if they were proxies in their own right.

Luterbacher mentions that there is one instrumental record going back to 1659, and several back to the mid-18th century. Since these are not as comprehensive as the post-1850 or post-1900 data, they cannot just be averaged to create a precise instrumental index for all of Europe. However, there is no reason these shouldn’t just be treated as proxies in their own right, i.e. noisy observations on the average temperature. In fact, their calibration might not even require estimation of a slope term, since their slope relative to the comprehensive instrumental data in the calibration period is presumably unity.

Other Luterbacher input series, like historical records of cherry blooms and canal freezes, go back further, and may be useful proxies in their own right. Luterbacher’s SI unfortunately doesn’t tabulate the raw data, or even his local or continental reconstructions, but it does give a long list of reference sources that may have the actual data, if only in hard copy form.

Steve’s diagrams copied in #121 above illustrate the anomalous behavior of the Luterbacher series — The black symbols in the top graph are Luterbacher, and naturally have a much higher correlation with instrumental data in the 1850-present period, since they are the instrumental data in the post-1900 period, and presumably draw heavily on it in the 1850-1900 period. They account for the anomalous bumps on the right of the histograms in the lower figure.

Their high correlation with the instrumental data definitely cannot be taken as representative of the “skill” of the remainder of the Mann08 1209 “proxy” data set.

It is most unfortunate that a journal named “Science” would publish an article like this without at least requiring the authors to provide their numerical results! An electronic tabulation of the input data would even further advance the cause of true science (with a small s), even if the numbers can all be found in the 51 references cited in the SI.

Incidentally, the exponent in the formula I gave for the Šidàk size correction that I gave back in comment #96 didn’t come through correctly in the post, even though it looked OK on the preview window. Using ^ for exponentiation, the formula I gave that came out
1-.91/n = .0000871
should instead have been
1-.9^(1/n) = .0000871
Steve McIntyre

Posted Jan 26, 2009 at 12:10 PM | Permalink

#128. Including Luterbacher in the proxy stats is like using loaded dice. To my knowledge, no climate scientist has objected.
UC

Posted Jan 26, 2009 at 1:10 PM | Permalink

Team’s choice for H0 seems to be AR(1) noise , but I tried how 1/f – like noise ( as discussed in here and here ) would do, correlations for original proxy data and 1/f-noise :

Quite interesting, I’ll try to clean and put this somehow online, when I have more time. ( Note that tree rings are one example in here )
Hu McCulloch

Posted Jan 26, 2009 at 2:20 PM | Permalink

Re #130,
Thanks, UC — I’ve learned a lot that I’ve long wished I knew about 1/f processes from your posts here on CA!

What is irfo? Is it one of Mann’s proxies? Is the big blip around observations 500-570 for real, or a typo? Even a 1/f process would have a hard time explaining this!

There are two separate issues here, both important: Are the series themselves, ie proxies and temperatures, 1/f? And, are the errors in a regression of proxy on temperature 1/f?

That being said, I’d be thankful if the Team would just adjust their stats for AR(1) serial correlation, however primitively. Mann08 do at least acknowledge some serial correlation, but it seems unlikely that the small adjustment of test sizes they make from .10 to .13 works for the regression errors implied by all or even most of their annual series.
- UC
  
  Posted Jan 26, 2009 at 2:50 PM | Permalink
  
  Re: Hu McCulloch (#131),
  
  Hu,
  
  info is a file propick.m writes to
  
  save(‘/holocene/s1/zuz10/work1/temann/data/info’,’PPP’,’-ascii’)
  
  and it includes pick2-ed correlations ( written to rtable1209 later in MAKEPROXY.m ) . Not easy to follow, for sure 😉
  
  I guess the flip is due to instrumental “proxies” ( Steve, is this correct ? )
Steve McIntyre

Posted Jan 26, 2009 at 2:29 PM | Permalink

Hu, Mann’s AR1 analyses are an entire new rabbit hole. I’ll post something up on them.
Steve McIntyre

Posted Jan 26, 2009 at 3:57 PM | Permalink

#133. I’m not sure what the question is. In the figure in #130, the plateau of high correlations are the loaded dice from Luterbacher. (In the online R data frame details.tab, I’ve collated odds and ends of proxy info strewn in various places in the Mann SI. details$id[500:600] shows that these series are Luter series.
Sam Urbinto

Posted Jan 26, 2009 at 5:00 PM | Permalink

Perhaps things would be clarified with showing how population levels lead to automobiles which lead to increased levels of carbon dioxide in the atmosphere.
UC

Posted Jan 27, 2009 at 12:19 PM | Permalink

Hu,

Is the big blip around observations 500-570 for real, or a typo?

These are sample correlations of 1/f noise vs. grid-box temperatures.

Steve,

In the figure in #130, the plateau of high correlations are the loaded dice from Luterbacher.

OK. So real proxies are not that significant, if you take 1/f as benchmark. Like I said, I’ll need some time to write this out.

Trying to go through relevant journals ( Technometrics, Biometrika, I don’t like Science or Nature anymore 😉 ) on this topic,

Slowly decaying serial correlations have, if ignored, disastrous effects on classical tests and confidence intervals.

..

Ad hoc methods like smoothing and detrending usually do not lead to satisfactory results.

(from A test of location for data with slowly decaying serial correlations, J. Beran, Biometrika (1989), 76, 2, pp. 261-9 )

Compare with Mann’s:

For the decadally resolved proxies, the effect is negligible because the decadal time scale of the smoothing is long compared with the intrinsic autocorrelation time scales of the data.

So the question is, are there slowly decaying correlations in proxy series ?
andymc

Posted Jan 27, 2009 at 1:00 PM | Permalink

Re Luterbachers proxy,

I arrived at work the other day only to discover that the central heating didn’t come on. It turned out that the contractors we had in at the weekend switched off the power while they were working but forgot to switch it back on when they left. I digress. The interesting bit is that in my search for finding out why the heating wasn’t working I came across a glass capillary about 9 inches long. It was mounted vertically on the wall and it appeared to contain a red liquid that extended about one quarter way up the capillary. The following day, the heating had been fixed, but this time I noticed that the red liquid was much further up the tube. I reckoned that it must have been due to the differential expansion of the glass capillary and the red liquid contained within; since the glass expanded less than the liquid, the liquid was observed to rise up the capillary as it got hotter.
Maybe such a device would prove useful as a temperature proxy.

Andy
- DeWitt Payne
  
  Posted Jan 27, 2009 at 1:51 PM | Permalink
  
  Re: andymc (#137),
  
  Good one.
  
  Too many people seem to want to diss models and proxies in general as somehow not being real, but in fact we use them every day. We measure the volume of a fixed mass of mercury or methanol or the curvature of a strip of two different metals and convert it to the abstract concept we call temperature with a mathematical model called a calibration. We have mechanical devices that count swings of a pendulum or electronic devices that count vibrations of a quartz crystal and we call them clocks and we use them as proxies for the abstraction we call time. The real question is how good are the proxies and how good are the models. That’s where statistics comes in.
  - Mark T.
    
    Posted Jan 27, 2009 at 2:00 PM | Permalink
    
    Re: DeWitt Payne (#139), The type of “proxy” that you used is a poor analogy to the types of proxies being discussed herein. A thermometer “proxy” results from a known, testable, physical relationship. Using statistics to validate a proxy is a circular argument.
    
    Mark
Steve McIntyre

Posted Jan 27, 2009 at 1:18 PM | Permalink

#136. How retro. Of course, you’ll have to develop some kind of digital device to figure out what it means.
Sam Urbinto

Posted Jan 27, 2009 at 2:47 PM | Permalink

Proxies are measured variables to infer the values of variables of interest. Thermometers, scales, clocks, voltmeters and the like aren’t proxies.

Although the idea of a scale as a proxy of the force of gravity is a rather funny idea.

But we know what was meant. We don’t measure time (whatever that is) we measure the known quality of the vibration of quartz.

But of course the general rule is that things that are measured themselves for their own values (measured variables of interest) are not proxies.
- DeWitt Payne
  
  Posted Jan 27, 2009 at 4:18 PM | Permalink
  
  Re: Sam Urbinto (#141),
  
  The simplest form of a gravitometer is in fact a weight on a spring. They used them in the oil patch in Texas. I saw one hooked up to a strip chart recorder and you could see the change in local g from the influence of the moon as the earth rotated. For more precise measurement, one drops a weight in a very high vacuum and measures the position as a function of time with a laser interferometer.
  
  Re: Mark T. (#140),
  
  If I want to measure the concentration of iron by emission spectrometry, I validate the linear model to convert emission intensity to concentration over a designated range of concentrations using statistics by measuring several replicates at several different concentrations and calculating the correlation coefficient of the OLS fit of intensity to concentration. I also validate the method by determining that the concentration I measure in a certified reference material is within the error bounds of the certification. How is that a circular argument.
  - Mark T.
    
    Posted Jan 27, 2009 at 5:07 PM | Permalink
    
    Re: DeWitt Payne (#142), W.r.t. your reply to Sam, yes, I agree that you are indirectly measuring one thing to provide information on something else. However, there is a testable equation to relate the one measurement to the other. In order for the proxy to hold up, however, there needs to be a known relationship between the measured and inferred variables.
    
    As to your reply to me, apples and oranges and, ultimately, a strawman argument. Again, you are using as an analogy a physical relationship, albeit one with a statistical distribution. That’s not even remotely what I am referring to when I said “using statistics to validate a proxy is a circular argument,” and the context of this thread should have made that obvious. Mann is using ex post statistical analyses of his proxies to determine which ones he should use, before proving they are proxies for temperature in the first place (as Steve noted in#143). By doing so he necessarily removes any proxies that would serve to invalidate his results, which creates the circular argument.
    
    Mark
    - Craig Loehle
      
      Posted Jan 27, 2009 at 8:42 PM | Permalink
      
      Re: Mark T. (#144), Yes.
    - DeWitt Payne
      
      Posted Jan 27, 2009 at 11:05 PM | Permalink
      
      Re: Mark T. (#144),
      
      I am in no way defending Mann’s or anyone else’s abuse of proxies. I’m objecting to the blanket condemnation of models as somehow not being real.
      
      As far as known physical relationships or testable equations being required, Galileo had no idea why a pendulum has a nearly constant period. He observed it empirically by counting his heartbeats and used it as a proxy for time in his experiments. The theoretical explanation requires a knowledge of Newton’s laws of motion and gravity, which came later.
    - Mark T
      
      Posted Jan 28, 2009 at 12:51 AM | Permalink
      
      Re: DeWitt Payne (#146),
      
      I’m objecting to the blanket condemnation of models as somehow not being real.
      
      I’m objecting to your blanket assumption that complaints about proxies are based on an incomplete understanding of what a proxy is or that such complaints assume all proxies are unphysical. This is simply not true. The complaints about many of the proxies are well-founded, actually. Certainly one would expect some proxies to hold verifiable relationships with temperature, but the ones that seem to “do the trick” for the Team do not.
      
      Galileo had no idea why a pendulum has a nearly constant period. He observed it empirically by counting his heartbeats and used it as a proxy for time in his experiments. The theoretical explanation requires a knowledge of Newton’s laws of motion and gravity, which came later.
      
      Yes, notice the little bit out how he “observed it emprically,” again, apples and oranges. The proxies the people protest have never been shown to be proxies empirically, they are simply assumed to be, then used as if they are. There is no attempt at proving so, even after Mann said in 1998 that in the (unlikely) event that the tree rings don’t represent temperature, then none of the following work holds. He then left it at that, apparently assuming putting “unlikely” in there made it all good. Now we’ve got him/others ex post picking based on correlation, which is what he needs to empirically prove FIRST. It is shameful. Until someone proves (or demonstrates reasonably) the relationship between a proxy and temperature, its use in a reconstruction is completely unfounded, correlation or no. Likewise, saying that a reconstruction using an unproven proxy is an unphysical representation is a valid argument.
      
      Look back at my reply, btw, and you’ll note that I did not mention models (which is, IMO, another fish to fry). I intentionally restricted my scope to your use of the proxy analogy, which was an incorrect analogy. I think, though I cannot verify, that andymc’s original comment was sarcastic.
      
      Mark
    - DeWitt Payne
      
      Posted Jan 28, 2009 at 10:49 AM | Permalink
      
      Re: Mark T (#147),
      
      The proxies the people protest have never been shown to be proxies empirically
      
      I’m not sure what you mean here. Are you talking about tree rings in general or a specific example of a clearly bad proxy like bristlecones? Besides, how does this conflict with my original statement:
      
      The real question is how good are the proxies and how good are the models.
      
      Just because I give examples of very good proxies does not mean I think that calling something a proxy makes it so.
      
      And here is an example of the attitude which I find objectionable. Disclaimer: I am in no way endorsing the conclusion of the Caswell Emperor Penguin extinction paper.
Steve McIntyre

Posted Jan 27, 2009 at 4:44 PM | Permalink

Quite so – we do use “proxies” all the time. Brown and Sundberg describe a calibration theory. The problem with Mannian stuff is not the idea of a proxy, but that that he just calls things proxies. You don’t use the weight of the spectrometer as a “proxy” for iron concentration.
Mark T.

Posted Jan 28, 2009 at 11:17 AM | Permalink

Within just a few posts you’ve already lost the context of your own argument.

In (#139), you said:

Too many people seem to want to diss models and proxies in general as somehow not being real, but in fact we use them every day.

which is utter BS. People are not dismissing “proxies in general,” they are dismissing temperature proxies and, more specifically, tree-ring proxies with good reason. Your analogies to physically verifiable measurements are irrelevant and more correctly, a Faulty Analogy.

Just because I give examples of very good proxies does not mean I think that calling something a proxy makes it so.

No, what you have done with your examples is to state “see, proxies are useful,” when in fact, nobody is saying proxies aren’t useful or real. They are saying that proxies need to be proven to be proxies to be useful as real tools, and you provide examples of proxies that have already been proven as if that gives you some comeuppance. It doesn’t. It makes you look as if you’re trying hard to side with warmers so they will see you as objective.

Mark
- DeWitt Payne
  
  Posted Jan 29, 2009 at 12:30 PM | Permalink
  
  Re: Mark T. (#149),
  
  You quote me and then always ignore the more important part of it in your argument. Suppose I had left “and proxies” out of the original statement or replaced “and” with “using”. Do you still disagree with such apparent hostility?
  
  I’m assuming that you indirectly answered my question about tree rings in general as temperature proxies in the negative. There is in fact good reason to believe that tree rings could be temperature proxies. Temperature does affect growth rate, or do you contest that? The problem with constructing temperature time series from tree ring widths or densities are mostly model related, including among other things, confounding variables like precipitation and whether the assumption that growth rate is a linear, or at least monotonic, function of temperature is valid.
  - Mark T.
    
    Posted Jan 29, 2009 at 2:07 PM | Permalink
    
    Re: DeWitt Payne (#151),
    
    You quote me and then always ignore the more important part of it in your argument. Suppose I had left “and proxies” out of the original statement or replaced “and” with “using”. Do you still disagree with such apparent hostility?
    
    But you didn’t leave them out, and I don’t “always ignore” anything, not particularly in reference to anything herein. My argument has been very to the point, but you simply cannot see it. You are attempting to leave an impression of objectivity by saying that everyone calling models or proxies unreal has no clue, then you proceed to provide analogies that aren’t even relevant – those calling the proxies (and even models) are much more correct than you, whether you want to believe it or not. You don’t even understand your own ignorance. There is no hostility, DeWitt, just fatigue listening to you spout off about how smart you are while repeatedly failing minimal tests of your actual knowledge. Get off your kick of omnipotent intelligence, it is simply not true.
    
    Mark
    - DeWitt Payne
      
      Posted Jan 29, 2009 at 3:43 PM | Permalink
      
      Re: Mark T. (#153),
      
      You are attempting to leave an impression of objectivity by saying that everyone calling models or proxies unreal has no clue
      
      Strawman argument. I never called everone who calls a model unreal (useless would have been a better choice of word) as being clueless for that reason alone. All models are by definition unreal. The map is not the territory. But that doesn’t mean a map is always useless. That also doesn’t mean that all maps must be correct models of the territory either. I have an extremely poor opinion of the utility of climate models for projecting the future climate and The Team’s paleoclimate reconstructions as representing the actual past temperature with any accuracy or precision. It’s everyone who dismisses the utility of all models. And I’m beginning to see that entirely too much, as in the link above and elsewhere on this site.
      
      I admit I have a strong tendency to pontificate. It’s not intentional.
    - Craig Loehle
      
      Posted Jan 30, 2009 at 7:31 AM | Permalink
      
      Re: Mark T. (#153), A little too harsh Mark. If you two want to have a flame war exchange email addr.
Sam Urbinto

Posted Jan 28, 2009 at 1:56 PM | Permalink

Yes, proxies can be useful if they’ve been established to actually infer the variable of interest. My point is that if the timed vibrations of a piezoelectric or the effects of gravity on a fixed weight are the variables of interest themselves, they aren’t proxies. But in the other respect, everything’s a “proxy”, even the well known relationship between applied voltage and mechanical deformation.

Seems a rather pointless discussion. 🙂
andymc

Posted Jan 29, 2009 at 1:35 PM | Permalink

Re proxies,

Yes, I was being sarcastic. This was prompted by Mann’s questionable practice of using Luterbacher’s thermomoeter data as a “proxy” in order to increase the numbers passing elementary screening. And he does all this before getting down to using high correlation infilled “data”.
I’m not a statistician but I can’t help but suspect that it might be easier to work out what methods were employed if you start at the end and work backwards. That’s what I did in chemistry practicals at school…
Perhaps those with a statistical bent can clarify matters for laymen like me. Maybe some kind of list in order to rank the proxies.
I’ve made a start.

The best temperature proxies are:
Thermometer 100%
Made up (i.e. infilled) data 88%
…

Andy
- Mark T.
  
  Posted Jan 29, 2009 at 2:09 PM | Permalink
  
  Re: andymc (#152), Yes, I figured as much.
  
  Mark
Mark T.

Posted Jan 30, 2009 at 9:55 AM | Permalink

I tell you what, he drops the pompous attitude, and I get nice again. He can’t even follow the argument HE made. It is tiring and this is not the only instance. People like DeWitt want to seem generally knowledgeable, smarter than the rest of us, and cannot understand their own failings when they dip into areas they don’t understand.

DeWitt, the fact remains that you said:

Too many people seem to want to diss models and proxies in general as somehow not being real, but in fact we use them every day.

Then followed it up with ONLY PROXY ANALOGIES, which were, as I have explained now 4 times, faulty. You used examples that have empirically testable outcomes, which is NOT the case for the proxies that people are “diss”ing. Not only do those of us that understand what a proxy is, we also understand why the proxies that we are “diss”ing ARE unreal.

Mark
- Kenneth Fritsch
  
  Posted Jan 30, 2009 at 11:06 AM | Permalink
  
  Re: Mark T. (#157),
  
  Mark, DeWitt provides readers of this blog with a lot of good explanations of the science of the atmosphere and other climate related fields. He comes across as believable by not going a bridge too far in his conclusions.
  
  When he says that temperature could be reconstructed by way of proxy but sees major weaknesses and errors in currently espoused temperature reconstructions that means more to me than if someone generalized against modeling and/or temperature reconstructions and by inference that the current models/reconstructions must be wrong.
  
  DeWitt has given evidence why the simple atmospheric models without the feedback effects could be accurate and provide useful insights. He has indicated that we should move on to those areas where the evidence and theory are much less settled.
  
  Steve M has commented about his problem with people going a bridge too far in their comments at CA and what it does to the perception of the blog (rightly or wrongly) by outsiders.
  
  I am not sure what this argument is about and I certainly am not implying that you have gone a bridge too far. I guess I am simply saying that I agree for the most part with DeWitt’s approach.
Sam Urbinto

Posted Jan 30, 2009 at 3:15 PM | Permalink

Conflicting methods, incompatible approaches, a way to explain something from different vantages that seem opposite?

So there’s this spoon and various folks are describing it; handle from both sides, bowl from both sides, looking at it from the bottom of handle or top of bowl.

Sometimes, to paraphrase Freud, a spoon is just a spoon.

2 Trackbacks

By SNR Estimates of Temperature Proxy Data « the Air Vent on Aug 19, 2010 at 7:21 AM

[…] Climate Audit, because that’s where I happened to be reading when I ran across this point. More on voodoo correlations. Although 484 (~40%) pass the temperature screening process over the full (1850–1995) calibration […]
By Signal to Noise Ratio Estimates of Mann08 Temperature Proxy Data « Climate Audit on Aug 19, 2010 at 9:50 AM

[…] at Climate Audit, because that’s where I happened to be reading when I ran across this point. More on voodoo correlations. Although 484 (~40%) pass the temperature screening process over the full (1850–1995) calibration […]

Climate Audit