Neukom and Gergis Serve Cold Screened Spaghetti

Neukom, Gergis and Karoly, accompanied by a phalanx of protective specialists, have served up a plate of cold screened spaghetti in today’s Nature (announced by Gergis here).

Gergis et al 2012 (presently in a sort of zombie withdrawal) had foundered on ex post screening. Neukom, Gergis and Karoly + 2014 take ex post screening to a new and shall-we-say unprecedented level. This will be the topic of today’s post.

Data Availability
As a preamble, the spaghetti is cold in the sense that the network of proxies is almost identical to the proxy network of Neukom and Gergis 2012, which was not archived at the time and which Neukom refused to provide (see CA here). I had hoped that Nature would require Neukom to archive the data this time, but disappointingly Neukom once again did not archive the data. I’m reasonably optimistic that Nature will eventually require Neukom to archive the data, but the unavailability of the data when the article is released restricts commentary significantly. I’ve written to Nature asking them to require Neukom and Gergis to archive the data. (April 1 – an archive has been placed online at NOAA).

Wagenmakers’ Anti-Torture Protocol
In the wake of several social psychology scandals, there has been renewed statistical interest in the problem of “data torture”, for example by Wagenmakers (for example, here and here).

Wagenmakers observes that “data torture” can occur in many ways. He is particularly critical of the ad hoc and ex post techniques that authors commonly use to extract “statistically significant” results from unwieldy data. Ex post screening is an example of data torture. Wagenmakers urges that, for “confirmatory analysis”, authors be required to set out a statistical plan in advance and stick to it. He acknowledges that some results may emerge during analysis, but finds that such results can only be described as “exploratory”.

Wagenmakers’ anti-torture protocol not only condemns ex post statistical manipulations (including ex post screening), but excludes data used in the formulation of a hypothesis, from confirmatory testing of the hypothesis. In other words, Wagenmakers anti-torture protocol would exclude proxies used to develop previous Hockey Sticks and restrict confirmation studies to the consideration of new proxies. This would prevent the use of the same data over and over again in supposedly “independent” studies – a paleoclimate practice long criticized at CA.

In my own examination of new multiproxy reconstructions, I tend to be most interested in “new” proxies. It would be a worthwhile exercise in each new reconstruction to clearly show and discuss the “new” proxies – which are the only ones that pass Wagenmakers’ criteria.

Ex post (after the fact) screening is a form of data torture long criticized at climate blogs (CA, Jeff Id, Lucia, Lubos – not previously using the term “data torture” in this context), but widely accepted by IPCC scientists. It was an issue with Gergis et al 2012 and again with Neukom, Gergis and Karoly 2014.

Gergis et al 2012 had stated that they had mitigated post hoc screening by de-trending the data before correlation. However, as Jean S observed at Climate Audit at the time, they actually calculated correlations on non-detrended data. Jean S observed that almost no proxies passed screening using the protocol reported in the article itself.

Gergis, encouraged by Mann and Schmidt, tried to persuade the journal that they should be allowed to change the description of the methodology to match their actual calculations. However, the journal did not agree. They required Gergis and Neukom to re-do their calculations using the stated methodology and to show that any difference in protocol “didn’t matter.” Unfortunately for Gergis and Neukom, it did matter. They subsequently re-submitted, but two years later, nothing has appeared.

In their new article, Neukom and Gergis are once again back in the post-hoc screening business but have taken post hoc screening to shall-we-say unprecedented levels.

They stated that their network consisted of 325 “records”:

The palaeoclimate data network consists of 48 marine (46 coral and 2 sediment time series) and 277 terrestrial (206 tree-ring sites, 42 ice core, 19 documentary, 8 lake sediment and 2 speleothem) records [totalling 325 sites] (details in Supplementary Section 1)…

Some of the 206 tree ring sites are combined into “composites” of nearby sites: their list of proxies in Supplementary Table 1 contains 204 records and it is these 204 records that are screened.

Once again, they claimed that their screening based on local correlations with detrended data, reducing the network to 111 (54% of the original count). In the Methodology section of the article:

Proxies are screened with local grid-cell temperatures yielding 111 temperature predictors (Fig. 1) for the nested multivariate principal component regression procedure.

and in the SI:

The predictors for the reconstructions are selected based on their local correlations with the target grid…

Later in the SI, they state that detrended data was used for the local correlation:

Both the proxy and instrumental data are linearly detrended over the 1911-1990 overlap period prior to the correlation analyses. Correlations of each proxy record with all grid cells are then calculated for the period 1911-1990.

Jean S determined that only a few proxies in the network of Gergis et al 2012 (which contributes to the present network) passed a screening test using detrended data.

So how did Neukom and Gergis 2014 get a yield of over 54%?

Watch how they calculated “local” correlation. Later in the SI, they say (for all non-Antarctic cells):

We consider the “local” correlation of each record as the highest absolute correlation of a proxy with all grid cells within a radius of 1000 km and for all the three lags (0, 1 or -1 years). A proxy record is included in the predictor set if this local correlation is significant (p<0.05). … Significance levels (5% threshold) are calculated taking AR1 autocorrelation into account (Bretherton et al., 1999).

Mann et al 2008 had improved their screening yield by a “pick two” methodology. Neukom and Gergis go far beyond that by comparing to grid cells within 1000 km and three lags. As I understand this, they picked the “best” correlation from several dozen comparisons. One wonders how they calculated “significance” in such a calculation (not elaborated in the article itself). Unless their benchmarks allowed for the enormous number of comparisons, their “significance” calculations would be incorrect.

The above procedure is used for non-Antarctic proxies. For Antarctic proxies, they say:

Proxies from Antarctica, which are outside the domain used for proxy screening, are included, if they correlate significantly with at least 10% of the grid-area used for screening (latitude weighted).

At present, I am unable to interpret this test in operational terms.

With a radius of 1000 km, 54% of the proxies passed their test (111). With a reduced radius of 500 km, the yield fell to 42% (85 proxies). The acceptance rate for corals was about 80% and for other proxies was about 50% (slightly lower for ice cores).

Among the “long” proxies (ones that start earlier than 1025, thus covering most of the MWP), 9 of 12 ice core proxies were rejected, including isotope records from Siple Dome, Berkner Island, EDML Dronning Maud. The only “new” passing ice core record was a still unpublished Law Dome Na series (while Na series from Siple Dome and EDML did not “pass”).

Of the 5 long tree ring series, only Mt Read (Tasmania) and Oroko Swamp NZ “passed”. These are not new series: Mt Read (Tasmania) has been used since Mann et al 1999 and Jones et al 1998, while Oroko was considered in Mann and Jones 2003. Both were illustrated in AR4. Mann and Jones 2003 had rejected Oroko as not passing local correlation, but it “passes” Neukom and Gergis with flying colors. (The Oroko version needs to be parsed, because at least one version spliced instrumental data due to recent logging disturbance.) Not passing were three South American series, one of which included Rio Alerce, a series used in Mann et al 1998-99.

None of the “documentary” series cover the medieval period, but calibration of these series is idiosyncratic to say the least. Nearly all of these series are direct measures of precipitation. SI Table 4 shows that these series end in the late 20th century, but a footnote to the table says that the 20th century portion of nine (of 19 series) is projected, citing earlier publications of the same authors for the projection method.

The documentary record ends in the 19th or early 20th century and was extended to present using “pseudo documentaries” (see Neukom et al. 2009 and Neukom et al. 2013)

They “explain” their extrapolation as follows:

Some documentary records did not originally cover the 20th century (Supplementary Table 4). In order to be able to calibrate them, we extend them to present using the “pseudo documentary” approach described by Neukom et al. (2009; 2013). In this approach, the representative instrumental data for each record are degraded with white noise and then classified into the index categories of the documentary record in order to realistically mimic its statistical properties and not overweight the record in the multiproxy calibration process. The amount of noise to be added is determined based on the overlap correlations with the instrumental data. In order to avoid potential biases by using only one iteration of noise6 degrading, we create 1,000 “pseudo documentaries” for each record and randomly sample one realization for each ensemble member (see below, Section 2.2). All documentary records are listed in Supplementary Table 4.

For Tucuman precipitation, one of the documentary records so extended, they report a “local” correlation (under the idiosyncratic methods of the article) of 0.43 and a correlation to SH temperature of 0.37. This was a higher correlation than all but one documentary indices with actual 20th century data.

Comparison to PAGES2K
The present dataset is closely related to data used for the South American and Australasian regional PAGES2K reconstruction used in IPCC AR5. I previously discussed the PAGES2K South American reconstruction here. pointing out that it had used the Quelccaya O18 and accumulation data upside-down to the orientation employed by specialists and upside-down to Thompson’s own reports. I also discussed Neukom’s South American network in the context of the AR5 First Draft here.

Neukom et al 2014 is non-compliant with Wagenmakers’ anti-torture protocol on several important counts, including its unprecedented ex post screening and its reliance on the same proxies that have been used in multiple previous studies.

I have some work on SH proxies in inventory, some of which touches on both Neukom and Gergis 2014 and PAGES2K, and will try to write up some posts on the topic from time to time.


  1. Mark Lewis
    Posted Mar 31, 2014 at 10:42 AM | Permalink

    Steve, you keep torturing us with all this talk about sharing the source data and – when it is obvious (like with Lewandowsky’s data) that you will just use it to criticize their article. This is politics Steve, not science. With a best of 100 ratio to “verify” your data sources – you can get policies enacted and politicians elected!

    I wonder how Nature will respond? (he says with naive hope.)

    Steve: I’m pretty sure that Nature will require them to archive their data. We’ll see.

    • Steve McIntyre
      Posted Mar 31, 2014 at 12:32 PM | Permalink

      A Nature editor responded as follows:

      I have contacted the authors requesting the data and will be in touch soon

      • Mark Lewis
        Posted Mar 31, 2014 at 9:56 PM | Permalink

        Fingers crossed – perhaps it will serve as a model for the future. I bet it would be a boost for Nature in the long term.

  2. MikeN
    Posted Mar 31, 2014 at 11:32 AM | Permalink

    So the proxies could be proxies for temperature in the future too.

    Does the Antarctic not match with screening based on the 55S-10N gridcell? Has to match at least ten percent of that area, with a lower threshold if you are matching with higher latitudes?

    Steve: I don’t know what this means in operational terms. What does this mean in equations or code?

    • MikeN
      Posted Mar 31, 2014 at 12:08 PM | Permalink

      Which temperature dataset are they using?

      Steve: GISS

    • MikeN
      Posted Mar 31, 2014 at 2:35 PM | Permalink

      I see several possible explanations.
      1) They use latitude-weighted earlier, so it could be a correlation, of at least .1 or 10% significance, with the latitude-weighted temperature of the whole area. Unlikely.

      2) My original guess, take the correlation with each gridcell, higher latitudes are given more weight as they are closer to the Antarctic, and you have to get 10% of the total gridcell area gets a passing correlation. The gridcells are weighted by latitude, using weighting =latitude, and ignoring north of equator would yield
      for(i=55;i gt 0;i=i-2) for(j=-179;j lt 180;j=j+2)

      if count>14112 then proxypass=true;
      However, this may be overcomplicating things.

      3) Take the area of each gridcell in km^2 and the correlations.
      for(i=55;i gt 0;i=i-2) for(j=-179;j lt 180;j=j+2)
      count must be more than 10% of area.

      Steve: dunno. but whatever it is, I doubt that they established this criterion ex ante. Not that any of this stuff necessarily has any statistical meaning. What does it mean when Law Dome O18 passes, but not Siple Dome O18. Or one Na series, but not another. If some proxies “fail”, then the “hypothesis” that this class of proxy is a temperature proxy doesnt appear to work out of sample.

      • HaroldW
        Posted Mar 31, 2014 at 4:06 PM | Permalink

        MikeN –
        This is merely a guess, as the description in the SI is imprecise, but given the unlikelihood of the authors clarifying, or archiving code…you have not taken into consideration the phrase “at least 10% of the grid-area used for screening.”
        for (all cells within 1000 km of any non-Antarctic proxy)
        …total_area = total_area + area(cell)
        …if (proxy correlates significantly with cell)
        ……[presumably using best correlation of the “pick 3” lags]
        ……corr_area = corr_area + area(cell)
        proxy_passes = (corr_area > 0.1*total_area)

        As I said, merely a guess.

        What’s truly puzzling to me, though, is that Supplementary Figure 20 shows a reconstruction without screening (the red curve), and it’s not appreciably different to my eye from the screened ones (black or blue). So why introduce the complexity and possible biases of screening?

        Steve: Interesting point. Doesn’t make a whole lot of sense. They describe their multivariate method as principal components regression citing Luterbacher et al 2002, 2004 and Wahl and Smerdon 2012. Wahl and Smerdon show algebra that looks identical to me to the underlying algebra of the Mann et al 1998 method (it more or less matches algebra posted up at Climate Audit long ago). These methods can achieve very high r2 in the calibratino period, but are prone to ~0 verification r2, as we saw with MBH. Neukom and Gergis have a very idiosyncratic RE method where calibration and verification are both taken in the 1911-90 period. Again its annoying to see a novel methodology introduced in an applied study.

        • Frank
          Posted Mar 31, 2014 at 10:19 PM | Permalink

          Steve: When each new data set (or re-submission) of an old data set requires a novel and unproven methodology …

          Perhaps this should be called “enhanced interrogation” of the data.

      • FerdiEgb
        Posted Mar 31, 2014 at 5:31 PM | Permalink

        Steve, in all cases d18O is a temperature proxy for the catch area of where the water vapour originated which is disposed as snow. For more coastal places like Law Dome and Siple Dome it is the nearby Southern Ocean which gives most of the precipitation. For the inland cores like Vostok and Dome C, it is from all of the Southern Hemisphere oceans.
        There was some kind of Southern Ocean temperature reconstruction, based on the d18O in coastal ice cores which showed the influence of ENSO and other oscillations, where the Peninsula and the rest of the coast cores were reacting in opposite ways.

        I doubt that GISS has much historical temperature data from the oceans near the Antarctic or even land temperatures. Thus if they used any land station far away, it is no wonder that they have some “good” and some bad correlations…

        I hadn’t heard before of Na as temperature proxy, but I suppose that this is about salt content in general and that has to do with wind speed/salt spray. There is some far away correlation with temperature: during glacial maxima there is more wind and less rain, which makes that salt and even sand from far away can reach the inland ice sheets. For coastal ice cores it may be a proxy for local wind speed too, but I don’t see the relation with temperature…

        Steve: keep in mind that Neukom and Gergis, like Mann, put instrumental precipitation into their meat grinder. I haven’t parsed their algorithm yet but I;m pretty sure that the algorithm mines for correlations.

        • Posted Mar 31, 2014 at 10:19 PM | Permalink

          AFAICS, only one proxy is based on Na, due to Majewski et al 2004. They say:

          “Higher (lower) levels of sea-salt deposition (represented by Na) at DSS are associated with high (low) sea-level pressure changes from coastal and interior East Antarctic locations (Souney and others, 2002) and with increased (decreased) wind speeds at Casey Station, coastward of DSS (Curran and others, 1998). The high sea-salt loading of the poleward-moving air masses is coincident with the austral winter minimum in sea ice (June). The correlation is best for June because there is at this time an available source of sea salt from open water and also energetic air–sea exchange allowing entrainment of sea-salt aerosols. Annual values of Na in the DSS ice core are well correlated (r ¼ 0.35–0.636; r average for 11 stations (Fig. 1c); p < 0.05) with winter (June correlation highest) sea-level pressure over East Antarctica during the period of ice-core and instrumental overlap (1957–96), providing a proxy for the East Antarctic High (EAH) (Souney and others, 2002)."

        • FerdiEgb
          Posted Apr 1, 2014 at 9:03 AM | Permalink

          Nick, indeed a quite good proxy for wind speed, but I don’t see any relation with temperature…

        • TimTheToolMan
          Posted Apr 3, 2014 at 9:27 PM | Permalink

          The correlation with temperature must hinge on sea ice extent…but it ALSO depends on windspeed. Not much of a proxy afaics.

      • MikeN
        Posted Apr 1, 2014 at 1:39 PM | Permalink

        It’s possible the non-screened proxies have some questionable hockey sticks included.

  3. Posted Mar 31, 2014 at 12:01 PM | Permalink

    Cold Screened Spaghetti. The blog that makes the latest ‘independent confirmation’ of the Mann hockey stick sound as congealed as it deserves. All we need now is Keith or Kim exploiting the rhyme between screens and greens.

  4. MikeN
    Posted Mar 31, 2014 at 12:33 PM | Permalink

    I was surprised when I recently saw a post of yours that Nature has adopted a policy of making authors reveal all data and not hide behind contributing authors. Now, we are seeing the same thing.

    Plus here they are citing their own retracted paper. Granted it is just to list proxies, but why not just list them again. Seems to be a weak attempt at getting citations.

    Steve: they are not citing their retracted paper (J Climate), but one that was published in Holocene

  5. MikeN
    Posted Mar 31, 2014 at 12:45 PM | Permalink

    >Wagenmakers urges that, for “confirmatory analysis”, authors be required to set out a statistical plan in advance and stick to it.

    Authors can set out a statistical plan that they know yields the results they want. Or they can adjust the statistical plan on the fly when they don’t like the initial results.

  6. Posted Mar 31, 2014 at 1:05 PM | Permalink

    Here’s the problem with screening. Suppose you generate N random series and compute a correlation term “p” with a temperature series T. About 5% of them will have correlation coefficients that exceed the usual critical value (call it alpha) indicating 5% significance. If you then select these series and build your model, you are still just using random data so your model is pure noise, even though it looks “significant”. So where’s the catch?

    It’s that alpha is no longer the correct critical value for p. As soon as you do a grid search the critical values follow a supremum function and tend to get large very quickly. The model actually has two parameters being estimated at the same time: the correlation coefficient p and a second parameter (call it gamma) that represents the point in the sample (from i=1,…,N) where the target variable optimally correlates to T. The null hypothesis is that gamma is not in the interval i=1,…,N (i.e. because gamma equals 0 or some number greater than N.) Under the null hypothesis, gamma is not in the sample so p is not identified. p is only identified under the alternative hypothesis.

    This is the so-called “Davies problem” – how to conduct inference in a model when a nuisance parameter is only identified under the alternative hypothesis. It was first analysed by Robert Davies in Biometrika 1977

    The topic was developed extensively in econometrics during the 1990s, e.g.,, etc. Tim Vogelsang presented another application at the EAIC conference last summer ( on the problem of trend estimation when there may be a shift in the series mean at an unknown point. It comes up all over the place in time series applications.

    The bottom line in Davies-type problems is that failure to adjust the critical values for the effect of doing a grid search across a nuisance parameter space leads to a large bias towards spurious rejections of the null (i.e. exaggerating the significance of the model fit). It looks like yet another area where a young statistician or econometrician could make a useful contribution by bringing empirical techniques in climatology up to, say, circa the mid-1990s.

    • Clark
      Posted Apr 1, 2014 at 1:43 PM | Permalink

      How many grid cells are within a 1000 km radius?

      If your p value cutoff is p,0.05, then you expect 5% of random series to pass this cutoff. If you have, say 5 grids cells within 1000 km, and you are triplicating each grid cell by including lag points, then you would expect >60% of the proxies to make it through the screening with completely random data with no actual physical correlation at all.

      Steve: they used GISS which has a complicated grid structure. My rough estimate would be more like 10-12 grid cells, but I haven’t confirmed this with a calculation. There is spatial autocorrelation. For rough estimates, one can do something like the AR1 adjustment for temporal autocorrelation: use an “effective” number of independent grid cells which would be less than the actual number – say half and we’re back to your 5 grid cells.

      • Willis Eschenbach
        Posted Apr 1, 2014 at 5:19 PM | Permalink

        Actually, it’s about 60 gridcells within 1000 km if you count gridcells that are partially overlapped.

        Or it’s 52 gridcells, if you count only gridcells whose centers are within 1,000 km.

        While the GISS system is complicated, it works out to (roughly) equal-area gridcells that about ~ 2.25° x 2.25° in size.


        • Clark
          Posted Apr 1, 2014 at 6:59 PM | Permalink

          Wow. If it 52 grid points at 3 different lag settings, even accounting for autocorrelation, it’s surprising only 54% passed muster. What’s wrong with these other proxies that they can’t partially match at least 1 of 156 data sets.

  7. RD
    Posted Mar 31, 2014 at 1:22 PM | Permalink

    I’m new to proxy-based temperature reconstructions, but I do have a background in applied statistics. This “local correlation” procedure is unlike anything I have ever heard of. There are accepted ways of computing distance-weighted correlations over subsets of data and I see no valid reason to invent some new criteria based on the most extreme outlier of a relatively large comparison set. Then there’s the issue of using an absolute correlation, which means that the same proxy type could have a positive correlation with temperature in one location and a negative correlation in another location. The potential causality mechanism (“higher temperatures cause more growth”, etc.) is deliberately ignored by such a procedure. Then there’s the issue of 1000 km- why this distance? Why three lags? And why a different test for Antarctic vs. non-Antarctic proxies? To an outsider like myself, this all seems like an elaborate fishing exercise.

    Steve: don’t ask me to explain this. Try the authors. If you’re new to proxy reconstructions and don’t wish to become infuriated by appalling methodology, I suggest that you immediately forget that you ever heard of the topic.

    • kim
      Posted Mar 31, 2014 at 3:12 PM | Permalink

      There’s madness in the method,
      Unbalance in the biased beam.
      Hygeia weeps at words withheld,
      Truth is smote, insane, unclean.

    • Posted Apr 1, 2014 at 12:21 PM | Permalink

      Well there is an easy way to test this once the data is available. For each proxy create say 10,000 randomised series based on the differenced time series (i.e. create randomised series of differences). Now put these 10,000 new series through the same mill.

      500 should pass and be forwarded to the next stage of analysis. 9,500 should fail and be discarded. If more than 500 pass, it looks like spurious correlations are being generated.

      More simply you can use the Bonferroni correction for multiple correlations where the new p value threshold for significance becomes p/n where p=0.05 and n=number of tests. (Although I’m not sure how this is affected when the data are not independent.) I assume something like this must have been done, but I can’t see it in the SI, nor is it clear how many correlations were done with each potential proxy.

  8. bernie1815
    Posted Mar 31, 2014 at 1:30 PM | Permalink

    Steve: Do you have any thoughts on Nature’s review process? Is there anyone you would have suggested as a reviewer? Should the reviewers have caught these “anomalies”?

    Steve: IMO a reviewer should have been aware that you need to look at screening in Neukom, Gergis and Karoly. However, there’s a larger problem in that many specialists in the field (e.g. Mann, Schmidt) think that ex post screening is an acceptable method. Also, knowing what’s “new” in this network requires intimate knowledge of the proxies in all previous networks. I have collated all this information over the years and can comment on this within minutes, but many specialists wouldn’t have spent the time on such collations.

    • Posted Mar 31, 2014 at 2:11 PM | Permalink

      I have collated all this information over the years and can comment on this within minutes, but many specialists wouldn’t have spent the time on such collations.

      I see an inconsistency with the word ‘specialists’ in this sentence. Wouldn’t the first guy mentioned be the real specialist?

      • NZ Willy
        Posted Apr 4, 2014 at 5:26 PM | Permalink

        The first guy isn’t paid.

  9. MarkB
    Posted Mar 31, 2014 at 2:00 PM | Permalink

    From my time in grad school (evolutionary biology) I seem to recall that NSF grants did not allow researchers to use the same data for multiple publications. Money was being turned over to produce NEW research, not to publish as many papers as possible. In (paleo) climate science, this practice seems to be not only de rigueur, but required.

    • Clark
      Posted Apr 1, 2014 at 1:45 PM | Permalink

      Journals are very picky about papers using previously published data. But it seems in climate science, the ‘data’ that has to be new is the novel statistical method employed. It has to differ from all other previous statistical approaches. In that case, this study is ‘novel’ indeed.

    • Posted Apr 2, 2014 at 7:50 PM | Permalink

      New papers using old data are very common in economics and finance. And I think also in other observational sciences.

      Steve: and data snooping is criticized in economics as well. Also the issue is not “old data” per se – it’s using the data that was used to develop a hypothesis to “confirm” the hypothesis.

  10. Bruce
    Posted Mar 31, 2014 at 2:26 PM | Permalink


    It is a well that you spend the time to go through this bilge. Otherwise we would be hearing Karoly pronouncing the end is neigh again.

    I thought this joker had a degree in mathematics?


    • kim
      Posted Mar 31, 2014 at 2:56 PM | Permalink

      His neigh, it nigh
      Offends the fly.

      • Jeff Alberts
        Posted Mar 31, 2014 at 8:52 PM | Permalink

        A hoarse! A hoarse! My kingdom for a lozenge! 😉

        • AndyL
          Posted Apr 1, 2014 at 1:24 AM | Permalink

          Bruce makes a good point. Don’t neigh-say it.

  11. Posted Mar 31, 2014 at 2:45 PM | Permalink

    The trader-philosopher NN Taleb picked up on a new study in Nature Neuroscience (actually the popular Sci Am article based on it) that tackles the problem of non-independent sampling and analysis.

    Aarts et al A solution to dependency: using multilevel analysis to accommodate nested data. Nature Neuroscience 17, 491–496 (2014)

    Steve: from the earliest days of this blog, I’ve drawn attention to random effects/mixed effects models – this is the same thing as multilevel analysis. I even showed a way to emulate recipes for tree ring chronologies using mixed effects models. The idea of multilevel analysis is very old, but not as widely applied as it might be.

  12. EdeF
    Posted Mar 31, 2014 at 3:07 PM | Permalink

    1000 km distance would be like correlating Northern California Coastal Redwood
    tree ring growth with Las Vegas temperature data.

  13. pauldd
    Posted Mar 31, 2014 at 3:16 PM | Permalink

    I have been casually following this debate for some time and I have a basic background in college level statistics. I have a question regarding expost screening that perhaps someone can answer. Forgive me if it is too elementary.

    First, I fully understand the problem with expost screening. As I understand it, expost screening substantially increases the risk that one will include a proxy in the reconstruction that is spuriously correlated with temperature. I am familiar with the criticism that if one screens a large group of randomly generated red noise proxies, one will find a subsample that appears to be a temperature proxies, but does not contain a true temperature signal.

    My question is: Can this problem be ameliorated by testing the correlation of the proxies to each other during the reconstruction period. For example, if one is identifying spurious correlations during the calibration period, one will detect this if the proxies are not correlated with each other during the reconstruction period. On the other hand, if one is identifying a true temperature proxy in the calibration period, one would expect that it would remain correlated to other true proxies during the reconstruction period as it is responding to the same temperature signal. Obviously, this test is not perfect as one would expect some degree of independence in the proxies during the reconstruction period due to difference in the true temperature signal of different regions. Nevertheless, this would seem to be a helpful test. Yes, no or maybe? Can you link me to a previous discussion that addresses this issue?

    Steve: I have long urged specialists to examine proxies for consistency as a precondition to presenting a reconstruction. If one has consistent proxies, then you get similar reconstructions regardless of the methodology. Specialists have either ignored the idea or sneered at it. Instead, they prefer to throw increasingly complicated and poorly understood multivariate methods at the problem, yielding poorly interpreted squiggles. This focus on poorly understood methods regularly results in proxies being used upside down and/or use of badly contaminated proxies.

    • Posted Apr 1, 2014 at 3:22 PM | Permalink

      It certainly would be a much needed step in the right direction.

      Unfortunately, I think part of the reason this hasn’t really been attempted much is that there are often huge inconsistencies between proxies… even proxies taken from the same area and using the same class of proxy, e.g., the several different “Polar Urals”/Yamal tree ring proxies.

      As Steve notes, if the proxies were all even moderately consistent, then you could probably get a reasonable enough estimate from a simple “Composite Plus Scale” (i.e., simple annual mean of all the proxy values available in a given year).

      An additional problem is that even where there is some consistency, it is possible that a lot of that might be an artefact of researcher confirmation bias.
      Paleoclimatologists are used to searching for (1) a “Medieval Warm Period”, (2) a “Little Ice Age”, (3) a “Current Warm Period”. This leads to the worry that some researchers might be accepting/dismissing a particular proxy series prematurely, based on whether or not such features could be identified.

      There are a lot of “thorny” problems in this field which need to be addressed, but unfortunately many of the studies in this field have so far chosen to only mention the “rose petals”…

      “Can you link me to a previous discussion that addresses this issue?”

      I discussed some of these problems in a review of the various millennial temperature proxy reconstructions that we have recently submitted for open peer review in a new experimental peer review forum we have set up called Open Peer Review Journal.

      You might find our review helpful – the article is called “Global temperature changes of the last millennium”. We also provide links to some CA posts which you might find relevant to your questions.

      • Jeff Alberts
        Posted Apr 1, 2014 at 8:13 PM | Permalink

        If you’re convinced there is a global temperature, then I don’t think it would be very useful.

  14. kim
    Posted Mar 31, 2014 at 3:19 PM | Permalink

    Instead of pasta,
    How about ‘Basta’?

  15. Posted Mar 31, 2014 at 3:31 PM | Permalink

    This should be fun.

  16. David Brewer
    Posted Mar 31, 2014 at 4:00 PM | Permalink

    Steve reports: “The acceptance rate for corals was about 80% and for other proxies was about 50% (slightly lower for ice cores)…Among the “long” proxies (ones that start earlier than 1025, thus covering most of the MWP), 9 of 12 ice core proxies were rejected, including isotope records from Siple Dome, Berkner Island, EDML Dronning Maud…Na series from Siple Dome and EDML did not “pass”…Of the 5 long tree ring series, only Mt Read (Tasmania) and Oroko Swamp NZ “passed”.”

    So if Neukom and Gergis’ method were correct (yeah I know, but just for the sake of argument…), wouldn’t it invalidate large numbers of previous studies which used the now rejected series as temperature proxies? Wouldn’t it also invalidate the previous methods which declared those series to be adequate proxies?

    Steve: things like Rio Alerce are indistinguishable from white noise. Whether they are included or excluded from an earlier study will have little impact the shape of the reconstruction (which typically comes from bristlecones, Yamal, upside-down Tiljander, that sort of thing.)

  17. Kenneth Fritsch
    Posted Mar 31, 2014 at 4:33 PM | Permalink

    Thanks SteveM for making the point and referencing the articles about the fallacy of post facto screening that for some reason climate scientists doing temperature reconstructions simply either do not understand or do not want to understand. This point cannot be made too often or in too many different ways in attempts to help readers see the light. Wagenmaker points to the distinction of exploratory and confirmatory analyses and also provides practical means of doing proper confirmatory analyses in order to prevent dumping the statistical tool box.

    I have said this before and I’ll say it again here: This simple concept of misapplied statistics and the effects it has on hypothesis testing I would bet is not understood by many of those people reading these posts and also by goodly number of otherwise intelligent and informed scientists practicing in the hard sciences. In the hard sciences one has the great advantage of running controlled experiments and experiments that by that very nature are confirmatory. I think perhaps there is a blank spot when it comes to the application by hard scientists dealing with the softer sciences where controlled experiments are not possible.

    Interesting that the soft science that Wagenmaker references in the links above has some appreciation for the problem, while I do not recall a single climate scientist, or at least those doing temperature reconstructions, making those points.

    The Neukom and Gergis application of post facto screening as you describe here, Steve, is not only wrong; it is hideous.

    • Jeff Alberts
      Posted Mar 31, 2014 at 8:58 PM | Permalink

      Thanks SteveM for making the point and referencing the articles about the fallacy of post facto screening that for some reason climate scientists doing temperature reconstructions simply either do not understand or do not want to understand.

      Or, they understand fully, which is why they do it.

    • Ian Blanchard
      Posted Apr 1, 2014 at 3:42 AM | Permalink


      I’m FAR less surprised than you that scientists from a ‘hard’ science background are poor at applying appropriate statistical techniques to these type of problems. I have Batchelors, Masters and PhD in geology / geochemistry (and an A level background of maths, physics, chemistry), and during my University education there was very little formal training in statistics and data processing beyond taking mean averages and standard deviations (or at least learning how to do these in Excel).

      The University where I did my PhD did run a dedicated statistics module for undergrad geologists (a couple of hours a week for one semester in the 2nd year), but that was an exception rather than the norm. I actually learnt more about stats from being a teaching assistant on that course than I ever did through being taught directly and even then it was fairly limited – a bit about Student’s t-test and other tests of statistical significance, but a long way short of the level needed to get involved in arguments between Bayesians and frequentists or to follow the details of the statistical arguments here (although obvioulsy the issue of inverting non-proxy series because of spurious correlations is a rather easier matter to understand…).

    • Clark
      Posted Apr 1, 2014 at 1:59 PM | Permalink

      I think there is a lot of truth in Kenneth Fritsch’s comment. Coming from the hard sciences, one sort of assumes that other disciplines also require strong conclusions to be based on independent, confirming studies.

      Having watched this climate proxy business via CA for many years, proxy reconstructions seems like the kind of project I would tell any graduate student in my lab to avoid at all costs. Tons of different proxy types, very noisy data, limited verification approaches, etc. If they insisted on pursuing proxy reconstruction, the only logical way would seem to find a specific proxy type that had a strong physical, demonstrated connection to temperature, skeptically test the heck out of all existing proxies of that type versus temp record. IF you found one that passed all the tests, THEN go out and get a bunch of new records of that particular proxy type and test them for consistency. Only then would it be worth while to think about trying your hand at a reconstruction.

  18. Follow the Money
    Posted Mar 31, 2014 at 5:10 PM | Permalink

    May I point out the exact publication is Nature Climate Change?

    “The only “new” passing ice core record was a still unpublished Law Dome Na series”

    Indeed the Law Dome “sea salt” is new to the 2014 paper, but I think you mean what is “still” unpublished is the Law Dome isotope record purportedly looked at for the 2012 paper. I’m not nitpicking, I think the now doubled reliance on “unpublished” data on something as important as Antarctic isotope records is a big deal.

    Steve; 1000%. Law Dome O18 has been a longstanding issue here. I asked for it as early as 2004 or so (Tas van Ommen reported my inquiry to Phil Jones in Climategate.) van Ommen promised to publish it in 2005 or so. Nine years later it is still unpublished. There is some grey information in other articles that I’ve collected.

    • Follow the Money
      Posted Mar 31, 2014 at 7:27 PM | Permalink

      Steve, the Neukom 2014 start date for the Antarctic O18 is A.D. 179. The same for Pages 2k 2013 is A.D. 167. That’s pretty close; looking at the same core? Anyway, this is an opportunity to point out to others that the visual representation of the Antarctic isotope record in the Pages 2k 2013 paper “shows” colorful warming in the first millenium A.D. The IPCC AR5 edit of the Pages 2k 2013 information truncates the start at 950 A.D.

    • joe
      Posted Apr 1, 2014 at 6:56 AM | Permalink

      Steve – we could use some education on the antarctic ice core proxies and their usage in the temp reconstructions (especially as they relate to the mwp). There was an indication that 7 of 9 were rejected as proxie. were they rejected due unreliability of data, inconsistencies, wrong answer, etc?
      I also recall seeing a chart of the temp reconstuctions using the 9 proxies. My recollection was 4-5 showed a warmer wmp, 2 showed a cooler mwp and the other 2 were inconclusive as to whether the mwp was warmer or cooler. One short coming of those charts is that they ended circa 1950-1980ish so there was some inability to compare though today.
      Any thoughts would be appreciated.

      Steve: I have an inventory of work in progress on Antarctic proxies. Its an interesting topic.

  19. observa
    Posted Mar 31, 2014 at 6:01 PM | Permalink

    I’m not normally into International Conventions to protect my rights but I must confess they do have their moments-

    ‘For the purposes of this Convention, the term “torture” means any act by which severe pain or suffering, whether physical or mental, is intentionally inflicted on a person for such purposes as obtaining from him or a third person information or a confession, punishing him for an act he or a third person has committed or is suspected of having committed, or intimidating or coercing him or a third person, or for any reason based on discrimination of any kind, when such pain or suffering is inflicted by or at the instigation of or with the consent or acquiescence of a public official or other person acting in an official capacity.’

    Particularly when I’m being constantly mentally tortured by a plague of these public officials, presumably with the intention of incarcerating me and my kind in science re-education gulags-

  20. Alastair
    Posted Mar 31, 2014 at 6:12 PM | Permalink


    Surely you are aware of the German maxim that one must “torture the data until it confesses”? I heard this nugget of wisdom at a scientific meeting I am attending and could not stop myself from sharing it with you given how appropriate it is to this current ‘conundrum’.

  21. snarkmania
    Posted Mar 31, 2014 at 7:04 PM | Permalink

    “However, as Jean S observed at Climate Audit at the time, they actually calculated correlations on non-detrended data.”

    I realize that detrending may provide a bit more resolution for spectral analysis (than not detrending). What is the purpose of detrending prior to a correlation analysis? I wonder especially because a common trend could be an important component of the correlations.

  22. Willis Eschenbach
    Posted Apr 1, 2014 at 1:04 AM | Permalink

    This is absolutely hilarious! That is the most bizarre procedure for proxy selection I ever heard of. I immediately thought of the issue that Ross McKitrick discusses so ably above—it will mine for “significant” results which won’t be significant in any sense of the word.

    However, I’d never considered the problem pointed out by Roman M … well done, that man.

    Onwards, ever onwards.


    • Nathan Kurz
      Posted Apr 1, 2014 at 4:48 AM | Permalink

      > However, I’d never considered the problem pointed out by Roman M … well done, that man.

      Which problem are you referring to? I’m not finding it here on the page.

    • MikeN
      Posted Apr 1, 2014 at 10:17 AM | Permalink

      If you correlate proxy data with a temperature record that makes it a temperature proxy.
      Since temperature tends to be coherent on a regional scale (as compared to, say, precipitation which is local), 1000 km is OK – especially since that only represents the sample space and does not mean that they automatically and always used data records located that far away. Data from that distance would only enter the equation if it was “statistically significant.”

      Response from a climate scientist.

      • Kenneth Fritsch
        Posted Apr 1, 2014 at 11:12 AM | Permalink

        MikeN, from coherency or high frequency correlations between a proxy and temperature, it does not follow that the proxies will get the lower frequency correlations (decadal/centennial scale) that are most important to these studies correct, i.e. the trends. One can readily generate time series with excellent high frequency correlations and a poor low frequency one.

        In the original and retracted Gergis paper on this same topic, the rationale behind using a detrended, high frequency correlation in the post facto selection process was an attempt to get around spurious low frequency correlations that arise in a time series with reasonably high autocorrelations. I call these 2 problems the reconstruction dilemma.

        But of course all these problems are noticeable only if one accepts the very wrong proposition that post facto selection, without a severe and complicated to impossible adjustment is made to the calculation of the p value, is proper. Or a confirmation analyses such as Wagenmaker(wagen fixer?) suggests is used.

  23. Geoff Sherrington
    Posted Apr 1, 2014 at 1:57 AM | Permalink

    In due course I shall be fascinated to learn the detail of how the many error terms are carried and accumulated faithfully through the several processes of screening, calibration, etc.

  24. Geoff Sherrington
    Posted Apr 1, 2014 at 3:51 AM | Permalink

    Lest there be doubt, there are applications in science that take more customary paths to completion.
    A simple example faces the analytical chemist who wishes to devise a method to determine the concentration of a sought substance in a described variety of materials. (for shortcuts and a trip back to olde times, let’s adopt a colorimetric determination of copper in soils for an example).
    The analyst either knows or devises a way to get the copper into solution and to hold it there without significant change.
    The solution is tested against a range of reactants. The reactants are not chosen randomly – there are far too many chemicals for that. Instead, the analyst uses prior education to hone in on a small subset of chemicals, such as those that are strong ligands for metals. The reactants need to show a colour change when exposed to ionic copper; and the change needs to be able to be represented by a calibration graph, linear is easy. The more sensitive the response to low concentrations, the better the choice.
    “Interferences”. Ions other than Cu might also react with the reactant. Many mixes are done to see if copper and only copper causes a colour change. Next, there are higher order interferences. An unwanted colour change might occur in the presence of Cu plus another metal, but only when a third reactant is present. Hydrogen ion as in pH is very commonly found here. Next, there are the unknown unknowns, investigated by methods such as obtaining or creating complex mixtures in which there is negligible copper, to see if there is a colour change that has nothing to do with the objective.
    Next comes the calibration step, in the following form if it is available. Completely different methods of analysis are used to derive copper concentration in a variety of materials. These might have acronyms like AAS, XRF, MS, NAA. If the developing colorimetric method does not reproduce these values to a very high degree of match, it’s curtains.
    Lastly, if the method is adopted, there is a usual plan of replicate analysis of routine samples for something like every 50th sample. As well there is the creation of standard samples by weights and measures type bodies, their routine inclusion in daily work, the round robin exercises with other labs and statistical ways to see how deviations that are revealed should be investigated.
    Even with the care and controls I’ve mentioned, the best analytical labs in the olde world were shocked by the results of an important collaboration, the chemical analysis of rocks and soils from the Apollo lunar project. See here
    I’ve purposely used this older period of the 1970s to emphasise a way that science was then conducted. It is important that many other branches of science adopted methods of similar rigour as this, as the routine, the expected way to go.
    What do we deduce? These, possibly.
    * The old rigour was thought adequate but was found not to be, rigour being subjective here. The Apollo exercise was continued and to find ways to improve.
    * The variability of the Apollo results looks to me to be worse than the variability that is often asserted to be present in modern climate change work. The latter is far less controlled and controllable. My deduction is that climate uncertainly is calculated hopefully but wrongly.
    * The analytical method did not depend on statistics to evolve or perform. (If 1 unit of copper produced 1 unit of colour change, that was the calibration – and it had to be repeatable over the years.
    * If a novel analytical method was found, it did not proceed if there were uncontrollable variables or factors. Near enough was not good enough.
    Back to the present, we see mention of tree rings at Mt Read, Tasmania, as passing screening tests. While statistical screening tests let this slip through, an objective, observational test would be that the relevant Tasmanian temperature record is not fit for purpose, unless one uses a very broad definition of ‘fit’.
    One unit of tree change has not been shown to match one unit of temperature change.
    The Neukom paper seems to fail about every one of the types of precautionary steps that were the norm a few decades ago.

    • Konrad
      Posted Apr 1, 2014 at 6:15 AM | Permalink

      “…observational test would be that the relevant Tasmanian temperature record is not fit for purpose”

      Ah, but Geoff, observational tests are somewhat subjective. Some observers would consider the proxy highly fit for purpose, if you consider that purpose was shameless propaganda 😉

  25. Steve McIntyre
    Posted Apr 1, 2014 at 6:27 AM | Permalink

    Nature contacted me today to say that an archive of Neukom data has been placed online here:

    I haven’t parsed it yet.

    • Posted Apr 1, 2014 at 7:04 AM | Permalink

      Top marks for both enforcing the policy and letting you know.

      • dalechant
        Posted Apr 1, 2014 at 9:20 AM | Permalink

        OK, I got this far:

        ****** WARNING ** WARNING ** WARNING ** WARNING ** WARNING ******
        ** This is a United States Department of Commerce computer **
        ** system, which may be accessed and used only for **
        ** official Government business by authorized personnel. **
        ** Unauthorized access or use of this computer system may **
        ** subject violators to criminal, civil, and/or administrative **
        ** action. All information on this computer system may be **
        ** intercepted, recorded, read, copied, and disclosed by and **
        ** to authorized personnel for official purposes, including **
        ** criminal investigations. Access or use of this computer **
        ** system by any person, whether authorized or unauthorized, **
        ** constitutes consent to these terms. **
        ****** WARNING ** WARNING ** WARNING ** WARNING ** WARNING ******

        and was too scared to go any further, not being on “official Government business” or an “authorized personnel”.

      • CG
        Posted Apr 4, 2014 at 10:52 AM | Permalink

        Seconded, cheers to Nature for this!

    • JunkPsychology
      Posted Apr 1, 2014 at 10:18 AM | Permalink

      I got access to the archive, no problem. Even downloaded a file.

      Just don’t ask me whether everything is there 🙂

      Steve: annoyingly their roster is only in pdf form. I’ve collated it into a spreadsheet for easier analysis and will put it online in a day or so.

      • Willis Eschenbach
        Posted Apr 1, 2014 at 2:53 PM | Permalink

        Steve, I downloaded their stuff, but I don’t find any metadata (proxy type, location, etc). Is that the “roster” you speak of?

        Your patient manner pays off … many thanks.


  26. John Ritson
    Posted Apr 1, 2014 at 7:11 AM | Permalink

    It would be interesting to know the lowest correlation (including the three lags) over the grid. Tiljander anyone.

  27. Steve McIntyre
    Posted Apr 1, 2014 at 8:19 AM | Permalink

    Checking some individual proxy versions. They show the end of ocean sediment 106KL offshore Peru as AD2000, whereas the underlying articles show an end in the 1920s. They also truncate the record at 1243. In red below, I’ve shown data from The red looks like the Neukom data in the common period. It is lithics concentration as a percentage of maximum. The original authors interpret the record as showing LIA and MWP impacting Peru, with low lithics concentration being “up”.

    I wonder where the post-1930 values in the Neukom record came from?? Are they real or or they “infilled” a la Mann?? This record was screened out in the reconstruction, but presumably used in SI Figure 20.


    • Matt Skaggs
      Posted Apr 1, 2014 at 1:11 PM | Permalink

      I had not heard of using lithic concentration in ocean sediments as a temperature proxy, so I went looking for a rationale based upon a physical process. Here is a note on 106KL posted by the authors, Rein et al, at a NASA site:

      “During cruise Sonne-147 in 2000, laminated marine cores were recovered from the ENSO region on the Peruvian shelf. Lithoclastics are supplied and dispersed on the shelf during El Niño floods. El Niño warm surface water anomaly shuts down the upwelling of nutrient rich waters and strongly reduces marine primary production. As ENSO proxies, photosynthesis pigment and lithoclastic contents were derived with very high sampling density which resolves the interannual ENSO variability. Sea surface temperatures were estimated from alkenones. Core 106KL comprises in its upper 11 m sedimentation since the Last Glacial Maximum and has been dated by 42 AMS-14C radiocarbon ages.”

      Alkenones are widely exploited as temperature proxies, of course. Hopefully Gergis et al obtained the approval of the authors to use the lithics concentration data as a temperature proxy rather than alkenone content. But I’m still looking for that rationale based upon a physical process.

      • Willis Eschenbach
        Posted Apr 1, 2014 at 2:56 PM | Permalink

        Interesting quote, matt. My question is, why would we expect an ENSO proxy to be a temperature proxy? And if we did expect it, which way is up (increasing temperature)?


    • MrPete
      Posted Apr 1, 2014 at 2:12 PM | Permalink

      Re: Steve McIntyre (Apr 1 08:19),
      Their method for producing more recent data is described in your quote above:

      In this approach, the representative instrumental data for each record are degraded with white noise and then classified into the index categories of the documentary record in order to realistically mimic its statistical properties and not overweight the record in the multiproxy calibration process. The amount of noise to be added is determined based on the overlap correlations with the instrumental data.

      If I’m not mistaken, they are simply:
      – Taking instrumental data
      – Adding (random) noise, the amount of noise depending an certain parameters but it is still random noise
      – “Extending” the measured field data using this “pseudo documentary” data

      No matter how I stare at that method, it looks to me like a fancy way of saying “we extended the proxy data using instrumental data.”

      This is where a visualization of the results is worth 1000 words.

      • RomanM
        Posted Apr 1, 2014 at 3:02 PM | Permalink

        This appears to have been done with precipitation and river streamflow information which is contained in old documents. What algorithm can turn temperatures into realistically mimiced water data is a mystery. To then claim that this is can somehow provide a genuine calibration of the proxy is no longer scientific.

        • HaroldW
          Posted Apr 1, 2014 at 3:33 PM | Permalink

          I think the interpretation of “representative instrumental data” as temperature is inaccurate. Rather, they use instrumental data of the relevant variable, e.g. precipitation. See Neukom et al. 2009, esp. figure 2.

          It seems that in essence they calibrate a hybrid of instrumental precipitation (say) and documentary values vs. temperature, and then apply that relationship to the historical data. I leave it to more knowledgeable folks to analyze what that implies about the reliability of the resulting reconstruction.

        • RomanM
          Posted Apr 1, 2014 at 4:00 PM | Permalink

          Thanks for the info and the link to the paper, Harold. I stand corrected. 🙂

        • Kenneth Fritsch
          Posted Apr 1, 2014 at 6:43 PM | Permalink

          Figure 2 in your link, Harold, shows historical values of either 1 or 2 with many missing years and then shows more modern readings with smaller increments and no missing data. How would you analyze and apply statistics to that? It makes my head hurt.

        • HaroldW
          Posted Apr 1, 2014 at 10:13 PM | Permalink

          Kenneth –
          Mine too. I think the “missing” years are interpreted as zero values. As I understand it, the logic behind the reconstruction is based on an assumption of an additive normal “noise”, which certainly is not an accurate description of such highly quantized time series. While many methods which are demonstrably optimal for gaussian distributions, still give reasonable results when the random variables follow other distributions (often due to the central limit theorem), it’s not apparent to me that this is true in the present instance, nor the reverse. As I don’t see any evidence in the SI about the applicability of the methodology to such series, my guess is that the authors are just winging it here.

    • Matt Skaggs
      Posted Apr 1, 2014 at 3:49 PM | Permalink

      Steve wrote:

      “The original authors interpret the record as showing LIA and MWP impacting Peru, with low lithics concentration being “up”.

      I just noticed this on a re-read, can you give a citation? I have not found any instance where the original authors, Rein et al, relate lithics to temperature in any way. Or did they mean that ENSO magnitude/frequency fluctuations correlate to the MWP and LIA? Elsewhere Rein states that their SST data from 106KL support a SH LIA but not a SH MWP.

      • Matt Skaggs
        Posted Apr 2, 2014 at 5:49 PM | Permalink

        Since I cannot find anything remotely resembling Steve’s statement anywhere in the literature, and absent a citation from him, I will assume that he erroneously concluded that the 106KL graph showing lithics concentration was somehow related to temperature when the authors actually used the data to show flood frequency (high lithics = major storm). So as to Willis’ question of “why would we expect an ENSO proxy to be a temperature proxy?”…you will have to ask Steve!

        Steve: Neukom et al 2014 included 106KL lithics concentration as a candidate series in their reconstruction. They also included the Peru ENSO documentary series and various precipitation series. Mann also included instrumental precipitation series in his reconstructions. If you want explanations, better to seek it from these authors, rather than me, as I have not proposed any of these series as temperature proxies.

  28. HR
    Posted Apr 1, 2014 at 9:28 AM | Permalink

    Steve if you are familiar with many of the proxies and also think you know the best tools for screening and building a multiproxy dataset then why not just generate your own reconstruction and let us see how you think temperatures have developed over the past 1 or 2 thousand years. I’d certainly be interested in seeing how much it differed from the published stuff.

    (apologies if you’ve already done this or I’m asking too much)

    Steve: I don’t see any point in dumping a lot of stuff into a hopper and making a weighted average. What specialists need to demonstrate first is that there is consistency among proxies. I also don’t see much point in 1000-2000 reconstructions without a Holocene perspective.
    I think that there are some regions where one can, with some squinting, begin to see consistency. I’ve been working at this off and on for a while.

    • Matt Skaggs
      Posted Apr 3, 2014 at 9:37 AM | Permalink

      Fair enough.
      106kl sits in the East Pacific Warm Pool and so is an important proxy.
      After reading the various Rein et al papers, I get this:
      Rein et al showed a strong anomaly of low lithics in the MWP.
      Lithics come from strong rain events.
      Strong rain events come from El Nino.
      El Nino comes from warm SST.
      Therefore low lithics in the MWP = cool SST in the medieval Pacific.
      (So the reason for Mann’s interest is obvious.)

      You wrote:
      “The original authors interpret the record as showing LIA and MWP impacting Peru, with low lithics concentration being ‘up’.”

      But actually the original authors interpret the low lithics as “down” in temperature.

      So whatever your point was, I missed it.

      • AndyL
        Posted Apr 3, 2014 at 5:16 PM | Permalink

        On 1 April, you asked for a citation for the original authors’ orientation of low lithics. As you have apparently found the citation, could you share it?

        • Matt Skaggs
          Posted Apr 4, 2014 at 9:13 AM | Permalink

          The logic steps I listed were pieced together from reading everything pertinent that came up from a Google Scholar search of “106kl Rein.” None of the papers concisely lay out the entire logic structure so I pieced it together as best I could, but the logic is consistent throughout every paper I found. What I did NOT find was any support for Steve’s statement that low lithics are “up” (Steve’s word) for temperature, rainfall, flood events, or ENSO. He also wrote of the “MWP impacting Peru” which is sort of right but misleading; what the authors found was a total cessation of major rain events which they equated to cool SST, so not the “MWP” having an impact per se. Any further research would have to go through a paywall so unless Steve responds again I can only assume that he erred somewhere or that I misinterpreted what he wrote.
          More interesting than this conundrum is Mann and Neukom’s interest in this proxy. The anomalously low lithics throughout the MWP must look like a shiny coin to a crow – if you just average it in with NH MWP temperature proxies and work a little principal component hocus pocus, you can disappear the MWP and achieve a long sought goal of the Hockey Team.

  29. miker613
    Posted Apr 1, 2014 at 2:41 PM | Permalink

    Dunno. How can this nonsensical procedure be published in a serious journal? Peer review? Nick Stokes, where are you when we need you?

  30. Curt
    Posted Apr 1, 2014 at 2:42 PM | Permalink

    A friend of mine is a top cancer researcher. One of the challenges he faces is that the state of knowledge in the field is changing so rapidly that before any clinical trial is over, and often virtually as soon as it has actually begun, they wish they had set up the analysis differently.

    But the research protocols in his field are very strict – all of the methods of statistical analysis must be explicitly laid out ahead of time, and cannot be changed.

    • Jeff Norman
      Posted Apr 2, 2014 at 9:37 AM | Permalink


      Surely a quick note to the Ethics Officer would allow your friend to alter your study in any way you wished

  31. Don B
    Posted Apr 1, 2014 at 2:55 PM | Permalink

    “Statistics, it has been observed, will always confess if tortured sufficiently.”

    Financial economist Andrew Smithers wrote these words in a Financial Times article describing how stockbrokers data mine so that studies always conclude that stocks should be bought. Climate scientists ought to aim higher than stockbroker standards.

  32. ThinkingScientist
    Posted Apr 1, 2014 at 2:55 PM | Permalink

    In a simple Student T test for correlation significance the significance test is modified as a function of the power of the number of comparisons made (to see a nice lay explanation google Kalkomey, 1997, The Leading Edge) However this would still assume that the grid cell temperature variable was spatially independent. If not (which seems likely) then I think the significance test would need to be even more onerous.

  33. Willis Eschenbach
    Posted Apr 1, 2014 at 5:51 PM | Permalink

    More fun with the data … I got to wondering about their error estimates. They give those as a part of the data Steve linked to. I multiplied their error estimates by the square root of the number of datapoints for each year, which I got from the “screened proxies” dataset. That gives me the standard deviation of the underlying data, year by year. It looks like this:

    Then I went to the “screened proxies” dataset and got the actual standard deviations year by year for comparison, shown in red above …

    There are a couple oddities in that graph. The first one is that the actual measured standard deviation across all proxies, year by year, is never far from 1.0. This implies to me that they first standardized the proxies, then they compared them to the standardized temperatures, in order to convert from proxy units to units of degrees C. This is supported by the histogram of the entirety of the screened proxy data, N=223,110:

    The next oddity is that the standard deviation given in the Neukom14 results is so regular from year to year, without the large variations we see in the actual SD data. In addition to that, the Neukom2014 value is also much more stable from decade to decade than the actual measured SD.

    The next oddity is that the period 1911 to 2000 is the time they used to calibrate the proxies … so as you’d expect, the red line dips down during that period (except for the big jump at the end where the number of datapoints is dropping rapidly). But we don’t see the equivalent in the blue line, the proxy SD per Neukom2014.

    Sigh …


    • Steve McIntyre
      Posted Apr 1, 2014 at 6:28 PM | Permalink

      I’ve collated information from the pdf SI plus some other details into a digital form

      • Willis Eschenbach
        Posted Apr 1, 2014 at 7:02 PM | Permalink

        Many thanks, Steve.


        • Willis Eschenbach
          Posted Apr 1, 2014 at 7:07 PM | Permalink

          Which reminds me, I forgot to mention, the spreadsheet I used for the first graph above is here. It contains the screened and unscreened proxies, along with the summary data about the reconstruction.


      • Willis Eschenbach
        Posted Apr 1, 2014 at 8:37 PM | Permalink

        Steve, I’ve been able to correlate their screened proxies with your info with the following exceptions. The numbers are the order of the screened proxies in their data:

        1 Lake Challa Kenya
        3 Laguna Pumacocha
        9 Law Dome sea salt
        42 Ifaty Madagascar
        66 Eastern NSW
        104 GBR precip recon rec17

        I do like the fact that they have no less than 7 unpublished proxies in the mix …


        Steve: there seems to be one extra Ifaty series. There are two Ifaty_1, two Infaty_4 and one Ifaty.

      • Posted Apr 4, 2014 at 5:13 AM | Permalink

        I’ve put my usual active viewer for the Neukom proxies here. There’s also a CSV file for the metadata in Table 5 of the SI.

        • Jim Z
          Posted Apr 6, 2014 at 11:07 PM | Permalink

          Nick, this may be tardy, but thank you for posting the active viewer.

  34. RalphR
    Posted Apr 1, 2014 at 11:29 PM | Permalink

    So, not only is there lipstick on the pig but they picked the most attractive and gave it cosmetic surgery too. Then they put in a big room with a bunch of real oinkers and told their algorithm to pick out the nicest.
    I’m just a lay person but I wonder whether the FDA and the drug companies aren’t at least on to something with their double or triple “blind” requirements. Yes, the combined-endpoint type of studies we get are misleading, but just imagine what an uproar there would be if it was discovered that they had adopted a “pseudo documentary” approach in their analysis. Oh wait, do they maybe do that?

  35. Beta Blocker
    Posted Apr 3, 2014 at 9:35 AM | Permalink

    Steve McIntyre ….. Wagenmakers observes that “data torture” can occur in many ways. He is particularly critical of the ad hoc and ex post techniques that authors commonly use to extract “statistically significant” results from unwieldy data. Ex post screening is an example of data torture. Wagenmakers urges that, for “confirmatory analysis”, authors be required to set out a statistical plan in advance and stick to it. He acknowledges that some results may emerge during analysis, but finds that such results can only be described as “exploratory”.

    What the paleostatisticians might do to clear up all these methodological issues in a clean, precise, and reproducible way would be to combine all of their various innovative methods into a paleoclimate statistical analysis package under the umbrella of the new DataTorture series of R functions.

    For example:

    DataTorture.Rack with optional control parameters Stretch, Twist, etc.

    DataTorture.Regularize with optional control parameters Invert, Expand, Contract, Bloat, etc.

    DataTorture.Clamp with optional control parameters Squeeze, Squish, Spike, etc.

    DataTorture.Confess with optional control parameters Review, Redact, Obfuscate, Restate, Regurgitate, and of course, the ever useful Robust and Unprecedented.

    The possibilities for a DataTorture R package are endless.

    Could the National Science Foundation be persuaded to award a grant to the paleoscience community for writing this new statistical analysis package?

    • John A
      Posted Apr 3, 2014 at 3:58 PM | Permalink

      Congratulations! That made me laugh.

      • Beta Blocker
        Posted Apr 3, 2014 at 7:44 PM | Permalink

        Re: John A (Apr 3 15:58),

        Ooh, ooh, there’s another possible new R function for the paleoclimate statistical analysis package that I neglected to list:

        DataTorture.Conform with optional parameters Copy, Shape, Skew, Distort, Rotate, Scale, Revise, and Reprint.

        • Jeff Norman
          Posted Apr 4, 2014 at 7:39 AM | Permalink

          Would it be possible to derive something for the diminutive Mr. Ward, say DataTorture.Spin?

3 Trackbacks

  1. […] at Climate Audit, Steve McIntyre is engaged in the slow public defenestration of the latest multi-proxy extravapalooza, a gem of a paper yclept “Inter-hemispheric […]

  2. […] […]

  3. […] Neukom and Gergis serve cold screened spaghetti  Climate Audit […]

%d bloggers like this: