Data Torture in Gergis2K

Reflecting on then current scandals in psychology arising from non-replicable research, E. Wagenmakers, a prominent social psychologist, blamed many of the problems on “data torture”. Wagenmakers attributed many data torture problems on ex post selection of methods. In today’s post, I’ll show an extraordinary example of data torture in the PAGES2K Australasian reconstruction.

Wagenmakers on Data Torture

Two accessible Wagenmakers’ articles on data torture are An Agenda for Purely Confirmatory Research pdf and a Year of Horrors pdf.

In the first article, Wagenmakers observed that psychologists did not define their statistical methods before examining the data, creating a temptation to tune the results to obtain a “desired result”:

we discuss an uncomfortable fact that threatens the core of psychology’s academic enterprise: almost without exception, psychologists do not commit themselves to a method of data analysis before they see the actual data. It then becomes tempting to fine tune the analysis to the data in order to obtain a desired result—a procedure that invalidates the interpretation of the common statistical tests. The extent of the fine tuning varies widely across experiments and experimenters but is almost impossible for reviewers and readers to gauge.

Wagenmakers added:

Some researchers succumb to this temptation more easily than others, and from presented work it is often completely unclear to what degree the data were tortured to obtain the reported confession.

It is obvious that Wagenmakers’ concerns are relevant to paleoclimate, where ad hoc and post hoc methods abound and where some results are more attractive to researchers.

Gergis et al 2012

As is well-known to CA readers, Gergis et al did ex post screening of their network by correlation against their target Australasian region summer temperature. Screening reduced the network from 62 series to 27. For a long time, climate blogs have criticized ex post screening as a bias-inducing procedure -a bias that is obvious, but which has been neglected in academic literature. For the most part, the issue has been either ignored or denied by specialists.

Gergis et al 2012, very unusually for the field, stated that they intended to avoid screening bias by screening on detrended data, describing their screening process as follows:

For predictor selection, both proxy climate and instrumental data were linearly detrended over the 1921-1990 period to avoid inflating the correlation coefficient due to the presence of the global warming signal present in the observed temperature record. Only records that were significantly (p<0.05) correlated with the detrended instrumental target over the 1921-1990 period were selected for analysis. This process identified 27 temperature-sensitive predictors for the SONDJF warm season.

Unfortunately for Gergis and coauthors, that’s not what they actually did. Their screening was done on undetrended data. When screening was done in the described way, only 8 or so proxies survived. Jean S discovered this a few weeks after publication of the Gergis et al article on May 17, 2012. Two hours after Jean S’ comment at CA, coauthor Neukom notified Gergis and Karoly of the problem.

Gergis and coauthors, encouraged by Gavin Schmidt and Michael Mann, attempted to persuade the Journal of Climate editors that they should be allowed to change the description of their methodology to what they had actually done. However, the editors did not agree, challenging the Gergis coauthors to show the robustness of their results. The article was not retracted. The University of Melbourne press statement continues to say that it was published on May 17, 2012, but has been submitted for re-review (and has apparently been under review for over two years now.)

PAGES2K

The PAGES2K Australasian network is the product of the same authors. Its methodological description is taken almost verbatim from Gergis et al 2012. Its network is substantially identical to the Gergis 2012 network: 20 of 27 Gergis proxies carry forward to the P2K network. Several of the absent series are from Antarctica, covered separately in P2K. The new P2K network has 28 series, now including 8 series that had been previously screened out. The effort to maintain continuity even extended to keeping proxies in the same order in the listing, even inserting new series in the precise empty spaces left by vacating series.

Once again, the authors claimed to have done their analysis using detrended data:

All data were linearly detrended over the 1921-1990 period and AR(1) autocorrelation was taken into account for the calculation of the degrees of freedom [55].

This raises an obvious question: in the previous test using detrended data, only a fraction passed. So how did they pass the detrended test this time?

Read their description of P2K screening and watch the pea:

The proxy data were correlated against the grid cells of the target (HadCRUT3v SONDJF average). To account for proxies with different seasonal definitions than our target SONDJF season (for example calendar year averages) we calculate the correlations after lagging the proxies for -1, 0 and 1 years. Records with significant (p < 0.05) correlations with at least one grid-cell within a search radius of 500 km from the proxy site were included in the reconstruction. All data were linearly detrended over the 1921-1990 period and AR(1) autocorrelation was taken into account for the calculation of the degrees of freedom [55]. For coral record with multiple proxies (Sr/Ca and ä18O) with significant correlations, only the proxy record with the higher absolute correlation was selected to ensure independence of the proxy records.

Gergis et al 2012 had calculated one correlation for each proxy, but the above paragraph describes ~27 correlations: three lag periods (+1,0,-1) by nine gridcells ( not just the host gridcell, but the W,NW,N, NE,E,SE,S and SW gridcells, all of which would be within 500 km according to my reading of the above text.) The other important change is the change from testing against a regional average to testing against individual gridcells, which, in some cases, are not even in the target region.

Discussion

Gergis’ test against multiple gridcells takes the peculiar Mann et al 2008 pick-two methodology to even more baroque lengths. Thinking back to Wagenmakers’ prescription of ex ante methods, it is hard to imagine Gergis and coauthors ex ante proposing that they test each proxy against nine different gridcells for “statistical significance”. Nor does it seem plausible that much “significance” can be placed on higher correlations from a contiguous gridcell, as compared to the actual gridcell. It seems evident that Gergis and coauthors were doing whatever they could to salvage as much of their network as they could and that this elaborate multiple screening procedure was simply a method of accomplishing that end. Nor does it seem reasonable to data mine after the fact for “significant” correlations between three different lag periods, including one in which the proxy leads temperature.

Had the PAGES2K coauthors fully discussed the background and development of this procedure from its origin in Gergis et al 2012, it seems hard to believe that a competent reviewer would not have challenged them on this peculiar screening procedure. Even if such data torture were acquiesced in (which is dubious), it should have mitigated by requiring adjustment of the t-statistic standard to account for the repeated tests: with 27 draws, the odds of a value that is “95% significant” obviously change dramatically. When the draws are independent, there are well-known procedures for doing so. Using the Bonferroni correction with 27 “independent” tests, the t-statistic for each individual test would have to be qt(1- 0.05/27,df) rather than qt(1-.05,df). For typical detrended autocorrelations, the df is ~55. This changes the benchmark t-statistc from ~1.7 to 3.0. The effective number of independent tests would be less than 27 because of spatial correlation, but even if the effective number of independent tests was as few as 10, it increases the benchmark t-statistic to 2.7. All this is without accounting for their initial consideration of 62 proxies – something else that ought to be accounted for in the t-test.

While all of these are real problems, the largest problem with the Neukom-Gergis network is grounded in the data: the long ice core and tree ring series don’t have a HS shape. However, there is a very strong trend in coral d18O data after the Little Ice Age and especially in the 20th century. Splicing the two dissimilar proxy datasets results in hockey sticks even without screening. Such splicing of unlike data in the guise of “multiproxy” has been endemic in paleoclimate since Jones et al 1998 and is underdiscussed. It’s something that I plan to do.

There are other peculiarities in the Gergis dataset. Between Gergis et al 2012, PAGES2K and Neukom et al 2014, numerous proxies are assigned to inconsistent calendar years. If a proxy is assigned to a calendar year that is inconsistent with the calendar year of its corresponding temperature series, the calculated correlation will be less than it really is. Some of the low detrended correlations of Gergis et al 2012 appear to have arisen from errors in proxy year assignment. I noticed this with Oroko which I analysed in detail: it ought to pass a detrended correlation test given the splicing of instrumental data and therefore failure of a detrended correlation test requires close examination.

This entry was written by Stephen McIntyre, posted on Nov 22, 2014 at 12:14 PM, filed under Uncategorized and tagged gergis, pages2k, torture, wagenmakers. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

59 Comments

Jeff Alberts

Posted Nov 22, 2014 at 2:48 PM | Permalink

Not sure if it’s relevant, but.

When I was an IT consultant, I worked with large corporations and gov’t agencies to incorporate their IT data (IT equipment, personnel, locations, etc) into a central application, in order to streamline processes and save money in the long run. The data we had to work with was often horrific, missing unique identifiers, inconsistent index fields, all sorts of stuff. We would require the client to clean up the data as much as possible so we wouldn’t have to code for unique problems.

It seems like “To account for proxies with different seasonal definitions than our target SONDJF season (for example calendar year averages) we calculate the correlations after lagging the proxies for -1, 0 and 1 years” is taking disparate data and trying to make it non-disparate, and expecting something meaningful as a result.
Kenneth Fritsch

Posted Nov 22, 2014 at 2:57 PM | Permalink

There is much grist for the mill in this post by SteveM. If the authors, reviewers and editors of the paper under discussion do not know or want to acknowledge what all those “extra” chances to obtain a significant correlation between proxy and temperature do to the statistics in estimating significance, I suspect all is lost – for now and well into the future – for any hope of changes in the approach used in this branch of climate science. The approach in my view involves a group of scientists who have convinced themselves that they know what final conclusions should be and it is their job to find ways of getting that result. If the results obtained by their methods match the already agreed upon answer the methods then evidently are no longer subject to what would be expected of normal scientific inquiry and skepticism. In the case of this paper a very careful application of a Bonferroni correction would be, at the very least, a first order of discussion.

As I recall the reason given by these authors in their failed paper for using detrended proxy to temperature correlations was because they were acknowledging the ease of obtaining spurious correlation with series that have high levels of autocorrelations and/or long term persistence. The other side of that coin, which can be readily demonstrated using the R function corgen in library(ecodist) with simulated dated where the trends are intentionally and significantly different, that high correlations can be obtained between paired time series that have very different trends. Temperature reconstructions are carried out to obtain decadal to centennial trends and make modern day to historical times comparisons. In that light and given that these reconstructions use the basically flawed approach in selecting proxies after the fact, detrending does not make sense.

If you play with enough proxy and temperature data it becomes clear that proxies even located in the same locale can have low coherences in decadal and centennial trends while at the same time having relatively high correlations, and particularly when compared to temperature stations in the same local. I am continuing to do these analyses with the Kaufmann Arctic 2K reconstruction. Until such time as I see the authors of these reconstructions doing and reporting similar analyses I cannot take seriously the results. They do, however, provide grist for the mill.
- barn E rubble
  
  Posted Nov 22, 2014 at 3:29 PM | Permalink
  
  RE: Kenneth Fritsch
  Posted Nov 22, 2014 at 2:57 PM
  “. . . The approach in my view involves a group of scientists who have convinced themselves that they know what final conclusions should be and it is their job to find ways of getting that result.”
  
  That’s been my view as well. I also think that’s the problem that CA continues to highlight with Steve Mc. et al’s audits. I can’t think of another field where the methodology for analyzing data isn’t established before analyzing the data. Or in what field that all the data collected isn’t included in the analysis. The medical field certainly comes to mind when inconvenient results aren’t included in final reports (for which there are consequences) but all the data was at least part of the analysis by the before prescribed methodology. It’s hard to imagine Big Pharma being able to make up their own rules/methodologies and get away with it.
  - ian005
    
    Posted Nov 22, 2014 at 11:10 PM | Permalink
    
    BER:
    
    Read the two papers to which Steve has provided links above – it appears that experimental and social psych is rife with the same problems. That the problem exists elsewhere in no way excuses the way statistical analyses are managed in the AGW field – but it does illustrate that it is not entirely unusual.
    
    (And if you read the stuff on Retraction Watch, numerous other fields – such as oncological research – seem to have endemic issues with how research is conducted and analyses performed.)
MikeN

Posted Nov 22, 2014 at 3:31 PM | Permalink

I suspect there more of an established basis, but I have the same reaction when I see studies that assess the proxy by calculating correlations against temperatures in all seasons, months, or smaller periods, and whichever gets the highest correlation, it is now declared that this is what the proxy is a proxy for, eg temperatures in August.
- barn E rubble
  
  Posted Nov 22, 2014 at 4:31 PM | Permalink
  
  RE: Posted Nov 22, 2014 at 3:31 PM
  ” . . . it is now declared that this is what the proxy is a proxy for, eg temperatures . . .”
  
  Well, that’s another problem isn’t it? I’m thinking of varve thickness, where apparently the ‘proxy’ can be whatever you want it to be; thicker is warmer or thicker is cooler, as has been discussed here on many threads. From what I’ve read (and understood) it appears only isotope ratios have a clear and ‘declared’ meaning. IE: the same methodology/ratios have not been used to determine both warming and cooling. The same methodology/ratios are always used to calculate one way or the other. Which I suppose is another problem for those not really interested in what the data actually show,s as opposed what they’d like it to . . .
  
  The University Of Saskatchewan has been doing a lot of interesting research RE: climatology with isotopes @shells/fish bones. They believe they can not only get seasonal resolution but right down to a weekly &/or near daily resolution for temperatures.
  - Jeff Alberts
    
    Posted Nov 22, 2014 at 7:53 PM | Permalink
    
    Well, that’s another problem isn’t it? I’m thinking of varve thickness, where apparently the ‘proxy’ can be whatever you want it to be; thicker is warmer or thicker is cooler, as has been discussed here on many threads.
    
    Not to mention all the other environmental factors which can affect the proxy in different ways. It always seems to be assumed that temperature is the dominant factor.
Joe

Posted Nov 22, 2014 at 3:54 PM | Permalink

A layman’s question – Can someone explain the difference in trenched data and detrenched data, detrended data and the significance.
I recall several months ago, as study by Mann using detrenched data which showed the current pause was likely caused by the amo/pdo (without any attribution to the prior warming to the amo/pdo). Just looking for a good explanation without any editorializing.
Thanks

Steve: I do not regard AMO and PDO as “explanations” of anything – better to ask someone who does.
- HAS
  
  Posted Nov 22, 2014 at 4:15 PM | Permalink
  
  I assume you me “trended” throughout.
  
  Coincidentally Judith Curry has the following in her “Week in review” post this week:
  
  “Superb post by Matt Briggs on the bogus use of statistics on temperature series. Don’t use statistics unless you have to [http://wmbriggs.com/blog/?p=14718%C2%A0]”
  
  It takes you through the problems in reasonably easy steps.
  - Richard Drake
    
    Posted Nov 22, 2014 at 4:32 PM | Permalink
    
    Briggs calls the post Netherlands Temperature Controversy: Or, Yet Again, How Not To Do Time Series
EdeF

Posted Nov 22, 2014 at 7:25 PM | Permalink

“Nor does it seem reasonable to data mine after the fact for “significant” correlations between three different lag periods, including one in which the proxy leads temperature.”

Apparently, some proxies are intuitive, or have ESP, and can guess ahead of time if the
temperature is going up or down?
- John Ritson
  
  Posted Nov 23, 2014 at 6:34 AM | Permalink
  
  They might be able to if they are made of Thiotimoline.
  http://en.wikipedia.org/wiki/Thiotimoline
- Peter Dunford
  
  Posted Nov 23, 2014 at 1:11 PM | Permalink
  
  An earlier example was the magic trees in Yamal.
Nicholas

Posted Nov 22, 2014 at 7:35 PM | Permalink

If you’re checking for correlation to one of 27 temperature series don’t you have to raise your “pass” correlation threshold significantly to retain p < 0.05 confidence interval? If so, how do so many more proxies manage to pass the test?
- Nicholas
  
  Posted Nov 22, 2014 at 7:37 PM | Permalink
  
  Apologies, this is affirmed in the post. Given that I, an amateur at statistics, know this, how is it that the authors and reviewers of PAGES2K don’t?
  
  Steve: PAGES2K was extremely rushed and contained too much material for reviewers to review. It was rejected by Science for this reason. However, IPCC had cited it in a prominent graphic and needed it. So Nature had to accept it. It was accepted on deadline eve. Briffa was one of the reviewers for Nature and, needless to say, he was aware of IPCC requirements. Nature published it in an article category that didn’t require stringent reviewing – if you look back at contemporary discussion.
  - Clark
    
    Posted Nov 24, 2014 at 1:52 PM | Permalink
    
    You have to remember that reviewers are volunteers working on their own time. With the ability to dump an encyclopedia into “supplemental” online data, paper submissions are getting longer and longer every year.
    
    I just re-reviewed a paper for Nature:
    Submission text: 34 pages
    Supplemental text: 12 pages
    Response to first reviews: 26 pages
    Figures (in paper + supplemental): >240 panels
    
    And Nature wants a review in about a week, after which they send you increasing worried emails. What are the chances I can, in my spare time from a busy job, go through each legend carefully.
    
    For example, my field is much less statistical (biology), but still uses simple things like standard error, standard deviations, t-tests, etc. I would guess that >50% of figures with a SD or SE show an error bar and never define it or provide sample sizes in the figure legend. For journals where they let you see all the reviews (as a reviewer), I cannot recall a single instance where a reviewer other than myself asked for those statistical details.
Tom Fuller

Posted Nov 22, 2014 at 8:08 PM | Permalink

In bridge they occasionally warn against ‘sending a man to do a boy’s job’, i.e., taking a trick with a higher card than necessary.

As Mr. McIntyre has often warned in the past, using obscure or brand-new statistical tricks to prove the point you already believe is fraught. I have described it in the past as using lower math as long as you can.

Again, I find it hard to find fault with Gergis et al for trying to salvage what must have been extensive prior work that came to nothing.

What is amazing is that they seem to have learned nothing from their prior experience.

It is this type of thing that leads to the accusations that the scientific has become the political.
- Steve McIntyre
  
  Posted Nov 22, 2014 at 11:14 PM | Permalink
  
  one of the oddities of this case is much of their data mining was probably unnecessary. In the original studies, there is generally a fairly strong correlation between summer temperature and gridcell instrumental temperature as long as the two series are dated consistently. But look at the start dates for the Kauri.NZ series used in Gergis et al 2012, PAGES2k 2013 and Neukom et al 2014: each study places the series one year apart. (All three versions match closely when offsets are synchronized; the numbers in the Gergis series are just scaled a little differently.)
  
  The differences between the three dsets are one-year offsets all over the place, with no pattern. If a substantial number of series are off by one year, they will fail detrended correlation. Look at the example shown here which is dated in different years in each of Gergis et al 2012, PAGES2K and Neukom et al 2014.
  
  It looks to me like the majority of series in Gergis et al 2012 were dated inconsistently to the instrumental series. Obviously I’m opposed to ex post correlation screening and IMO the “real” problem in this particular reconstruction is the validity of splicing short trending coral O18 series onto long tree ring and ice core O18 series with little variability. But oddly enough it looks to me like Gergis might have been able to salvage their original screening strategy if they hadn’t screwed up the dating of their proxies. And that their need to datamine for lags and leads arose almost entirely from inconsistent proxy dating. I’m doublechecking this before posting on it, but right now it looks like a real clusterf.
  - Steven Mosher
    
    Posted Nov 23, 2014 at 1:34 AM | Permalink
    
    somebody is writing a mail now alerting others to his independent discovery of this problem
    - Skiphil
      
      Posted Nov 23, 2014 at 2:06 AM | Permalink
      
      Let’s not be cynical, no doubt Neukom has been all over this issue for HOURS now, and did not need any info or prompting from CA to discover the problem himself…. Gergis and Karoly will be able to re-assure themselves and the world that all is right with Team GERGIS2K.
    - largolarry
      
      Posted Nov 23, 2014 at 8:48 AM | Permalink
      
      the “real” problem in this particular reconstruction is the validity of splicing short trending coral O18 series onto long tree ring and ice core O18 series with little variability.
      
      Has there ever been a peer reviewed study that shows that these type of data splicing is valid. Until the method is validated all results using the method are at least questionable and probably should never been allowed in a peer reviewed publication.
    - Richard Drake
      
      Posted Nov 24, 2014 at 12:31 AM | Permalink
      
      Steven Mosher (1:34 AM):
      
      somebody is writing a mail now alerting others to his independent discovery of this problem
      
      Amazing we haven’t heard about this independent discovery yet. Maybe it’s a northern/southern hemisphere thing.
    - mrsean2k
      
      Posted Nov 24, 2014 at 1:52 PM | Permalink
      
      Ace timing
  - bernie1815
    
    Posted Nov 23, 2014 at 7:53 AM | Permalink
    
    Steve:
    What a neat catch. How on earth did you spot this discrepancy? What drew your attention to it?
    
    Steve: recall my detailed examination of the Oroko dataset, which looks like it spliced instrumental data in its post-1957 portion and had a high correlation to instrumental temperature before that. So it should pass a detrended correlation test with flying colors. But when we tested Gergis et al 2012 a couple of years ago, it failed. I joined the G12 and P2K versions and to compare them, scaled them and plotted them together in a 100-year window for clarity. One version was displaced one year from the other version. So I checked all the overlapping G12 and P2K, later looking at N14 versions. In some cases, the P2K version was displaced offset one way and in other cases it was offset the other way. In a couple of cases, there is a two-year offset. When the Oroko series was lined up properly, it has a high t-value; when it is offset, it has a low t-value. Same for other series. I suspect that the problem crept in when they were converting SH summer series to NH calendar years, but all the more reason why they should have taken care. If series are are lined up ex ante correctly, it looks like many of the series pass detrended correlation without the data mining in PAGES2K. There are some loose ends that I’m trying to figure out.
    - Nicholas
      
      Posted Nov 23, 2014 at 3:47 PM | Permalink
      
      Actually, the fact that the series pass the detrended correlation test against the nearest temperature record when aligned properly and don’t when even slightly misaligned is pretty good evidence that they are actually proxies for temperature. I am actually quite impressed. I think therefore it’s more appropriate for there to be an alteration to the paper rather than retraction.
      
      This whole mess just goes to show yet again that peer review is no guarantee of quality. It’s particularly problematic that peer review was essentially glossed over for political reasons. Science shouldn’t be politically driven – it should be the other way around.
    - Nicholas
      
      Posted Nov 23, 2014 at 3:49 PM | Permalink
      
      (Well, impressed except for the Oroko dataset, no surprise that temperature should be correlated with… temperature. But for the rest of the proxies, if they aren’t spliced, it’s a pretty interesting result.)
    - Steven Mosher
      
      Posted Nov 23, 2014 at 3:57 PM | Permalink
      
      craig.
      
      note the wiggle room
      
      http://www.nature.com/authors/policies/corrections.html
    - Kenneth Fritsch
      
      Posted Nov 23, 2014 at 5:05 PM | Permalink
      
      “Actually, the fact that the series pass the detrended correlation test against the nearest temperature record when aligned properly and don’t when even slightly misaligned is pretty good evidence that they are actually proxies for temperature. I am actually quite impressed. I think therefore it’s more appropriate for there to be an alteration to the paper rather than retraction.”
      
      I have looked at a number of proxies and particularly tree ring proxies and I am not impressed with a detrended correlation. As I noted before it is easy to show that a paired series with relatively high and significant detrended correlation (or a non detrended high correlation) can have decadal trends that are very different. Tree rings can respond to temperature in near lock step but to pass the true test of whether the proxy is valid in tracking decadal temperature trends, the tree rings have to respond to match the magnitude of the temperature changes. It is here that the tree ring responses can fail and no doubt because other variables can affect that response and those other variables are not necessarily the same over time or even random.
      
      I have been using the R function mtm.coh in the library (multitaper) to quantify the coherence of paired proxy to proxy, proxy to temperature and temperature to temperature series. I have found by
      using simulations that the coherence measure is a good indicator of whether the linear or non linear decadal trends seen in these paired series are out of phase and by how much. As a whole the proxies I have studied in the Kaufmann Arctic 2K reconstruction do not have good coherence with each other over long stretches of time or with temperature over the shorter time period of the instrumental record. There are a very few proxies that are much better in coherence that I am studying further.
      
      I need to look into the reconstruction under discussion here by measuring coherence of decadal trends and not letting correlations, detrended or trended and linear or non linear, get me excited. It might impress some that relatively high correlations, or at least significant ones, can be obtained from both trended and detrended paired series and after that impression it might be easier to not talk about the more critical measure of the decadal trends in these series. Further to that selling point approach for this reconstruction is with regards to something SteveM mentioned about the hockey stick being garnered by the composite of the short series coral proxies being tacked onto the mundane composite of the longer series. The short to long series is a very important critique of the authors methods since with the short series we never know how those proxies responded in historical times compared to the instrumental/near instrumental period.
  - Kenneth Fritsch
    
    Posted Nov 23, 2014 at 11:25 AM | Permalink
    
    If indeed on double checking SteveM finds that the authors of the paper under discussion here made errors in the proxy dating and if further one combined that error with the failure of these same authors to see in an earlier failed paper that the data had not been detrended, I would consider that evidence that these authors are more interested in the end result than the methods and the application of those methods that got them to that conclusion.
- MikeN
  
  Posted Nov 24, 2014 at 1:57 PM | Permalink
  
  Usually it’s sending a boy to do a man’s job, trumping to low and getting overruffed. Now in this case did they send a boy to do a Mann’s job or did they send a Mann to do a boy’s job?
  
  John Bernoulli laid out a challenge of finding the fastest path between two points, and solutions were received from L’Hopital, Liebniz, and an unsigned one from Newton. Bernoulli declared ‘I recognize the lion from his paw’
charles the moderator

Posted Nov 23, 2014 at 1:15 AM | Permalink

One has to wonder if this particular Australia Rat Pack has the moral compass to feel embarrassment.

Probably not.
- Alexej Buergin
  
  Posted Nov 23, 2014 at 2:04 PM | Permalink
  
  Why do you blame Australia for the Rafi?
solomon green

Posted Nov 23, 2014 at 8:13 AM | Permalink

barn E rubble:

“I can’t think of another field where the methodology for analyzing data isn’t established before analyzing the data.”

True but, in the fields in which I work, it is very often necessary to examine the data before determining the best methodology to use for analyzing it. Unless, of course, one has set up the experiment to produce the data in the first place.
observa

Posted Nov 23, 2014 at 5:07 PM | Permalink

“For a long time, climate blogs have criticized ex post screening as a bias-inducing procedure -a bias that is obvious, but which has been neglected in academic literature. For the most part, the issue has been either ignored or denied by specialists.”

Well some mainstream scientists like Walter Starck are calling that loud and clear now-
https://quadrant.org.au/opinion/doomed-planet/2014/11/climate-scams-meltdown/

Interestingly enough The Australian’s reporter Keith Delacy writing of Starck’s work in June 4 2012 had this to say-

“I HAVE just read a report written by marine biologist Walter Starck of Townsville for the Australian Environment Foundation, in response to the proposal to add further to our Marine Protected Areas. Starck is an unusual scientist in that he advances his knowledge by practical, hands-on experience, observation and evidence, sometimes putting him at odds with the mainstream scientific establishment.

However, I would venture to say that no one knows more about marine biology, and certainly the Great Barrier Reef, than Starck. He has spent a lifetime getting his hands dirty – or should that be clean if you are out on the reef?”

An ‘unusual scientist’ indeed.

Steve: that article has nothing to do with ex post screening.
observa

Posted Nov 23, 2014 at 5:44 PM | Permalink

I would have thought Steve it and it’s author are a much broader vindication of the challenges to the scientific methodology of CAGW climatology, which includes the lesser subset of ex post screening here.
Geoff Sherrington

Posted Nov 23, 2014 at 8:37 PM | Permalink

The G2012 description of “SONDJF warm season” is inaccurate.
Of the 27 proxy locations used, 12 are in the Tropics. At least two have coordinates wrongly labelled in my original copy (as first released), resulting in these being north of the Equator.
This Wikipedia graph shows the variation in daylight hours with latitude.

There are plausible physical mechanisms linking day length to ground temperature.
Note that at the Equator, there is essentially 12 hours of sunlight every day, so ground temperatures would vary little from month to month. Therefore, the SONDJF month record would be very flat, and could not be called significantly “warmer” than the rest of the year, especially in the Northern hemisphere as some sites are.
The New Zealand proxy site at 47 degrees South latitude has day length variation from about 9 to 15 hours, with pronounced seasonality and so SONDJF is correctly a warm season.
There are consequences. Seasonal detrending of tropical data involves removal of a small part of the signal, whereas detrending of mid-latitude data requires a much larger removal. To the extent that noise is involved, there is likely to be more inherent character in the mid-latitude detrended data than in the tropical data. The simple calculation of correlation coefficients between proxy response and temperature in not easily comparable in the two cases, given some dependence on “wriggle matching” to create higher coefficients.
The data were correlated not only for the year, but also for the year before and the year after. Again, the relatively constant Equatorial temperature would not be expected to benefit much from this procedure, while the mid-latitude data would. Again, one is not comparing similar cases when using a single correlation coefficient for a cut-off as to which proxies are used. The selection process is questionable.
There are more complicated procedures in G2012 than I have described, but some of the more basic assumptions and errors in the paper need to be overcome before similar data are used in later papers.
Kenneth Fritsch

Posted Nov 23, 2014 at 11:44 PM | Permalink

I have downloaded, standardized and plotted the Australasia section of the Pages2K temperature reconstruction and found a few items of interest immediately. First there are a number of proxy sources in close proximity and that have very different looking series and including in the instrumental period. Second there are 2 tree ring proxies that are used upside down. Third some proxies (and in consideration that the coral proxy values go down when temperature goes up) trend up, some trend down and some have little trend in the instrumental period. With these differences it is difficult to take these reconstructions seriously even when forgetting for the moment the basic flaw in selecting proxies after the fact. I plan to measure the coherences of these proxies in pair wise fashion.
- Kenneth Fritsch
  
  Posted Nov 24, 2014 at 2:02 PM | Permalink
  
  Below I have linked some graphs of the Pages2KAustralasian 28 proxies used in the Pages2k temperature reconstruction. The proxy series have been standardized and the proxy responses reported in standard deviation units. I included (in red) a decadal trend line that was derived from the first 2 principle components after applying the Singular Spectral Analysis function from Rssa from the library Rssa in R. The function decomposes each proxy series into principle components that can be reconstructed to present the secular trend, cyclical components and white/red noise residuals.
  
  I think this analysis visually brings to the fore the differences in the proxy series from a decadal trend and spectral point of view. I plan on using the coherence function in R on these proxy series to quantify those differences in pair wise fashion. The title of each graph gives the proxy name, latitude and longitude of the proxy location and the type of proxy. All proxies that gave expected negative responses to temperature by the authors were flipped for this presentation except the 2 tree ring proxies for which I could not find justification for the authors flipping.
  
  Of particular note are the proxies of the same type that are located in close proximity and how different the series and decadal trends appear in many cases. I think that these differences should be instructive in judging the validity of the proxy response to temperature. If one wants to conjecture that the large variations in temperature changes in locations in close proximity could create these differences then another problem arises: With large differences in temperature changes in tightly spaced locations the attempt to estimate a mean regional temperature and trends with these few proxies becomes a problem of having huge confidence intervals.
  - Follow the Money
    
    Posted Nov 24, 2014 at 5:36 PM | Permalink
    
    A few posts ago I gave my eyeball assessment that the numerical data for the Celery Top trees and Pink Pine would produce hockey sticks. I see here Mangawhero is also a player. The Australasian and Arctic Pages2k graphs are the most hockey-stickish of the regional representations. I am curious whether these trees were weighted heavier among the Aus proxies in or to produce the Pages 2k Aus graph.
  - William Larson
    
    Posted Nov 24, 2014 at 8:34 PM | Permalink
    
    I know very little about statistics and nothing about “library Rssa in R”. OK, with that lack of understanding on my part, would you please tell me what’s with the quasi-sine-waves (red lines) in some of these proxy graphs? Other (black-line) graphs with somewhat similar appearance do not have such sinusoidal red lines.
    And while we’re at it, what’s with that strange red “line” in the Celery Top Pine East graph?
    - Kenneth Fritsch
      
      Posted Nov 28, 2014 at 11:24 AM | Permalink
      
      William Larson, my purpose here was to use the same spectral analysis on all these series to determine whether differences would appear that are not as apparent, or at all for that matter, in the original series. The red line is an estimation by the spectral analysis method to find a secular trend in the series. The Singular Spectrum Analysis (the ssa/Rssa function/library from R) attempts to separate a series into cycles, secular trends and residuals where the residuals are red/white noise. I think that when their are longer secular trends the method picks up on those and non linear longer traces are found. When these longer term trends are not found the higher frequencies, of decadal or lesser periods, are picked up and can appear as reoccurring cycles – which in effect is the case. I could have plotted the cyclical patterns found in these series and the “leftover” residuals which could have made more sense of the secular trend patterns. I am not an expert in interpreting the output of ssa and again my purpose here was simply looking for differences in proxy series. Differences can be seen in the original proxy series which is a particularly good comparison when comparing series in near proximity – and the Australasian proxy series have plenty of those. The thickening and narrowing red line in the Celery Top Pine East is actually some high frequency oscillations in trends.
      
      Further I should be clear that ssa uses principle components and there is no objective way, of which I am aware, in selecting the particular PCs for determining secular trends, and high and low frequency cycles. My point here is that I used the same ordered PCs for all the series in an attempt to be objective.
      
      I hope to show at this thread a comparison of some closely located proxies with each other and with close by temperature stations. If you are asking me what property in the proxy is behind these different responses I have to plead ignorance – although I am willing to conjecture. I do, however, know that the authors of temperature reconstructions using these proxies should at least show interest in studying and understanding these differences and get beyond their black boxes that are more opaque than what I have shown here and on other posts concerning reconstruction proxies and that SteveM has been analyzing over many years and in many different ways.
  - admkoz
    
    Posted Nov 25, 2014 at 9:48 AM | Permalink
    
    Thanks!
    
    So looking at these, I see two that ultimately went up by anywhere close to 4 in SD units. How is it possible to average these together and get 4 SDs of uptick?
AntonyIndia

Posted Nov 24, 2014 at 4:07 AM | Permalink

Eric-Jan Wagemakers is also co-author of “The Fallacy of Placing Confidence in Confidence Intervals” (2014), a manuscript submitted for publication. http://www.ejwagenmakers.com/submitted/fundamentalError.pdf
admkoz

Posted Nov 24, 2014 at 9:21 AM | Permalink

“it ought to pass a detrended correlation test given the splicing of instrumental data and therefore failure of a detrended correlation test requires close examination.”

Is the author really saying that, since they spliced in instrumental data, it sure should correlate with instrumental data? How is there any validity to this whatsoever?

steve: read my post on Oroko. There is no official archive of Oroko data. Some grey versions spliced instrumental data. It’s hard to tell exactly what is in the Gergis version as there is no official record and the article says that it was “disturbance corrected” in pers comm data.
Svend Ferdinandsen

Posted Nov 24, 2014 at 12:55 PM | Permalink

Could anyone give references to papers that show how well these proxies reflect temperature and not anything else like rain, draught, sun, nutrition, windspeed, CO2?
I find sometimes that these studies just calibrate an unknown proxy variation with known temperature and then assume it reflects temperature according to the calibration. Next time they may use the same proxy to measure draught!

Steve: I’d appreciate such references as well :). And yes, sometimes the same proxy is used to measure both wind speed and temperature. I’m thinking of a post on one example.
- Richard Drake
  
  Posted Nov 24, 2014 at 3:35 PM | Permalink
  
  sometimes the same proxy is used to measure both wind speed and temperature.
  
  That’s the power of statistics, man.
  - stevepostrel
    
    Posted Nov 24, 2014 at 6:57 PM | Permalink
    
    Some know how to multitask.
- Geoff Sherrington
  
  Posted Nov 24, 2014 at 11:34 PM | Permalink
  
  Sites near the Equator show quite small changes in temperature patterns from month to month. One can infer that the annual patterns would be almost flat line, but are not because of exogenous items. Singapore’s pattern for example admits that some of the T variation could be explained by rainfall, as do many other sites. One can then propose that the inputs should be detrended for known exogenous influences to make the math purer.
  Incidental question, should estimates of climate sensitivity to GHG also use T detrended for known exogeneous factors?
Hoi Polloi

Posted Nov 25, 2014 at 2:46 AM | Permalink

“Weak statistical standards implicated in scientific irreproducibility” by Valen Johnsonwww.nature.com/news/weak-statistical-standards-implicated-in-scientific-irreproducibility-1.14131

Comment from E.J.Wagemakers:

“It shows once more that standards of evidence that are in common use throughout the empirical sciences are dangerously lenient,” says mathematical psychologist Eric-Jan Wagenmakers of the University of Amsterdam. “Previous arguments centered on ‘P-hacking’, that is, abusing standard statistical procedures to obtain the desired results. The Johnson paper shows that there is something wrong with the P value itself.”
Jeff Id

Posted Nov 25, 2014 at 10:19 AM | Permalink

I suppose that pick-12 was a predictable extension of the pick 2. Like carrot-top’s prediction of the 7 blade razor — a genius visionary if ever there was one, well before his time.
- Steve McIntyre
  
  Posted Nov 25, 2014 at 12:01 PM | Permalink
  
  Neukom et al 2014 take the pick-9 to an even greater extreme: they increase their search distance to 1000 km from 500 km.
  
  We consider the “local” correlation of each record as the highest absolute correlation of a proxy with all grid cells within a radius of 1000 km and for all the three lags (0, 1 or -1 years). A proxy record is included in the predictor set if this local correlation is significant (p<0.05). Reconstruction results using an alternative search radius of 500 km, leading to a smaller predictor set (85 instead of 111 records) are similar (Supplementary Figure 20, see section 3.2.2). Proxies from Antarctica, which are outside the domain used for proxy screening, are included, if they correlate significantly with at least 10% of the grid-area used for screening (latitude weighted). An alternative reconstruction using the full un-screened proxy network yields very similar results (Supplementary Figure 20, see section 3.2.2)
  
  However, they also say that their screening procedure doesn’t impact their results all that much.
  
  As I previewed in an earlier post, I think that their stick is mainly a result of pasting trending coral O18 data onto “shafts” of relatively little-changing tree ring and ice core O18 data. More on this on another occasion.
Sven

Posted Nov 25, 2014 at 10:23 AM | Permalink

I’m quite sure there’ll be no retraction this time. It would be much too embarrassing. They’ll rather sit as still as they can and hope that it’ll pass…
- Richard Drake
  
  Posted Nov 25, 2014 at 10:35 AM | Permalink
  
  I tend to agree, this one’s an embarrassment double bind.
Mickey Reno

Posted Nov 28, 2014 at 11:40 AM | Permalink

Steve wrote: The [Gergis} article was not retracted. The University of Melbourne press statement continues to say that it was published on May 17, 2012, but has been submitted for re-review (and has apparently been under review for over two years now.)

The Gergis team is attempting to revise history in a very surreptitious and dishonest manner, IMO.

The original Gergis et. al. paper is NOT under review and the paper was NOT “published” in any but the most Bill Clinton-esque meaning of that word. It was originally accepted for publication, and it was hosted online. But it was never actually published in the journal and prior to it’s actual publication, Jean S. found the error, notifying the world on Climate Audit.

His post set in motion Gergis’ desperate and sloppy attempt to save the paper, and their team’s disgraceful claim to have coincidentally discovered the Jean S. error shortly after he posted to CA. Ms. Gergis asked the journal if she could merely modify the methodology description within the paper as a post-acceptance correction while leaving the rest of the paper intact (and with it, the claims of many Southern Hemisphere hockey sticks). Her desperate plea ignored the fact that, per Neukom, approximately 19 of their 26 selected proxies would no longer qualify for inclusion in the paper. This was too much for the editor of “Journal of Climate” who flatly said NO to Gergis’ suggestion. He stated unequivocally that such an edit would require a whole new analysis which could not and should not presume the same outcome, and with that, he rescinded acceptance of (ie. he rejected) the paper

So, no, the paper was never retracted, because it was rejected. The paper is not now under review, because it was rejected. The ongoing claims that it was published and/or under review are, at best, misleading and wrong, and at worst, blatant lies. The Pages 2K paper was their attempt to reintroduce some of the contents of the rejected paper in a new form, without needing to address, correct or mention the prior rejection to the IPCC for AR5.

And all this scientific dissembling passes muster with University of Queensland. Shameful.

Steve: while I agree with you on many points, it is my view that the article was published on May 17, 2012 as stated by the University originally. Having been published, I believe that the correct interpretation of subsequent events is that it was retracted. Ivan Oransky of Retraction Watch is also of the opinion that, in an online world, publication occurs when it is accepted and placed online, not when it is physically transferred to a somewhat obsolete paper medium. I agree that there is much to criticize in the University’s handling of the affair.
- Pekka Pirilä
  
  Posted Nov 28, 2014 at 1:04 PM | Permalink
  
  The view of the Journal of Climate is that the paper was withdrawn by the authors prior to being published in final form.
  
  http://journals.ametsoc.org/page/JCLI-D-11-00649
  
  This is the page that is displayed when a link to the original paper is clicked.
  - Steven Mosher
    
    Posted Nov 28, 2014 at 2:40 PM | Permalink
    
    withdrawn..
    
    when executives get fired we say they resigned to spend more time with their families.
    
    meh.. rejected, withdrawn.. pfft
Kenneth Fritsch

Posted Dec 9, 2014 at 12:56 PM | Permalink

I have been attempting to find a method to quantify and summarized what I see in viewing the many proxy series from temperature reconstructions that I have been studying recently. At this point I judge that what is important and necessary, but not necessarily sufficient, in these proxies is to be reasonably coherent with other proxies which are sited as near neighbors. I have been using the functions from spec.mtm and mtm.coh from the library (multitaper) in R to determine the squared coherences of proxy and temperature station pairs and the difference in phase of the coherency periods.

If I were doing temperature reconstructions I would attempt to find an independent method of selecting proxies, i.e. other than the proxy correlation with the instrumental record. My analysis here in no way can be considered following what I would consider an appropriate approach. Rather I want to look at proxies (that have already been selected by what I consider a flawed method/approach) in a manner unlike I see in reconstructions where the proxies are like black boxes, but rather in some detail with individual proxies.

I determined the paired squared coherence and phase of that coherence with closely located proxies from the 2K Australasian series of 28 proxies and compared those to GHCN mean annual station temperature data that was from the same locales as the proxies and closely located to one another. I used the time period of 1900-2000 for all pairs. I have summarized the mean results of that data in the top of the table in the first link below. The data show that the station pairs have superior coherency and that it is more in phase that that of the proxy pairs. I used the periods of 2, 5, 10, 20 and 40 years to get a feel of the various period cycles. A significance at the 0.05 level is for a squared coherency of approximately 0.34 for all pairs tested. It should be noted that while only a few proxy pairs showed some significant coherency (see below for details) and most did not, there were a few near station pairs that also did have significant coherency. On further study I could find no obvious reason for the lack of station coherency , e.g. a water/land pair. Now my story would have been clearer if all the station pairs were coherent and all the proxy pairs were not but nature is usually not that accommodating. I see no reason, however, that the proxy pairs should not be at the level of coherence of the station pairs. No doubt that the proxies are responding to variables other than temperature but those variables in close proximity should be near the same – I would think.

The long common time shared by these series pairs, that showed some reasonable level of paired coherency, provided an opportunity to look at how well that coherency held up over other 100 year periods going back in time. A proxy that shows good coherency during the instrumental period but not in historical times is not a reliable proxy in my view. To this end I did several 100 year period paired coherency determinations for the best cohering proxy pairs in the 1900-2000 period and have presented the results in a table located below the summary mentioned in the link above. From those results it is readily seen that the coherency breaks down for those 100 year periods back in time and to the point of not being significant in the important 10, 20 and 40 year periods. The one exception to a total breakdown being the two Celery Top tree ring proxies which overall have poorer historical coherency, and particularly in the 1800-1899 period, but are better historically than the other pairs.

In order to look further at the Celery Top proxy pair, I paired those proxies with the nearest GHCN stations of Hobart Regional and Low Head. The result of those 4 pairs of coherences is in the second link below and shows reasonably good coherence over most of the time periods. A direct comparison (using standard deviation units) of those pairs in the graph in that second link shows the Celery Top proxies have a much larger ending trend compared to the temperature series than I would reasonably expect from a proxy responding predictably to temperature changes.

I will show the coherency results for the proxy pairs over the entire period common to both in the next post here.

The Australasia proxy data along with all other 2K regional proxy data can be found from the link below:

Rutherford et al [2005] Collation Errors
Kenneth Fritsch

Posted Dec 9, 2014 at 1:04 PM | Permalink

As noted in the previous post here I have linked below a table with the coherence results of all the proxy pairs, used in the previous post for the 1900-2000 period, and here over the entire time periods common to both pair members. The results show that the coherency using the entire common period is lower than that found for the 1900-2000 period and the phase difference varies much more. The significance at the 0.05 level remains at 0.34 for squared coherence for all table entries. The tree ring proxies in general perform worse than the coral proxies, but as noted earlier by SteveM the corals are generally much shorter series.

I find 2 rather outstanding coherence pairs in the proxy results and that being the tree ring pair of Mt.Read and Buckley’s Chance and the coral proxy pair of Great Barrier Reef and Havannah. The coral pair have only the time period from 1880-1980 in common and thus do not present a good test case for long term coherence. The tree ring pair have the period from approximately 1480-2000 in common. Both tree ring series appear as nondescript with flat to slightly long term downward trending temperature anomalies into modern times. It should also be remembered that while I judge that reasonable coherence should be a requirement for proxies it is not sufficient for viewing the proxy as a reliable thermometer.

Steve: Kenneth, Mt.Read and Buckley’s Chance are nearby Tasmanian tree ring series; Havannah Reef is part of the Great Barrier Reef. some of the data may actually overlap.
- Kenneth Fritsch
  
  Posted Dec 10, 2014 at 11:40 AM | Permalink
  
  SteveM: The proxy pairs used here were all supposed to be near neighbors, but some were, no doubt, nearer than others – and in a climate sense. In my view coherency of proxies within the range of near neighbor temperature stations coherency should be a requirement for any after the fact validation of a group of proxies, but that alone is not sufficient for validation since nearby proxies could well be responding to the same climatic changes of which temperature is only one response variable and not all variables are changing incrementally the same.
  
  Also I am curious that Mann has not done detailed spectral analyses on proxy series used in temperature reconstructions given his interest in that area prior to the hockey stick paper(s). Mann (2008) proxies were very much using a blackbox approach and a manipulated one at that.