Obviously I think that R is a great language. But one of the reasons that it’s great is because it’s open source and because of the incredible energy and ingenuity of the packages contributed by the R Community for the use of others.
In a real sense (as opposed to a realsense), this sort of open source philosophy represents what a lot of us thought that climate science would be like (and should be like). I have a story today which shows a small victory for open source philosophy.
As a preamble, I don’t think that any of us were really prepared for stonewalling by leading members of the “Community”, such as Phil Jones:
We have 25 or so years invested in the work. Why should I make the data available to you, when your aim is to try and find something wrong with it. There is IPR to consider.
Most members of the public expected an open source philosophy – the attitude of the R Community as opposed to the attitude of the “Community” [of climate scientists] and particularly the “Team”.
None of us could have expected Michael Mann’s remarkable responses, first to a reporter from the Wall St Journal:
Dr. Mann refuses to release [the source code]. “Giving them the algorithm would be giving in to the intimidation tactics that these people are engaged in,” he says
Or later his truly extraordinary response to a request for his source code from the House Energy and Commerce Committee, who were nonplussed at the above answer. The Committee asked Mann:
Q5: according to The Wall Street Journal, you have declined to release the exact computer code you used to generate your results… (a) Is that correct? (b) What policy on sharing research and methods do you follow? (c) What is the source of that policy? (d) Provide this exact computer code used to generate your results.
Mann’s answers are well worth re-reading. Mann refers to “intellectual property” no less than seven times in his answer, a few of which are quoted here:
My computer program is a piece of private, intellectual property, as the National Science Foundation and its lawyers recognize. It is a bedrock principle of American law that the government may not take private property “without [a] public use,” and “without just compensation.” …
It also bears emphasis that my computer program is a private piece of intellectual property, as the National Science Foundation and its lawyers recognized….
under long-standing Foundation policy, the computer codes referred to by The Wall Street Journal are considered the intellectual property of researchers and are not subject to disclosure.
Even more recently, the National Science Foundation confirmed its view that my computer codes are my intellectual property.
Actually, there’s an interesting argument that the computer code was not actually Mann’s private property but the property of his employer at the time (University of Massachusetts, I presume) – the NSF letter merely said that they did not claim title to the code, but did not opine on Mann’s ownership rights as opposed to those of the university. In a law school examination sense, Mann’s assertion of personal ownership of the code is arguably an assertion of a right inconsistent with the right of the true owner and thereby a tort (of conversion.) See CA discussion here.
Notwithstanding the above statements, on July 18, 2005, Gavin Schmidt stated that the “data and computer code are in the public domain”:
These responses emphasise two main points that we have explained in great detail in earlier postings [linking to a Feb 18, 2005 thread] on this site:
1. There is no case for casting doubt on the scientific value and integrity of the studies by Mann et al. – they have been replicated by other scientists, the data and the computer code are available in the public domain (including the actual fortran program used to implement the MBH98 procedure) …
If the data and computer code had always been “available in the public domain, including the actual fortran program used to implement MBH98”, it’s hard to figure out why Mann would have made such an issue about it. And, of course, the computer code hadn’t always been in the public domain. It had been put online a few days earlier (July 12, 2005) in response to the Committee request. On other occasions when data has been produced after a long campaign, we’ve seen the same pattern of pretending that it had always been there. Gavin Schmidt pulled a similar stunt in connection with non-infilled data for Mann et al 2008. (And BTW, the code wasn’t complete or workable with available data. No one knows to this day how confidence intervals were calculated or principal components were retained.)
Anyway, back to today’s story. A few day’s ago, I posted on coral calcification, an issue to which I had been referred by Peter Ridd of James Cook University in Australia. Although the article involved linear mixed effects modeling using lmer (a technique with which I’m familiar), this is a relatively complicated methodology and you really need to see the equations (i.e. code) to see what people are doing. I read the SI and couldn’t decode it. With enough time and effort, I might have been able to, but, realistically, if I couldn’t decode it right away – and this was a program that I happened to have experimented with – how many paleoclimate scientists could even get a foothold on it.
Following my post, Peter Ridd sent an email to the author group requesting actual source code. Peter Doherty gave an answer more or less in keeping with Mann’s “private property” attitude, though couched more pleasantly:
I won’t ask Glenn to provide his R code because I don’t believe that it is public property. If you think that he got the analysis wrong, you have the ability to reanalyse the exact same data set and publish an alternative interpretation of it. I’m sorry that this doesn’t meet your latest request but I hope that our actions to date are consistent with your high expectations of scientific probity.
Peter Ridd forwarded this to me. Needless to say, the assertion “private property” was a red flag for me. In this case however, adding insult to injury, De’ath had written his programs in R, the quintessential open source language, and, whatever the standards of the [climate science] “Community”, the stance was at odds with that of the R Community. So I wrote:
Dear Dr Doherty,
I am familiar with the lme4 package.
I consulted the SI to De’ath et al and, from that information, I am unable to determine how the lmer models were constructed. Could you therefore please provide a substantially expanded description of your methodology for the construction of your lmer models that is sufficient to permit replication. It would probably be just as easy to provide code.
Aside from whatever obligations you may have under Australian policies or journal policies of Science, R is a an open-source language. Your attitude here is very much in opposition to the R philosophy. You were willing enough to use open source software and your unwillingness to reciprocate will undoubtedly be viewed unfavorably by this community.
Thank you for your attention,
Shortly thereafter I received a squib of code (uploaded to CA here) providing the code for the response to a comment by Ridd et al (but not for the original paper), which I’ll discuss on another occasion. De’ath said:
Not sure what your problem is:
From the SOM: “The final models for calcification, extension and density were fitted using the R package mgcv and partial effects plots were used to illustrate the effects of year and SST on calcification, extension and density. For calcification, extension and density, the temporal trends were selected with ~9 df.”
The code for the comparison in the rebutal of Ridd et al (with the final years omitted etc) is in the zip and the data are in the .RData.
It also produces the graphics.
All very straightforward really.
Stay cool — all’s well in the world of R & Ubuntu :-)
I liked the closing. I’ve had limited success in getting access to code – with the two main exceptions (Mann’s MBH98 and Hansen’s GISTEMP) both coming under exceptional circumstances more or less over their dead bodies and only after enormous adverse publicity. Appeals to journal policy, scientific mores, journals, funding agencies have usually been unsuccessful.
But the mention of the R philosophy worked. There’s a moral here: the Community needs to learn from the R Community.
UPDATE: The concession was only partial. I wrote to De’ath asking him for the code for the actual article, rather than the code for his rebuttal to Ridd. He replied:
OK — one final go.
(1) You’ve got the code for the purely temporal analysis as part of the code I previoulsy sent you.
That together, with that previously sent code, address ALL Ridd et al issues with the analysis.
The lmer analyses (of which ALL the outputs are in the SOM, and hence ALL the models are stated as part of that ouput) are used (as stated in the SOM) to identify the models in terms of (a) the random effects structure and (b) which of the spatial & temperature covariates are included for the second analysis.
To quote the SOM:
“In this section the details of the analysis for data set 1900 – 2005 that included SST and spatial predictors are presented in detail. Similar models and fitting procedures were followed for the purely temporal analyses of the data 1900 – 2005 (all colonies) and the data 1572 – 2001 (10 colonies), but details are omitted as the procedures were simpler since no selection of predictors was required.”
Couple that with (also from the SOM):
“The final models for calcification, extension and density were fitted using the R package mgcv and partial effects plots were used to illustrate the effects of year and SST on calcification, extension and density. For calcification, extension and density, the temporal trends were selected with ~9 df.”
Then you have the answer!
** You’ve got the code for the temporal analysis **
I have had several discussion with R/statistically trained people and they seem to get it fine.
The code previously sent read as follows:
# Fit gamms to all data (.2000) and data (.2000.x) with final years for 2004 & 2005 removed
# Fit gamms with common smoothness df=9 (i.e. k=10) — round up largest df from above fits
# Plot the fits for the mean trends
# Use overall mean of all corals >1900 for centering
# More conservative (ie gives same absoulte but smaller % change)
# short term average used in MS
This was actually quite helpful as it shows the actual equation that they used. I don’t know why he wouldn’t send the corresponding equations for other parts of the article. AFAIK, the reason for not doing so is simply to make it more difficult to see what he did. With some time and effort, I can probably sort it out, but with the code, I could do it at lightning speed. The purpose of providing code in econometrics – as nicely articulated by Bruce McCullough – is to improve the efficiency of verification studies. The more trouble that you put people to, the less likely it is that anyone will verify it. How hard would it have been for De’ath to send the code out – he’s already spent more time consulting about not doing it.
On a scale of 1 to Lonnie Thompson, this is not a particularly bad case. The online data doesn’t tie in to the article, but the authors provided the data to Peter Ridd on request. The SI is pretty decent as these things go. After a first refusal, they grudgingly provided a squib of code, but only grudgingly and incompletely. Annoying, but nothing like Lonnie Thompson.