GISS does not provide a coherent archive of station data (such as is available at GHCN.) As reported before, Nicholas developed a technique for downloading station data within R. Downloading the entire data set (which takes a minute or so from GHCN on a highspeed network) is laborious but automatic. I did this in the background and after about 8 hours had downloaded half of one version. In the middle of this, the program stopped working. It was hard to figure out why it stopped working. When I went back and tried to do things line by line, I found that it wasn’t reading. Now there had been a few missing records which caused my read program to file and to require restarting at the next record (this could be fixed but it seemed just as easy to restart if it didn’t happen too often). After a while, I checked some records that I’d already downloaded and these failed. I wrote to the GISS webmaster wondering about the 403 diagnostic.
Robert Schmunk promptly replied that my attempting to “scrape” data from their website constituted an “obvious and blatant violation” of their rules as set out in their robots.txt directory and they had accordingly blocked access.
Although you did not provide any further details about your problem, I will assume that you are the person on the cable.rogers.com network who has been running a robot for the past several hours trying to scrape GISTEMP station data and who has made over 16000 (!) requests to the data.giss.nasa.gov website.
Please note that the robots.txt file on that website includes a list of directories which any legitimate web robot is _forbidden_ from trying to index. That list of off-limits directories includes the /work/ and /cgi-bin/ directories.
Because the robot running on the cable.rogers.com network has rather obviously and blatantly violated those rules, I placed a block on our server restricting its access to the server.
If you are indeed the person who has been running that particular web robot, and if you do need access to some large amount of the GISTEMP station data for a scientific purpose, then you should contact the GISTEMP research group to explain your needs. E-mail addresses for the GISTEMP research group are located at the bottom of the page at http://data.giss.nasa.gov/gistemp/
I wrote back to Schmunk stating that I was not using a “web robot” but was downloading data for legitimate scientific purposes, as follows:
I have been attempting to collate station data for scientific purposes. I have not been running a robot but have been running a program in R that collects station data.
However, even after confirming that this was not a web robot and was data access for scientific purposes, NASA GISS did not remove the block (which applies to many webpages besides the GISTEMP data that I was downloading.
I wrote to Reto Ruedy of NASA GISS this morning as follows:
Dear Dr Ruedy, I have been unable to locate an organized file of station data as used by GISS (such as is available from GHCN). In the absence of such a file, I attempted to download data on all the stations using a script in R. This was laborious as it required multiple calls. I was not using a “web robot” nor was I indexing files. During the course of this, your webmaster blocked my access to the site claiming that downloading the data in this fashion violated your policies. Would you please either restore my access to the site or provide me with an alternative method of downloading the entire data set of station data in current use. Thank you for your attention, Steve McIntyre
We’ll see what happens.
UPDATE: After a series of emails, GISS agreed to allow me to continue operating my download program exactly how I’d been doing it (after hours.) I posted the correspondence below, but here is a collection. I sent a further email to Schmunk noting that I was blocked from the pages identifying the email addresses of contact persons.
I am blocked from access to the page where the email addresses are located.
How can I download the data then?
The GISS webmaster replied:
Good point. That was foolish of me to suggest checking a page on which access had been turned off. I have turned off the restriction that I added to the server on data.giss.nasa.gov last night so that you can access the GISTEMP page and view the contact information.
> I have been attempting to collate station data for scientific purposes. I have not been running a robot but have been running a program in R that collects station data.
It is an automated process scraping content from the website, and if that isn’t what a web robot does, then it’s close enough.
The only “notice” of the supposed policy is their robots.txt file. Google, which can surely be regarded as authoritative on web rotbots, discusses the function of robots.txt files as follows:
A robots.txt file provides restrictions to search engine robots (known as “bots”) that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages.
My program was obviously not “crawling the web”, but was downloading specific data from GISS.
The webmaster also said separately:
Please contact the GISTEMP group and inquire if they are willing to provide you with the dataset(s) from which the website applications extract information.
If they are not (I do not know what their current policy on this), then you can go a step closer to the source and obtain station data from the same location that the GISTEMP group obtains the original “raw” datasets that they work from. That is the Global Historical Climatology Network at http://www.ncdc.noaa.gov/oa/climate/ghcn-monthly/index.php
I’m not sure which specific files from the GHCN site are used. But if the complete GISTEMP data are not available then perhaps Dr. Ruedy of the GISTEMP group could give you some tips on how to use the GHCN data.
I had already contacted the GISTEMP group. I replied that I wasn’t interested in “tips” on how to access GHCN data and re-iterated my request (copy to Ruedy):
I know how to use the GHCN data. I’m not interested in “tips” on how to use it.
I’m interested in the versions as used by GISS. The GHCN version is convenient to download and I see no reason why GISS versions should not be available on equivalent terms.
In response to my initial request to Ruedy about access to the data, I received the following response that these were “scratch pads”, asking why I needed the data and an undertaking to “try” to provide the necessary data (which was already online).
Our main focus working with observed data is creating a gridded set of temperature anomalies which gives reasonable means over comparatively large regions – the global mean average being one of the major goals. If you are interested in individual stations, you are much better off working directly with the GHCN data.
Our station data are really intermediate steps to obtain a global anomaly map, and are not to be viewed as an end result. A modified time series for a particular location may be more representative for the surrounding region than for that particular location. So it is important to use these data in the proper context.
All our publications and investigations deal with regional temperature anomalies and that is the only use these data are good for after our modifications.
If you still think that downloading our “scratch pads” is important to your investigations, please let me know exactly what stage after the raw GHCN data you need and maybe an indication why you need it, and I’ll try to provide you with the necessary data.
Again, we are not trying to compete with GHCN as provider of station data; we are using their data for a very specific project and we made – perhaps unwisely – some of our tools that we used to test the various steps of our process available on the web.
I promptly responded that I was interested in the data as it was available to the public, and asked for a copy of the program by which they generated their data from GHCN data (on the basis that the size of the file could not be held to be a consideration for this request):
Dear Reto, in that case, could you provide me with copies of the programs that you use on the GHCN data so that I can replicate these calculations for myself? Thanks, Steve McIntyre
In answer to your question, I’m interested in the data as it is presented to the public. All I was doing was downloading the data that is supposedly available to the public, but in a way that would not take 4 weeks of manual labor. If your version differs from the GHCN version, I’m interested in downloading your version so that I can assess the differences.
Later in the day, instead of providing a coherent file of the data or the source code, Ruedy said that I could continue downloading the data in the way that I had commenced, asking me to do so after hours or on weekends. (My original download was being done after hours and was interrupted at 11.30 at night; so they were asking me to observe a condition that had not been a problem in my initial download attempt.)
After a short meeting with Dr. Hansen, we were advised to let you download whatever you want as long as generally accepted protocols are observed. Please try to do so at a time that does not impact other users, i.e. late nights, weekends.
What we did with the GHCN data is carefully documented in the publications listed on our website. We are not creating an alternate version of the GHCN data, we are mainly combining their data in various steps to create our anomaly maps.
Reto A. Ruedy
I replied to Ruedy thanking him for this and politely re-iterating my request for code:
Thank you for this. I will observe this condition.
I realize that you have provided some documentation of what you did. In econometrics, it is a condition of publication in journals that authors archive their code and data so that their results can be routinely replicated. I realize that no such standards apply to climate science. However, equally, there is no prohibition on individual climate scientists voluntarily adopting these best practice standards. In that spirit, I would appreciate it if I could inspect the code used to process the GHCN data. Thanks, Steve McIntyre
Ruedy replied not entirely cordially:
The block has been lifted as far as I know. As far as I’m concerned, this is the end of our correspondence.
I resumed downloading after hours and have finished one data version. While I was downloading, I tested browser access to the GISS site to see if the R-downloading program interfered with access to the GISS site by others and invited any readers online to verify this. I experienced no service degradation whatever when I attempted browser access simultaneous with R download, nor did another reader who tested it simultaneously.
I have no particular objection to the webmaster blocking access until he was assured that the inquiry was legitimate or even that the webmaster referred the matter to his bosses. I also have no objection to how long GISS took to remove the block. If all climate data access issues were resolved this quickly, it would be great. Reasonable people can differ about whether they would have been so responsive in the absence of blog publicity. I happen to think that the publicity to the issue facilitated resolution of the matter, but I can’t prove that they wouldn’t have resolved it anyway. On the other hand, I don’t think that any of my actions were unreasonable.