GISS has a large collection of station data, both adjusted and unadjusted. Unlike many data archives, GISS do not permit you to either extract the entire data set from a single archive or from permanent individual files. You can obtain digital data for individual stations, but you have to go through a laborious process of manually clicking on several links for each station, then copying a file to a text file and then working with the saved file.
Until now.
CA reader Nicholas, who had previously developed a nice method for reading for reading zip files, has provided me with a very pretty method of reading GISS station data into R objects. Some of his programming techniques were new to me and I thought that it might be of interest to other readers, especially since some of his methods can be modified to do slightly different tasks.
Don’t try to cut-and-paste code from the WordPress version here. Use an ASCII version here. I’m going to massage this a little and will make up a collection of similar retrieval scripts (which I’ll link here as well.)
First, let’s go through the process of how one downloads GISS station data.
1. First you go to the GISTEMP page http://data.giss.nasa.gov/gistemp/station_data/
2. Scroll down and choose the option “raw data +USHCN corrections” – later we’ll worry about the other data sets.
3. Then go to the box asking for a site name and insert Tarko for Tarko-sale, which we’ve been considering and click Search.
4. You will get 3 Tarko-Sale versions. Pick one of the versions and click that.
5. You get a graph showing the selected version in blue and the alternatives in dashed and dotted black.
6. If you scroll down, there’s a link to monthly data. Click that and you get an a nice ASCII text file.
7. Save the text file for later use.
There’s no point saving the URL; while it looks like something that you might be able to insert into an R read command, the url is temporary and, if you re-insert the URL in a few minutes, it won’t return the ASCII file any more.
Nicholas has two main functions here for extracting data. Nicolas’ functions are here . I tried to post up a line-by-line discussion of the code but the interaction of R script and WordPress coding has proved difficult to reconcile and not worth the trouble. So I’ve deleted much of the script from the post and interested parties can consult the link in ascii format. The first function download_html is a utility function which saves the html commands for the url as a text file. The second function download_giss_data returns the ASCII file obtained through the steps above. So a call like this returns a usable file.
my_data < - download_giss_data("Tarko");
my_alt_data <- download_giss_data("Tarko", 2);
Nicholas has recently modified this a little to work with station numbers as well. There’s also a little editing that I need to do to make time series of the results and to collect the results if there are several variations, but this is low-level programming that I can do easily and will do in the near future. Anyway here’s how he did this, for future reference and because some of the methods are transferable to other problems.
First, Nicholas replicated the url for the first search, pasting the station name into the search string. Here he uses the R functions paste and gsub, both of which I use a lot (I knew this step). The paste function is a very nice way parameterizing names and R users should be familiar with it. If you copy giss_url into a brower, you get the first search page into the manual procedure.
giss_url < - paste("http://data.giss.nasa.gov/cgi-bin/gistemp/findstation.py?datatype=gistemp&data_set=0&name=", gsub(" ", "%20", station_name), "&submit=1", sep = ""); giss_url
# [1]"http://data.giss.nasa.gov/cgi-bin/gistemp/findstation.py?datatype=gistemp&data_set=0&name=Barabinsk&submit=1"
Next Nicholas uses his function download_html to retrieve the html version of the webpage.
my_data < - download_html(giss_url);
#html page mentioning 3 ids- 222296120000 222296120001 222296120002
He uses this function on a few occasions and I’ll work through it the first time here in baby steps. The first step is to use the R function download.file to retrieve the contents of giss_rul. I hadn’t used download.file before, but Nicholas uses it a lot. This copied the html version of the webpage to a temporary file.
url< -giss_url; download.file(url, "temp.html");
The Nicholas used the R function file to open a connection to the temporary file “temp.html”. “rt” is a parameter for reading in text mode. (I haven’t used this function and am not familiar with it.) It’s values are for the connection.
html_handle < - file("temp.html", "rt"); html_handle
Next Nicholas used the function readLines to read the file in text format and closed the connection and unlinked to the temporary file. I use readLines all the time. What we have now is the html text as a text object in R, which you can inspect by “html_data”, which is 204 lines long.
html_data < - readLines(html_handle);
close(html_handle);
unlink("temp.html");
length(html_data) #204
Now within the 204 lines of html code, the salient information – three codes – occurs on only a few lines. Nicholas used an interesting trick here. He noticed that three lines containing the required codes were associated with the distinctive text “gistemp.station.py”, which did not occur on any other lines. Nicholas used the R function grep to identify the three lines containing the station numbers that were the objects of the original search. This is a neat trick when you have to work with html code text and one that may come in handy in other situations.
urls < - grep("gistemp_station.py", my_data, value=1);urls
#[1] " Barabinsk"
#[2] " Barabinsk"
#[3] " Barabinsk"
Next, Nicholas extracted the codes of interest from these lines and inserted them to create new url’s, using the paste and gsub functions.
url < - paste("http://data.giss.nasa.gov", gsub("^ *[^<]*$", "", urls), sep="");url
#[1] "http://data.giss.nasa.gov/cgi-bin/gistemp/gistemp_station.py?id=222296120000&data_set=0&num_neighbors=1"
#[2] "http://data.giss.nasa.gov/cgi-bin/gistemp/gistemp_station.py?id=222296120001&data_set=0&num_neighbors=1"
#[3] "http://data.giss.nasa.gov/cgi-bin/gistemp/gistemp_station.py?id=222296120002&data_set=0&num_neighbors=1"
Here we get 3 station versions here. We’ll just worry about one version for now, as it’s easy enough to handle multiple versions with a little additional programming. The new urls So Nicholas uses the function download_html to obtain the html text for the new url, which is the second url in the manual process (step 5 above) Using the download_html, we get the html commands for this webpage (171 lines).
j < - 1; my_data <- download_html(url[j]); length(my_data) #171
Nicholas uses the same trick with grep as above to retrieve the salient codes from this htm code. This time, he used the phrase “monthly data as text” to extract the salient lines.
urls[j] < - grep("monthly data as text", my_data, value=1); urls[j]
#monthly data as text"
Using paste and gsub, this is reshaped into the call for the text file that we saw originally:
"http://data.giss.nasa.gov/work/gistemp/STATIONS//tmp.222296120000.0.1/station.txt"
Next Nicholas opens the connection to this temporary file using the R function url (the function url working here like the function file abovel) and reads from the connection.
my_handle < - url(url, "rt");
my_data <- read.table(my_handle, skip=1);
dim(my_data) #91 18
Now we have the sought-for ASCII file which can be manipulated in usual ways. A little work needs to be done to allow for reading the diffferent versions. The calls use a dataset identification of 0,1,2 and this needs to be allowed for. Also we’ll need to think about using the WMO number rather than name and about handling multiple versions in a convenient way. But this is relatively easy stuff that I can figure out.
The above works for trying to match names, but it’s usually easier to work with WMO numbers so as not to worry about spelling variations (Karsakpay, Qarsakbay etc.). Nicholas modifed the routine here to work with WMO numbers.
43 Comments
I have been a lurker for some time
Last nite I loaded down the GISS temps for my hometown ( Christchuch, New Zealand) as the Government agency NIWA ( National Institute Weather and Atmosphere) restrict this info
We have a continuous recording since 1905
But here the rub
GISS separate the monthly readings into the 4 seasons of the year ie D-J-F by adding the monthy numbers and dividing by 3. Ok so far but no allowance for the number of days in each month
And then they have an annual “ANN” which equals the sum of the 4 seasons divide by 4
Now I humbly ask — Does a climate year start at Dec 1st?
Because this is the effect of the GISS calculation
Probably of little importance except here in Christchurch we had an exceptionally cold Dec 06 and this is not counted into the 2006 year
Here’s the URL for a plain text version of my script, which seems to have been a bit mangled by WordPress in the post above:
http://x256.org/~hb/giss2.r
This version also does a few things better.. it escapes strings in the URL more correctly, handles station numbers better, and is a bit more robust to any updates GISS make to their HTML pages.
Oh, and by the way, the reason I wrote the function to use download.file() to download the HTML, rather than just opening a connection using url(), is that for some reason I get different HTML data back from those URLs using download.file() and url() when I execute the code in Windows.
Obviously using url would be easier, it would skip the temporary file step. I’m guessing the difference is that R sends different HTTP headers in each circumstance, which for some reason GISS detects and returns different data. Perhaps to do with the User-agent string.
I was lazy when I wrote the code to download to temp.html. I should have used tempFile() (or whatever the function is called, I’ve used it before but can’t remember 100%) to get a unique temporary file. Perhaps I’ll update the code a bit later to fix that.
I’ve undone the McIntyre formatting snafu, and I hope I’ve not altered any of the R code in the process.
Please check.
That’s better, but some of the regular expressions are still mangled. Here is the original version, which the above post quotes and is based off:
http://x256.org/~hb/giss.r
Notice the gsub expressions, like gsub(“^ *<a href=\”|\”>[^<]*</a></td>$”… contain HTML tags, so have understandably not been reproduced properly. Perhaps you can compare the post to the file and fix up the discrepancies? It’s probably limited to the two complex gsub parameters.
I’m sorry, but posting code on WordPress is a nightmare. I’ve put in a support request to try and resolve this.
Please be patient.
Well, it looks better now, except the problem is the reverse.
Before, we were having trouble with unescaped characters, now, the characters are showing up as the escaped sequence in the browser. For example I see both & and >gt; in the code above. I’ve e-mailed Mr. McIntyre about it. Maybe there’s nothing you can do to fix it.
Anyway, to someone who’s trying to use the script, replace & with & and > with > and hopefully it will work.
Any better?
I’ve re-edited this entirely removing the most problematic interactions of R script and Word Press script which pertained to the gsub operations, which were not needed in the post itself.
Re #1,
Neil,
It seems that there is a concept of a ‘meteorological year’
which starts on December 1, and ends on the next November 30,
and contains meteorological quarters of varying numbers of
days according to the lengths of the respective months.
The ability to use station numbers will be very helpful, as there are
some station names that would refer to several stations, such as
aberdeen, cambridge, columbus, and georgetown. However, the WMO station
number is also ambiguous. Almost half of GHCN stations share WMO station
numbers with one, or more, somewhat nearby stations, for example: 60630,
32165, 87270, and 71068, the latter being shared by six stations. So, it
may be necessary to make use of the full 11 digit GHCN station numbers,
or at least the last 8 digits thereof.
This WordPress document Writing Code in Your Posts may be useful.
“Writing code within a WordPress post can be a challenge … “
Jerry, I haven’t tried it, but I think it will work if you pass the full 11 digit station number. Their search tool seems to search that field as well as the station name. Since that number should be unambiguous, no additional support will be needed in the script, since hopefully only one match will be returned, and therefore that will be the data it downloads.
Nicholas,
Yes, I would expect it to work fine with the 11 digit GHCN
number, and even with a twelve digit number that includes
the source number for those stations, such as Barabinsk, that
have multiple sets of temperature series.
Do you know this? The GISS shows a temerature index curve for the New York Central Park station as poster girl. The starting date 1900 seems to be cherrypicking. This Site shows also the station data from 1820 on, demonstrating – surprise – a cooling, linking to the GISS station data. However, and this is very interesting, the links fails and the GISS station data for New York Central begin in 1880, not 1820. It maybe – this is my suspicion – that GISS has censored the data prior to 1880. It would be very interesting to find out, whether there are data available for this station for the period 1820-1880 or not. If so, the GISS selects information, which is a clear proof of manipulation.
When I discovered the discrepancy between the official GISS US surface data in 1999 and the one in 2004 (here and here, in german), I wrote an email to the GISS. They have never answered, but two days later, they deleted their 1999 report from the webside. Maybe it’s coincidence, maybe not.
When did you write them? Why don’t you post up your email to them?
It won’t be long before Antarctica shows a warming trend.
Mark
I wrote the email on 12.12.2006.
Dear ladies and gentlemen,
on the GISS webside you publish two graphs showing the temperature record of the USA. One graph is from 1999 (here), the other one is from 2005 (here). As both graphs are supposed to show the same with the exception that the new graph contains more data, one would think that they coincide up to the year 1999. However, this is not the case. In the 2005 graph, the difference between the 1934 temperature and the 1990 temperature is much lower than in the 1999 graph. In the latter 1934 was the hottest year in the century, while according to the former it was 1998. This seems strange and I would be pleased if you could explain this deviation to me.
Thank you very much
Dirk Paulsen
Dirk,
The change of the GISS US temperature graph occurred in 2000,
and is discussed in another thread http://www.climateaudit.org/?p=1142
although the discussion is rather convoluted.
Briefly, GISS began using USHCN adjusted data, instead of USHCN “raw” data.
Hansen did reply to emails on the subject back in March, 2001.
As for pre 1880 data, GISS stopped using it, but is not censoring it.
It is still available at GHCN, which is where GISS gets the
temperature data that they use.
Correction: The change of the GISS US temperature graph occurred in 2001,
Quick question from a newbie* – regarding the ‘script’ language these scripts are written for: is it TCL, “C” shell (CSH, KSH etc), Python … can someone vector me in and I’ll go from there and get the proper interpreter for my Wintel box …
.
.
.
.
.
*Not new to computers, programming, or the net; just new to some of the latter-day scripting languages. 😉 Started out on minis in the 80’s using them for ‘job’ submission to the corporate mainframe ‘iron’ which was a System/370 (3090 processors at the time doing 32 some MIPS!) with jobs ‘bid’ via JCL under MVS …
R. You can download form http://www.r-project.org. It’s miraculously good for statistical work and has untold varieties of software (packages).
Very cool … thanks Steve.
Well, it looks like there is more than one way to make a hockey stick. The tree rings didn’t work out, so it looks like some “scientists” are doing it by “adjusting” the temperature data. And if so, it is being done largely by government employees! This is really spooky. Maybe Orwell should have used the date 2014, instead of 1984.
There’s something funny with the GISS web site.
I just tried to run this code in Linux, which I successfully tested in Windows, and it’s not working. Once again, the problem is that depending on which browser I use to visit the same URL at their site, I get a totally different page!
Why would they do that?
I’m trying to find a work-around. Telling R to use Lynx to download files sort of works – it gets the right page, but with the HTML stripped out. However, the Lynx support is described as “legacy” so I probably shouldn’t use it.
What a pain. Why can’t they set up their web server correctly? Or are they actually purposefully trying to make it hard to automatically download their data?
OK, I worked out what was going on, I had downgraded from R 2.4.1 to R 2.2.1 (because that was the version recommended for use with my operating system). I upgraded back to 2.4.1 and now the script works.
It turns out they introduced an “HTTPUserAgent” option in R 2.4.0 which properly identifies the program to the web site it is downloading files from. For some reason GISS returns you wrong data/forbids you access to data based upon your user agent.
Here is an updated script which is slightly improved:
http://x256.org/~hb/giss3.r
It uses tempfile() to create a temporary file and uses read.table() to read the data rather than scan(). I also added a comment noting you need at least version 2.4.0, at least in Linux.
Nicholas,
Some speculation:
They may be attempting to discourage robotic access in order
to reduce load on their servers. Some bots honor robots.txt
limitations, but some do not, and can constitute a considerable
load on a server. The access denial can be based on the
name of the software attempting the access, and they may have
a list of such names that they presume are robotic. I have
no evidence that this speculation applies in this instance.
Well, maybe that’s what they’re trying to do, but if so they are going about it in a very strange manner.
Firstly, it seems if you don’t provide a user agent at all (e.g. R versions before 2.4.0) then you still get served a page, it’s just the wrong one. (It appears to be the index page, even if your URL is that of a sub-page of some type). As if it’s totally ignoring the URI and query string in that case. Which is a very odd thing to do.
Secondly, I’ve never noticed significant load on any of my web servers from robots, unless someone actively tried to mirror them, in which case there’s nothing stopping them from changing their user agent string. Maybe it varies a lot based on your domain or something. Anyway I guess the bottom line here is that you need to make sure you have a very recent version of R for this to be guaranteed to work, thanks to whatever fiddling they’ve done to their configuration.
Nicholas,
after upgrading to V2.4.1 (Win) and changing line #46 to read
>my_data header=T, na.strings = c(“999.9″,”999.90″,”999.99”));
the code worked fine so far – Thank you!
PS:
Could someone please provide an elegant example of how to reshape the resulting data.frame to reflect a monthly time series?
Nicholas,
(sorry, tag messing in my previous post, now trying CODE tags)
after upgrading to V2.4.1 (Win) and changing line #46 to read
>
my_data header=T, na.strings = c("999.9","999.90","999.99"));
the code worked fine so far – Thank you!
PS:
Could someone please provide an elegant example of how to reshape the resulting data.frame to reflect a monthly time series?
Nicholas,
(sorry, tag messing in my previous post, now trying CODE tags)
my_data
#30. Wolfgang, I’ve posted up a collection of functions which return time series objects for GISS station data (And other archives)
I started downloading and collating all the GISS station data. I got about half-way through and now my access to the GISS site is blocked. It took about 8 hours so far and I’m about halfway through. It’s very slow because there is no organized data set that can be downloaded. They may have assumed it was a robot, but I’ve been blocked by the Team before (MAnn blocked me from his ftp site at UVA and Rutherford blocked from his SI).
Do you need someone else to run a script for you? Perhaps if you break the job up into a few chunks, you could distribute those chunks to us, we could run them, and then you could collate the results into one big set.
We might end up getting blocked too, eventually, but hopefully after we’re each finished 1/8th or 1/4 or some fraction of the sites.
I sent an inquiry to the webmaster. I was indeed blocked. The webmaster sent me a reply in amazingly quick time:
I was not running a robot but an R-project. I’m blocked from the webpage in question so maybe someone can send me the email address.
Nicholas, I’ll email you with what I was running.
Here’s the program that I was running. It failed a couple of times when there was no information and had to be re-started by increasing the start point. I could use the try(…) function to work around this, but it seemed just as easy to occasionally restart it. It’s a slow retrieval process. I’m on a high-speed network and this has taken about 8 hours. The grandkids were over and it was just running in the background.
#36 Steve,
On March 5, soon after you posted the original entry, I successfully downloaded all three datasets for all stations using a variation of the process you described. I throttled my requests to a couple every few seconds so that I would not hammer the site (and hopefully wouldn’t get disconnected) and ended up taking 36 hours to grab everything. I figured it was only a matter of time before the multi-step process stopped working, or they started blocking people using your technique, given the obvious lengths that they went to to make downloading difficult.
I would be happy to send you the combined and compressed station data if you like. As a tar.bz2 file it’s a 26 MB archive containing ~27,000 separate files. I originally intended to combine and process it into one file with a WMO number column, but the day job interfered. If that would be helpful, I could try to get to it maybe this weekend.
-rv
#38 sounds good. CAn somebody post up the email addresses for the GISTEMP research group so that I can contact them directly as well.
#39 I’m getting a 403 Forbidden error when going to any page under http://data.giss.nasa.gov/ site from all networks that I have access to, so it looks like they have a server error right now, and it’s not just us being blocked.
How would you like me to get you the data set? Direct e-mail’s out since it’s too big, but perhaps we can switch to e-mail to coordinate an upload?
Let me know,
-rv
Re #39, Steve
Bottom of the page http://data.giss.nasa.gov/gistemp/ :
Contacts
Please address scientific inquiries about the GISTEMP analysis to
Dr. James Hansen.
Please address technical questions about these GISTEMP webpages to
Dr. Reto Ruedy.
Also participating in the GISTEMP analysis are Dr. Makiko Sato and Dr. Ken Lo.
Twit. Not much help if he’s blocked …
Dr. James E. Hansen : jhansen@giss.nasa.gov
Dr. Reto A. Ruedy : rruedy@giss.nasa.gov
Dr. Makiko Sato : makikosato@giss.nasa.gov
Dr. Kwok-Wai Ken Lo : klo@giss.nasa.gov
Re: #36
Steve M:
You asked for e-mail addresses in response to this from the webmaster:
Not exactly true. There are links to their bio pages which do have their e-mail addressess. However, good luck with getting a response from them. I wish thee well.
Contacts
Please address scientific inquiries about the GISTEMP analysis to Dr. James Hansen. (jhansen@giss.nasa.gov)
Please address technical questions about these GISTEMP webpages to Dr. Reto Ruedy (rruedy@giss.nasa.gov)
Also participating in the GISTEMP analysis are Dr. Makiko Sato (makikosato@giss.nasa.gov) and Dr. Ken Lo (klo@giss.nasa.gov)