GISS has a large collection of station data, both adjusted and unadjusted. Unlike many data archives, GISS do not permit you to either extract the entire data set from a single archive or from permanent individual files. You can obtain digital data for individual stations, but you have to go through a laborious process of manually clicking on several links for each station, then copying a file to a text file and then working with the saved file.
CA reader Nicholas, who had previously developed a nice method for reading for reading zip files, has provided me with a very pretty method of reading GISS station data into R objects. Some of his programming techniques were new to me and I thought that it might be of interest to other readers, especially since some of his methods can be modified to do slightly different tasks.
Don’t try to cut-and-paste code from the WordPress version here. Use an ASCII version here. I’m going to massage this a little and will make up a collection of similar retrieval scripts (which I’ll link here as well.)
First, let’s go through the process of how one downloads GISS station data.
1. First you go to the GISTEMP page http://data.giss.nasa.gov/gistemp/station_data/
2. Scroll down and choose the option “raw data +USHCN corrections” – later we’ll worry about the other data sets.
3. Then go to the box asking for a site name and insert Tarko for Tarko-sale, which we’ve been considering and click Search.
4. You will get 3 Tarko-Sale versions. Pick one of the versions and click that.
5. You get a graph showing the selected version in blue and the alternatives in dashed and dotted black.
6. If you scroll down, there’s a link to monthly data. Click that and you get an a nice ASCII text file.
7. Save the text file for later use.
There’s no point saving the URL; while it looks like something that you might be able to insert into an R read command, the url is temporary and, if you re-insert the URL in a few minutes, it won’t return the ASCII file any more.
Nicholas has two main functions here for extracting data. Nicolas’ functions are here . I tried to post up a line-by-line discussion of the code but the interaction of R script and WordPress coding has proved difficult to reconcile and not worth the trouble. So I’ve deleted much of the script from the post and interested parties can consult the link in ascii format. The first function download_html is a utility function which saves the html commands for the url as a text file. The second function download_giss_data returns the ASCII file obtained through the steps above. So a call like this returns a usable file.
my_data < - download_giss_data("Tarko");
my_alt_data <- download_giss_data("Tarko", 2);
Nicholas has recently modified this a little to work with station numbers as well. There’s also a little editing that I need to do to make time series of the results and to collect the results if there are several variations, but this is low-level programming that I can do easily and will do in the near future. Anyway here’s how he did this, for future reference and because some of the methods are transferable to other problems.
First, Nicholas replicated the url for the first search, pasting the station name into the search string. Here he uses the R functions paste and gsub, both of which I use a lot (I knew this step). The paste function is a very nice way parameterizing names and R users should be familiar with it. If you copy giss_url into a brower, you get the first search page into the manual procedure.
giss_url < - paste("http://data.giss.nasa.gov/cgi-bin/gistemp/findstation.py?datatype=gistemp&data_set=0&name=", gsub(" ", "%20", station_name), "&submit=1", sep = ""); giss_url
Next Nicholas uses his function download_html to retrieve the html version of the webpage.
my_data < - download_html(giss_url);
#html page mentioning 3 ids- 222296120000 222296120001 222296120002
He uses this function on a few occasions and I’ll work through it the first time here in baby steps. The first step is to use the R function download.file to retrieve the contents of giss_rul. I hadn’t used download.file before, but Nicholas uses it a lot. This copied the html version of the webpage to a temporary file.
url< -giss_url; download.file(url, "temp.html");
The Nicholas used the R function file to open a connection to the temporary file “temp.html”. “rt” is a parameter for reading in text mode. (I haven’t used this function and am not familiar with it.) It’s values are for the connection.
html_handle < - file("temp.html", "rt"); html_handle
Next Nicholas used the function readLines to read the file in text format and closed the connection and unlinked to the temporary file. I use readLines all the time. What we have now is the html text as a text object in R, which you can inspect by “html_data”, which is 204 lines long.
html_data < - readLines(html_handle);
Now within the 204 lines of html code, the salient information – three codes – occurs on only a few lines. Nicholas used an interesting trick here. He noticed that three lines containing the required codes were associated with the distinctive text “gistemp.station.py”, which did not occur on any other lines. Nicholas used the R function grep to identify the three lines containing the station numbers that were the objects of the original search. This is a neat trick when you have to work with html code text and one that may come in handy in other situations.
urls < - grep("gistemp_station.py", my_data, value=1);urls
# " Barabinsk"
# " Barabinsk"
# " Barabinsk"
Next, Nicholas extracted the codes of interest from these lines and inserted them to create new url’s, using the paste and gsub functions.
url < - paste("http://data.giss.nasa.gov", gsub("^ *[^<]*$", "", urls), sep="");url
Here we get 3 station versions here. We’ll just worry about one version for now, as it’s easy enough to handle multiple versions with a little additional programming. The new urls So Nicholas uses the function download_html to obtain the html text for the new url, which is the second url in the manual process (step 5 above) Using the download_html, we get the html commands for this webpage (171 lines).
j < - 1; my_data <- download_html(url[j]); length(my_data) #171
Nicholas uses the same trick with grep as above to retrieve the salient codes from this htm code. This time, he used the phrase “monthly data as text” to extract the salient lines.
urls[j] < - grep("monthly data as text", my_data, value=1); urls[j]
#monthly data as text"
Using paste and gsub, this is reshaped into the call for the text file that we saw originally:
Next Nicholas opens the connection to this temporary file using the R function url (the function url working here like the function file abovel) and reads from the connection.
my_handle < - url(url, "rt");
my_data <- read.table(my_handle, skip=1);
dim(my_data) #91 18
Now we have the sought-for ASCII file which can be manipulated in usual ways. A little work needs to be done to allow for reading the diffferent versions. The calls use a dataset identification of 0,1,2 and this needs to be allowed for. Also we’ll need to think about using the WMO number rather than name and about handling multiple versions in a convenient way. But this is relatively easy stuff that I can figure out.
The above works for trying to match names, but it’s usually easier to work with WMO numbers so as not to worry about spelling variations (Karsakpay, Qarsakbay etc.). Nicholas modifed the routine here to work with WMO numbers.