Another R-Script for a Zip File (NOAA)

I’m posting up another script for unzipping a file in R (in part so that I can keep track of this things as well.) I’m trying to figure out how to download the Australian data in R and reviewing some prior successful efforts (which have relied on Nicholas).

Here is how one can get at the NOAA gridded data directly from R. First download the file to a temporary location. I’m not sure what mode=”wb” does, but it’s something you have to do.

download.file(“ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/anom/anom-grid2-1880-current.dat.gz”, “anom-grid2-1880-current.dat.gz”,mode=”wb”);

The following will get it into an ASCII file. The unzipping of the Russian meteo script uses the scan function; here readLines works. The file Data is something that can be handled with ordinary techniques.

handle < – gzfile("anom-grid2-1880-current.dat.gz");
Data <- readLines(handle);
close(handle);
length(Data) # 331359

In this case, the data comes out as 12 values per line; 217 lines per month-year combination. There are (217-1)*12=2592 = 72*36 gridcell values. There are 331142 lines in the file currently. The data comes out as latitude in the hour hand (N to S in 5 degree increments) and longitude as the minute hand (W to E from the Dateline in 5 degree increments). To make a collated version of time series with gridcells in each column in the same hour hand-minute hand order (which I try to use consistently), I first identify and remove the 1, 217, … lines with month-year information and then transform the data through matrix operations.

N< -length(Data);N
temp<-!is.na(match(1:N,seq(1,N-216,217)))
Data<-Data[!temp]# 329616
noaa<-cbind(as.numeric(substr(Data,1,6)), as.numeric(substr(Data,7,12)),as.numeric(substr(Data,13,18)),as.numeric(substr(Data,19,24)), as.numeric(substr(Data,25,30)), as.numeric(substr(Data,31,36)),as.numeric(substr(Data,37,42)),as.numeric(substr(Data,43,48)),as.numeric(substr(Data,49,54)), as.numeric(substr(Data,55,60)),as.numeric(substr(Data,61,66)),as.numeric(substr(Data,67,72)) )
rm(Data)
temp<-(noaa== -32768);noaa[temp]<-NA
dim(noaa) #329616 12
# 329616*12/2592 =1526
noaa<- c(t(noaa))
noaa<-array(noaa,dim=c(2592,length(noaa)/2592 ) );dim(noaa)# 2592 1526
noaa<-t(noaa) #1526 2592
#save(noaa,file="d:/climate/data/noaa/noaa.tab")

Nicholas also looked recently at *.Z files, which he described as an obsolete compression format, not supported at present in R. It is used in some climate data sets e.g. http://cdiac.ornl.gov/ftp/tr055/sta60.dat.Z. Nicholas wrote a routine which he said is slow and which I haven’t tried yet. If it’s a one-off analysis, it’s easy enough to download and unzip manually. The need for automated unzipping occurs when the data is updated or if you need to call individual stations. I’ll revisit the *.Z files if and when I get to this situation.

There are a couple of BOM files at ftp://ftp.bom.gov.au/anon/home/bmrc/perm/climate/temperature/annual. As an exercise, I tried to see if I could modify Nicholas’ methods to unzip this data in R, but so far have been unsuccessful. I’m sure that Nicholas will have an answer.

5 Comments

  1. Skip
    Posted Apr 26, 2007 at 9:57 AM | Permalink

    The “wb” is likely just passed through to the c runtime library fopen() function. Likely values are:

    “wb” – open for writing as a binary file
    “w” – open for writing as a text file. This will change end of line characters to be appropriate for the system that it’s downloaded to.
    “ab” – same as “wb” but appending instead of creating the file new
    “a” – same as “w” but appending.

    You need the “wb” if it’s a zip file because otherwise it will likely corrupt the file.

  2. mhcoffin
    Posted Apr 26, 2007 at 10:50 AM | Permalink

    Files with a .Z suffix are probably generated by “compress”. Compress was the standard Unix compression utility until Unisys, which owned a patent on Lempel-Ziv compression, began trying to enforce their patent. (That patent expired a few years ago, but by then gzip had become standard.)

    The good news is the patent didn’t cover decompression, so gunzip will happily uncompress .Z files. If R supports gzip, it can probably be tricked into decompressing .Z files pretty easily. Maybe by merely renaming the file to .gz and pretending it’s gzipped.

  3. Nicholas
    Posted Apr 26, 2007 at 7:37 PM | Permalink

    mhcoffin, you’re right, but the problem is while gzip may support uncompressing .Z files, the built-in R function gzfile does not work on .Z files. You can’t rely on there being gzip installed on the system, since R is often run in Windows, where there is no gzip command-line program.

    I wrote an R implementation of a Z decompressor (it’s just dictionary-based so it wasn’t too hard) but it takes several minutes to decompress a 1MB file. There’s no way to I know avoid a for-loop since the compressed data is stored in variable length bit strings, and R is slow with for-loops.

    The idea of renaming the file is interesting, but I don’t think gzfile cares about the extension. It probably just calls the deflate algorithm, which returns an error because the data is not compressed with deflate. The gzip program is probably smart enough to detect the extension and call the modified-LZ78 routine instead, but R’s built-in function doesn’t seem to have that capability.

  4. MurrayK
    Posted Apr 26, 2007 at 9:21 PM | Permalink

    znew recompresses files from .Z (compress) format to .gz (gzip) format. There’s an example here. Perhaps the R ‘system’ interface can call this utility first?

  5. Nicholas
    Posted Apr 27, 2007 at 3:17 AM | Permalink

    MurrayK, it’s tough to rely on anything which isn’t a standard part of R, because as soon as you do, it makes it more difficult for people to download & run Steve’s scripts. I think it’s OK if he requires some commonly used and widely available third party R packages, but would be extremely hesitant to require people to install znew on their system, especially for Windows users, where there may be no tidy GUI for installing that program.

    That is why I wrote the R uncompress routine, because it was the only solution I could think of that wouldn’t require users to jump through hoops to get the script to run. It works, but I don’t think most people have the patience to wait for 2-4 minutes while the data decompresses. It’d be even worse if the script had to deal with multiple .Z files. I’d much rather just recompress those files myself and upload them to my web server, and point the scripts there. Less hassles for all of us.

%d bloggers like this: