An R Package by a CA Reader Solves the Z Problem

CA reader, Nicholas, an extremely able computer analyst, has helped me with a number of problems with downloading data in compressed formats into R. One of the most annoying and heretofore unsolved problems was how to get Z files into R without having to handle them manually – a problem that I revisited recently when I looked at ICOADS data which is organized in over 2000 Z files.

Z files are an obsolete form of Unix compression that is not even mentioned at zlib.com nor was it supported at R. So if you wanted to analyze a Z file in R, you had to download the file, unzip it manually using WinZip or equivalent and then start again.

I presume that this obsolete format fits in a ecological niche with Fortran, an antique computer language (one that I learned over 40 years ago and which, in comparison with a modern language like R, seems about as relevant as medieval Latin). Since most climate scientists appear to live in an ecological niche with Fortran and Unix, many climate data sets are only available in Z files, e.g. USHCN, GHCN, ICOADS, although a number of data sets are available in NetCDF format, which is accessible in R through the ncdf package.

Nicholas figured out how to uncompress Z files and contributed a package “uncompress” to R, which is online and downloadable as of today. You can install R packages easily within a session using the Install Packages button. There are a couple of little tricks in using the package to extract ASCII data so you have to pay close attention to the example. I did a test this morning and it worked like a champ. Here was my trial session (after installing uncompress). The flag in the rawToChar command is set here for Unix lines, which will be the relevant option in most of our applications.

handle < – url("ftp://ftp.ncdc.noaa.gov/pub/data/ushcn/station.inventory.Z", "rb")
data <- readBin(handle, "raw", 9999999)
close(handle)
uncomp_data <- uncompress(data)
Data <- strsplit(rawToChar(uncomp_data), "\n")
Data = unlist(Data)

This returns an ASCII file, which can be handled conventionally using a variety of techniques. For large files, I usually use the substr command to parse columns out, but you could also write the file to a “temp.dat” file and read it using read.fwf or read.table or scan.

Anyway, it’s a great utility!!

PS. I asked any number of people about how to handle Z files in R without having to do it manually and got nowhere. I did learn about a number of annoying Windows mysteries and some interesting R techniques, which I’ll note here as a diary item. It turns out that you can run DOS commands out of R by using the system() command. The following command runs Firefox:
system(paste(‘”c:/Program Files/Mozilla Firefox/firefox.exe”‘,’-url cran.r-project.org’)) #
In order to run particular applications, on my machine in Windows, some would only run in a default directory. So the following command:
system(paste(‘”d:/Documents/gzip.exe”‘,’COPYING’))
dir()[12] # “COPYING.GZ”
system(paste(‘”d:/Documents/gunzip.exe”‘,’COPYING’))
dir()[12] # “COPYING”
worked, but they didn’t work in any other directory. Go figure.

Note – the R function gzfile handles gz files just fine; I was using the gzip.exe program only for testing DOS commands within R.

12 Comments

  1. Steve McIntyre
    Posted Jun 9, 2008 at 7:22 AM | Permalink

    I notified a couple of people of the new package. Douglas Bates of Wisconsin, author of the nlme package and one of the leading lights of R, already replied that he thought that the package was a “great idea”.

  2. Nicholas
    Posted Jun 9, 2008 at 9:36 AM | Permalink

    Note that right now this package is not available for Mac OSX because for some reason, they haven’t compiled the latest version I uploaded yet.

    Hopefully within the next day or two they will do so, and it will be available.

    I also uploaded a version with more examples in the documentation but they haven’t put it up yet.

  3. austin
    Posted Jun 9, 2008 at 10:24 AM | Permalink

    I once offered use of a private cluster and modern database to a researcher for free after seeing the horrible tools and methods he was using. At first he was elated, then when he realized he could do a year’s worth of work in a day, he backed off, because “No one will be able to reproduce my results.”

    The private sector runs rings around most other institutions.

  4. Nicholas
    Posted Jun 9, 2008 at 10:40 AM | Permalink

    By the way Steve, what you might be able to do in the case of gzip is put the gzpi.exe an gunzip.exe somewhere in your system “path”, then you won’t need to append the “d:/Documents/” to it each time.

    It’s possible to add locations to the path via, I believe, Control Panel’s “System” panel. However personally I’m lazy and I usually just copy such files into C:\WINNT\SYSTEM32 (or C:\WINDOWS\SYSTEM32 as appropriate). Then you can run them from anywhere.

    Of course R has gzfile() so in this case it isn’t strictly necessary, but there may be other times you need to execute an external program. Then again, now that I know how to make R packages, it should be less necessary 🙂


    Steve:
    I was just using gzip to experiment with the system() command on something that worked (my interest then being in the compress.exe command), not because I was having trouble with gz files.

  5. Not sure
    Posted Jun 9, 2008 at 5:26 PM | Permalink

    As is the case with many things Unix, the tale of compress is long and sordid.

  6. OldUnixHead
    Posted Jun 10, 2008 at 2:57 PM | Permalink

    Apologies in advance if you folks have already tried this (and I’m barely passing-familiar with ‘R’), but, if ‘R’s internal code for gzfile() used the complete set of gzip’s/gunzip’s ‘magic number’-recognition and decoding functionality, it might be able to parse ‘.Z’ files automatically.

    Following your example, is something like the following possible without requiring the addition of a package?


    handle

  7. OldUnixHead
    Posted Jun 10, 2008 at 3:07 PM | Permalink

    Follow-on to #6:

    Re-attempting to show the code that got cut off:

    Following your example, is something like the following possible without requiring the addition of a package?


    handle <- url(“ftp://ftp.ncdc.noaa.gov/pub/data/ushcn/station.inventory.Z”, “rb”)
    uncomp_data <- gzfile(handle)
    close(handle)
    Data <- strsplit(rawToChar(uncomp_data), “\n”)

    Steve: gzfile does gz files, not Z files. That’s why the package ws necessary. My guess is that Nicholas’ function will ultimately be added into the base R along with gzfile and such.

  8. Nicholas
    Posted Jun 10, 2008 at 10:44 PM | Permalink

    I think I tried that and it didn’t work, but I can try it again. I believe you’d have to download the file to the hard disk since I don’t think gzfile works with URLs.

    The Mac OSX version of my package is available now too. I believe that means “uncompress” will run on any platform that R will, unless I’ve made a mistake with Unix portability. I haven’t tested it on 64 bit, but theoretically it should work.

    The packages are here:

    http://cran.r-project.org/web/packages/uncompress/index.html

    I’ll check again if gzfile will open .Z files – as I said I don’t think so – but I’ll report on the result.

  9. Nicholas
    Posted Jun 10, 2008 at 10:51 PM | Permalink

    OK, interestingly, what seems to happen is that if you use gzfile on a .Z file, it doesn’t complain or fail in any way, but the data you get back appears to be the raw compressed data – i.e. exactly what you get if you use file().

    Observe:

    > handle <- gzfile(“IMMA.1784.02.Z”, “rb”);
    > data <- readBin(handle, “raw”, 9999999999);
    > close(handle);
    > rawToChar(data)
    [1] “37\235\2201n\340\24001B\206\f\202…
    > library(“uncompress”);
    > rawToChar(uncompress(data));
    [1] “1784 224 3913 28782 02 4 10EMPRE*_C 1 23…
    >

    I think what is happening is the commandline program “gzip” can handle .Z data but not via zlib – it uses separate code – and gzfile() uses zlib. However that’s just a guess. I’m pretty sure I tried this before I wrote the uncompress package since I was trying to find the easiest way to solve this problem 🙂

  10. Slartibartfast
    Posted Jun 12, 2008 at 12:14 PM | Permalink

    It turns out that you can run DOS commands out of R by using the system() command.

    Sure, but just try and call a Cygwin executable using system() from Microsoft C++ .NET…and passing it a filename, including path, as an argument. Recalling, of course, that DOS wants to use backslash in path descriptors, but Cygwin treats backslash as an escape.

  11. Nicholas
    Posted Jun 12, 2008 at 4:30 PM | Permalink

    Slartibartfast : Does it work if you double up the backslashes, in order to escape them?

    Anyway the bigger problem with calling external programs, besides inconvenience and performance, is that we (and specifically Mr. McIntyre, I think) want these scripts to be reproducible. If you’re executing an external program it requires the user to have that program installed in the same place as you do, and then you have to deal with the case where the program may be executing on different operating systems.

    However, with this R package, all a user has to do is install it via CRAN, whether they are using Linux, Windows, MacOS, or some other Unix variant. It should just work. If it doesn’t hopefully they will let me know and I can fix it.

    If there’s another processing step required for external data in future which is hard to perform in R then I can write another package to deal with it. Hopefully we can reach a point where practically all climate data is easily imported into R for processing. We may already be close to that point.

  12. Henry
    Posted Feb 14, 2009 at 9:08 PM | Permalink

    Just to note there is an unhelpful space in the first line of the example code with an erroneous “< -” which should be “<-“.

    Pedantically, I would also tend to avoid having “data” and “Data” around at the same time meaning different things, and stylistically either “<-” or “=” should be used throughout, so perhaps as a model the code could be

    handle <- url(“ftp://ftp.ncdc.noaa.gov/pub/data/ushcn/station.inventory.Z”, “rb”)
    raw_data <- readBin(handle, “raw”, 9999999)
    close(handle)
    uncomp_data <- uncompress(raw_data)
    Data <- strsplit(rawToChar(uncomp_data), “\n”)
    Data <- unlist(Data)

    More seriously, I do not seem to get the whole NOAA file into Data, as it breaks at records 143 or 458 depending how I run the code. Copying the file to my hard disk first, it breaks at record 1222.