Over the weekend (before I picked up my “regular” files), I started looking at Steve Mosher’s use of raster and zoo – both of which intrigue me a great deal, but got intrigued by something else and ended up finally figuring out how to extract .Z files within an R script without having to handle them manually. (R has utilities for .zip and .gz files, but not the older .Z format.) This isn’t anything other than a nuisance with GHCN which only has one .Z file to worry about, but was a big problem with the very large ICOADS SST data where every month of data is in its own .Z file and manual processing isn’t an alternative. It’s further complicated since the 12 monthly .Z files for each year and packaged into an annual .tar file.
On many occasions, I’ve expressed regret at the disproportionate interest in land datasets relative to SST datasets, which are more important proportionately but about which there’s been negligible third party analysis. We’ve all raised our eyebrows a little at the bucket adjustments, but I, for one, hadn’t handled original data. For a start, the data sets were too big for the computer that I had a couple of years ago, but they are practical on my new computer (though, if I were doing much on this, I’d need to upgrade.)
A couple of years ago, CA reader Nicholas experimented with the extracting .Z files in R, contributing the package compress, which, unfortunately, I couldn’t get to work on the GHCN data set. I lost interest in the issue at the time, but the inability to handle .Z files automatically has been at the back of my mind for a while.
Mosh started one of his posts as follows:
In the last installment we started by downloading the v2.mean.Z file unzipping it and processing the raw data into an R object.
Mosh linked to another blog who had used the R command system to unzip a .gz file as follows:
# ———– UNCOMMENT ME TO DOWNLOAD HADSST2 ————-#
#if (! file.exists(hadsst2_data)) {
# hadsst2_gz <- paste(hadsst2_data, ".gz", sep="")
# download.file(hadsst2_data_url, hadsst2_gz)
# # windows users will need gunzip in their PATH
# system(paste("gunzip ",hadsst2_gz, sep=""))
#}
# ———————————————————-#
While browsing through some sites, I noticed the following comment at gzip.org
gunzip can decompress files created by gzip, compress or pack. The detection of the input format is automatic
.Z files are produced by compress. So maybe I thought that a simple expedient for extracting .Z files was staring me in the face. I’d gone through the process of installing a gunzip.exe commend on my old computer but hadn’t done it on my new computer and had to retrace my steps. I found a version of gzip here http://www.powerbasic.com/files/pub/tools/win32/gzip124xN.zip – there are other versions around. You have to unzip this file, which yields gzip.exe, but not gunzip.exe. I was stumped by this for a while. A webpage here explains:
Note that this archive contains only gzip.exe — to get gunzip.exe, you must copy gzip.exe to gunzip.exe (a silly Unix trick– don’t ask).
I uploaded a version of gunzip.exe to climateaudit.info to save others the labor. The following short script downloads the .Z file for GHCN and the gunzip.exe program to the working directory. Worked like a champ for me:
download.file(“ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/v2.mean.Z”,”v2mean.Z”,mode=”wb”)
download.file(“http://www.climateaudit.info/scripts/gzip/gunzip.exe”,”gunzip.exe”,mode=”wb”)
system(paste(‘”gunzip.exe”‘,’v2mean’ ))
working=readLines(“v2mean”,n=20)
working[1:3]
#[1] “1016035500001966-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999 133 110”
#[2] “1016035500001967 100 112 129 143 181 193 239 255 225 201 166 107”
#[3] “1016035500001968 109 123 130 160 181 207 247 241 223 191 162 128”
This was handy for GHCN but is a nice to have, but not a need to have, whereas it’s essential for ICOADS. I checked it out on the .Z files extracted from the ICOADS .tar files and again it worked fine. The .Z files have to be in the working directory for the R system command to work.
It turned out that one could use tar.exe the same way as gunzip.exe. The structure of tar commands is described in a manual here. I located a version of tar.exe in a zip package at http://sourceforge.net/projects/unxutils/ and downloaded it (and uploaded it unzipped to http://www.climateaudit.info so it could be readily used.)
The following commands downloaded and extracted the contents of a tar file. (This is a small file – recent years get huge.)
loc=”ftp://ftp.ncdc.noaa.gov/pub/data/icoads/IMMA.1784.tar”
download.file(loc,”JMMA.tar”,mode=”wb”)
system(paste(‘”d:/climate/scripts/compression/tar.exe”‘,’-xvf’,'”JMMA.tar”‘ ))
#IMMA.1784.02.Z
#IMMA.1784.03.Z
#IMMA.1784.04.Z
Each of the monthly .Z files could than be unzipped. To do so, I found it convenient to rename each .Z file in turn to temp.Z and unzip, read an R-object and then remove this file, saving the read R-object (which takes up less space than the original Z file anyway.) The following read-instruction reads each of the .Z files and then saves the result by year.
index=list.files()[grep(“IMMA”,list.files())];index
K=length(index); n=nchar(index)
sst=rep(list(NA),K)
for(i in 1:K) {
file.rename(index[i],”temp.Z”)
system(paste(‘”gunzip.exe”‘,'”temp”‘))
working=readLines(“temp”,n=-1)
X=read.fwf2(working,widths=info$width,colClasses=info$class )
names(X)=info$id
sst[[i]]=X
file.remove(list.files()[grep(“temp”,list.files())])
}
save(sst,file=file.path(destdir, paste(“sst_”,year,”.tab”,sep=””) )
The above assumes that information about ICOADS formats has been extracted into a info file info. This is not a small job see here. I downloaded and extracted data from 1900 to 1980 over the last couple of days (tar files go from a few MB in 1900 to over a GB in 2006.) I got some odd results, which I’ll mention in another post, but will have to move on as looking through SST data is a big analytic job.
12 Comments
Well I don’t see how Gavin can disagree with this post…
Nice job. New data to play with.
If you were having problems you should have gone to your peeps sooner.
On windows 7 zip is your friend
http://www.7-zip.org/
On Linux if it is just gnu zip file then gunzip file_name
If it is a gziped tar ball then tar xvfz file_name
If it is a bz2 tar ball then tar xvfj file_name
On most Unix machines you have to do it in 2 steps and pipe them together.
Steve: The issue was doing this within R as opposed to doing it in Unix. The problem wasn’t figuring out the Unix commands – that was the easiest part. The problem was figuring out how to run Unix commands within an R environment – which is what I use and which has many advantages for analysis purposes. Also it wasn’t something that I was working on. I don’t feel badly about not figuring this out earlier since I’d asked a couple of very good programmers about how to do this within an R environment without success.
gunzip file_name |tar xvf
“gunzip file_name |tar xvf”
if the pipe is the problem, try “tar xzvf file_name”
Steve: as I said before, the Unix commands were not at issue. It was the R interface. No more discussions of Unix commands please.
Steve,
I think the core of the problem is that you’re running a Windows port of R. R was originally developed in Linux and decompressing files within R is easy because the system() command can call binaries that are already present by default on any Linux system. But you have what you need installed so problem solved.
http://stat.ethz.ch/R-manual/R-patched/library/base/html/system.html
Mosh linked to another blog who had used the R command system to unzip a .gz file as follows:
You’re welcome. I’m glad it was of some use to you. I’ve learned much from your R code as well.
Ron Broberg
http://rhinohide.wordpress.com
This might be a time to consider writing the R script to work from a local copy of the data, and to preprocess the data beforehand into .gz (rather than .Z) files with a .bat or .sh script.
Code that relies on remote data really needs to calculate a hash on that data and record it along with other (more interesting) results. The remote data might change. If someone downloads the code (a year after it was published, say) and gets different results then, absent the hash, it may not be obvious that this is due to a change in the remote data. The md5deep utility provides the relevant functionality and is freely available for all operating systems.
Have you tried asking the ICOADS folks to provide the files in .gz?
When ccc-gistemp had this problem (decompressing a .Z file in Windows), I read the C code for the OS X version of compress and reimplemented the same algorithm in Python. Such fun.
Steve: I asked one climate group to provide in.gz, but they didn’t do anything. The method set out here works fine – so I’ve worked around the problem.
BTW I’d emulated quite a long way through GISTEMP steps in R in 2007 and placed the code online. Their so-called UHI adjustment seemed totally ineffective – as it effectively presumed that other inhomogeneities had already been resolved.
This data is no longer available on the ftp site you ( ftp://ftp.ncdc.noaa.gov/pub/data/icoads/ ). Is it available somewhere else?
They’ve removed the data from the site. It is now at http://dss.ucar.edu/datasets/ds540.0/ behind a well of Java access screens that are hard to penetrate with R.
One Trackback
[…] Handling .Z files September 13, 2010 Steven Mosher Leave a comment Go to comments A while back Steve Mcintyre was looking for a way to handle .Z files in R […]