Tar and Z

Over the weekend (before I picked up my “regular” files), I started looking at Steve Mosher’s use of raster and zoo – both of which intrigue me a great deal, but got intrigued by something else and ended up finally figuring out how to extract .Z files within an R script without having to handle them manually. (R has utilities for .zip and .gz files, but not the older .Z format.) This isn’t anything other than a nuisance with GHCN which only has one .Z file to worry about, but was a big problem with the very large ICOADS SST data where every month of data is in its own .Z file and manual processing isn’t an alternative. It’s further complicated since the 12 monthly .Z files for each year and packaged into an annual .tar file.

On many occasions, I’ve expressed regret at the disproportionate interest in land datasets relative to SST datasets, which are more important proportionately but about which there’s been negligible third party analysis. We’ve all raised our eyebrows a little at the bucket adjustments, but I, for one, hadn’t handled original data. For a start, the data sets were too big for the computer that I had a couple of years ago, but they are practical on my new computer (though, if I were doing much on this, I’d need to upgrade.)

A couple of years ago, CA reader Nicholas experimented with the extracting .Z files in R, contributing the package compress, which, unfortunately, I couldn’t get to work on the GHCN data set. I lost interest in the issue at the time, but the inability to handle .Z files automatically has been at the back of my mind for a while.

Mosh started one of his posts as follows:

In the last installment we started by downloading the v2.mean.Z file unzipping it and processing the raw data into an R object.

Mosh linked to another blog who had used the R command system to unzip a .gz file as follows:

# ———– UNCOMMENT ME TO DOWNLOAD HADSST2 ————-#
#if (! file.exists(hadsst2_data)) {
# hadsst2_gz <- paste(hadsst2_data, ".gz", sep="")
# download.file(hadsst2_data_url, hadsst2_gz)
# # windows users will need gunzip in their PATH
# system(paste("gunzip ",hadsst2_gz, sep=""))
#}
# ———————————————————-#

While browsing through some sites, I noticed the following comment at gzip.org

gunzip can decompress files created by gzip, compress or pack. The detection of the input format is automatic

.Z files are produced by compress. So maybe I thought that a simple expedient for extracting .Z files was staring me in the face. I’d gone through the process of installing a gunzip.exe commend on my old computer but hadn’t done it on my new computer and had to retrace my steps. I found a version of gzip here http://www.powerbasic.com/files/pub/tools/win32/gzip124xN.zip – there are other versions around. You have to unzip this file, which yields gzip.exe, but not gunzip.exe. I was stumped by this for a while. A webpage here explains:

Note that this archive contains only gzip.exe — to get gunzip.exe, you must copy gzip.exe to gunzip.exe (a silly Unix trick– don’t ask).

I uploaded a version of gunzip.exe to climateaudit.info to save others the labor. The following short script downloads the .Z file for GHCN and the gunzip.exe program to the working directory. Worked like a champ for me:

download.file(“ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/v2.mean.Z”,”v2mean.Z”,mode=”wb”)
download.file(“http://www.climateaudit.info/scripts/gzip/gunzip.exe”,”gunzip.exe”,mode=”wb”)
system(paste(‘”gunzip.exe”‘,’v2mean’ ))
working=readLines(“v2mean”,n=20)
working[1:3]
#[1] “1016035500001966-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999 133 110”
#[2] “1016035500001967 100 112 129 143 181 193 239 255 225 201 166 107”
#[3] “1016035500001968 109 123 130 160 181 207 247 241 223 191 162 128”

This was handy for GHCN but is a nice to have, but not a need to have, whereas it’s essential for ICOADS. I checked it out on the .Z files extracted from the ICOADS .tar files and again it worked fine. The .Z files have to be in the working directory for the R system command to work.

It turned out that one could use tar.exe the same way as gunzip.exe. The structure of tar commands is described in a manual here. I located a version of tar.exe in a zip package at http://sourceforge.net/projects/unxutils/ and downloaded it (and uploaded it unzipped to http://www.climateaudit.info so it could be readily used.)

The following commands downloaded and extracted the contents of a tar file. (This is a small file – recent years get huge.)

loc=”ftp://ftp.ncdc.noaa.gov/pub/data/icoads/IMMA.1784.tar”
download.file(loc,”JMMA.tar”,mode=”wb”)
system(paste(‘”d:/climate/scripts/compression/tar.exe”‘,’-xvf’,'”JMMA.tar”‘ ))
#IMMA.1784.02.Z
#IMMA.1784.03.Z
#IMMA.1784.04.Z

Each of the monthly .Z files could than be unzipped. To do so, I found it convenient to rename each .Z file in turn to temp.Z and unzip, read an R-object and then remove this file, saving the read R-object (which takes up less space than the original Z file anyway.) The following read-instruction reads each of the .Z files and then saves the result by year.

index=list.files()[grep(“IMMA”,list.files())];index
K=length(index); n=nchar(index)
sst=rep(list(NA),K)
for(i in 1:K) {
file.rename(index[i],”temp.Z”)
system(paste(‘”gunzip.exe”‘,'”temp”‘))
working=readLines(“temp”,n=-1)
X=read.fwf2(working,widths=info$width,colClasses=info$class )
names(X)=info$id
sst[[i]]=X
file.remove(list.files()[grep(“temp”,list.files())])
}
save(sst,file=file.path(destdir, paste(“sst_”,year,”.tab”,sep=””) )

The above assumes that information about ICOADS formats has been extracted into a info file info. This is not a small job see here. I downloaded and extracted data from 1900 to 1980 over the last couple of days (tar files go from a few MB in 1900 to over a GB in 2006.) I got some odd results, which I’ll mention in another post, but will have to move on as looking through SST data is a big analytic job.

This entry was written by Stephen McIntyre, posted on Aug 30, 2010 at 11:45 AM, filed under Uncategorized and tagged ICOADS, SST, tar, Z. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

12 Comments

Patrick M.

Posted Aug 30, 2010 at 12:11 PM | Permalink

Well I don’t see how Gavin can disagree with this post…
Jeff Id

Posted Aug 30, 2010 at 12:39 PM | Permalink

Nice job. New data to play with.
jv

Posted Aug 30, 2010 at 1:00 PM | Permalink

If you were having problems you should have gone to your peeps sooner.

On windows 7 zip is your friend
http://www.7-zip.org/

On Linux if it is just gnu zip file then gunzip file_name
If it is a gziped tar ball then tar xvfz file_name
If it is a bz2 tar ball then tar xvfj file_name

On most Unix machines you have to do it in 2 steps and pipe them together.

Steve: The issue was doing this within R as opposed to doing it in Unix. The problem wasn’t figuring out the Unix commands – that was the easiest part. The problem was figuring out how to run Unix commands within an R environment – which is what I use and which has many advantages for analysis purposes. Also it wasn’t something that I was working on. I don’t feel badly about not figuring this out earlier since I’d asked a couple of very good programmers about how to do this within an R environment without success.

gunzip file_name |tar xvf
Emil

Posted Aug 30, 2010 at 4:25 PM | Permalink

“gunzip file_name |tar xvf”

if the pipe is the problem, try “tar xzvf file_name”

Steve: as I said before, the Unix commands were not at issue. It was the R interface. No more discussions of Unix commands please.
Chad

Posted Aug 30, 2010 at 4:32 PM | Permalink

Steve,

I think the core of the problem is that you’re running a Windows port of R. R was originally developed in Linux and decompressing files within R is easy because the system() command can call binaries that are already present by default on any Linux system. But you have what you need installed so problem solved.
Emil

Posted Aug 30, 2010 at 4:50 PM | Permalink

http://stat.ethz.ch/R-manual/R-patched/library/base/html/system.html
Ron Broberg

Posted Aug 30, 2010 at 6:45 PM | Permalink

Mosh linked to another blog who had used the R command system to unzip a .gz file as follows:

You’re welcome. I’m glad it was of some use to you. I’ve learned much from your R code as well.

Ron Broberg
http://rhinohide.wordpress.com
johnl

Posted Sep 3, 2010 at 5:00 PM | Permalink

This might be a time to consider writing the R script to work from a local copy of the data, and to preprocess the data beforehand into .gz (rather than .Z) files with a .bat or .sh script.
Jane Coles

Posted Sep 8, 2010 at 5:52 AM | Permalink

Code that relies on remote data really needs to calculate a hash on that data and record it along with other (more interesting) results. The remote data might change. If someone downloads the code (a year after it was published, say) and gets different results then, absent the hash, it may not be obvious that this is due to a change in the remote data. The md5deep utility provides the relevant functionality and is freely available for all operating systems.
drj11

Posted Sep 10, 2010 at 10:39 AM | Permalink

Have you tried asking the ICOADS folks to provide the files in .gz?

When ccc-gistemp had this problem (decompressing a .Z file in Windows), I read the C code for the OS X version of compress and reimplemented the same algorithm in Python. Such fun.

Steve: I asked one climate group to provide in.gz, but they didn’t do anything. The method set out here works fine – so I’ve worked around the problem.

BTW I’d emulated quite a long way through GISTEMP steps in R in 2007 and placed the code online. Their so-called UHI adjustment seemed totally ineffective – as it effectively presumed that other inhomogeneities had already been resolved.
Pablo

Posted Sep 24, 2010 at 3:02 PM | Permalink

This data is no longer available on the ftp site you ( ftp://ftp.ncdc.noaa.gov/pub/data/icoads/ ). Is it available somewhere else?
- Steve McIntyre
  
  Posted Sep 24, 2010 at 4:57 PM | Permalink
  
  They’ve removed the data from the site. It is now at http://dss.ucar.edu/datasets/ds540.0/ behind a well of Java access screens that are hard to penetrate with R.

One Trackback

By Handling .Z files « Steven Mosher's Blog on Sep 13, 2010 at 1:31 AM

[…] Handling .Z files September 13, 2010 Steven Mosher Leave a comment Go to comments A while back Steve Mcintyre was looking for a way to handle .Z files in R […]

Climate Audit

Tar and Z

12 Comments

One Trackback

Tip Jar

Pages

Categories

Articles

Blogroll

Favorite posts

Links

Weblogs and resources

Archives

NOTICE

Search

Blog Stats

Twitter Updates

Recent Posts

Recent Comments

Meta

Climate Audit

Tar and Z

Share this:

Related

12 Comments

One Trackback

Tip Jar

Pages

Categories

Articles

Blogroll

Favorite posts

Links

Weblogs and resources

Archives

NOTICE

Search

Blog Stats

Twitter Updates

Recent Posts

Recent Comments

Meta