One of the common operations in the types of analyses done here is simply counting things. I am constantly amazed by the tremendous productivity of the tapply function in R – I use it over and over – for producing interesting results at warp speed. There’s nothing particularly novel in how I use it, but I’ve got a lot of experience now and have a few little techniques that I find useful. I made a little improvement in how I use it the other day and thought that it might be useful to some readers to illustrate this. The method has no mathematical significance – but efficiency in handling new data sets is important for what I do and I thought that I write it up and illustrate a couple of simple and useful techniques. The issue arise during the collation of hurricane landfall data.
The most recent dataset is at http://www.aoml.noaa.gov/hrd/hurdat/ushurrlist18512005-gt.txt ; this appears to be a little more recent that the dataset http://www.aoml.noaa.gov/hrd/hurdat/ushurrlist.htm that Willis used in making a count posted up in October. The archived information is quite messy in format and it took me a couple of hours to massage the data into a consistent tab-separated file with each gridcell representing one datum for one hurricane. I also matched each hurricane to the Atlantic basin hurricane data set by attaching an id number to each landfall hurricane by matching year-name combinations where available; and by matching year-landfall combinations in earlier years with a couple of years requiring manual inspection. I’ve posted up my collation here http://data.climateaudit.org/data/hurricane/landfall.hurdat.csv in a tab-separated ascii file, that you can access as follows:
Now let's say that you want to count the number of landfall hurricanes by year. I organize my datasets so that each record has an id and a year, maing use of tapply possible. If you wanted to calculate the average landfall windspeed by year, you could just do (the na.rm option excludes NA values from the calculation.)
landfall.wind< – tapply(landfall$wind,landfall$year,mean,na.rm=TRUE)
In order to do counts, I use the !is.na function and then sum. For example:.
count< – tapply(!is.na(landfall$year),landfall$year,sum)
Usually, one wants a time series and this method only returns values for years with values with the return showing the year as a name. In the past, through a couple of fiddly but not complicated operations, I’ve massaged such vectors to recover a time series. However, I noticed that, if you use the “levels” option in the “factor” function in R, you can avoid these fiddly operations, through the following:
count< – tapply(!is.na(landfall$year),factor(landfall$year,levels=1851:2006),sum)
This returns a value for each year with NA rather than 0 for years with no values. Sometimes you want NA if it isn’t observed; sometimes you want 0. If you want 0, this can be done by simply doing (I use !is,na and is.na a lot):
count[is.na(count)]< -0 #assigns 0 to NA values
If you want this as a time-series object, you can simply do (or the lines can be combined):
count< – ts( count ,start=1851)
One gets a quick plot as follows (I’ve increasingly used the plot function with parameters, but the ts.plot and plot.ts functions are even quicker, but sacrifice a little control):
Figure 1. Count of Annual Landfalling Hurricanesà’à
For analyses in which I want to do counts or averages or medians on restrictions, I nearly always apply a logical operator first and then use tapply the same way. Here’s a restriction for landfalls with speed greater than 65 knots, then one can use a logical limitation prior to the tapply function – I do this all the time in various studies. Thus:
count< – ts( tapply(!is.na(landfall$year[temp]),factor(landfall$year[temp],levels=1851:2006),sum) ,start=1851);count[is.na(count)]<-0
Easy as pi.