Counting Ragged Arrays in R

One of the common operations in the types of analyses done here is simply counting things. I am constantly amazed by the tremendous productivity of the tapply function in R – I use it over and over – for producing interesting results at warp speed. There’s nothing particularly novel in how I use it, but I’ve got a lot of experience now and have a few little techniques that I find useful. I made a little improvement in how I use it the other day and thought that it might be useful to some readers to illustrate this. The method has no mathematical significance – but efficiency in handling new data sets is important for what I do and I thought that I write it up and illustrate a couple of simple and useful techniques. The issue arise during the collation of hurricane landfall data.

The most recent dataset is at ; this appears to be a little more recent that the dataset that Willis used in making a count posted up in October. The archived information is quite messy in format and it took me a couple of hours to massage the data into a consistent tab-separated file with each gridcell representing one datum for one hurricane. I also matched each hurricane to the Atlantic basin hurricane data set by attaching an id number to each landfall hurricane by matching year-name combinations where available; and by matching year-landfall combinations in earlier years with a couple of years requiring manual inspection. I’ve posted up my collation here in a tab-separated ascii file, that you can access as follows:

landfall< -read.table(";,sep="\t",header=TRUE)

Now let's say that you want to count the number of landfall hurricanes by year. I organize my datasets so that each record has an id and a year, maing use of tapply possible. If you wanted to calculate the average landfall windspeed by year, you could just do (the na.rm option excludes NA values from the calculation.)

landfall.wind< – tapply(landfall$wind,landfall$year,mean,na.rm=TRUE)

In order to do counts, I use the ! function and then sum. For example:.

count< – tapply(!$year),landfall$year,sum)

Usually, one wants a time series and this method only returns values for years with values with the return showing the year as a name. In the past, through a couple of fiddly but not complicated operations, I’ve massaged such vectors to recover a time series. However, I noticed that, if you use the “levels” option in the “factor” function in R, you can avoid these fiddly operations, through the following:

count< – tapply(!$year),factor(landfall$year,levels=1851:2006),sum)

This returns a value for each year with NA rather than 0 for years with no values. Sometimes you want NA if it isn’t observed; sometimes you want 0. If you want 0, this can be done by simply doing (I use !is,na and a lot):

count[]< -0 #assigns 0 to NA values

If you want this as a time-series object, you can simply do (or the lines can be combined):

count< – ts( count ,start=1851)

One gets a quick plot as follows (I’ve increasingly used the plot function with parameters, but the ts.plot and plot.ts functions are even quicker, but sacrifice a little control):

par(mar=c(3,3,1,1)); plot(1851:2006,count,type=”l”,xlab=””,ylab=””)


Figure 1. Count of Annual Landfalling Hurricanesàƒ’€šà‚

For analyses in which I want to do counts or averages or medians on restrictions, I nearly always apply a logical operator first and then use tapply the same way. Here’s a restriction for landfalls with speed greater than 65 knots, then one can use a logical limitation prior to the tapply function – I do this all the time in various studies. Thus:

count< – ts( tapply(!$year[temp]),factor(landfall$year[temp],levels=1851:2006),sum) ,start=1851);count[]<-0

Easy as pi.


  1. John Norris
    Posted Jan 17, 2007 at 7:59 PM | Permalink

    I thought I would take my second attempt at R, my first being when Dr. Juckes was having trouble with your Mann Hockey Stick replicator.

    I cut and paste the line with landfall and read.table (less than sign omitted here for fear of unintended blog consequences) into a script on my computer. I ran it and received the following error:

    Error: object “landfall” not found

    I deleted a spurious appearing space between the less than sign and the subsequent – sign, and the line appeared to execute okay. I did this with subsequent lines that I cut and paste from this thread and they also appeared to execute okay. When I got down to plotting using the par statement, it failed.

    Error: syntax error in ” plot(1851:2006,count,type=””

    Not expecting you to waste your valuable time on my baby steps with R, but if you see something obvious I would appreciate a redirect.

  2. Evan Englund
    Posted Jan 17, 2007 at 9:15 PM | Permalink

    To: New R users,

    Steve writes R scripts using “=65)
    temp = (landfall$wind>=65)

    are identical. If, like me, you grew up on languages like Basic and Fortran, the latter is easier to read – and it is definitely easier to type!

    (When you really want to say “equals” as opposed to “is assigned the value of the following expession”, the operator is “==”, as in: if (a == b))

  3. Evan Englund
    Posted Jan 17, 2007 at 11:20 PM | Permalink


    Why can’t the comment box do WISIWIG?

    The point I was trying to make is that in current versions of R you can use a simple “=” instead of the more cumbersome “less_than hyphen” symbol combination as the assignment operator:

    temp = (landfall$wind>=65) is the same as
    temp “less_than hyphen” (landfall$wind>=65)

  4. Steve McIntyre
    Posted Jan 17, 2007 at 11:46 PM | Permalink

    EVan this is only a problem when I paste these things into WordPress. Sorry about that.

  5. Posted Jan 18, 2007 at 1:56 AM | Permalink

    Here’s a hint:

    If you want to post code into the comments and stop WordPress from interpreting the symbols, then use the tags.

%d bloggers like this: