Hansen and the GHCN Manual

We’ve been discussing the bewildering methods that Hansen used to do what appears to be a trivial task: combining slightly variant versions of GHCN data at an individual station – versions, that in many cases are identical for many of their readings. As it turns out, GHCN has a manual on their data set, including a section on “Duplicate Elimination”. If Hansen had read this manual, here’s what he would have read:

First, they discuss the situation with the maximum and minimum data sets at GHCN – GISS only uses mean monthly records, so we haven’t explored these data sets in the present context and this is worth keeping in mind.

A time series for a given station can frequently be obtained from more than one source. For example, data for Tombouctou, Mali were available in six different source data sets. When “merging” data from multiple sources, it is important to identify these duplicate time series because: (1) the inclusion of multiple versions of the same station creates biases in areally averaged temperature analyses, and (2) the same station may have different periods of record in different data sets, therefore “mingling” the two versions can create longer time series.

The goal of duplicate station elimination is to reduce a large set of n time series (many of which are identical) to a much smaller set of m group of time series that are unique. In the case of maximum and minimum temperature, 8,000 source data set time series were reduced to 4,964 unique time series. This was accomplished in the following fashion. First, the data for every station were compared with the data for every other station. This naturally started with stations whose metadata indicated they were in approximately the same location. Similarity was assessed by computing the total number of months of identical data as well as the percentage of months of identical data. Maximum/minimum temperature time series were considered duplicates of the same station if they shared the same monthly value at least 90% of the time with at least 12 months of data being identical and no more than 12 being different. This process identified the duplicates which were then mingled to form time series with longer periods of record after a manual inspection of the metadata (to avoid misconcatenations). This process was then repeated on the mingled data set without the initial metadata considerations so every time series was compared to all the other time series in the database. Similarity of time series in this step was judged by computing the length of the longest run of identical values.

Cases where the time series were determined to be duplicates of the same station but the metadata indicated they were not the same station were examined carefully and a subjective decision was made. This assessment provided additional Quality Control of station locations and the integrity of their data. For example, a mean temperature time series for Thamud, Yemen had 25 years (1956-1981) of monthly values that were exactly identical to the mean temperature data from Kuwait International Airport (12 degrees farther north). Needless to say, one of these time series was in error. As with most of these problems, determining which time series was erroneous was fairly easy given the data, metadata, knowledge about the individual data sources, duplicate data, and other climatological information available.

They then proceed to discuss what they describe as a more “complicated” decision tree for mean temperatures.

The procedure for duplicate elimination with mean temperature was more complex. The first 10,000 duplicates (out of 30,000+ source time series) were identified using the same methods applied to the maximum and minimum temperature data sets. Unfortunately, because monthly mean temperature has been computed at least 101 different ways (Griffiths 1997), digital comparisons could not be used to identify the remaining duplicates. Indeed, the differences between two different methods of calculating mean temperature at a particular station can be greater than the temperature difference from two neighboring stations. Therefore, an intense scrutiny of associated metadata was conducted. Probable duplicates were assigned the same station number but, unlike the previous cases, not mingled because the actual data were not exactly identical (although they were quite similar). As a result, the GHCN version 2 mean temperature data set contains multiple versions of many stations. For the Tombouctou example, the 6 source time series were merged to create 4 different but similar time series for the same station (see Figure 1).

While GHCN has obviously done a lot of work in collecting this information, I really wish that someone with some accounting experience had been involved so that there were clear audit trails. I looked at the Tombouctou data and sure enough there are 4 versions, but where did these come from? As far as I can tell the version numbers do not track back to references, but are just arbitrary. If one is doing proper analysis, shouldn’t we know which data set comes from which provenance (Smithsonian, WWR, UCAR etc.) Some of the records seem to have negligible difference in the period of common coverage. And having taken the trouble to preserve original data (good), why would they then merge 6 data versions to get 4 data versions? Why not stick with the original 6 series with an identifier showing the provenance? Frustrating.

They go on to say:

Preserving the multiple duplicates provides some distinct benefits. It guarantees no concatenation errors. Adding the recent data from one time series to the end of a different time series can cause discontinuities unless the mean temperature was calculated the same way for both time series. It also preserves all possible information for the station. When two different values are given for the same station/year/month it is often impossible for the data set compiler to determine which is correct. Indeed, both may be correct given the different methods used to calculate mean temperature.

Fair enough for the most part. As to the last sentence, when you have long runs of identical values and singletons that differ, I think that one can safely conclude that the difference is scribal rather than differences in calculation method – although I suppose that there could be singleton differences in calculation method: a scribal error of a different sort.

GHCN go on to give the following warning:

Unfortunately, preserving the duplicates may cause some difficulty for users familiar with only one “correct” mean monthly temperature value at a station. There are many different ways to use data from duplicates. All have advantages and disadvantages. One can use the single duplicate with the most data for the period of interest; use the longest time series and fill in missing points using the duplicates; average all data points for that station/year/month to create a mean time series; or combine the information in more complicated ways, such as averaging the first difference (FDyear 1 = Tyear 2 – Tyear 1) time series of the duplicates and creating a new time series from the average first difference series. Which technique is the best depends on the type of analysis being performed.

It certainly appears that Hansen’s “method”, whatever it ultimately proves to be, is distinct from any of the above, leading to what seems to be an extensive corruption of the GISS data set.

12 Comments

  1. David
    Posted Sep 7, 2007 at 9:04 AM | Permalink

    or combine the information in more complicated ways, such as … Which technique is the best depends on the type of analysis being performed.

    Hansen is performing a very complex analysis that requires him to combine the data in a more complicated way such as to fit his hypothesis…

  2. Steve Moore
    Posted Sep 7, 2007 at 9:54 AM | Permalink

    First, the data for every station were compared with the data for every other station.

    In the “Good Old Days” this and the subsequent steps would have been done manually. Nowadays, there is such a blind faith in the efficacy of computers that the temptation is to just throw some code together to do it (my choice of words is deliberate). It is very rare to find someone skilled in both areas.

    I’m reminded of a quote from Jerry Pournelle in a Byte Magazine column (around 1980) in which he bemoaned the paucity of adequate accounting software: “Accountants aren’t programmers, and programmers aren’t accountants.” Just substitute “climate scientists” for “accountants”

  3. VG
    Posted Sep 7, 2007 at 9:55 AM | Permalink

    Dont know if this should be posted here. From cryosphere today: http://arctic.atmos.uiuc.edu/cryosphere/IMAGES/current.365.south.jpgnote Antarctica extent has gone off charts 8th sept, 2007 but no mention (or correction to chart) as distinct to record decrease in artic…By the way, artic is also increasing (albeit from record low) autumm 4-5 weeks soon compared to previous year.

  4. John F. Pittman
    Posted Sep 7, 2007 at 10:04 AM | Permalink

    Preserving the multiple duplicates provides some distinct benefits. It guarantees no concatenation errors. Adding the recent data from one time series to the end of a different time series can cause discontinuities unless the mean temperature was calculated the same way for both time series.

    With Hansen saying that they used a 5 sigma (?, by memory)to determine to reject a data point, doesn’t the above quote indicate that there would certainly be errors in Hansen’s approach due to his implicit assumption that he could combine station data using one method, rather than looking and determining which method was appropriate?

  5. steven mosher
    Posted Sep 7, 2007 at 10:31 AM | Permalink

    Gavin and Eli are always quick to tell people to RTFR: Read The ..-. ..- -.-. -.- .. -. –. Rules

    This is similiar to RTFM. Read The ..-. ..- -.-. -.- .. -. –. Manual.

    How many more mistakes do we have to find in Hansen’s prosecution before he Frees the Code.

  6. Robert L
    Posted Sep 7, 2007 at 10:52 AM | Permalink

    I suppose it is possible that Hansen wrote the manual?

    In which case he is merely exercising his options to:

    Unfortunately, preserving the duplicates may cause some difficulty for users familiar with only one “correct” mean monthly temperature value at a station. There are many different ways to use data from duplicates. All have advantages and disadvantages. One can use the single duplicate with the most data for the period of interest; use the longest time series and fill in missing points using the duplicates; average all data points for that station/year/month to create a mean time series; or combine the information in more complicated ways, such as averaging the first difference (FDyear 1 = Tyear 2 – Tyear 1) time series of the duplicates and creating a new time series from the average first difference series. Which technique is the best depends on the type of analysis being performed.

    Get the result desired.

  7. Sam Urbinto
    Posted Sep 7, 2007 at 1:30 PM | Permalink

    Re Steve Moore #2 It’s been my experience someone that’s an expert programmer does pretty much that alone. True for most experts. They focus. Being an X and having to program is usually a side thing. In fact, in my experience, it usually takes a few people just to do one thing correctly.

    BTW, Chaos Manor is still around: http://www.chaosmanorreviews.com/

  8. VG
    Posted Sep 7, 2007 at 2:40 PM | Permalink

    re #3 link should have been

    mes apologies

  9. Steve Moore
    Posted Sep 7, 2007 at 3:52 PM | Permalink

    Re #7:
    That’s pretty much my point. Ideally, a client (scientist, engineer, etc) should understand his data well enough to be able to specify to a programmer what he wants. This might take a few rewites, but the end product (hopefully) will be reproducible.
    Sadly, that doesn’t appear to be the case with a lot of what we see.

    I continue to visit the Manor. Often worth the trip.

  10. Gary
    Posted Sep 7, 2007 at 6:30 PM | Permalink

    Good to see ol’ Jerry P is still around. Imagine if he sank his teeth into the climate science shenanigans.

  11. steven mosher
    Posted Sep 7, 2007 at 6:52 PM | Permalink

    RE 10.

    Where’s jerry

    http://www.weirdstuff.com/sunnyvale/html/weirdcam.htm

  12. D. Patterson
    Posted Sep 8, 2007 at 2:34 AM | Permalink

    Re: #10

    In fact, Jerry has been providing commentary at Chaos Manor in regard to the current state of climate science and Global Warming hysteria.

Follow

Get every new post delivered to your Inbox.

Join 3,249 other followers

%d bloggers like this: