We’ve been discussing the bewildering methods that Hansen used to do what appears to be a trivial task: combining slightly variant versions of GHCN data at an individual station – versions, that in many cases are identical for many of their readings. As it turns out, GHCN has a manual on their data set, including a section on “Duplicate Elimination”. If Hansen had read this manual, here’s what he would have read:
First, they discuss the situation with the maximum and minimum data sets at GHCN – GISS only uses mean monthly records, so we haven’t explored these data sets in the present context and this is worth keeping in mind.
A time series for a given station can frequently be obtained from more than one source. For example, data for Tombouctou, Mali were available in six different source data sets. When “merging” data from multiple sources, it is important to identify these duplicate time series because: (1) the inclusion of multiple versions of the same station creates biases in areally averaged temperature analyses, and (2) the same station may have different periods of record in different data sets, therefore “mingling” the two versions can create longer time series.
The goal of duplicate station elimination is to reduce a large set of n time series (many of which are identical) to a much smaller set of m group of time series that are unique. In the case of maximum and minimum temperature, 8,000 source data set time series were reduced to 4,964 unique time series. This was accomplished in the following fashion. First, the data for every station were compared with the data for every other station. This naturally started with stations whose metadata indicated they were in approximately the same location. Similarity was assessed by computing the total number of months of identical data as well as the percentage of months of identical data. Maximum/minimum temperature time series were considered duplicates of the same station if they shared the same monthly value at least 90% of the time with at least 12 months of data being identical and no more than 12 being different. This process identified the duplicates which were then mingled to form time series with longer periods of record after a manual inspection of the metadata (to avoid misconcatenations). This process was then repeated on the mingled data set without the initial metadata considerations so every time series was compared to all the other time series in the database. Similarity of time series in this step was judged by computing the length of the longest run of identical values.
Cases where the time series were determined to be duplicates of the same station but the metadata indicated they were not the same station were examined carefully and a subjective decision was made. This assessment provided additional Quality Control of station locations and the integrity of their data. For example, a mean temperature time series for Thamud, Yemen had 25 years (1956-1981) of monthly values that were exactly identical to the mean temperature data from Kuwait International Airport (12 degrees farther north). Needless to say, one of these time series was in error. As with most of these problems, determining which time series was erroneous was fairly easy given the data, metadata, knowledge about the individual data sources, duplicate data, and other climatological information available.
They then proceed to discuss what they describe as a more “complicated” decision tree for mean temperatures.
The procedure for duplicate elimination with mean temperature was more complex. The first 10,000 duplicates (out of 30,000+ source time series) were identified using the same methods applied to the maximum and minimum temperature data sets. Unfortunately, because monthly mean temperature has been computed at least 101 different ways (Griffiths 1997), digital comparisons could not be used to identify the remaining duplicates. Indeed, the differences between two different methods of calculating mean temperature at a particular station can be greater than the temperature difference from two neighboring stations. Therefore, an intense scrutiny of associated metadata was conducted. Probable duplicates were assigned the same station number but, unlike the previous cases, not mingled because the actual data were not exactly identical (although they were quite similar). As a result, the GHCN version 2 mean temperature data set contains multiple versions of many stations. For the Tombouctou example, the 6 source time series were merged to create 4 different but similar time series for the same station (see Figure 1).
While GHCN has obviously done a lot of work in collecting this information, I really wish that someone with some accounting experience had been involved so that there were clear audit trails. I looked at the Tombouctou data and sure enough there are 4 versions, but where did these come from? As far as I can tell the version numbers do not track back to references, but are just arbitrary. If one is doing proper analysis, shouldn’t we know which data set comes from which provenance (Smithsonian, WWR, UCAR etc.) Some of the records seem to have negligible difference in the period of common coverage. And having taken the trouble to preserve original data (good), why would they then merge 6 data versions to get 4 data versions? Why not stick with the original 6 series with an identifier showing the provenance? Frustrating.
They go on to say:
Preserving the multiple duplicates provides some distinct benefits. It guarantees no concatenation errors. Adding the recent data from one time series to the end of a different time series can cause discontinuities unless the mean temperature was calculated the same way for both time series. It also preserves all possible information for the station. When two different values are given for the same station/year/month it is often impossible for the data set compiler to determine which is correct. Indeed, both may be correct given the different methods used to calculate mean temperature.
Fair enough for the most part. As to the last sentence, when you have long runs of identical values and singletons that differ, I think that one can safely conclude that the difference is scribal rather than differences in calculation method – although I suppose that there could be singleton differences in calculation method: a scribal error of a different sort.
GHCN go on to give the following warning:
Unfortunately, preserving the duplicates may cause some difficulty for users familiar with only one “correct” mean monthly temperature value at a station. There are many different ways to use data from duplicates. All have advantages and disadvantages. One can use the single duplicate with the most data for the period of interest; use the longest time series and fill in missing points using the duplicates; average all data points for that station/year/month to create a mean time series; or combine the information in more complicated ways, such as averaging the first difference (FDyear 1 = Tyear 2 – Tyear 1) time series of the duplicates and creating a new time series from the average first difference series. Which technique is the best depends on the type of analysis being performed.
It certainly appears that Hansen’s “method”, whatever it ultimately proves to be, is distinct from any of the above, leading to what seems to be an extensive corruption of the GISS data set.