For your Sunday edification, today’s crossword puzzle pulls together 3 examples of Hansen data “combining”. The question is to find an AlGore-ithm that accounts for all three versions. The three data sets each have 3 columns: the first two are data versions; the third column is Hansen-combined. In each case, there is an overlap between data versions and where the data versions overlap, they are identical. In each case, during the period of one overlap, one version has one value missing and as a result of this one missing value, the early values of the version which has the earlier values is adjusted.
In one case, the missing value is a winter value; in the other two cases. In the case where a winter value was missing the earlier version was adjusted down; in a case where a summer value was missing, the earlier version was adjusted up.
Praha http://data.climateaudit.org/data/giss/61111520000.dat
Joenssu http://data.climateaudit.org/data/giss/61402929000.dat
Gassim http://data.climateaudit.org/data/giss/22340405000.dat
These are all tab-separated and can be read into Excel for Excel users.
The issue is not to find a rational algorithm that yields the adjustments: the only rational way to combine the data is just to combine the available values. The issue is to find any algorithm, however improbable. I’ve posted notes on my efforts, but am now stuck. Some things seem likely but don’t be limited by my guesses as they may be wrong.
Notwithstanding these caveats, as noted before, it appears that Hansen calculates the first month of overlapping values and last month of overlapping values. One possibility was that he calculated the difference in means over available months without any allowance for whether it was a summer or winter month missing and then uses a form of Hansen-rounding to calculate a 1-digit adjustment. I was able to get a semi-plausible formula that fit Praha and then failed Joenssu; then one that fit Praha and Joenssu, but failed Gassim. Praha has a negative adjustment to early values; Joenssu and Gassim positive adjustments. From inspecting quite a few series, I would say that negative adjustments are by far more prevalent, but we’re working here with 2-version examples for simplicity. Who knows what will happen when we get to 3 version stations.
UPDATE #1
Thank you for the various contributions and suggestions, especially to John Goetz both for identifying the problem and solving the calculation of the delta. I think that we’ve solved quite a bit of what Hansen’s done and ended up turning over another large stone: how Hansen calculates seasonal and annual averages for individual station versions. Now that we’re close to describing what he did, it appears increasingly likely to me that Hansen has corrupted his entire data set. Whether the corruption of the data set “matters” is a different issue, but the evidence is now quite powerful that the data set has been corrupted through the problems discussed below.
Let me summarize what I think that we know. In some cases, there are perhaps alternative ways at arriving at an answer and I’ve picked approaches that seem most likely to me to have been used.
Single-Digit Adjustment
I think that the evidence is overwhelming that Hansen makes an adjustment in even tenths to one column in these two-column arrays of monthly data. The delta for Praha is -0.1 to the first column; +0.2 for Gassim and +0.1 for Joenssu. After adjusting one column, Hansen then calculates a combined monthly series, that is not a direct average, but a “Hansen average” as discussed below. This leaves two problems: (1) how the single-digit delta is calculated; and (2) how the combined version is calculated from from the adjusted matrix.
Integer Arithmetic
There is convincing evidence that Hansen uses integer arithmetic in tenths of a degree. It’s possible to emulate integer arithmetic to some extent in R with a little thought. Floating point arithmetic with conventional rounding won’t yield Hansen combined series.
Calculation of the Delta
I am convinced that John G has solved the calculation of the delta. By collating monthly series, I complicated the solution of the problem because I (and people following my collations) were working only with monthly data versions. Hansen’s GISS version also includes annual versions. While it is presently unknown how Hansen calculated the annual and quarterly averages (as his method of filling missing data is unknown), John G’s method (and I’ve confirmed this with a slight variation) will recover the deltas.
Briefly, the two versions are compared to determine which years have values for both versions – this is 4-5 years in the Praha, Gassim and Joensuu cases. Then the difference between the averages is calculated, yielding the required delta (-0.1, 0.2, 0.1). In the Gassim case, the result is exactly 0.15 and when I calculated the result using floating point arithmetic, I got 0.1 instead of 0.2. So you do have to take care that one emulates integer arithmetic.
Hansen Averaging
We’ve had a variety of suggestions on how to obtain the combined value once the adjusted matrix is created – some of which are probably equivalent. I’ve re-stated my own solution to the problem since I’ve spent time confirming that it works – without excluding the possibility that some other variation was used. (BTW I think that it is very unlikely that Hansen detoured through Kelvin in these calculations though.) As I noted in a post below, here’s the evidence to consider:
1) if there is only one available value from the two-column matrix after adjustments, the adjusted value is taken whether it comes from column 1 (adjusted) or column 2 (unadjusted). No problem here. So we only need to consider averaging where both columns have values.
2) If the sum of the two columns (in tenths) is even, then the combined value is the average – no problem.
3) for Praha, if the sum of the two columns (in tenths) is odd, then the combined value is averaged up. For Joensuu, if the sum of the two columns (in tenths) is odd, then the combined value is averaged down. For Gassim, there is only one case where the sum of the two columns (in tenths) is odd – May 1990- and in this case, the combined value is averaged down.
The rounding up or rounding down depends on the sign of the adjustment. If there is a negative adjustment to column 1, it’s averaged up; if there’s a positive adjustment, it’s averaged down. I was able to implement this by checking whether the sum of the two entries was even or odd, calculating the “Hansen-average” as follows:
1) if there’s only one available value, use it.
2) if the sum of the two values is even, take a simple average;
3) if the sum of the two values is odd, add 1 if the delta is positive (i.e. subtraction in the first column) and subtract 1 if the delta is negative (i.e. addition in the first column). Then take an average.
If it seems a little longwinded, remember that we have to consider integer arithmetic.
Seasonal Averages
Now we turn to the seasonal and annual averages that underlie the troublesome delta calculations. Here we have another crossword puzzle. I’ve shown below the relevant overlapping years for the two Praha-Libu” versions – I apologize for the ragged format. I’ve bolded points of difference.
The 1986 year does not enter into the delta calculations but December values enter into the quarterly DJF calculations of the following year. The only difference between the two versions here is that Version 0 has a December 1986 value of 1.4, while Verson 1 is missing a December 1986 value. The 1987 DJF estimated first quarter for the version with a missing month (version 1) is -3.7, while the 1987 actual DJF first quarter for the version with all three months is calculated arithmetically to be -2.0. There is a second smaller difference in 1991 values which is hard to understand since all 1991 values are identical in the two versions. (Explaining this is a new puzzle.) To calculate the deltas from raw data, we have to be able to replicate how Hansen fills in missing data.
In the example below, the actual Praha temperature in Dec 1986 according to version 0 was 1.4 deg C. This information was missing from version 1. By calculating the Dec 1986 temperature required to yield the 1987 DJF average, one can easily see that Hansen estimated the version 1 Dec 1986 temperature at -3.6 deg C – fully 5.0 deg C lower than the actual temperature (Version 0). He then said – aha, version 1 measures cold relative to version 0 and therefore version 0 (which contains the older data) needs to be adjusted down (in this case by 0.1 deg C.) It’s crazy.
Praha Version 611115200000
year jan feb mar apr may jun jul aug sep oct nov dec DJF MAM JJA SON ANN
1986 0.1 -6.4 3.9 9.4 16.2 17.1 18.2 17.8 12.6 9.4 5.0 1.4 -0.8 9.8 17.7 9.0 8.93
1987 -6.7 -0.8 -0.7 9.8 11.6 15.6 18.6 16.3 15.5 9.6 4.9 1.8 -2.0 6.9 16.8 10.0 7.93
1988 2.6 2.2 2.7 9.3 15.7 16.3 18.6 18.3 14.1 9.9 1.3 2.7 2.2 9.2 17.7 8.4 9.40
1989 1.2 3.6 7.5 9.2 14.7 16.2 18.8 18.3 15.0 10.6 1.9 1.6 2.5 10.5 17.8 9.2 9.98
1990 1.5 5.5 7.9 8.1 15.3 17.4 18.7 19.8 12.0 9.6 4.4 0.3 2.9 10.4 18.6 8.7 10.15
1991 1.3 -2.9 6.2 7.8 10.2 15.6 20.0 18.5 15.3 NA 3.2 NA -0.4 8.1 18.0 9.2 8.73
Praha Version 611115200001
year jan feb mar apr may jun jul aug sep oct nov dec DJF MAM JJA SON ANN
1986 NA NA NA NA NA NA NA NA NA NA 5.0 NA NA NA NA NA NA
1987 -6.7 -0.8 -0.7 9.8 11.6 15.6 18.6 16.3 15.5 9.6 4.9 1.8 -3.7 6.9 16.8 10.0 7.50
1988 2.6 2.2 2.7 9.3 15.7 16.3 18.6 18.3 14.1 9.9 1.3 2.7 2.2 9.2 17.7 8.4 9.40
1989 1.2 3.6 7.5 9.2 14.7 16.2 18.8 18.3 15.0 10.6 1.9 1.6 2.5 10.5 17.8 9.2 9.98
1990 1.5 5.5 7.9 8.1 15.3 17.4 18.7 19.8 12.0 9.6 4.4 0.3 2.9 10.4 18.6 8.7 10.15
1991 1.3 -2.9 6.2 7.8 10.2 15.6 20.0 18.5 15.3 NA 3.2 -0.7 -0.4 8.1 18.0 9.4 8.77
Difference
year jan feb mar apr may jun jul aug sep oct nov dec DJF MAM JJA SON ANN
1987 0 0 0 0 0 0 0 0 0 0 0 0 -1.7 0 0 0.0 -0.43
1988 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0 0 0.0 0.00
1989 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0 0 0.0 0.00
1990 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0 0 0.0 0.00
1991 0 0 0 0 0 0 0 0 0 NA 0 NA 0.0 0 0 0.2 0.04
Appropriateness of “Bias Method” for Scribal Errors
When one puts down what Hansen did in black and white, the inappropriateness becomes increasingly clear.
In this case, Hansen has estimated a value for the missing December 1986 value in Version 1. The most obvious method for estimating the value would be to use the available December 1986 value in Version 0, since these values coincide for all other points. In linear regression terms, the r2 is 1. Instead of doing this, Hansen used an unknown method to estimate the missing value. He then calculated the difference between his estimate and the actual values in the other version and stated that this difference was a bias in the versions. He then used this bias to adjust the data.
It’s pretty hard for a statistical method to be as “wrong” as this one. In the cases that we’ve seen, the error sometimes adds to early values; it is my impression that the error more often increases the apparent trend. Estimating the impact of this Hansen error on the data set is a different project. We’ll see where this goes.
UPDATE #2
The following R-function appears to calculate the Hansen-delta between two station versions. X and Y are R-data frames in GISS format with a column with name “ANN” corresponding to the 18th column in GISS format.
hansen_bias=function(X,Y) {
temp1=!is.na(X$ANN); temp2=!is.na(Y$ANN)
test=match(Y$year[temp2],X$year[temp1]);temp=!is.na(test)
A=Y$ANN[temp2][temp];#A
B= X$ANN[temp1][test[temp]]
a= sum( floor(100*A -100*B)) ; K=length(A)
hansen_bias= sign(a) *round(floor( (abs(a) + 5*K)/K)/100,1)
hansen_bias}
UPDATE #3 (Sept 8, 2007)
Hansen released source code yesterday. In a quick look at the code for this step, the combination of station information is done in Step 1 using the program comb_records.py which unfortunately, like Hansen’s other programs, and for that matter, Juckes’ Python programs, lacks comments. However, with the work already done, we can still navigate through the shoals.
The delta calculation described above appears to be done in the routine get_longest_overlap where a value diff is calculated. Squinting at the code, there appear to be references to the annual column, confirming the surmise made by CA readers. It doesn’t look like there is rounding at this stage – so maybe the rounding just occurring at the output stage, contrary to some of the above surmise. I still think that there’s some rounding somewhere in the system, but I can’t see where right now and this could be an incorrect surmise.
After calculating the delta (“diff”), this appears to be applied to the data in the routine combine which in turn uses the routine add (in which “diff” is an argument).
There is an interesting routine called get_best which ranks records in the order: MCDW, USHCN, SUMOFDAY and UNKNOWN and appears to use them in that order. I don’t recall seeing that mentioned in the documentation. I’m going to look for the location of this information in the new data as I don’t recall any prior availability of this information.