Earlier this year I did a post on the amount of estimation done to the GHCN temperature record by GISS before generating zonal and global averages. A graphic I posted compared the amount of real temperature data with the amount of estimation over time. To read the graphic, consider 2000 as an example. As of February 7, 2008 there were 3159 station records in the GHCN data with an entry for the year 2000. Of those station records, 62% were complete and an annual average could be fully calculated. Another 29% were incomplete, but contained enough monthly data that the GISS estimation method kicked in. The final 9% were so incomplete that no estimation could be done.
What I did not explore at the time and would like to look more closely here is the accuracy of the estimation. One would hope with so much infilling going on that the accuracy would be rather high (I will leave the determination of “high accuracy” for a later time). Because I did not have real data to compare with the GISS estimations, I took another approach. I used the GISS method to estimate real temperature data as if that data were missing.
Recall that GISS never explicitly estimates missing monthly temperatures. What they do is estimate seasonal averages when one monthly temperature is missing but the other two are present. Similarly, an annual temperature can be estimated when one seasonal value is missing but the other three are present. Using this methodology GISS can estimate an annual temperature when as many as six monthly values are missing.
While no explicit monthly estimate is recorded by GISS, it certainly can be derived from the seasonal estimate. I have shown several times a one-line equation that exactly reproduces the GISS seasonal estimate. Leaving a subsequent derivation as an exercise for the reader, the implied monthly estimate can be found from that equation and is expressed as follows:
where the average values for A, B, and C are calculated from all valid entries for the given month in a particular station record.
Now to test the estimation accuracy. In Connecticut, December 2006 was warmer than normal, but February 2007 was colder than normal. Looking at the records for Hartford, CT, we see the following monthly and seasonal temperatures:
Dec 2006: 3.3
Jan 2007: -0.3
Feb 2007: -4.6
If the December 2006 record were missing from Hartford, GISS would estimate a value of -0.7 C, which would yield a seasonal average of -1.9 C. Similarly, if February 2007 were missing, GISS would estimate it at 1.7 C and produce a seasonal average of 1.6 C. That’s a 4.0 degree miss for Dec, a 6.3 degree miss for February, and a 3.5 degree swing at the seasonal level.
The winter of 06-07 in Connecticut was a bit of an oddball. I really wanted to know what the typical error looked like. To do that, I performed the same calculation on all GHCN v2.mean records.
A real monthly value can be compared against its GISS estimate only when all three monthly values in the season are available. In my copy of GHCN v2.mean, there are approximately 6.25 million monthly values that meet that requirement. I went through each of the monthly values and simulated a GISS estimate, and from that estimate I subtracted the actual value to produce a delta temperature. A positive delta means that GISS would over-estimate the temperature and a negative delta means GISS would under-estimate the temperature.
Following is a histogram of the delta values collected. The x-axis is the value of the delta in degrees C. The y-axis is the percentage of records that had the specified delta value.
The fact that the simulation histogram looks like a normal distribution should not be surprising. This comes about because I need all three months in a season in order to simulate an estimate and a resulting delta. Recall that in the Hartford example above a large delta for December was followed by a similarly large delta for February, but of the opposite sign. Given the enormous sample size, the small differences in magnitude eventually even out.
The above distribution tells us the probability that the GISS estimate will miss the actual value by a specific amount. Zooming in on the distribution, we see GISS should get it exactly right just over 3% of the time:
Following is a table of absolute values and their corresponding probabilities, through a delta value of 2.9 degrees:
Referring to the table, the probability GISS will create an estimate within 0.4C of the actual value is 26.7%. A value between 0.5 C and 0.9 C has a 22.2% probability of occuring. Similarly, 1.0 C to 1.9 C is 26.5%, and 2.0 C to 2.9 C is 12.7%. There is about a 12% probability that the GISS estimate will be off by 3.0 C or more.