During the past few days, I’ve been assessing the GHCN-Daily dataset, which is a very large data set and plan to do a number of posts on this topic, including a description of the data set. It turns out that literally hundreds of stations that expire around 1989 or 1990 in the NASA data set are alive and thriving in the GHCN-Daily parallel universe. More on this over the next few days.
Before I get to this, I’d like to document a small puzzle in connection with the calculation of the USHCN “raw” monthly average that arose out of inspection of the GHCN-D data, a puzzle that takes us back to the Detroit Lakes MN station, the investigation of which led to the identification of Hansen’s Y2K error.
I’m not saying that these small puzzles necessarily or even probably “matter” in terms of world averages, but they are relevant in terms of craftsmanship and I presume the following: if one is going to the trouble of making these large temperature collations, the the craftsmanship should be as good as possible. By commenting on issues pertaining to craftsmanship, I am not imputing malfeasance, as some perpetually agitated commenters allege. However, as far as I can tell, no one – journal peer reviewer, NASA peer reviewer (if such existed), rival climate scientist, skeptic – ever seems to have gone to the trouble of parsing through the actual craftsmanship of the large temperature calculations and I see no harm and some benefit in doing so. It looks like NASA is paying some attention and has already implemented a couple of recommendations made at CA. (I’ll have some similar suggestions in the near future.)
The GHCN-D dataset contains daily max and mean information for nearly all the USHCN stations, as well as thousands of ROW stations. (I’ll discuss some discrepancies in the USHCN station lists on another occasion). The GHCN-D data set is available in a huge zipped file, but is also available on a station-by-station basis. Most of the identification codes are different than GHCN-M (and thus GISS), but I’ve managed to create a concordance of over 3300 station identifications – and do not preclude the possibility of further gains. I’ve created time series of monthly means for these 3300 or so GHCN-D series. As a first cut, I simply took a monthly average of available values, without requiring a minimum number of values to constitute an average (which I usually do and would probably do if I re-run the results.) I then calculated the monthly mean as the average of the mean monthly minimum and mean monthly maximum – in some cases, there would be different numbers of measurements.
The figure below shows the difference between the USHCN “raw” monthly mean and my calculation from daily information for a station (Kalispell MT) with an excellent match. There are rounding differences, but the two versions clearly reflect the same provenance. In this case, I presume that the small spike differences result from some procedural difference in calculation of monthly averages. While the differences appear attributable to rounding, the differences are not truly random: there are far more +0.1 differences than -0.1 differences, but this is unrelated to time.
Next here is the same plot for Detroit Lakes MN, a station which had a puzzling jump around 2000 in the original NASA version (a jump that could be attributed in part to the Y2K error.) This particular error has now been patched by NASA. In this case, the tracking looks very similar to the Kalispell tracking from 1950 to about 1980. But in the late 1990s-2000s, the USHCN Raw version (and thus the downstream versions) jumps up relative to the average calculated from GHCN-D daily information. Why is this?
Figure 2. As Figure 1, but for Detroit Lakes MN.
I parsed through about 40 such plots, most of which were in between Kalispell and Detroit Lakes in appearance. But there were a couple of oddballs: here’s one. It looks like the USHCN Raw version must be spliced from two different GHCN-D stations, with values after 1980 or so from the present station and earlier values from some other station.
Here’s a station (Dillon MT) which has a somewhat similar appearance of being spliced – only this time, it looks like the USHCN station is drawn from the GHCN-D data set prior to 1980 and perhaps some other related source after 1980.
The puzzle that needs to be resolved is the exact relationship between the USHCN “Raw” and GHCN-D data. If this can be sorted out, then NASA could make a substantial gain in the timeliness of their reporting.
GHCN-D versions of USHCN stations are current through early March 2008. Right now NASA’s USHCN data is only current to March 2006 – the date of the most recent GHCN update. Following a CA suggestion, NASA is moving to make its USHCN stations more current by adopting the USHCN (NOAA) source, which is more current than the versions at GHCN-M or CDIAC,
However, the GHCN-D data is truly current. NASA already uses “raw” USHCN data for its current results, using a patch to splice each station to the FILNET version used for historic values. If monthly averages calculated from GHCN-D data were used instead of GHCN-M data, then NASA could report USHCN stations right through to February 2008 (and keep current) instead of the current system of being up to two years out of date for USHCN stations. (A better system would be for NASA to write NOAA and ask them to update the USHCN data set on a monthly basis, which should be trivial to program and dispense with the patch altogether.)
Gaining two years in report timeliness for USHCN stations is a small thing but worth doing. In some forthcoming posts, I’ll discuss how NASA can gain nearly 20 years in reporting timeliness for many international stations.