My curiosity in the mathematics behind the homogeneity adjustment caused me to finally take a close look at Hansen Step 2. This turned out to be an incredibly torturous task. A quote from a Stephen King short story, The Jaunt, came to mind as I plowed through lines of code: “It’s eternity in there…“. However, I think I’ve decoded enough of the information to begin explaining it in layman’s terms. Part I will deal with the process of preparing the data to be adjusted, while a future, planned post will describe the actual adjustment.
There are a number of programs involved in prepping the data for adjustment. Most are involved in reformatting, splitting, and trimming files. These programs don’t really do anything meaningful to the data, but it is important to understand the input and output file formats in order to follow the real work that takes place later. One prep program, toANNanom.exe, creates the seasonal and annual anomalies we have come to know and love on this blog.
The important program in data preparation is PApars.exe. It is one of the better commented programs in the set, but before the champagne is uncorked at NASA, it should be noted that it is about the only program containing comments. Detracting from the comments that do exist are the utterly confusing variable names and lack of formatting.
With that in mind, following is a summary of the preparation process this program undertakes.
The highlights of this summary are:
- Urban adjustments are not consistently based on rural stations from 0km to 1000km. Adjustments are based on stations from 0km to 500km, or on stations from 500km to 1000km, but never both.
- Rural stations in the range of 500km to 1000km carry the same weight as stations in the 250km to 500km range.
- The USHCN brightness index determines whether or not a station is rural or urban over most, but not all, of North America. For all other stations, the GHCN flag is used.
The first comment in this program states the purpose is “to combine for each urban station the rural stations within R=1000km and write out parameters for broken line approximations to the difference of urban and combined rural series.” As we will discover later, this is not quite true.
A subsequent comment provides a little more detail:
The combining of rural stations is done as follows: Stations within Rngbr km of the urban center U contribute to the mean at U with weight 1.- d/Rngbr (d = distance between rural and urban station in km). To remove the station bias, station data are shifted before combining them with the current mean. The shift is such that the means over the time period they have in common remains unchanged. If that common period is less than 20 (NCRIT) years, the station is disregarded. To decrease that chance, stations are combined successively in order of the length of their time record.
There is the option to pass a different distance to the program (such as R-1200km) and different overlap period (such as NCRIT = 30 years), but the shell script does not do so.
Rural or Urban
A first pass is made through the data to determine which stations are rural and which are urban. If the station has a USHCN brightness index, that index is used to make the rural / urban determination. In those cases, a station with an index of “1” is rural and all others are urban. If the station does not have a USHCN brightness index, the GHCN R/S/U flag is used, with “R” indicating rural and “S” or “U” urban. The GHCN brightness index is ignored.
Use of the USHCN brightness index extends outside the continental US to some locations in Canada, Mexico, and the Caribbean. Also, not all USHCN sites have a brightness index, notably Alaska and Hawaii. Note that when using the GISS website to interactively get station data, a “rural” station is determined by examining the GHCN flag, not the USHCN brightness index. Thus, stations such as Cedarville are listed as being rural on the GISS website, but are treated as urban when calculations are done.
As the pass is made through the data, the latitude and longitude for each station is collected, as is the temperature data. The temperature data was stored as an integer (units of 0.1C), so it is reconverted to real data by dividing by 10.
Sorting by length of station record
From here on I ran into problems determining when months are used and when years are used. PApars.exe and most of the other programs look like they can operate on years, months, or seasons. Which unit they operate on depends on information placed in each file format along the way. Some of it appears implicit. Unfortunately, one cannot depend on the variable names for resolving the issue. For example, is variable N3L expressed in months, years, or hair follicles? As a result, my use of specific month or year units in the following discussion indicates my best judgment as to what is actually used, but I recognize I may be wrong.
A pass is made through the rural station data to sort the stations according to the number of valid months of temperature data. First, the number of valid monthly records is determined (i.e., a month with a temperature value of 9999 is not valid). Then the stations are sorted in order of the station with the most valid months of temperature data to the station with the least valid months of temperature data. A station with 1000 consecutive months of data is not given preference over a station with 1001 months of occasional data.
Identifying “nearby” rural stations
Next, a pass is made through all urban stations to generate a list of “nearby” rural stations. This is done in two steps, first by looking at stations within 500km of the urban station and then, if no rural stations are found within 500km, looking at the remaining stations out to 1000km. If rural stations within 500km exist, it is quite possible any stations from 500km to 1000km will not be considered.
First, using the spherical law of cosines, the cosine of the angle between the coordinates of the urban station and all rural stations is calculated:
COS_U_R = SIN(R_LAT*PI180) * SIN(U_LAT*PI180) + COS(R_LAT*PI180) * COS(U_LAT*PI180) * [COS(R_LON*PI180) * COS(U_LON*PI180) + SIN(R_LON*PI180) * SIN(U_LON*PI180)]
COS_U_R is compared with the previously calculated “critical” cosine value COS(500/Earth_radius). If the resulting COS value is less than the critical value, the rural station is too far away and is skipped. The SIN and COS values for each station’s latitude and longitude are pre-calculated and saved so that each iteration through a station does not require a new calculation.
If a rural station is within the distance limit of the urban station, the distance between the two is calculated. However, the program does not calculate the great circle distance. This is to avoid the use of ARCCOS, which must have been quite expensive 20 years ago. Instead, the shorter chord distance is calculated, and a ratio is produced, as follows:
Dist_U_R = (Earth_radius / 500) * SQRT(2 * (1 – COS_U_R ))
From this, weight is assigned to the station using WT = 1 – Dist_U_R
As an example, for a station 5 km away, a weight of 0.99 is assigned, and for a station 499 km away, a weight of .002 is used. These seem appropriate if the range is 0km to 500km.
However, if no rural stations are found within that range, the search is extended out to 1000km. In this extended case:
Dist_U_R = (Earth_radius / 1000) * SQRT(2 * (1 – COS_U_R ))
The same weighting function is applied. The implication is that if an urban station has no rural neighbors within 500km, but does have neighbors between 500km and 1000km, those rural stations are given the same weight as those rural stations 250km to 500km from other urban stations. It seems odd to me that such an asymmetric weighting would be used.
Combining an urban station’s nearby rural stations into a single series
Next, a pass is made through all rural stations that are within the 0km to 500km (or 500km to 1000km) zone from the urban station. The stations are visited in order of temperature record length, which was sorted previously. The purpose of this pass is to create a combined rural time series for the urban station. Each rural station’s record is added to the combined series as follows:
(Assume the station to be added is “NEW” and the combined series is “COMBINED”.)
- The average temperature of all months in common between NEW and COMBINED is found for both series. Note that this is a single number representative of each series.
- A bias is calculated by subtracting the NEW average from COMBINED average.
- Next, each month M from NEW is added to COMBINED as follows:
NEW_WEIGHT = COMBINED_W(M) + WT
COMBINED(M) = [COMBINED_W(M) * COMBINED(M) + WT * (NEW(M) + BIAS)] / NEW_WEIGHT
COMBINED_W(M) = NEW_WEIGHT
The difference between the urban and combined series
After this is done, the urban time series is examined. Two tests are made against the combined series.In the first test, if the urban and combined series have fewer than 20 years in common, the urban station is dropped.
In the second test, if the number of years in common is fewer than 2/3 the period of overlap, the early years are dropped and the comparison is re-done from a new starting point. The formula for the new starting point is NEW_START = LAST – (COMMON – 1) / (2/3). NEW_START and LAST are offsets from the first year of record (1880). Thus, if there are 110 years of overlap (1880 to 1989) but only 50 years have valid records in common, the new starting point would be 110 – (50 – 1)/(2/3) = 36 (integer value). This is then added to 1880 to yield a new starting year of 1916. The analysis is rerun from that point.
Note that if there are fewer than 20 years in common for rural stations within a 500km radius, the program will make an attempt to find and combine rural stations between 500km and 1000km. The rural stations within 500km are never combined with the new stations between 500km and 1000km – they are dropped from the urban station’s analysis.
Assuming there is enough overlap between the urban and combined rural time series, a new time series is calculated by subtracting the urban series from the combined rural series. Also, the mean of this new series is calculated over the period of overlap. Both calculations appear to be done on the annual averages and not the monthly averages, but I am not entirely certain at this point. This information – along with many other parameters – are passed in a call to GETFIT, which will be the subject of my future post.