GISS Step 2

Here are some notes and functions on some work that I did last fall trying to benchmark the GISS Step 2 adjustment in a non-US site. My first efforts to follow the written prescription have been unsuccessful. I’m placing some tools and a benchmark case (Wellington NZ) online and perhaps people who are trying to decode Step 2 from the Fortran angle can push this peanut along a little. All discussion is in R,

The script is online here. http://data.climateaudit.org/scripts/station/hansen/step2/step2.txt

I’ve worked through the major chunks below, but you’re better off to cut and paste from the ASCII version as WordPress does odd things to quotation signs.

First here are commands to load functions that I’ve used.

source(“http://data.climateaudit.org/scripts/station/hansen/step1.txt”) #loads hansenq
source(“http://data.climateaudit.org/scripts/station/hansen/step2/functions.txt”)

Step 2A
This is an optional step that I would encourage people to ignore for present purposes since there is nothing at issue here and I’ve saved the files from this step in ASCII form online. This step finds the “rural” stations within 1000 km and then collects their GISS dset1 histories. For completeness, I’m showing the work here.

First here are commands to download two files of GISS dset1 and dset2 versions (scraped Feb 2008 vintage).

loc.giss=”http://data.climateaudit.org/data/giss”
download.file( file.path(loc.giss,”giss.dset1.tab”),”temp.dat”,mode=”wb”); load(“temp.dat”)
download.file( file.path(loc.giss,”giss.dset2.tab”),”temp.dat”,mode=”wb”); load(“temp.dat”)

Next the station information is downloaded

url= file.path(loc.giss,”giss.info.dat”)
stations=read.table(url,sep=”\t”,header=TRUE,fill=TRUE,quote=””)
names(stations)[3]=”name”
pi180=pi/180;R= 6372.795 #earth’s radius in km

A utility function calculating the great circle distance is used:

circledist =function(loc, lat,long,R=6372.795) {
N=length(lat)
if(N==0) circledist=NA else {
pi180=pi/180;
x= abs(long -loc[2])
if (N>1) delta= apply( cbind(x %%360,(360-x)%%360),1,min) *pi180 else delta= min (x %%360,(360-x)%%360) *pi180
loc=loc*pi180; lat=lat*pi180; long=long*pi180
theta= 2* asin( sqrt( sin( (lat- loc[1])/2 )^2 + cos(lat)*cos(loc[1])* (sin(delta/2))^2 ))
circledist=R*theta
}
circledist
}

The function ruralstations locates the “rural” stations within a distance of 1000 km of the target station.

ruralstations=stations[temp1&temp2&temp_rur,c(“id”,”name”,”long”,”lat”,”start_raw”,”end_raw”,”start_adj”,”end_adj”,”dist”,”pop”,”urban”,”lights”)]
#matrix of stations meeting first two screen tests
ruralstations$dist=circledist(loc=c(lat0,long0),lat=ruralstations$lat,long=ruralstations$long)
#calculate circle distance for all stations in first screen
temp=(ruralstations$dist< =1000)&!is.na(ruralstations$dist)
ruralstations=ruralstations[temp,]
#restrict to stations within 1000 km
ruralstations=ruralstations[order(ruralstations[,"dist"]),]
#order by increasing distance
ruralstations$weights= 1- ruralstations$dist/1000
ruralstations
}

This step leaves us with three data sets for onward use: an information set of 4 stations ( here – HOKITIKA AERO, WHENUAPAI, KAITAIA and CHATHAM ISLAND); a data set holding the dset1 and dset2 versions of Wellington NZ and a data set holding the dset1 versions of the 4 rural comparanda – all located at http://data.climateaudit.org/data/giss/step2/ .

STEP 2
Starting from this step, first we read in the comparandum series, the target series and the information.

chron=read.table(“http://data.climateaudit.org/data/giss/step2/compare.dat”,sep=”\t”,header=TRUE)
chron=ts(chron[,2:ncol(chron)],start=chron[1,1])
target=read.table(“http://data.climateaudit.org/data/giss/step2/target.dat”,sep=”\t”,header=TRUE)
target=ts(target[,2:3],start=target[1,1])
dset1= target[,1]
dset2= target[,2]
rural=read.table(“http://data.climateaudit.org/data/giss/step2/rural.dat”,sep=”\t”,header=TRUE)
weights0=rural$weights

Now we count the number of rural stations with at least 3 values. This is done by counting availability and setting the count at NA for values less than 3. Then the range is determined. In this case a range of 1951-1984 is obtained. In this case, dset2 is calculated for 1939 to 1988. Don’t know why this doesn’t reconcile.

count=ts(apply(!is.na(chron),1,sum),start=tsp(chron)[1])
count.adj=count;count.adj[count<3]=NA
#if less than 3 stations, not calculated
M0=range(time(count)[!is.na(count.adj)])
#range of available

Then I calculated the fraction of this range in which there are at least 3 stations (some cases go in and out.) This is not an issue in this example. I haven’t implemented this test yet as there are other conundrums at hand, but will at some point.

Y=ts.union(dset1,count.adj)
temp1=(time(Y)>=M0[1])&(time(Y)=3 in range with count>=3
N3/(M0[2]-M0[1]+1) # fraction of range

Then a “reference” series is calculated, adding in the rural stations in order of increasing length using the Hansen delta-adjustment that we employed in Step 1.

#set up arrays
N=nrow(chron)
W=t(array(rep(weights0,N),dim=c(length(weights0),N) ) )
W[is.na(chron)]=NA
#matrix of weights: at least 3 stations and station available
long0=apply(chron,2,long)
#calculates length of the candidate rural series
order1=order(long0,decreasing=TRUE);index=NULL
#long0[order1] # orders by decreasing length
delta=rep(NA,K);
fixed=NULL; available= 1:K;

#start with the longest series
j0=order1[1];
reference=chron[,j0] #picks longest series to insert
delta[j0]=hansen_delta(chron[,j0], dset1);
#calculates Hansen delta uses Step 1 methods
reference=reference-delta[j0];
reference.seq=reference
weights1=rep(weights0[j0],length(reference));weights1[is.na(reference)]=NA
N=nrow(chron);fixed=j0

#sequentially adds in the rural series
for (k in 2:K) {
j0=order1[k] #chooses the series to be added
delta[j0]=hansen_delta(chron[,j0], reference);#calculates Hansen delta
weights2=W[,j0]
weights1=apply(as.matrix(W[,fixed]),1,sum,na.rm=T)
#g=function(x) g= weighted.mean(x,,1),na.rm=TRUE)

Y=cbind(chron[,j0]-delta[j0],reference); X=cbind(weights2,weights1)
for(j in 1:N) reference[j]= weighted.mean(Y[j,],X[j,] ,na.rm=T)
fixed=c(fixed,j0)
reference.seq=cbind(reference.seq,reference)
}

The reference series is then used in the two-legged adjustments to follow. These operations are bundled in a function called emulate_dset2 which can be called:

id0=”50793436001″
test=emulate_dset2(id0,method=”read”)

This returns the various dset versions and the reference series.

Two-Legged Adjustment

Now collect the dset versions and reference series with placeholders for the adjustments (keeping an index for dates that meet the purported test criteria of 3 comparanda).

Y=ts.union(test$dset1,test$dset2,test$reference,test$reference,test$reference)
dimnames(Y)[[2]]=c(“dset1″,”dset2″,”reference”,”adjustment”,”adjusted”)
Y[,4:5]=NA
temp1= (test$count>=3)&!is.na(Y[,1]);sum(temp1) #patch assumption

Now calculate the two-legged adjustment over the period in which adjusted values have been reported (this particular period selection has not been replicated in this example, so this is a restricted test.) The two-legged adjustment here is done from the difference between the dset1 version for Wellington and the comparandum series (in column 3), using the following implementation of the two-legged procedure as described in the underlying texts:

twoleg=function(x) {
N=length(x)
stat=rep(NA,N)
temp=!is.na(x)
index=(1:N)[temp]
for (i in index[2:(length(index)-1)]) {
z=data.frame(x, abs( (1:N)-i), (1:N)-i) ;names(z)=c(“y”,”abs”,”x”)
fm=lm(y~abs+x,data=z)
stat[i]=ssq(fm$residuals)
}
index0=(1:N)[(stat==min(stat,na.rm=TRUE))&!is.na(stat)]
z=data.frame(x, abs( (1:N)-index0), (1:N)-i) ;names(z)=c(“y”,”abs”,”x”)
twoleg=lm(y~abs+x,data=z)
twoleg}

The adjustment is inserted in the data set as is the adjusted series.

temp=(time(Y)>=M0[1])&(time(Y)< =M0[2]) #patch assumption
Y[!temp,3]=NA
adj0=twoleg(Y[,3]-Y[,1])
Y[temp&!is.na(Y[,1]),4]=fitted(adj0)
Y[,5]=Y[,1]+Y[,4]

The results are illustrated below. The first panel shows the dset1 version for Wellington and the rural reference series calculated above (green). The second panel shows the two-legged fit (green) to the difference between the two series (dashed), compared to the actual adjustment series (solid). Obviously not at all the same. The bottom panel compares the dset2 version to the emulated dset2 using the above adjustment series.

In this particular case, the adjustment is not at all close. OF course, there are other issues here that we’ve visited previously: like why, NASA has been unable to locate data for Wellington NZ for nearly 20 years, but that’s a different story.

In choosing this site, I wanted to stay away form sites that had dozens of comparanda in case clues arose from the versions. I got stuck last fall. What I’d like from anybody that’s been able to get GISTEMP to work through this step is to extract working files for Wellington NZ (50793436001) and we’ll see if we can decode the intermediate steps.

I’ve also collated various scripts and programs from GISTEMP step 2 in the order that I think that they are implemented in one file here , but have been unable to get much of a foothold in understanding the actual implementation of the calculations.

This entry was written by Stephen McIntyre, posted on Jun 13, 2008 at 3:11 PM, filed under GISTEMP Replication, Hansen, Surface Record. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

101 Comments

Spam

Posted Jun 13, 2008 at 4:56 PM | Permalink

Are they seriously using the Chatham Islands to ‘adjust’ wellington? Wellington is one of the windiest cities in the world – does average windspeed affect the UHI effect? I would have expected wind to provide cooling, and hence reduce the UHI effect. And the Chatham islands are very isolated

Surely there is a clear difference between correcting for UHI by using a station in a nearby and similar geographic environment, and using one in a completely different environment. Do they only care about them having a similar elevation, or what?
Geoff Sherrington

Posted Jun 13, 2008 at 7:11 PM | Permalink

Re # 1 Spam

Your para I Chatham endorsed. Have already noted it when Wellington was studied on CA before. Re your para 2, elevation is a separate issue – follow the coming story.
MrPete

Posted Jun 13, 2008 at 8:56 PM | Permalink

The assumption is that:
* Actual station temperature is immaterial, only trends matter
* Trends are essentially identical across vast distances, including different altitudes etc
Steve McIntyre

Posted Jun 13, 2008 at 9:32 PM | Permalink

Folks, for now, let’s not discuss whether any of this makes sense, please limit remarks to what Hansen is actually doing. There will be plenty of time to discuss things, once we’ve confirmed what he’s done.

Also remember that some of this to-ing and fro-ing probably won’t matter much, in that it’s just making weird weightings of different series – which only matters if some of the series are “bad”. If the data is OK, you can have a pretty crummy method and still not screw weighted averages all that much.

The foolishness of his coding and methodology annoys me as much or more than any of you. But no need to pile on right now.
Geoff Sherrington

Posted Jun 14, 2008 at 1:07 AM | Permalink

OK Steve. You want to concentrate on the code and its quality and not so much on constants and coefficients derived from climate studies, that are used in the code. That’s cool.

Steve: There’s time enough for both. But it doesn’t do any harm to do one thing at a time.
MrPete

Posted Jun 14, 2008 at 6:35 AM | Permalink

Not even code quality. It’s a given that the code is not what some of us would like. (Being nice)
The debugging issue right now is: what’s this thing actually do?

So sad that it takes anyone, even the author, quite a while to be sure the code actually does what was intended. Completely unnecessary.

This is entirely in line with Steve’s adopted role: auditing data, methods and more.
buck smith

Posted Jun 14, 2008 at 7:00 AM | Permalink

what is the scriping language being used here?
Richard Steckis

Posted Jun 14, 2008 at 7:04 AM | Permalink

Steve,

Could you have a look at the GISS Temp data for southwestern Australia. I have been looking at it to answer a criticism of me on Tamino’s site. It would seem that almost all rural stations in the region were “destroyed” in 1992. Apart from Perth (the capital of Western Australia) and a couple of other major towns that’s where the records stop. Of course Perth has records through to 2008 (a major city). I am sure data are still being collected in the rural parts of the region (after our Bureau of Meteorology is first class) so why aren’t they in the GISS dataset?

Perhaps this could be another avenue of invesigation for you.
Richard Steckis

Posted Jun 14, 2008 at 7:51 AM | Permalink

Further to my previous post. Nearly all rural stations for Western Australia cease in 1992. The major towns have more current and up to date data. The question is: Why is it that there is 16 years of missing data for Western Austalian rural stations. Is the station data on the GISTEMP website (http://data.giss.nasa.gov/gistemp/station_data/) the same data as that used by GISS for their temperature anomaly calculations? If it is why are more current data for rural stations in Western Australia not available?
Steve McIntyre

Posted Jun 14, 2008 at 7:59 AM | Permalink

#8. We’ve talked about the 1992 issue on many occasions. GHCN (and thus NASA) hasn’t collected fresh data from rural stations in the ROW since the early 1990s, not just in Australia, but in Russia, China, South America. You can get data on the internet that NASA hasn’t been able to locate. It’s pathetic.

Why? Don’t ask me. Ask them.
DocMartyn

Posted Jun 14, 2008 at 9:08 AM | Permalink

Steve “Folks, for now, let’s not discuss whether any of this makes sense, please limit remarks to what Hansen is actually doing.”

I am really trying to keep on this point, as I know you hate diversions.

Can I just ask why he is doing this process at all?

Statistically, what is the effect of transforming a dataset, in this manner, in terms of the size of the SD? Can a transformation of a dataset really ‘improve’ the quality? If it can, it should be possible to estimate, by triangulation, the temperature of one rural site, from three other rural sites.
Has Hansens method been tested in this manner? How close is Hansens method to the ‘real’ value?
hswiseman

Posted Jun 14, 2008 at 12:32 PM | Permalink

Steve and all,

The content here for the last few weeks has been nothing short of exceptional (not that I pretend to understand what is going on under the hood of the code). There is alot of bloviating on the topic all over the web, but who is doing any real work in the public forum open to all for critique and collaboration? CA and no one else. The unique experiment that is this blog will be remembered long after we figure out the real delta associated with doubling CO2.
hswiseman

Posted Jun 14, 2008 at 12:38 PM | Permalink

When I say “CA” above, I include the contributors who are undertaking research and analysis in separate endeavours as well, such as Anthony, Lucia and many others.
Scott McG

Posted Jun 14, 2008 at 2:48 PM | Permalink

#9 Richard – you are probably aware of BOM’s network of Reference Climate Stations:
http://www.bom.gov.au/climate/change/reference.shtml
I have no idea why GISS doesn’t seem to know of them though!!
Thor

Posted Jun 14, 2008 at 4:33 PM | Permalink

Steve,

I’m trying to run the R code, but have run into a couple of problems. It seems to be related to the following:

On line 74 in …/step2/functions.txt there is a line ” rbind(info[,c(“id”,”name”,” …. ” that doesn’t seem to belong there. I just removed it from my local copy, and thus got the R program to run a bit further.

My next problem is related to the emulate_dset2(…) function. It seems that when it is called with method=”base” it creates the dset1.monthly data structure. That does not seem to happen when using method=”read”. So, in the “step2.txt” file, it is called with method=”read”, and I get the following:

test=emulate_dset2(id0,method=”read”)
Error in emulate_dset2(id0, method = “read”) :
object “dset1.monthly” not found

I’m not really sure how to resolve that one. Seems to help by setting method=”base”, but I’m not sure I will then produce the intended result. And, if I do that, I run into a problem where M0 is not defined a few lines further down, on the line with: temp=(time(Y)>=M0[1])

I still need to familiarize myself with this code AND with the fortran code. Wow, fortran77, now those were the days. I’m not at all ruling out that I’m doing something wrong. I’m using R 2.7.0

Thank you.
Vincent Gray

Posted Jun 14, 2008 at 7:55 PM | Permalink

Hansen’s compilation is very far from comprehensive and no reson is given why, Some authorities demand confidentiality and/or a fee. For New Zealand, he seems to have access to airports but not to others. Is access to airports some sort of international obligation?
paul

Posted Jun 14, 2008 at 9:19 PM | Permalink

According to NIWA ALL of their data is publicly available.
Steve McIntyre

Posted Jun 14, 2008 at 10:53 PM | Permalink

As discussed on many occasions, the ongoing data is primarily MCDW data.
Terni

Posted Jun 15, 2008 at 7:02 AM | Permalink

It seems I’m not able to download the software.. is the site regularly on-line?
Thanks
Hu McCulloch

Posted Jun 15, 2008 at 7:02 AM | Permalink

The 2-legged ROW UHI adjustment for Wellington looks the same as John G described in the previous post for the US. So is the difference between US and ROW in this respect just how stations are classified into Urban and Rural?

However, for Wellington, the Urban series itself (dset1) and the emulated rural comparison series overlap for most of c.1880-1985. Wouldn’t the adjustment have been fit to this entire period, with a knee anywhere in its interior less 5 years on each end, and then applied to dset1 for the entire period dset1 is available? Instead, the actual and emulated adjustments are shown as beginning c.1940 and 1950, resp.
John Goetz

Posted Jun 15, 2008 at 9:19 AM | Permalink

Steve, you say:

Then I calculated the fraction of this range in which there are at least 3 stations (some cases go in and out.)

Is your interpretation that a minimum of 3 rural stations are required to adjust an urban series?
Steve McIntyre

Posted Jun 15, 2008 at 11:17 AM | Permalink

#15. I tidied the references and it should turn key. I had a slightly different version on my own computer,
Steve McIntyre

Posted Jun 15, 2008 at 11:54 AM | Permalink

Hu and JOhn G and others, could you take a look at HAnsen’s two-legged routine purporting to fit two line segments with a knee at midpoint xmid posted up here http://data.climateaudit.org/scripts/station/hansen/step2/tr2.f

I’ve also posted up a simple data set with a knee http://data.climateaudit.org/scripts/station/hansen/step2/template.dat

The methodology in my “twoleg” function is far simpler and thus more reliable for what they say they are trying to do (and it works fine on this template), but I’m having difficulty seeing that Hansen’s function actually does what it’s supposed to do. To be fair, I don’t have a Fortran compiler and would appreciate results on this template from someone with such a compiler.

The steps that look odd to me are the calculations for xnum1, xnum2 and denom.

The slope and intercept of the last step entitled “linear regression” seem to work OK although the method is a bit roundabout and old-fashioned. But I can’t get the results from the segmented calculations to line up with the construction.
Steve McIntyre

Posted Jun 15, 2008 at 12:30 PM | Permalink

HEre’s a script showing my attempt to emulate HAnsen’s trend2 routine.

Load in simple sample.

##Example
X=read.table(“http://data.climateaudit.org/scripts/station/hansen/step2/template.dat”,skip=1)
y=X[,3]
x=1:40
xmid=10

Do standard regression and obtain coefficients:

##FULL-PERIOD REGRESSION (see bottom of
fm=lm(y~x);coef(fm) #(Intercept) x
# 37.929850 1.346511

Tramsliterate formula at bottom of function trend2 (regression). This part is OK.

sxtot=sum(x);sxxtot=sum(x^2);sxatot=sum(x*y);sa=sum(y);ntot=length(x)
sl=(ntot*sxatot-sa*sxtot)/(ntot*sxxtot-sxtot^2)
Y0=(sa-sl*sxtot)/ntot
c(Y0,sl)
# 37.929850 1.346511

Now do two separate regressions for knee at x=10.

x=(1:40)-xmid
temp=(x>0)
fm1=lm(y[!temp]~x[!temp]);coef(fm1) # 49.911370 4.943939
fm2=lm(y[temp]~x[temp]);coef(fm2) #83.7076432 -0.3581346

Now transliterate Hansen’s calculation:

N=length(y)
temp=(x>0)
sa=sum(y);
saa=sum(y^2);
sx=c(sum(x[!temp]),sum(x[temp]));
sxx=c(sum(x[!temp]^2),sum(x[temp]^2));
sxa=c(sum(x[!temp]*y[!temp]),sum(x[temp]*y[temp]));
kount=c( sum(!temp),sum(temp));
ntot=sum(kount);
denom=ntot*sxx[1]*sxx[2]-sxx[1]*sx[2]^2-sxx[2]*sx[1]^2;
xnum1=sx[1]* ( sx[2]*sxa[2]-sxx[2]*sa) +sx[1]*(ntot*sxx[2]-sx[2]^2);
xnum2=sx[2]* ( sx[1]*sxa[1]-sxx[1]*sa) +sx[2]*(ntot*sxx[1]-sx[1]^2);
sl1=xnum1/denom;
sl2=xnum2/denom;
Ymid= (sa-sl1*sx[1]-sl2*sx[2])/ntot;
RMS= ntot*Ymid^2 +saa-2*Ymid*(sa-sl1*sx[1]-sl2*sx[2])+
sl1*sl1*sxx[1]+sl2*sl2*sxx[2]-2*sl1*sxa[1]-2*sl2*sxa[2];

This yields different values for the slopes.

c(sl1,sl2) # [1] 13.48777 -12.04889

The plot also is very different.

mx=c(mean(x[!temp]),mean(x[temp]))
my=c(mean(y[!temp]),mean(y[temp]))
int= my-c(sl1,sl2)*mx
ylim0=range(c(y,Ymid))
par(mar=c(3,3,2,1))
plot(x,y,xlab=””,ylab=””,ylim=ylim0)
lines(x[!temp],fitted(fm1),col=2,lty=2,lwd=2)
lines(x[temp],fitted(fm2),col=2,lty=2,lwd=2)
lines(x[!temp],int[1]+sl1*x[!temp])
lines(x[temp],int[2]+sl2*x[temp])

HAnsen’s segment slope calculations:

denom=ntot*sxx[1]*sxx[2]-sxx[1]*sx[2]^2-sxx[2]*sx[1]^2;
xnum1=sx[1]* ( sx[2]*sxa[2]-sxx[2]*sa) +sx[1]*(ntot*sxx[2]-sx[2]^2);
xnum2=sx[2]* ( sx[1]*sxa[1]-sxx[1]*sa) +sx[2]*(ntot*sxx[1]-sx[1]^2);
sl1=xnum1/denom;
sl2=xnum2/denom;

look incorrect to me. Wherever these formulaw come from and whatever their purpose, they don’t seem to yield the correct answer as illustrated by the plot provided in this script. This is something that I’ve just noticed and it’s possible that I’ve incorrectly transliterated something or missed something, but the formulaw themselves look weird.
Thor

Posted Jun 15, 2008 at 3:25 PM | Permalink

#22 #24 (Steve)

Thanks for uploading the revised code, that helped a lot. The plots aren’t working straight out of the box, but I’m sure I can work that part out.

Now, for the trend2 function, I notice a slight transliteration error, and the corrected version follows (the bold letter may be hard to catch):

denom=ntot*sxx[1]*sxx[2]-sxx[1]*sx[2]^2-sxx[2]*sx[1]^2;
xnum1=sx[1]* ( sx[2]*sxa[2]-sxx[2]*sa) +sxa[1]*(ntot*sxx[2]-sx[2]^2);
xnum2=sx[2]* ( sx[1]*sxa[1]-sxx[1]*sa) +sxa[2]*(ntot*sxx[1]-sx[1]^2);
sl1=xnum1/denom
sl2=xnum2/denom
Steve McIntyre

Posted Jun 15, 2008 at 9:34 PM | Permalink

#25. Thor, thanks for this. That reconciles the subroutine. Hansen’s method is a very roundabout method of doing things, but with the corrected transliteration matches the simpler routine that I had posted as the function “twoleg”. Here was the form of my calculation:

N=length(x)
z=data.frame(y, abs( (1:N)-xmid), (1:N)-xmid) ;
names(z)=c(“y”,”abs”,”x”)
fm=lm(y~abs+x,data=z)
#c(coef(fm)[3]-coef(fm)[2],coef(fm)[3]+coef(fm)[2])
#8.7394759 0.1217609

Re-doing the HAnsen method with the corrected transliteration:

xnum1=sx[1]* ( sx[2]*sxa[2]-sxx[2]*sa) +sxa[1]*(ntot*sxx[2]-sx[2]^2);
xnum2=sx[2]* ( sx[1]*sxa[1]-sxx[1]*sa) +sxa[2]*(ntot*sxx[1]-sx[1]^2);
sl1=xnum1/denom;
sl2=xnum2/denom;
Ymid= (sa-sl1*sx[1]-sl2*sx[2])/ntot;
c(sl1,sl2)
# 8.7394759 0.1217609

So these things match. The plot works.

plot(x,y,xlab=””,ylab=””,ylim=ylim0)
lines(x,fitted(fm),lty=2)
points(0,Ymid,pch=19,col=2)

So Hansen’s method of implementing the method could be implemented in a couple of lines, but, given his objective, for this step, his formula can be replicated.

The flip side of this is that my function “twoleg” is a valid and simpler implementation of this subroutine. This is therefore not the reason why my Step 2 implementation does not replicate HAnsne’s Step 2; the difference lies elsewhere.
Hu McCulloch

Posted Jun 16, 2008 at 5:56 AM | Permalink

Steve wrote in the post,

Now we count the number of rural stations with at least 3 values. This is done by counting availability and setting the count at NA for values less than 3. Then the range is determined. In this case a range of 1951-1984 is obtained. In this case, dset2 is calculated for 1939 to 1988. Don’t know why this doesn’t reconcile.

So I gather, re #20, that the green line in the top panel of the graph is the weighted average of the available rural stations in the 4-station set that was identified, whether there are 3 or not. There are at least 3 only during the period 1951-84 (though not always the same 3), whence this is the emulation period in the 3rd panel.

If Hansen in fact went back to 1939 with this algorithm, he must be picking up another station going back to 1939. Could this starting date be a clue as to which station this might be?

I see you’re using the Arcsine to compute great circle distance, while John G mentioned that GISS in fact approximated this with the straight line (“as the worm burrows”) distance, to avoid the once significant computational expense of an arcsine. Is there a station just outside 1000 KM that might be picked up this way, and that starts in 1939?
John Goetz

Posted Jun 16, 2008 at 6:31 AM | Permalink

#27 Hansen also uses 6375 km as the earth’s radius. Another small adjustment, but perhaps enough to matter in terms of bringing another station into play. I haven’t found it yet, but I also have a bug in my calculation of distances. I get Chatham wrong but everything else right. Go figure.
Steve McIntyre

Posted Jun 16, 2008 at 7:54 AM | Permalink

The permitted midpoints of the two-legged adjustment go from year 5 to year N-4. Not an unreasonable restriction, but it could also have been something else.
wkkruse

Posted Jun 16, 2008 at 8:23 AM | Permalink

#2 Hu McCullough, The GISS program doesn’t not need a string of consecutive years with at least 3 rural stations in order to compute the 2 piece linear regression. It looks to me like the first year has to have 3 rural stations and the total number of years with at least 3 must be more than 2/3 of the total number of years in the record starting from that first year with 3. In addition, it has to have at least 20 years with 3 rural stations. I hope this is understandable!
Steve McIntyre

Posted Jun 16, 2008 at 8:35 AM | Permalink

#30. Your commment is understandable but I’m having trouble locating where these tests are implemented in the code.

For reference, Hansen et al 1999 says:

An adjusted urban record is defined only if there are at least three rural neighbors for at least two thirds of the period being adjusted.

Hansen et al 2001 says:

The urban adjustment in the current GISS analysis is a similar two-legged adjustment, but the date of the hinge point is no longer fixed at 1950, the maximum distance used for rural neighbors is 500 km provided that sufficient stations are available,

“Sufficient” is not defined in Hansen et al 2001.
wkkruse

Posted Jun 16, 2008 at 8:46 AM | Permalink

#31 Steve Mc, The program that combines the rural stations within range keeps a count of the number of stations used in each period. This is call IWT(n). The variable NRURM is set to 3 in a parameter statement upfront. After the program is done combining all of the rural records in range, it starts a loop at statement 191 to see if the combined record meets certain criteria. You’ll see in this loop several checks against NURRM, i.e.3. They check for at least 20 periods with at least 3 rural records in range and then for the 2/3 criterion before trying to fit the broken line.
John Goetz

Posted Jun 16, 2008 at 9:16 AM | Permalink

#30 wkkruse – I don’t believe the very first year needs three rural stations.

Here is the code in question (sorry about the formatting)

DO 195 IY=IY1,IYRM
IF(AVG(IY).NE.XBAD.OR.URB(IY).NE.XBAD) NXX=NXX+1
IF(AVG(IY).EQ.XBAD.OR.URB(IY).EQ.XBAD) GO TO 194
if(IWT(IY).ge.NRURM) then
N3L=IY
N3=N3+1
if(N3F.eq.0) N3F=IY
end if
if(N3.le.0) go to 195
NXY=NXY+1
TS(NXY)=AVG(IY)-URB(IY)
F(NXY)=TS(NXY)
TMEAN=TMEAN+F(NXY)
YR(NXY)=IY+IYOFF
X(NXY)=YR(NXY)-X0
W(NXY)=1.
if(IWT(IY).ge.NRURM) THEN
NXY3=NXY
TM3=TMEAN
END IF
194 if(nxx.gt.0.and.IY1.eq.1) write(66,'(i4,2f10.2)’)
* IY+IYOFF,URB(IY),AVG(IY)
195 CONTINUE

The “N3” variables are all associated with keeping count of records with three or more rural stations. N3L is the last one, N3F the first, and N3 a count of all of them.

However, “NXY” is an index into the time series “TS” created when the urban record is subtracted from the combined record. This time series is created when the combined rural record has one or more stations. TS is immediately copied into “F”, and F and NXY are passed to trend2.

I see no check in getfit or trend2 that limits the curve fitting to those years with 3 or more rural stations. It seems to me that, so long as there are enough years with 3 or more rural stations (NCRIT or more), and they meet the XCRIT temporal density requirement, then the curve fitting proceeds with the entire combined series.
Steve McIntyre

Posted Jun 16, 2008 at 9:26 AM | Permalink

In the Wellington case that I’m examining, assuming that the identification of rural stations is correct, N3F is 1951, but the dset2 series starts in 1939. 1939 is the first year that there are 2 rural comparanda. Off the top of my head, I can think of two possibilities: my collation of rural stations is missing a candidate or perhaps the target station is included in counting N3.
wkkruse

Posted Jun 16, 2008 at 9:44 AM | Permalink

#33 John, I think your right about not needing 3 records at the beginning. And I didn’t mean to imply that all of the data points in TS had to have at least 3 rural stations. You’re right that GETFIT and Trend2 don’t check for the 3 rural station condition. All that is done in papars.
wkkruse

Posted Jun 16, 2008 at 9:48 AM | Permalink

#34 I hate to bring this up because I don’t yet understand what’s going on, but there is a final step after GETFIT is run that “finds an extended range.” Perhaps that’s what you are seeing. Maybe John Goetz has figured this part already.
John Goetz

Posted Jun 16, 2008 at 10:14 AM | Permalink

Yes, I have been looking at the extended range. I think that is what is happening with Wellington. Here is the Fortran code:

IYXTND=NINT(N3/XCRIT)-(N3L-N3F+1)
write(*,*) ‘possible range increase’,IYXTND,N3,N3L-N3F+1
n1x=N3F+IYOFF
n2x=N3L+IYOFF
IF(IYXTND.lt.0) stop ‘impossible’
if(IYXTND.gt.0) then
LXEND=IYU2-(N3L+IYOFF)
IF(IYXTND.le.LXEND) then
n2x=n2x+LXEND
else
n1x=n1x-(IYXTND-LXEND)
if(n1x.lt.IYU1) n1x=IYU1
n2x=IYU2
end if
end if

IYXTND = 17 for Wellington, so the code will find LXEND. We know N3L is the last year of the combined record (offset from 1880 I believe), IYOFF is 1880 (I believe), and
IYU2 is IYU1+ILSU(NURB)-I1SU(NURB)+2
where
IYU1=MFSU(NURB)+IYOFF-1
If wkkruse can help me figure out what IYU2 is for Wellington then I think we can find out how Hansen extends the range back to 1939.
Fred Nieuwenhuis

Posted Jun 16, 2008 at 10:21 AM | Permalink

An example of missing Canadian data in GISS record: Ottawa, Ontario. WMO registered and compliant with daily records going back to 1938. GIStemp stops using the data in 1990.

Steve: This issue has been discussed on many occasions – it’s a worldwide problem
John Goetz

Posted Jun 16, 2008 at 10:26 AM | Permalink

#37 Check that, IYOFF = 1879 (it is INFO(6) minus 1, and INFO(6) is 1880).
Steve McIntyre

Posted Jun 16, 2008 at 10:31 AM | Permalink

John G. Arthur E has archived working files from a GISS run through Step 2 ( see link at http://www.climateaudit.org/?p=3171#comment-259839 ) with some working file outputs which may be of assistnace.

For Wellington, the N3 range in PAPars.list, one of the working files, is 1951-1984 with the “Extended range” shown as 1939-1989. So we’ve probably got the right rural comparanda in this example and it’s a matter of decoding thow the extended range is determined in order to push through this little portion of the code.
Steve McIntyre

Posted Jun 16, 2008 at 10:39 AM | Permalink

#37. What about this? IYU2 is probably the last year available in the target urban U-series, which in this case is 1988, since NASA doesn’t know where Wellington NZ is. So they allocate the extension first to lengthen the record into the present and then assign the remainder of the 17 years to the early portion.
SteveSadlov

Posted Jun 16, 2008 at 10:40 AM | Permalink

So it’s down to this. Is there an undocumented fudge factor operation, manually applied? Is that the stumbling block to replication.

Steve: Please don’t editorialize or jump to unwarranted conclusions at this point. All we know right now is that the code is a mess. However they get results, so it operates. I have no reason to surmise manual fudging, tho there are many other issues with this.
John Goetz

Posted Jun 16, 2008 at 11:00 AM | Permalink

Clearly it looks like GISS tries to make sure the combined series fits exactly their “2/3 rule”, which wkkruse and I have been discussing here. The rule might be summarized as follows:

During the period in common between the combined rural series and the urban series, 2/3 of the years in common must have three or more rural stations represented in the combined rural series.

To this point we have been looking at instances where GISS trims the combined series to meet the rule. For example, if the combined record starts in 1891 and ends in 1990, but the years 1901 to 1940 are made up of fewer than three stations, then the early years of 1891 to 1900 are dropped, and 1941 to 1991 are used to do the adjustment. At least that is my initial guess.

Wellington seems to be a case where they expand the series to meet the rule. In the case of Wellington, there are 34 years in common with three or more rural stations in the combined series. Thus, the rule says the comparison can be made against up to 34/(2/3) = 51 years, so long as the 34 with 3 or more stations fall in that 51 year range. The code I mentioned in #37 is responsible for extending the range.

As we all know, this particular Fortran program is painful to follow, but I have narrowed things down some. For Wellington, some key values are:

N3F = 72
N3L = 105
N1x = 1951
N2x = 1984

Using substitution, we can express LXEND as:

LXEND=((MFSU(NURB)+IYOFF-1)+ILSU(NURB)-I1SU(NURB)+2)-(N3L+IYOFF)

So what we need to nail down are:

MFSU(NURB)
ILSU(NURB)
I1SU(NURB)

Once nailed down we can see how the start and finish are extended.
Fred Nieuwenhuis

Posted Jun 16, 2008 at 11:01 AM | Permalink

Calculation of Fudge factor = Hansen’s constant
if slope(stationrecord)== 1950
then Fudgefactor = 0.5 + (year-1950)* 0.1
Fred Nieuwenhuis

Posted Jun 16, 2008 at 11:06 AM | Permalink

Oops, my pseudo code didn’t translate well to comment block.
It should read:

if slope(stationrecord) is less than or equal to zero and year(stationrecord) is greater than 1950
then …
Hu McCulloch

Posted Jun 16, 2008 at 11:13 AM | Permalink

Steve McI wrote in #41,

For Wellington, the N3 range in PAPars.list, one of the working files, is 1951-1984 with the “Extended range” shown as 1939-1989. So we’ve probably got the right rural comparanda in this example and it’s a matter of decoding thow the extended range is determined in order to push through this little portion of the code.

1951-84 (34 years) is exactly 2/3 of 1939-89 (51 years), so perhaps this is where the 2/3 factor mentioned by wkkruse in #30 and Hansen in #31 above comes in. Other adjustment periods like 1938-88 would still have 2/3 of its years with 3 comparanda, but perhaps there was also an unstated minimum of 2 stations, whence 1939?

Would the regression be run with the full “extended range”, or just with the 3-station range?
wkkruse

Posted Jun 16, 2008 at 11:26 AM | Permalink

# 43 John, I’ve been out having some fun, so I’ve missed a lot here. I think you’ve got the logic correct for the 2/3 rule. That’s why the do that “drop early years” step. But remember they still want at least 20 periods with at least 3 rural stations too. I only have the Fortan code to work from, and I’m not familiar with the data. So I think I’ll just sit back and enjoy this and hopefully learn something.

If I can help with the code, let me know.
John Goetz

Posted Jun 16, 2008 at 11:42 AM | Permalink

Taking the three unknown variables from #43 one at a time, beginning with MFSU(ISU), the code develops the value as follows (leaving out a lot of unrelated steps in between):

MF=INFO(1)
MFCUR=MF
MFSU(ISU)=MFCUR

The value for INFO(1) is generated by the toANNanom program, and I believe it is an offset to the first year of the record for Wellington.

So wkkruse, if you don’t mind looking at toANNanom and figuring out if the INFO(1) it writes for Wellington has a value of 0, 1, or 2, that would be great. Keep in mind the first year of record for Wellington is 1881, and that toANNanom does recognize that a GISS year begins in December (although it may not matter here). I am guessing the value is 2.
wkkruse

Posted Jun 16, 2008 at 1:11 PM | Permalink

#48, John, I’m confused. toANNanom reads in INFO just like Papars does. I may quickly be getting out of my depth of knowledge here.
John Goetz

Posted Jun 16, 2008 at 1:42 PM | Permalink

I change my guess…MFSU(NURB) is 1.

If we go back and simplify LXEND:

LXEND=((MFSU(NURB)+IYOFF-1)+ILSU(NURB)-I1SU(NURB)+2)-(N3L+IYOFF)

becomes

LXEND = ILSU(NURB) – I1SU(NURB) + 1 + MFSU(NURB) – N3L

The first part ILSU(NURB) – I1SU(NURB) + 1 is the length of the urban record. In the case of Wellington, this is 1989 – 1881 + 1 = 109 years.

Thus LXEND = 109 + 1 – 105 = 5

Looking back at #37,
n1x = n1x – (IYXTND – LXEND) = 1951 – (17 – 5) = 1939

n2x = IYU2 = (ILSU(NURB) – I1SU(NURB) + 1) + MFSU(NURB) + IYOFF

so nx2 = 109 + 1 + 1879 = 1989
John Goetz

Posted Jun 16, 2008 at 2:43 PM | Permalink

#50 in English:

Let N3 be the total number of years in which three or more rural stations can be combined, with N1X the first year and N2X the last year. N3O = N2X – N1X + 1 is the length of the overlap period. Not all years in the overlap period necessarily have three or more stations.

If N3 is greater than 2/3 of N3O, the algorithm will extend the beginning and/or the end of the overlap period so that N3 is 2/3 of N3O. IYXTND is the number of years that must be added to the overlap period to make N3 2/3 of N3O.

If N2X is earlier than the last year of the urban record, the end of the rural overlap period is extended toward the urban record’s last year. If additional extension years remain, the extension years are applied to the beginning of the combined rural record.
John Goetz

Posted Jun 16, 2008 at 3:19 PM | Permalink

I should point out that Steve in #41 observed what I described in #50/51. My #50/51 comments simply confirm his observation based on my deconstruction of the fortran.
Geoff Sherrington

Posted Jun 17, 2008 at 4:07 AM | Permalink

In case it matters for this test set with Wellington at centre, some distances are about:

Wellington to
Hokitika 350 km
Whenuapai 500 km
Kaitaia 700 km
Chatham Is 700 km

All under 1000 km. But, if for reasons like triangulation, other distances are used, and ae excluded if more than 1000 km, some are

Chatham Is to
Hokitika 1000 km
Whenuapai 1000 km
Kaitaia 1300 km
Steve McIntyre

Posted Jun 17, 2008 at 7:38 AM | Permalink

John G, there are epicycles upon epicycles in this calculation by Hansen

From examination of the logs from the run that Arthur E has archived, we can be sure that the 4 rural reference stations are correctly selected here. The difference between the reference composite and Wellington dset1 looks like its going to be pretty close to what’s shown below (solid = reference minus Wellington dset1) for the “extended range” 1939-1989 (3 stations only available for 1951-1984). This would not appear to yield the actual dset2 adjustment shown in dashed, using a two-legged adjustmend.

So what’s going on? It looks like a couple of things. There is an override in which one-leg is substituted for two legs and this can be confirmed for the Wellington case. Also, and I’m just checking this, it looks like pre-extended range information is used to create the adjustment.

In the plot below, I’ve shown two-leg and one-leg adjustments – using reference information from prior to the “extended range” of 1939, the 3-station range ending in 1951. What we can see here is that the actual dset2 adjustment (dashed) is one legged despite a two-legged fit.

The subroutine in the padjust program contains an override in which the two-legged slopes are overwritten by the one-legged slopes depending on the value of iflag. The iflag value appears to be in the log PAPars.list , which shows a flag of 1211 for Wellington NZ, a value which will provoke an override. The iflag value appears to be generated in the program flags.f, which looks for short knees (limit 7 years), absolute value of slope 1 (limit 1), absolute value of slope 2 (limit 1),absolute value of the slope differences (limit 0.5) and a test which seems to be (a) different slope signs and (b) absolute values of both slopes (limit 0.2).

This heretofore elusive calculation is now looking within reach. Arthur E’s logs have been very helpful. It should be possible from available code to determine if pre-extended range results are used in creating the slope.

I might add the 1.5 deg C step in the 1920s that appears to drive the slope adjustment is precisely the sort of step that is used as a homogeneity adjustment in USHCN. So one can see quite different methods at work here.
John Goetz

Posted Jun 17, 2008 at 10:27 AM | Permalink

Steve, nice catch in flags.f, which to this point I had ignored. I feel like STEP 2 is a Ronco commercial – “but wait, that’s not all!!”

I would be interested in the rationale behind the conditions in which a one-legged adjustment is favored over two-legged. The 7-year limit is interesting, given that five years are already factored into the beginning and end of the overlap period in PApars.
Dave Dardinger

Posted Jun 17, 2008 at 10:31 AM | Permalink

re: #54 Steve,

This looks like the short version of what will turn into the top of a new thread. Looks interesting.
Steve McIntyre

Posted Jun 17, 2008 at 11:41 AM | Permalink

In the file PAPars.list, the following information is logged in relation to the Wellington fit:
[1] “CCdStationID slope-l slope-r knee Yknee slope Ymid RMS RMSl 3-rur+urb ext.range flag”
[3527] “507934360010 0.324 -1.026 1978 4.653 0.195 2.470 5.038 7.630 1951-1984 1939-1989 1211”

I presume that the y-values are all in units of 10 times deg C as otherwise the values don’t make sense. Below is a plot on that basis, plus values in the early 1980s are centered at 0 to match the dset2 adjustment. On this basis, the one-legged adjustments more or less match over the 1951-1984 period with 3 stations. The slope here is determined not over the 3-station period but includes value back to 1880 with only one station. Also the values in the “extended range” are not those from the slope, but are simply the 1951 and 1984 values extended. I expect that this is what further parsing of the code will show, but the code is like tangled fishing line.
John Goetz

Posted Jun 17, 2008 at 1:42 PM | Permalink

I guess I need to download the 400 MB file – the information is obviously useful. Now that my cable is back and I am off dial-up, it falls within the realm of possibility.
Thor

Posted Jun 17, 2008 at 2:06 PM | Permalink

Some similar logfiles are found on the FTP site below. It looks like the values are from some kind of test run. Still, information can be decoded from the different log files. If you don’t want to download 400 MB, that is.
ftp://data.giss.nasa.gov/pub/gistemp/GISS_Obs_analysis/test/STEP2/test1/
Steve McIntyre

Posted Jun 17, 2008 at 6:36 PM | Permalink

The holding directory at ftp://data.giss.nasa.gov/pub/gistemp/ for GISTEMP working files was created for the first time on June 10, 2008, just after JOhn G initiated the present attention on June 8 http://www.climateaudit.org/?p=3169 . In the past, we’ve noticed NASA responding to CA posts – needless to say without linking here – and the present “coincidence” seems like another example.
Steve McIntyre

Posted Jun 19, 2008 at 7:08 AM | Permalink

Boy, it’s hard to reconcile GISS bilge. I’m getting fairly close to being able to do Wellington NZ and then something else pops up.

Nicholas sent me a new tool to read one of Hansen’s annoying binary files
ftp://data.giss.nasa.gov/pub/gistemp/GISS_Obs_analysis/ARCHIVE/2008_05/binary_files/Ts.GHCN.CL.5

This is an annual anomaly calculation. I extracted Chatham Island which is used in the Wellington NZ calculation and compared it to the annual anomaly (Hansen style) that I had calculated from the GISS dset1 version that I scraped in Feb 2008. Most values were identical, with some being 0.1 different presumably due to mysteries of Hansen rounding.

But there were (and are) some puzzling differences. The binary version has values for 1880, while the Feb 2008 version commenced in 1881. I re-downloaded the current dset1 version and it also started only in 1881. Where did the 1880 values appear from? Who knows? This is NASA.

The only other differences were in 1989, where the differences were as much as 3 deg C! I re-downloaded the current dset0 version and compared it to the versions scraped a few months ago and compared the dset0 versions (there were 7 of them.) dset0 series 1-5 for Chatham Island were file identical. dset0 series 6 had a few more months of data in 2008 and added some months that were previously NA in 2007, but otherwise was file-identical.

But there was a difference in version 7 as shown below.

Here are the values from the Feb 2008 version (as I scraped it).

87 1987 15.8 15.2 14.2 12.5 10.9 8.8 7.3 8.9 8.4 10.5 12.3 14.0 15.1 12.5 8.3 10.4 11.60
88 1988 1.0 16.6 1.0 11.5 NA 9.1 8.7 NA NA 1.0 11.7 13.7 10.5 5.3 8.9 5.8 7.64
89 1989 1.0 1.0 14.1 12.1 1.0 NA NA 9.2 NA 1.0 1.0 13.1 5.2 9.1 NA 0.5 3.95
90 1990 14.4 15.0 14.7 12.1 11.3 9.7 7.9 9.0 8.9 10.9 12.9 14.1 14.2 12.7 8.9 10.9 11.66

Here are the values from the June 2008 version (as I scraped it).

87 1987 15.8 15.2 14.2 12.5 10.9 8.8 7.3 8.9 8.4 10.5 12.3 14.0 15.0 12.5 8.3 10.4 11.58
88 1988 14.1 16.6 14.5 11.5 11.0 9.1 8.7 8.3 9.7 11.3 11.7 13.7 14.9 12.3 8.7 10.9 11.71
89 1989 16.3 15.2 14.1 12.1 11.2 9.5 7.4 9.2 9.1 11.7 13.3 13.1 15.1 12.5 8.7 11.4 11.90
90 1990 14.4 15.0 14.7 12.1 11.3 9.7 7.9 9.0 8.9 10.9 12.9 14.1 14.2 12.7 8.9 10.9 11.66

So they’ve obviously located a transcription error in this version and corrected it between Feb 2008 and June 2008 either at the GHCN stage or GISS stage. Fair enough. But this error has been corrected nearly 20 years after the collection of the data and only after Climate Audit commenced examination of this particular calculation. Needless to say, they didn’t issue an error notice nor advise us of the change.

The impact on dset1 values is attenuated but still noticeable (indeed, that’s what I originally noticed.) Here are the monthly dset1 anomaly values calculated from the Feb 2008 scrape:

1987 15.8 15.2 14.2 12.5 10.9 8.9 7.3 8.8 8.4 10.4 12.1 14.1
1988 10.9 16.6 11.2 11.5 11.0 9.1 8.7 8.3 9.7 7.9 11.7 13.7
1989 12.5 11.7 14.1 12.1 8.7 9.5 7.4 9.2 9.1 9.1 10.3 13.2
1990 14.4 15.0 14.7 12.1 11.3 9.7 8.0 9.0 8.9 11.0 13.0 14.2

Here are the present values (times 10) from the binary file:
1987 158 152 142 125 109 88 73 88 84 103 121 141
1988 141 166 145 115 110 91 87 83 97 113 117 137
1989 163 152 141 121 112 95 74 92 91 117 133 131
1990 144 150 147 121 113 97 79 90 89 109 129 141

Because Hansen erases his history, I can’t go back and double check against actual Feb 2008 values at NASA, only against what I scraped and it’s not beyond the realm of possibility that I’ve been wrongfooted somewhere in trying to extract results from the nasty NASA formats. But in this case, I’ve got indpendent scrapes of dset0 and dset1 versions showing an error in the same place, so I’m pretty sure that I’ve not introduced the error as a scraping artifact.

So why were QC efforts at GHCN and NASA incapable of identifying the huge summer (Jan 1988) errors in one of the Chatham Island versions, especially given the existence of multiple independent versions, most without the transcription error? How could a summer value of 1 deg C survive even the most cursory quality control?

It would also be interesting to know how the error was picked up between Feb 2008 and now? Who picked it up? Who made the correction?
Steve McIntyre

Posted Jun 19, 2008 at 7:38 AM | Permalink

When I was young, there was an annoying novelty song about a crazy toy – it had lyrics like It went zing when it spinned, …., with the narrator trying to guess the purpose of the crazy toy, ending with a comment that “I guess we never will” [know the purpose]. Now that it’s in my head, I can’t get rid of it, but I can’t quite remember it either. Argggh.

GISS calculations seem like that most of the time. Lights popping, gizmos moving, binary files appearing, disappearing, re-appearing. Pointless calculations. At the end of the day, moving about 2 inches. Sort of like the crazy toy.
wkkruse

Posted Jun 19, 2008 at 7:57 AM | Permalink

Steve, Peter, Paul and Mary, The Marvelous Toy.
John Goetz

Posted Jun 19, 2008 at 8:11 AM | Permalink

I have two copies of v2.mean from GHCN. One is from February of this year and the other is from August 2007. In both cases the Chatham record ends with 5079398700055, meaning there are six versions. What is the source then of this seventh version GISS uses?
Carl Smith

Posted Jun 19, 2008 at 8:38 AM | Permalink

Steve, you can refresh your memory here:

Peter, Paul and Mary, The Marvelous Toy
John Goetz

Posted Jun 19, 2008 at 10:15 AM | Permalink

I figured out where the seventh version of Chatham Island comes from. It is not GHCN. In Step 0 GISS combines in a bunch of Antarctica station records. In the input files for Step 0 one can find the station lists and their records. antarc2.txt contains the mysterious scribal record for Chatham.

It turns out GISS scrapes this data from http://www.antarctica.ac.uk/met/READER/temperature.html.

Here is where it gets interesting. Looking at the Chatham record on the UK site, it still contains the errors described by Steve in comment #61. The version on the GISS website contains the corrections / changes Steve noted.

The timestamp on the antarc2.txt file buried in the latest GISTEMP sources (as of today) is June 9 at 12:35 PM. I have a version of the GISTEMP sources I downloaded on June 2. In that version antarc2.txt has a timestamp of February 28 at 6:57 PM, and in that version Chatham matches the record presently on the UK site, errors and all. It is pretty obvious then that GISS made the changes.

Clearly the record on the UK site has some errors, and it makes sense GISS would want to correct them. However, I find it interesting that GISS made the changes without seeming to involve the source of the data. I wonder how the new values were derived?
Steve McIntyre

Posted Jun 19, 2008 at 10:20 AM | Permalink

#63,65. Thanks, it was annoying me like crazy. John G, I urge you to watch the link in #65 and see if you agree that it’s a nice description of GISS.
Steve McIntyre

Posted Jun 19, 2008 at 10:25 AM | Permalink

#66. So Hansen and his boys are manually editing data. My, my. This GISTEMP thing really is a pig. ftp://data.giss.nasa.gov/pub/gistemp/GISS_Obs_analysis/GISTEMP_sources/STEP0/input_files/preliminary_manual_steps.txt

antarc1.txt was downloaded from http://www.antarctica.ac.uk/met/READER/surface/stationpt.html
antarc2.txt was downloaded from http://www.antarctica.ac.uk/met/READER/temperature.html
antarc3.txt was downloaded from http://www.antarctica.ac.uk/met/READER/aws/awspt.html

some typos in antarc2.txt were manually corrected

Station information files were manually created combining information from the above files and GHCN’s v2.temperature.inv

CHATHAM ISLAND in ftp://data.giss.nasa.gov/pub/gistemp/GISS_Obs_analysis/ARCHIVE/2008_05/text_files/antarc2.txt as of today has been corrected. HOwever in Arthur E’s zipfile of June 9th, antarc2.txt is not corrected.
John Goetz

Posted Jun 19, 2008 at 11:13 AM | Permalink

#68 Steve…they are manually editing, which is fine, but what are the edits based on? How do they know that a 1.0 in one case should be 14.1, and in another it should be 16.3? It is good that they noted a manual change occurred, but it would be better if they noted how the values were derived.

I had noticed last year that a number of the data points flagged in the GHCN quality control file (and hence removed from the v2 dataset) were easily recoverable transcription errors, such as an incorrect sign or an inadvertent shift of several months worth of data. Easily recoverable and easy to document the reason. A lot of the Meteo data fit the same bill – before it mysteriously disappeared last month.

If the missing and incorrect Chatham Islands data were so easy to recover by hand, then document the methodology.
Steve McIntyre

Posted Jun 19, 2008 at 11:20 AM | Permalink

http://www.youtube.com/watch?v=vvmyxTm6hkg has an animation of GISTEMP.
John Goetz

Posted Jun 19, 2008 at 11:21 AM | Permalink

Regarding #65, I do find it a perfect description! It is this crazy toy with thousands of fancy features, none of which have a meaningful purpose. I also think my analogy to Stephen King’s The Jaunt is apt as well. It is a science fiction / horror story, evoking the haunting phrase “It’s eternity in there…”
Steve McIntyre

Posted Jun 19, 2008 at 11:26 AM | Permalink

#71. Doncha feel exactly like the kid in the video- seeing two strange green GISTEMP eyes, or pressing on GISTEMP buttons and watching it go whrrr and disappearing under a chair. And Hansen says that he disdains jesters.
steven mosher

Posted Jun 19, 2008 at 11:36 AM | Permalink

re 72.

He’s saving the planet, details don’t matter.
Thor

Posted Jun 19, 2008 at 12:20 PM | Permalink

I notice there are 17 temperatures in each row of the “scraped” data. I’m guessing that 12 of these are monthly data, 4 are seasonal averages and one is a yearly average?

Maybe some sort of interpolation was made on the seasonal and/or yearly averages from years before and after, and then monthly data were reconstructed based on this?
Steve McIntyre

Posted Jun 19, 2008 at 12:23 PM | Permalink

#74. Your surmise is correct. Howevver, we have control of this part of the calculation and I’m 99.99999% certain that the problems arise elswehere.
jeez

Posted Jun 19, 2008 at 2:11 PM | Permalink

RE: 68 and 69. If GISS is continuing to edit, correct, and obfuscate online data in real time in response to the diligence of outsiders without posting reasons why or methodologies, I would think that Michael Griffin should be notified. It has really gotten out of hand.

http://www.nasa.gov/about/highlights/griffin_bio.html
bernie

Posted Jun 19, 2008 at 5:31 PM | Permalink

Tom Paxson wrote “The Marvellous Toy” in 1962 or 1963. Many have performed it including PPM and John Denver. My kids loved it.
Larry T

Posted Jun 19, 2008 at 6:01 PM | Permalink

and my favorite group of that time The Chad Mitchell Trio
jeez

Posted Jun 19, 2008 at 6:07 PM | Permalink

Steve Mc, some of the craziness you observe is probably because a bunch of this was written for nine track tape.
jeez

Posted Jun 19, 2008 at 6:08 PM | Permalink

“whirr”
Neil

Posted Jun 19, 2008 at 8:16 PM | Permalink

Re 61
Chathams temps
I dont know why the GISStemps vary so much from the station data as supplied by NIWA below. (or where the errors and NA came from)

1987 16.2 15.7 14.9 12.7 10.8 8.8 7.4 8.9 8.9 10.5 12.2 14.3
1988 14.4 17.1 15 11.6 10.8 8.9 8.6 8.4 10.1 11.7 12.1 14
1989 16.8 15.5 14.5 12.2 11.1 9.4 7.7 9.5 9.6 12.1 13.7 13.4
1990 14.5 15.2 14.8 12.4 11.2 9.8 8 9.3 9.4 11.1 13.4 14.3
1991 15.1 16.1 14.5 13.9 10.7 8.5 8.5 9.2 10.2 10.5 10.9 12.6
1992 14.5 14.2 12.6 10.2 8.3 8.3 7.5 7.9 8.2 10.4 12.4 12.2
1993 13.4 14.1 12.3 11.2 10.5 9.8 8.2 7.9 8.4 10.4 10.7 12.9
1994 15.2 15.6 13.7 12.4 10.1 8.4 7.8
The Chatham station runs from 1956 to 94 when it is replaced by Aws.

When you have done the Hansen step2 maybe a Watt audit in this part of the World would be easy given that every station has raw data available.
Steve McIntyre

Posted Jun 21, 2008 at 9:58 AM | Permalink

I’ve had a bit more luck decoding this sucker. The black jiggly series is the difference between the “reference” composite of 4 “rural” stations and the Wellington dset1 series. The dashed red line is a fit over the 1951-1984 period in which there are 3 stations; and the black dashed is the difference between dset2 and dset1. It looks like the adjustment in this case is coerced back to a one-legged adjustment and that the end values of the 1951-1984 adjustment are extended to the “extended” dset2 range of 1939-1988. I’ll pull these together in a longer note.

Given the black series, does the black dashed line represent any sort of logical estimate of required adjustment? In this case, the procedure makes no sense whatever. I didn’t pick this data set in order to embarrass them; as readers here know, I picked this one in advance because it had enough texture to illustrate the procedure and not so many comparanda as to make reverse engineering too difficult.
Steve McIntyre

Posted Jun 21, 2008 at 10:07 AM | Permalink

The four rural comparanda are:

id name long lat altitude start_raw
6020 50793615000 HOKITIKA AERO 170.98 -42.72 40 1894
6010 50793112000 WHENUAPAI 174.63 -36.78 27 1951
6009 50793012000 KAITAIA 173.27 -35.13 87 1961
6028 50793987000 CHATHAM ISLAN -176.57 -43.95 49 1881

Are there any New Zealand readers who can provide further details on these stations? I’ll ask over at Anthony’s as well.
John Goetz

Posted Jun 21, 2008 at 3:06 PM | Permalink

Steve, getting very close. Any thoughts on why there is a displacement between your red adjustment and the GISS adjustment?

It would be good to know from our New Zealand friends how the rural stations evolved in the period 1939 to 1988. It looks like Hokitika Aero started as a grass field and now has two paved runways. No idea though on the exact location of the station then and now. Looking at where the coordinates for Whenuapi land me in Google Earth, that station is probably at the airport as well.
Steve McIntyre

Posted Jun 21, 2008 at 3:46 PM | Permalink

It looks like Hansen re-zeros the adjustment at the closing values, tho I have to confirm this. HEre’s a plot of the deltas between the observed dset2 adjustment and the emulated adjustment. For this station, the calculation is matched up to Hansen-rounding, which I’m not going to worry about right now. I can only take so much of this dreck at one go.

I spent a number of hours today making a coherent data frame that didn’t fail with some of the bizarre NAs in the dset1 data base. There are sure a lot of land mines in this data set.
Steve McIntyre

Posted Jun 21, 2008 at 3:57 PM | Permalink

John G, if I subtract the last value of the emulated adjustment before rounding, I get an exact match. How bout that.
_Jim

Posted Jun 21, 2008 at 4:59 PM | Permalink

jeez –

Steve Mc, some of the craziness you observe is probably because a bunch of this was written for nine track tape.

Ninth track -wasn’t that used for parity (parity bit)?

I believe it was in the case of our Pertec drives on the 990/12’s (990 series was a “16 bit” machine with 8-bit byte addressablity). Pertec tape drive: rack-mounted using vacuum-tecnology tape ‘accumulators’ where tape was looped off the supply and take-up reels into ‘wells’ with vacuum supplied at one end as a means to reduce the inevitable inertia experienced with reels (sometimes full of tape) for rapid back-and-forth tape movement past the read/writes heads. A quick bit of googling shows the ninth track was used for parity on other machines series (incl. VAX) as well.
John Goetz

Posted Jun 21, 2008 at 6:50 PM | Permalink

Steve Mc…it always feels good when you break the code 🙂

I am not exactly sure what you meant, however, in #86. Do you mean you drop the last year?

Now its time to test on South America?
Steve McIntyre

Posted Jun 21, 2008 at 8:26 PM | Permalink

#88. No, Hansen’s pattern is to have the last adjustment at 0. So if the emulation ends at a non-zero value, deduct the closing value from the series as a whole. I’ve also done the re-scaling back to the dset2 monthly version (CEntigrade, not anomaly) and have that up to 0.1 deg C in sporadic years, with most at 0.0. The adjustment ends in 1989 rather than 1988. Merely having a couple of values in 1989 triggers the extension algorithm. I should be able to get the last squeak on this case.
Steve McIntyre

Posted Jun 21, 2008 at 8:55 PM | Permalink

I got the last squeak. I’ll post this up tomorrow. MAybe it will even work on other stations.
paul

Posted Jun 21, 2008 at 10:49 PM | Permalink

Kaitia is at top of North Island, sub tropical, Chatham Islands below bottom of south Island, Alpine/Antractic. Hansen is mad. These are two extremes of Wellington.
Grant

Posted Jun 21, 2008 at 11:51 PM | Permalink

And Hokitika is nestled at the western base of the Southern Alps, and receives nearly 3m of rain per year vs. 1.2m for Wellington.

It throws a great festival, though: http://www.wildfoods.co.nz
Ed Snack

Posted Jun 22, 2008 at 3:33 AM | Permalink

I’m not sure what records Hansen uses for NZ, presumably the raw numbers ? Because, the NZ Met service carries out its own adjustments. The raw temperatures for most NZ sites have tended to trend downwards, it is only the adjustments that give them a positive slope. I’ll try to find some further info on these adjustments.
Geoff Sherrington

Posted Jun 22, 2008 at 4:07 AM | Permalink

Here is a short primer on aspects of surveying of the Earth. Its purpose is to show that as technology advances, so do corrections to past data.

Hu McCulloch might find interest in map projection maths.

Steve Mc., there might be reasons within this unpublished paper why GISS data are changed frequently by small amounts. I am not saying that the amounts are often significant values, only that they do cause recalculation especially for satellite and military work. You might be able to correlate some past dates with adjustments to the surface temperature record.

There are more papers available, showing how some of the equations in this paper are derived; and more on satellite accuracy.

It remains the case that absolute distance measurements do not exist in this type of work. For example, the acceptable answer might depend on the number of terms used in a series expansion like Taylor’s.

Click to access coorddef.pdf
MrPete

Posted Jun 22, 2008 at 6:34 AM | Permalink

Geoff,

Even with a new reference frame, the rain in Maine falls mainly on the Seine.**

It’s an interesting paper. However, I seriously doubt any of this would impact climate calculations. Grid adjustments in the paper would be pure “bit noise,” below climate measurement detectability.

Grid adjustments in the paper are all much smaller than an arc-second (30m or so); climate location measurements are not that precise.

More significant: even with reference frame adjustments, distances change much less than do positions.

–Pete

**If the reader is new to CA, search for “Seine” to find that little joke 🙂
Steve McIntyre

Posted Jun 22, 2008 at 6:37 AM | Permalink

81,92. What I’m inteested in in is Antony Wtts type detailed informatino on the stations and details on station history, if they can be obtained from the NZ meteorological service or elsewhere.
Geoff Sherrington

Posted Jun 22, 2008 at 6:52 AM | Permalink

Re # 95 MrPete

It’s not the value of a change that I’m referring to, it’s the fact of a change. I do not know the answer to this, but if a new calculation for (say) satellite calibration adjusts a distance, does this mean that the whole GISS system is adjusted? There has to be some reason for the frequency of adjustments.

Re # 96 Steve McI, let me know if you already have the public CD of some 1200+ Australian B o M stations. The CDs are the best means I know for getting ROW data from here.
tty

Posted Jun 22, 2008 at 10:59 AM | Permalink

#91 I think You are confusing Chathams with some other island (Stewart?). Chatham Islands are on the Chatham Rise about 800 km almost due east of Christchurch, i e just north of the middle of the South Island. A fairly good analogue would be using data from Bermuda to correct the record for Washington DC.
Steve McIntyre

Posted Jun 22, 2008 at 12:12 PM | Permalink

#97. Nope. I asked BoM for data quite a few months ago, but got no response.
Geoff Sherrington

Posted Jun 22, 2008 at 5:52 PM | Permalink

Re 99 Steve McI

My email is sherro1 at optusnet dot com dot au

If you are prepared to mail me a postal address I will try to remedy that shortage.

Regards Geoff.
Geoff Sherrington

Posted Jun 22, 2008 at 5:57 PM | Permalink

Re #96 MrPete

Rain in Seine … I think I have read all CA posts for the last 2 years (E&OE) and enjoyed Maine. That is, if it is possible to enjoy bad science, errors going neglected and the rest of the lessons therein.

3 Trackbacks

By Dirty Harry 4: When Harry Met Gill « Climate Audit on Dec 11, 2010 at 7:23 AM

[…] of Wellington NZ (an arbitrary example, but an interesting one in the present context.) See here. In the next comment, I noticed problems with Chatham Island (which is in the READER database, […]
By Another quiet GISS update « Climate Audit on Jul 13, 2011 at 9:41 PM

[…] is this important? For those of you who followed the GISS Step 2 thread on this site, you will recall that on June 19 Steve and I were commenting on a mysterious […]
By More Changes at BAS « Climate Audit on Mar 22, 2012 at 4:03 PM

[…] together with a revised “credit” here. One of the 6 stations is Chatham Island, where I noticed a problem on June 13, 2008 when I was trying to replicate GISS methodology using Wellington NZ as an example. John Goetz […]

Climate Audit

GISS Step 2

101 Comments

3 Trackbacks

Tip Jar

Pages

Categories

Articles

Blogroll

Favorite posts

Links

Weblogs and resources

Archives

NOTICE

Search

Blog Stats

Twitter Updates

Recent Posts

Recent Comments

Meta

Climate Audit

GISS Step 2

Share this:

Related

101 Comments

3 Trackbacks

Tip Jar

Pages

Categories

Articles

Blogroll

Favorite posts

Links

Weblogs and resources

Archives

NOTICE

Search

Blog Stats

Twitter Updates

Recent Posts

Recent Comments

Meta