Mann and Perfect Reconstructions

I finally turned over a few stones in Mann’s EIV reconstructions, little suspecting the perfection that awaited me in the cave using the simple and unassuming alter ego shglfulihad_smxx.

The figure below compares the SH reconstruction to the smoothed (SH) iHAD instrumental version. From the frail instruments of speleothems, bristlecone ring widths and upside-down sediments, Mann has achieved something divine – a “perfect” reconstruction.

Presumably out of an excess of modesty, Mann did not claim the perfect verification statistics that he was ‘entitled’, only claiming an “average” verification r2 of 0.28. But let’s not hide this light under a bushel. [Note: Jean S observes below that the calculations are done separately for calibration and verification periods. As I interpret his comment (together with my own take on Mannian RegEM), the method yields essentially ‘perfect’ reconstructions during the calibration period – no overfitting involved, of course – with verification stats dropping sharply in the verification period due, of course, to overfitting: a phenomenon that we discussed in a May 2006 post on VZ pseudoproxies.]

This was the first stone that I turned over in the EIV cave/ Perhaps more perfection awaits us deeper within the cave? Perhaps even richer treasures.

If you wish to confirm the perfection for yourself, here are scripts that download perfection from Mann’s FTP site. First download the shglfulihad reconstruction:

url=”http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/reconstructions/eiv”
x=ts(scan(file.path(url,”shglfulihad_smxx”)),start=0)

Next the instrumental target:

url=”http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/instrument/iHAD_SH_reform”
target=read.table(url);target=ts(target[,2],start=1850)

Now do Mannian smoothing on the instrumental target:

library(signal);source(“http://data.climateaudit.org/scripts/mann.2008/utilities.txt”)
url=”http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/instrument/iHAD_SH_reform”
target=read.table(url);target=ts(target[,2],start=1850)
cutfreq=0.1; ipts=10 #ipts set as 10 in Mann lowpass
bf=butter(ipts,2*cutfreq,”low”); npad=1/(2*cutfreq); #from
smooth= mannsmooth(target,M=npad,bwf=bf)

Now plot perfection:

test=ts.union(target,smooth,x)
ts.plot(test[,1:2],col=c(“grey80”,1,2),xlim=c(1850,2000))
title(“shglfulihad”)
lines(c(time(test)),test[,3],col=2)
points(c(time(test)),test[,2],pch=19,cex=.07)
legend(“topleft”,fill=c(“grey80”,1,2),legend=c(“Instrumental”,”Instr Smooth”,”‘Reconstruction'”),cex=.8)

Oh yes, here’s the rest of the ‘perfect’ reconstruction.


Figure 2 – this is the same data as above, but plotted over the entire record.

Replication Problems: Mannian Verification Stats

If anyone feels like sticking needles in their eyes, I’d appreciate assistance in trying to figure out Mannian verification statistics. Even when Mann posts up his code, replication is never easy since they never bothered to ensure that the frigging code works. Or maybe they checked to see that it didn’t work. UC’s first post on the matter wondered where the file c:\scozztemann3\newtemp\nhhinfxxxhad was. We still have no idea. This file is referred to in the horrendously written verification stats program and it may be relevant.

With UC’s help, I’ve been able to replicate quite a bit of the CPS program (the EIV module remains a mystery.)

I’ve been testing verification stats with the SH iHAD reconstruction. I mentioned previously that Mannian splicing does not always use larger proxy networks if they get “better” RE stats with fewer proxies. This Mannian piece of cherry picking is justified in the name of avoiding “overfitting” although it is actually just the opposite. It reminds me of the wonderful quote from Esper 2003 (discussed here):

this does not mean that one could not improve a chronology by reducing the number of series used if the purpose of removing samples is to enhance a desired signal. The ability to pick and choose which samples to use is an advantage unique to dendroclimatology.

Mining promoters would like a similar advantage, but, for some reason, securities commissions require mining promoters to disclose all their results.

Mann’s reconstruction archive in 2008, as with MBH98, only shows spliced versions – some habits never change, I guess. But in the SH iHAD run, the AD1000 network remains in use right through to the 20th century, with all proxies starting later than AD1000 being ignored – all in the name of not “overfitting”. But the long run of values from a consistent network is very handy for benchmarking and, with much help from UC’s Matlab runs, I’ve managed to very closely replicate the SH iHAD reconstruction from first principles, as shown below – this graphic compares a version archived at Mann’s FTP site with my emulation.

For comparison, here is an excerpt from Mann SI Figure S5d (page 11), which has an identical appearance.

You can upload an original digital version of this reconstruction (1000-1995) as follows:

url=”http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/reconstructions/cps/SH_had.csv”
had=read.csv(url)
temp=!is.na(had[,2])
estimate=ts(had[temp,2],start=min(had[temp,1]))

A digital version of the “target” instrumental is also at Mann’s website and can be downloaded as follows:

url=”http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/instrument/iHAD_SH_reform”
target=read.table(url)
target=ts(target[,2],start=1850,end=1995)

The reported verification statistics for the SH iHAD reconstruction are also archived and can be downloaded as follows (load the package indicated). BTW this is a nice package for reading Excel sheets into R.

library(xlsReadWrite)
url=”http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/cps-validation.xls”
download.file(url,”temp.xls”,mode=”wb”)
test=read.xls( “temp.xls”,colNames = TRUE,sheet = 14,type = “data.frame”,colClasses=”numeric”)
count=apply(!is.na(test),2,sum);count
temp=!is.na(test[,1])
stat=test[temp,!(count==0)]
name1=c(“century”, c(t( outer(c(“early”,”late”,”average”,”adjusted”),c(“RE”,”CE”,”r2″), function(x,y) paste(x,y,sep=”_”) ) )) )
names(stat)=name1[1:ncol(stat)]
stat[stat$century==1000,]
# century early_RE early_CE early_r2 late_RE late_CE late_r2 average_RE average_CE average_r2 adjusted_RE adjusted_CE adjusted_r2
# 1000 0.0746 -1.663 0.3552 0.7194 0.1475 0.303 0.397 -0.758 0.3291 0.397 -0.758 0.3291

Given digital versions of the reconstruction and the “target”, it should be simplicity itself to obtain standard dendro verification statistics. But, hey, this is hardcore Team. First, Mann does some Mannian smoothing of the instrumental target. Well, we’ve managed to replicate Mannian smoothing and can follow him through this briar patch.

library(signal) # used for smoothing and must be installed
source(“http://data.climateaudit.org/scripts/mann.2008/utilities.txt”)
cutfreq=.1;ipts=10 #ipts set as 10 in Mann lowpass
bf=butter(ipts,2*cutfreq,”low”);npad=1/(2*cutfreq);npad
smooth=ts( mannsmooth(target,M=npad,bwf=bf ) ,start=1850)

Now the “early miss” verification stats using a simple (and well-tested) program to do the calculations:

rbind( unlist(stat[ stat$century==1000, grep(“early”,names(stat)) ]), unlist(verification.stats(estimator=estimate,observed=smooth,calibration=c(1896,1995),verification=c(1850,1895))[c(2,5,4)]
))
# early_RE early_CE early_r2
#[1,] 0.0745930 -1.6633600 0.3551960
#[2,] 0.2883958 -0.8940804 0.1432888

And for the “late-miss” stats:
rbind( unlist(stat[ stat$century==1000, grep(“late”,names(stat)) ]),
unlist(verification.stats(estimator=estimate,observed=smooth,calibration=c(1850,1949),verification=c(1950,1995))[c(2,5,4)]
))
# late_RE late_CE late_r2
#[1,] 0.719441 0.1474790 0.3030280
#[2,] 0.804556 0.4111129 0.4549566

These should match the early_ and late_ values, but don’t. The inability to replicate the r2 values is particularly troubling, since these are not affected by the various scaling transformations. I simply haven’t been able to get the reported verification r2 values using many permutations.

Since the reconstruction ties together both to digital and graphic versions, perhaps the archived instrumental version http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/instrument/iHAD_SH_reform is not the same as the c:\scozztemann3\newtemp\shhinfxxxhad .

The code for the verification stats is at
http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/code/codeveri/veri1950_1995sm.m and http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/code/codeveri/veri1850_1895sm.m. They seem to have learned their programming style from Hansen, as the code is replete with steps that don’t seem to have any function, unhelpful comments made less helpful in places by inaccuracy and, most of all, by an almost total lack of mathematical understanding and organization in implementing the code.

Santer and the 4 NOAA Coauthors

Apparently none of Santer’s four NOAA coauthors either received or sent any correspondence to Santer regarding the monthly data series used in Santer et al 2008. What a strange way to run a railroad. Continue reading

GHCN Updates

In its land surface temperature calculations, NASA GISS is little more than a distributor for NOAA GHCN. As Gavin Schmidt explained, they spend no more than 0.25 man-years on this product, which permits negligible (if any) quality control. Although there are 7280 stations in the GHCN network, only a fraction of these occur as up-to-date records in GHCN (or their distributor), even from countries where up-ro-date information is easy to locate (e.g. Canada, New Zealand). I’ve discussed this bizarre failure to update station data on many occasions in the past.

Aside from seemingly poor practice, a further issue for me and others is whether some kind of bias has been introduced into the system by the discontinuity. Assessing this potential bias is not a small job and one that NOAA and NASA should obviously have done long ago and existing literature does not do so.

An obstacle to assessing such a bias has been the lack of any clear description of the update program. In my recent review of the data, I noticed a couple of scraps of information and have now constructed what I believe to be a reasonable accurate inventory of the stations that are included in the GHCN update program.

Peterson et al (BAMS 1997) reported the following update procedure:

Of the 31 sources, we are able to perform regular monthly updates with only three of them (Fig. 5). These are 1) the U.S. HCN, 1221 high quality, long-term, mostly rural stations in the United States; 2) a 371-station subset of the U.S. First Order station network (mostly airport stations in the United States and U.S. territories such as the Marshall and Caroline Islands in the western Pacific); and 3) 1502 Monthly Climatic Data for the World stations (subset of those stations around the world that report CLIMAT monthly code over the Global Telecommunications System and/or mail reports to NCDC). Other stations will be updated or added to GHCN when additional data become available, but this will be on a highly irregular basis.

When they said that the “other stations” would be updated on a “highly irregular basis”, they were not joking. In the 11 subsequent years, as far as I can tell, there have been no such updates. We’ve already collated information on the 1221 USHCN sites, but the GHCN website unfortunately doesn’t provide any lists of either the 1502 MCDW stations or 371 First Order stations.

Last fall, included in the NASA GISS archive were two tables named “mcdw.tbl” and “sod.tbl” (insert urls). The two tables consisted of columns of id numbers without any descriptors, but the first table had 1522 rows and the second table had 371 rows, matching the information in Peterson et al 1997. When I noticed this, I realized that this was a guide to the stations in the GHCN update program.

Even with this guide, it proved to be a lot of work actually reconciling this information to the inventory of NASA GISS/GHCN station id numbers, as the tables matched to a point, but only to a point. Many numbers in the 2nd column of these tables matched parts of the GHCN station identifications, but there were many sites that were only partial matches. In some cases, the GHCN station seemed to match the number in the first column better than the second column. I got about 95% matched sem-automatically but then had a bunch of left over numbers for which I sought matches in a variety of ways. I inserted oddball id numbers at this website (http://weather.gladstonefamily.net/site/42599 ) and got names; then I checked partial name matches in my GHCN station inventories to see if the id number resembled any of the numbers in the two tables.

I got close enough that I wanted to finish the enterprise and eventually wasted a lot of time on it, but I’ve emerged with a plausible list of identifications here: http://data.climateaudit.org/data/station/ghcn/update.dat .

There appears to be a considerable overlap (149 stations) between the two lists (mcdw.tbl and sod.tbl), reducing the total number to 1724 stations. Of the 1724 stations, only 1058 stations have values in 2008.

Although there are no USHCN stations in this part of the network, there are 138 US stations in the non-USHCN update, almost entirely from airports, including large airports like Phoenix, Los Angeles, Houston etc. If you want a list, you can examine them as follows:

info=read.table(“http://data.climateaudit.org/data/station/ghcn/update.dat”,sep=”\t”,header=TRUE)
temp=(info$country==425)
temp1=(info$end_raw>=2008)&!is.na(info$end_raw)
info[temp,]

The ID numbers in this table tie into the ID numbers in the giss_info collation that I’ve already posted up (which contains some additional GISS information on these sites e.g. altitude, (incorrect) population etc.

Country codes are here and can be read as follows:

url=”ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/v2.country.codes”
ccode=read.fwf(url,widths=c(4,30),colClasses=c(“numeric”,”character”))
names(ccode)=c(“id”,”name”);
ccode$name=gsub(” +$”,””,ccode$name)

NASA GISS makes an effort to adjust for UHI. As noted elsewhere, their procedure in the US is considerably different than their procedure outside the US, as, in the US, the USHCN network, whatever its warts, provides many stations that are not in urban airports. There is scattered information on non-US stations, but this is clearly a major lacuna at present (and I’m sending a copy of this list to Anthony in the hope that this topic will interest him – which seems likely given the interesting recent posts at his site on Verhoyansk in Russia). The Russian (and western China) sites seem like excellent first targets as they have a lot of leverage in overall land-based aggregates.

October 2008 – NOAA vs MSU

Today, I’m posting up plots comparing the MSU gridded data and the NOAA land-sea gridded data. On an overall basis, MSU (GLB) is running about 0.2 deg C cooler than NOAA land-sea gridded in October 2008. These differences are quite volatile and this discrepancy is not unusual.

First here is October MSU (centered on 1980-2000) to facilitate comparison with the NOAA land-sea version (which I’ve also centered on 1980-2000.)

Next here is October NOAA land-sea (re-centered on 1980-2000).

While the major patterns are similar in both data sets, there are some intriguing differences. NOAA surface coverage of the Southern Hemisphere looks decidedly scrappy beside UAH. While both NOAA and UAH show a cool “ridge” in the ocean offshore western Europe, it’s much larger in the UAH version. The tropical Pacific and northern Pacific are cooler in UAH. But it’s not all the same direction – UAH western Canada is warmer than NOAA western Canada.

Can’t See the Signal For the Trees

ABSTRACT: A new method is proposed for determining if a group of datasets contain a signal in common. The method, which I call Correlation Distribution Analysis (CDA), is shown to be able to detect common signals down to a signal:noise ratio of 1:10. In addition, the method reveals how much of the common signal is contained by each proxy. I applied the method to the Mann et al. 2008 (hereinafter M2008) proxies. I analysed all (N=95) of the M008 proxies which contain data from 1001 to 1980. These contain a clear hockeystick shaped signal. CDA shows that the hockeystick shape is entirely due to Tiljander proxies plus high-altitude southwestern US “stripbark” pines (bristlecones, foxtails, etc). When these are removed, the hockeystick shape disappears entirely.

=============================================

Pondering the wide variety of distributions in the M2008 proxies led me to try to understand some of the statistical properties of a group of proxies. In particular, I wanted to understand more about how to find out if there is a common signal in a group of proxies. Before I got into the question of what the signal might look like, I wanted to know if the signal existed.

To do this, I invented a curious game that I called “hide the signal”. In this game, I took a group of random pseudoproxies, which I standardized to a common mean of 0 and a standard deviation of 1. To these, I added a common signal of different sizes. Then, I tried to devise processes that would reveal the existence of the signal. Not the shape of the signal necessarily, but the size of the signal and how it was distributed amongst the proxies.

After some experimentation, I adopted a procedure which successively removes the proxy with the least correlation with the average of the remaining proxies. The procedure looks like this:

1. Repeat a loop that does the following:

a) Calculate the row means (yearly means of a proxy matrix where the rows are years and the columns are individual proxies) of all of the “proxies” for each year. Let me designate this resulting dataset as RM, for the row means. It has one entry for each year, the mean (average) of all the proxy values for that year.

b) Invert (swap plus for minus) all proxies that have negative correlations with RM. (Since we are looking at the correlation with the average of all proxies, there are typically not many correlations which are negative). Repeat until there are no negatively correlated proxies with respect to RM.

2. Repeat a loop that does the following:

a. Calculate RM, the row means of the group of proxies.

b. Identify and record the proxy with the minimum correlation with RM.

c. Remove from the dataset the proxy with the minimum correlation. Repeat until there is only one proxy left.

At the end of this process, graph the minimum correlation values.

My logic ran like this. If I keep removing the proxy that least resembles the common signal, I will end up with the (possibly few) proxies containing the majority of the signal. It also allows me to investigate how the information was distributed among the proxies.

Knowing nothing about this new procedure, I decided to start by examining datasets containing no information at all. I looked at results from random distributions of various types. I created 100 “proxies” containing 1000 “years” of data for each type of distribution.

The first big surprise in this was the following graph:


Fig. 1. Correlation distribution in a variety of random datasets. The results, while not identical, are so close that the top one covers them all.

I found this quite amazing. In terms of how the information is spread out among the individual proxies, there was absolutely no significant difference between any of the standard kinds of distributions. It seemed like that could be very useful, although I did not know how.
Continue reading

Hello, Viet Nam!

Hey, NOAA, if you’re not too busy deleting data, you can drop in and say hello.

Emulating CRUTem Graphics

Because of the puzzles in trying to replicate the NOAA graphic from archived data, I’ve also tried to replicate the HadCRU plots from archived data. Hu McCulloch has criticized the highly inappropriate use of Mercator projections by climate scientists and, as an exercise, I figured out a fairly simple method of plotting these results in Mollweide projection (which seems like a pretty reasonable method, also shown below).

First here is October CRUTEM, followed by my emulation from the data archived at http://hadobs.metoffice.com/crutem3/data/CRUTEM3.nc as of today. I spent a while trying to match the color coding and am close enough for practical purposes. The archived data appears to contain some stations not in the webpage graphic e.g. in Canada.


Here’s a Mollweide projection of the CRUTem3 data. I’ll post up a utility for doing this from gridded data, as Hu is 100% right that Mercator projections should not be used. They are a bit easier to do, but it didn’t take me that much time to figure out how to do color coded Mollweide plots and you’d think that CRU, Hadley Center and NOAA should be able to do as well.

"Anomalous" in Finland

Occasionally, I’ve been criticized for spending too much time on NASA GISS. Since we aim to please even our severest critics, let’s spend a little time today on NOAA, which has just reported the second warmest October of all time.

I downloaded the NOAA gridded data (who mercifully, in this case, at least use .gz instead of the medieval .Z compression so beloved of antique climate scientists). I’ve uploaded the read script and you can obtain the data into R as follows in 2592 columns (Jones grid order) starting in 1880:

source(“http://data.climateaudit.org/scripts/gridcell/collation.functions.txt”)
noaa=download.noaa(); noaa=noaa/100; dimnames(noaa)[[2]]=1:2592
dim(noaa)

The first thing that most data analysts do with fresh data is to look for extreme cases. Here I picked out gridcells with October 2008 anomalies over 5 deg C (10 of them).

N=nrow(noaa); x=noaa[N-2,];
temp=(x>5.00)&!is.na(x); id=as.numeric(names(x[temp]))
x[temp]
# 196 208 221 303 329 418 421 485 634 710
#6.49 5.24 5.29 5.45 7.42 5.20 5.22 5.23 5.70 6.24

I then identified the GHCN stations that were in these gridcells, noting in particular, stations with values in 2007 and 2008. Relevant stations were Ostrov Vize, Ostrov Kotel, Barrow, Cambridge Bay, Karuesuando/Haparanda, Ergobasen, Viljujsk, Koplasevo, Dauunmod and Yanji, a couple of which we encountered in connection with GISS.

download.file(“http://data.climateaudit.org/data/giss/giss.info.tab”,”temp.dat”,mode=”wb”);load(“temp.dat”)
giss.info$jones=jones(giss.info$lat,giss.info$long)
test=giss.info[ !is.na(match(giss.info$jones,id)),c(1:6,22,29)]
order1=order(test$jones,test$id)
test[order1,]

Some regular CA contributors are from Finland. Combined with the fact that the October 2008 was the largest in the entire world, this seemed like a good one to analyze. I downloaded the most recent GISS dset1 versions of Haparanda and Karuesuando and compared them to the corresponding NOAA gridcell. For each GISS dset1 series (which are expressed in deg C), I converted them to anomalies prior to the comparison.

g= function(A) ts( c(t(A[[1]][,2:13])),start=c(A[[1]][1,1],1),freq=12)
station1=anom(g( read.giss(giss.info$id[6848],dset=1)));
station2=anom(g(read.giss(giss.info$id[6852],dset=1)));#Bor
ts.union(station1,station2,noaa[,329])

The October 2008 anomalies for Haparanda and Karuesuando were garden variety – nothing that yielded the NOAA anomaly:

Hap. Kar. NOAA
Aug 2008 -1.04 -0.97 0.64
Sep 2008 -0.42 -0.67 -2.49
Oct 2008 0.07 1.01 7.42

A little puzzled by this, I printed out the NOAA history for gridcell 329, excerpts are shown below:
1971 2.57 -0.78 -4.85 -1.49 -0.85 0.15 -0.94 NA NA NA NA NA
1972 2.92 1.10 1.85 0.36 -0.75 2.37 3.21 NA NA NA NA NA
1973 7.47 0.90 2.30 -0.12 -0.62 1.22 3.96 NA NA NA NA NA
1974 5.72 5.45 2.55 1.83 0.10 1.60 0.46 NA NA NA NA NA
1975 3.19 6.40 4.90 0.31 1.15 -2.20 -2.02 NA NA NA NA NA

….
1999 -1.76 -0.68 1.92 2.06 -1.10 1.82 0.31 -0.01 0.41 8.62 7.09 -9.53
2000 4.14 3.32 2.12 1.06 1.30 -0.33 1.31 2.09 -1.14 11.12 7.14 -5.53
2001 6.54 -1.03 -2.78 0.56 0.10 1.92 0.36 2.39 0.46 7.67 2.29 -6.63
2002 0.89 4.22 1.42 3.41 2.85 2.97 1.76 4.59 -2.09 3.87 -2.16 -9.78
2003 -4.51 5.77 4.72 0.91 1.65 0.02 4.01 2.59 -1.84 6.87 4.19 -5.78
2004 1.79 0.72 2.42 2.21 1.25 -0.93 1.61 2.64 -0.49 7.17 0.99 -2.33
2005 6.54 2.62 -0.73 2.36 -0.80 0.92 2.21 3.24 -0.99 9.22 6.59 -6.68
2006 4.24 0.32 -3.33 2.81 1.90 0.67 0.41 5.19 -0.79 5.57 4.19 -0.58
2007 0.94 -3.58 5.18 2.11 0.45 0.72 0.51 2.84 -2.24 9.87 3.14 0.57
2008 6.14 4.97 0.32 0.81 0.15 -0.63 -0.04 0.64 -2.49 7.42 NA NA

So the October 2008 anomaly was very high to say the least, but the NOAA anomaly is always high in October. It was 11.12 deg C in 2000. WTF is going on? And what happens prior to 1991. NOAA has gridded values from Jan to July, but not from August to December during the entire period from 1880 to 1991.

Here’s something odd. I calculated my own anomaly on the GISS/GHCN dset1 versions of Haparanda and Karasuando and took a simple average. I compared this to the NOAA gridded version. For every October from 1992 to 2008, the NOAA series was exactly 6.88 deg C higher than the average and for every December it was 7.53 deg C lower. Over the 5 months with oddball data, the NOAA gridcell averaged 0.2 deg C higher than the average of the two stations.

2002 -0.23 -0.24 -0.19 -0.11 -0.06 0.00 0.00 1.64 -1.94 6.88 1.96 -7.52
2003 -0.23 -0.24 -0.19 -0.11 -0.06 0.00 0.00 1.65 -1.95 6.88 1.97 -7.53
2004 -0.23 -0.24 -0.19 -0.11 -0.05 0.00 0.00 1.65 -1.95 6.88 1.97 -7.53
2005 -0.24 -0.24 -0.19 -0.11 -0.06 0.00 0.00 1.65 -1.94 6.88 1.96 -7.52
2006 -0.23 -0.24 -0.19 -0.11 -0.06 0.00 -0.01 1.65 -1.95 6.88 1.97 -7.53

The comparison also tells us something else – because the match was so precise even if weird, it shows that the NOAA gridcell was calculated from the Haparanda and Karasuando stations. My surmise is that something is wrong with how they calculated their normals, but for sure, there’s something wrong.

Update – 10 pm: see comment below comparing NOAA plot and emulated NOAA plot. It looks like the version of the data used for the NOAA October graphic does not contain the error observed here. It appears that the archived data (See link in comment below) differs from the data used in the NOAA October graphic and that I stumbled upon one of the discrepant gridcells. I’ve emulated most parts of the NOAA graphic quite accurately with the northern Finland gridcell sticking out quite dramatically. There are a couple of other interesting discrepancies, which readers can note for themselves.

A Collation Utility for GISS dset1 and dset2

GISS has been providing a considerable amount of intermediate information on their results. Unfortunately, it’s been provided in binary format that is presumably suited for people who speak Fortran with a Unix accent. I presume that such people converse with one another in medieval Latin. It’s not very handy for people who use modern languages.

Nicholas sent me some binary scripts a while ago, which I’ve modified to make organized dset1 and dset2 data sets. The form of organization is: the object dset1 is a list of 7364 items, one for each row of the GISS station data table, with each item “named” by its 11 digit id number. For dset1 and dset2, there can be 0, 1, 2 or more versions for each station. I’ve made a list of the time series, so that there are 0,1,2,,, time series for each of the 7364 items. I’ve checked that these reconcile to the webpage results. Each file is about 7-8 MB, about half the combined size of the 6(!) binary files that GISS uses to store the data.

An annoying feature of Hansen’s intermediate files is that the ID numbers are related to, but different from the ID numbers in the GISS information file. I’ve done a concordance and checked the concordance, but really you’d think that they could manage to keep one set of ID numbers. Even if you prefer another language to R (which you shouldn’t 🙂 ), it’s probably easier to port this sort of organized information out of R into another language. This program supercedes my previous scraping program. The program requires giss.info.tab for names.

source(“http://data.climateaudit.org/scripts/station/giss/collate.dset.nicholas.txt”)
download.file(“http://data.climateaudit.org/data/giss/giss.info.tab”,”temp.dat”,mode=”wb”);load(“temp.dat”)
dset1=collate.giss(m=1,month=10,year=2008)
#save(dset1,file=”d:/climate/data/station/giss/200810/dset1.tab”)
#dset2=collate.giss(m=2,month=10,year=2008)

I use the giss.info.tab table to keep track of stations. To plot (or recover) Verhojansk data, here’s a simple command in which the letters “VERHO” are looked up, the id recovered and the series located.

plot.ts(dset2[[paste( giss.info$id[grep(“VERHO”,giss.info$site)])]][[1]])