Code and realcode

Mann et al (PNAS 2008) states:

All data used in this study and Matlab source codes for implementing the algorithms described, as well as additional supporting information are available at http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07 .

They did make a much better effort than others in the field, but unfortunately the reality falls short of the PNAS representation. They have already coopered up some defects in the original documentation in response to observations made at Climate Audit e.g. by adding “original” data to the archive in early September (notably failing to acknowledge us, presumably following the example set by industry leader, James Hansen).

One of the sections of code that was added post-publication was the directory http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/code/codeprepdata/infillproxydata/ , which contained code for the infilling of proxy data. There’s something interesting about the code in this directory.

First take a look at the programs in the directory http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/code/coderecon/ supposedly containing code to do the EIV reconstruction. This contains 14 programs:

center.m gcvfctn.m gcvridge.m iridge.m mridge.m nancov.m nanmean.m nanstd.m nansum.m newregem.m peigs.m pttls.m regrechigh.m regreclow.m

Now return to the directory http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/code/codeprepdata/infillproxydata/ which supposedly containing code to do proxy infilling. It contains 24 programs, of which 14 program names are identical to the names in the recon code directory. In a quick spot check, it appears that it is not just the names of the programs that are identical, but the programs themselves are identical, even up to the file names of output.

~~The recon code directory lacks a readme and lacks any programs comparable to the above 3 programs which actually carry the water in the infillproxydata operations.~~ My partial error here. As observed by a reader below, the recon code directory does contain a readme which states the following:

This directory contains all the codes used for reconstuction. Users need to run regrechigh.m for the reconstruction of high-frequency band, and run regreclow.m for the reconstruction of low-frequency band. The other codes are subroutines that have brief description of the function of the codes.

The infillproxydata code directory includes a Readme, which states that the following 3 programs need to be run in order to execute the operations:

1) trimdata1227nolu.m 2)masterinfillproxy.m 3)backsiginfilled1227nolu.m

Given that the programs regrechigh.m and regreclow.m in the recon code directory are identical to the corresponding programs in the infilledproxydata directory, it seems odd that the same program can be used in both places in an identical way and that covering programs are needed in the one case, but not the other. Thoughts on this welcomed.

Proxy Counts: Today’s Crossword Puzzle
Table S1 contains a table showing the counts of proxies starting before specified date. There are 12 such tables. I’ve excerpted the tables at the bottom of the first page of the SI showing the counts of proxies (GLB), the 3 columns on the left show “annually” resolved proxies, the 3 columns on the right “all” proxies. The classification into “annual” is not provided anywhere and appears to have some traps for the unwary.

I tried to replicate the counts shown in the 3 columns to the left (code shown in a comment below). My counting method replicated the top line counts of the 3 right-hand columns.

The “dendro” count (!032) equals the total of code -9000 and code -7500 proxies. In passing, this means that there is another Mannian trap for the unwary: some dendro series are classified as non-dendro – the ones with code 3000. These are two of Cook’s Tasmanian dendro series, the Jacoby-D’Arrigo Mongolian series and the Taymyr series and are all long series (through the MWP). One feels that this detracts somewhat from whatever point was being made in the “non-dendro” reconstructions. Luterbacher (instrumental in their later portion) are included among the “other” proxies, but are really sui generis.

However, as shown below, I was unable to replicate the counts for any earlier period. I place this before you as today’s crossword puzzle. Given that observed counts exceed Mann counts in most periods, the discrepancy is a real braintwister.

1032	177	1209
926	154	1080
553	145	698
317	139	456
197	58	255
117	52	169
90	47	137
73	45	118
58	37	95
43	34	77
31	31	62
27	30	57
24	29	53
19	28	47
17	25	42
11	25	36
10	24	34
10	24	34
9	23	32

This entry was written by Stephen McIntyre, posted on Oct 4, 2008 at 7:56 AM, filed under Mann et al 2008. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

58 Comments

Soronel Haetir

Posted Oct 4, 2008 at 8:42 AM | Permalink

I don’t know what limitations matlab has but in my experience duplicating code files is just begging for trouble.

Though duplicates/conflicts/overlaps don’t seem to faze the team in other contexts so why would it here?
- RomanM
  
  Posted Oct 4, 2008 at 12:11 PM | Permalink
  
  Re: Soronel Haetir (#1),
  
  Matlab has a well defined order to the treatment of functions with identical names. From the matlab website documentation:
  
  The MATLAB® software uses a search path to find M-files and other files related to MATLAB, which are organized in directories on your file system…
  
  When there are two files with the same name, one in the current directory and the other in a directory on the search path, MATLAB runs the one in the current directory.
  
  When there are two or more files with the same name, each located in a different directory on the search path, MATLAB runs the file located in the directory closest to the top of the search path, and MATLAB considers the other file to be shadowed. For this reason, the order of the directories on the search path is relevant.
  
  It seems to me to be a risky business to have multiple copies of functions floating around. Errors found and fixed in one version would still exist in the others and may resurface unexpectedly. However, one would expect that someone as careful as Prof. Mann would not be likely to make such obvious errors.
  
  By the way, I noticed that the pea seems to be moving again to a new shell. There seem to have been further corrections to the data and the results posted on the supplement website at http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/corrections/ . Look at the file A_README.txt.
  - John A
    
    Posted Oct 4, 2008 at 5:40 PM | Permalink
    
    Re: RomanM (#6),
    
    It seems to me to be a risky business to have multiple copies of functions floating around. Errors found and fixed in one version would still exist in the others and may resurface unexpectedly. However, one would expect that someone as careful as Prof. Mann would not be likely to make such obvious errors.
    
    You’re obviously new in these parts. Welcome to Climate Audit.
    - RomanM
      
      Posted Oct 4, 2008 at 7:58 PM | Permalink
      
      Re: John A (#21),
      
      😉
TerryBixler

Posted Oct 4, 2008 at 9:54 AM | Permalink

Routines show no version numbers, edit dates and why, or authorship. More trouble.
Richard deSousa

Posted Oct 4, 2008 at 11:10 AM | Permalink

How about climate and RealClimate… 😉
Steve McIntyre

Posted Oct 4, 2008 at 11:31 AM | Permalink

Mann sets traps everywhere for the unwary.

I had some trouble replicating SI Table S1 showing proxy counts by time (which I will return to). Frustrated by this, I re-calculated the starting date of the 1209 series and compared to the SI. All but a few matched. However the SI start date and realstart dates were different for 9 series – 8 New Mexico series (one of which nm573 was already on our horizon.) Below I show the start date and realstart dates. The SI is fine for the first 785 values, then the 785th value is repeated and the next 8 values are displaced before getting back in order at 794. How the hell does Mann make these sorts of goofs? We saw the same thing with MBH98 (and uncorrected in Mann et al 2007) with the precipitation series, where the locations got displaced one row – thus the rain in Maine.

id start realstart
782 nm566 1590 1590
783 nm567 1530 1530
784 nm568 1670 1670
785 nm570 1554 1554
786 nm572 1554 -136
787 nm573 -136 1610
788 nm574 1610 1613
789 nm575 1613 1712
790 nm576 1712 1626
791 nm577 1626 1595
792 nm578 1595 1635
793 nm579 1635 1627
794 norw008 1599 1599

Also
id start realstart
648 moy_2002_ageredcolor 0 -48

These calculations are so simple, I don’t understand how the errors are made. Here’s all that’s required to calculate the start dates (given a collation of the data into mann.tab).

h=function(X) min(X[,1])
details$realstart=sapply(mann,h)
Steve McIntyre

Posted Oct 4, 2008 at 12:02 PM | Permalink

HEre is code to produce the table here (it should be turnkey tho I haven’t cleared my workspace to doublecheck):

download.file(“http://data.climateaudit.org/data/mann.2008/details.tab”,”temp.dat”,mode=”wb”);load(“temp.dat”)
download.file(“http://data.climateaudit.org/data/mann.2008/mann.tab”,”temp.dat”,mode=”wb”);load(“temp.dat”)

countf=function(temp) {
count=array(NA,dim=c(19,3));count=data.frame(count);names(count)=c(“dendro”,”other”,”total”)
row.names(count)=paste(seq(0,1800,100),seq(99,1899,100),sep=”-“)
start0=seq(0,1800,100)
for (i in 1:19) {
count$dendro[i]=sum(details$start[temp&temp2]< =start0[i])
count$other[i]=sum(details$start[temp&!temp2]<=start0[i])
count$total[i]=sum(details$start[temp]<=start0[i])
}
countf=count[nrow(count):1,]
countf
}
temp2=(details$code==7500)|(details$code==9000);sum(temp2) #1032
countf(rep(TRUE,1209)
bender

Posted Oct 4, 2008 at 12:28 PM | Permalink

At this point I now know MBH98 was s…. MBHWEDFGRSTYD08 appears to be s… as well. I thank Steve M for revealing this to the world. The bottom line is that we most certaintly can not conclude that modern temperatures are “unprecedented” since the MWP. i.e. This is a loathsome fiction. The “MCA” hypothesis is a similar fiction. So can we have the “warmies” now just admit this, so that we can all “move on”? To the GCMs? Which are the only thing that matter. This message is for Ray Ladbury, Hank Roberts, Barton Paul Levenson and all my friends at RC, but especially Gavin Schmidt and Michael Mann. Can we PLEASE move on to the GCMs? PLEASE?
Pat Frank

Posted Oct 4, 2008 at 12:46 PM | Permalink

#4 — “Mann sets traps everywhere for the unwary.” Consistent with pure innocent unintentional carelessness?

Steve: I was intending to be ironic here. I don’t think that Mann intentionally screwed up start dates for proxies 785-792. I find it hard to imagine how you’d go about making this sort of error though.
Pat Frank

Posted Oct 4, 2008 at 12:48 PM | Permalink

#7 — bender, the global average surface (not satellite) temperature record is of the same malodorous quality.
Steve McIntyre

Posted Oct 4, 2008 at 1:07 PM | Permalink

#6. Well, they seem to be paying a little attention to Climate Audit, although they refuse to provide “clear referencing and acknowledgment” to Climate Audit, which has demonstrable priority in identification of these various problems. Their obstinate refusal to cite Climate Audit – Hansen’s policy – is leading them to walk a pretty fine line in respect to the following:

Plagiarism involves the use of another person’s work without full and clear referencing and acknowledgement.

If a student were to submit an essay observing that Mann had made this particular goof without citing Climate Audit, would he be incurring potential liability? Do the same standards apply to webpage alterations? Just asking.
Steve McIntyre

Posted Oct 4, 2008 at 1:12 PM | Permalink

A new contest : see comment #4. Provide your best theory as to how the start dates are in error for items 785-792 but not elsewhere. I am at a total loss for an explanation.
Soronel Haetir

Posted Oct 4, 2008 at 1:27 PM | Permalink

#14:

Transcription error like losing your place on a standardized test form then getting back on track without bothering to go back and fix the sequence in error.
Alan S. Blue

Posted Oct 4, 2008 at 1:28 PM | Permalink

I’d just say it was an off-by-one transcription error. The kind that is trivial to make whenever someone decides that it is easier to retype something instead of doing it programatically, or copying the list electronically. They aren’t calculating the start date – they’re manually tabulating it.

How this could: 1) pass the typist’s double-check, 2) pass the user’s scrutiny of the incoming information, 3) pass the check when generating tables for publication, 4) pass internal review as a last check before submitting to any journal, 5) pass peer review, and 6) pass the ground-up reexamination at the galley proof stage is the boggling part.

And the only defense I’ve seen so far is “It’s Climate Science, you wouldn’t understand.”
captdallas2

Posted Oct 4, 2008 at 1:29 PM | Permalink

I sure hope the majority of the PR Challenge folks frequent this blog.
Steve McIntyre

Posted Oct 4, 2008 at 1:39 PM | Permalink

C’mon – do you think that they manually tabulated the start dates? It takes two lines of code. Actually one line. Assuming that they organized their data in a fashion that even an “amateur” can do.

sapply(mann, function(X) min(X[,1]) )
jeff id

Posted Oct 4, 2008 at 2:24 PM | Permalink

I noticed this problem when I first started loading the data in C++. I tried to use the start end dates to set the file length for each proxy during load. It didn’t take long to notice values weren’t loading correctly.
jeff id

Posted Oct 4, 2008 at 3:00 PM | Permalink

Regarding #17,

What you say makes sense if you start with the data you are going to use compile it, make it neat and write the software. But if you are continually adding and deleting series even up to the last minute, SD1 would be completed at the beginning and require regular editing.

It’s a stretch but then this kind of error makes more sense.

At least now you have real grid locations for shweingruber.
Steve McIntyre

Posted Oct 4, 2008 at 4:14 PM | Permalink

#5. Take a look at the pdf’s in the corrections file. The changes arising from re-locating the rain in Spain are plain to see (and larger than I would have expected.)
Peter W

Posted Oct 4, 2008 at 9:34 PM | Permalink

I downloaded the m scripts mentioned in the blog entry and did the following check in a bash shell (linux) from the coderecon directory.

for f in *.m;do echo $f;echo;diff $f ../codeprepdata/infillproxydata;echo;echo;done

The result was (except for compression of blank lines by the blog formatter):
——————————————————-
center.m

gcvfctn.m

gcvridge.m

iridge.m

mridge.m

nancov.m

nanmean.m

nanstd.m

nansum.m

newregem.m

peigs.m

pttls.m

regrechigh.m

1,2c1
< % this file is for running regem with stepwise precedure/reconstruction of
% this file for doing regem with stepwise precedure

regreclow.m

1,2c1
< % this file is for running regem with stepwise precedure/reconstruction of
% this file for doing regem with stepwise precedure

—————————————————————

That is, the files are identical except for the comments in regrechigh.m and regreclow.m.
Thor

Posted Oct 5, 2008 at 4:49 AM | Permalink

There is a Readme.txt in the coderecon directory stating the following:

This directory contains all the codes used for reconstuction.

Users need to run regrechigh.m for the reconstruction of high-frequency band, and run regreclow.m for the reconstruction of low-frequency band. The other codes are subroutines that have brief description of the function of the codes.

Steve: Quite so. My bad. I copied the files manually and omitted this one. I’ve noted your observation in a revision to the above post. The underlying puzzle remains however: how can regrechigh.m and regreclow.m be identical in the two different cases right up to calls and writes?
Patrick M.

Posted Oct 5, 2008 at 6:47 AM | Permalink

Look at the date column in nm572.

-136
Patrick M.

Posted Oct 5, 2008 at 6:48 AM | Permalink

Here are the files that I found with what llok like invalid dates, (filename date):

arge092.ppd -342.0
burns_2003_socotrad13c.ppd -2754.0
burns_2003_socotrad18o.ppd -2754.0
burns_nicoya_d13c.ppd -3862.0
burns_nicoya_d18o.ppd -3862.0
ca051.ppd -42.0
ca534.ppd 0.0
ca535.ppd -6000.0
ca630.ppd -420.0
cronin_2003_mgca.ppd -183.0
curtis_1996_d13c.ppd -1597.0
curtis_1996_d13cpyro.ppd -1597.0
curtis_1996_d18o.ppd -1597.0
curtis_1996_d18opyro.ppd -1597.0
dongge.ppd -6.0
hodell_2001_d18o.ppd -6239.0
hodell_2001_yucatancaco3.ppd -8855.0
lee_thorpe_2001_c13.ppd -4453.0
lee_thorpe_2001_o18.ppd -4453.0
moy_2002_ageredcolor.ppd -48.0
nm572.ppd -136.0
nv515.ppd -2370.0
nv516.ppd 0.0
recjj_yy1.ppd -407.0
tan_2003_recontemp.ppd -665.0
tasmania_recon_orig.ppd -1600.0
tiljander_2003_darksum.ppd -1000.0
tiljander_2003_lightsum.ppd -1000.0
tiljander_2003_thicknessmm.ppd -1000.0
tiljander_2003_xraydenseave.ppd -1000.0
ut509.ppd 0.0
yu_1999_mgcaratio.ppd -146.0
Steve McIntyre

Posted Oct 5, 2008 at 7:08 AM | Permalink

#26. No, there are many problems but these aren’t among them/ Some data sets start before 0 CE. The above dates are not problematic on this count.
steven mosher

Posted Oct 5, 2008 at 7:19 AM | Permalink

maybe the problem is here doindexinf.m

%%%%%%%%%%%%%%
%%%% This section finds the starting and stopping year for each column. It assumes that
%%%% there are no missing values between start and stop, otherwise filtering chokes.

for i=1:nrows
for j=1:ncols
if ~isnan(stdkeepers(i,j)) & gotstart(j)~=1
start(j)=i;
gotstart(j)=1;
elseif isnan(stdkeepers(i,j)) & gotstart(j)==1 & gotstop(j)~=1
stop(j)=i-1;
gotstop(j)=1;
end
end
end
%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Patrick M.

Posted Oct 5, 2008 at 7:25 AM | Permalink

The error exists in 1209proxynames.xls also. It has the start date as 1554 instead of -136 for nm572.
Wolfgang Flamme

Posted Oct 5, 2008 at 8:15 AM | Permalink

#28

Don’t know Mathlab but at first glance looks like the first missing value in a series will be denoted as end of series?
RomanM

Posted Oct 5, 2008 at 9:28 AM | Permalink

In Thor (#24), Steve:

I copied the files manually and omitted this one.

The files are also available (presumably without the latest “updates”) in zip format at:

ftp://ftp.ncdc.noaa.gov/pub/data/paleo/contributions_by_author/mann2008 .

It seems to be an easier place for people still trying to gather the supplemental information to download the individual files.

Steve: I’ve referred to that. I doublechecked against that to see if it might have added after the fact, but it’s in the WDCP zip file, so it’s my copy error.
GTFrank

Posted Oct 5, 2008 at 1:53 PM | Permalink

slightly ot – snip at will
A question for all you folks in the Academic world. Do these types of errors speak more about:
1.the quality of the graduate students that actually create, track, modify, and organize the code and data, or
2. the quality of the educational process at Penn State, and the institutions of the coauthors, or
3. the abilities of Mann et al to properly manage all of these various tasks?

Is this typical in other areas of academic research?

I find it hard to imagine that all of this is intentional. It seems they have no one in the entire group that is detail oriented.

I also find it hard to believe the president of Penn State doesn’t step in and get it organized.
Steve McIntyre

Posted Oct 5, 2008 at 2:33 PM | Permalink

Jean S has got the program gridproxy.m in the CPS code section to work. This program selects proxy rosters according to several criteria. The counts from this program seem to tie in to the Table S1 counts (other than the AD1800 counts). But somewhere along the line, this program is excluding series that seem to qualify. All the excluded series are dendro series, but I can’t figure out the pattern or why the program is excluding them.

I’d appreciate programmers seeing if they can figure it out.
Steve McIntyre

Posted Oct 5, 2008 at 2:50 PM | Permalink

Here are some examples. The following are ad900 networ NH dendro series included in the SI screened 484 that do not get included in the gridproxy.m network (and not in the Table S1 count).

id si rtable rtable_lf
152 az510 0.2678 0.268 0.505
232 ca528 0.1282 0.128 0.445
233 ca529 0.1485 0.149 0.368
237 ca533 0.1578 0.158 0.346
271 ca630 0.3048 0.305 0.243
654 mt110 0.2794 0.279 0.432
796 norw010 0.2552 0.255 0.129
817 nv513 0.2281 0.228 0.668
873 or060 0.1282 0.128 0.233

The following dendro series are included in the count:

id si rtable rtable_lf
816 nv512 0.2916 0.292 0.564
1070 tornetrask 0.3358 0.336 0.660

ca630 has a higher rtable correlation than nv512, so why is it left out? Jean S got these results from running gridproxy.m, but so far I can’t figure out where ca630 gets bumped.
steven mosher

Posted Oct 5, 2008 at 5:00 PM | Permalink

re 32, some snippets from code comments: I started by looking at the grunt work code. basically the data prep code. The most likely place for mann to assign work to someone else.

Now, in the software profession, its normal practice to put in some
Information about the author of the code, date of initial write, updates, and if your a real professional, to actually create code that is self documenting ( call trees, data dictionary)
http://en.wikipedia.org/wiki/Software_documentation
ROBOdoc unfortunately doesnt support matlab
For matlab there are some tools:
universal report ( only on windows I think)
There is also Cell mode:

http://www.artefact.tk/software/matlab/m2html/
http://www.mathworks.com/company/newsletters/news_notes/dec04/cellmode.html

Any matlab guys out there use cell mode?

Anyways back to the Issue of who wrote the code:

As to the authorship of SOME of the code, there is only one solid clue:

% program to do proxy gridbox CPS against instrumental data
% Created December 2006 by skm (from earlier work)
% Last modified October 2007 by skm

And from another piece of code:

%%% Last modified October 2007 by skm

Not that there is anything wrong with this, but every file should at least have documentation
on who wrote it and when.

And for R folks ( In case you didnt know)

For R : http://www.stat.umn.edu/~charlie/Sweave/

and reproducible research, a movement CA would approve of.

http://www.stat.umn.edu/~charlie/Sweave/#reproduce
http://www.reproducibleresearch.org/

Click to access hpc-considered-harmful-2008.pdf
steven mosher

Posted Oct 5, 2008 at 5:06 PM | Permalink

Trident: A workflow system for doing data-intensive science with reproducible results

Ok, no more.
Jack Lacton

Posted Oct 5, 2008 at 6:22 PM | Permalink

Has anyone noticed that every single time someone outside of the closed community of IPCC-sponsored climate scientists has done some serious analysis of the science supporting AGW that there have been issues discovered that are material to the outcome?

Is there any other field in which the same situation exists/existed?
Steve McIntyre

Posted Oct 5, 2008 at 6:37 PM | Permalink

#34. Jean S has figured the problem. In the SI table, there are several different columns showing start dates. Table S1 somehow uses the column denoted “Start date (for reconstruction)” rather than “Start year”. This in turn corresponds (for ITRDB series) to the first year where Mann-data shows 8 samples (a calculation which I haven’t verified, but is probably not at issue). Using this, I can replicate Table S1 either exactly or within 1 count (this would be a rounding difference in a single example). Here’s a script to produce Table S1 – some of the tools here carry into other calculations. This is located at http://data.climateaudit.org/scripts/mann.2008/table_S1.txt.

I apologize if my script doesn’t meet programming protocols either. (To some extent, I think that the structure is easier by avoiding all the pointless subscripting in Mann code, where they continue to write Fortran code in modern languages.)

#TABLE S1
#point to watch: for ITRDB dendro series, the column start.rec is used: this is first year with 8 samples. (Doesn’t apply to end)

#ITRDB matrix
download.file(“http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/itrdbmatrix”,”temp.dat”)
test=scan(“temp.dat”,n=-1) #2426050
test=t( array(test,dim=c(1210,length(test)/1210) )) ;dim(test) #2005 1210

proxy_info=test[1:3,]
test=test[4:nrow(test),] # from – 5 to 1996
range(test[,1])
test=ts(test[,2:ncol(test)],start=test[1,1])
proxy=test;rm(test)
dim(proxy) #[1] 2002 1209
tsp(proxy) #- 5 1996

proxy_info=t(proxy_info)
proxy_info=proxy_info[2:nrow(proxy_info),]
names(proxy_info)=c(“long”,”lat”,”code”)

##rtable
rtable=scan(“http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/rtable1209”)
rtable=array(rtable,dim=c(1209,7))

##LOAD DETAILS
download.file(“http://data.climateaudit.org/data/mann.2008/details.tab”,”temp.dat”,mode=”wb”);load(“temp.dat”)
#coral proxies (7000,7001) may need to be inverted
coral=(details$code==7000)|(details$code==7001);sum(coral) #15

###MAKE TABLE S1: NUMBER OF PROXIES IN NETWORKS
dendro=(details$code==9000)|(details$code==7500);sum(dendro) #1032
annual=! (substr(as.character(details$code),4,4)==”1″);sum(annual) #1158 #this digit controls definition of decadal
NH=(details$lat>0) ;sum(NH) #1036
passing=!is.na(details$r1850_1995) ;sum(passing) #484

countf=function(temp) {
count=array(NA,dim=c(19,3));count=data.frame(count);names(count)=c(“dendro”,”other”,”total”)
row.names(count)=paste(seq(0,1800,100),seq(99,1899,100),sep=”-“)
start0=seq(0,1800,100)
for (i in 1:19) {
count$dendro[i]=sum(details$start.rec[temp&dendro]<=start0[i])
count$other[i]=sum(details$start.rec[temp&!dendro]<=start0[i])
count$total[i]=sum(details$start.rec[temp]<=start0[i])
}
countf=count[nrow(count):1,]
countf
}

#set up list. Named per Table S1 2 columns across, then by geog, then by screened or not
name1= c(outer(c(“annual”,”all”),c(“NH”,”SH”,”GLB”), function (x,y) paste(x,y,sep=”_”)))
name1=c(outer(name1,c(“full”,”screen”), function (x,y) paste(x,y,sep=”_”)))
name1
# [1] “annual_NH_full” “all_NH_full” “annual_SH_full” “all_SH_full” “annual_GLB_full” “all_GLB_full” “annual_NH_screen”
# [8] “all_NH_screen” “annual_SH_screen” “all_SH_screen” “annual_GLB_screen” “all_GLB_screen”
count=rep(list(NA),12);
names(count)=name1

count[[1]]= countf(annual&NH)
count[[2]]=countf(NH)
count[[3]]=countf(annual&!NH)
count[[4]]=countf(!NH)
count[[5]]=countf(annual)
count[[6]]=countf(rep(TRUE,1209))
count[[7]]= countf(annual&NH&passing)
count[[8]]=countf(NH&passing)
count[[9]]=countf(annual&!NH&passing)
count[[10]]=countf(!NH&passing)
count[[11]]=countf(annual&passing)
count[[12]]=countf(passing)

count
#this is close reproduction of Table S1
Steve McIntyre

Posted Oct 5, 2008 at 8:43 PM | Permalink

Does anyone know why Mann uses Butterworth filters? Is there any material advantage to using them relative to Gaussian weights? I’m using the BW filter in the R package mFilter and the filter is quite slow, especially when one does them over and over incessantly (and needlessly) as Mann does. I’m going to do a filter for the entire data set once and then use that.

Update: I abandoned the version in mFilter and used the version in package “signal” which is a zillion times faster and appears just fine for the purpose
- Mark T
  
  Posted Oct 6, 2008 at 12:15 AM | Permalink
  
  Re: Steve McIntyre (#39),
  
  Does anyone know why Mann uses Butterworth filters? Is there any material advantage to using them relative to Gaussian weights?
  
  A Butterworth filter is IIR whereas the Gaussian weighted filter is likely FIR. IIR filters primarily give you higher rejection (and steeper transition bands) for fewer taps. They also typically have non-linear phase, and of course, infinite impulse response. He likely uses these because they are maximally flat in the passband, though it could also be because the function butter() is listed first in the IIR generation methods when you type help/signal and then click on “Digital Filters” in MATLAB. Ya never know.
  
  Mark
Smokey

Posted Oct 5, 2008 at 9:26 PM | Permalink

Mann refuses to honestly cooperate with the general scientific community. This article from the Sept. 20th issue of the Economist may show a way around Mann’s obfuscation:

Web 2.0 tools are beginning to change the shape of scientific debate

In pre-internet times, peer-reviewed journals were the best way to disseminate research to a broad audience. Even today, editors and reviewers cherry-pick papers deemed the most revelatory and dispatch them to interested subscribers worldwide. The process is cumbersome and expensive, but it has allowed experts to keep track of the most prominent developments in their respective fields.

Peer-review possesses other merits, the foremost being the ability to filter out dross. But alacrity is not its strong suit. With luck a paper will be published several months after being submitted; many languish for over a year because of bans on multiple submissions. This hampers scientific progress, especially in nascent fields where new discoveries abound. When a paper does get published, the easiest way to debate it is to submit another paper, with all the tedium that entails.

Now change is afoot. Earlier this month Seed Media Group, a firm based in New York, launched the latest version of Research Blogging, a website which acts as a hub for scientists to discuss peer-reviewed science. Such discussions, the internet-era equivalent of the journal club, have hitherto been strewn across the web, making them hard to find, navigate and follow. The new portal provides users with tools to label blog posts about particular pieces of research, which are then aggregated, indexed and made available online.

Although Web 2.0, with its emphasis on user-generated content, has been derided as a commercial cul-de-sac, it may prove to be a path to speedier scientific advancement. According to Adam Bly, Seed’s founder, internet-aided interdisciplinarity and globalisation, coupled with a generational shift, portend a great revolution. His optimism stems in large part from the fact that the new technologies are no mere newfangled gimmicks, but spring from a desire for timely peer review.

However, what Dr Bly calls Science 2.0 has drawbacks. Jennifer Rohn, a biologist at University College London and a prolific blogger, says there is a risk that rivals will see how your work unfolds and pip you to the post in being first to publish. Blogging is all well and good for tenured staff but lower down in the academic hierarchy it is still publish or perish, she laments.

To help avoid such incidents Research Blogging allows users to tag blog posts with metadata, information about the post’s author and history. This enables priority of publication to be established, something else peer-reviewed journals have long touted as their virtue.

Coming home to roost
With the technology in place, scientists face a chicken-and-egg conundrum. In order that blogging can become a respected academic medium it needs to be recognised by the upper echelons of the scientific establishment. But leading scientists are unlikely to take it up until it achieves respectability. Efforts are under way to change this. Nature Network, an online science community linked to Nature, a long-established science journal, has announced a competition to encourage blogging among tenured staff. The winner will be whoever gets the most senior faculty member to blog. Their musings will be published in the Open Laboratory, a printed compilation of the best science writing on blogs. As an added incentive, both blogger and persuader will get to visit the Science Foo camp, an annual boffins’ jamboree in Mountain View, California.

By itself this is unlikely to bring an overhaul of scientific publishing. Dr Bly points to a paradox: the internet was created for and by scientists, yet they have been slow to embrace its more useful features. Nevertheless, serious science-blogging is on the rise. The Seed state of science report, to be published later this autumn, found that 35% of researchers surveyed say they use blogs. This figure may seem underwhelming, but it was almost nought just a few years ago. Once the legion of science bloggers reaches a critical threshold, the poultry problem will look paltry.

Steve: Please don’t pile on here. Mann has made a decent effort to archive data and code for Mann et al 2008. Sure there are annoying defects in what’s archived, but it’s much better than Lonnie Thompson. So let’s give this a rest for a while. There are plenty of other issues.
Steve McIntyre

Posted Oct 6, 2008 at 7:38 AM | Permalink

For these sorts of noisy data sets, I can’t imagine that the differences between two filters can matter very much. Nonetheless, the implementation in the “signal” package proved very fast. I don;t know why the implementation in mFilter bogged down, but am not going to worry about it.
Mark T.

Posted Oct 6, 2008 at 8:49 AM | Permalink

For these sorts of noisy data sets, I can’t imagine that the differences between two filters can matter very much.

It can matter, but you need to have some valid assumptions regarding the differences between “signal” and “noise.” We all know how that plays out.

Mark
Mark T.

Posted Oct 6, 2008 at 8:51 AM | Permalink

Oh, I should add, you also need to have an understanding of how phase linearity will effect your results.

Mark
Geoff Sherrington

Posted Oct 7, 2008 at 2:15 AM | Permalink

“Feeding the chooks”

One outstanding feature of Mann’s work is the use of Matlab. Previously I have said that the data was too dirty for the task; the methods for rejection/retention horrendous.

Matlab is a program born in a University for the purposes of performing mathematical calculations, for University students. It may have “grown up” in the last few years but a look at the site a week ago said to me that Matlab is Matlab. It is not a high level program, you can think in terms of its behavior as a FORTRAN 77 preprocessor.

It does have some spatial data ability, but it is trivial when compared to the civilian version of the STK/AGI package, which NASA can use in full military form, along with other advanced packages. The STK/AGI suite of programs is designed for 4D and greater analysis and modeling. I guarantee that Mann et al do not get scenario column data for multiple rocket launch windows from Matlab results. So why use Matlab for modeling columns of atmospheric climate data given the present use of purpose-written modeling programs such as STK/AGI? STK/AGI may be quiet as they get funding from the USA department of defense, some of it tied to the requirements of NASA.

See
http://www.agi.com/solutions/specializedAreas/geoIntelApp/

STK/AGI is short for Satellite Tool Kit / Analytical Graphics Incorporated.

It seems to me that Mann et al are feeding data to inquirers in less than optimum form. It would be courteous if the data were prepared in modern form to satisfy requests.

There is a good reason for this, as shown by my country, Australia. Here, decision makers are being stampeded into action by a series of doomsday reports that threaten to cause the early adoption of heroic measures. We need proper data to make decisions. At the moment, we feel like Mann et al are throwing scraps of chicken feed to the masses to watch them scramble.

The World deserves better. Mann at al can learn from these user groups if they feel inadequate.

The following is from: http://www.agi.com/userCommunity/userGroups/nasa/
Welcome to the NASA User Community pages on AGI.com! Here you can access the latest resources targeted specifically for the NASA User Community.
Download the PowerPoint presentation from the July 22, 2008 NASA STK User Community Webinar. (login required)
Download the 2008 NASA User Group Meeting PDF.
Free AGI software training is held year-round at NASA Johnson Space Center.
Click here for upcoming offerings.
Special AGI license server at NASA JSC. Learn more.
Download resources from the 2007 AGI User Exchange in D.C.
- Jean S
  
  Posted Oct 7, 2008 at 3:12 AM | Permalink
  
  Re: Geoff Sherrington (#45),
  
  Huh?!? Matlab is a technical computing language, used all over in science and engineering. It is well suited for the problem Mann et al is doing, and the only real alternative for this type of generic scientific numerical research is nowadays offered by R. The problem here is not Matlab, it’s Mann et al way of programming with Matlab (or programming style in general).
- Mark T.
  
  Posted Oct 7, 2008 at 9:15 AM | Permalink
  
  Re: Geoff Sherrington (#45),
  
  I guarantee that Mann et al do not get scenario column data for multiple rocket launch windows from Matlab results. So why use Matlab for modeling columns of atmospheric climate data given the present use of purpose-written modeling programs such as STK/AGI? STK/AGI may be quiet as they get funding from the USA department of defense, some of it tied to the requirements of NASA.
  
  I have a much simpler propagator than STK, one that is a MATLAB script (albeit very large), and works just as well. STK is sitting idly on the shelf next to my desk. I can modify the MATLAB propagator, not the STK one. 😉 STK provides better visualization, but that’s not the issue with Mann’s code, not even close.
  
  MATLAB is a) very high level and b) probably the most often used tool within the engineering disciplines for all-purpose computing tasks. MATLAB’s strength is its built-in ability to manipulate multi-dimensional arrays with simple commands. In fact, it is billed as a matrix manipulator for just that reason. It is fast for a high-level language (interpreted) and it contains tools for just about every computational need that one can imagine. Without it, I would not have a job.
  
  Mark
  - Geoff Sherrington
    
    Posted Oct 7, 2008 at 4:49 PM | Permalink
    
    Re: Mark T. (#47),
    
    So Mark, why does NASA use STK as in the box you quoted?
    
    Re: Jean S (#46),
    
    I have seen you solve some problems brilliantly, so I defer. I am left with the uneasy feeling that programs other than Matlab might be used by Mann et al and that some of the mess is caused by retrospectively making older programs match newer ones. You are much more expert than I – would you dismiss this possibility?
    - Geoff Sherrington
      
      Posted Oct 8, 2008 at 9:55 PM | Permalink
      
      Re: Geoff Sherrington (#48),
      
      Here is an example of what I mean. Suppose you want to send up a SV and maybe you want it to land a couple of km from a nominated point. With respect to launch coords, the other point is over the Equator and crosses the Greenwich Mean time line. There is no room to get the lats and longs wrong. You want to calculate the great circle distance on a nominated map projection. Choices limited to 3 here: a) You write your own routine in MATLAB or similar. b) you call up a routine in MATLAB or similar. c) you use STK.
      
      Question. Given that you will likely need to use Taylor’s series/expansion and to find the squares of differences between small numbers, how confident would you be that a) you are carrying adequate significant figures b) the program has been debugged by experts and c) that your programmer is aware of the special surveying cases arising from crossing Equator and Greenwich?
      
      Is this not heading towards why the rain in Maine problem and the present probs Steve has with Butterworth filters?
      
      I’m not trying to take fertile future exploration areas from Steve’s acute prospecting eyes, but I wonder if there are less visible, higher quality versions of climate data that, if revealed, might allow a concentration of thought on other problems, of which plenty remain.
      
      Whatever happened to all that sub-sea sonde data collected for submarine signalling and detection in the cold war years and later?
    - Mark T
      
      Posted Oct 8, 2008 at 10:33 PM | Permalink
      
      Re: Geoff Sherrington (#48),
      
      So Mark, why does NASA use STK as in the box you quoted?
      
      Most likely, if I were to guess, it is because STK it is a powerful 3D modeler. It cannot do general purpose signal processing, however, which is what Mann is doing, and makes it somewhat useless for this task.
      
      If you don’t care about graphics, and the STK program is big on graphics, there’s no real reason to use it and many reasons to not use it (satellite intel guys I know like other programs better because they are more user friendly).
      
      Re: Geoff Sherrington (#49),
      
      Here is an example of what I mean. Suppose you want to send up a SV and maybe you want it to land a couple of km from a nominated point. With respect to launch coords, the other point is over the Equator and crosses the Greenwich Mean time line. There is no room to get the lats and longs wrong. You want to calculate the great circle distance on a nominated map projection. Choices limited to 3 here: a) You write your own routine in MATLAB or similar. b) you call up a routine in MATLAB or similar. c) you use STK.
      
      What does this have to do with Mann’s PCA algorithm?
      
      Question. Given that you will likely need to use Taylor’s series/expansion and to find the squares of differences between small numbers, how confident would you be that a) you are carrying adequate significant figures b) the program has been debugged by experts and c) that your programmer is aware of the special surveying cases arising from crossing Equator and Greenwich?
      
      Again, what does this have to do with the problems being discussed here? He isn’t modeling satellite tracks, he’s just processing proxies to determine if there is a common signal in them. You don’t need complex 3D modeling software to do this, particularly not a propagator such as STK.
      
      Mark
    - Geoff Sherrington
      
      Posted Oct 8, 2008 at 11:19 PM | Permalink
      
      Re: Mark T (#50),
      
      Thanks for the response, Mark T.
      
      In the more general sense, Mann et al use algoriths for estimation of great circle distances, they use lats and longs when searching for grid cells and points within, spherical coordinates, etc etc. There are many computational threads over the years on CA. There are other programs that do similar computations, but with different potential consequences for error and some are used by the same organisation, maybe with more attention to error. STK for example models oceanic columns for turbulence, stratification and other physical properties. I would have thought this relevant. (No, I’m not on the payroll, have never even spoken to them by email or any method).
      
      When I learned machine language programming in 1970 I thought it was the bee’s knees until Basic appeared and I bought a machine with 8 K of ferrite core memory to run it. Then Fortran appeared, even better. Some people are early adopters, others are content to wait until version XYZ is about to expire. I’m not so close to it now as before, but I wonder aloud if CA people are keeping up with the latest – I suspect they are.
    - Mark T
      
      Posted Oct 8, 2008 at 11:35 PM | Permalink
      
      Re: Geoff Sherrington (#51),
      
      In the more general sense, Mann et al use algoriths for estimation of great circle distances, they use lats and longs when searching for grid cells and points within, spherical coordinates, etc etc. There are many computational threads over the years on CA. There are other programs that do similar computations, but with different potential consequences for error and some are used by the same organisation, maybe with more attention to error. STK for example models oceanic columns for turbulence, stratification and other physical properties. I would have thought this relevant. (No, I’m not on the payroll, have never even spoken to them by email or any method).
      
      Yes, I agree, but none of this requires the level of complexity contained within something like a satellite propagator, and ultimately, has nothing to do with his algorithm. In fact, the only reason “location” matters is whether or not his proxies relate to local temperatures, and what local effects need to be removed. Even with this, precision in location is not important. Now, claiming that the rainfall in Maine… ahem.
      
      Mann isn’t doing anything that even closely resembles turbulence, oceanic, atmospheric or otherwise. Once he picks his proxies that provide enough spatial coverage of the globe, 3D implications no longer matter (and ultimately, it mostly boils down to a tree in Colorado from what I gather, but that’s another discussion, hehe).
      
      The thing is that STK brings nothing to the table for doing proxy work. It’s like bringing a sledge-hammer when you have a screw.
      
      Mark
    - Geoff Sherrington
      
      Posted Oct 9, 2008 at 3:28 AM | Permalink
      
      Re: Mark T (#52),
      It’s like bringing a sledge-hammer when you have a screw.
      That’s ok. Personally, I prefer a nut cracker.
    - Geoff Sherrington
      
      Posted Oct 11, 2008 at 6:21 AM | Permalink
      
      Re: Mark T (#52),
      
      Mark T.
      I was aware of the truth of your comments before you posted them and agree with them in substance.
      
      However, I was not discussing the employment functions of Mann (I tried to keep it broad with “Mann et al”). My concern was more with quality. I would sooner trust software that makes you think of Wernher von Braun than 20 year-old program versions stacked on versions. We have example after example of the problems Steve meets in reverse engineering old software. It’s quality that is the issue. Why not do the surveying exercise I sketched above and surprise yourself with the results?
      
      Here are the inputs. What is the main Great Circle distance, to the nearest meter, from a famous Oxford University place at 51 deg 45 min 12.340 sec North; 1 deg, 15 min 14.460 sec West, to a nostalgic place at Sydney University, 33 deg 53 min 22.140 sec South; 151 deg 11 min 15.510 sec East? The WGS-84 geoid can be used.
      
      Why not all run it routinely in any canned program you like and submit your results on the Message Board on November 1st, 2008, 1000 hours Greenwich Mean Time, so that people do not peek at each other’s answers first.
      
      BTW, I do not concur that this exercise is irrelevant to climate work. If you use a routine to find nearest neighbours in a dense data field, some of the answers might be close (or equal in calculation)and you have to be sure which one to use. If you don’t carry enough figures, you can choose one place one time and another place another time, depending on the order your particular program defaults to. If you then try to match outcomes, you get a glitch and have to waste time tracking it down.
    - Mark T
      
      Posted Oct 8, 2008 at 11:48 PM | Permalink
      
      Re: Geoff Sherrington (#51),
      
      Some people are early adopters, others are content to wait until version XYZ is about to expire. I’m not so close to it now as before, but I wonder aloud if CA people are keeping up with the latest – I suspect they are.
      
      I would add, MATLAB is the latest in terms of scientific computing, though I’ve been using it over 20 years. There are many clones that are freely available, though they don’t come with the same level of maintenance (nor the depth of libraries) and may not always have the same code (Mathworks now compiles most of their m-files for both speed and copyright protection purposes, probably to keep the latest and greatest away from the clones).
      
      Mark
Steve McIntyre

Posted Oct 9, 2008 at 7:00 AM | Permalink

#53. Mark, I would submit that R is more “latest” than Matlab. If it has not already surpassed Matlab in market penetration, it is only a matter of time. There are an unbelievable number of packages. (Although now that I think of it, R is an evolution of S, which might be about 20 years old as well.)
- Spence_UK
  
  Posted Oct 9, 2008 at 2:09 PM | Permalink
  
  Re: Steve McIntyre (#55),
  
  I would submit that R is more “latest” than Matlab
  
  Is that an objective view based on extensive experience of using both there Steve? 😉
  
  Having used both, I find R and MATLAB are doing very similar things. To argue one is the latest and greatest seems a bit odd. Checking wikipedia (caveat emptor as usual) suggests both were formed in the second half of the 1970s (making each circa 30 years old) for more or less the same reason – to provide access to the extensive and powerful FORTRAN libraries through a simple and intuitive language more closely linked to how the maths would be written and a bit more divorced from the nuts and bolts of computer languages.
  
  Both do this pretty well to be honest. I have limited experience in R but I can see some areas do seem better but other areas are stronger in MATLAB. The main difference is likely to be that one has a more distributed base and free, whilst the other is more centrally controlled and costs. To someone working at home, who wants to keep costs down to a minimum, R makes more sense. To a corporate environment, the support and consistency provided by a central design authority means MATLAB is probably going to be preferred. I also note that R is a bit stronger on the statistics side and MATLAB is a bit stronger on the engineering and control side (although each can play these roles).
  
  PS. It is worth noting that there are a large number of “packages” available for MATLAB – take JEGs mapping library as an example. The Mathworks also have a pretty extensive upload site (although the quality can be variable! They do include reviews) Although if there is a built in option for MATLAB I usually prefer to use that on the basis of quality control, consistency etc.
Mark T.

Posted Oct 9, 2008 at 8:53 AM | Permalink

Now that could be, Steve. I only heard of R through you, however. I think the difference is going to be companies buying a package for their employees and home users that need powerful tools without the expense.

Mark
Mark T.

Posted Oct 9, 2008 at 8:58 AM | Permalink

I should add that I still refuse to learn C++. My coding buddy insists that I know it, and I actually program in it, but I just don’t realize it. 🙂

Mark
Mark T.

Posted Oct 9, 2008 at 3:57 PM | Permalink

I think both of our guesstimations are based on usage, not existence. Both MATLAB and R are becoming very popular, the former in companies that can afford the licensing, the latter with home users (as you also note). It seems I’ve even seen MATLAB advertisements in mainstream publications such as Discover Magazine, which is relatively recent.

I actually bought my own professional copy for my MS thesis back in 1994. Back then, the student version sucked, but you could buy a pro version for $500 if you could prove academic intent. I could, and did. Now, the student version is identical to the pro version other than some nits about printing graphics and publishing stuff, but it still sucks because it doesn’t run on 64-bit platforms! 🙂 I am fortunate that I have a copy from work and licenses are user based, not location based.

Mark