Team Euro’s CPD submission online here has done a commendable job archiving data in their supplementary information online here . It’s by no means perfect, but it’s a big improvement in practices in the field. So while they’ve chosen to unfairly disparage my own efforts in respect to source code, perhaps one can construe the marked improvement in their own practices as a type of compliment.
They state that they will provide source code for the calculations after the review period as follows:
The software (in the `python’ language) used to generate the reconstructions presented in the manuscript will be made available on http://mitrie.badc.rl.ac.uk when the manuscript is published.
Now this is obviously an improvement, but, to the extent that they are trying to adhere to improved practices in the paleoclimate field – something that I obviously endorse – it would be much better if they archived their code at the time of submission, as required in econometrics journals.
Notwithstanding the above disclaimer, some of their Python functions are archived at the MITRIE website here, although there does not appear to be any unifying script to date. To give a flavor of Juckes’ programming style, I’ve attached his code for the do_regress.py function below. I suspect that their job would have been more cost-effective if they had used R, but people use what they’re used to and I’ve got no complaint about that.
At my present rate, I’ll have much of the article replicated in R in a day or two.
73 Comments
Just a note that this isn’t visible to me yet.
1 #
2 ###########################################################################
3 # module do_regress.py
4 #
5 ###########################################################################
6 #
7 # Utilities for proxy based temperature reconstructions
8 ###########################################################################
9 from Numeric import *
10 import cdms
11 import string, shelve, mods
12 from config import data_dir, work_dir
13 import basic_utils
14
15 class regresser:
16
17 def __init__(self, ncfile=data_dir+’nc/mitrie_proxies_v01.nc’,
18 xrange=(1000,1980),
19 std_opt=’std’,
20 key_list=None,
21 key_opt=None,
22 std_period = ‘full’,
23 proxy_id=’mann_etal1999′, \
24 n_extrap=1, \
25 select_on_id=True, \
26 replace_pcs=None, \
27 ppcs = None, \
28 mann_etal1999_fixed_pc = True, \
29 verbose=0, \
30 lcal=79 ):
31
32 self.name = ‘regresser’
33 self.n_extrap=n_extrap
34
35 self.ee = {}
36 self.start = xrange[0]
37 self.end_year = xrange[1]
38 self.proxy_id = proxy_id
39 self.ncfile = ncfile
40 self.lcal = lcal
41 self.time_res = False
42
43 self.nc = cdms.open( self.ncfile, ‘r’ )
44 keys = self.nc.variables.keys()
45 keys.sort()
46
47 if key_list != None:
48 nnf=0
49 for k in key_list:
50 if k not in keys:
51 print k, ‘ not found in keys’
52 nnf +=1
53 if nnf != 0:
54 print keys
55 self.nc.close()
56 raise ‘error’
57 keys = key_list
58 else:
59 keys = basic_utils.xx_where( \
60 lambda x: self.nc.variables[x].standard_name == ‘proxy’, keys, extract=True )
61 if len (keys) == 0:
62 print ‘ no proxies found in: ‘,self.ncfile
63 self.nc.close()
64 raise ‘error’
65
66 if select_on_id:
67 keys = basic_utils.xx_where( \
68 lambda x: self.nc.variables[x].collection_id == proxy_id, keys, extract=True )
69 if len (keys) == 0:
70 print ‘no proxies matching id %s found in %s ‘ % (proxy_id, self.ncfile)
71 self.nc.close()
72 raise ‘error’
73
74 keys = basic_utils.xx_where( \
75 lambda x: (self.nc.variables[x].start[0] = self.end_year-n_extrap ), keys, extract=True )
77
78 if proxy_id == ‘mann_etal1999’:
79 if mann_etal1999_fixed_pc:
80 keys.pop( keys.index( ‘mbh99_13’ ) )
81 else:
82 keys.pop( keys.index( ‘mbh99_01’ ) )
83 elif proxy_id == ‘mann_etal1998’:
84 sect_list = [‘VAGANOV_SOVIET_TREEL’, ‘ITRDB-NORTH_AMERICA’, \
85 ‘Stahle_swm_late’, ‘Stahle_swm_early’ ]
86 keys = basic_utils.xx_where( \
87 lambda x: self.nc.variables[x].section not in sect_list, \
88 keys, extract=True )
89
90 if len (keys) == 0:
91 print ‘no proxies matching id %s and starting before %s found in %s ‘ % (proxy_id, self.start, self.ncfile)
92 self.nc.close()
93 raise ‘error’
94
95 if verbose > 1:
96 print ‘keys extracted:’,keys
97 self.keys = keys
98
99 nproxy = len( keys )
100 ny = self.end_year + 1 – self.start
101 self.predictor = multiarray.zeros( (nproxy,ny), ‘f’ )
102 self.predictor_raw = multiarray.zeros( (nproxy,ny), ‘f’ )
103 self.predictor_mean = multiarray.zeros( (nproxy), ‘f’ )
104 self.predictor_var = multiarray.zeros( (nproxy), ‘f’ )
105 tax = self.nc.getAxis( ‘time’ ).getValue()
106 i0 = self.start – tax[0]
107 ik=0
108
109 if std_period == ‘cal’:
110 n_right=lcal
111 elif std_period == ‘full’:
112 n_right=-1
113 else:
114 print ‘std_period should be cal or full’
115 raise ‘error’
116
117 if verbose > 1:
118 print std_opt, lcal
119 self.meta_dict = {}
120 for k in keys:
121 mdict = {}
122 for at in self.nc.variables[k].listattributes():
123 mdict[at] = self.nc.variables[k].getattribute( at )
124
125 self.predictor[ik,:], self.predictor_mean[ik], self.predictor_var[ik] = \
126 self.standardise( \
127 self.nc.variables[k].getValue().tolist()[i0:i0+ny], \
128 opt=std_opt, n_right=n_right )
129 self.predictor_raw[ik,:] = \
130 self.nc.variables[k].getValue().tolist()[i0:i0+ny]
131 ik+=1
132 self.meta_dict[k] = mdict
133
134 self.keys = keys
135 self.l1 = nproxy
136 self.nc.close()
137 ##
138 ## recalculated mann et al, 1998, 1999 proxy pcs
139 ##
140 if ppcs != None:
141 if replace_pcs == None:
142 kks = range(3)
143 else:
144 kks = []
145 for p in replace_pcs:
146 kks.append( self.keys.index( p ) )
147
148 nc = cdms.open( data_dir + ‘nc/mitrie_new_proxy_pcs_%4.4i_v01.nc’ % self.start, ‘r’ )
149 k1=0
150 for k in kks:
151 v2 = nc.variables[ ppcs[k1] ].getValue()
152 self.predictor[k,:], self.predictor_mean[k], self.predictor_var[k] = \
153 self.standardise( v2, opt=std_opt, n_right=n_right )
154 self.predictor_raw[k,:] = v2
155 k1+=1
156 nc.close()
157
158 ############################################
159 ### predictand stuff.
160
161 self.load_predictand( )
162 if proxy_id == ‘mann_etal1999’:
163 for k in range(3):
164 ss = sum( self.predictor[k,-self.lcal:]*self.predictand )
165 if ss 1:
248 print ‘getting composite::’,res[1],res[2]
249 print len(res[0])
250 print res[0][0:10]
251 self.composite = array( res[0] )
252
253 def regress_composite( self, opt=’cols’ ):
254
255 cc = dot( self.composite[-self.lcal:], self.composite[-self.lcal:] )
256 if self.time_res:
257 print ‘smoothing predictand in variance calculation’
258 sp = basic_utils.smooth( self.predictand, self.time_res )
259 self.pp = dot( sp, sp )*len(self.predictand)/len(sp)
260 else:
261 self.pp = dot( self.predictand, self.predictand )
262
263 if opt == ‘cols’:
264 self.gg = dot( self.composite[-self.lcal:], self.predictand )/cc
265 self.predicted_temp = self.gg*self.composite
266 elif opt == ‘cvm’:
267 self.gg = sqrt( self.pp/cc )
268
269 self.predicted_temp = self.gg*self.composite
270 else:
271 print ‘option ‘,opt,’ not available’
272 raise ‘error in regress_composite’
273
274 self.residuals = self.predictand – self.predicted_temp[-self.lcal:]
275
276
277 def residuals_pp( self ):
278
279 lch = self.lcal/2
280 self.ac = multiarray.zeros( (2,lch), ‘f’ )
281 for i in range( self.lcal/2 ):
282 self.ac[0,i] = vdot( self.residuals[:lch],
283 self.residuals[i:i+lch] )/lch
284 if i != 0:
285 self.ac[1,i] = vdot( self.residuals[:-i],
286 self.residuals[i:] )/(self.lcal-i)
287 else:
288 self.ac[1,i] = self.ac[0,i]
289
290 self.ac[0,:] = self.ac[0,:]/self.ac[0,0]
291 self.ac[1,:] = self.ac[1,:]/self.ac[1,0]
292
293 v0 = vdot( self.residuals, self.residuals )/self.lcal
294 v1 = vdot( self.residuals[1:], self.residuals[:-1] )/(self.lcal-1)
295 v2 = vdot( self.residuals[2:], self.residuals[:-2] )/(self.lcal-2)
296 v3 = vdot( self.residuals[3:], self.residuals[:-3] )/(self.lcal-3)
297 std_err = sqrt( v0 )
298 ndf1 = self.lcal*( 1.-v1/v0 )
299 ndf2 = self.lcal*( 1.-sum( [v1/v0, v2/v1, v3/v2] )/3. )
300 x = self.predicted_temp[-self.lcal:]
301 u0 = vdot( self.predictand, self.predictand )/self.lcal
302 u1 = vdot( self.predictand[1:], self.predictand[:-1] )/(self.lcal-1)
303
304 tau_d = 1. – self.ac[0,1]
305 tau_i = sum( self.ac[0,:] )/lch
306 ndf3 = int( self.lcal*( 1.-u1/u0 ) )
307 alpha = u1/u0
308 return (std_err, ndf1, ndf2 )
309
310
311 ##########################################################
312 ### standardise proxies — either on whole time series
313 ### or on last n_right values (as in Mann et al., 1998)
314 ##########################################################
315 def standardise( self, aa, n_right=-1, opt=’std’ ):
316 ##
317 ## n_right > 0:: standardise using latter end of series only.
318 ##
319 if self.n_extrap > 0:
320 ii = basic_utils.xx_where( lambda x: x 0:
322 print ‘standardise:: filling’
323 for i in ii:
324 aa[i] = aa[ii[0]-1]
325
326 if n_right > 0:
327 if min( aa[-n_right:] )
The readme file makes it look like this package is everything. There are a couple of dependencies most folks would need to add, CDAT (Climate Data Analysis Tools) and an NCAR graphics package. Folks thinking of saving Python code for posterity should probably consider the readability guidelines from the Benevolent Dictator.
In fairnes to the authors, their tabs and spacing got lost in the posting to the site. What I like about R is the vector and matrix handling so that you can elgeantly avoid do-loops. When I see them, I just see busy work.
R rocks
Steve, there is an extension for python called numpy which makes the language very MATLAB/R like, . I think Python is also a little more general purpose than R.
2 comments.
1. Looks like typical academic code. Poorly commented, pasta-like, etc.
2. Languages like C++, Java, etc can be set-up so that matrix operations look like matrix operations. I would wager that R doesn’t have the robustness (or speed?) that a general purpose language has which would make it less suitable for really large sims/models. For instance, Blitz++ is a very fast matrix/vector library. Not for the faint of heart though…tough on compilers also.
Mike T is right that there are great matrix functions for Python in numpy. This code even uses Numeric, the predecessor to numpy.
Bender R is more complete but there is a lot of weight behind scipy now. And Python is way better at parsing data salad than R (or even SAS).
Re #7
It’s not just the look of the code, it’s also performance. I’ll take your wager on R’s speed for matrix operations.
Re #8
I’ll take power over popularity.
I’ll wager R itself is written in C++.
X=randn(1000);tic,out=inv(X’*X)*X’;toc
elapsed_time =
14.4530
I have time to wait, old Matlab does fine 🙂
I’ll wager R’s statistical libraries far outstrip anything available for C++. Run-time is only a tiny fraction of the real time-cost associated with any particular language. C++ is not for dummies. R is.
I would expect that R’s statistical libraries are themselves written in C++.
Given that it’s an open source program, you should be able to download the source code, compile it, and link it into your own program written in C, Basic, or whatever you like.
Not sure why you keep saying that. It may be true but it’s irrelevant to anyone looking for statistical solutions.
Fair point. I’ll shut up.
Which one to use depends on the scale of your problem. Large scale apps with a long lifetime probably should be done in the “real” language of your choice. Smaller scale, prototyping or quick what ifs are amenable to things like R, MathCad (one of my favorites) and MatLab. I’d wager they are a better choice.
Blitz++ runs at Fortran speeds…or faster.
Re #16 I wasn’t going to say the F word …
Re #15 Hey, fFreddy, I wasn’t tryin to razz ya. Comments & insights appreciated. (Overall I’ve used C++ as much as R; but I haven’t touched it in years.)
Hearing about unfamiliar things like Python, numpy, blitz++ – this is good. You can’t get that at RC.
UC, I tried that on my machine:
>> X=randn(1000);tic,out=inv(X’*X)*X’;toc
elapsed_time =
5.5180
See? You don’t need a new language, you just need a faster computer 😉
I hope you don’t actually use inv to compute that :),
>> X=randn(1000);tic,out=inv(X’*X)*X’;toc
Elapsed time is 1.376841 seconds.
>> X=randn(1000);tic,out=(X’*X)\X’;toc
Elapsed time is 0.943003 seconds.
Not only is inv slower than \, but it’s also more susceptible to numerical precision problems.
Bender popularity is power. What makes R great is that a lot of smart people put some great functions there. Now lots of smart people are working on sciPy and Climate Data Analysis Tools.
Just looking at the Python code, my impression – and it’s gut feel only – is that there are 4-10 times more lines than you’d need with R. In R, you hardly ever use do-loops because of the matrix and vector orientation, and Juckes’ code was full of do-loops.
Popularity is good, no question. Power is more than just popularity however.
Heh…the F word…clunky but fast. The gold standard for computational speed.
I’m very much a fan of C-like languages and their derivatives, but I would say that if there is a language specifically designed for your purpose, and it’s designed well, then you should use it. I can probably write any function you perform in R much faster in C, but with more effort. If you aren’t running a script that’s going to take weeks or months to execute, performance probably doesn’t matter much to you and you should just stick with whatever makes your life easiest.
I certainly don’t see the point of using a scripting language that’s not specifically designed for mathematical applications in order to write mathematical functions. There’s no reason you can’t, but as Steve says the scripts are going to end up longer and messier. Still, if one is very familiar with a given language, and unfamiliar with R, for a single program it may not be worth learning. If statistics are used heavily in your line of work, learning R or Matlab or similar (Maple?) is probably a good idea.
Re: #24: Nicholas,
I would be surprised if the R implementation does not perform a whole bevy of high-level optimisations that would make it hard for a roll-your-own C program to compete on any non-trivial problem without major programming effort.
Mike T, thanks for the note. Don’t tell anyone I’ve used inv().
There is a clear difference:
X=randn(79,415);
out=X*((X’*X)\X’);max(max(abs(eye(79)-out)))
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RCOND = 7.411750e-021.
ans =
1.8097e-014
out2=X*inv(X’*X)*X’;max(max(abs(eye(79)-out2)))
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RCOND = 7.411750e-021.
ans =
31.1229
SteveM you can’t judge Python by this script. Scientists and engineers both are notoriously lousy programmers. Phds especially. They just can’t be bothered to write comments. For a Phd scientist, professor J seems to be a couple of deviations better than average, yet he’s clear as mud. Write me the next time you find yourself confounded by data tossed salad and I will show you how Python goes beyond R, SAS, Perl, or anything in conversion of nonsense into arrays. The R tutorial page suggests learning Perl or SAS along with R since the R/S text processing facilities are junk, But Python is mainly a text processing language. With numpy for arrays and matplotlib for graphs, it’s the room, the sun, and the sky.
Bender would I bother you with popular but useless things? Guido Van Rossum is just so darn smart, I am trying to institute his pep8 style guide for Python as a style guide for SAS in my shop. Alex Martelli and Tim Peters too are such clear thinkers and concise writers that you have to consider what they are spending so much time on. While I am clearly enthralled, I will make an effort to contain my evangalism to this post. So, Bender, stay with me a minute.
A long while ago I had a job where I needed to hand SAS code over to a COBOL programmer for implementation into a production system. One extremely bright COBOL programmer was brutal in his criticism of any use of repeated code that did not use a loop. He was very critical of the SAS code we gave him. It was a little unfair. Jim from Mar Vista could see several steps beyond the code what was going to happen next, yet we could barely see what we were writing. It’s only in Python that I can write clear code that I don’t worry that Jim from Mar Vista would complain.
Yes R is great if you can think in arrays. But Python lets you look like an organized thinker even if you are like me. I need to wear all my keys on my necklace or I would lose them. I selected the car I drive partially on the basis that it’s physically impossible to lock the keys in the ignition or to leave the lights on. But while I need help from Tokyo just to make my way to work, I like to think that if you look at my code it’s transparent.
If you are really smart. like SteveM, you no syntax can hold you back. But for the rest of us, clarity matters. R is great. But you need also something for the datasalad r is no good with. Python is less well developed wrt stat functions but is the best thing ever for datasalad.
Steve,
I have extended JAMA (JAva Matrix Algebra) package which is freely available from the link below, to be more Matlab-like. It means that it is basically writing Matlab codes but in Java.
“JAMA”
http://math.nist.gov/javanumerics/jama/
If you want a copy, I can send you one.
Here is a piece of code by UC shown above:
—————- UC Code Snippets —————
X=randn(79,415);
out=X*((X’*X)\X’);
————————————————-
—————- Sione Code Snippets Using JAMA—————
Matrix X = JDatafun.randn(79,415);
Matrix out = X.times(X.transpose().times(X)).times(X.transpose().inverse());
————————————————————–
Rafe: A mathematician friend of mine who writes a lot of software to be run on large clusters swears by C, and uses some high-performance C mathematical libraries (for stuff like matrix operations, fourier transforms, etc.) I would ask him what library he likes best and start with that. That should take care of providing me with well-optimized high level functions.
As for actually performing mathematical optimizations by refactoring formulae, that is something I suppose I would be bad at because I am not a trained mathematician as such. But most of the code I see on this blog is pretty straightforward matrix ops. I’m sure a good matrix library would deal with them quite well.
As I said in general I would use R for this type of thing, unless I was writing, say, a climate model which processed vast amounts of data. Then I would use something low level like C in order to be able to get close to optimal low-level code to reduce run-time. But if anyone is going to write mathematical code in C or any other language not specifically designed for it, chances are you are better off doing a little research and finding the premier mathematical library(ies) for the language you have chosen. There’s no point re-inventing the wheel, especially since the prewritten libraries are likely to be more heavily optimized than anything you’re likely to come up with yourself.
Firstly I’d like to apologise for an error in our manuscript: On page 19, line 6, the manuscript states: “The code used by MM2005 is not, at the time of writing, available”: This comment should have referred to the code used by MM2005c (their Energy and Environment paper) — the code used by MM2005 (their GRL paper) was made available at time of publication.
We apologise to McIntyre and McKitrick for suggesting that the code for MM2005 was unavailable when it was in fact the code for MM2005c which we had not been able to obtain, and note that they have made efforts to provide code for this and other publications which go beyond the norm in this field.
Further details are on the Climate of the Past Discussion site, where anyone is welcome to post a comment.
Stephen McIntyre has now made the code for their Energy and Environment paper available to us (though his http://www.climate2003.com site still declares that it is unavailable). Interestingly, given the strong emphasis of McIntyre and McKitrick’s 2005 Energy and Environment paper on the importance of centering data for principal component calculations this step appears to have been omitted for figure 2 of their paper, which displays a principal component of uncentred data. Further details are available at mitrie.badc.rl.ac.uk.
Concerning the code, this is provided in the spirit of transparency. I fully agree with the comment above that python should not be judged on the basis of this code.
Martin: I’m a bit confused. I’m looking at Figure 2 in M&M’s 2005 Energy & Environment paper. The caption says:
This seems, to me, to be explicitly indicating that the graph is of four sets of data. Two of them are the PC1 using centered calculation method (with & without arbitrary adjustment) and the other two are the PC1 using decentered calculation method (with & without arbitrary adjustment).
Without reading the rest of the paper, it seems clear to me the figure is attempting to display the difference that decentering/centering makes to the results of the method. Why is it strange that they are presenting a graph of decentered data, when they are using it as a basis of comparison? It seems entirely straight-forward to me. Otherwise I don’t know how they could illustrate the impact of the decentering.
Martin, thank you for your letter. Would you like to also put some comments in your paper on the unavailability of the codes and data for some of your co-authors on your paper? Because if you don’t, I don’t see the point of commenting on code availability at all …
In particular, could you respond to, or even better, offer some help with, Steve McIntyre’s problems with getting the following information? Steve M. says:
Steve M. has always provided data and code, both in advance and on request, and has always been totally open about his methods and data. I find your singling out of Steve M. for your comments, while totally ignoring the well-known problems with your co-authors Briffa, Esper, and Moberg, to be a very troubling and disconcerting occurrence.
Your comments greatly appreciated …
w.
PS – Everything in science requires a citation from some reputable Journal, so regarding your preferable course of action in this matter, let me offer a reference to the following Journal article. The Abstract says:
… “And why beholdest thou the mote that is in thy brother’s eye, but considerest not the beam that is in thine own eye? Or how canst thou say to thy brother, Let me pull out the mote out of thine eye; and, behold, a beam is in thine own eye? Thou hypocrite, first cast out the beam out of thine own eye; and then shalt thou see clearly to cast out the mote out of thy brother’s eye.”
“The Sermon On The Mount”, Christ, J. et al., Journal of Applied Theology, 30 CE
Response to comment 32. The lower two curves in the plot are calculated using the MBH method, centred on the calibration period, which is described in the caption as “decentred”, i.e. not centred on the complet data record. The upper two curves, which are described in the caption as “Above: PC1 using centred calculations” are calculated without centering of any kind. The discussion on page 77 of the paper makes it clear that the authors intend to show the contrast between the MBH PC and a centred PC, but what is plotted in the figure is a comparison between the MBH PC and a PC calculated with no centering.
Response to comment 33: I have never advocated that code should always be disclosed. For the record, I asked for the code for the McIntyre and McKitrick 2005 paper in December 2005, and it was not provided. I checked Stephen McIntyre’s website prior to submission and found a line clearly saying that the code was not yet available. It has now been provided and, as I acknowledge, this goes beyomd what is normal in the field.
Why not? Is it the threat of terrorism?
That’s a matter of record that Steve can clarify.
If you’re referring to http://www.climate2003.com then it hadn’t been updated in months. It’s no excuse to make the false claim about software code being unavailable six months later because that is simply untrue – and you know it.
That’s because what is normal in the field is a disgrace to science and can fairly be described as rampant misconduct.
#30. Martin Juckes draws our attention today to the following comment which he has just published at CPD:
I will revert on a number of points here more fully. I wish to note for the record that the stale index webpage at http://www.climate2003.com was updated yesterday (November 2, 2006) and that a direct link to the Supplementary Information for MM05 was placed on the index page. This change was made promptly after the matter was brought to my attention. The date of the change is recorded at the website and, if Juckes wishes confirmation, I will provide an access password to coauthor Zorita to confirm this. I presume that Juckes either did not double-check the webpage when he submitted his correction or possibly he may have inadvertently used a cached version.
Second, on the issue of EE Figure 2, when the matter was brought to my attention, I edited the backup script on Oct 31, 2006, including a commentary and a script to produce a re-stated version of Figure 2 using prcomp. I posted this re-stated version at climateaudit on Nov 2 here . Juckes discusses Figure 2, but failed to discuss the comments as of Oct 31, 2006.
Third, the backup script originally archived in March 2006 includes a case using correlation PCs (which were reported in MM05b). Juckes’ correction stated: "…shows that they use PCs which have been (except in figure 2) centred but not normalised to unit variance (standardised)." Correlation PCs obviously yield identical results to PCs in which the chronologies are divided by their standard deviation – a point which Juckes, like Wahl and Ammann fails to acknowledge. Juckes failed to refer to any of the discussion in Huybers 2005 or Reply to Huybers 2005 about correlation and covariance PCs.
Juckes said that there were "uncertainties" about what we did, but never asked me any direct questions about his "uncertainties". I presume that he did not consult his coauthors either. During the last year, I have exchanged over 100 emails with 2 of his coauthors. One of his coauthors has posted here frequently. Eduardo Zorita exchanged code to reconcile results. I have met met with 2 of his coauthors on 3 different occasions. I spent nearly an entire day at KNMI with Nanne Weber, one of his coauthors. On none of these occasions did the Juckes group express any "uncertainties" about what we did.
Question to the consensus: Whatsa matter with you? Why are you behaving this way?
Answer from the consensus: Amateurs like Steve McIntyre might just find out what’s wrong with our important scientific papers and show everybody! These amateurs must to be silenced and discredited in any way shape of form! Hrrrumph, waffle waffle waffle…kah kah kah
LOL
Another AGW double-standard for the file:
Steve M must make all his codes (1) available, and (2) turnkey, and (3) must advise all team members when such codes are available, and (4) must update all stale web pages linking to such material, and (5) must be sure team members aren’t using cached web material; whereas the team is under no obligation to do anything whatsoever.
Martin, you’re an extremely bright guy. Grow up. Stop being cute.
Juckes says:
Actually January 2006. The code was archived on March 1, 2006. The line on the old website does not "clearly say that the code was not yet available". It was a very stale webpage, but it says that the code will be archived – not that it was "not available". At the time that Juckes purported to check, it was archived – in the same directory as the MM03 scripts which Juckes appears to have located.
During this period, I exchanged code with one of Juckes’ coauthors – I would have presumed Juckes consulted with his coauthors, but it appears that he failed to consult Eduardo Zorita about his "uncertainties".
Zorita has put his name to the paper, presumably if he had reason to disagree with its contents, he wouldn’t have signed off on it.
Hello Stephen,
As you are no doubt aware, my update to the Climate of the Past Discussion site was posted yesterday, Nov 2nd, not today.
Concerning your revised figure 2, would you confirm that you still have a minus sign multiplying the centred PC, and that without this minus sign the effect of the adjustment to the non-BCP is to decrease the PC in the first 50 years, as it does for the MBH PC?
re: #40
Just an observation.
Why wasn’t this all straightened out before statements were made and submitted in the paper ? And look, instant communication right here right now! Amazing how easy it is and unfolding right before our eyes too.
My husband is a published enviromental scientist.
I know the real drill. This is not it.
#40. First, if an increase to non-BCP ring widths from 1400-1450 results in a decrease in both PC1s in the period 1400-1450, that hardly helps you, as it simply shows that PC methods are singularly inappropriate whether Mannian or centered. In this case, the sign was chosen so that the orientation of the post-1450 values were consistent in the two versions. I notice that your “evaluation” of the Mannian method does not evaluate this perverse property of the methodology.
You stated “at the present time this page is not linked to from the main site (www.climate2003.com)”. Yes, I realize that you posted your “correction” yesterday, but the webpage was amended yesterday.
#40. The webpage was updated yesterday. I checked the time as against the time of your submission to CPD and the webpage was updated prior to your submission to CPD. Thus this statement in your “correction” was untrue as well.
Let’s (yes, including me) keep the chatter down on this thread while Steve and Martin hash this out.
The sign of the Principal Component is arbitrary, I only mentioned it because you appear to attach importance to the direction of the perturbation.
As regards the difference between your centred PC and the Mann et al. PC, would you agree that this is partly due to the fact that you have not normalised the data, so that proxy series with high variance contribute disproportionately to your PC? If the data is normalised (by setting cor=TRUE in the call to princomp) the leading PC does reflect the anomalous growth in the 20th century, though it is still significantly different from the Mann et al. PC. Wouldn’t you agree that using normalised data is appropriate here?
As far as I can tell from your modified site, it was updated after I posted the comment yesterday, but if I’ve got that wrong, and the check I made on your site yesterday got information from a cache for some reason, please accpet my apologies.
Just a brief comment in response to 41: the paper is on a discussion site, the idea is that the site can act as a forum for discussion (hence the name). This is slightly different from traditional printed publications.
#45. As I just pointed out in #43, it was updated prior to your submission. Specifically, the webpage was updated at 14:24 Greenwich Mean Time, while your comment is shown as being made at 17.40 yesterday. If I used the term "at the present time" in the document, I would ensure that the comment was actually true at the time that I made the comment – something that is easily done and which you failed to do. It’s no good making an "apology" here. Your obligation is to correct your correction.
So have you contacted the editor of Energy and Environment about the mistake in your published paper?
Re:#44,48,49
Bender makes a good suggestion.
Martin Juckes thanks for introducing folks to CDAT.
Martin Juckes stated:
and
The sign of a PC series, as it comes out of a PC algorithm, is arbitrary. However, in the cases of interest here, the interpretation of a PC1, in which virtually all first eignevector coefficents have the same sign, as a weighted averages of ring width site chronologies (i.e. nearly all positive coefficients) is far more natural than an interpretation in which they nearl all have negative coefficients. It’s always a good idea to look at the eigenvector coefficients and I recommend this to you. Note that the MBH99 PC1 which came out as an upside-down hockey stick had mainly negative first eigenvector coefficients; Mann multiplied it by negative 1 and presented it as an upside-up hockeystick. I have no objection to this presentation as the underlying bristlecone chronologies have upside-up hockeysticks.
In the cases of interest here, nearly all the weighting coefficients of the PC1 have the same sign and the most natural physical interpretation of the PC1 is as a weighted average with nearly all positive coefficients.
In Figure 2, the first eigenvector coefficients were mostly positive in the AD1400 Mannian PC step. However, when non-BCP ring widths were increased in the early 15th century, this had the effect of flipping over the entire hockeystick i.e. it assigned negative first eigenvector coefficients to nearly all the series. For interpretation purposes, there is no particular meaning to all negative coefficients and I multiplied the Mannian PC1 with 15th century increment be negative 1 to yield the series in MM05 (EE) Figure 2.
Using prcomp on the network i.e. centered, the first eigenvector coefficients were again nearly all positive. When non-BCP ring widths were increased in the early 15th century, once again, this had the effect of flipping the entire series over yielding mainly negative first eigenvector coefficients. Again, as with the MAnnian case, I multiplied the PC1 with 15th century increment by negative 1 to yield the series in re-stated Figure 2.
The critical difference in the two situations is that after all the PCs are consistently oriented (such that first eigenvector coefficients are nearly all positive), the 15th century increment yields an increase with a centered prcomp PC1 and a decrease with the Mannian PC1. We expressed this conclusion in our EE 2005 article as follows:
This statement is correct. Since you have raised the matter and alleged an error, do you seriously deny the correctness of the above the statement?
re #45: Although Bender’s suggestion (#44) is good, I just can not keep my hands off from this:
No, Martin, no. It is inappropriate. Since the series are already in standard units, it is normal
practice to use the covariance matrix for PCA. If you use the correlation matrix instead in this case, the burden of justification is on you. I could not find any such thing from your paper, only an astonishing claim about “another apparent error by McIntyre and McKitrick”. IMO, only thing apparent here is that you are not familiar with PCA.
BTW, M&M have already discussed the issue in their reply to Huybers. That comment and other related things are available from this site by clicking “multiproxy pdfs” on the right panel. I suggest you take a look on those publications before re-iterating more refuted claims.
I have commented on this and another question at the "Climate of the Past" discussion site. Here is the part about proxy selection:
Sorry, left off the address of the “Climate of the Past” discussion site.
w.
Willis, I saw your comment. For the record, Moberg’s SI was sufficient to identify data versions for 16 of 18 series. He refused (said he was not authorized) to provide either the Indigirka data (which is non-HS) and the Lauritzen data. This time , I filed a complaint with Nature (not Science) immediately and they required him to file a Corrigendum and provide both series, which he did. The inconsistency between the Lauritzen published version with cold point dated to the mid 19th century and the Nature version with the same cold point supposedly dated to the 1930s remains unexplained.
I note that Stephen is tiring of this discussion and starting a new thread on http://www.climateaudit.org/?p=893#more-893.
To clarify, the “error” I referred to in McIntyre and McKitrick (2005, Energy and Environment) was plotting an obvioulsy un-centred PC and describing it as centred.
Concerning the normalisation of data (#51 above) (or, in other words, the use of the correlation matrix instead of the covariance): it is true, as noted above, that the tree-ring series have already been normalised, but the normalisation is different for each series. Sounds weird, but it is a logical consequence of the fact that the different series span different periods. But it still sounds weird, so I think calling the input data normalised is a little misleading, though not strictly false. Normalising the data actually used (i.e. the selected time period) is standard practise where the normalisation is arbitrary or, as here, affected by factors outside the study period. Normalising the data on a period other than the study period sounds a little like what Mann et al. 1998 did, i.e. centering the data, but not on the study period.
However, the main point regarding McIntyre and McKitrick (2005, Energy and Enviroment) is that the claim that the difference the new reconstruction published there and the Mann et al. (1998) reconstruction is primarily due to differences in the centering of the data is false. Most of the difference results from the difference in normalisation.
Concerning our choice of proxies, page 1021, line 15 of our manuscript says: “Finally there is a `Union’ composite made using 18 independent northern hemisphere proxy series marked with “*” in table 1. These series have been chosen on the basis that the extend to 1980 (the HCA composites and the French tree ring series end earlier), the southern hemisphere series have been omitted apart from the Quelcaya glacier data, Peru, which are included ensure adequate representation of tropical temperatures. The MBH1999 North American PCs have been omitted in favour of individual series used in other studies. Finally the Polar Urals data of ECS2002, MBH1999 and the Tornetraesk data of MSH2005 have been omitted in favour of data from the same sites used by JBB1998 and ECS2002 respectively.”
Martin, your response is much appreciated. However, what are needed are a priori rules for proxy selection which are actually followed. You say:
While two of these are a priori proxy selection rules (must extend back to the year 1000, must go forward to 1980), the rest are just descriptions of proxy choices without giving any reason for the choice. Why the exclusion of later Tornetraesk data in favor of earlier? Why use the Tornetraesk data from ECS2002, but reject the Polar Urals data from the same source?
Finally, you have not explained the omission of the Indigirka series. Since this runs from the year 0 to 1996, it clearly fits your criteria. The Yamal series is used … why not the Indigirka? If those are your proxy selection rules as you state, you have not followed them in the Indigirka case.
In short, it is apparent that you have selected proxies on an “ad hoc” basis, and thus your study has no value. Let me pick the proxies as I wish, as you have done, and I will use your methodology to give you a very different result. For your study to be significant, you need to:
1) Establish clear a priori proxy selection criteria, and
2) Provide a rationale for the selection criteria, and
3) Include all of the proxies that meet the criteria.
You have not done this.
w.
PS – I know you don’t want to answer this, because I see you skate around it every time it comes up, so I will ask again. Why is Steve M. singled out in your paper by the false claim that he has withheld information, when your co-authors Briffa, Moberg, and Esper have withheld details of their data and methodology despite repeated requests? Why are they not mentioned in your paper?
I don’t like to push you on this, but I have asked it several times, including on the CPD discussion forum, without any reply. Your unwillingness to answer this question is … interesting.
Dear Willis,
your question about why I singled out Stephen for mention of non-provision of data (I was doing no more than echo what he said on his own web site at the time) has been answered elsewhere on this wiki. I guess the proliferation of different pages is not helping the transparency of the discussion. Anyway, the answer is simply that Stephen has made many assertions to the effect that software should be made available on publication. The fact that he didn’t follow this procedure for his 2005 Energy and Environment paper is … interesting.
The answer to your question about selection is also elsewhere on this blog, in a contribution from Stephen — so I think the information is available in the manuscript, but will endeavour to make it clearer on revision: where several versions exist, the earliest version is used. The reason for this (not given in the manuscript) is to try to avoid any suggestion of being influenced, in selection/rejection of data, by the results of the earlier publications in selection/rejection of data. For the Indigirka — see Moberg et al. (corrigendum).
#57. Martin has evaded answering about Sargasso Sea, which goes back to 1000, as pointed out elsewhere.
The Moberg Corrigendum, which resulted from my inquiry, is familiar to readers of this blog, states as follows:
.
The data is available from one of the Mitrie coauthos; indeed, he’s provided it to me. Sidorova et al. stated:
As to selection criteria – and Juckes has provided no selection criteria so far to justify excluding Sargasso Sea and Indigirka – Mitrie coauthor Esper stated:
#57
Dr. Juckes, you state:
So if I understand well, this personal attack on Steve McIntyre in a peer reviewed publication is a little revenge, because he dared give openness lessons to the “established” scientists. How civilized and polite!
Re # 57 and #59 – What is not stated is that Steve M will provide the information when requested while certain others will not provide it. THAT IS THE SIGNIFICANT PART!
RE: #57 – So in other words, you allowed an ad hominem element of consideration to influence your selection of who to single out. It was particularly ad hominem since you went for a hair splitting argument to say “he hath not the data” – for of course, his own archiving and transparancy far exceeds most of what we see within this discipline. You wanted to make and example of he who dared to, as I would imagine you probably feel, opened the can of worms in the first place. It was pure revenge.
Folks, if saying the sofware was not available is such a crime, why did Stephen say just that on his own website?
While we are on the issue, I’d appreciate a copy of c:/climate/data/mann/proxy.tab as used in the 2005 Energy and Environment paper. [this is a request for data, repeating an earlier request, not an accusation]
First, where did I say that the software "was not available"? Any climate scientist that thinks that climate2003.com is my primary website, rather than climateaudit,org, is living on another planet. The index page at climate2003.com was obviously stale. It said in January 2005 that the code will be available in a few days. It never said that the code was "not available". I exchanged code with one of your coauthors.
The table proxy.tab is the same as proxy.MBH which loads from the script. Repeating an answer already given.
Dear Martin,
I am sure you will appreciate that I think your rationalisation (not given in the manuscript) is “… interesting”. I fail to see how your procedure prevents you from cherry-picking series that give a desired result. In fact, much the opposite.
yours confused
per
Re #57,
Dear Martin,
This is astounding. You say:
First, let’s get one thing straight. Non-provision of data means you don’t provide it when you are asked. Steve is not guilty of that, you just couldn’t find it. Esper, Briffa, and Moberg are guilty of that. But that’s not the astounding part.
The astounding part is this. You are saying that you singled out Steve in your paper because he has always stood for openness and transparency, and failed to send you a notification that his data had been placed where he said he would place it. This you see as hypocrisy.
The clear implication, of course, is that that since Esper, Briffa, and Moberg have always stood for obfuscation, concealment, and trickery, and since they continue to practice that, there’s no need to mention that in your paper. There’s no hypocrisy there at all, they’re just doing what they always did.
Say what? So if Esper, Briffa, and Moberg actually release their data, will you bust them for being hypocritical and not following their code of dishonor?
Dr. Juckes, the data was there all along. Sure, Steve forgot to mention that it was there … but then, you forgot to look, didn’t you.
Also, please quit saying you’ve answered questions when you haven’t. I said:
You said:
NO. IT. IS. NOT. ANSWERED. ELSEWHERE PLEASE. ANSWER. THE. QUESTION. ABOUT. INDIGIRKA.
I really hate to be petty about this, but handwaving and saying “I answered that elsewhere” WHEN YOU HAVEN’T just doesn’t cut it.
What about Indigirka? What about Sargasso? Both of them fit your criteria, and you didn’t use them. Why not?
w.
(crossposted to Juckes Omnibus)
Stephen, the script you have provided attempts to load c:/climate/data/mann/proxy.tab. This looks like a file to me (it is not loaded directly in ee2005.backup.txt, but in MBH98config4.txt, which is itself loaded by ee2005.backup.txt).
These are Mann’s proxies. The table is at http://www.climateaudit.org/data/ee/proxy.MBH.txt
re #67 – The link doesn’t work for me.
Re #68 It works for me.
re: #68 It did for me so either it’s been fixed or the problem is on your end.
Martin,
Could you clarify your choice for the TornetràÆà⣳k data?
(eg I see “Tornetraesk” data plotting in the Baltic Sea.)
see map
http://home.casema.nl/errenwijlens/co2/swedenmap.htm
To my knowledge Schweingruber refers to the only real TornetràÆà⣳k location.
and why is the “Tornetraesk” data plotting in the Baltic Sea?
Has anyone tried to use any other dimensional reduction methods other than PC? Here are some new dimensional reductions non-linear techniques available from the internet in Matlab codes.
#1) LLE (Locally Linear Embedding)
http://www.cs.toronto.edu/~roweis/lle/
#2) IsoMap
http://isomap.stanford.edu/
#3) Principal Curve Analysis
(R implementation)
http://rss.acs.unt.edu/Rdoc/library/pcurve/html/00Index.html
(Java Implementation)
http://www.iro.umontreal.ca/~kegl/research/pcurves/implementations/index.html
#4) KICA (Kernel Independent Component Analysis)
http://www.cs.berkeley.edu/~fbach/kernel-ica/
#5) KPCA (Kernel Principal Component Analysis) – Statistical Pattern Recognition free toolbox.
http://cmp.felk.cvut.cz/cmp/software/stprtool/
#71: Hans, looks like a typing error (by me) entering the Tornetraesk position, it should be 68N, 20E, which hopefully will keep it out of the Baltic. Thanks for pointing that out.
#67-70: I could download the file, but not read it with R. I’ve asked Stephen for clarification on page 894.
#72: Not that I know of. Most people in this field don’t even use Principal Component Analysis.