Do not post anything other than programming comments. Absolutely no piling on please.
September 16th, 2008 at 2:18 am edit
And this is just the beginning, Mann08 requires some time to comprehend. Code with lines
makes the task difficult. Specially, I’d like to find
( Funny, variance matching is well alive, SI:
The gridded proxy data were then areally weighted, spatially averaged over the target hemisphere, and scaled to have the same mean and decadal standard deviation (we refer to the latter as ‘‘variance matching’) as the target hemispheric (NH or SH) mean temperature series.
September 18th, 2008 at 5:17 am edit
More problems with the code, gridboxcps.m,
??? Out of memory. Type HELP MEMORY for your options.
Error in ==> gridboxcps at 132 saveregts=reg_ts(:,igood,:);
reg_ts seems to be too big for my computer
reg_ts 1996×2592×19 786392064 double
and seems to be almost full of NaNs. Any volunteers to optimize the code by applying sparse matrices ?
September 23rd, 2008 at 5:22 am edit
First match, quick way to reduce memory consumption is to run fewer proxy sets at a time,
for n=1:19 %loop over each proxy set (back to year n*100)
etc, for example. Now NH_whole_newgrid_HAD_gbeachyear.mat gbcpsfull(:,4) agrees with archived NH_had.mat gbcps for 1500-1600, so I guess I’m on the right track.
What does gridboxcps.m do ?
* Each gridbox instrumental is filtered with Mann’s smoothing algorithm
* Proxies are standardized (they seem to be Mann-smoothed already in griproxy.m; cutfreq 0.1 for ‘annual’ and 0.05 for ‘decadal’ proxies)
* All proxies in 5×5 grid are averaged
* Within the grid, proxy set is ‘calibrated’ with variance matching to temperature. Smooth first, calibrate then. What would happen vice versa ? I have to read ‘The Foundations of Variance Matching’ again..
(* Data is regridded to 15X15)
* Grid-reconstructions are area-weighted to get the hemispheric means, and variance matched again (!!) (with CRU or HAD as target).
September 24th, 2008 at 12:56 pm edit
This is probably inappropriate thread, #119-#121 might need a transfer. Anyway, gridboxcps.m is very interesting program to play with. For example, you can replace proxy data with instrumental data to see the effect of sampling errors in the NH mean. Leaves not much room for proxy errors.. Fig 3. CPS uncertainties seem to be about +- 0.15 C ( and not shown after 1850. Probably because there would be a conflict with CRU 😉 )
Re: Steve McIntyre (#65),
A question here: for a proxy located on a gridcell border e.g. 5S or 5W here, there are two equidistant “nearest” neighbors. This situation is not described in the text, but might be clarified in the code.
It is done in ‘propick.m’, see the part described as ‘great circle distance’. As far as I can see it works as follows:
1) calculate the distance (see the code for the formula) between the proxy location and all instrumental grid cell centres (loop ‘i’). The values are stored in vector ‘d1’. (BTW, notice here that comments for lat1/lon1 and lat2/lon2 are wrong (reversed):
‘X’ contains instrumental data and ‘Y’ proxies)
2) use build-in sort-function to sort ‘d1’. Sort-function arranges values from smallest to the largest and in the case of equal values, the value appearing first comes first (FIFO).
Hence, the order of equidistant neighbours depends on the order the intrumental grids appearing in ‘X’. ‘X’ comes from file ‘clidata’. ‘climdata’ is prepared in ‘doannsst.m’, where there is a description how the grid cells are arranged. Data is coming from ‘instfilled’, and the order should be the same (I’m too tired to check the code).
However, notice that there are additionally flags ‘enough’ and ‘toofew’ (found in intrumental data directory), which control if the grid cells are used: the locations in ‘enough’ are used and the ones in ‘toofew’ are not. According to ‘Readme.txt’ in the ‘instrument’ data folder, those grid cells which have less than 10% of annual data are disregarded.
From The Mystery in Kenya, 2008/09/23 at 3:04 PM
I have not looked in-depth at the code that resulted in this: reg_ts 1996×2592×19 786392064 double,
but, if it is possible, break it into 19 separate 1996×2592 blocks. MATLAB is grumpy when it comes to memory allocation. In particular, it requires contiguous memory for a single array. Even on my machine with 4 GB of RAM (well, 3.2 GB usable because of Winders), the largest array I’ve ever seen it create was on the order of 500 MB. In fact, you can run a check:
[uV sV] = memory
and it will return a element in uV called
Of course, I could play all day and never get a 958 MB array, but it will put you in the correct order of magnitude. 🙂
works with ver 7.4.,
Largest Contiguous Free Blocks:
1. [at 18e40000] 1050 MB (41a70000)
Not enough for gridboxcps. Separating blocks helps. How much memory Mann has, I wonder 😉
Re: UC (#4),
What about a 64 bit OS? Both XP and Vista are available in 64 bit versions. Would MATLAB take advantage of the increased memory address space? With 2GB sticks, most new motherboards support 8GB of memory with a 64 bit OS?
How much memory Mann has, I wonder
Well, if I were doing what he does and had grant money to fund it, I would have at least a high end workstation, with at least dual Xeon or Opteron CPU’s with 20 to 30 GB of ECC memory. a dedicated SCSI 320 RAID controller with a few terabytes of drive space and be running a high end 64 bit OS, not a desktop with a desktop OS. I suspect that explains some of the sloppy coding. It’s a variation on Parkinson’s Law: work expands to fill the time available. Code expands to fill the computer capacity, as exemplified by the progression from DOS to Vista. Progression probably isn’t the correct word to use.
Re: DeWitt Payne (#30),
I’ve tested servers with 64 GIG of Raid ECC memory.
Still couldn’t play Far Cry on 64-bit XP on it though. Stupid built in graphic cards. 🙂
If the application works OK in a 32 bit word environment, it’s a 32 bit application (or perhaps 16 bit, or even 8 bit)!
If you are using a [i]winders[/i] op/sys there shouldn’t be an [i]out of memory problem[/i] (unless your hard drive is nearly full) because the o/s uses a swap file that increases with its requirement (but it slows responses a lot). However, if the o/s is a UNIX based o/s the swap file is a fixed size.
Increasing the size of the swap file on a UNIX based system provides the advantage of an increased virtual memory.
Perhaps you need to increase the size of your swap file.
Best regards, suricat.
Re: suricat (#6),
There are 64 bit versions of Matlab available for multiple 64 bit OS’s, XP, Vista, Linux, etc.
Have considered giving Octave a shot?
It is fairly compatible with MatLab Compatibility.
I have a dumb question. It’s been quite a while since I last used MatLab. Back then, MatLab source code would be contained in basic text files. I have been looking at the *.mat files in http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/reconstructions/cps/
Those files seem to be in some kind of binary format. What am I missing? Where can I find the actual source code?
And, what do cps and eiv stand for?
I am not sure if I can be of any help and I am sorry I got stumped at such an elementary level. Any pointers would be much appreciated.
Apologies. I realized I was looking in the wrong directory. I have since found http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/
Steve, please feel free to delete my earlier posts.
MAT files are just data files that hold MATLAB variables upon executing a standard save command. They are not really binary (not like a bunch of packed 32-bit floats, for example), but MATLAB’s proprietary format.
64-bit OS may help, but it really depends upon Winders memory management. By going to 64-bit OS, and more memory, there is certainly more available for MATLAB to use. The problem regards contiguous space, which is not often granted by Winders (not that it tries to break memory up, I’ve just never been able to get large arrays even when I have the memory available).
From a coding standpoint, it doesn’t really make sense to store things in large 3D arrays in MATLAB anyway. First, that array needs to be generated prior to running the loop in which it gets used – reg_ts = zeros(1996,2592,19); – else MATLAB must redefine the array (and space it occupies) every time you iterate and add to it. This can immediately be the source of “out of memory” errors. It also slows MATLAB down considerably (keep in mind, the first run of a MATLAB script is interpreted, not compiled, and it does not optimize array growth even on consecutive runs). Also, there’s simply no benefit other than a reduction in the number of arrays that are used unless larger array operations are being performed (I can’t think of a reason for such an idea).
This is probably coals to Newcastle, but the Matlab site has a lot of information about out of memory errors. The process limit for 64 bit Matlab on a 64 bit OS, btw, is 8 TB compared to 2 or 3 GB on a 32 bit OS or a 32 bit version of Matlab.
I’ve uploaded an R script to calculate the two next nearest gridcells for each proxy using great-circle distances as scripts/mann.2008/nextnearest.txt and added two columns next1, next2 to the details data frame recording the jones cell #s of the two closest gridcell with Mann08 instrumental values.
The script scripts/mann.2008/temperature.correlation.txt calculates correlations to Mann instrumental data for a variety of cases: two by two for 1) actual gridcell and 2) two nearest with Mann instrumental values; a) infilled; b)”original”, and collects the rtable and SI versions, for the periods 1850-1995, 1896-1995 and 1850-1949.
The scripts should be turnkey or close to turnkey. I’ve “replicated” the rtable correlations up to .01 for all but 27 series (and up to .001 for most of these). This doesn’t mean that the underlying calculations are “right” in any sense, merely that I’m pretty close to replicating this step. The 27 oddballs are:
id next_infill rtable si
325 chil016 -0.077 0.062 NA
326 chil017 -0.151 -0.037 NA
339 co522 0.084 0.105 NA
395 fisher_1994_agassiz 0.165 0.258 0.5238
396 fisher_1995_agassiz 0.065 -0.065 NA
400 fisher_2002_devon 0.121 -0.116 NA
405 fisher_2002_plateau -0.203 -0.283 NA
492 lutannt12 0.746 0.864 0.8637
512 lutannt30 0.478 0.561 0.5614
526 lutannt43 0.835 0.846 0.8460
549 lutannt64 0.677 0.623 0.6229
576 meese_1994_gisp2 -0.080 0.062 NA
799 nv037 -0.078 0.061 NA
857 or026 0.202 -0.134 NA
910 schweingruber_mxdabd_grid100 0.222 0.363 0.3626
911 schweingruber_mxdabd_grid101 0.465 0.439 0.4387
936 schweingruber_mxdabd_grid20 0.374 0.302 0.3021
946 schweingruber_mxdabd_grid3 0.280 0.215 0.2146
952 schweingruber_mxdabd_grid35 0.233 0.306 0.3059
964 schweingruber_mxdabd_grid48 0.312 0.282 0.2820
982 schweingruber_mxdabd_grid65 0.363 0.428 0.4284
984 schweingruber_mxdabd_grid67 0.528 0.559 0.5593
992 schweingruber_mxdabd_grid78 0.160 0.289 0.2886
1000 schweingruber_mxdabd_grid87 0.310 0.374 0.3742
1004 schweingruber_mxdabd_grid90 0.450 0.397 0.3973
1017 smith_2004_thicknessmm 0.111 0.151 0.1513
1041 stenni_1999_talos 0.161 0.142 0.1424
Re: Steve McIntyre (#13),
**UPDATED: 23 September 2008 to correct errors with the longitude value in the “schweingruber mxd” grids.
Steve: I notice that they didn’t acknowledge us. It seems to be a concerted Team policy not to mention me or CA. The PNAS SI isn’t corrected yet. http://www.pnas.org/content/suppl/2008/09/02/0805721105.DCSupplemental/SD1.xls
I’ve sort of located the problem with series 910 and others may be similar. The gridded series coincides exactly with a gridcell (42.5N 107.5W). So there’s a tie for next nearest between 112.5W and 102.5W. The decision rule in propick.m according to Jean S is that tie goes to the Mann series that occurs first in his numbering system, which would be 112.5W. However the correlation to 102.5W is higher and this is the one that is used. Puzzling.
mannid mann lat long jones
1886 1886 1 42.5 -112.5 662
1887 1887 1 42.5 -107.5 663
1888 1888 1 42.5 -102.5 664
Re: Steve McIntyre (#14),
The decision rule in propick.m according to Jean S is that tie goes to the Mann series that occurs first in his numbering system, which would be 112.5W. However the correlation to 102.5W is higher and this is the one that is used. Puzzling.
propick.m looks at the two closest gridcells – by distance and then as pointed out, by order in the data. It works out the correlation to each cell. If correlation is good enough it picks the cell that has the best correlation.
Here is the code – I’ve snipped out other stuff – and one caveat – this is not a language I actually know. The correlations are done in a loop – variable name of k. It always loops twice, igrid=2, so once with k=1 and once with k=2. That’s the closest two cells or first in the list if tied on distance.
for k=1:igrid % %numbers of gridpoints near proxy site used to cal. corrcoef
… snip …
… snip …
So, at this point, if I follow the code aa(1,1) has the closest cell’s correlation and aa(2,1) has the next.
if sum(d3)>=igr % at least ‘igr’ correlation of ‘igrid’ pass the CL
Take the absolute values of aa, and find the largest value, into ra
I don’t understand the syntax of that bit. I’d expect iii and jjj to be the index of the element in bb that matches ra. So I can’t see why you would then sum it. However rra(n4) is going to end up set to the larger (positive or negative) in aa().
There are several different ways in which this whole thing screws up. mxd series aren’t really proxies – so the lat,long are exactly equal to the grid cells. And because the algorithm only considers two close cells, it will use the matching one – and then whichever one it hits next. There being 4 adjacent cells that are exactly equally close. setting igrid to 5 would have been more logical – although frankly why not just see if the proxy matches any temperature anywhere on the planet 🙂
The other issue, is what happens if there is no data for the instrumental grid cell, i.e the cell is named in the toofew list. This happens mainly close to the poles, so I’d guess it would affect ice core proxies. If there is no instrumental data, it’ll cast around for two cells with data. These might be some way apart and maybe 20 deg of lat away from the proxy location.
Re: Nick Moon (#17),
although frankly why not just see if the proxy matches any temperature anywhere on the planet 🙂
Cut the smiley. That’s what MBH98 did.
Re: UC (#21),
Ahh, but I thought we had moved on.
I’ve been too busy to download and try the latest MBH stuff (hopefully get more time soon and have a play with it) but I have been browsing this thread with interest. This caught my eye. I would stress I haven’t seen the full code or context so I may be missing something.
As noted by Mark T. and others, sum(iii) is a terrible way to test the length of the matrix. The behaviour of this test is going to be dependent on a number of different things; if bb is a row vector, iii will simply be a vector of ones and sum(iii) will indeed yield length (just very inefficiently with lots of extra computation compared to the length function). If bb is a column vector or matrix, iii will contain indices and this test will pass even for some non-vector results.
Interestingly, in the former case, a two-element vector will fail the test, because the code uses greater than two instead of greater than or equal to two. This would cause the code to fall over so I’m assuming this never happens?
But the curious thing is this. If the test passes (i.e. sum of iii greater than two), the code appears to just select the first result. But that isn’t what the code seems to do. Selection of the first result would be the code
but the code is actually
which is presumably the first entry in the data, which may not be the maximum at all? Have I got this right? That means under certain conditions where you get multiple values through, a high value will be replaced with garbage. If that is most likely to happen for the Luterbacher series, the code may yield results with a correlation value closer to zero than should actually be the case.
Repeated caveat: this is speculation from the code I’ve seen. I may have the context wrong.
If I’ve got the right end of the stick, Dr Mann may be interested to know that the max function in MATLAB also provides the index into the array as a second output. This is a neater solution to the apparent intent:
[maxValue, maxIndex] = max(abs(aa));
rra(n4) = aa(maxIndex(1));
The index into maxIndex will work whether maxIndex is a scalar, vector or matrix and is much simpler to follow. Also, Dr Mann, variables called “aa”, “bb” and “iii” really went out of programming fashion a few decades ago. MATLAB pre-processes the m-file into p-code before running it so there is no performance advantage to unreadable short names. And given the size of your arrays, I’m not convinced memory is an excuse, either.
UC, you may be able to reduce some of the memory by converting those large arrays to single precision. If the code were well written, this shouldn’t have much affect on the results. Of course, I think we already know the answer to that question (as Jean S alludes to above). Unfortunately that means using machines with different maths co-processors (even different versions of the same chip, let alone entirely different architectures) will yield different results as well…
Re: Spence_UK (#52),
Duh… penny drops that aa is already a restricted subset and aa(1,1) may have some special meaning… in which case aa(1,1) may do something not entirely unreasonable. I probably ought to find time to download the code and get some context before commenting 🙂
Re: Spence_UK (#53),
aa contains the correlation coefficients, between the centre of the grid cell and the proxy. Though there are only ever 1 or 2 values in aa. The values maybe + or -. So Mann creates a copy with the absolute values – bb. Then he finds the largest value in that array. Then he uses that value to find the index. Then he uses the index to find the original coefficient in aa. I think there is then a little bit of code to cope with the special case that both coefficients are the same, if so he just takes the first one. It’s hard to believe there isn’t a much cleaner way of doing it. But code like this probably gets tweaked and modified and probably some of this stuff was doing useful work in a previous version.
Re: Nick Moon (#54),
Thanks Nick, I finally downloaded some of the code to figure out more closely what was going on. Not the easiest code to follow.
One neater and more generic way to do it is in the third blockquote of my #52. But having looked at the code more closely, Mann’s method will work, although for fairly non-obvious reasons. The vector aa is constructed as a 2-element column vector, which means iii will contain the indices into the vector and jjj will just be a vector of ones. iii can therefore only take on three possible values; ,  or [1; 2]. jjj will simply be a vector of 1s as long as iii. Mann’s sum test will therefore always yield the correct answer, so long as igrid is 2. This test won’t work for igrid of 3 or more. Of course the value is hard wired, but the fact that it is parameterised gives it the feel of one of the many tweakable degrees of freedom available to the author – but the code doesn’t support it being used in this way.
As for the code being tweaked and modified – yes, this is pretty evident in the comments. One convenience of MATLAB is the ability to quickly and easily try out little tweaks and changes. And it appears Mann’s passion for little tweaks and changes haven’t abated any in recent years, no matter how much they can get him into hot water.
Here’s an interesting observation. On one pass around the larger outer loop, it could be set (and populated) to length 2. But say on a later pass, only one valid record may be found. This would still invoke the piece of code you list (since it requires a minimum of igr, 1 record, to be found). But then you get an inconvenient side effect – you’ve still got the correlation coefficient hanging around from the last time around the loop that you found 2 cases from.
This may have no consequence whatsoever in the actual application, but nevertheless, this sort of coding is just asking for mistakes, and of course underscores the need for the original code to make any kind of replication. Anyhow, I’ve downloaded most of the files now, just need to generate clidata as Steve notes above and then I’ll be able to get propick running.
I think I figured this out … a sort of. It seems to be a rounding/floating point issue! Variable ‘d1’ has the distances calculated. I checked for the proxy #910, and the three smallest distances are:
and these correspond to:
Now the difference 2nd smallest-3rd smallest is: -9.6634e-013 !!!
So the rule is as I (and Nick Moon) explained, but it happens due to floating point issues that according
to Mannian code 112.5W and 102.5W are not equally close to 107.5W …
I’m 100% sure that there is an incredible amount of pointless calculation in this.
Yes, that’s true. My favorite line (from ‘propick.m’) so far:
Just so every one is clear about memory management on PC’s:
In a workstation Windows 32bit operating system (eg Windows 2000 Professional; Windows XP Pro) the maximum addressable memory by a program (single executable process) is 2 Gb.
For a server version of Windows 32bit OS the maximum addressable memory by a program is 3 Gb.
For a 32bit Linux operating system running on a PC the maximum addressable memory is 3.6 Gb.
A 32bit program will usually run without difficulty on a 64bit operating system but it will only run as a 32bit application and will still have the 32bit memory addressing limitations.
A 64bit version of a program running on a 64bit machine will, to all practical purposes, have no memory addressing limits on most current PC hardware and the limitation will then be the available physical memory which is installed in the machine less the operating system overhead.
Re: Ashley (#15),
Theoretical maximum addressable value for a 64 bit OS is 2 ^ 64 = 18446744073709551616 bits.
Windows 64 imposes a synthetic limit of 8 Gb (8 000 000 000). Many Main (Mother) Boards also have BIOS limits that will not allow more than 8 Gb and some 4 and 3.2 Gb limits, regardless of the OS. Intrinsic math functions (on die) on the cpu / alu and math processer clip functions such as (arctan) ^ (-1) (inverse arc tan) to 16 bits or 8 bits on another brand of board. Check the Intel web site for advanced math functions handled on cpu. Variations occur markedly between the Centrino, Pentium D, Core2, Core2 Duo, Core2 Quad within the same die series.
Speed and cost are the reasons.
Swap space in windows is calculated as 2.3 X physical ram =42427511369531968716.8 (2^64)X2.3 total avaiable ram. This can be overridden by the Administrator.
To use all the hard disk one must have a minimum capacity of OS + Program + Swap Space which means you require a single disk with greater than 42427511369531968716 bits on a single platter and single head to get a contiguous file stored.
The Unix workaround above does not work because the CPU assigns the size of the memory addresses. From an Oil and Gas exploration program with matrix upto 1000 X 1000 double floats (lats and longs) non standard methods to handle the data had to be employed, such as checking for matrix symmetry. The complete control of all memory handling functions was removed from the OS to the program and done in C C++ and Assembler for BSD then Solaris. This generated an extra 20 000 lines of code. Even at this level any data input errors could cause buffer under and overflows.
When you find a disc of these dimensions, put me down for one.
A copy of a post at linuxformat.co.uk on the Sun Solaris filesystem, which allows many distinct drives to be used to create a single pool:
Re: The LXF Test: OpenSolaris 2008.05 (Score: 1)
by pwbrum61 on Aug 20, 2008 – 12:03 PM
(User information | Send a message)
SUNW – Stanford University Network Workstation It’s a throwback from Sun’s early days SFW – Sunfreeware (I think) form http://www.sunfreeware.com With ZFS there’s no need to “grow” a filesystem or expand it. In fact there’s no need to newfs (Solaris UFS equivalent of mkfs) it. If, for example, you have a system with 5 drives in it you can combine them into a raidz (variation on RAID5) pool (apologies for using Solaris device names, but you sould get the gist): zpool create testpool raidz c0t0d0 c0t1d0 c0t2d0 c0t3d0 c0t4d0 Then create your zfs filesystem: zfs create testpool/filesys1 The filesystem will automatically be mounted for you, so df -h would show something like : testpool 977G 431K 72G 1% /testpool testpool/filesys1 977G 21M 72G 1% /testpool/filesys1 (ignore the sizes above, it’s simply an extract from a live system to illustrate) If you wanted to restrict the size of a filesystem you could simply set a quota when creating it: zfs create -o quota=100G testpool/filesys2 or for an existing filesystem: zfs set quota=80G testpool/filesys2 In the example above, entire drives are used to create the pool. It’s also possible to use slices (Solaris equivalent of partitions): zpool create testpool2 raidz c0t0d0s3 c0t1d0s3 c0t2d0s3 If/when zfs does make its way into Linux it could make life a lot easier during install for newcomers Paul
OK, let’s test your memory. Type
(from gridboxcps.m )
Who’s Matlab survives through these two commands ?
Re: UC (#20),
Without a doubt the two largest arrays I’ve ever had in a MATLAB session.
I don’t understand the syntax of that bit. I’d expect iii and jjj to be the index of the element in bb that matches ra. So I can’t see why you would then sum it.
Yet another case of Mann not understanding MATLAB, IMO. If he wants to test whether there are more than two instances of bb==ra, all he needs to do is use length(iii) rather than sum(iii).
Do I get some quatloos for getting it to run without crashing? I’m guessing, btw, that I can’t do a whole lot with MATLAB once I’ve scarfed up that much memory for two arrays. I probably can’t do much with Winders, either (I’m not working very hard at the moment, so it matters not, hehe). 🙂
I’m 100% sure that there is an incredible amount of pointless calculation in this. The guts of MBH98 can be boiled down to a few lines of code. Mann has a tendency to repeat the same operation tens of thousands of times.
The purpose of getting this to work is to figure out what RegEM is actually doing. He obviously has code that produces results, because he has produced results. The key issues are exactly the same as MBH PC – what are the properties of the method?
But quite aside from the esoterica of obscure Mannian methods, there’s quite a lot that’s interesting in the proxies themselves. At the end of the day, after using gigabytes of memory and teraflops of operations, all the operations appear to be linear (a point that Tapio Schneider confirmed for his RegEM), so one ends up with 1209 coefficients, which can vary somewhat.
If anyone can get the pig to work, I’d like it to be tweaked so that the program keeps track of these coefficients – something that Mann should have done, so we can then show maps of the proxy weights, as I did for MBH.
Re: Steve McIntyre (#25),
At the end of the day, after using gigabytes of memory and teraflops of operations, all the operations appear to be linear (a point that Tapio Schneider confirmed for his RegEM), so one ends up with 1209 coefficients, which can vary somewhat.
That’s exactly what PCA (et. al.) is designed to do: generate a set of coefficients that are used to linearly combine the observations into a single output vector. Different sets of coefficients yield different principal components. In a comm system, the coefficients would represent the variance of individual multipaths. You’d then shift and weight each input according to each coefficient (which are spaced in time), sum the result and the result is your original (desired) signal. The difference in the latter case, however, is that you know a priori what the signal was, or at least how it was constructed, so you know what your output is as well.
Part of what makes this code so unwieldy is the fact that it processes ALL of the data as one big block, rather than breaking it up into smaller sections and generating a different set of weights for each section. Of course, this goes to the entire stationarity argument, and past sections do not have the temperature recordings to correlate against, making it very difficult to determine which component is actually temperature, hehe.
Ooops, when I said “Different sets of coefficients yield different principal components.” I should have said “Different sets of coefficients yield different sources.”
Based on the english in some of the readme files, I’m going to guess that they were written by somebody whose native language is not english.
I wonder how much of the code was written by Mann?
Steve: Which individual wrote the code is irrelevant for our purposes here, just as much as it’s irrelevant to a third party which Pfizer scientist developed a patent. From the directory labels, Zhang probably wrote some of the code; Rutherford probably wrote others, but unless you connect this to something in the output, forget about this line of analysis please.
Patrick M: The readme files were probably written by somebody who’s learned new languages and ways of thinking; climatesciencespeak and computerprogrammingtalk.
Oh, I forgot. It would be interesting to know if there are any proxies that are “significant/insignificant by chance”. That is, if there are any proxies whose selection depends from the floating point issues in Matlab…
#31. Too funny. The Little Shop of Horrors has definitely opened a new franchise.
#33. Do you mean that they are insignificant with a “correct” pick two rule under Mannian order, but significant with an “incorrect” pick two?
Jean S, have you located the low-frequency code? I presume that this will be Butterworth filters or something like that?
#35: Both ways. Those whose picking depends on how you order the equal distance cells. BTW, the ‘propick.m’ code is clearly designed to implement “pick m” rules. Would be interesting to know why they chose 2, not 1 or 3?
#34: Yes. The correlations are calculated in ‘propickl.m’. It inputs the “low frequency” splits ‘clidatal’ (instrumental) and ‘itrdbmatrixl’ (proxies). ‘clidatal’ is supposedly calculated in ‘clihybrid.m’ (which seems to be missing; at least I can’t locate it) and ‘itrdbmatrixl’ in calculated in ‘proxyhybrid.m’. Here’s the relevant filtering part (additionally there seems to be a lot of “normalization” going on, which is so typical to Mann):
%%%% Apply the zero-phase butterworth filter but only to the proxies
clear y y2
frequency = 0.05;
[smoothed0,icb,ice,mse0] = lowpassmin(data,frequency);
So before choosing this 2nd latest Mannian filtering (choose the “best” “end condition”; ‘lowpassmin.m’) at least Mannian filtering version 2004 (with “minimum roughness”; ‘lowpass(data,0.05,2,2)’) and stardard Butterworth in Matlab (‘butter’ + ‘filtfilt’) were used. How come he needs to test so many things if his results are so robust?
Re: Jean S (#37),
I found this bit of code confusing because in general I don’t like to divide numbers by zero, even if I’m dividing zero by zero.
This is apparently Mannian long-hand for
lowf = highf = NaN(nrows,ncols);
which would be much easier to read. I can see why it is so hard to follow what Mann does in his code because he changes direction in the code so many times and leaves bits and pieces of old methods hanging around in the code.
Re: Rusty Scott (#48),
This is apparently Mannian long-hand for lowf = highf = NaN(nrows,ncols);
Curiously enough, I have an older version of matlab and the NaN function doesn’t seem to work on my system. Maybe that piece of code was written on an earlier version of Matlab.
That’s the definition of ad-hoc, right?
#18. Although they said it was updated on Sept 23 in this statement, elsewhere they say Sep 25 which seems more likely. Not that it matters particularly, but this may be helpful for those readers who get cross at me when I do not take everything at face value
I thought there was also an offset b coefficient for each proxy. y=mx+b; So m and b are 2418 total coefficients.
Re: Jeff Id (#40),
The intercepts don’t matter that much so I would still say there are 1209 coefficients. After all, the analysis seems to be focused on correlations.
Re: DeWitt Payne (#30), based on the header information in the *.mat files, the code was run on some flavor of 64 bit Linux platform. I am not sure if that counts as a desktop OS.
MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Nov 18 20:51:53 2007
MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Thu Mar 27 12:54:48 2008
MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Thu Jul 26 14:48:43 2007
MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Nov 18 20:51:53 2007
MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Thu Mar 27 12:54:48 2008
MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Thu Jul 26 14:48:43 2007
Not that this is of any help in terms of understanding what is going on.
Re: Sinan Unur (#41),
based on the header information in the *.mat files, the code was run on some flavor of 64 bit Linux platform. I am not sure if that counts as a desktop OS.
Yeah it can. I had Fedora Core 5 running on my machine for a while (which was a broken version, actually). I spend most of my at-home pooter time playing games so that went out the window for Winders (sort of punny) rather quickly. Lots of Linux users running around. Btw, the nice thing about MAT files is that they are portable from one OS to another! Yay.
So, does anyone have this whole thing working? I’ve got a pretty clean XP build at home and potentially some free time on my hands.
Re: Sinan Unur (#41), This comment was written on a 64-bit Linux workstation.
I thought there was also an offset b coefficient for each proxy. y=mx+b; So m and b are 2418 total coefficients.
No, that wouldn’t make any sense nor would it matter.
Therefore the offset and final result cannot be calculated without both.
Uh, no, not at all. And yes, correlation is independent of scale since you normalize. DC offset is arbitrary, and contributions from one are no different than from another, and therefore immaterial. They should also be removed prior to application of any 2nd order statistical analyses, but Mann does not quite do it correctly (this is because offsets correlate perfectly, which will bias results that are based on variances).
Correlation is ‘allegedly’ independent of the multiplier as well. I don’t recall the specific phrase but I believe Mann stated something to that effect. Therefore the offset and final result cannot be calculated without both.
Oops, clicked on the wrong link for the second one. I was trying to get Jeff Id’s #42.
Jeff Id, if RegEM is some flavor of linear fitting, the final result will likely be (or be able to be recast) in the form:
“temperature” = proxy1*m1 + proxy2*m2 + proxy3*m3 + …. + proxyn*mn +b
So we’d end up with 1210 coefficients, the majority of which will likely be ~ 0 …
Exactly, and then the entire “global temperature” is represented by one or two trees in Colorado. Them are some pretty nifty treemometers.
Can’t do the double equal thing in MATLAB like you can in C, btw. Mann just isn’t as familiar with the language as he could (or should) be. You’re right, it is confusing at times.
JEan S generated clidata, which I’ll upload to http://www.climateaudit.org/data/mann.2008/data/replication in a few minutes.
OK, got doannsst.m and propick.m running okay. Just needed to redirect the loads and saves.
I think the coding flaw I outlined in #55 seems moot, since it seems to always pick up two locations (i.e. sum(d3) appears always 2, never 0 or 1, at least in the first half of propick). Buggy software along execution paths that are never tested… I probably shouldn’t be surprised!
In Mann’s proxyinfill program, he uses (or says that he uses) a dataset 1227proxynamenolu.txt. “nolu” is a shorthand for “no Luterbacher”. The prefix “1227” doesn’t tie into present nomenclature which has things like 1209proxy … etc and presumably comes from an unreported earlier iteration with 1227 proxies. However this data set has 1137 series. All 1137 are in the 1209. HOwever 1209 minus 71 Luterbacher = 1138. What’s the missing series? The rain in Spain/Kenya (rodrigo).
The starting point of Mann’s program is the infilling of proxies. This might be a good place to watch how the RegEM works as well.
Re: Steve McIntyre (#59),
This might be a good place to watch how the RegEM works as well.
Got some code from Jean S.. The truth is out: EIV-Files go soon public 🙂
Re: UC (#61),
here’s a small preview what to expect 😉 Some proxies weighted according to their weight in the (RegEM) EIV-reconstruction (nhnhscrihad). Blue indicates positive weight and red negative. The lower right corner is the actual (low split) reconstruction (for the step) scaled to the final scaling.
It is rather interesting how some proxies respond with a different sign depending on the century they are used…
Are people still having trouble running out of memory trying to run this? Is running out of memory holding anyone up still?
It is running now
Name Size Bytes Class
saveregts 1996x1728x19 524261376 double
reg_ts 1996x1824x19 553387008 double
(as I updated the computer:)
Maximum possible array: 27353 MB (2.868e+010 bytes) *
Memory available for all arrays: 27353 MB (2.868e+010 bytes) *
Memory used by MATLAB: 1879 MB (1.971e+009 bytes)
Physical Memory (RAM): 16267 MB (1.706e+010 bytes)
* Limited by System Memory (physical + swap file) available.
ok, where were we..
[…] continue Mann 2008 – Replication with EIV. To run regreclow.m and regrechigh.m you’ll need files climate, eofnumb, and proxyN, […]