Mann 2008 – Replication

Do not post anything other than programming comments. Absolutely no piling on please.

63 Comments

  1. UC
    Posted Sep 24, 2008 at 1:03 PM | Permalink

    UC writes:

    September 16th, 2008 at 2:18 am edit

    And this is just the beginning, Mann08 requires some time to comprehend. Code with lines

    pp=load(‘c:\scozztemann3\newtemp\CRU_NH_reform’);

    and

    nnn=load(‘/holocene/s1/zuz10/work1/temann/newtemp/HAD_NH_reform’)

    makes the task difficult. Specially, I’d like to find

    c:\scozztemann3\newtemp\nhhinfxxxhad

    ( Funny, variance matching is well alive, SI:

    The gridded proxy data were then areally weighted, spatially averaged over the target hemisphere, and scaled to have the same mean and decadal standard deviation (we refer to the latter as ‘‘variance matching’) as the target hemispheric (NH or SH) mean temperature series.

    )
    #
    119
    reply and
    paste link
    UC:
    September 18th, 2008 at 5:17 am edit

    More problems with the code, gridboxcps.m,

    ??? Out of memory. Type HELP MEMORY for your options.

    Error in ==> gridboxcps at 132 saveregts=reg_ts(:,igood,:);

    reg_ts seems to be too big for my computer

    reg_ts 1996×2592×19 786392064 double

    and seems to be almost full of NaNs. Any volunteers to optimize the code by applying sparse matrices ?
    #
    120
    reply and
    paste link
    UC:
    September 23rd, 2008 at 5:22 am edit

    First match, quick way to reduce memory consumption is to run fewer proxy sets at a time,

    replace

    for n=1:19 %loop over each proxy set (back to year n*100)

    by

    for n=1:9

    etc, for example. Now NH_whole_newgrid_HAD_gbeachyear.mat gbcpsfull(:,4) agrees with archived NH_had.mat gbcps for 1500-1600, so I guess I’m on the right track.

    What does gridboxcps.m do ?

    * Each gridbox instrumental is filtered with Mann’s smoothing algorithm

    * Proxies are standardized (they seem to be Mann-smoothed already in griproxy.m; cutfreq 0.1 for ‘annual’ and 0.05 for ‘decadal’ proxies)

    * All proxies in 5×5 grid are averaged

    * Within the grid, proxy set is ‘calibrated’ with variance matching to temperature. Smooth first, calibrate then. What would happen vice versa ? I have to read ‘The Foundations of Variance Matching’ again..

    (* Data is regridded to 15X15)

    * Grid-reconstructions are area-weighted to get the hemispheric means, and variance matched again (!!) (with CRU or HAD as target).
    #
    121
    reply and
    paste link
    UC:
    September 24th, 2008 at 12:56 pm edit

    This is probably inappropriate thread, #119-#121 might need a transfer. Anyway, gridboxcps.m is very interesting program to play with. For example, you can replace proxy data with instrumental data to see the effect of sampling errors in the NH mean. Leaves not much room for proxy errors.. Fig 3. CPS uncertainties seem to be about +- 0.15 C ( and not shown after 1850. Probably because there would be a conflict with CRU 😉 )

  2. Jean S
    Posted Sep 24, 2008 at 1:04 PM | Permalink

    Jean S

    Re: Steve McIntyre (#65),

    A question here: for a proxy located on a gridcell border e.g. 5S or 5W here, there are two equidistant “nearest” neighbors. This situation is not described in the text, but might be clarified in the code.

    It is done in ‘propick.m’, see the part described as ‘great circle distance’. As far as I can see it works as follows:
    1) calculate the distance (see the code for the formula) between the proxy location and all instrumental grid cell centres (loop ‘i’). The values are stored in vector ‘d1’. (BTW, notice here that comments for lat1/lon1 and lat2/lon2 are wrong (reversed):
    ‘X’ contains instrumental data and ‘Y’ proxies)
    2) use build-in sort-function to sort ‘d1’. Sort-function arranges values from smallest to the largest and in the case of equal values, the value appearing first comes first (FIFO).
    Hence, the order of equidistant neighbours depends on the order the intrumental grids appearing in ‘X’. ‘X’ comes from file ‘clidata’. ‘climdata’ is prepared in ‘doannsst.m’, where there is a description how the grid cells are arranged. Data is coming from ‘instfilled’, and the order should be the same (I’m too tired to check the code).

    However, notice that there are additionally flags ‘enough’ and ‘toofew’ (found in intrumental data directory), which control if the grid cells are used: the locations in ‘enough’ are used and the ones in ‘toofew’ are not. According to ‘Readme.txt’ in the ‘instrument’ data folder, those grid cells which have less than 10% of annual data are disregarded.

    From The Mystery in Kenya, 2008/09/23 at 3:04 PM

  3. Mark T.
    Posted Sep 24, 2008 at 1:27 PM | Permalink

    UC,

    I have not looked in-depth at the code that resulted in this: reg_ts 1996×2592×19 786392064 double,
    but, if it is possible, break it into 19 separate 1996×2592 blocks. MATLAB is grumpy when it comes to memory allocation. In particular, it requires contiguous memory for a single array. Even on my machine with 4 GB of RAM (well, 3.2 GB usable because of Winders), the largest array I’ve ever seen it create was on the order of 500 MB. In fact, you can run a check:

    [uV sV] = memory

    and it will return a element in uV called

    MaxPossibleArrayBytes: 958464000

    Of course, I could play all day and never get a 958 MB array, but it will put you in the correct order of magnitude. 🙂

    Mark

  4. Posted Sep 24, 2008 at 1:41 PM | Permalink

    Thks Mark,

    feature(‘memstats’)

    works with ver 7.4.,

    Largest Contiguous Free Blocks:
    1. [at 18e40000] 1050 MB (41a70000)

    Not enough for gridboxcps. Separating blocks helps. How much memory Mann has, I wonder 😉

    • DeWitt Payne
      Posted Sep 24, 2008 at 2:51 PM | Permalink

      Re: UC (#4),

      What about a 64 bit OS? Both XP and Vista are available in 64 bit versions. Would MATLAB take advantage of the increased memory address space? With 2GB sticks, most new motherboards support 8GB of memory with a 64 bit OS?

    • DeWitt Payne
      Posted Sep 25, 2008 at 12:36 PM | Permalink

      Re: UC (#4),

      How much memory Mann has, I wonder

      Well, if I were doing what he does and had grant money to fund it, I would have at least a high end workstation, with at least dual Xeon or Opteron CPU’s with 20 to 30 GB of ECC memory. a dedicated SCSI 320 RAID controller with a few terabytes of drive space and be running a high end 64 bit OS, not a desktop with a desktop OS. I suspect that explains some of the sloppy coding. It’s a variation on Parkinson’s Law: work expands to fill the time available. Code expands to fill the computer capacity, as exemplified by the progression from DOS to Vista. Progression probably isn’t the correct word to use.

      • Sam Urbinto
        Posted Sep 25, 2008 at 1:48 PM | Permalink

        Re: DeWitt Payne (#30),

        I’ve tested servers with 64 GIG of Raid ECC memory.

        Still couldn’t play Far Cry on 64-bit XP on it though. Stupid built in graphic cards. 🙂

  5. suricat
    Posted Sep 24, 2008 at 3:36 PM | Permalink

    If the application works OK in a 32 bit word environment, it’s a 32 bit application (or perhaps 16 bit, or even 8 bit)!

    If you are using a [i]winders[/i] op/sys there shouldn’t be an [i]out of memory problem[/i] (unless your hard drive is nearly full) because the o/s uses a swap file that increases with its requirement (but it slows responses a lot). However, if the o/s is a UNIX based o/s the swap file is a fixed size.

    Increasing the size of the swap file on a UNIX based system provides the advantage of an increased virtual memory.

    Perhaps you need to increase the size of your swap file.

    Best regards, suricat.

  6. Posted Sep 24, 2008 at 4:33 PM | Permalink

    Have considered giving Octave a shot?

    It is fairly compatible with MatLab Compatibility.

    — Sinan

  7. Posted Sep 24, 2008 at 5:00 PM | Permalink

    I have a dumb question. It’s been quite a while since I last used MatLab. Back then, MatLab source code would be contained in basic text files. I have been looking at the *.mat files in http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/reconstructions/cps/

    Those files seem to be in some kind of binary format. What am I missing? Where can I find the actual source code?

    And, what do cps and eiv stand for?

    I am not sure if I can be of any help and I am sorry I got stumped at such an elementary level. Any pointers would be much appreciated.

    Sinan

  8. Posted Sep 24, 2008 at 5:05 PM | Permalink

    Apologies. I realized I was looking in the wrong directory. I have since found http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/

    Steve, please feel free to delete my earlier posts.

    — Sinan

  9. Mark T
    Posted Sep 24, 2008 at 5:25 PM | Permalink

    MAT files are just data files that hold MATLAB variables upon executing a standard save command. They are not really binary (not like a bunch of packed 32-bit floats, for example), but MATLAB’s proprietary format.

    64-bit OS may help, but it really depends upon Winders memory management. By going to 64-bit OS, and more memory, there is certainly more available for MATLAB to use. The problem regards contiguous space, which is not often granted by Winders (not that it tries to break memory up, I’ve just never been able to get large arrays even when I have the memory available).

    From a coding standpoint, it doesn’t really make sense to store things in large 3D arrays in MATLAB anyway. First, that array needs to be generated prior to running the loop in which it gets used – reg_ts = zeros(1996,2592,19); – else MATLAB must redefine the array (and space it occupies) every time you iterate and add to it. This can immediately be the source of “out of memory” errors. It also slows MATLAB down considerably (keep in mind, the first run of a MATLAB script is interpreted, not compiled, and it does not optimize array growth even on consecutive runs). Also, there’s simply no benefit other than a reduction in the number of arrays that are used unless larger array operations are being performed (I can’t think of a reason for such an idea).

    Mark

  10. DeWitt Payne
    Posted Sep 24, 2008 at 5:34 PM | Permalink

    This is probably coals to Newcastle, but the Matlab site has a lot of information about out of memory errors. The process limit for 64 bit Matlab on a 64 bit OS, btw, is 8 TB compared to 2 or 3 GB on a 32 bit OS or a 32 bit version of Matlab.

  11. Steve McIntyre
    Posted Sep 24, 2008 at 6:04 PM | Permalink

    I’ve uploaded an R script to calculate the two next nearest gridcells for each proxy using great-circle distances as scripts/mann.2008/nextnearest.txt and added two columns next1, next2 to the details data frame recording the jones cell #s of the two closest gridcell with Mann08 instrumental values.

    The script scripts/mann.2008/temperature.correlation.txt calculates correlations to Mann instrumental data for a variety of cases: two by two for 1) actual gridcell and 2) two nearest with Mann instrumental values; a) infilled; b)”original”, and collects the rtable and SI versions, for the periods 1850-1995, 1896-1995 and 1850-1949.

    The scripts should be turnkey or close to turnkey. I’ve “replicated” the rtable correlations up to .01 for all but 27 series (and up to .001 for most of these). This doesn’t mean that the underlying calculations are “right” in any sense, merely that I’m pretty close to replicating this step. The 27 oddballs are:

    id next_infill rtable si
    325 chil016 -0.077 0.062 NA
    326 chil017 -0.151 -0.037 NA
    339 co522 0.084 0.105 NA
    395 fisher_1994_agassiz 0.165 0.258 0.5238
    396 fisher_1995_agassiz 0.065 -0.065 NA
    400 fisher_2002_devon 0.121 -0.116 NA
    405 fisher_2002_plateau -0.203 -0.283 NA
    492 lutannt12 0.746 0.864 0.8637
    512 lutannt30 0.478 0.561 0.5614
    526 lutannt43 0.835 0.846 0.8460
    549 lutannt64 0.677 0.623 0.6229
    576 meese_1994_gisp2 -0.080 0.062 NA
    799 nv037 -0.078 0.061 NA
    857 or026 0.202 -0.134 NA
    910 schweingruber_mxdabd_grid100 0.222 0.363 0.3626
    911 schweingruber_mxdabd_grid101 0.465 0.439 0.4387
    936 schweingruber_mxdabd_grid20 0.374 0.302 0.3021
    946 schweingruber_mxdabd_grid3 0.280 0.215 0.2146
    952 schweingruber_mxdabd_grid35 0.233 0.306 0.3059
    964 schweingruber_mxdabd_grid48 0.312 0.282 0.2820
    982 schweingruber_mxdabd_grid65 0.363 0.428 0.4284
    984 schweingruber_mxdabd_grid67 0.528 0.559 0.5593
    992 schweingruber_mxdabd_grid78 0.160 0.289 0.2886
    1000 schweingruber_mxdabd_grid87 0.310 0.374 0.3742
    1004 schweingruber_mxdabd_grid90 0.450 0.397 0.3973
    1017 smith_2004_thicknessmm 0.111 0.151 0.1513
    1041 stenni_1999_talos 0.161 0.142 0.1424

  12. Steve McIntyre
    Posted Sep 24, 2008 at 7:03 PM | Permalink

    I’ve sort of located the problem with series 910 and others may be similar. The gridded series coincides exactly with a gridcell (42.5N 107.5W). So there’s a tie for next nearest between 112.5W and 102.5W. The decision rule in propick.m according to Jean S is that tie goes to the Mann series that occurs first in his numbering system, which would be 112.5W. However the correlation to 102.5W is higher and this is the one that is used. Puzzling.

    mannid mann lat long jones
    1886 1886 1 42.5 -112.5 662
    1887 1887 1 42.5 -107.5 663
    1888 1888 1 42.5 -102.5 664

    • Nick Moon
      Posted Sep 25, 2008 at 4:05 AM | Permalink

      Re: Steve McIntyre (#14),

      The decision rule in propick.m according to Jean S is that tie goes to the Mann series that occurs first in his numbering system, which would be 112.5W. However the correlation to 102.5W is higher and this is the one that is used. Puzzling.

      propick.m looks at the two closest gridcells – by distance and then as pointed out, by order in the data. It works out the correlation to each cell. If correlation is good enough it picks the cell that has the best correlation.

      Here is the code – I’ve snipped out other stuff – and one caveat – this is not a language I actually know. The correlations are done in a loop – variable name of k. It always loops twice, igrid=2, so once with k=1 and once with k=2. That’s the closest two cells or first in the list if tied on distance.

      for k=1:igrid % %numbers of gridpoints near proxy site used to cal. corrcoef
      n2=0; d2=[];

      … snip …

      rr=corrcoef(d2);
      aa(k,1)=(rr(1,2));

      … snip …
      end

      So, at this point, if I follow the code aa(1,1) has the closest cell’s correlation and aa(2,1) has the next.

      if sum(d3)>=igr % at least ‘igr’ correlation of ‘igrid’ pass the CL
      n4=n4+1;

      bb=abs(aa);
      ra=max(bb);

      Take the absolute values of aa, and find the largest value, into ra

      [iii,jjj]=find(bb==ra);
      if sum(iii)>2
      rra(n4)=aa(1,1); %07/20/06
      else
      rra(n4)=aa(iii,jjj);
      end

      d4(n4)=i+1;
      end

      I don’t understand the syntax of that bit. I’d expect iii and jjj to be the index of the element in bb that matches ra. So I can’t see why you would then sum it. However rra(n4) is going to end up set to the larger (positive or negative) in aa().

      There are several different ways in which this whole thing screws up. mxd series aren’t really proxies – so the lat,long are exactly equal to the grid cells. And because the algorithm only considers two close cells, it will use the matching one – and then whichever one it hits next. There being 4 adjacent cells that are exactly equally close. setting igrid to 5 would have been more logical – although frankly why not just see if the proxy matches any temperature anywhere on the planet 🙂

      The other issue, is what happens if there is no data for the instrumental grid cell, i.e the cell is named in the toofew list. This happens mainly close to the poles, so I’d guess it would affect ice core proxies. If there is no instrumental data, it’ll cast around for two cells with data. These might be some way apart and maybe 20 deg of lat away from the proxy location.

      • Posted Sep 25, 2008 at 6:36 AM | Permalink

        Re: Nick Moon (#17),

        although frankly why not just see if the proxy matches any temperature anywhere on the planet 🙂

        Cut the smiley. That’s what MBH98 did.

        • Nick Moon
          Posted Sep 25, 2008 at 7:23 AM | Permalink

          Re: UC (#21),

          Cut the smiley. That’s what MBH98 did.

          Ahh, but I thought we had moved on.

      • Spence_UK
        Posted Sep 27, 2008 at 3:36 AM | Permalink

        Re: Nick Moon (#17),

        I’ve been too busy to download and try the latest MBH stuff (hopefully get more time soon and have a play with it) but I have been browsing this thread with interest. This caught my eye. I would stress I haven’t seen the full code or context so I may be missing something.

        As noted by Mark T. and others, sum(iii) is a terrible way to test the length of the matrix. The behaviour of this test is going to be dependent on a number of different things; if bb is a row vector, iii will simply be a vector of ones and sum(iii) will indeed yield length (just very inefficiently with lots of extra computation compared to the length function). If bb is a column vector or matrix, iii will contain indices and this test will pass even for some non-vector results.

        Interestingly, in the former case, a two-element vector will fail the test, because the code uses greater than two instead of greater than or equal to two. This would cause the code to fall over so I’m assuming this never happens?

        But the curious thing is this. If the test passes (i.e. sum of iii greater than two), the code appears to just select the first result. But that isn’t what the code seems to do. Selection of the first result would be the code

        rra(n4)=aa(iii(1),jjj(1));

        but the code is actually

        rra(n4)=aa(1,1);

        which is presumably the first entry in the data, which may not be the maximum at all? Have I got this right? That means under certain conditions where you get multiple values through, a high value will be replaced with garbage. If that is most likely to happen for the Luterbacher series, the code may yield results with a correlation value closer to zero than should actually be the case.

        Repeated caveat: this is speculation from the code I’ve seen. I may have the context wrong.

        If I’ve got the right end of the stick, Dr Mann may be interested to know that the max function in MATLAB also provides the index into the array as a second output. This is a neater solution to the apparent intent:

        [maxValue, maxIndex] = max(abs(aa));
        rra(n4) = aa(maxIndex(1));

        The index into maxIndex will work whether maxIndex is a scalar, vector or matrix and is much simpler to follow. Also, Dr Mann, variables called “aa”, “bb” and “iii” really went out of programming fashion a few decades ago. MATLAB pre-processes the m-file into p-code before running it so there is no performance advantage to unreadable short names. And given the size of your arrays, I’m not convinced memory is an excuse, either.

        UC, you may be able to reduce some of the memory by converting those large arrays to single precision. If the code were well written, this shouldn’t have much affect on the results. Of course, I think we already know the answer to that question (as Jean S alludes to above). Unfortunately that means using machines with different maths co-processors (even different versions of the same chip, let alone entirely different architectures) will yield different results as well…

        • Spence_UK
          Posted Sep 27, 2008 at 4:32 AM | Permalink

          Re: Spence_UK (#52),

          Duh… penny drops that aa is already a restricted subset and aa(1,1) may have some special meaning… in which case aa(1,1) may do something not entirely unreasonable. I probably ought to find time to download the code and get some context before commenting 🙂

        • Nick Moon
          Posted Sep 28, 2008 at 5:20 AM | Permalink

          Re: Spence_UK (#53),

          aa contains the correlation coefficients, between the centre of the grid cell and the proxy. Though there are only ever 1 or 2 values in aa. The values maybe + or -. So Mann creates a copy with the absolute values – bb. Then he finds the largest value in that array. Then he uses that value to find the index. Then he uses the index to find the original coefficient in aa. I think there is then a little bit of code to cope with the special case that both coefficients are the same, if so he just takes the first one. It’s hard to believe there isn’t a much cleaner way of doing it. But code like this probably gets tweaked and modified and probably some of this stuff was doing useful work in a previous version.

        • Spence_UK
          Posted Sep 28, 2008 at 11:30 AM | Permalink

          Re: Nick Moon (#54),

          Thanks Nick, I finally downloaded some of the code to figure out more closely what was going on. Not the easiest code to follow.

          One neater and more generic way to do it is in the third blockquote of my #52. But having looked at the code more closely, Mann’s method will work, although for fairly non-obvious reasons. The vector aa is constructed as a 2-element column vector, which means iii will contain the indices into the vector and jjj will just be a vector of ones. iii can therefore only take on three possible values; [1], [2] or [1; 2]. jjj will simply be a vector of 1s as long as iii. Mann’s sum test will therefore always yield the correct answer, so long as igrid is 2. This test won’t work for igrid of 3 or more. Of course the value is hard wired, but the fact that it is parameterised gives it the feel of one of the many tweakable degrees of freedom available to the author – but the code doesn’t support it being used in this way.

          As for the code being tweaked and modified – yes, this is pretty evident in the comments. One convenience of MATLAB is the ability to quickly and easily try out little tweaks and changes. And it appears Mann’s passion for little tweaks and changes haven’t abated any in recent years, no matter how much they can get him into hot water.

          Here’s an interesting observation. On one pass around the larger outer loop, it could be set (and populated) to length 2. But say on a later pass, only one valid record may be found. This would still invoke the piece of code you list (since it requires a minimum of igr, 1 record, to be found). But then you get an inconvenient side effect – you’ve still got the correlation coefficient hanging around from the last time around the loop that you found 2 cases from.

          This may have no consequence whatsoever in the actual application, but nevertheless, this sort of coding is just asking for mistakes, and of course underscores the need for the original code to make any kind of replication. Anyhow, I’ve downloaded most of the files now, just need to generate clidata as Steve notes above and then I’ll be able to get propick running.

    • Jean S
      Posted Sep 25, 2008 at 1:45 PM | Permalink

      Re: Steve McIntyre (#14),

      The decision rule in propick.m according to Jean S is that tie goes to the Mann series that occurs first in his numbering system, which would be 112.5W. However the correlation to 102.5W is higher and this is the one that is used. Puzzling.

      I think I figured this out … a sort of. It seems to be a rounding/floating point issue! Variable ‘d1’ has the distances calculated. I checked for the proxy #910, and the three smallest distances are:
      0
      410.1783
      410.1783
      and these correspond to:
      -107.5000 42.5000
      -102.5000 42.5000
      -112.5000 42.5000
      Now the difference 2nd smallest-3rd smallest is: -9.6634e-013 !!!
      So the rule is as I (and Nick Moon) explained, but it happens due to floating point issues that according
      to Mannian code 112.5W and 102.5W are not equally close to 107.5W …

      I’m 100% sure that there is an incredible amount of pointless calculation in this.

      Yes, that’s true. My favorite line (from ‘propick.m’) so far:

      itrdbmatrix=itrdbmatrix;

  13. Ashley
    Posted Sep 25, 2008 at 12:28 AM | Permalink

    Just so every one is clear about memory management on PC’s:

    In a workstation Windows 32bit operating system (eg Windows 2000 Professional; Windows XP Pro) the maximum addressable memory by a program (single executable process) is 2 Gb.

    For a server version of Windows 32bit OS the maximum addressable memory by a program is 3 Gb.

    For a 32bit Linux operating system running on a PC the maximum addressable memory is 3.6 Gb.

    A 32bit program will usually run without difficulty on a 64bit operating system but it will only run as a 32bit application and will still have the 32bit memory addressing limitations.

    A 64bit version of a program running on a 64bit machine will, to all practical purposes, have no memory addressing limits on most current PC hardware and the limitation will then be the available physical memory which is installed in the machine less the operating system overhead.

    • Geoff Sherrington
      Posted Sep 25, 2008 at 1:46 AM | Permalink

      Re: Ashley (#15),

      Theoretical maximum addressable value for a 64 bit OS is 2 ^ 64 = 18446744073709551616 bits.

      Windows 64 imposes a synthetic limit of 8 Gb (8 000 000 000). Many Main (Mother) Boards also have BIOS limits that will not allow more than 8 Gb and some 4 and 3.2 Gb limits, regardless of the OS. Intrinsic math functions (on die) on the cpu / alu and math processer clip functions such as (arctan) ^ (-1) (inverse arc tan) to 16 bits or 8 bits on another brand of board. Check the Intel web site for advanced math functions handled on cpu. Variations occur markedly between the Centrino, Pentium D, Core2, Core2 Duo, Core2 Quad within the same die series.

      Speed and cost are the reasons.

      Swap space in windows is calculated as 2.3 X physical ram =42427511369531968716.8 (2^64)X2.3 total avaiable ram. This can be overridden by the Administrator.

      To use all the hard disk one must have a minimum capacity of OS + Program + Swap Space which means you require a single disk with greater than 42427511369531968716 bits on a single platter and single head to get a contiguous file stored.

      The Unix workaround above does not work because the CPU assigns the size of the memory addresses. From an Oil and Gas exploration program with matrix upto 1000 X 1000 double floats (lats and longs) non standard methods to handle the data had to be employed, such as checking for matrix symmetry. The complete control of all memory handling functions was removed from the OS to the program and done in C C++ and Assembler for BSD then Solaris. This generated an extra 20 000 lines of code. Even at this level any data input errors could cause buffer under and overflows.

      When you find a disc of these dimensions, put me down for one.

  14. KimberleyCornish
    Posted Sep 25, 2008 at 5:33 AM | Permalink

    A copy of a post at linuxformat.co.uk on the Sun Solaris filesystem, which allows many distinct drives to be used to create a single pool:

    Re: The LXF Test: OpenSolaris 2008.05 (Score: 1)
    by pwbrum61 on Aug 20, 2008 – 12:03 PM
    (User information | Send a message)
    SUNW – Stanford University Network Workstation It’s a throwback from Sun’s early days SFW – Sunfreeware (I think) form http://www.sunfreeware.com With ZFS there’s no need to “grow” a filesystem or expand it. In fact there’s no need to newfs (Solaris UFS equivalent of mkfs) it. If, for example, you have a system with 5 drives in it you can combine them into a raidz (variation on RAID5) pool (apologies for using Solaris device names, but you sould get the gist): zpool create testpool raidz c0t0d0 c0t1d0 c0t2d0 c0t3d0 c0t4d0 Then create your zfs filesystem: zfs create testpool/filesys1 The filesystem will automatically be mounted for you, so df -h would show something like : testpool 977G 431K 72G 1% /testpool testpool/filesys1 977G 21M 72G 1% /testpool/filesys1 (ignore the sizes above, it’s simply an extract from a live system to illustrate) If you wanted to restrict the size of a filesystem you could simply set a quota when creating it: zfs create -o quota=100G testpool/filesys2 or for an existing filesystem: zfs set quota=80G testpool/filesys2 In the example above, entire drives are used to create the pool. It’s also possible to use slices (Solaris equivalent of partitions): zpool create testpool2 raidz c0t0d0s3 c0t1d0s3 c0t2d0s3 If/when zfs does make its way into Linux it could make life a lot easier during install for newcomers Paul

  15. Posted Sep 25, 2008 at 5:43 AM | Permalink

    OK, let’s test your memory. Type

    reg_ts=NaN([1996,2592,19]);
    saveregts=reg_ts(:,1:1728,:);

    (from gridboxcps.m )

    Who’s Matlab survives through these two commands ?

    • Mark T.
      Posted Sep 25, 2008 at 9:22 AM | Permalink

      Re: UC (#20),

      Who’s Matlab survives through these two commands ?

      Without a doubt the two largest arrays I’ve ever had in a MATLAB session.

      Re: Nick Moon (#17),

      I don’t understand the syntax of that bit. I’d expect iii and jjj to be the index of the element in bb that matches ra. So I can’t see why you would then sum it.

      Yet another case of Mann not understanding MATLAB, IMO. If he wants to test whether there are more than two instances of bb==ra, all he needs to do is use length(iii) rather than sum(iii).

      Mark

  16. Mark T.
    Posted Sep 25, 2008 at 9:35 AM | Permalink

    Do I get some quatloos for getting it to run without crashing? I’m guessing, btw, that I can’t do a whole lot with MATLAB once I’ve scarfed up that much memory for two arrays. I probably can’t do much with Winders, either (I’m not working very hard at the moment, so it matters not, hehe). 🙂

    Mark

  17. Steve McIntyre
    Posted Sep 25, 2008 at 9:44 AM | Permalink

    I’m 100% sure that there is an incredible amount of pointless calculation in this. The guts of MBH98 can be boiled down to a few lines of code. Mann has a tendency to repeat the same operation tens of thousands of times.

    The purpose of getting this to work is to figure out what RegEM is actually doing. He obviously has code that produces results, because he has produced results. The key issues are exactly the same as MBH PC – what are the properties of the method?

    But quite aside from the esoterica of obscure Mannian methods, there’s quite a lot that’s interesting in the proxies themselves. At the end of the day, after using gigabytes of memory and teraflops of operations, all the operations appear to be linear (a point that Tapio Schneider confirmed for his RegEM), so one ends up with 1209 coefficients, which can vary somewhat.

    If anyone can get the pig to work, I’d like it to be tweaked so that the program keeps track of these coefficients – something that Mann should have done, so we can then show maps of the proxy weights, as I did for MBH.

    • Mark T.
      Posted Sep 25, 2008 at 9:53 AM | Permalink

      Re: Steve McIntyre (#25),

      At the end of the day, after using gigabytes of memory and teraflops of operations, all the operations appear to be linear (a point that Tapio Schneider confirmed for his RegEM), so one ends up with 1209 coefficients, which can vary somewhat.

      That’s exactly what PCA (et. al.) is designed to do: generate a set of coefficients that are used to linearly combine the observations into a single output vector. Different sets of coefficients yield different principal components. In a comm system, the coefficients would represent the variance of individual multipaths. You’d then shift and weight each input according to each coefficient (which are spaced in time), sum the result and the result is your original (desired) signal. The difference in the latter case, however, is that you know a priori what the signal was, or at least how it was constructed, so you know what your output is as well.

      Part of what makes this code so unwieldy is the fact that it processes ALL of the data as one big block, rather than breaking it up into smaller sections and generating a different set of weights for each section. Of course, this goes to the entire stationarity argument, and past sections do not have the temperature recordings to correlate against, making it very difficult to determine which component is actually temperature, hehe.

      Mark

  18. Mark T.
    Posted Sep 25, 2008 at 9:54 AM | Permalink

    Ooops, when I said “Different sets of coefficients yield different principal components.” I should have said “Different sets of coefficients yield different sources.”

    Mark

  19. Patrick M.
    Posted Sep 25, 2008 at 11:10 AM | Permalink

    Based on the english in some of the readme files, I’m going to guess that they were written by somebody whose native language is not english.

    I wonder how much of the code was written by Mann?

    Steve:
    Which individual wrote the code is irrelevant for our purposes here, just as much as it’s irrelevant to a third party which Pfizer scientist developed a patent. From the directory labels, Zhang probably wrote some of the code; Rutherford probably wrote others, but unless you connect this to something in the output, forget about this line of analysis please.

  20. Sam Urbinto
    Posted Sep 25, 2008 at 12:01 PM | Permalink

    Patrick M: The readme files were probably written by somebody who’s learned new languages and ways of thinking; climatesciencespeak and computerprogrammingtalk.

  21. Jean S
    Posted Sep 25, 2008 at 1:50 PM | Permalink

    Oh, I forgot. It would be interesting to know if there are any proxies that are “significant/insignificant by chance”. That is, if there are any proxies whose selection depends from the floating point issues in Matlab…

  22. Steve McIntyre
    Posted Sep 25, 2008 at 1:50 PM | Permalink

    #31. Too funny. The Little Shop of Horrors has definitely opened a new franchise.

  23. Steve McIntyre
    Posted Sep 25, 2008 at 1:52 PM | Permalink

    #33. Do you mean that they are insignificant with a “correct” pick two rule under Mannian order, but significant with an “incorrect” pick two?

  24. Steve McIntyre
    Posted Sep 25, 2008 at 1:54 PM | Permalink

    Jean S, have you located the low-frequency code? I presume that this will be Butterworth filters or something like that?

  25. Jean S
    Posted Sep 25, 2008 at 3:15 PM | Permalink

    #35: Both ways. Those whose picking depends on how you order the equal distance cells. BTW, the ‘propick.m’ code is clearly designed to implement “pick m” rules. Would be interesting to know why they chose 2, not 1 or 3?

    #34: Yes. The correlations are calculated in ‘propickl.m’. It inputs the “low frequency” splits ‘clidatal’ (instrumental) and ‘itrdbmatrixl’ (proxies). ‘clidatal’ is supposedly calculated in ‘clihybrid.m’ (which seems to be missing; at least I can’t locate it) and ‘itrdbmatrixl’ in calculated in ‘proxyhybrid.m’. Here’s the relevant filtering part (additionally there seems to be a lot of “normalization” going on, which is so typical to Mann):

    [lowa,lowb]=butter(5,lowlim,’low’);

    [snip]

    %%%% Apply the zero-phase butterworth filter but only to the proxies
    highf=zeros(nrows,ncols);
    lowf=zeros(nrows,ncols);
    lowf=lowf./lowf;
    highf=highf./highf;
    for k=1:ncols
    clear y y2
    data=stdkeepers(start(k):stop(k),k);
    % lowf(start(k):stop(k),k)=lowpass(data,0.05,2,2);
    frequency = 0.05;
    [smoothed0,icb,ice,mse0] = lowpassmin(data,frequency);
    lowf(start(k):stop(k),k)= smoothed0;
    % lowf(start(k):stop(k),k)=filtfilt(lowa,lowb,data);
    highf(start(k):stop(k),k)=data-lowf(start(k):stop(k),k);
    end

    So before choosing this 2nd latest Mannian filtering (choose the “best” “end condition”; ‘lowpassmin.m’) at least Mannian filtering version 2004 (with “minimum roughness”; ‘lowpass(data,0.05,2,2)’) and stardard Butterworth in Matlab (‘butter’ + ‘filtfilt’) were used. How come he needs to test so many things if his results are so robust?

    • Rusty Scott
      Posted Sep 26, 2008 at 1:31 PM | Permalink

      Re: Jean S (#37),

      highf=zeros(nrows,ncols);
      lowf=zeros(nrows,ncols);
      lowf=lowf./lowf;
      highf=highf./highf;

      I found this bit of code confusing because in general I don’t like to divide numbers by zero, even if I’m dividing zero by zero.
      This is apparently Mannian long-hand for
      lowf = highf = NaN(nrows,ncols);
      which would be much easier to read. I can see why it is so hard to follow what Mann does in his code because he changes direction in the code so many times and leaves bits and pieces of old methods hanging around in the code.

      • RomanM
        Posted Sep 26, 2008 at 5:00 PM | Permalink

        Re: Rusty Scott (#48),

        This is apparently Mannian long-hand for lowf = highf = NaN(nrows,ncols);

        Curiously enough, I have an older version of matlab and the NaN function doesn’t seem to work on my system. Maybe that piece of code was written on an earlier version of Matlab.

  26. Mark T.
    Posted Sep 25, 2008 at 3:41 PM | Permalink

    That’s the definition of ad-hoc, right?

    Mark

  27. Steve McIntyre
    Posted Sep 25, 2008 at 5:56 PM | Permalink

    #18. Although they said it was updated on Sept 23 in this statement, elsewhere they say Sep 25 which seems more likely. Not that it matters particularly, but this may be helpful for those readers who get cross at me when I do not take everything at face value

  28. Posted Sep 25, 2008 at 6:47 PM | Permalink

    At the end of the day, after using gigabytes of memory and teraflops of operations, all the operations appear to be linear (a point that Tapio Schneider confirmed for his RegEM), so one ends up with 1209 coefficients, which can vary somewhat.

    I thought there was also an offset b coefficient for each proxy. y=mx+b; So m and b are 2418 total coefficients.

    • Posted Sep 25, 2008 at 7:28 PM | Permalink

      Re: Jeff Id (#40),

      The intercepts don’t matter that much so I would still say there are 1209 coefficients. After all, the analysis seems to be focused on correlations.

      Re: DeWitt Payne (#30), based on the header information in the *.mat files, the code was run on some flavor of 64 bit Linux platform. I am not sure if that counts as a desktop OS.

      nhcru_cps_composite.mat

      MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Nov 18 20:51:53 2007

      glcru_eiv_composite.mat

      MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Thu Mar 27 12:54:48 2008

      infilled2poles_1850.mat

      MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Thu Jul 26 14:48:43 2007

      Not that this is of any help in terms of understanding what is going on.

      • Mark T
        Posted Sep 25, 2008 at 11:19 PM | Permalink

        Re: Sinan Unur (#41),

        based on the header information in the *.mat files, the code was run on some flavor of 64 bit Linux platform. I am not sure if that counts as a desktop OS.

        Yeah it can. I had Fedora Core 5 running on my machine for a while (which was a broken version, actually). I spend most of my at-home pooter time playing games so that went out the window for Winders (sort of punny) rather quickly. Lots of Linux users running around. Btw, the nice thing about MAT files is that they are portable from one OS to another! Yay.

        So, does anyone have this whole thing working? I’ve got a pretty clean XP build at home and potentially some free time on my hands.

        Mark

      • Not sure
        Posted Sep 26, 2008 at 2:57 PM | Permalink

        Re: Sinan Unur (#41), This comment was written on a 64-bit Linux workstation.

    • Mark T
      Posted Sep 25, 2008 at 8:54 PM | Permalink

      Re: Jeff Id (#40),

      I thought there was also an offset b coefficient for each proxy. y=mx+b; So m and b are 2418 total coefficients.

      No, that wouldn’t make any sense nor would it matter.

      Re: Sinan Unur (#41),

      Therefore the offset and final result cannot be calculated without both.

      Uh, no, not at all. And yes, correlation is independent of scale since you normalize. DC offset is arbitrary, and contributions from one are no different than from another, and therefore immaterial. They should also be removed prior to application of any 2nd order statistical analyses, but Mann does not quite do it correctly (this is because offsets correlate perfectly, which will bias results that are based on variances).

      Mark

  29. Posted Sep 25, 2008 at 8:17 PM | Permalink

    Correlation is ‘allegedly’ independent of the multiplier as well. I don’t recall the specific phrase but I believe Mann stated something to that effect. Therefore the offset and final result cannot be calculated without both.

  30. Mark T
    Posted Sep 25, 2008 at 8:55 PM | Permalink

    Oops, clicked on the wrong link for the second one. I was trying to get Jeff Id’s #42.

    Mark

  31. Willis Eschenbach
    Posted Sep 25, 2008 at 10:25 PM | Permalink

    Jeff Id, if RegEM is some flavor of linear fitting, the final result will likely be (or be able to be recast) in the form:

    “temperature” = proxy1*m1 + proxy2*m2 + proxy3*m3 + …. + proxyn*mn +b

    So we’d end up with 1210 coefficients, the majority of which will likely be ~ 0 …

    w.

  32. Mark T
    Posted Sep 25, 2008 at 11:05 PM | Permalink

    Exactly, and then the entire “global temperature” is represented by one or two trees in Colorado. Them are some pretty nifty treemometers.

    Mark

  33. Mark T.
    Posted Sep 26, 2008 at 2:34 PM | Permalink

    Can’t do the double equal thing in MATLAB like you can in C, btw. Mann just isn’t as familiar with the language as he could (or should) be. You’re right, it is confusing at times.

    Mark

  34. Steve McIntyre
    Posted Sep 28, 2008 at 12:50 PM | Permalink

    JEan S generated clidata, which I’ll upload to http://www.climateaudit.org/data/mann.2008/data/replication in a few minutes.

  35. Spence_UK
    Posted Sep 28, 2008 at 5:09 PM | Permalink

    OK, got doannsst.m and propick.m running okay. Just needed to redirect the loads and saves.

    I think the coding flaw I outlined in #55 seems moot, since it seems to always pick up two locations (i.e. sum(d3) appears always 2, never 0 or 1, at least in the first half of propick). Buggy software along execution paths that are never tested… I probably shouldn’t be surprised!

  36. Steve McIntyre
    Posted Sep 29, 2008 at 8:59 AM | Permalink

    In Mann’s proxyinfill program, he uses (or says that he uses) a dataset 1227proxynamenolu.txt. “nolu” is a shorthand for “no Luterbacher”. The prefix “1227” doesn’t tie into present nomenclature which has things like 1209proxy … etc and presumably comes from an unreported earlier iteration with 1227 proxies. However this data set has 1137 series. All 1137 are in the 1209. HOwever 1209 minus 71 Luterbacher = 1138. What’s the missing series? The rain in Spain/Kenya (rodrigo).

  37. Steve McIntyre
    Posted Sep 29, 2008 at 9:00 AM | Permalink

    The starting point of Mann’s program is the infilling of proxies. This might be a good place to watch how the RegEM works as well.

    • Posted Mar 16, 2009 at 2:53 PM | Permalink

      Re: Steve McIntyre (#59),

      This might be a good place to watch how the RegEM works as well.

      Got some code from Jean S.. The truth is out: EIV-Files go soon public 🙂

      • Jean S
        Posted Mar 16, 2009 at 3:47 PM | Permalink

        Re: UC (#61),
        here’s a small preview what to expect 😉 Some proxies weighted according to their weight in the (RegEM) EIV-reconstruction (nhnhscrihad). Blue indicates positive weight and red negative. The lower right corner is the actual (low split) reconstruction (for the step) scaled to the final scaling.

        It is rather interesting how some proxies respond with a different sign depending on the century they are used…

  38. Not sure
    Posted Sep 29, 2008 at 11:06 AM | Permalink

    Are people still having trouble running out of memory trying to run this? Is running out of memory holding anyone up still?

    • UC
      Posted Oct 28, 2013 at 12:38 PM | Permalink

      It is running now

      Name Size Bytes Class
      saveregts 1996x1728x19 524261376 double
      reg_ts 1996x1824x19 553387008 double

      (as I updated the computer:)

      Maximum possible array: 27353 MB (2.868e+010 bytes) *
      Memory available for all arrays: 27353 MB (2.868e+010 bytes) *
      Memory used by MATLAB: 1879 MB (1.971e+009 bytes)
      Physical Memory (RAM): 16267 MB (1.706e+010 bytes)

      * Limited by System Memory (physical + swap file) available.

      ok, where were we..

One Trackback

  1. […] continue Mann 2008 – Replication with EIV. To run regreclow.m and regrechigh.m you’ll need files climate, eofnumb, and proxyN, […]