Porting RegEM to R #1

I’ve transliterated relevant Tapio Schneider code into R (pttls.m) and parts of regem.m that seem relevant at present. Jeff Id has extracted a variety of intermediates from his Matlab run and I’ve fully reconciled through two steps with remaining differences appearing to be probably due to transmission rounding. My dXmis statistic at step one was 1.258116 (Jeff – 1.258) and at step two was 0.4673552 (Jeff – 0.4674). I’ve posted a detailed turnkey script for the reconciliationhere (something that I do from time to time to make sure that I don’t inadvertently modify a reconciliation – scripts which are unexciting but time consuming.)

I’ve added in a few features that I’ll use for analysis purposes. RegEM calculates a bazillion coefficients – well, maybe not quite a bazillion – in the case at hand, 1,341,662 coefficients per iteration. Given that there are only 23,449 actual measurements between the AWS and surface stations in the 1957-2006 period, this is a yield of over 57 coefficients per measurement per iteration, which seems like a high yield and one that might give rise to simplification.

In particular, given that the “money” series is the reconstruction trend, this means that the station and AWS reconstructions end up being combined some way. By doing the algebra in a little different order (merely the commutative property of matrix algebra), we should be able to get to weighting factors using this information. But that’s a little distance in the future.

Today, I want to show a couple of graphics from the first two steps, in which I calculated the proportion of positive coefficients to total coefficients by year. Given that we’re dealing with millions of coefficients, these coefficients have to be considered statistically – something that you see in random matrix theory, which we’ve alluded to from time to time. Here they are. The proportions are not identical (I checked), but there isn’t much change in the first step.


What does this mean? Dunno but, intuitively, there are a couple of things that I don’t like.

First, I don’t like the idea that 1 out of 3 coefficients is negative. When things are eventually recombined into an Antarctic average, these negative coefficients will express themselves as upside down contributions from some stations – which surely can’t be a good thing.

Second, I don’t like the non-stationarity in the proportion of positive coefficients. Why does the proportion of positive coefficients decrease so sharply from the early 1980s to 2000? Does it “matter”? Dunno. This is the sort of thing that IMO makes it very undesirable to use poorly understood methods (such as TTLS RegEM with RegPar=3) for important applied results, particularly when the new results seem to yield “better” answers than earlier results with better understood methods (GISTEMP, Monaghan) and the differences are not clearly reconciled.

Anyway, I should have an end-to-end port of this method to R later today or tomorrow.

41 Comments

  1. jack mosevich
    Posted Feb 17, 2009 at 11:20 AM | Permalink

    Pardon my ignorance but I am not familiar with some acronyms. What pray tell is RegEM? Thanks

    Jack

    • Peter D. Tillman
      Posted Feb 17, 2009 at 11:32 AM | Permalink

      Re: jack mosevich (#1),

      http://climateaudit101.wikispot.org/Regularized_expectation_maximization_algorithm?action=show&redirect=RegEM

      –and see http://climateaudit101.wikispot.org/Glossary_of_Acronyms
      for more climalphabet soup.

      The wiki is worth looking through, if you’re a newcomer.

      Best, Pete Tillman

    • RomanM
      Posted Feb 17, 2009 at 11:36 AM | Permalink

      Re: jack mosevich (#1), (Alberta?)

      Regularized Expectation Maximization, Jack.

      This abstract may help a bit.

      • Mike B
        Posted Feb 17, 2009 at 12:46 PM | Permalink

        Re: RomanM (#3),

        Roman, is there a difference between “Expectation Maximization” and “Maximimum Likelihood Estimation”?

        Also, I’m still fascinated (and confounded) by the revelation that RegEM doesn’t take into account the distances between stations. Surely there are better methods available that do take into account the distances?

        The more I look at the cover of Nature, the more I think that the map of Antarctica looks like a methodological artifact, where the densely stationed peninsula is casting a warm “shadow” accross all of Western Antarctica.

        • RomanM
          Posted Feb 17, 2009 at 1:16 PM | Permalink

          Re: Mike B (#8),

          The two concepts are quite different (although EM uses MLE). Obviously, you can search google so I will give a short “intuitive” explanation.

          In both cases, the point of the exercise is to estimate one or more parameters which play a role in the distribution of some variable. Take a sample from the population (some of the values in the sample can be related to each other). In MLE, using the underlying distribution, you calculate the “probability” (sort of … read “likelihood”)of getting the sample values that you just got. This “probability” will depend on the actual value of the parameter you want to estimate. The estimate of that parameter is the value which makes that “probability” as large as possible (on the principle that more likely outcomes will occur more often than unlikely ones).

          In EM, you are also trying to estimate a parameter, but some of the data are missing. The method being used requires all data to be complete. (For example, your sample consists of multiple observations on each of a group of individuals. If you miss a value for one individual, your options are to throw out all of the data for that person or guess (impute) the one value that you missed so the rest of the data can be used). So EM alternates between guessing at the value and then estimating the parameter, then reguessing the value, etc. until the calculation process converges to some result. Both the imputation and the estimation steps use MLE for a multivariate normal distribution. The assumption of normality is important because of the MLE and the property that the value not be missing for systematic reasons is important for the vakidity of the imputation step.

          The Reg part comes when hordes of values are missing and conditions must be placed on the process to be able to get an answer in the first place.

        • Mike B
          Posted Feb 17, 2009 at 1:25 PM | Permalink

          Re: RomanM (#11),

          Thanks Roman. That’s a very succinct explanation.

          My primary questions regarding the approach would revolve around a) the convergence criteria, and b) why there would be a reasonable expectation of convergence in the first place.

        • Scott Brim
          Posted Feb 17, 2009 at 3:05 PM | Permalink

          Re: RomanM (#11)

          The Reg part comes when hordes of values are missing and conditions must be placed on the process to be able to get an answer in the first place.

          In a paleoclimate temperature reconstruction using tree rings, can we presume that one such condition on the use of RegEM for in-filling temperature data would be an assumption that not only is it possible for temperature signals to be cleanly dissociated from other environmental factors which might affect tree ring growth patterns; but further, that the temperature response curve is reasonably linear for the temperature ranges encountered?

        • RomanM
          Posted Feb 17, 2009 at 3:45 PM | Permalink

          Re: Scott Brim (#20),

          The conditions I refer to are mathemtical conditions on the form of the matrices used in the calculation process, not on the data values themselves. That is an entirely different matter.

        • Jeff C.
          Posted Feb 17, 2009 at 3:59 PM | Permalink

          Re: Mike B (#8),

          Also, I’m still fascinated (and confounded) by the revelation that RegEM doesn’t take into account the distances between stations. Surely there are better methods available that do take into account the distances?

          Roman had some good comments on RegEm’s implicit weighting in an earlier thread. My non-statistician brain looks at is this way. Take two stations 1000 miles apart with fairly complete records (station A and station B). Now add in station C with a third of it’s data randomly missing 50 miles away from station A. RegEM will recognize that station C is better correlated with station A than station B and infill accordingly.

          Now we have the Antarctic reconstruction with 20 out of 42 predictor stations on either the peninsula or islands off the continent. Add in the predictand AWS sites with woefully incomplete records with data missing for long stretches, typically in the winter months. Is RegEM sophisticated enough to avoid spurious correlations without knowing the distance between the sites? Steig and Gavin (on comments at RC) say yes. I think most of us are skeptical.

        • Posted Feb 17, 2009 at 4:33 PM | Permalink

          Re: Jeff C. (#23),

          To me the reduction of the amount of information in the spatial data from 3 pc’s would encourage spurious correlation. I don’t have any idea why the actual data wouldn’t be used instead.

        • bender
          Posted Feb 17, 2009 at 4:49 PM | Permalink

          Re: Jeff C. (#23),

          Is RegEM sophisticated enough to avoid spurious correlations without knowing the distance between the sites? Steig and Gavin (on comments at RC) say yes. I think most of us are skeptical.

          They’re kind of right. But not entirely. The issue is this: does the correlation stem from low-frequency or high-frequency correspondence? If A and B correlate because of high-frequency coherence, then it makes some sense to infill one from the other. If the correlation stems from a low-frequency shared “trend” then it might be a little more dubious: you are manufacturing signal based on very little real information (as defined by Shannon-Weaver).

        • Posted Feb 17, 2009 at 6:07 PM | Permalink

          Re: bender (#25),

          I’ve been thinking a lot about that issue and was considering doing a post on r correlation but I’m no statistician, I just know the problem exists and thought it would be interesting to show it. It seems to me that with such slight trends as compared to the anomaly variation, the r values are almost meaningless unless the trends are separated into low and high frequency components. It allows a paper to claim high correlation to a signal without actually matching the important part. How do actual statisticians handle this?

        • bender
          Posted Feb 17, 2009 at 10:25 PM | Permalink

          Re: Jeff Id (#26),
          Your #26 was a happy crosspost with my #27.

          It seems to me that with such slight trends as compared to the anomaly variation, the r values are almost meaningless unless the trends are separated into low and high frequency components.

          Exactly.

          It allows a paper to claim high correlation to a signal without actually matching the important part. How do actual statisticians handle this?

          Filter your x and y (independent and dependent) variables into orthogonal low- and high-frequency components and calculate your correlation on the basis of those separate components. The high-frequency correlation will not be prone to spurious matching. The low-frequency correlation will. Infilling on the basis of correlation in the high-frequency domain makes sense to me (whether you are talking weather or climate “noise”). Not so for low-frequency (trends from forcing etc.). That’s how I would do it. I have not studied RegEM.

          What do you think, Gavin? Is this how you would do it?

        • Posted Feb 18, 2009 at 1:36 AM | Permalink

          Re: bender (#30),

          I was hoping for a rule or something. It seems like statistics should have a frequency based r val. I’m just an engineer though and have not yet attempted to prove anything with a correlation. Things either work or they don’t and in my experience the god of physics pretty well let’s me know when I messed up.

          By the way, I did some slightly more complete PC analysis of the Sat data. I’m completely exhausted but if I didn’t mess it up, Dr. Steig is going to have a headache tomorrow.

          The Three PC’s of the Antarctic

        • bender
          Posted Feb 18, 2009 at 7:50 AM | Permalink

          Re: Jeff Id (#33),

          It seems like statistics should have a frequency based r val.

          For periodic time-series you use cross-spectral coherence. It is interpreted like Pearson’s r wrt a specific frequency range. But for aperiodic time-series the spectral methods (frequency domain) are not so good. There is nothing illegal about filtering and correlating based on the filtered component series. This is not a statistical issue. It’s a signal processing issue. The idea is that weather noise, climate noise, and forcing signals occur on different characteristic timescales and you filter on that basis. This is not a statistical proposition; it is a climatoogical propostiion. So it’s out of the hands of the statisticians. You can ask a statistician’s opinion, but they are going to tell you that it makes sense, if that’s what the physics dictates. Two statisticians I would trust on this are Bloomfield and Nychka. (And of course, Wegman.)

        • bender
          Posted Feb 17, 2009 at 6:10 PM | Permalink

          Re: bender (#25),
          To rephrase:
          The closer are two points to each other, the higher will be the degree of high-frequency coherence. S&M are ok on that one. The same can not be said of low-frequency coherence (trend). To assume so risks making unreasonable assumptions about long-range teleconnectivity vis a vis trends.

        • Posted Feb 17, 2009 at 6:38 PM | Permalink

          Re: bender (#27),

          It just seems like there must be some standard for determining ‘which’ coherence you’ve identified. The example I think of is where temperatures would affect the length of growing season rather than the actual growth rate. You may then get a good high fre

        • Posted Feb 17, 2009 at 6:42 PM | Permalink

          Re: bender (#27),

          It just seems like there must be some standard for determining ‘which’ coherence you’ve identified. The example I think of right now is where temperatures would affect the length of growing season rather than the actual growth rate. You may then get a good high frequency correlation and a poor or even negative low frequency correlation. It seems to me that this is one of the main problems in the use of statistics to verify a temperature proxy. I’m pretty far off topic but very interested in how the pro’s handle it.

        • Mike B
          Posted Feb 18, 2009 at 9:33 AM | Permalink

          Re: Jeff C. (#23),

          Thanks Jeff. And I get that part (i.e. the algorithm implicitly “recognizes” the higher correlation between two stations 50 miles apart and two stations 950 miles apart).

          But my poorly posed question(s) are slightly different:

          a) What if station B, via error or spurious correlation, is more highly correlated with station C 950 miles away than it is with station A 50 miles away?

          and more generally:

          b) Even if things are working well, you’re still stuck (in your hypothetical) with 950 miles of guesswork between two stations. And RegEM seems to crank along indifferent to this.

        • Ryan O
          Posted Feb 18, 2009 at 9:43 AM | Permalink

          Re: Mike B (#39), There’s no way to know. In the case of the AWS recon, it is very possible to have undetected, spurious correlations because some of the station data is so short. For longer series – like the manned stations – this is less likely because a spurious correlation is unlikely to persist for long periods. But with the shorter duration of many of the AWS stations, it could definitely be an issue.

        • Posted Feb 18, 2009 at 2:10 PM | Permalink

          Re: Ryan O (#40),

          For sure. Not only is there no way to know, there is no effort to check.

        • Molon Labe
          Posted Feb 17, 2009 at 11:10 PM | Permalink

          Re: Mike B (#8),

          To summarize, given ill-posed problem Ax = y:

          MLE: Find x which minimizes y-Ax

          EM: Find minimum dA for which (A+dA)x=y

          However, in practice EM is indistibguishable from the FM or “Funding Maximization” method:

          FM: Input desired result x=xGW, find “adjustments” to the data in A which maximize my funding and media exposure.

          This method is solved using a genetic algorithm in which research groups successful in cramming xGW backwardw through the problem are rewarded with higher funding.

          It’s also sometimes called the ILM or “Individual Liberty Minimization” method.

        • Craig Loehle
          Posted Feb 18, 2009 at 7:47 AM | Permalink

          Re: Molon Labe (#31), Oh I just love all the tech jokes. You had me completely sucked in for a minute. I notice when the discussion gets really mathematical the trolls stay away: is math like wolfbane or garlic?

  2. Andy
    Posted Feb 17, 2009 at 11:42 AM | Permalink

    Thank god someone else asked that, I thought I was the only one.

  3. Steve McIntyre
    Posted Feb 17, 2009 at 11:50 AM | Permalink

    I think that I can declare that my R script is now completely file compatible with Jeff Id’s Matlab run. Below is a comparison of dXmis through 21 iterations. Not even a rounding difference.

    me jeff
    [1,] 1.25811556 1.25800
    [2,] 0.46735524 0.46740
    [3,] 0.31128378 0.31130
    [4,] 0.18168994 0.18170
    [5,] 0.11207077 0.11210
    [6,] 0.08017931 0.08018
    [7,] 0.06474290 0.06474
    [8,] 0.05473997 0.05474
    [9,] 0.04690776 0.04691
    [10,] 0.04046183 0.04046
    [11,] 0.03512440 0.03512
    [12,] 0.03070195 0.03070
    [13,] 0.02702687 0.02703
    [14,] 0.02395646 0.02396
    [15,] 0.02137379 0.02137
    [16,] 0.01918531 0.01919
    [17,] 0.01731698 0.01732
    [18,] 0.01571032 0.01571
    [19,] 0.01431906 0.01432
    [20,] 0.01310643 0.01311
    [21,] 0.01204302 0.01204

  4. Nic L
    Posted Feb 17, 2009 at 12:43 PM | Permalink

    Well done, Steve. I look forward to your posting of the full R RegEM script and to experimenting with it.

  5. Ryan O
    Posted Feb 17, 2009 at 12:45 PM | Permalink

    Very cool, Steve. R is much cheaper than Matlab. 🙂

  6. Posted Feb 17, 2009 at 12:52 PM | Permalink

    Thanks for the effort on this Steve.

    Re: Mike B (#8),

    I think the shadow analogy is probably going to turn out to be the case as well.

  7. jack mosevich
    Posted Feb 17, 2009 at 12:59 PM | Permalink

    Thanks for your responses Peter and Roman. The problem of calculations where there is missing data has been around for an awfully long time. I would have thought that robust algorithms (not to be confused with Al-Gore-ithms) have been develped previously (e.g. neural net approaches, maximim likelihood…) and some well known standard approach taken.

  8. Ryan O
    Posted Feb 17, 2009 at 1:26 PM | Permalink

    Second, I don’t like the non-stationarity in the proportion of positive coefficients. Why does the proportion of positive coefficients decrease so sharply from the early 1980s to 2000?

    .
    Just throwing out a guess, I would think that the proportion of positive coefficients would be related to the degree to which the temperatures of the stations are physically coupled. Sharp drops like the 1980-2000 area might be reflective of microclimate changes, or changes in the relative importance of large-scale climate vs. microclimate.
    .
    Another somewhat unrelated thought is that RegEM might have value as a tool of a different sort: detecting whether station moves or instrumentation calibration/replacement affected temperature readings. One might expect changes in microclimate to cause a drift over time, but a sudden, sharp change in correlation would be more likely to arise from man-made influences. Could be a way to verify GHCN splicing of time series, for example. Dunno.

    • Ryan O
      Posted Feb 17, 2009 at 1:30 PM | Permalink

      Re: Ryan O (#12), Just to clarify, the second half of the above wouldn’t apply to proportions of coefficients . . . it would be to withhold some of the data of the station you’re analyzing, impute, and look at the difference. A sharp step would be indicative of an artificial change vs. natural drift of microclimate.

    • Kenneth Fritsch
      Posted Feb 17, 2009 at 2:15 PM | Permalink

      Re: Ryan O (#13),

      Another somewhat unrelated thought is that RegEM might have value as a tool of a different sort: detecting whether station moves or instrumentation calibration/replacement affected temperature readings. One might expect changes in microclimate to cause a drift over time, but a sudden, sharp change in correlation would be more likely to arise from man-made influences. Could be a way to verify GHCN splicing of time series, for example. Dunno.

      Tom Karl(he is not a Dr.) looks for non climate related changes in the USHCN temperature series using change point analysis.

      • Ryan O
        Posted Feb 17, 2009 at 3:41 PM | Permalink

        Re: Kenneth Fritsch (#18), Thank you. I was not aware of that. 🙂 I have some reading to do, then . . .

  9. jack mosevich
    Posted Feb 17, 2009 at 1:27 PM | Permalink

    Roman: Thanks. Have the data been tested for Normality? Maybe a quick computation of skewness and kurtosis would give a hint.

    • RomanM
      Posted Feb 17, 2009 at 1:54 PM | Permalink

      Re: jack mosevich (#14),

      It isn’t that simple. What we are talking about here is multivariate normality where a single “observation” consists of all measurements in a given month. The dimension of the observation is over 100 in the AWS case and over 5500 in the TIR recon. Furthermore, there are temporal trends as well which would have to be removed. About the best one can do is to look at detrended residuals for each station, etc. and see how well behaved they are. Combined with the other difficulties in the data, doing proper inference on the results obtained is pretty much a matter of faith under such conditions.

  10. Dave Dardinger
    Posted Feb 17, 2009 at 1:34 PM | Permalink

    Steve,

    Anyway, I should have an end-to-end port of this method to R later today or tomorrow.

    Well, sounds like it’s time for the team to move on now. I’m sure they’re not interested in lots of people who aren’t members of the team being able to detail what’s going on in the locker-room between games / articles.

    Volleyball anyone?

  11. Martin Åkerberg
    Posted Feb 17, 2009 at 2:16 PM | Permalink

    How large is the magnitude of the negative coefficients? Could it be that there are a lot of very small coefficients hovering around zero, and the ones with larger magnitude are all positive? And the large coefficient portion is increasing from early 1980 to 2000? Could you plot the distribution of the coefficients?

  12. Alan S. Blue
    Posted Feb 18, 2009 at 12:08 AM | Permalink

    I’m still missing the fundamental reason for infilling or inferring data in the first place.

    First break the spacial area into gridcells (which is already done). Then determine “The Gridcell Temperature” in each gridcell at each time. Gridcells with no data are exactly that – gridcells with no data. Gridcells with 30 datapoints at a given time mean that you’ve probably substantially reduced the error on that particular datapoint.

    When it becomes time to estimate the average temperature of the entire area, you’re doing a average weighted by the errors on the individual gridcells. The ’30 point’ cell is going to get a substantially heavier weight than the zero data cell (which would actually get a zero weight). Along with the “Temperature Maps” that we keep getting, you’d be able to show an “Error Map.”

    At the next level, when you’re moving from individual times to trend across the years, you’ve got individual error bars on each time. If there’s very little data in 1973, you can see it flat out – because the error for the 1973 data would be enormous. Whereas when you’re doing the same thing for a gridcell in 2003 with full coverage, you should have nice tight error bars.

    So in fitting a trend, you also get to weight the yearly data by it’s own individual merit. Rock solid data after 2000? Excellent, strongly weighted. All data from any random year missing? That’s fine – you’ve made no assumptions about missing data whatsoever.

    • MrPete
      Posted Feb 18, 2009 at 3:47 AM | Permalink

      Re: Alan S. Blue (#32), that sounds like a great methodology to me. But what do I know, I’m certainly no statistician. (Note: Alan’s comment must be read carefully; many times when referring to “average” or “weight” he’s referring to average or weight of the errors” not of temperature.)

    • bender
      Posted Feb 18, 2009 at 7:54 AM | Permalink

      Re: Alan S. Blue (#32),
      I just asked the same question in the other thread. Why infill if your sole purpose is to estimate a trend? Unless you are infilling for one data type based on patterns in another data type …

  13. Geoff Sherrington
    Posted Feb 18, 2009 at 3:02 AM | Permalink

    Re your graphs in header:

    I’m not too sure of the validity of correlation coefficients on annual data averaged from daily data. Using years 2004 (less a day for leap year) and 2005 for Macquarie Island, the CORREL Excel function gives 0.555 for Tmax and 0.453 for Tmin. Years chosen at random.

    The stats package for year 2005 daily gives for Tmax

    Mean 7.039452055
    Standard Error 0.116673752
    Median 6.9
    Mode 6.6
    Standard Deviation 2.229048909
    Sample Variance 4.96865904
    Kurtosis 0.086984527
    Skewness -0.312414358
    Range 13.8
    Minimum -0.2
    Maximum 13.6
    Sum 2569.4
    Count 365

    and for Tmin

    Mean 3.536191781
    Standard Error 0.142589332
    Median 3.9
    Mode 3.0
    Standard Deviation 2.724165366
    Sample Variance 7.421076941
    Kurtosis -0.452491633
    Skewness -0.457868522
    Range 13.1
    Minimum -4.3
    Maximum 8.8
    Sum 1290.71
    Count 365

    How do you get correlations of 0.75 in your graphs when taking a favourable case I can only manage 0.45-0.55? Am I missing a square root terminologically somewhere? If not, surely correlation coefficients should decrease as you go further out on a data limb?