Martin Ringo on Principal Components

We have a number of readers who are highly qualified econometricians. I think that initially they find it hard to believe the description of Hockey Team statistical practices. Martin Ringo is one such reader, who has a doctorate in "finite sample properties of a variety of feasible, generalized least squares estimators". He’s sent in the following discussion of principal components reconstruction from a different aspect than any that I’ve considered – whether the temperature PCs from the short calibration period will even accomplish what MBH want. Marty:

Even if we had the exact PC component from the SVD of the long series, it is not enough to reconstruct the temperatures. (That is, even if MBH’s proxies could perfectly estimate the temperature PCs, they still won’t be able to reconstruct what they want to reconstruct.) To reconstruct the temperature, even the averages across all the grids, one needs a better estimate of the post-multiplying matrices associated with the long series SVD than the post-multiply matrices of the short series will give.

Martin Ringo: More on Linear Algebra

I am going to start by reviewing what the proxy variables and all those regressions were about (explained in “MBH Calibration-Estimation Procedure”). There is a relatively short temperature record, roughly 150 years, and Mann et al and most of us would like to have a much longer record. To do this the MBH98 procedure was to use the several linear relationships between the temperature records and proxy variables over that same period to produce a estimate of a set of the principal components of that temperature and then to use that relationship applied to the long period (600 or 1000 years) of the proxy records.

Now for a little linear algebra (although fortunately I am going to avoid all those proxy regression and scaling issues).

MBH98 had 1082 temperature series over 94 years. Call that a 94 by 1082 temperature matrix. The matrix has a most a rank of 94, and thus MBH98 couldn’t use it for principal component analysis (PCA) as such. They expressed on a monthly basis to give a 1128 x 1082 matrix (presumably of full rank 1082). Using the matrix form of MBH98’s equation (in the Methods Section), the temperature matrix could be expressed as:

1) T1 = U1 * S1 * V1′, (for monthly data)

where (for simplicity) I am going to assume that 1) was derived from the compact form of the SVD algorithm with U1 a 1128 x 1082 matrix of principal components, S1 a 1082 x 1082 diagonal matrix, and V1 a 1082 x 1082 matrix. U1 and V1 are, of course, both orthogonal matrices.

MBH98 took the first Neofs PCs of U1 to derive a T^ (“T hat”) estimate of the full temperature matrix.

The thing to note about equation 1) is that it can be used to “reconstruct” T. That is with U1, S1, and V1 we can get T, and if the first Neofs of U1 explain most of the variation, the can approximately reconstruct T from those first Neofs PC. MBH98 took the long series of proxies and relationships from the inverse regression equations to build a U1^ as long as the proxy data. If one goes through the linear algebra in “MBH Calibration-Estimation Procedure,” you will see that the length of the U1^ is independent of the length of T1. Hence the long proxy series allows MBH98 to produce a long estimate of the temperature PCs, a long U^ so to speak.

That is what was done. I looked at the reconstruction a slightly different way. Suppose we don’t have to use U^. Suppose we have the actual U, the actual PCs from the actual, long, temperature series. Then we should be able to do an even better job of reconstructing the full temperature matrix for the long period, say a full 600 or 1200 years, because the actual PCs must be a better estimator than the estimated PCs. Posing the problem this way gives a type of upper bound on what can be accomplished through reconstruction of estimated temperature PCs.

This type problem is just the kind of thing Monte Carlo experiment can — I won’t say solve, but — give insight into. The experiment simulated a set of serially and spatially correlated temperature variables (as deviations from the mean), a matrix T of monthly data. T_full, which is combination of T1 and, say, T2 corresponding to observed and unobserved parts. T1 is decomposed per equation 1) and the annual version of T_full is decomposed similarly:

2) T_full = U_full * S_full * V_full’, (annual)
[where T_full might be 1200 x 1028, U_full the same, S_full 1028 x 1028, and V_full 1028 x 1028]

The first Neofs PCs from U_full are taken: call them W. The U1 from the short period SVD is simply thrown away. In this “world” we know the true PCs. But the S1 and V1 of equation 1) are retained. This is the crucial point: the MBH98 reconstruction has to use the S1 and V1 matrices from the short (observed) period SVD not the S_full and V_full of the full period SVD. We then have two reconstructions

3a) R_short = [W| zeros] * S1 * V1′;
and
3b) R_full = [W| zeros] * S_full * V_full’;
[Note the “[W|zeros]” represents filling out the last 1082-Neofs columns of the matrix with zeros to make the matrix multiplication conform.]

The entire difference in the two estimates are the postmultiplying matrices S and V. with R_short we have a reconstruction just as MBH98 would have done only with a better, in fact perfect, estimate of the temperature PCs, but still limited by the postmultiply matrices being determined from the short (observed) period data. With R_full those matrices are taken from the full period data.

If the MBH98 reconstruction procedure is feasible, it will perform reasonably well in comparison to the R_full reconstruction. That is, the limitation will the fact that Neofs PCs cannot perfectly reconstruction 1082 temperature series.

MBH98 use the Reduction in Error (RE) statistic to “verify” their reconstruction. Alternatives to that are the “Coefficient of Efficiency,” CE, (do not use that term as climatologists use it in economics, sociology, educational research or industrial engineering) and the old standby rho-squared, R2. Let me suggest that Henri Theil’s U statistic — “coefficient of uncertainty [of forecasts],” (Theil, H, Applied Economic Forecasts and Policy, North Holland, 1961), is superior to all three.

4) U = ( sqrt( sum( (Yhat-Y)^2) ) )
/ ( sqrt( sum(Yhat^2))+sqrt( sum(Y^2)) )

John interjects: this might be easier to read

$U = \frac {\sqrt{ \sum {( \hat{Y}-Y)^2}}}{ \sqrt { \sum{ \hat{Y}^2}} + \sqrt{ \sum {Y^2}}}$

It measures the size of the squared errors in proportion to the values being forecasted, both the forecast and the actual values. This (the original) version of the statistic goes from 0, a perfect forecast, to 1, a forecast in which the errors completely overwhelm the forecast, and because of that, I prefer it to the later version with a more tractable distribution but unbounded upper end.

But in any case I calculated all four statistics.

The results: The RE statistics for the R_short reconstructions of the individual temperature series all average well less than 0 for 100 run batches. The grand average is probably something like -20. When the reconstruction is compared on the basis of a spatial average, the RE values improve a bit, to maybe around -5. While it might be fun to hoist Dr. Mann by his own petard, remember the RE is a lousy statistic for Monte Carlos because of its asymmetrical unboundedness. Get 99 RE values of 0.5 and one of -100 (which is can happen), the conclusion on the average is a -0.5, yet the procedures might have been pretty good. The Theil’s U statistic is much more telling, at least to me. There the R_short, i.e. the MBH98 type reconstruction, never performed lower than a 0.70 while the R_full, i.e. what could be done with accurate weightings, never was over 0.2.

A note of caution: It should always be remembered that all Monte Carlo results are dependent on the parameters. For things like the Classical Linear Model (the textbook regression model), changing the parameters doesn’t change the Monte Carlo results (as to the nature or the distribution). But with more complex models, especially when nonlinearity is introduced, use beware. One would like to sample over all the parameter space, but life and computing facilities will run short. And right now I haven’t figured out just what is the appropriate variation of parameters. Also because the time consuming nature of the simulation, I am doing it right now with about half the size of MBH98’s work. I report these preliminary results because they were so dramatic and have remained unchanged across my various parameter changes. Black and white: Neofs — be that 10% of number of series or 75% — principal components constructed from the whole temperature series can reconstruct the full temperature series fairly accurately on an individual basis and very accurately on a spatial average when that reconstruction is based on the postmultiplying matrices that were jointly part of the SVD. But those same principal components cannot even reasonably reconstruct the spatial average when that reconstruction is based on the postmultiplying matrices of a short span of the full period of the temperature reconstruction.

To give a numerical flavor to the differences in the S * V’ matrices from the full versus short series SVDs, below are the two S * V’ matrices for first the full and then the short SVD on a 40 x 4 data set:

Full data S*V’ values:
-0.178 1.432 0.920 -7.440
-0.086 -5.435 3.568 -0.603
-0.849 -2.786 -4.445 -1.065
-4.448 0.398 0.540 0.250

Short data S*V’ values:
0.930 1.379 -0.599 4.414
1.126 -1.833 1.631 0.557
1.244 -0.272 -1.055 -0.320
-0.556 -0.745 -0.548 0.276

And the effect? Below is the 40th observation on the data:
0.161 -0.135 -1.872 1.108

What did the short data SVD reconstruction predict?
0.016 0.031 -0.373 -0.959

Percent differences:
-90.0% -122.7% -80.1% -186.5%

Maybe it isn’t even necessary to go through the Monte Carlo experiments. What MBH98 and the principal components reconstruction advocates appear to be butting their collective heads against is a little linear algebra.

The pseudocode is here and PC reconstruction sample here

This entry was written by Stephen McIntyre, posted on Feb 15, 2006 at 8:16 AM, filed under Multivariate, Statistics. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

31 Comments

kim

Posted Feb 15, 2006 at 10:03 AM | Permalink

B I N G O and bingo was nis nome o.
===========================================
jae

Posted Feb 15, 2006 at 10:50 AM | Permalink

These analyses are way over my head, but they look like pure mathematics. As such, are they contestable?
Ian

Posted Feb 15, 2006 at 11:36 AM | Permalink

Steve,

In your Monte Carlo experiments, what did the spectrum of your noise look like (was it white noise)? I was just wandering if your results would be different if there was a significant low frequency component. Intuitively, it seems like it might.
Ian

Posted Feb 15, 2006 at 11:41 AM | Permalink

I meant ‘simulated signal’ not noise — but I sure you knew what I meant anyways.
kim

Posted Feb 15, 2006 at 12:00 PM | Permalink

jae, I’m ignorant enough of math for the language and the logic to make sense to me.
===================================================================================
Ian

Posted Feb 15, 2006 at 12:41 PM | Permalink

In case they are listening here is my proposal to the hockey stick team for future supporting/refuting papers

1. Drop the use of controversial proxies. They only represent a small percentage so just drop them.
2. Drop the use of principal components. Steve has made it abundantly clear that there are too many issues with its current usage.
3. For each proxy type use the sensor data as known signal and the proxy minus sensor data as noise to create a wiener filter (spectral filter optimized for known signal spectra and noise spectra). Smooth the filter.
3 (alternate). If one doesn’t like this approach (wiener filter) then visually inspect the spectrum and propose a simple low pass filter with appropriate cut off. I am pretty confident that both approaches will lead to similar results (the spectral filtering approach is quite robust).
4. Apply the filter to the data to get the best estimate of the signal.
5. Create a weighted sum of the signals where the weighting is estimated from a best guess geographic area that the proxy applies to.
6. Do the same but only for an ensemble of European proxies to check for the existence of a MWP, where there is the most confidence that one existed.
7. Compare and publish

If you confirmed the existence of MWP in Europe and still get a global hockey stick, then I, and I assume many others, would be convinced. Goodbye global MWP.

Anyone see any flaws with this approach?

If I find the time, I might even take a crack at doing this myself.
Mark

Posted Feb 15, 2006 at 12:57 PM | Permalink

ian, you just described a generalized sidelobe canceller…
mark
Ian

Posted Feb 15, 2006 at 12:58 PM | Permalink

Mark – Ok. And … doesn’t this seem like a good approach to resolve this issue, or what are the flaws with this approach?
Ian

Posted Feb 15, 2006 at 1:06 PM | Permalink

An important point to add/change would be too apply this technique blindly to all proxies, (yes even the bristlecones so scratch step #1), as part of the problem from the beginning is the danger of cherry picking data.
fFreddy

Posted Feb 15, 2006 at 1:36 PM | Permalink

John A , a while ago you mentioned an add-in that would make it possible to represent mathematical formulae more legibly. Is that available yet ?
(Apologies if you’ve been too busy, but it really would make it easier to thrash through these algebra posts.)
David Stockwell

Posted Feb 15, 2006 at 2:58 PM | Permalink

Yes, this seems to be about the problem of overfitting the random errors that initally concerned me with the MBH98 use of so many PCs. It would be interesting to see how many Neofs are required to minimize the error of the short program (to borrow a figure skating term). If I understand this correctly 10% is way too many and probably even one might capture all the worthwhile signal.
Steve Sadlov

Posted Feb 15, 2006 at 2:58 PM | Permalink

Ian, have you ever considered a career in high speed computing? 😉

As the core voltages get lower and lower, and the clocks get faster … well, you can imagine!

Same principles apply and you do clearly get it. Bravo!
John Hekman

Posted Feb 15, 2006 at 3:04 PM | Permalink

Ian
I don’t know why all this filtering would provide better data series with which to test the hypothesis at issue. Assume for the sake of argument that tree ring width and density could be “perfectly” explained if one had good data on sunlight, temp, humidity and soil quality, whereas the Hockey Team has only tested the ring widths and densities against recent temperatures.

Then the regression of ring width on temperature is going to have residuals with a lot of unexplained variance. In reality, this is due to the effects of the omitted variables: sunlight, humidity etc. Now if you do all this filtering and smoothing, how is this going to give you a better estimate of the effect of temperature on tree rings? Why would one have any more confidence in the estimated parameters coming out of this? Thanks in advance.
Ian

Posted Feb 15, 2006 at 3:26 PM | Permalink

John,

I agree this does not help the situation if the proxies themselves are flawed, however if the proxies are flawed, then this will show up as a high level of noise.

The main advantage of this approach is its simplicity and robustness. The current processing technique is basically an exercise in obsfucation. As Steve has shown it is a way to hide unrealistic weighting of bristlecones and other targeted proxies.

Of course the simplicity of the approach is also part of the problem since scientific papers are designed to be complex to impress one’s peers of one’s mathematical prowess (and also to suppress criticism for fear of looking foolish by not understanding some obsfucated process). Writing a paper that anyone can understand is looked down upon. It took Steve quite a while to decipher Mann’s paper in terms of more elementary linear algebra equations. That barrier is part of the point, and may have been a major reason why the paper passed review in the first place.
Frank H. Scammell

Posted Feb 15, 2006 at 4:04 PM | Permalink

Steve, Suppose you were to take all the proxies that DON’T have a kickup in the region of merging with the “instrumental” record (and are not duplicated in other records), and merge them with (for example) the UAH tropospheric record at a single point (the beginning of the UAH record. I expect this will show that the blade of the hockeystick is grossly exxagerated and a MWP and LIA clearly existed.
Mark

Posted Feb 15, 2006 at 4:22 PM | Permalink

Ian, I think there may be some valid direction with such an approach.

FWIW, a generalized sidelobe canceller is sort of the mother of all adaptive filtering schemes (assuming multiple inputs). In directive (beamformed) signal processing applications, the point is to find the common signals, those that are correlated, across an array of sensors. The correlated signals across the array, after blocking the desired signal, represents the sidelobe interference from just about anything, but usually a jammer of some sort.

In our applications (er, the climate application), we’d actually be looking to keep the correlated signals as they would represent the commonality across all the proxy inputs (the proxies are the analogy to my sensor array, I suppose).

At the very least, this seems reasonable…

Mark
Mark

Posted Feb 15, 2006 at 4:25 PM | Permalink

Somehow the first sentence in my last post came out sort of silly sounding… try this:

“Ian, I think there may be validity with such an approach.”

🙂

Mark
Steve McIntyre

Posted Feb 15, 2006 at 4:35 PM | Permalink

#15 – Frank, you might look at my splice of Moberg with the satellite record here – http://www.climateaudit.org/?p=179. There’s obviously lots of hair on Moberg as a record, but the difference between the satellite and Jones records is material to the look of the rhetorical hockey stick.
Steve McIntyre

Posted Feb 15, 2006 at 4:41 PM | Permalink

#11 – David, the overfitting is even worse: Mann represents each of the temperature PCs in terms of 22-112 proxy series. The representation is related to multiple regression – I’ll chat about the difference soon. This is a big anvil.

#14. Ian, I don’t necessarily agree that flawed proxies will show up as noise. That will be the case only if the bulk of the proxies actually are “proxies”. Look at the correlations reported in our Reply to Von Storch, where the validity of the proxies is contested.
Ian

Posted Feb 15, 2006 at 4:59 PM | Permalink

Mark,

I have had some experience with beamforming array data, but was not familiar with that term. From my experience with beamforming, it is robust and works even with very low SNR. It’s only downside is minor sidelobe artifacts, which people attempt to attenuate through various window functions.

I have seen many people try fancier techniques, such as maximum entropy, SVD, etc, however these techniques only seem to work under high SNR, and as far as I can tell the only advantage of these techniques (besides sprucing up a paper with unneeded complexity) is they can provide sharper peaks and eliminate sidelobes. With low SNR, these techniques fail horribly providing strong spurious peaks and no peak where they should be one. I think most people will agree that the proxy data has a low SNR.
Ian

Posted Feb 15, 2006 at 5:04 PM | Permalink

Steve,

Even if the flawed proxies do not show up as noise at the state will be given reasonable weightings, far far lower than their current weightings (by a factor of a 100 for bristlecones, if I remember correctly the effective weighting from one of your other posts).

But you’re right, just eliminate them, they only represent a small fraction of the overall number of proxies, however doing so would open one up to claims of cherry picking. At least eliminating 5% of proxies is a lot easier to defend then selecting 5%, which is what is going on now.
Ian

Posted Feb 15, 2006 at 5:06 PM | Permalink

Sorry that should say:

Even if the flawed proxies do not show up as noise at least they will be given reasonable weightings

The dangers of using voice recognition system.
Mark

Posted Feb 15, 2006 at 5:21 PM | Permalink

The generalized sidelobe canceller is nothing more than the top-level description of a beamformer. The underlying principle is simply to find the correlated elements across a signal set. Assuming your primary signal (main-lobe from a directional antenna) is pointing at your desired signal, and each canceller element is omni-directional, the GSC actually steers nulls (spatially) towards sidelobe interferers.

That there are lobes in the directional portion of the desired signal is irrelevant to this discussion. Where I noted similarity is the concept of finding the correlation across an array of proxies. In this sort of system, the correlated bits are actually the desired signals (as opposed to the noise we want to reject in a radar application, for example), sort of a reverse of the canceller problem. I.e. we are in a sense pulling the correlated signals out of the noise.

I typically used a form of a mofified Gram-Schmidt orthogonalization to implement the GSC and generate the weights – one per element or proxy in this case (do a search at the USPTO for “Gerlach, Karl”
and you’ll see the actual algorithm I’m running). The cascaded structure I use has computational benefits that aren’t important when the process is not real-time. However, I’m wondering if a similar approach can be used, without a main-lobe analogy, to find a correlation across a proxy set and “pull out” the signals from the background noise.

Hopefully we haven’t derailed to far off topic, Steve. 🙂

Mark
Steve Sadlov

Posted Feb 15, 2006 at 8:03 PM | Permalink

Mark, no comment yet from Steve M, but here are my two pence. The more we can get people such as yourself, with experience in things like ECM, or folks with experience in noise management in high speed digital circuits, and the like, the more we can bring to bear the best the world has to offer in order to not only challenge the current “climatological” orthodoxy, but also, to come up with entirely new methodologies for analyzing the – pardon the pun – real climate!
Harry G

Posted Feb 16, 2006 at 6:38 AM | Permalink

John

re your interjection in the original article by Mark – I think you missed out a square on the second Y hat.
Harry G

Posted Feb 16, 2006 at 6:42 AM | Permalink

Oops sorry – my apologies – I meant in the original article by Martin Ringo – getting old and dyslexic.

John responds: Quite right! Equation updated. At least someone is awake on this blog….
David Stockwell

Posted Feb 16, 2006 at 8:22 AM | Permalink

Re: #19 It would be interesting to see the error predicting the known long series with the short clibration series, plotted against number of Neofs. That would tell us the number that maximizes accuracy. I would do it myself if the script was in R.
David Stockwell

Posted Feb 16, 2006 at 9:11 AM | Permalink

The more I look at these graphs I think Martin is right. This method couldn’t even predict the mean. You can see this by imagining a process where the proxies are random series with a given mean, from which a small number are chosen based on correlation with temperatures at the end. Then the reconstruction of those selected series is simply the mean – which is an arbitrary parameter! Looking at the reconstructions, including Moberg, Esper and others, you could see (reading backwards in time) the height of the MWP as a reversion of a random series to a predetermined mean, after a purturbation due to the biased selection for a downward impulse matching temperatures.
Mark

Posted Feb 16, 2006 at 10:25 AM | Permalink

Steve S. Agreed. At least, I think that “related” fields provide a different perspective. There’s a common math underlying any form of signal processing (that’s all this is), but a different viewpoint for each area of interest.

I’m still thinking about ways to extract what we want, btw. What I do for radar (ECM as noted) correlates noisy elements against a known signal or primary element, which is something not necessarily present in climate science (if we had a 1000 year true temperature record we could, but we don’t). perhaps we could “test” the known flawed proxies against the noisy ones to show the correlation between them?

Mark
Martin Ringo

Posted Feb 19, 2006 at 12:57 AM | Permalink

Re #27, David,
I do not believe that the will a great deal of sensitivity in “most cases” for changes in the number of principal components used. The reason is that in most cases (my that I mean most of the parameter space for a given model of the Data Generating Process, “DGP,”) the relevant statistics are close to 1 or 0. I have been playing around (euphemism for I’m still debugging parts of the model) and there are specifications that will bring the “full” and “short” reconstructions closer together. I will run the sensitivity to the number of PCs that in the near future.

Re #28, because I am a simple minded old guy (or maybe because I live most of my working life in a business world), I like to see the numbers and the math. So I try to build my simulations in spreadsheets first with a much simplified version. The “Can one reconstruct from principal components with short sample weighting” issue — that’s a mouthful — becomes very clear in the spreadsheet: no REs, no rho-squareds, not even Theil’s Us, just some simple graphs and the interocular test. The results hit you between the eyes: with weightings (the post-multiplying matrix) from the full sample (which can’t be done in reality), one can reconstruct; with weightings from the short sample, one can’t.

Note, this does not rule out reconstruction from one or a few proxies to one temperature series. But it cast considerable doubt on the reconstruction from many proxies to a many temperature or to a temperature aggregate series. Making this claim truly applicable resolves on how well one can model the range of reasonable specifications of temperature. If temperature were strictly periodic with, say, white noise and with constant spatial correlation, then short sampling would do pretty well. It is easy to go to a fully randomly periodic variation, but how does one move on the continuum between the two? Or how much should the spatial correlations be allowed to change over time? All these things make a big difference in the results. The MBH98 type of reconstruction is a complicated way of saying that the relevant temperature patterns of the past are sufficiently strong in the current instrument record that even if we had the full record, we would change how much to weight PC1, PC2, etc. to reconstruct temperature series A and so forth for series B, C … That condition is a “maintained hypothesis” (untested and assumed a priori to be true, and I think unstated) of the MBH98 model. They would say my type simulations only apply if the simulations could produce short sample weightings that reconstructed temperature approximately correctly. They would probably phrase it in the form that the work is irrelevant unless it simulated strong forcings that affected all series approximately the same way. {Ever think about why someone would have to run a climate model to test if the MBH98 transformation mined for hockey stick? Running a tree growth model (e.g. TreeGrOSS, PTAEDA, STIM, TREGRO — go take a look at it to see what the denrochronology guys are missing by not teaming up with the commercial forestry guys) or maybe something that David figured out in his kind of modeling might make sense as the underlying DGP, but not a climate model. It merely serves to restrict the DGP of the proxies, not estimate it from empirical data.}

Finally note, while we can never test the maintained hypothesis on temperature, we can do a “proxy” test, in both senses of the word. Can the proxy variable series be reconstructed from the short sample? If the proxy variable sets can’t be reconstructed by principal component analysis, then there is more than reasonable doubt as to efficacy of doing so on the temperatures. Any interest?
David Stockwell

Posted Feb 21, 2006 at 9:33 AM | Permalink

Re #30. There are three possible outcomes of plotting error against number of PCs. (1) errors decrease with increasing PCs (Mann’s maintained hypothesis) (2) errors are unchanged with increasing PC’s (Martin’s hypotheisi) and (3) errors increase with increasing PC’s (David’s hypothesis). This seems like a nice experiment and I will be interested in seeing the result!

Re #30 re #28. I am the same way. I have generally found z tests to underestimate errors, and like to construct simulations to understand what is going on. As to means, I have a theory that the temperature of the MWP from the reconstructions is related to the normalizing temperature for the series. If I get a chance I will collate and plot the zero anomoly temperature against the MWP temperature for a number of series and see if there is a positive slope! Once again, not proof, but suggestive of these proxies are simply expressing hidden ‘maintained hyportheses’.