## Rule N and Weighted Regression

Today’s post, which has a forbidding title, contains an interesting mathematical and statistical point which illuminates the controversy over how many PCs to retain. In my re-visiting the totally unknown corner of Mannian methodology – regression weights and their determination – I re-parsed the source code, finding something new and unexpected in Mannian methodology and resolving a puzzling issue in my linear algebra.

The key to the idea is very simple. After creating a network of proxies, Mann does what are in effect two weighted regressions: first, a calibration regression in which the proxies are regressed against the temperature PCs weighted by their eigenvalues; second , an estimation regression in which the reconstructed temperature PCs are estimated by a weighted regression in which different weights are assigned to each proxy. These weights ranged from 0.33 to 1.5.

Now there’s an interesting aspect to Mann’s implementation of the weights using SVD, which has taken me a long time to figure out. After a new push at parsing the source code, I’ve determined that the code actually implements weighting by the square of the weights. I’m going to do a very boring post on the transliteration of these weighting calculations from Mannian cuneiform into intelligible methodology in a day or two.

The effective weighting thus varies from 0.11 to 2.25 (a range of over 20) – with the largest weight being assigned to the Yakutia series. This is amusing, because Yakutia is an alter ego for the Moberg Indigirka series which has a very elevated MWP – we discussed this in connection with Juckes. Use of the updated long Yakutia-Indigirka series will have a dramatic impact on MBH99 as well as Juckes – something that I’d not noticed before.

In the AD1400 network, the sum of squares of all 22 weights is 12.357, of which the NOAMER network accounts for about 16% of the total weight.

How were these weights determined? No one knows. Actually I challenge Tamino or anyone else to locate these weights. Weighted regression is not mentioned in MBH98 itself. Indeed, MBH98 contains an equation for this step which clearly fails to show that weighting was done. They are not discussed in the Corrigendum but weights are given in the Corrigendum SI in files http://www.nature.com/nature/journal/v430/n6995/extref/PROXY/data1400.txt. In the source code provided to the House Energy and Commerce Committee, the weights are referred to in the code by the variable weightprx.

Weights are mentioned in passing in Wahl and Ammann, who stated incorrectly that:

A second simplification eliminated the differential weights assigned to individual proxy series in MBH, after testing showed that the results are insensitive to such linear scale factors.

and

Additionally, the reconstruction is robust in relation to two significant methodological simplifications-the calculation of instrumental PC series on the annual instrumental data, and the omission of weights imputed to the proxy series. Thus, this new version of the MBH reconstruction algorithm can be confidently employed in tests of the method to various sensitivities.

The argument below shows exactly how incorrect these claims are and how theses claims are inconsistent with Wahl and Ammann’s own confirmation of results using 2 and 5 PCs.

Weights and PC Retention

To illustrate the connection between weights and PC retention, showing how PC retention can be viewed as a special form of regression weighting, let’s construct a network using centered PCs with 5 PCs included. Perhaps through Rule N (even if it wasn’t used in MBH98, it’s not an implausible rule.)

Here’s how special choices of weights yield the two cases that have been at issue so far. If you give weights of (1,1,0.001,0.001,0.001) to the 5 PCs, then you’ll get a result that is virtually identical to the retention of only 2 PCs, while if you give equal weights to the 5 PCs, then you’ll get a result that looks like the Stick – heavy weighting on the bristlecones. Wahl and Ammann agree that you get different results using 2 and 5 covariance PCs. So obviously the selection of weights can “matter” – their own calculations confirm this. They happened to show one calculation where the two choices didn’t matter much, but that hardly rises to the standard of mathematical proof. It’s amazing how low the standards of reasoning are in this field.

If you assign equal unit weights to all 5 PCs in the network now represented by 5 PCs instead of 2 PCs, you also increase the proportion of weighting assigned to the NOAMER network relative to all weights. With two PCs, the NOAMER network had about 16% of the weights, but, if 5 PCs are each assigned equal weight, then the proportion of weight increases from 2/12.357 to 5/15.357 or nearly doubles.

Let’s suppose for a moment that there was a reason why Mann allocated 16% of the total weights to the NOAMER network. Perhaps it wasn’t clearly articulated but one can’t exclude the possibility that there was a reason. Why would the proportion assigned to this network increase because of the decision to increase the number of PCs used to represent the network? Mann has changed this aspect of his methodology. On the basis that this network should represent 16% of the total weight, then the weight allocated to all 5 PCs should still amount to 16% of the total weight. Instead of equal unit weights, the 5 PCs should be assigned lower weights.

There’s another weighting issue which reflects the reduced proportion of total variance accounted for by a PC4 (and is a pleasing and rational way of dealing with the frustration that a PC4 should dominate the subsequent regression.) In Mann’s calibration, he weighted the temperature PCs by their eigenvalues. Now this procedure is undone by the subsequent rescaling (hey, this is Mann we’re talking about), but the idea of weights by eigenvalues is not a bad one.

So a plausible implementation for the tree ring network would be to weight each tree ring PC by its eigenvalue so that the more important PCs were weighted more heavily, requiring also, as in the prior paragraph, that the total weights be apportioned so that the network still accounts for 16% of total weight. If this quite plausible procedure is done, then the weights assigned to the Graybill bristlecones are reduced substantially – so that the resulting recon is only marginally different from the reconstruction using 2 PCs as shown below:

I absolutely don’t want readers to get the impression that any of these squiggles mean anything. They are just squiggles. The MBH squiggle gets a HS shape from higher weighting of Graybill bristlecone chronologies. As I’ve said over and over, the Graybill Sheep Mt chronology was not replicated by Ababneh and all calculations involving the Graybill Sheep Mt chronology (including MBH, Mann and Jones 2003, Mann et al 2007, Juckes, Hegerl etc) should be put in limbo pending a resolution of the Sheep Mt fiasco.

Today I’m just making an appealing mathematical point – differences between the number of retained PCs can and should impact the weights in the weighted regression. The statement by Wahl and Ammann that these weights don’t matter – as a mathematical point – is, in Tamino’s words, “just plain wrong”. Purporting to salvage MBH by increasing the number of retained PCs without reflecting this in the overall weight assigned to the network increases the proportion of weight assigned to the network – in this case dramatically.

If the overall weight of the network is left unchanged and the PCs additionally weighted in proportion to their eigenvalues, the net result does not yield a HS using AD1400 proxies.

AD1400 MBH reminds me of GISS. A sort of transform of pre early 20th century values, downward.

2. Ross McKitrick

By increasing the number of NOAMER PCs to 5, what is the resulting change in weight on the NOAMER network (i.e. it rises from 16% to what?)

I was having trouble figuring out why regression weights should matter here. In a normal regression, re-scaling the independent variable just rescales the coefficient by the inverse. If your data are in dollars and you report them in pennies, the coefficient just shrinks by a factor 0.01, but nothing changes about the fit or the inferences.

However, I think the reason the rescaling matters here is that the coefficients in one step of the MBH algorithm are used as regressors in another step. Mann had 16 temperature PCs and k (= up to 112) proxy series. If I recall, he regressed the k proxies one by one on the 16 temperature PCs, yielding a matrix G of kx16 coefficients (these regressions have no constant terms). Then the temperature reconstruction for one year is formed by taking the k proxy observations for that year and regressing them on the transpose of G, yielding 16 coefficients. Those 16 coefficients are interpreted as the estimated temperature PCs for that year. I think — not having seen the algebra — the second step effectively reverses the weight cancellation in the first step, preserving the manipulation of weights. A data series in pennies would count for a hundred times as much as the same data in dollars. Is that right?

Steve:
Nope, remember that there are two steps. The effect is in the second part. The residuals for proxy i are weighted by weight w_i . This does more than change the scale.

The formula for weighted regression is
$\hat{U} = YPG^T (GPG^T)^{-1}$
where P is the diagonal of weights, Y the matrix of proxies, $\hat{U}$ the reconstructed PCs, G the calibration coefficients. In Mann-world, because of he does this in a weird way with PC analyis, it works out to P^2.

In partial least squares one-step terms (with cancellations), it works out that the recon coefficients are proportional to the correlation coefficient multiplied by the square of the assigned weight. Very strange, but the cancellations leave a very nice expression.

3. Jean S

#2: As Ross I also have hard time of understanding this … can you clarify that a bit more? I do not understand why the weights do not cancel out. In the equation you have for weighted regression, P’s cancel out.

BTW, are these weights the ones given in datalist1XX0.dat -files in Corrigendum SI?

Steve: Yes, I’ve revised and added this link. No they don’t cancel out. I’ll expand the discussion. I’ve had a lot of trouble sorting out Mann’s weighted regression and am glad that (I think) that I can finally recover it. Needless to say, as we’ve observed, there’s something wrong with the AD1400 network – probably in the proxies – so that this RPC can’t be replicated from the available proxies – but we’ve discussed with UC, the reconstructed PC1 is not linear in the available proxies so something’s wrong either with the archived PC1 or the archived proxies (or both.) It”s easier working with the MBH99 proxies, but I’m not sure that the same weighting system is used.

4. BDAABAT

Sorry, OT:

All faculty need to adhere to SOME academic principles or codes of conduct. It seems that he has not upheld his end of the bargain. Without getting into issues of intent, the bottom line is that he produced fundamentally flawed work. He has failed to comply with editorial policies on access to data/programs. He has continued to use the same (or nearly the same) flawed methods after he’s been made aware of the problems. Doesn’t this pattern of behavior violate the policies of his various sponsoring institutions? It appears so…

Cut and pasted some info from both Penn State and UVA’s website on academic grant/contract policy and faculty responsibilities:
Penn State University Policy:

Although the University is legally responsible to the sponsor as the actual recipient of a grant or contract, the PI is held accountable for the proper fiscal management and conduct of the project….
The PI must also ensure that the project is completed in a diligent and professional manner.”

http://www.research.psu.edu/osp/PSU/Proposal/facgd97.pdf

UVA Statement on Faculty Rights and Responsibilities

Thomas Jefferson helped establish the principles upon which academic freedom is based when he said of the University of Virginia, This institution will be based on the illimitable freedom of the human mind. For here we are not afraid to follow truth wherever it may lead, nor to tolerate any error so long as reason is left to combat it.

Selected snippets from the Policy:

“To this end professors devote their energies to developing and improving their scholarly competence. They accept the obligation to exercise critical self-discipline and judgment in using, extending, and transmitting knowledge. They practice intellectual honesty.

http://www.virginia.edu/provost/Faculty_Handbook_2006_Version/2.4FacultyRightsandResponsibilities.htm

5. Jean S

#3: Yes, you are right. Silly me.

6. Pat Keating

1

I absolutely don’t want readers to get the impression that any of these squiggles mean anything.

The squiggles may not mean anything from a technical point of view, but it goes without saying that the MBH squiggle has meant a great deal politically and in the media.

7. dearieme

I can’t quite make out whether you think the first principal component of Mann’s work is ineptitude or impropriety.

8. Pat Keating

7
Steve avoids making judgments like that, which is good.

9. tetris

Re: 7
dearieme
Don’t know where Steve Mc stands on this and he may decline to tell us. Meanwhile, #4 DAABAT, highlights rules of conduct not followed by Mann which would point to latter of your two options.

For those many (most) readers who, like me, don’t know enough statistics to judge this debate, it is worth remembering that Craig Loehle got a robust and meaningful historical temperature record by taking the simple average of a range of published records. If this can be done, then piling on layer after layer of sophisticated data massaging a la Mann should be redundant, and should be treated with suspicion.

I found it interesting that some journals rejected Loehle’s analysis without review, apparently because of its lack of sophistication. It would also appear that steaming piles of impenetrable language are prerequisites for publication in Nature and Science. This may be a reason that Mann et al used so much of it.

For me one of the most telling quotes on this blog came from Dr Jolliffe, who “wrote the book” on Principal Components analysis. He said

“My one (anonymous) interaction with Mann, his co-workers and his critics
was last year when I acted as a referee for an exchange of views
submitted to Nature. After a couple of iterations I came to conclusion
that I simply could not understand what was being done sufficiently
well to judge whether Mann’s methodology was sound,
but I certainly would not endorse it. At least one other referee came to same conclusion.”

Sorry about drifting off-topic, but for those who can’t really follow this debate, this is a quote worth repeating.

11. Posted Mar 18, 2008 at 9:00 PM | Permalink | Reply

Re #2, on behalf of Mann, Bradley and Hughes (how often do you hear that here?), their method as described is at least a lot more sophisticated that either the “direct regression” or “variance matching” advocated by Juckes, Allen, Briffa, Esper, Hegerl, Moberg, Osborn and Weber in Climate of the Past 2007. “Variance matching,” which was also used by Moberg, Sonechkin, Holmbren, Datsenko, Karlen (and Lauritzen), Nature 2005, makes no use of the correlation between the series at all, and therefore has no statistical merit whatsoever. I like to refer to this method as “Non-Least-Squares (NLS) regression”.

The formula given by Steve in his remark on #2 is essentially that advocated by UC in the CA thread UC on CCE, and is also (2.17) in Brown (Proc. Royal Statist. Soc B, 1982). However, in order to be valid, it is essential that P be proportional to inv(V), where V is an appropriate estimate of the covariance matrix of the step 1 regression residuals. The constant of proportionality is immaterial, however, since multiplying P by 100, eg, will just multiply Uhat by 100/100 = 1.

If it is reasonable to assume that the step 1 residuals all have equal variance, then P is proportional to I, and drops out. Or, if the step 1 residuals have different variances but are uncorrelated, then Weighted Least Squares (WLS) is appropriate. P and V are then diagonal, with p_ii = 1/v_ii, where v_ii is the variance of the residuals for the i’th proxy’s regression on the Temp PC’s. The appropriate weights are not arbitrary, but are dictated by these variances. I have no clue whether MBH derived their weights in this manner.

More likely, the step 1 regression residuals are not independent, but are correlated across proxies. In this case, you often can’t just estimate V without restriction (as advocated by Brown), since if there are too many proxies relative to the calibration sample size, the outer product of the errors will not be of full rank, and therefore can’t be inverted to perform the calculation. In this case, the covariance structure must be simplified with a priori restrictions, such as that the correlation between two proxies is a simple declining function of the great circle distance between the two sites.

Evidently, MBH ignore the off-diagonal elements of V and therefore P altogether. This may be a big shortcoming of their paper(s). Note that what is relevant is not the correlations of the proxies themselves (which may be 0 if they are PC’s normalized Mann-style to eh calibration period), but rather of the first staage regression residuals, which may be much smaller and differently-behaved entities.

In any event, I find it odd that MBH bothered with the 16 temperature PC’s, if it was their sole objective to come up with a NH temperature index. They could have simplified the whole procedure considerably by just calibrating their proxy PCs to NH temperature itself.

12. Posted Mar 19, 2008 at 1:26 AM | Permalink | Reply

Hu,

Re #2, on behalf of Mann, Bradley and Hughes (how often do you hear that here?), their method as described is at least a lot more sophisticated that either the “direct regression” or “variance matching” advocated by Juckes, Allen, Briffa, Esper, Hegerl, Moberg, Osborn and Weber in Climate of the Past 2007. “Variance matching,” which was also used by Moberg, Sonechkin, Holmbren, Datsenko, Karlen (and Lauritzen), Nature 2005, makes no use of the correlation between the series at all, and therefore has no statistical merit whatsoever. I like to refer to this method as “Non-Least-Squares (NLS) regression”.

Interesting comparison, CVM vs. MBH algorithm. Would you prefer to eat a newspaper or drink a bottle of glue? :) Juckes INVR is CCE with identity S, IIRC.

More likely, the step 1 regression residuals are not independent, but are correlated across proxies. In this case, you often can’t just estimate V without restriction (as advocated by Brown), since if there are too many proxies relative to the calibration sample size, the outer product of the errors will not be of full rank, and therefore can’t be inverted to perform the calculation.

Yes, the relevant article is Multivariate Calibration with More Variables than Observations by Rolf Sundberg, Philip J. Brown Technometrics, Vol. 31, No. 3 (Aug., 1989), pp. 365-371

In this case, the covariance structure must be simplified with a priori restrictions, such as that the correlation between two proxies is a simple declining function of the great circle distance between the two sites.

That is interesting idea. See the recent posts by Steve, MBH algorithm discards distance information completely, and Mann later wrote that (Climate Over the Past Two Millennia)

CFR methods do not require that a proxy indicator used in the reconstruction exhibit any local correlation with the climate field of interest, but instead make use of both local and nonlocal information by relating predictors (i.e., the long-term proxy climate data) to the temporal variations in the large-scale patterns of the spatial field.

Makes no sense.. At least I’d add some kind of cost function for ‘teleconnections’ per distance, for MBH9x algorithm anything goes, and that cannot be right..

13. Posted Mar 19, 2008 at 2:07 AM | Permalink | Reply

In this case, the covariance structure must be simplified with a priori restrictions, such as that the correlation between two proxies is a simple declining function of the great circle distance between the two sites.

But only north of 30N and south of 30S. In the equatorial region the spatial correlation is poor.

14. Louis Hissink

Steve,

Reply by email when time suits but the whole idea of “weighting” stuff is problematical.

Senso Strictu, “weighting” is applied as a second step when processing measurements, and actually means converting intensive variable measurements to scientifically useful data.

Mann’s PC contortions of temperature proxies are really nothing more than topological transformations of some imaginable construct; there does not seem any obvious analog with physical reality.

Weighting, therefore, means what in a physical sense, for surely any discussion on this topic must be limited to physical reality, not virtual as dictated by climate model inputs.

15. Patrick Henry

Some good NOAA records of raw temperature data from the 1930s, before any modern temperature manipulations. It appears that 1938 was a very hot year.

16. Posted Mar 19, 2008 at 9:11 AM | Permalink | Reply

UC, #12, writes,

Interesting comparison, CVM vs. MBH algorithm. Would you prefer to eat a newspaper or drink a bottle of glue? [smile] Juckes INVR is CCE with identity S, IIRC.

I’ll take the “glue” any day! Variance Matching or “NLS” as I call it, a la Moberg et al, is statistical nonsense. The MBH data set has a lot of problems, as we are all aware, but at least their estimator, as described above, is an estimator, if only an inefficient one.

If the covariance matrix V of the first stage regression errors is known, then GLS with P = inv(V) is the efficient estimator of U. But any P, including P = I, gives an unbiased estimator. Of course, the variance of a non-GLS estimator requires knowing V, and falsely assuming that V = sigma^2 I will give you wrong answers, so you still have to come up with an intelligent estimate of V. If you have enough degrees of freedom to have reasonable confidence in this estimate of V, you may as well do GLS with it, and thereby improve your efficiency. (With small DOF, weights based on your estimate of V may be cherry picking randomly well-fitting proxies, but I’m willing to assume that away for now.)

Thanks for the Technometrics cite. I’ll take a look at it.

Hans Erren, #13 writes, re my proposal to model the correlation as a simple declining function of the great circle distance between the two sites,

But only north of 30N and south of 30S. In the equatorial region the spatial correlation is poor.

Hans has already illustrated his point with an excellent graph he posted on 2/11/08 at comment #83 of Thread 2711. The graph itself is on his site at http://home.casema.nl/errenwijlens/co2/station_correlation.gif:

The relationship is indeed tighter north of 23.6N than in the tropics or southern latitudes. An ideal model should somehow take this different behavior into account, but I think that as a first approximation, it’s better to assume that the relationship is universal than to just ignore it altogether.

The “gridding” beloved of climatologists is, if I understand it correctly, a clumsy attempt to take this correlation into account, by essentially assuming that the correlation is perfect within 5 degree gridcells (556 KM at the equator), and 0 between gridcells. MBH grid their data, I gather, presumably in an effort to “correct” for spatial correlation in this manner.

17. Patrick Henry

Some problems with GISS temperature data for Death Valley.

http://data.giss.nasa.gov/cgi-bin/gistemp/gistemp_station.py?id=425746190010&data_set=1&num_neighbors=1

GISS shows temperatures from 1911-1921 between 21.5 and 26 C. The official records from 1922 show the average for that period to be much higher, at 27.3C. Also note the photograph of a well maintained and properly located station.

http://docs.lib.noaa.gov/rescue/mwr/050/mwr-050-01-0010.pdf

18. Steve McIntyre

#16. Hu, both methods contain variance matching steps. The difference is whether you have more or less equal weights or whether you weight proxies according to correlation to the trend.

I haven’t parsed some of the recent discussion and will try to spend some time on it.

19. Geoff Sherrington

Re # 16 Hu McCulloch

A quick look failed to find the answer to these questions.

In the series of graphs of correlation coefficient versus station separation, were the graphs conducted with a linear E-W search shape, or with some ellipsoid shape, or with a circle of up to 5000 km radius? Let’s assume the latter.

Given that a half-circumference of the earth is 20,000 km, North Pole to South, if you divide this into 7 zones shown (should be 8, but Antarctic is empty) then each latitude band is about 3,000 km wide. But the points on the graph go out to 5,000 km, so they are sampling adjacent bands. Problem?

The next problem with the graph is that the number of pairs within a latitude zone drops off with distance after a while. I would have thought that the more distant the search, the more numerous the pairs available.

Final problem, a correlation coefficient below about 0.8 gets a bit ratty when plotted. In some of the graphed latitude bands, most of the coefficients are below 0.8 by far, so I personally would not have much regard for them. In fact, I’d read very little of use into this graph other than that the tropics behave dissimilar to the poles.

Have you a reference to more of this type of work? Please don’t go to any trouble. I just seem to be missing some vital assumptions. Used to do similar things interpolating grades of ore deposits betweeen drill holes.

20. Geoff Sherrington

Why are there so few points nestled beside the left Y-axis? The graphs indicate that there are very few cases of comparison stations being closer than about 200 km to each other.

“Curiouser and curiouser” as Dodgson would say.

21. Posted Mar 20, 2008 at 11:20 AM | Permalink | Reply

Re Geoff, #19,20,

Hans would be the person to answer these questions. I’d like to know if he did these himself or if not where he found them, as they’re very useful.

One thing I see here is that beyond about 3000 KM (27 deg), the correlations are mostly noise, so that it is better to try to constrain them to a functional form than to just take each correlation at face value. They pretty much die to 0 by 3000 KM, except perhaps for the tropics. There is a piecewise linear curve fit line if you look closely.

Another funny thing is that the northern latitude correlations converge on unity at 0 distance, while the tropical and southern ones don’t. Is this climatological, or does it just mean that tropical and southern stations often aren’t as reliable as northern ones?

Geoff is right that it might make a difference whether distance is measured NS or EW. Also, what is the frequency of this data? Daily data surely has much less correlation, so I’m guessing this is monthly or annual.

22. Geoff Sherrington

Re # 21 Hu

Many thanks. I’m trying to help others avoid the tedium of math analysis if some of the fubdamental inputs are doubtful. I re-read Steve’s article on station spacing (the one with the diverse map projections) and was none the wiser. We put years of computer team time into finding the right ways to interpolate ore grades and it is not a trivial subject. It has direct parallels with using paired stations to correct for surface temps and as I have been saying for a long time, you simply cannot use a search distance of 1200 km for temp work. In short, there is no global surface temperature graph, land or sea, in which I have any faith at all. You can drive a big truck through many of the assumptions and corrections.

23. Posted Mar 20, 2008 at 6:57 PM | Permalink | Reply

re 16
I can’t find the source at this moment, if I remember correctly it was from Hansen.
AFAIK It’s a pairwise correlation, grouped in lattidunal bands: of course stations can be more that 3000 km separated: in EW direction!

24. Geoff Sherrington