Here’s something amusing.

Mann has written to the House Energy and Commerce Committee, arguing that we made a fundamental and obvious mistake in how we calculated AR1 coefficients for the North American tree ring network, which exaggerated the HS-ness of the simulated hockey sticks – a mistake that Ritson has now supposedly picked up and which supposedly any reviewer would have picked up.

The irony is that the method that we used to calculate AR1 coefficients is **identical** to the method used by Mann in his Preisendorfer diagram for the NOAMER network submitted to Nature and posted up at realclimate, as I prove below.

Reviewing the debate a little, here’s what Mann wrote to the House Energy and Commerce Committee:

There is another element of this question which raises a deeply troubling matter with regard to Dr. Wegman’s failure to subject his work to peer review, and Wegman’s apparent refusal to let other scientists try to replicate his work. Professor David Ritson, Emeritus Professor of Physics, Stanford University, has found error in the way that Dr. Wegman models the “persistence” of climate proxy data. Interestingly, this is the same error Steven McIntyre committed in his work, which was recently refuted in the paper by Wahl and Ammann, which was in turn vetted by Dr. Douglass Nychka, an eminent statistician. Dr. Ritson has determined that that the calculations that underlie the conclusions that Dr. Wegman advanced in his report are likely flawed. Although Dr. Ritson has been unable to reproduce, even qualitatively, the results claimed by Dr. Wegman, he has been able to isolate the likely source of Wegman’s errors. ..

Moreover, the errors that Dr. Ritson has identified in Dr. Wegman’s calculations appear so basic that they would almost certainly have been detected in a standard peer review. In other words, had Dr. Wegman’s report been properly peer-reviewed in a rigorous process where peer-reviewers were selected anonymously, it likely would not have seen the light of day. Dr. Wegman has thus unwittingly provided us with a prime example of the importance of the peer review process as a basic first step in quality control.

Here’s the "mistake" supposedly identified by Ritson in the calculation of AR1 coefficients. (Now even if Ritson were correct, all this would affect is the HS-ness of our illustration of the biasing effect – it doesn’t in any sense disprove the biasing effect. Ritson:

To facilitate a reply I attach the Auto-Correlation Function used by the M&M to generate their persistent red noise simulations for their figures shown by you in your Section 4 (this was kindly provided me by M&M on Nov 6 2004 ). The black values are the ones actually used by M&M. They derive directly from the seventy North American tree proxies, assuming the proxy values to be TREND-LESS noise…Surely you realized that the proxies combine the signal components on which is superimposed the noise? I find it hard to believe that you would take data with obvious trends, would then directly evaluate ACFs without removing the trends, and then finally assume you had obtained results for the proxy specific noise! …Your report makes no mention of this quite improper M&M procedure used to obtain their ACFs.

Now we’ve done a variety of calculations to show the artificial hockey stick effect. A calculation that’s received the least attention, but is probably the most pertinent is the discussion in the Reply to VZ in which we talk about the impact of 1-2 "bad apples" in an MBH context. Given the concern over potential nonclimatic effect of bristlecones, this is actually a more important issue than the red noise argument. In our red noise discussions, we did two calculations – one with ARFIMA noise and one with AR1 noise. The ARFIMA noise produced pretty hockey sticks but introduced a secondary complication and replications have focused on AR1 examples. To set parameters for the simulation, we calculated AR1 coefficients on the North American AD1400 tree ring network using a simple application of the arima function in R:

arima.coef = arima(x,order=c(1,0,0))

This is what Ritson is criticizing, arguing that application of a standard arima function to a tree ring network without previously removing trends is incorrect. Now it seems to me that Ritson has recently argued that VZ’s implementation of MBH made some sort of ghastly error by removing a trend prior to regression – so it’s hard to say what Team policy is on when trends should be removed and when trends shouldn’t be removed – but that’s a story for another day.

However here my point is different. Whatever the right method may be, the method that I used simply followed Mann’s own methodology. This can be proven by looking at his Preisendorfer simulations posted up at realclimate here (which were also submitted to Nature). In the SI to Mann’s revised reply to our Nature submission (all unpublished), Mann stated – posted up here for the first time :

We performed the experiments described by MM04 , producing various realizations of M=70 statistically independent red noise series of length N=581 ‘years’, using an N(0,1) Gaussian innovation forcing and the lag one autocorrelation coefficients of each of the actual M=70 North American ITRDB data for the interval 1902-1980.

From this calculation, Mann produced the figure shown in the left panel below, which was submitted to Nature and later posted up at realclimate in an identical form. In the right panel, I show my replication of this figure posted up early last year here., discussed here at CA. This exact replication of Mann’s diagram was produced using the arima coefficients calculated above – if the coefficients had been calculated using Ritson’s method, the diagram would have been much different. So whether this method of calculating AR1 coefficients is right or wrong, it is **EXACTLY** the method used by Mann himself.

**Mann’s Original Caption for Left Panel: FIGURE 1. **Comparison of eigenvalue spectrum for the 70 North American ITRDB data based on MBH98 centering convention (blue circles) and MM04 centering convention (red crosses). Shown is the null distribution based on simulations with 70 independent red noise series of the same length with the same lag-one autocorrelation structure as the actual ITRDB data using the centering convention of MBH98 (blue curve) and MM04 (red curve). In the former case, 2 (or perhaps 3) eigenvalues are distinct from the noise floor. In the latter case, 5 (or perhaps 6) eigenvalues are distinct from the noise floor. The simulations are described in "supplementary information #2".

While we’re looking at these diagrams, it’s interesting to look at a couple of other example to see that Mann’s use of the Presiendorfer criterion cannot be demonstrated in other networks, as I pointed out early last year, although nobody from realclimate has explained the discrepancies and Wahl and Ammann avoided the topic altogether. On the left is a diagram for the Stahle SWM AD1700 network – no fewer than 9 PCs were actually retained. On the right is a diagram for the Vaganov AD1600 network in this case only 2 PCs were retained. What system did Mann actually use to decide retained PCs? I have no idea. You can’t get the retention decisions from these diagrams. No code for these retention decisions has ever been produced. I’d love to know how the retentions were made.

Mann’s tactics are really pretty amazing sometimes. Why pick another fight on such a bad issue? He should have just shut up, taken his medicine and got on with it. What if Wegman takes him at his word and actually examines the issues that Mann raises here? He will conclude that the climate science community has taken leave of their senses.

## 21 Comments

So, there is some kind of ‘fear of the high AR1 coefficients’. I assume that you simulated observations in mm05(grl), observation=signal+noise. And Ritson thinks that you simulated noise? And he argues that noise in proxies is not as red as you simulated.

Thanks to Steve M’s criticism of their work, the HT are starting to pay much closer attention to the issue of error structures. Their argument that the AR1 model is too simplistic is correct. But that’s not the point. The AR1 model is THEIR model, which Steve M was merely replicating.

They know they need a better model now. But rather than innovative with some new models, they are using the lessons learned from Nychka to score some quick optics vicotries attacking Steve M’s credibility. This was predictable, and I further predict this tactic will not abate. Despite the hit they took, the team has much captial and residual credibility to work with. They will use mobilize those resources to innovate at a rate that will make it very difficult for them to be caught again. To the extent that they rely on other equally flawed Mannomatic methods, however, (such as RegEM) they WILL trip up again.

If you thought the first round of argumentation was inanely technical, you ain’t seen nothin’ yet.

More Mannian fog?! When will this clear up?

It is hard to know where to begin, particularly since SteveM has clarified all the technical points. Nontheless…

Mann’s call for peer review of Wegman’s report (aside from the social network stuff, which IMHO would have benefited from peer review) is ridiculous. Wegman was providing expert peer review on MBH and MM, not introducing original work. Does Mann believe we have to peer-review peer review?

Also, if I’m not mistaken, Wegman did get the work peer reviewed anyway — by none other than the Board of the American Statistical Association.

Finally, and this needs to be noted, Wegman is 1000 times better at statistics than are MBH combined. This is obvious from even a cursory look at the resumes.

(Admittedly, there are some typos in Wegman’s report, but they are irrelevant to the substance. Do I recall correctly that the last time I examined Mann and Ritson’s work, they were unable to correctly compute an ACF?)

One last thought: It is telling that Mann is lashing out at Wegman now (does he do this with all his critics?). Mann must finally realize how devastating the NAS and Wegman reports are in their criticism of his methods.

Their opening must be ‘Gambit’. Sacrifice Ritson, more fog and more time to play.

#3

If he is Anonymous Referee 2, then the answer is yes.

Barton et al didn’t bother going after Mann on issues like non-reporting of adverse results. Mann has claimed that their not bothering with the non-reporting issue amounts to some sort of vindication (See his San Franciso Chronicle comments especially) when it was not bothering by the House Committee and the NAS panel intentionally avoiding inquiring.

I don’t think that Mann realizes how lucky he was that they lost interest in him. He’s led a sheltered life obviously and has no idea what tough lawyers can do if they turn their minds to it. There’s more than one type of smart person in the world. Why would you give these guys a chance to change their minds by taking a parting shot. I’m sure that they won’t bother but it’s a foolish chance to take.

Can I recommend to Professor Wegman that he should immediately go to the New York Times and tell them that he is not going to be intimidated into releasing his code?

Alternatively, if Ritson is unable to reproduce Wegman’s results, how does he know that Wegman made an error? Is it like telepathy?

Ritson appears to be arguing that a high AR1 coefficient is actually evidence of a trend+low AR1 coefficient. Well, sometimes a series has a high autocorrelation component. In any case, none of this has any bearing on the central problems. In the MBH98 context the decentering has a specific effect: artificially promoting the bristlecone signal to PC1 instead of PC4 and inflating the explained variance term. This arises due to the shape of the graph, not (merely) the autocorrelation of the series. As Wegman explains (with a theoretical derivation), decentering here is incorrect mathematics, end of story. The resulting “PCs” are not actually PCs. The decentering on AR1 coeffs matters for benchmarking the RE stat. Suppose we simply conceded the point Ritson seems to be making. That doesn’t affect the bristlecone issue, but might change the RE benchmark, up or down. Even if it lowers the RE benchmark (and thereby increases the chance that the hockey stick attains significance narrowly defined) it would not affect the failure on the r2 and CE scores. Under the circ’s if the RE benchmark jumps around just based on a small change in noise specification then it’s one more reason to ignore the RE stat until some clever stats buff works out an exact or asymptotic distribution. Surely one of the clear lessons from NAS is not to rely just on the RE, especially when the CE and r2 stats are insignificant, as they are for MBH98.

I conjecture that Ritson’s noise model won’t change the RE benchmark much, since Ritson’s noise model includes a trend, which is likely susceptible to the decentering effect in any situation where the red/white distinction matters. But even if I’m wrong about that, the CE and r2 scores as reported by Wahl-and-Ammann-and-vetted-by-eminent-statistician-Nychka show that the hockey stick has no significance prior to the late 1700s. The CE scores in particular show that it is worse than simply using the verification period mean. Nothing raised by Ritson changes that.

Steve, could you produce a version of the first graph using this new method they are saying you should have used (even if they themselves didn’t?)

I’m curious to see if it makes any real difference, and by showing that it produces different results than they have reported, it will show just how silly their attack is.

I have one last question: is this science, or is this a school yard argument? I wonder how long it will be before others start realizing that their defences are a sham, and what that reveal about the quality of their work in the first place.

Probably the most galling result (to the Hockey Team) of the McIntyre-McKitrick papers is the bias of the MBH PC method (transformation before calculating the principal components). The MBH method will exaggerate hockey stick shaped series if they are part of the data set, and the method will find hockey stick principal components out of serially correlated (“red”) noise (essentially all the time if the noise is modeled as ARFIMA with the same ACF, autocorrelation function, as the original tree ring data). This is the “cherry picking” by algorithm issue, not to be confused with the cherry picking by hand which is probably more important in the long run issue of whether any reconstruction is valid.

Cherry picking by algorithm, and hence biasing the results, is an issue of both bias and competence. Had the result been acknowledged when it was initially pointed out the bias issue would have disappeared had the Team really “moved on.” The competence issue would by now also have vanished for lack of attention. One “Oops, my bad,” and it becomes both old news and non-controversial.

The mathematics — or maybe I should the computation — of principal components is sufficiently messy that it would be a tedious exercise to show bias and conditions where the bias appears for the MBH method. However, the result can be shown in simple and involved simulations. The first one I did was in a spreadsheet for a little set of 10 variables with 100 observations looking at white noise with one deterministic hockey stick. But more to the point is the 581 x 70 matrix with means equal to one and various types of red noise. The MM05b conclusions were based on LTP based on the NOAMER ACF. David Stockwell did a similar with a random LTP. I suspect that any kind of LTP for the simulation will show the bias since LTP produces “long runs” which get amplified by the MBH method.

But Steve has shown the bias with an AR(1) model. The bias here becomes very obvious around an AR1 coefficient of 0.25, not that it can’t be seen for coefficients as low as 0.1. [The average NOAMER ACF1 value is about 0.4 and over 20 of the first order autocorrelations are over 0.5, at which point the MBH method finds hockey stick PC1 around 99% of the time.] Now there are ARMA models which will not show the bias: do a simple negative AR1 coefficient. Maybe this is where the trend removal comes from, but it doesn’t work: the trends over the whole series are largely so weak that there is little affect on the major AC coefficients and thus the bias will show with an ARMA or ARFIMA fit and simulation.

Series Difference (AR1 with — without trend)

ar049 0.00

ar050 0.00

ar052 0.00

ar053 0.00

az082 0.00

az086 0.00

az510 0.00

az550 0.00

ca065 0.00

ca073 0.00

ca084 0.00

ca087 0.00

ca528 0.04

ca529 0.03

ca530 0.03

ca531 0.01

ca532 0.00

ca533 0.04

ca534 0.08

ca535 0.00

ca555 0.01

co067 0.00

co076 0.00

co509 0.00

co509x 0.00

co511 0.00

co522 0.00

co523 0.05

co524 0.04

co525 0.00

co535 0.02

co545 0.00

co547 0.00

ga002 0.00

ga003 0.01

ga004 0.00

la001 0.00

mt006 0.00

nc008 0.00

nm025 0.00

nm026 0.00

nm559 0.00

nm560 0.00

nm572 0.03

nv037 0.00

nv049 0.00

nv053 0.00

nv056 0.00

nv060 0.00

nv061 0.00

nv510 0.05

nv511 0.03

nv512 0.05

nv513 0.09

nv514 0.20

nv515 0.00

nv516 0.01

nv517 0.00

or009 0.00

or012 0.00

or015 0.00

sd017 0.00

ut023 0.00

ut508 0.00

ut509 0.01

va021 0.00

wy005b 0.00

wy006 0.03

wy023 0.00

wy023x 0.00

Of course, the difference is irrelevant because the whole purpose of the MBH paper was to find the multi-decade changes, i.e. just those kind of things of which trends are. Thus, the mis-modeling accusation looks like one of “You people [McIntrye, McKitrick, Wegman, and all others who independently found the bias] didn’t remove the hockey sticks shapes, er… make that trends, from the simulated data. Therefore, you have bias.” Of course, even that were true, it is wrong because the MBH method will find hockey sticks, per the MM test, in cases where none of the data series individually show a hockey stick per the same test. For instance, a hockey stick series — MM test > 1 absolute value — is less than a 1 in 100,000 occurrences for an AR1 = 0.5 simulation, but an AR1 = 0.5 simulation of data will produce MBH PC1 hockey sticks 99% of the time. Thus, I will slightly disagree with Ross’s description of what the MBH transformation does: it is not so much an issue of promoting one series as it is one of concentrating the “hockey-stickedness” of whole data set (by the choice of weights which I concede can be called “promoting”) into a single PC, which coincidently reduces the variance of the shaft of the PC1 — hence a reduced MWP. The hockey stick PC1 happens each time the hockey stick shape is the largest single type of variation, which is what the centering of the data on the blade (tail end) of the series produces with even moderate serial correlation. Remember all PC methods do this too, just not to the degree of the MBH method.

All remember all these problems are before the statistical infirmities of the regression analysis, which can also be rigged, that ties the multiproxies to the temperatures. Were I a young paleoclimatologist, I would get to work showing the conditions sufficient for successful reconstruction. Right now a I doubt anyone has shown that a reconstruction can be accomplished with anything but orthogonal original series or series so treated. That is, it may be that there is no possible “multi” in multiproxy.

Marty, but even worse than the red noise issue is the “bad apple” issue which we discussed in Reply to VZ. Under plausible circumstances, if you have even one nonclimatic HS series, even in a network with a signal, you get a HS. I’ve done some experimenting with the VZ pseudoproxy nework for my trip to Europe and done graphics showing the impact of one bad apple on a network of 55 pseudoproxies from VZ. IN some circumstances, the Mann method will pick out one HS series as the PC1 and flip over all the other series. I do’t know why they want to contest this. They just look dumber and dumber.

If you can imagine, the Wahl-Ammann no-PC Variation is looking even crazier – out of the frying pan into the fire.

Re # 11

I confess I get hung up on the statistics side of the questions. When I first tried modeling the MBH98 method, I did it with a single “bad apple” format and calculated how bad the apple had to be to turn the whole bushel to worms. With real data for many — maybe, many many — series, the bad apple has got to be a — and I am not arguing the “the” — big problem in saying anything about the many, many.

But hey, maybe Dr. Mann has done paleoclimatology a great service: the MBH98 transformation can be used, sort of in reverse, as a general “de-wormer.”

Everything is consistent. Mann actually did vote for the method before he voted against it.😉

Who guards the guardians? I think all of us believe in assuring that the peer-reviewers have as little pro-author or anti-author bias as possible. Peer-review ought to be less anonymous and have more stature and reward than it currently does.

#10. Marty, I agree with your closing observation – I’ve noticed that ordinary PC methods add low-frequency that isn’t in the data – in particular, a spurious low-frequency waves in red noise data – at approximately 2-3 waves in the period. I think that it’s a lesser variation of the HS effect, as segments that end (or begin) off-centre get overweighted. I haven’t seen this written up anywhere.

I have the vague feeling Mann does not understand what his ‘canned’ maths programs do. GIGO.

Or he has an agenda to push.

Mind you he would make an excellent salesman for another BrEx.

#8

I think he just argues that you included signal in your simulated observations(=signal+noise).

This is about to get interesting. Signal must contain ‘trends’ and no high frequencies (otherwise Ritson coefficient would underestimate the AR1 coeff of the proxy noise, link). On the other hand, signal cannot contain trends without CO2 forcing. Trendsetters.

A high AR1 coefficient does not necessarily imply a trend+low AR1 coefficient. However a trend+low AR1 coefficient

does lead to a high AR1 coefficient. That is the one of the points these guys are making: AR1 models are an improvement over AR0 models, but they are fraught with their own problems.

The problem with these HS-shaped series is that they are nonstationary, so AR coeffs do not have a straightforward interpretation. (Split any time-series at the join of the shaft and blade, compute the PACF and you will see what I mean when you compare the two.) You could take out the trend, to give the coeffs a straightforward interpretation, but then you’ve got the problem of interpreting what it is you’ve taken out, and an autocorrelation analysis certainly isn’t going to help you now.

The purpose of autoregression is to figure out how Xt varies as a function of Xt-1. If they are autocorrelated only indirectly, through the action of some other forcing variable, then the autoregressive model is a bad model, and this badness will revealed when the forcing agent fades in and out (as teleconnections are wont to do).

bender, I agree. The AR1 issue doesn’t matter a whit for bias arguments. Suppose that the bristlecones have a nonclimatic trend and lower AR1 coefficient. The Mann algorithm will mine for the nonclimatic trend.

If one views the statistics in Hampel terms – what’s the breakdown point from contamination? With Mannian methods, the breakdown point can arise with as little as one contaminated series.

Follow-up on #18:

1. For a quasi-demonstration of how a PACF changes when a trend is removed, compare the PACFs of the tropical storm count (with 1970-2005 trend) vs. the landfalling hurricane count (without trend) with which it is strongly correlated (r=0.62 before 1930, r=0.49 afterward). See how PACs 1-4 drop in the detrended series?

2. A clarification for anyone who finds it necessary: the purpose of autoregression is to identify endogenous processes that are persistent through time. Exogenous processes that fade in and out tend to inhibit the estimation of the endogenous autoregressive component, because they introduce a complex nonstationary noise structure.

3. Incidentally, the more dominating the low-frequency exogenous component(s), the lower the precision on the ARMA model estimates. This is the real problem with 1/f noise: you increase your sample size over time, and you inevitably uncover some new “trend” caused by some hitherto unknown exogenous forcing agent. Consequently, it is impossible to obtain an “out-of-sample” sample. (Your new samples come from

differentpopulations, which thus nullifies the validation test.)Re: #20

bender, you say, “Incidentally, the more dominating the low-frequency exogenous component(s), the lower the precision on the ARMA model estimates.” I don’t believe that is true, at least as I understand the statement.

The exogenous component can be made more and more dominating, in the sense of explaining the total sum of squares, by increasing the sample variance of the exogenous variables. Thus in a model of

1) Y(t) = phi*[Y(t-1) — X(t-1)*b — e(t-1) ] + X(t)*b + e(t),

i.e. an AR1 model with a mean equal to X(t)*b, an increase in the variation of X will increase the explanatory power of the model and the significance of the estimate of b.

Run a Monte Carlo with varying levels of the variance of X and you will see, I believe, that the Mean Square Error of the estimate of phi is essentially unchanged by the variance of X while the MSE of the estimate of b is inversely related to the variance of X.

Is this the model you are referring to in your quote?

Marty