Judith Curry and JEG have expressed an interest in talking about Mann et al 2007.
Looking past the annoying and amusing faults, here are some thoughts about the substance of the article. There are two sorts of results in Mann et al 2007: results on the MBH98 network and pseudoproxy results.
The pseudoproxy results are much weakened because they only consider results from one quirky and idiosyncratic multivariate method (RegEM TTLS) in a very tame pseudoproxy network without
(a) comparing results from the method to other methods other than their own equallty quirky RegEM Ridge, or
(b) examining results from networks that are flawed and, in particular, flawed in ways that may potentially compromise MBH. I’ve posted on both these issues and will review some thoughts on this.
It is somewhat surprising to see another lengthy effort to rehabilitate the MBH98 network, which is analysed complete with original (incorrect) PC series without conceding a comma to the NAS panel or Wegman reports or even mentioning the bristlecone problem. Mann shows that he can recover the bristlecone shape using RegEM if we spot him the PC1 or raw bristlecone series. This was never in doubt – see MM (EE 2005) and, other than for polemical reasons, it’s hard to see any purpose or interest in the application of RegEM to this flawed network.
The main properties of the MBH98 network have been known for some time. MM (EE 2005) and Wahl and Ammann (2007), despite the claims made in the latter, agree on virtually every specific calculation, as is unsurprising since our codes matched. If you do an MBH98-type calculation with 2 NOAMER covariance PCs, the bristlecones get downweighted and you don’t get a HS; if you increase the number of covariance PC2 to 5, you include the bristlecones and you get a HS. If you use correlation PCs, the bristlecones dominate the PC2, which is attenuated a little, and you get a HS; if you do a calculation without bristlecones, you don’t get a HS regardless of method. If you do a calculation without a PC analysis and without bristlecones, you don’t get a HS; if you do a calculation with bristlecones and without a PC analysis, you get a HS. The incorrect Mann method promoted bristlecones into the PC1 of the AD1400 network and made the HS shape of the bristlecones appear to be the “dominant component of variance” as opposed to a local phenomenon (and very reliant on chronologies done by Graybill.)
Where does RegEM fit into this dispute? It really has nothing to do with it. After the construction of their PC-proxy network, Mann carried out an “inverse regression” analysis – described in the most overblown and uninformative terms imaginable. I’ve worked through the linear algebra of this and confirmed (as have UC and Jean S) that, in the early AD1400 and AD1000 steps where only one temperature PC is reconstructed, that the weights of each proxy are in direct proportion to their correlation with the temperature PC1. This is a form of Partial Least Squares regression (one-stage) – a method used in chemometrics.
In the Mann et al 2007 proxy section, they use RegEM (Total Least Squares version) instead of Partial Least Squares regression. They say that the process is non-linear and that they are unable to calculate weights for each proxy – a claim also made for MBH98, which proved untrue. Their network is based on the identical MBH98 network – warts and all, including the incorrect PC series, criticized by both the NAS Panel and Wegman, which seems pretty insolent towards other climate scientists and rather weak reviewing by JGR.
Using RegEM, they “get” a NH reconstruction that is said to be pretty similar to the MBH98 reconstruction and I don’t doubt that this is true. What I don’t “get” is exactly what this proves in the scheme of things. My instinct is that RegEM (TTLS) is generating coefficients somewhere (or is approximated by this) and that the weights are more or less approximated by the weights from Mannian inverse regression.
In MM (EE 2005), we discussed the situation where Mannian inverse regression was done with no PCs; in this circumstance, because there are a lot of Graybill bristlecones, they dominate the network without PC analysis – but any pretense of geographic balance was sacrificed in the process, one of the warranties of MBH that led to its acceptance. So the fact that RegEM leads to a similar result in a case where the network is dominated by bristlecones is a nothing and has been known since 2004 and a response given in MM (EE 2005).
Mann et al 2007 do not mention the word “bristlecone” even once – a remarkable omission since they still continue to imprint his results. Indeed, one might argue that Mann’s major innovation was his introduction of the known-to-be-problematic bristlecone chronologies into multiproxy reconstructions – a temptation resisted by Bradley and Jones 2003 and Jones et al 1998 (but perhaps anticipated in Hughes and Diaz 1994).
Mann et al 2007 report a lot of RE, r2 and CE statistics for different reconstructions using pseudoproxies. I think that these sort of pseudoproxy studies can be quite useful, although I don’t think that scientists in the field have necessarily got the hang of these studies yet or that Mann et al have given a very broad survey of results that need to be examined. Zorita’s results are the most reliable.
A couple of my more interesting posts have reported my pseudoproxy results, one using Echo-G runs (kindly provided to me by Eduardo Zorita) and one using pseudoproxy networks constructed to emulate the MBH98 network (Huybers #2 and Reply to Huybers) also here and I’ll try to tie these three different studies together.
In Mann et al 2007, he only considers results from RegEM (TTLS) under different noise scenarios. In Benchmarking from VZ Proxies, I tested the effect of a broader range of methods on attenuation of low frequency response, comparing the impact of OLS regression, PLS (Mannian inverse) regression with and without detrending, CPS and principal components (PC1). OLS regression had the “best” fit in the calibration period but the poorest low frequency recovery; in the circumstances of this very “tame” setup where an equal amount of noise was added to each proxy, the PC1 and simple average had the best recovery of low-frequency variance, with Mannian inverse regression in the middle, as shown below:
Spaghetti graph of selected multivariate methods. The blow-up is not because of intrinsic interest to the period, but just to show detail a little better in a different scale.
I then compared verification statistics for the different reconstructions as shown below. OLS yielded much the “best” fit in the calibration period, but the worst fit in the verification period. I think that this is a useful perspective on what’s going on with more “sophisticated” multivariate methods, as they will fit somewhere on this graphic (and there is no free lunch.) You’ll notice that these reconstructions all have good calibration r2, RE and verification r2 (other than OLS). In this particular case, the one example with a verification r2 failure (OLS) is the one with the worst performance in terms of signal recovery – suggesting that this particular statistic is definitely worth looking at.
Another interesting image in the earlier post was the image showing the regression coefficients (the “Fourier” space if you will) for the different methods – and this is a tame network with an equal amount of noise. The best models were the ones with the most balanced weights and the worst models were the ones with the most tailored weights. The mathematics of this are trivial once you think about it: because equal amounts of noise are being added to a signal, the noise cancels out most effectively if the weights are equal. If some of the series are turned upside down as happens in OLS, then the noise reduction cancellation is diminished. IF you think about it, you’ll see why.
One of the morals for me from the adverse impact of flipping series is that anyone starting a temperature reconstruction needs to know the sign of the expected relationship to temperature in advance and specify it in advance. This is one of the hidden strengths of the simple average, that can get overturned in a noisy network.
If we put Mann et al 2007 results in this context, we find that RegEM methods, like all other methods, will generally yield high calibration r2, RE and verification r2 scores in a tame network with known and equally distributed noise additions. There are a couple of cases (j,x) in which the RegEM method fails to achieve a good r2 in the verification period and my guess is that it’s ended up overfitting the model somehow – along the lines of the OLS failure shown in my example. It’s just a guess.
In these pseudoproxy studies, the noise is all very “tame” – it’s white noise of equal amount or low-order red noise of the same type. However, if you plot the residuals for each proxy in the MBH AD1400 network, that’s not what you get. The residuals for the NOAMER PC1 and Gaspé are fantastically autocorrelated. Actually autocorrelation doesn’t really describe the mis-specification at all: the residuals for the NOAMER PC1 relative to the recon are hockey stick shaped – something that doesn’t fit very well into autocorrelation vocabulary or techniques. The reason is that the PC1 is a super hockey stick that overshoots the NH reconstruction. So the residual is itself also a HS. This of course renders assumptions about white and low order red noise completely moot. I’ll try to post up a graphic illustrating this point – I probably have one already somewhere.
There is also another notable gap in the Mann et al 2007 pseudoproxy study. They only deal with how the system works on a network where a signal is guaranteed. What happens when you move into networks where there is no common signal? What happens then? Now you’re getting into the pseudoproxy networks of our Reply to Huybers and my post Huybers #2 and Wahl and Ammann #2. In those cases, I constructed synthetic hockey sticks from red noise (or alternatively from stock price indices), blended in a network of 21 white noise series and pressed it into the Mannomatic.
What happened? You get “reconstructions” that were functionally equivalent to MBH as shown below: high calibration r2, high verification RE and negligible verification r2 : same as MBH98 in the AD1400 network. Virtually all of the MBH proxies can be replaced with white noise with no impact whatever on the reconstruction. However there are a few “active ingredients” – in the AD1400 network, they are the Graybill bristlecone pine PC1 and the problematic Jacoby Gaspé series. See this post for a derivation of the graphic below.
Left – Tech stocks; right – MBH. Top left – Tech PC1 (red), MBH recon (smoothed- blue). Top- Tech PC1 and Gaspé-NOAMER PC1 blend; Middle – plus network of actual proxies; Bottom – plus network of white noise.
So from a mathematical point of view, this very different type of pseudoproxy situation has to be incorporated into the testing universe. There is no evidence that the MBH proxies contain a common signal along the lines of the Mann et al 2007 pseudoproxy test. If anything, the MBH proxies are remarkably orthogonal. Because they are “near orthogonal”, you end up with an interesting mathematical situation in which the matrix that rotates PLS coefficients to OLS coefficients is near-orthogonal and thus Mannian situation can end up being much closer to an OLS overfit than one would like.
One of the obstacles to understanding the situation is that people are used to regressions from cause to effect as Ross pointed out, where collinearity is a problem. Here you actually want collinearity and orthogonality is the problem. Mannian inverse regression is actually an improvement on OLS which is not “optimal” at all, but in a case where the proxies are near orthogonal, it may revert back to having OLS problems.
If you don’t know much about the proxies other than their sign, I think that there’s something to be said for quite simple averaging procedures. You may leave something on the table but you’re less likely to screw up. As for re-scaling after averaging – both UC and Jean S recoil at this step: it’s a different kettle of fish and deserves a lot more thought than has been given to it. It’s not a given that it’s a good technique and I’ll discuss this another day. That’s one reason why the Loehle method – the simple average so scorned by JEG – is intriguing: it avoids the potentially problematic re-scaling step. It may have other problems but it avoids that one.
One of the things that I’d be interested in seeing from the various Mann networks is a barplot of the coefficients along the lines of the plot shown here for the VZ network. Yeah, I know that they say that it’s impossible, but I suspect that they can be extricated somehow. Maybe JEG can figure out how.