WOW….. I’m not asking for room service but wondering if anyone still arriving at this and related threads can think of ways to address these problems now. Even after so many years, can a public case be made that MBH98 and MBH99 should be retracted? It seems that they were always sold under false pretences….

]]>I have a somewhat unrelated question for Dr. McKitrick with regard to his article, “Does a Global Temperature Exist?”: if you could establish that a separate, non-temperature-related series of data fits well (has a low r2?) to the temperature data, could you not say that the temperature data then has some validity and relevance? The reason I am asking is because solar intensity and global temperatures follow the same trend rather well until around the ’70s, so couldn’t we independently verify the temperature data from a specific form of data analysis by saying that we see a correlation beteen two sets of data, one of which is known to be reliable?

Thanks

Regarding your ideas, Ross M. has done a good job of explaining most of your questions, I hope. One that got missed was the question about “red noise”.

“Noise” is any information in your data that is not the signal you are looking for. It could be literally anything, from instrumental variations, to the effect of other variables, to typographical errors in the collation of the dataset.

There are three flavors of noise: white noise, red noise, and blue noise. White noise is the simplest, because it is totally random. There are some different flavors of random, such as gaussian or normally distributed, binomially distributed, Poisson distributed and the like. All of these types of white noise are called “IID”, or independent identically distributed. This means that they all have the same distribution (Gaussian, Poisson, etc.), and that they are all independent of each other.

Red and blue noise are not IID, because the individual data points are not independent of each other. Instead, a given value depends in some way upon the previous value. For example, a very hot day is more likely to be followed by another hot day than by a very cold day. Today’s temperature depends in part on yesterdays temperature, so they are not independent. This kind of signal, where a high value is commonly followed by another high value and vice versa, is called “red noise”. Datasets with this kind of structure are called “autocorrelated”.

Red noise is extremely common in climate data. We may, for example, be looking for a temperature signal in tree rings. However, the tree ring width may also be affected by say precipitation. From our perspective, the precipitation information is “noise”, because we are looking for a temperature signal. It is not white noise, though, because the precipitation signal is autocorrelated a very dry year is more commonly followed by another dry year than by a very wet year.

There is also the possibility that one data point depends *negatively* on the previous data point. For example, because of ocean cycles, good fishing years may alternate, with one year being good and the next one bad. In the case of noise, this type of data is called “blue noise”. Datasets with this structure are called “negatively autocorrelated”.

Processes that generate red or blue noise are generally called “AR”, or autoregressive, processes. The simplest have the form

*X(t) = aX(t-1) + e*

where

*X = data value*

t = time

X(t) = data value at time “t”

*e = a random error, with some mean and standard deviation.
*

What does this mean? It means that the value of “X” at time “t” is equal to some number “a” times the previous value X(t-1), plus some random number “e”. The variable “a” can take any value from -1 to 1. If “a” is negative, then you get blue noise, and if it is positive, you get red noise. If “a” is zero, the point doesn’t depend on the last point, and you get white noise.

Finally, why is this important? Well, it turns out that the statistics for AR processes are different, sometimes very different, from standard (IID) statistics. In particular, long trends and wide excursions from the mean are much more common as the value of “a” increases towards 1. This means that many things that look like they represent some real trend are in fact just natural swings in an AR process.

For a good exposition of this, see Cohn and Lins, “Naturally Trendy“, along with the discussion of Cohn and Lins here on CA.

Best of luck, keep following the science, you can’t go wrong.

w.

PS – I’ve posted a list of acronyms used on this blog here …

]]>But thanks for posting your question anyhoo. I will try to decode.

The proxy climate reconstruction problem involves taking proxy data and temperature data during an interval where they overlap (called the calibration interval), and working out a set of coefficients that map the two together, so that given the proxy data for an earlier interval you could estimate what the temperature data would have been. It’s easy to correlate any pair of data, so to test that you’re not just generating gibberish you do a “verification” test. You shorten the calibration interval a bit so it no longer covers the whole overlap. Now you have a calibration interval and a verification interval. Then you compute the mapping coefficients and estimate the temperatures during the verification interval. That gives you estimated temperatures, but also you have the observed temperatures for that period. The tests we are talking about boil down to asking how well your model estimates the observed temperatures during the verification interval.

The r2 test is the simplest. It is the square of the correlation between them, and can be interpreted as the fraction of the variation in the observed data that the estimated data overlaps with. There are 2 other tests, the CE and the RE test. They are conceptually similar, but change slightly the point of comparison. The RE test, for instance, asks how much better your statistical model did than if you had just used the mean of the temperature data during the calibration interval.

Each of these tests gives you a number, say 0.15. Now you have to decide if that’s a “pass” or a “fail”. In statistics, we decide pass and fail by asking if your model did better than if you had just used a bunch of random numbers. In some cases, tests come in standard forms so there are tables you can look up to tell you if 0.15 is a large number or not (for that test). But some tests, like the RE score, don’t have tables since the “marking scheme” so to speak changes with each data sample.

In that case you can come up with the pass/fail cut-off by “monte Carlo analysis”. You estimate the model using random numbers instead of proxy data and seeing what kind of RE score you get. If you do this a thousand times and in 95% of the cases you get an RE score of 0.2 or less, then we would say that your proxy data has to yield an RE score of more than 0.2, or you’re not significantly better than random numbers.

In the MBH case, they got an RE score for the longest portion of their model of 0.51. They said that the Monte Carlo experiment yielded a pass/fail cut-off of 0.0, so therefore they have significant explanatory power. However, we showed that they did not take account of the distortion induced by their decentered principal component method, and if they had done so, the RE pass/fail mark would move up to 0.56 (IIRC). We also claimed that their r2 score fails the significance test, and they knew it at the time but never reported it. This thread was started when 2 of Mann’s later coauthors published a paper in which they (reluctantly) reported the full suite of hitherto-unpublished test scores, showing failing grades for most of the reconstruction experiments.

Hope this helps.

]]>“RE score,” a

“null” RE score,

what the values of RE mean,

what “red noise” is.

Is an RE score the same as r2? I was wondering if anyone could explain this to me.

It overstated the explanatory power of the model when checked against a null RE score (RE=0), since red noise fed into Manns PC algorithm yielded a much higher null value (RE>0.5) due to the fact that the erroneous PC algorithm bends the PC1 to fit the temperature data.

Also, in the future, is there a reference source that I could use for these types of situations?

Thanks

It’s r^2 – ‘r-squared.’ It’s a goodness-of-fit statistic. The closer it is to 1.0, the better does the fit reproduce the data. It is a measure of goodness-of-fit for any linear or non-linear least squares fit to data. In a fit from theory, as opposed to a purely empirical fit, anything less than about 0.9 becomes suspicious.

In empirical fits (those not constrained by a valid theory), people usally cite the r^2 fraction as saying that the fit explains ‘blah’ precent of the data wiggles. When someone is using a phenomenological model to fit some data, as in quantity of women’s lingerie sold to men correlated to the demographics of religious fundamentalism, for example, one would explain a fit with r^2 = 0.5 as saying that fundamentalist religious beliefs explain 50% of the sales.

]]>