Comments on: Verification r2 Revealed!!!

By: Skiphil

Skiphil — Fri, 01 Mar 2013 20:46:11 +0000

WOW….. I’m not asking for room service but wondering if anyone still arriving at this and related threads can think of ways to address these problems now. Even after so many years, can a public case be made that MBH98 and MBH99 should be retracted? It seems that they were always sold under false pretences….

By: Kevin Meyer

Kevin Meyer — Mon, 30 Nov 2009 20:33:48 +0000

snip – policy

By: Carl Wolk

Carl Wolk — Mon, 26 Nov 2007 03:34:00 +0000

Thank you very much, Dr. McKitrick and Dr. Eschenbach. I felt somewhat guily for posting possibly the least-sophisticated comment I have ever seen appear on this site, but your answers were very helpful.

I have a somewhat unrelated question for Dr. McKitrick with regard to his article, “Does a Global Temperature Exist?”: if you could establish that a separate, non-temperature-related series of data fits well (has a low r2?) to the temperature data, could you not say that the temperature data then has some validity and relevance? The reason I am asking is because solar intensity and global temperatures follow the same trend rather well until around the ’70s, so couldn’t we independently verify the temperature data from a specific form of data analysis by saying that we see a correlation beteen two sets of data, one of which is known to be reliable?
Thanks

By: Willis Eschenbach

Willis Eschenbach — Fri, 23 Nov 2007 23:49:22 +0000

Carl, first, welcome to ClimateAudit, and congratulations on finding a spot for real science.

Regarding your ideas, Ross M. has done a good job of explaining most of your questions, I hope. One that got missed was the question about “red noise”.

“Noise” is any information in your data that is not the signal you are looking for. It could be literally anything, from instrumental variations, to the effect of other variables, to typographical errors in the collation of the dataset.

There are three flavors of noise: white noise, red noise, and blue noise. White noise is the simplest, because it is totally random. There are some different flavors of random, such as gaussian or normally distributed, binomially distributed, Poisson distributed and the like. All of these types of white noise are called “IID”, or independent identically distributed. This means that they all have the same distribution (Gaussian, Poisson, etc.), and that they are all independent of each other.

Red and blue noise are not IID, because the individual data points are not independent of each other. Instead, a given value depends in some way upon the previous value. For example, a very hot day is more likely to be followed by another hot day than by a very cold day. Today’s temperature depends in part on yesterdays temperature, so they are not independent. This kind of signal, where a high value is commonly followed by another high value and vice versa, is called “red noise”. Datasets with this kind of structure are called “autocorrelated”.

Red noise is extremely common in climate data. We may, for example, be looking for a temperature signal in tree rings. However, the tree ring width may also be affected by say precipitation. From our perspective, the precipitation information is “noise”, because we are looking for a temperature signal. It is not white noise, though, because the precipitation signal is autocorrelated a very dry year is more commonly followed by another dry year than by a very wet year.

There is also the possibility that one data point depends negatively on the previous data point. For example, because of ocean cycles, good fishing years may alternate, with one year being good and the next one bad. In the case of noise, this type of data is called “blue noise”. Datasets with this structure are called “negatively autocorrelated”.

Processes that generate red or blue noise are generally called “AR”, or autoregressive, processes. The simplest have the form

X(t) = aX(t-1) + e

where

X = data value

t = time

X(t) = data value at time “t”

e = a random error, with some mean and standard deviation.

What does this mean? It means that the value of “X” at time “t” is equal to some number “a” times the previous value X(t-1), plus some random number “e”. The variable “a” can take any value from -1 to 1. If “a” is negative, then you get blue noise, and if it is positive, you get red noise. If “a” is zero, the point doesn’t depend on the last point, and you get white noise.

Finally, why is this important? Well, it turns out that the statistics for AR processes are different, sometimes very different, from standard (IID) statistics. In particular, long trends and wide excursions from the mean are much more common as the value of “a” increases towards 1. This means that many things that look like they represent some real trend are in fact just natural swings in an AR process.

For a good exposition of this, see Cohn and Lins, “Naturally Trendy“, along with the discussion of Cohn and Lins here on CA.

Best of luck, keep following the science, you can’t go wrong.

PS – I’ve posted a list of acronyms used on this blog here …

By: Ross McKitrick

Ross McKitrick — Fri, 23 Nov 2007 22:12:24 +0000

Carl, I think you meant to post this on the thread here– http://www.climateaudit.org/?p=2418
But thanks for posting your question anyhoo. I will try to decode.

The proxy climate reconstruction problem involves taking proxy data and temperature data during an interval where they overlap (called the calibration interval), and working out a set of coefficients that map the two together, so that given the proxy data for an earlier interval you could estimate what the temperature data would have been. It’s easy to correlate any pair of data, so to test that you’re not just generating gibberish you do a “verification” test. You shorten the calibration interval a bit so it no longer covers the whole overlap. Now you have a calibration interval and a verification interval. Then you compute the mapping coefficients and estimate the temperatures during the verification interval. That gives you estimated temperatures, but also you have the observed temperatures for that period. The tests we are talking about boil down to asking how well your model estimates the observed temperatures during the verification interval.

The r2 test is the simplest. It is the square of the correlation between them, and can be interpreted as the fraction of the variation in the observed data that the estimated data overlaps with. There are 2 other tests, the CE and the RE test. They are conceptually similar, but change slightly the point of comparison. The RE test, for instance, asks how much better your statistical model did than if you had just used the mean of the temperature data during the calibration interval.

Each of these tests gives you a number, say 0.15. Now you have to decide if that’s a “pass” or a “fail”. In statistics, we decide pass and fail by asking if your model did better than if you had just used a bunch of random numbers. In some cases, tests come in standard forms so there are tables you can look up to tell you if 0.15 is a large number or not (for that test). But some tests, like the RE score, don’t have tables since the “marking scheme” so to speak changes with each data sample.

In that case you can come up with the pass/fail cut-off by “monte Carlo analysis”. You estimate the model using random numbers instead of proxy data and seeing what kind of RE score you get. If you do this a thousand times and in 95% of the cases you get an RE score of 0.2 or less, then we would say that your proxy data has to yield an RE score of more than 0.2, or you’re not significantly better than random numbers.

In the MBH case, they got an RE score for the longest portion of their model of 0.51. They said that the Monte Carlo experiment yielded a pass/fail cut-off of 0.0, so therefore they have significant explanatory power. However, we showed that they did not take account of the distortion induced by their decentered principal component method, and if they had done so, the RE pass/fail mark would move up to 0.56 (IIRC). We also claimed that their r2 score fails the significance test, and they knew it at the time but never reported it. This thread was started when 2 of Mann’s later coauthors published a paper in which they (reluctantly) reported the full suite of hitherto-unpublished test scores, showing failing grades for most of the reconstruction experiments.

Hope this helps.

By: Carl Wolk

Carl Wolk — Fri, 23 Nov 2007 18:29:24 +0000

Hi, I’m a highschool kid with no background in science or statistics, so this sentence has left be bewildered. This is what I don’t understand:
“RE score,” a
“null” RE score,
what the values of RE mean,
what “red noise” is.

Is an RE score the same as r2? I was wondering if anyone could explain this to me.

It overstated the explanatory power of the model when checked against a null RE score (RE=0), since red noise fed into Manns PC algorithm yielded a much higher null value (RE>0.5) due to the fact that the erroneous PC algorithm bends the PC1 to fit the temperature data.

Also, in the future, is there a reference source that I could use for these types of situations?
Thanks

By: NAS Committee Hearings on Hockeystick « The Global Warming Hoax

NAS Committee Hearings on Hockeystick « The Global Warming Hoax — Thu, 19 Oct 2006 20:27:14 +0000

[…] Lubos Motl, 8 March 2006 Theoretical physicist, Harvard http://motls.blogspot.com/2006/03/verification-statistics.html Verification r2 revealed ( http://www.climateaudit.org/?p=564) […]

By: john lichtenstein

john lichtenstein — Tue, 25 Apr 2006 05:41:40 +0000

JohnA, #117 is spam.

By: Pat Frank

Pat Frank — Tue, 25 Apr 2006 02:54:30 +0000

#117 “I don’t know what the hell this r2 is?”

It’s r^2 – ‘r-squared.’ It’s a goodness-of-fit statistic. The closer it is to 1.0, the better does the fit reproduce the data. It is a measure of goodness-of-fit for any linear or non-linear least squares fit to data. In a fit from theory, as opposed to a purely empirical fit, anything less than about 0.9 becomes suspicious.

In empirical fits (those not constrained by a valid theory), people usally cite the r^2 fraction as saying that the fit explains ‘blah’ precent of the data wiggles. When someone is using a phenomenological model to fit some data, as in quantity of women’s lingerie sold to men correlated to the demographics of religious fundamentalism, for example, one would explain a fit with r^2 = 0.5 as saying that fundamentalist religious beliefs explain 50% of the sales.

By: ENM » How to start a science blog (scary version)

ENM » How to start a science blog (scary version) — Mon, 24 Apr 2006 21:21:06 +0000

[…] Not only do blogs enable an aggressive falsification program, they enable individuals to defend themselves against stones thrown from ivory towers. On May 11, 2005, on the day that Ross McKitrick and Steve McIntyre were presenting their results debunking the hockey stick in Washington, UCAR issued a press release announcing that one of its scientists, Caspar Ammann and one of its former post-doc fellows, Eugene Wahl, had supposedly demonstrated that their criticisms of the hockey stick were “unfounded”. S&M have used the blog medium masterfully to reveal that a crucial unfavorable r2 verification statistic was withheld from the Nature publications, thus proving the UCAR accusations, not only unwarranted, but totally unfounded. Claims that scientists have been ‘harrassed’ about archiving their data have been shown false by posting all relevant correspondence on the web. […]