Allen and Tett 1999

I’ve posted up a pdf for Allen and Tett 1999 here as this seems to be a frequently cited article that said that "optimal fingerprinting" was linear regression and gives a flavor for the literature. The approach looks to me like pretty garden variety methodology, such as one would see in the fall term of an econometrics course. It’s hard to believe that this is the Royal Society’s "advanced statistical methods" – I wonder if they checked this with any statisticians.

This entry was written by Stephen McIntyre, posted on Sep 23, 2006 at 3:29 PM, filed under Statistics. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

30 Comments

Willis Eschenbach

Posted Sep 24, 2006 at 1:29 AM | Permalink

Some interesting stuff from that paper:

4 Consistency checks to detect model inadequacy

Having framed the optimal “ngerprinting algorithm as a
linear regression problem, a variety of simple checks
for model adequacy immediately present themselves,
drawn from the standard statistical literature. For
simplicity, following Hasselmann (1997) we will focus
on parametric tests based on the assumption of multivariate
normality. To judge from the analyses we have
performed to date, the assumption of normality is likely
to be reasonably close to valid for temperature data
on large spatio-temporal scales. Assuming normality
for other data types (such as precipitation) would be
more problematic.

Right, temperature series are normal … we have the HadCRUT3 monthly global temperature series, for example, on the largest spatio-temporal scale possible, which is incredibly non-normal, Jarque-Bera test says X-squared = 60.5801, df = 2, p-value = 7.006e-14 … in other words, non-normal at the 99.999999999999% level … even detrended it is still non-normal at the 99% level.

Or how about the Kaplan North Atlantic monthly SST series, undetrended it’s non-normal at the 99.9999999etc level, detrended it’s still non-normal at the same level.

Then there’s the null hypothesis statement, which immediately follows the previous statement:

Our null-hypothesis,H0, is that the control simulation
of climate variability is an adequate representation
of variability in the real world in the truncated statespace
which we are using for the analysis, i.e. the subspace
defined by the first k EOFs of the control run
does not include patterns which contain unrealistically
low (or high) variance in the control simulation of
climate variability. Because the effects of errors in observations
are not represented in the climate model,
H0 also encompasses the statement that observational
error is negligible in the truncated state-space (on the
spatio-temporal scales) used for detection. A test of
H0, therefore, is also a test of the validity of this
assumption.

I love this one. Their null hypothesis is that the climate model works just fine, and if they can’t disprove the null hypothesis, why, everything’s just dandy … wonder how much time they spent trying to disprove the null hypothesis …

My null hypothesis, on the other hand, is that the climate models don’t work for sh*t, and it’s up to the modelers to prove otherwise.

Then we have the way they test their null hypothesis …

We formulate a simple test of this null-hypothesis as follows: if H0 is true then the residuals of regression [of the model results on the data] should behave like mutually independent, normally distributed random noise in the coordinate system …

Umm … well, OK. Here’s the model results from the Santer et al. paper published Science magazine, Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere, that was supposed to provide a “fingerprint” of tropical tropospheric warming. These are the models, and the non-normality of the residuals. Sorry for all the decimals, but I had to put them in to show how non-normal the residuals are:

CM2.1 , 99.9999898480390%
UKMO , 99.9999999999258%
M_hires , 99.9999999999513%
CM2.0 , 87.7303407597607%
CCSM3 , 99.9999965856309%
GISS-EH , 99.9999999999999%
HadCRUT , 99.1232969781016%
GISS-ER , 99.9999999999999%
M_medres , 99.9999999892319%
PCM , 99.9999918309551%

Only one of these models (CM2.0) has even vaguely normally distributed residuals (we only have 87% confidence it’s not normal, so we can’t reject it), and that model gave wildly wacky results, with huge plunges above and below the actual data.

In addition, the tropical NOAA SST data, as well as the tropical HadCRUT SST data, both of which were used in the study, are non-normal at the 99.99999999% and 99.9999999% levels respectively … (As an aside, only one of these models (HadCRUT) showed a significant correlation with the data (p less than 0.05)

This was the study that famously said that:

These results suggest that
either different physical mechanisms control
amplification processes on monthly and decadal
timescales, and models fail to capture such behavior, or
(more plausibly) that residual errors in several
observational datasets used here affect their
representation of long-term trends.

Oh, right, the models disagree with the data, so it is more plausible to think that the data is wrong …

How do these guys get away with this stuff?

w.
Pat Frank

Posted Sep 24, 2006 at 2:20 AM | Permalink

#1 — “How do these guys get away with this stuff?”

That’s the 64 thousand $ question dogging all of climate science.
bender

Posted Sep 24, 2006 at 5:11 AM | Permalink

It’s hard to believe that this is the Royal Society’s “advanced statistical methods”

That it’s marketed as “advanced” probably tells you something about the “unadvanced” methods that came before it.
Willis Eschenbach

Posted Sep 24, 2006 at 5:12 AM | Permalink

Given that much of the temperature data is non-normal, in some cases greatly so, I don’t see how their test for normal residuals would ever get passed. I’ve been looking at the Kaplan North Atlantic SST data, which is wildly non-normal, both trended (Jarque-Bera test, non-normal at p = 4e-35) and detrended (non-normal at p = 4e-37).

Having been unsuccessful at detrending it, I thought at first that instead of detrending it with a linear trend, if I detrended it with say a six year gaussian smoothing of the data, that it would have normal residuals. But no joy, the gaussian residuals were non-normal as well (p = 9e-26, moving in the right direction but a long way from there). This is despite the fact that the gaussian average itself is also significantly non-normal (p = 4e-37).

So I tried removing the average monthly anomalies from the linear detrended data, that was the best yet but didn’t work either, still non-normal at p = 0.001. How about the gaussian detrended data minus the gaussian monthly average anomalies? Getting close, p = 0.02, but still non-normal.

OK, how about detrending it with a longer gaussian smoothing, maybe 12 years, then remove the monthly average residuals … whoa, I did it. We can’t reject the null hypotesis that the residuals are normal, p = 0.13. Of course, this means we have 87% confidence that they are in fact non-normal … but we can’t reject the hypothesis. And lengthening the smoothing filter beyond that doesn’t improve matters.

I’m fresh out of ideas … but I’d sure like to see the climate model that can successfully pass their test regarding emulating the Kaplan North Atlantic SST record …

w.
Douglas Hoyt

Posted Sep 24, 2006 at 6:14 AM | Permalink

Willis, it would be helpful if you showed a histogram of one of these residuals compared to a normal distribution.
Francois Ouellette

Posted Sep 24, 2006 at 9:24 AM | Permalink

Willis, this is all publishable stuff, it seems. Why don’t you write it up?
Tim Ball

Posted Sep 24, 2006 at 12:26 PM | Permalink

#2
One major way appears to be they peer review each others papers, which is why they keep stressing that those who question are not for the most part peer reviewed.
TCO

Posted Sep 24, 2006 at 7:18 PM | Permalink

For the most part, those who are criticizing are not bothering to put ass to chair seat and write papers. And when they do, they too often look like BC06 or Steve’s PPT presentation for the AGU: abortions. Let’s ditch the “hating the journals” when people are not even trying to get published.
Francois Ouellette

Posted Sep 25, 2006 at 8:20 AM | Permalink

TCO,

I have to agree. Scientific revolutions are won on the battlefield of scientific journals. What many of the critics are doing is similar to guerilla warfare : don’t face the ennemy in the open, just stay on the fringe and strike here and there, claiming victory each time, but without really making a dent in the established regime. On the other hand, every published paper is another battle won. AGW proponents have understood that right from the start.
rwnj

Posted Sep 25, 2006 at 7:16 PM | Permalink

I have not followed the technical details that have been developed on this website, so I apologize if this has been answered somewhere else. I have two related comments:
1. PCA is a calculation performed on a covariance matrix. In a typical application, the covariance matrix is an estimate of the “true” covariance of the system which is derived from a finite number of observations of the system. Also in the typical application, the covariance matrix is estimated with the sample means removed. This can be shown to be the optimal estimate (with respect to likelihood) under some simple assumptions. If the sample means are not used to center the covariance estimate, then the estimate is simply not optimal with respect to likelihood. Perhaps an “uncentered” estimate is optimal with respect to some other data model. What is that data model?
2. Why are not all of these discussions prefaced with a description of an assumed data model? (e.g., the signal is Brownian motion with variance 1 per unit time and the noise is uncorrelated normal with mean 0 and variance 2). With an explicit data model, the discussion divides into two parts: is the data model appropriate and does the statistical technique diminish the noise and enhance the signal?

Regards,

rwnj
UC

Posted Sep 26, 2006 at 1:14 AM | Permalink

If the model simulation of internal variability is correct, the variance in the response patterns from an M-member ensemble is approximately 1/M times the variance in the observations (exactly so if all distributions are Gaussian).

Lets see, Fisz (1980) Probability Theory and Mathematical Statistics:

Theorem 3.6.3. The variance of the sum of an arbitrary finite number of independent random variables, whose variances exist, equals the sum of their variance.

Didn’t find a theorem that explains ‘exactly so if all distributions are Gaussian’.

But this is more important: they use ‘control integrations’ to solve the covariance matrix of the ‘climate noise’, just like IDAG. Mann uses ad hoc spectrum smoothing. It is not detection, it is circular reasoning. Useless.

(and what is rank-L vector?)
Willis Eschenbach

Posted Sep 26, 2006 at 3:22 AM | Permalink

Re #5, Doug, you asked for a histogram of the Kaplan dataset. Here ’tis …

Like I said … radically non-normal …

The problem is that the earth’s climate is chaotically bi-stable (or more properly multistable). Think for example of the PDO, or in this case, the AMO. The datasets from these chaotically bi-stable distributions are typically “humped”, with a concentration of data on both sides of the overall average. This gives us non-normal distributions.

w.
UC

Posted Sep 26, 2006 at 3:53 AM | Permalink

#1,12

How do you test normality of time series? i.i.d case is easy, but what if the correlation in time is high?
Steve McIntyre

Posted Sep 26, 2006 at 7:19 AM | Permalink

#10. I agree. You’d think that people purporting to make reconstructions based on a methodology such as PCA would show the applicability of the methodology. However, the original article was published in Nature and the referees did not require that any demonstration of the applicability of a “novel” methodology be made. So for a full explanation of thephenomenon, you’d have to ask Nature.
bender

Posted Sep 26, 2006 at 7:28 AM | Permalink

Re #13
Tests of normality are available in most standard stats packages.
Shapiro-Wilk’s test is one. In R, use function:

shapiro.test(x)

which assumes iid, of course.

If the series is nonstationary, then there is by definition more than one distribution. If the system is bistable, then there are 2 distributions – which you can estimate if you have a long enough data series and if you eliminate the transient phase where states are switching. If the series are autocorrelated than you can use pre-whitening to remove the red.
bender

Posted Sep 26, 2006 at 7:31 AM | Permalink

Re #14 Steve M, that’s what my supervisory committee asked me to do when as a grad student I proposed using PCA to extract signals from a network of tree ring data. They wanted proof of concept before any manuscripts were written. (But of course, there were no climatologists on my committee.)
Steve McIntyre

Posted Sep 26, 2006 at 8:45 AM | Permalink

#16. no wonder this Mann stuff seems so bizarre to you. Precautions taken by your advisory comnmittee (presumably some time ago) completely thrown to the winds by Nature and IPCC and then a scorched-earth policy by the Team of deny, deny, deny.
bender

Posted Sep 26, 2006 at 9:12 AM | Permalink

Re #17
That’s how I knew from the start (i.e. from the time the bizarre off-centering method was revealed) that your criticisms of their methods and data were spot on. Their reluctance to release the new bristlecone pine data is, as you have said many times, suspicious. Their reluctance to admit how statistically non-independent these “independent” multiproxy reconstructions is telling. That they fail to understand Wegman’s primary point – that the community is too inbred and too far removed from the statistics community resonates with my observations. They say they’ve “moved on”, but look at the faulty statistical methods still being used in the analysis of the hurricane data. Detection and attribution is becoming non-scientific as they become single-minded in their focus on the A in AGW and unwilling to consider alternative hypotheses.

Then again, maybe climatology never was a science. (In the Popperian sense of growing knowledge through iterative conjecture & refutation.)
Martin Ringo

Posted Sep 26, 2006 at 10:06 PM | Permalink

Re: tests of normality – 4 notes

1) I am about 20 — OK 30 — years out of date on distribution tests, but the wag within the community – say 1975 – was that distribution tests didn’t work well in practice, which lead to there being ignored.

2) Be careful with the Shapiro-Wilk test for data with multiple observations with the same value. The test is based or order statistics which are obviously sensitive to ties.

3a) If you are testing a variable Y for normality of the levels, you need to account for the ARMA structure in the model and then test the residuals — a point that is pretty easy to see when you plot the realizations of an AR(1)= 0.5 series next to a series of i.i.d. N(0,1) of the same sample size.
3b) If Y(t) = X(t)*B+e(t) where e(t) is i.i.d. N(0,sigma^2) but X is not normal, Y won’t be normal either. Thus again one has to look to the residuals for the test or normality to have any meaning.

4) The error term in a linear model does not have to be normal for the Gauss-Markoff theorem to hold, only that the error is identically and independent distributed with a zero mean and finite variance. Of course, the significance of the coefficients can no longer be read straight from the tables (t, F or Chi-Square) which presumed normally distributed errors. In practice most applied econometric studies do not test for normality. The do test for serial correlation and heteroskedasicity in the residuals. (And most readers of said studies do a little implicit discounting of the significance both for lack of normality and for data mining.) A “good” model has clean residuals: they will look roughly like white noise on a plot although they will probably fail most tests of normality.
UC

Posted Sep 26, 2006 at 11:37 PM | Permalink

#19

Good points. Should add to 4) that if the error distribution is unknown, 2-sigma and 3-sigma limits are quite useless.

We should avoid making same mistake as Mann & Jones do in More on the Arctic post. (5-sigma events in correlated time series)
bender

Posted Sep 27, 2006 at 12:44 AM | Permalink

Re #20

We should avoid making same mistake as Mann & Jones

You mean the mistake of assuming that the error distribution is (i) known and (ii) homogeneous?
bender

Posted Sep 27, 2006 at 12:46 AM | Permalink

Re #12
I am intrigued by Willis’ confident assertion in #12 of [chaotic] multi-(bi-)stability.

1. Mathematically & intuitively I understand the concept. But in reality, doesn’t local bistability in a network of n cells across a globe imply the system actually has 2^n super-states? i.e. ENSO, AMO, PDO are just big, resilient chunks of a superstate. If the globe is warming and we are passing from one superstate to the next, then some of these chunks will be more resilient to change over time than others. i.e. The illusion of large-scale bistability persists for some time, until familiar chunk after familiar chunk is finally broken down and reconstituted to form new and different chunks that characterize the new superstates. If this is the shape GW will take, then the notion of bistability and metastability seem not very useful. [Apologies for vagueness, imprecision, ambiguity. I’m trying my best with my limited toolkit.]

2. Are these systems really locally bistable? Or is this largely an illusion enhanced by the way warm & cool waters vertically separate? i.e. It’s not a single variable system with two states, but a two variable system with continuous states. [Again, apologies for the bandwidth-consuming musings. It makes sense to the writer, but probably sounds incoherent to the reader. Even after substantial efforts in editing.]

I’m willing to read if anyone’s got suggestions.
Willis Eschenbach

Posted Sep 27, 2006 at 1:30 AM | Permalink

Re #22, bender, thanks for the interesting question. I assert it based on the existence of a variety of “oscillations”, which is climatespeak for a couple of separate stable states.

Take for example the PDO. Here’s the correlation of the PDO with the SST:

And here’s the PDO index, from here.

Note the stable period between 1945 and 1975. That’s what I mean by “bi-stable” … but I’m willing to learn. In any case, I’m not sure I agree that there are 2^n superstates. Rather, I would say that there are various sub-systems, many of which have more than one stable state.

w.
UC

Posted Sep 27, 2006 at 1:33 AM | Permalink

re #21

RC:

The April mean temperature is almost 5 standard deviations above the mean, a “5 sigma event” in statistical parlance. Under the assumption of stationary ‘normal’ statistics, such an event is considered astronomically improbable..

So, sample mean and sample std computed from 1961-1990 data. Then it is observed that 2006 value is 5 sample standard deviations above the sample mean. And that is astronomically improbable, they say.

Mistake: stationary does not mean i.i.d. It means that all of the distribution functions of the process are unchanged regardless of the time shift applied to them (let’s be strict here 🙂

I think no one disagrees that there are non-zero autocorrelations in temperature data. Bera-Jarque test assumes random sample, and autocorrelated series won’t necessarily do (as said in #19 3). And sample mean and sample std won’t tell much if the samples are not random.

Maybe I should it put this way: In #1 Willis proofs that the way Mann & Jones interpret their data is wrong. But now I’m confusing people (and myself), so I need to stop.

related discussion here (http://www.climateaudit.org/?p=678)
bender

Posted Sep 27, 2006 at 1:48 AM | Permalink

Re #23
Fair enough. Bistability is useful conceptually, but in practice it has its limits. Worth mentioning because I sometimes get the sense that skeptics take the “bistability” proposition to mean there is a hard ceiling on global warming. In reality there is no telling how many ceilings there are to bust through. Just wanted to be clear that local bistability does not imply global bistability.
bender

Posted Sep 27, 2006 at 2:03 AM | Permalink

Re #24
I follow better now. If underlying distribution is changing, then that “5 sigma” “event” may really be a 2 sigma event, with the 3 sigma difference attributable to a trend, or a switch among bistable states, or nonstationary background forcing effects, or what have you. So their “astronomic improbability” is exaggerated by cherry-picking the time-frame of the baseline “normals” used for comparison. And 1 and 2 sigma events aren’t all that uncommon.
UC

Posted Sep 27, 2006 at 2:44 AM | Permalink

Generate AR1 with p=0.9, Gaussian driving noise, N=300, and take sample std and sample mean using 30 samples. Won’t take long to find 6-sigmas.

So their “astronomic improbability” is exaggerated by cherry-picking the time-frame of the baseline “normals” used for comparison.

Yep. But of course they can claim that stationary ‘normal’ actually means those weakly correlated AR1 background processes that CGCMSs and spectrum smoothing methods provide. This might be infinite loop.

And 1 and 2 sigma events aren’t all that uncommon.

5-sigma events are not uncommon, if you let me choose the distribution. Or if we observe astronomical number of samples.
Steve Sadlov

Posted Sep 27, 2006 at 9:14 AM | Permalink

RE: #23 – OK, I am going to be a tree (well, actually, bush) ringer for a short while here. This is based on personal recollection so please excuse any slight errors. I recall reading an article in the Los Angeles Times back during the early to mid 80s (probably ~ 1983) where it was discussing a ring width study done on chaparral (don’t recall the exact species, perhaps ceanothus) in the near coastal transverse ranges (now Dano, that is most definitely Mediterranean! 🙂 ). The altitude was such that the main limiting factor of growth was moisture availability. The assertion of the study was that ring widths depicted that there was in general less moisture available from the 1940s to the time of the sampling (I suspect late 1970s) than had been available from some point in the 1800s (I seem to recall late 1860s) to the 1940s. The folks who did the study also issued a warning that the 1940s – 1970s lull in precip (taken as a proxy for late fall through spring mid latitude cyclones and cold fronts) would, based on the result for the earlier period, likely not hold. So, what has happened since then? After a fairly significant – ~ 7 year – regional drought (interestingly, affecting mostly Southern, but not Northern, California) during much of the 1980s, there came some very wet years since, with the exception of the odd dry one (expected that far south). I am not sure to what level the concept of the PDO was understood back then. Nonetheless I find it fascinating that the time frames seem to align.
Steve Sadlov

Posted Sep 27, 2006 at 9:26 AM | Permalink

Additional notes, Southern California (where I resided during most of the 1980s) experienced very moist years 1981 – 1983, corresponding to the upper portion of that significant rising edge of the PDO figure of merit. The So Cal drought I mentioned 1984 – 1991 was during the subsequent slight lowering and trough in the waveform after that significant 1973 through 1985 rising edge. IIRC, El Ninos during the rising edge were and 78 – 79, 82 – 83. There have been no truly extreme El Ninos since. There were two moderate ones during the time since, one in 97 and early 98 the other back in 00 and 01. Interestingly, the severe flooding and mud slides a few years back (03 I believe?) in So Cal were not during a true El Nino event but were a result of a persistent split polar jet with one leg stuck over So Cal. Also, the worst drought in recent California history, affecting the entire state, was 75 – 77, years during which most of the state melded in with Baja and Arizona from a precipitation standpoint. Also, Feb 1976 we had one of the more notable cP outbreaks ever witnessed, bringing a couple of inches of snow to most of the lowland areas of the state.
Steve Sadlov

Posted Sep 27, 2006 at 9:33 AM | Permalink

RE: #29 – sorry, final note, while the 98 El Nino was really major from an ENSO figure of merit perspective, as experienced in much of California, the 81 – 83 one was more impactful. My statement about extremity is as experienced here, not from an ENSO figure of merit perspective.