If one is to advance in the statistical analysis of temperature reconstructions, let alone climate reconstructions – and let’s take improving the quality of the data as the obvious priority – Task One in my opinion is to place the ad hoc Team procedures used in reconstructions in a statistical framework known off the Island. The reason is not to be pedantic, but that there has been a lot of work done in statistics over the past 2 centuries and you’d think that some of it would somehow be apply to temperature reconstructions. This is a theme that we’ve discussed here from time to time.
One of the many frustrating aspects about conferences like Kim Cobb’s Trieste workshop and/or the PR Challenge is that this sort of issue is nowhere on their horizon. In my opinion, the parties involved are thrashing around rather aimlessly, in part because no one in the field has successfully connected the reconstruction problem to known statistical problems. UC, Jean S, Hu McCulloch and myself have reflected on this from time to time and it is our shared view that advances will come by applying known results from multivariate calibration to the particular and difficult problems of proxy reconstruction. UC has an interesting but cryptic post and here last summer on this matter.
Had Kim Cobb asked for my advice on her workshop, I would suggested trying to put multivariate calibration experts and their doctoral students looking for topics in the same room with people familiar with the empirical nuts and bolts of the proxies and see if they could connect. It’s hard to see the point of re-assembling the same old faces – what are Briffa , Mann and Ammann going to say to one another that they don’t already know? Who wants to hear about some other “moving on” method that is nowhere connected to the relevant calibration literature. A complete waste of time and money. Well, they went to a nice place.
Framing the problem in this way is actually a long step towards solving it. Some of you may have noticed references to various important references in multivariate calibration literature (Brown 1982, Brown and Sundberg 1987, 1989, but the literature is very large.
The multivariate calibration literature originates in a proxy problem that has many points of similarity to temperature reconstructions – with the helpful advantage that the “proxies” are known to be valid. In each case, one establishes a relationship between cause (X) and effect (proxy-Y) in a calibration period and then seeks to use Y in the prediction period (or verification period) to obtain information about X. Indeed, as I’ll mention below, it’s possible to directly compare the MBH method (once its orotund rhetorical description is reduced to something coherent) to a known calibration method, as UC and myself have discussed from time.
The prototype case in calibration literature is how to use cheap and efficient near-infrared (NIR) reflectance information from many (q) frequencies (the Y-matrix) to estimate p values of interest (the X-matrix), the X-values in these cases being chemical measurements which are expensive and time-consuming. NIR calibration uses a direct and known relationship (Beer-Lambert Law) as opposed to the more exotic, speculative and probably non-linear relationship in which (say) Graybill bristlecone ring width chronologies act as an integrator of world temperature across 7 continents.
But for now, we’re not going to question the existence of these relationships. Let’s stipulate for a moment that there are relationships – maybe very noisy and with complicated noise. If so, then many of the issues in temperature reconstructions parallel examples in multivariate calibration literature. In any event, before dealing with more difficult problems of non-proxy proxies, it is helpful to try to understand what’s going on in relatively “tame” situations, which, as it turns out, are not without their own perils or interest.
Unfortunately the original literature (e.g. Brown 1982, Brown and Sundberg 1987, Brown and Sundberg 1989 and many others) is not easy; nor am I aware of software packages that implement the procedures and diagnostics in a way that can be transferred to topics of interest to us; nor am I aware of textbook level treatments of the material. Often the terminology in the original articles is not very clear; notation that looks the same in two articles often means different things and for someone trying to apply the literature to a novel situation, the meaning can often only be elucidated by patient experimenting with examples. Brown and Sundberg 1987 url in particular gives just enough results on a case where original data is available for benchmarking so that one can, with considerable patience, get a foothold on the methods. For me to make sense of it, I have to work through example by example, line by line, which is very slow going, but I’m pleased to report some headway.
I’ve organized my notes into a post, both to help me keep track of things when I re-visit the matter and in case they are of interest to others. I’ve also placed a set of functions used to implement this methodology online together with the small data set used as an example in Brown and Sundberg 1987.
While some of the text will be dense for many readers, the problem that’s going to be illuminated from this methodology is how to measure “inconsistency” between proxies – something that should be of keen interest to anyone in this field and then how to estimate confidence intervals in calibration problems. Readers often complain about climate articles not providing error bars as though that were some sort of panacea. This has not been a particular theme of mine because, in the paleoclimate field, authors often do provide error bars. The problem is that the error bars have no statistical meaning and provide a faux confidence in the results.
A justified calculation of confidence intervals in calibration is not straightforward; counterintuitive results in univariate calibraiton have been known for many years. Procedures in the field don’t appear to be settled. However, any person who’s read this literature is IMO very unlikely (in the IPCC sense) to conclude that the discussants of paleoclimate confidence intervals in IPCC AR4 or in Wahl and Ammann have any idea of the nature of the problems or any familiarity with the relevant literature.
The title of Brown and Sundberg 1987, “Confidence and conflict in multivariate calibration”, indicates its relevance to our problem, once the premise of the linkage of temperature reconstructions to multivariate calibration is accepted. I found Brown and Sundberg 1987 a bit more accessible and on point than the more widely referenced Brown 1982. Conflict between proxies is an issue that’s bothered me for a long time and one that’s been discussed here on an empirical basis at great length. But it’s an issue that’s pretty much swept under the carpet in climate literature; when I asked IPCC to cite this as an uncertainty in AR4, the mere suggestion was brusquely dismissed and the problem nowhere appears in AR4.
The Multivariate Inconsistency Statistic
Brown and Sundberg 1987, p 49 (and elsewhere) propose a multivariate inconsistency statistic (their R) with a high R denoting inconsistency. This is not to be confused with the usual R/r (one seems tp run out of letters). So watch out. About this statistic, they observe:
It should be observed that one is unlikely to want to use [their “reconstruction” estimate] when R is significantly larger than would be expected given that Z comes from the same model as the calibration data.… For such a large R, one might question the validity of the observation. Various strategies are then possible, including investigation of the individual error components and seeking further data.
Note already the difference approach as compared to people with limited understanding of statistical procedures, such as Wahl and Ammann. If your analysis throws up problems, you recognize them and deal with them in a straightforward way. You investigate the bristlecones or other key series; you don’t just argue that their inclusion in a data set is Newton’s Fifth Law.
The other important and relevant aspect of the Brown and Sundberg approach is the application of maximum likelihood methods to calibration problems. They report that the GLS estimate (which equates to Mannian and like approaches) is inconsistent (in the statistical sense) and departs increasingly from the MLE (maximum likelihood) estimate as the Inconsistency R increases.
Benchmarking with Paint Data
Brown and Sundberg (1987) use the Paint data available in print form in Brown (1982) (transformed slightly) to report some results that can be used to benchmark the implementation of the methods. I’ve manually retrieved this data from the print document and made a digital version applying transformations described in the articles and using the calibration set as defined in the articles. Here is a script to retrieve the Paint data and to allocate subsets in the terminology used in the articles.
Paint=read.table(file.path(url,”data/brown/Paint.brown.dat”),sep=”\t”,header=TRUE) #this table has been organized in nrown/collation.paint.1982.txt
Z0=as.matrix(brown[verification,c("Y1","Y4")]);Z0#see 1987 page 56
The nomenclature in the literature varies a little from article to article and it gets hard to follow sometimes. Here I’ve used X to denote the calibration “cause” matrix and Y to denote the calibration proxy matrix. This is an important heuristic; all too often, it seems that people use methods that require that tree rings cause temperature rather than the other way around. In the methods used here, the X and Y data is centered in the calibration period (not a good idea for PC analysis) but OK for regression.
I use Z to denote the (centered) proxies in the verification/prediction period and Z0 to denote the uncentered proxies in the verification period; Xver to denote the uncentered X matrix in the verification period. (You have to watch out for different ways of centering being handled in different articles). In the 1987 example, Brown uses q=2 proxies (Y1,Y4) against p=1 causes (viscosity) in a calibration set with n=27 measurements.
I haven’t created an R package, but I’ve placed the functions used in my calculations in one file to make the following discussion turnkey. The functions can be downloaded as follows:
My implementation is very much a work in progress. For now, I use two wrapper functions to produce Brown-type information for in the first instance banchmarking, with a view to applying the maximum likelihood methods to proxies. The work is done in the following commands, which produces a variety of information, not least of which is the maximum likelihood estimates (discussed below):
The first function brown.basics performs various elementary operations to produce information used in the maximum likelihood analysis. Brown centers both X and Y in the calibration period – this is done and the vector of means are saved as xbar and ybar. Contrary to procedures frequent in climate, Brown and Sundberg do not divide the proxies by their standard deviations – this step isn’t required for regression and, since the “true” standard deviations are unknown, this procedure is itself a relevant part of the error structure, which is concealed to a certain extent by prior standardization. Brown then carries out a garden variety OLS regression of the “stack” of q proxies Y against the p X-variables, more or less as:
In effect, there are q individual linear regressions, which are stacked. This yields several objects carried forward to the Maximum Likelihood calculations:
alpha: the first row of fm$coef, the vector of p constants in the stacked regression. This will be 0 with centered matrices.
B: the 2nd to qth row of fm$coef, the matrix of OLS regression coefficients. This is Mann’s G (ignoring the arbitrary MBH weighting factors, which don’t matter according to Wahl and Ammann.
S,Sinv: the qxq matrix of residuals ( t(fm$resid) * fm$resid) with its inverse Sinv
n: the number of rows in the calibration set
p: the number of columns in the X-matrix
q: the number of columns in the Y-matrix
X, Y: post-centering versions of X,Y
xbar, ybar: means
A few items can be benchmarked against information about the example at Brown 1987 page 56, evidencing that the data set has been matched and that the elementary matrices have been correctly identified (ybar, B and ):
Basics$ybar #matches 1987 p 56
# Y1 Y4
# 1.747778 37.936296
Basics$B #mathces 1987 p 56
# Y1 Y4
#[1,] -0.128 -1.69
Basics$Sinv #this is the inverse of S
# Y1 Y4
#Y1 6.86285 0.03052
#Y4 0.03052 0.02299
That the calibration carried out in this example represents a pretty simple example can be seen in the graphic below showing the calibration of the two proxies against the (one-column X matrix – and one-column is useful for most recons where, rightly or wrongly, there is only one X-column (temperature). In this example, there is controlled calibration, i.e. the X values are set at known values – permitting random X values changes the statistical analysis. In this case, there is a much more distinct relationship between proxy Y and X than one sees in typical paleoclimate proxies, so this is very much an idealized situation.
Brown’s Inconsistency Diagnostic
Brown’s inconsistency R (denoted here sometimes by Rb) is defined as follows (with the hat denoting the GLS fit rather than the maximum likelihood fit) and being the same thing as the B obtained above (the hat being assumed in this frequently used object).
At 1987 page 56, Brown and Sundberg give the results for three cases, which you may also calculate directly as follows (which I present as evidence that my implementation of the Brown inconsistency diagnostic has been done correctly, prior to describing its calculation. Brown considers the significance of this statistic relative to a chi-squared distribution (df=1 here) and the implied chi-squared values are also shown.
Rb(Z=c(1.68,38.64)-Basics$ybar) #case a brown[18,]
# 0.825 versus reported 0.83
Rb( Z=c(1.86,35.7) -Basics$ybar) #case b
# 4.46 versus reported 4.46 #said to be significant at 5% level
Rb (Z=c(1.94,34.09)-Basics$ybar);#case (c)
# 13.14 versus reported 13.14
#said to be significant at 0.1% level based on chisq df=1
# 3.841459 6.634897 10.827566
The calculation of the inconsistency R can be seen in the functions. While there seemed to be some other candidate denominators for the calculation of , the following seemed to work (note that the definition of is structurally identical to MBH, except that MBH replaces the GLS estimator with arbitrary weights and Wahl and Ammann 2007 with the identity matrix I. Using l here to denote the number of rows in the individual estimate (typically 1 for each year in proxy situations, but the methods can be adapted to an entire verification period):
from 1987 2.11
which is merely notation for
I’m not clear why the GLS estimator is used here instead of the maximum likelihood estimator (denoted by Brown with a double-hat (which will be described below). The double-hat doesn’t render here and so the tilde is used.
Confidence Intervals and the Inconsistency Diagnostic
Brown and Sundberg 1987 show that the likelihood-based confidence interval in multivariate calibration expands dramatically as Rb increases.
First is their Figure 1 and then is my emulation, which visually matches and, as discussed below, there are benchmark matches as well. In the examples shown here, the GLS estimate is arbitrarily set at 1 and the profile maximum likelihood is then calculated. As observed by Brown and Sundberg, the maximum likelihood estimate (the minimum here) coincides with the GLS estimate only when Rb=0. As Rb increases, it goes increasingly far away from . This is something that needs to be borne in mind by people who worry about the standard deviation of reconstruction time series – it would be noticeably higher in a maximum likelihood reconstruction (I’ve done some preliminary analyses).
For benchmarking purposes, Brown and Sundberg report the confidence intervals for each of these four figures, which are matched up to rounding in my emulation (thumbnails shown below):
The widening of the confidence intervals with increasing inconsistency diagnostic R(b) is really quite dramatic. It doubles when R=4 and becomes almost infinite when R=7; when R=20, one has a very perverse situation (which has been noted for a long time in univariate calibration e.g. Fieller 1954 and many others.
Having established that I’ve replicated this particular Brown and Sundberg 1987 procedure, what is it operationally? This calculation originates with the profle likelihood equation set out in 1987 equation 2.14:
After some manipulations, Brown and Sundberg report that this can be re-written as proportional to (and thus the same maximum likelihood estimator):
The element m in this equation is a function of the degrees of freedom: m=(n-p-q) (n+1)/n. The above form (1987 equation 3.6) with simplifications available for the case of one-dimensional X is implemented in R as:
f=function(mu,R) (R+ (mu-muhat)^2)/(1+g*mu^2) ##see ff eq 3.10
#muhat and g are pre-set
Because the maximum likelihood functions have a well-defined shape, the location of the minimum is easily obtained numerically in R using the flexible and efficient optimize function (Brown and Sundberg also show an analytic calculation.) Here I show the implementation within my wrapper function for R=4.
# 1.531 3.469
The maximum likelihood estimate is the first item (called “minimum” in R, but this is the x-location of the minimum, while the “objective” is the y-value of the minimum. The second argument in the function is the interval over which the optimization is specified. Here I know what’s needed to cover the answers and have specified it. In other cases, I’ve experimented with 3 times the range of the x-values in the calibration period and that seems to work. Note that the maximum likelihood value has already changed quite a bit from the GLS (Mannian) value of 1 and this changes quite a bit as Rb gets higher (the proxies get more inconsistent).
The next operation in the calculation of likelihood-based confidence intervals is the calculation of the level of the dashed line in the above figures, which represents the 95% coverage level (and thus a likelihood based confidence interval.) This is defined in 1987 equation 4.3 as follows, with being the maximum likelihood estimator:
In my R script, this has been implemented in log form through the function shown below (where kgamma is the 95 percentile chi-squared value defined by kgamma=qchisq(.95, df=p):
cstar=function(R) (m+f(muhat,R)) * exp(kgamma/(n+1)) -m
The confidence intervals could easily be calculated numerically as the intersect of this vertical level with the likelihood function. Brown and Sundberg 1987 derive an analytic formula shown below (1987 equation 4.4) based on solutions to a quadratic equation. This formula is helpful because it shows how the confidence interval becomes imaginary, an approach which is used in my R-function cstar.brown.
Brown and Sundberg Figure 2
There is one other benchmarking calculation: Brown and Sundberg Fig 2, shown below together with my emulation, showing the profile likelihood for the three cases for which the Inconsistency R was calculated above. Here is the original figure followed by my emulation:
Visually they appear identical. Brown and Sundberg give information on the confidence intervals shown by the dashed lines and my calculation matches these exactly, confirming that this implementation is also accurate. The curves plotted above are the profile likelihoods for the three different proxy combinations (leading in each case to the maximum likelihood estimator). As indicated in the original legend, this is based on 1987 equation 2.14 which is
This was a bit fiddly to figure out, since Brown and Sundberg elsewhere say that, in the case of just one reading of the proxies (the case in examples that interest us), the S matrix in the denominator is the calibration S matrix without pooling, although the language here certainly indicates pooling. For calculation purposes, the one is as easy as the other. I got the results to match in the case where l= 1 without pooling so I presume that this is the proper interpretation. (see the function profile.lik in the scripts for this step.)
I then determined the maximum numerically by the optimize function. The dashed line showing the 95% confidence interval is set at kgamma/2 below the maximum (kgamma defined from the chi-squared distribution above.) I then solved for the two end-points of the confidence interval calculation numerically by equating the vertical level to the likelihood function. The benchmarking is exact.
I’ve posted up a note reconciling three different forms of expressing profile likelihoods in Brown and Sundberg 1987 here
At a later date, I’ll show what happens when you start applying these methods to proxy studies. It’s pretty interesting. You can see the direction that this is going to go: Inconsistency R statistics on a year-by-year can be calculated for each of the proxy networks regardless of whether they are CPS, Mannian or correlation-weighted. This approach also leads to “honest” confidence intervals (to borrow a phrase from an econometrics article) as opposed to the irresponsible use of calibration residuals in Mannian/Ammannian calculations, something that I think may lead to a far more definitive treatment of these networks. Finally I think that there is a way of getting at the RE bilge because the methods can be used for verification sets where l exceeds 1. So I think that these simple tools may be pretty useful.
Brown, P. J. 1982. Multivariate calibration. Journal of the Royal Statistical Society, Series B 44: 287-321.
Brown, P. J., and R. Sundberg. 1987. Confidence and conflict in multivariate calibration. J. Roy. Statist. Soc. Ser. B 49: 46-57.
—. 1989. Prediction diagnostics and updating in multivariate calibration. Biometrika 76, no. 2: 349-361. url