A Surprising Result

Yesterday, I observed some seemingly bizarre patterns in Mannian splicing, which seemed at first to be an inexplicable stupid pet trick. However, in this case, what seems to be a stupid pet trick appears to implement an intentional Mannian procedure referred to in passing in the SI, but not shown in the online source code. The net result is really quite astonishing: it appears that none of the 422 proxies that started after AD1700 contribute to any of the reported CPS reconstructions (EIV are a different can of worms). Here’s how I arrived at this startling conclusion.
Continue reading

More Stupid Pet Tricks

Yesterday I left my exposition of Mannian CPS just before trying to do a NH composite. There was another stupid pet trick left, that made replication of this step a little trickier than one would think.

UC had sent me a file entitled NHmean.txt (which I’ve placed online), which was produced during gridboxcps.m, and proved to be 9 unspliced of the NH reconstruction using iHAD as a “target”.

In this case, I was able to locate information at Mann’s website that could be reconciled exactly to this output (after performing a little trick.) The third column in the dataset at http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/reconstructions/cps/NH_had.csv (the column entited “cps.iHAD.”) started in AD1500 and was a splice of three columns of our NHmean.text – the first 100 years was from the AD1500 step, the next 100 years from the AD1600 step and the remainder from the AD1700 step.

The AD1800 step was left out. How? It doesn’t seem possible from the source code. But who knows? It is what it is. Another stupid pet trick.

The second reconstruction in this spreadsheet (headed “cps.HAD”) has a correlation of exactly 1 to the cps.iHAD reconstruction. It starts from the same composite (which I’ll discuss below) and is rescaled a little differently – to match the HAD instrumental series (instead of the iHAD series). It’s 1850-995 standard deviation and mean are a little different, but otherwise the series is identical. One version supposedly reconstructs land temperatures, while the other reconstructs land+ocean temperatures: the size of the glove changes depending on the person wearing it.

Related (spliced) reconstructions using CRU (instead of hadCRU) as a target are at http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/reconstructions/cps/NH_land.csv . The column headed “cps.iCRU” has a correlation of exactly 1 to the two HAD series, but is scaled a little differently. It starts in AD600 instead of AD1500, but like the two versions already discussed, omits the AD1800 step.

The column headed “cps.CRU” starts in AD200 and, unlike the other three series, had a high ( more than 0.9 correlation to the other series) but differed for inexplicable reasons – something that shouldn’t happen given the source code.

Inconsistent programming

Another weird and stupid pet trick lay here. The post-AD1700 portion had an exact correlation of 1 to the other versions – so the first leg in the splice was the AD1700 portion. For an inexplicable reason, Mann skipped the next two steps in this version – the AD1400-AD1700 portion had an exact correlation to the reconstruction step beginnning in AD1400 – so the AD1400 network was used for 300 years.

At this time, I only have Matlab output going back to AD1000 and the next steps did not have an exact correlation to any of the steps starting in AD1000 or later. The highest correlation was to the AD1000 step – it looks like this recon uses a pre-AD1000 step (perhaps AD800?) for the period up to AD1400. But only in the reconstruction with “CRU” target.

Weird. It’s hard to figure out how this would happen given the source code. I wonder if the climate models work like this.

Making a Composite

After this diversion through new pet tricks, back to making the NH composite. Because UC sent me a Matlab version (that proved to have an iHAD target), I used this instrumental series (at http://www.meteo.psu.edu/~mann/supplements/MultiproxyMeans07/data/instrument/NH_iHAD_reform) as a target in gridboxcps.m.

I made a weighted average of the 15 gridcell series for the AD1000 network discussed yeaterday, weighting them by the cos latitude of the gridcell. This weighted average had a correlation of 0.9999968 to the column in the Matlab dataset NHmean that started in AD1000. So the composite matched perfectly up to scaling. Re-scaling to match the mean and sd of Mannian smoothed iHAD, the two matched up to a slight rounding discrepancy of about 0.006, as shown below. So we’ve pretty much manage to replicate all the little pet tricks so far.

Next we’ll try to figure out why some reconstructions start in AD1500 and some in AD600.

Mannian CPS: Stupid Pet Tricks

David Letterman sometimes has a segment entitled “Stupid Pet Tricks”, which is an apt title for today’s post – more parsing of Mannian CPS, recently discussed here. With helpful contributions from Jean S, UC and Roman M, I can now pretty much replicate Mannian CPS, but only through a variety of devices that fall into the category of stupid pet tricks. I’m not sure which is stupider – the original pet trick or spending the time trying to figure out what’s going on. Probably the latter.

I realize that this isn’t the most interesting material in the world, especially compared to talking about U.S. politics, but this takes a lot of time to do and I find it useful to document these results while they’re fresh.

Smoothing

The first stupid pet trick is Mannian smoothing. With particular thanks here to Roman M, we now have an exact replication of Mann’s smoothing method in R. Mannian smoothing has been described in two different articles in GRL (see references below); the algorithm as used in Mann et al 2008 differs somewhat from either.

I suspect that we are the first people – including the reviewers of Mann’s two GRL articles – who have actually figured out what Mann’s smoothing method actually does. The real issue is whether the world needs Mannian smoothing – there are standard smoothing algorithms developed by statisticians (e.g. lowess).

And why GRL is publishing articles on smoothing methodology? In my opinion, if Mann (or me or anyone else) has something to contribute to smoothing literature, it should be submitted to a statistical journal where it can be properly reviewed. Review by a couple of climate scientists at GRL is worse than pointless as it gives faux authenticity to the method.

The tricky thing about Mannian smoothing is, as noted before, the use of butterworth filters. These may be meritorious in sound engineering where you are, say, trying to recover music which is governed by frequency. But these methods seem less apt when you are dealing with red noisy time series, not least because they have strong mean-reverting properties at their endpoints.

Because Mann doesn’t even bother centering the series, this leads to very odd endpoint behavior, which he tries to patch with endpoint padding of ever increasing length. The most recent aspect of this method winkled out by Roman M. is that Mann’s mean padding is the mean of the entire series, rather than the mean of the last M (say M=10) values, the method used in IPCC padding.

Here’s a plot comparing our emulation of Mannian smoothing for Dongge to a result extracted by UC using Mann’s source code. The validity is further confirmed by the ability to replicate other results, as shown below.

Counting more than once

The next stupid pet trick relates to double-counting and even quadruple-counting of certain proxies, a point already identified (the issue was first noted by Jean S, who observed the identity of a couple of Mannian gridcells.)

If the Mannian lat-longs for a proxy (which may well be in error) are on the border of a cell e.g. exactly 50E as with the (incorrect) Socotra location, Mann places the proxy in both adjacent cells. A proxy located precisely at a 4-corner is allocated to all 4 gridcells.

It took a while to figure out this stupid pet trick, even with the code. This got most gridcells right but there were further traps for the unwary.

Wrong locations

The next stupid pet trick wasn’t easy to find. I eventually tracked the difference between the allocations that I was calculating and Mann’s results to gridcells where the latitude was at the north of a gridcell.

UC had recovered a Mann file showing the latitudes for gridcell centers that Mann had calculated – these ran from -88.5 to +86.5 (S to N), rather than the correct -87.5 to 87.5 (S to N). So a proxy located in the top 1 degree of a gridcell got assigned to the wrong gridcell.

This goof can be 100% confirmed in one of the Matlab runs that UC sent to me. Re-doing with (Mannian incorrect) gridcell centers, I was able to match the AD1000 selections exactly.

Smoothing over and over

The next stupid pet trick is that the proxy data are smoothed over and over again in all 19 steps. On each occasion (e.g. AD1000 which we’ve been studying), the proxy data are truncated to AD1000 and then padded at their beginning even if there are actual values.

Mannian smoothing is not especially efficient in the first place, so, aside from using padded values when known values are available, the procedure adds to the overheads. It would be perfectly feasible to smooth the master proxy data set once and select from this.

In any event, for a given step (e.g. 1000 here), Mann truncates the data to start at AD1000 and selects the subset of series with “passing” correlations. The smoothed series are all given short-segment standardization on 1850-1995 – note that this period is a little different than the 1855-1995 used in correlations – and then averaged. The average is then re-scaled to the mean and standard deviation of the instrumental gridcell over 1850-1995.

Here is the comparison of the emulation to the series (that UC extracted from Mann’s Matlab code) for the gridcell containing the Dongge O18 series. A perfect emulation.

You’ll notice that, unlike the Socotra O18 speleothem series, the Dongge O18 speleothem series has been flipped over, a point that I’ll return to in a separate post. (Gavin Schmidt excoriated Loehle for using Mangini’s O18 speleothem series where the O18 series had been flipped over, but, for some reason, failed to criticize Mann for flipping over the Dongge O18 series.)

This example only had one contributing proxy, but the emulation in a gridcell with multiple proxies is also accurate, as shown below for the gridcell containing the 4 Tiljander proxies (which Mann inverted from the usage of the original author, who warned against non-climate anthropogenic effects in the latter part of the series.)

After making these gridded series, Mann then re-grids all series N of 30N into 15×15 gridcells, weighting the two gridcells by their cos latitude. This yields 15 regrid-regridcells with contributions.

I’ve compared all 15 to versions extracted by UC and the emulations are mostly exact, with the “worst” emulation being the one below combining the Tiljander and Tornetrask gridcells, affected only by very slight rounding.

Fat files

While I’m thinking about it, another stupid pet trick are the huge Matlab files that Mann created in the CPS program. The gridded series for the 9 steps from AD200 to AD1000 were over 250 MB in size. In the AD1000 step, 2567 out of 2592 columns were empty. In addition, all values prior to AD1000 were empty (as used).

Adding to the absurdity, Mann created a huge 3-dimensional matrix of these series. I guess they simply buy more and more powerful work stations to perform stupid pet tricks.

If you simply kept the portion of each portion that’s used, naming each column to keep track of it, and keep the information from each step in a list rather than a 3-D matrix, the same information took about 1MB in R. So the handling of the information became effortless rather than a strain. This absurd data handling was repeated time after time.

I’ll finish off this post through to the NH reconstruction in a little while.

Smoothing References:

Mann, M. E. 2004. On smoothing potentially non-stationary climate time series. Geophys. Res. Lett 31: 710–713. —. 2008. Smoothing of climate time series revisited. Geophys. Res. Lett 35. http://holocene.meteo.psu.edu/shared/articles/MannGRL08.pdf

Soon, W. H. S., D. R. Legates, and S. L. Baliunas. 2004. Estimation and representation of long-term (>>> 40 year) trends of Northern-Hemisphere-gridded surface temperature: A note of caution. Geophys. Res. Lett 31: L03209. http://www.cfa.harvard.edu/~wsoon/myownPapers-d/SLB-GRL04-NHtempTrend.pdf

Reblog this post [with Zemanta]

The U.S. Election

For several months, the words Obama and McCain have not been allowed in posts here. I’m declaring a one-day moratorium on this policy. This was not because I’m uninterested in the U.S. election – quite the opposite. I thought that this particular election was an important one. And, quite aside from the importance of the election, I enjoy the “pennant race” aspect of the election and have followed the election avidly.

I don’t often talk about my political views – though I’ve sometimes taken pains to point out that I do not share the political views of many readers. In American terms, Canada would be a blue state along the lines of Massachusetts; Toronto would be a liberal city in a blue state; and I live downtown in one of the most liberal constituencies in the city. None of this is unrelated to my political views. I realize that many Climate Audit readers have opposite political views, but we try to get along.

I thought that McCain’s concession speech last night was remarkably and commendably gracious and that it was the high point of McCain’s campaign, in which he showed the positive aspects of his character in a way that was little evident in the campaign.

I thought that Obama’s acceptance speech was equally gracious. Obviously any politician has to have an ego and I presume that Obama is no different; but he at least expresses himself with a commendable humility, which I, for one, am prepared to take at face value.

It was pretty amazing that McCain, at the age of 72, could have campaigned so tirelessly for so long. McCain’s body language last night showed to me that he was tired and, in a way, probably ready to get off the horse. Whereas for Obama, there will be no days off and it looked like he was ready to work just as hard today as he did yesterday. Aside from anything else, I think that the U.S. is better off with a leader who’s at the peak of his powers and energy.

If one is looking for good in the economic crisis (and this is hard for me to say as I am entirely reliant on my diminished 401K equivalent), it is that crises focus people’s minds on problems and issues. Obama is in a position to call on the best and most competent Americans for his Cabinet and, in the present circumstances, few would refuse. I think that we’ll see people of the stature of Warren Buffett and Colin Powell. And, if they don’t serve directly, Obama will listen to their advice on the younger person who is selected. I would be shocked if his cabinet is doctrinaire or mediocre.

I think that Obama’s election is also very healthy for the U.S. in world terms. The U.S. stands for both good and bad in world terms. While U.S. economic dominance has faded, it is still the leading world nation and leadership from the U.S. is important. Obama is in a position to provide such leadership in a way that would have been impossible for McCain.

In this context, the climate wars seem like small beer. Clearly an Obama win will be welcomed by Gore-ites, though my sense is that Obama will steer his own course. I also think that there are several other imperatives driving economic policy in the direction of reducing U.S. economic dependence on oil and especially imported oil. Will there will be end to the generation-long embargo on nuclear power construction in the United States in an Obama administration? Surely that should be one of the first topics on the agenda. For climate change activists, it is one of the few ways to seriously dent CO2 emissions; for people worried about foreign oil, it’s one of the few ways to seriously reduce dependence on foreign energy. If something better comes along, so be it, but right now, in Ontario, we’re dependent on nuclear stations designed and built in the 1970s and my guess is that people in the 2040s and 2050s will appreciate any nuclear power plants built in the next 10 years.

Margaret Wente, a Toronto columnist with American roots (who’s written kindly about me and whom I often agree with), wrote yesterday before the election: “Americans are about to get their country back”, at least in the sense that U.S. politics of the next 4 to 8 years are going to be less acrimonious than the politics of the last 16 years and that its leadership will accorded the respect that it deserves, both in the world and in the country. The graciousness of last night’s speeches by both McCain and Obama was a big step in this direction.

Mannian CPS

As a technique, CPS is little more than averaging and shouldn’t take more than a few lines of code. You standardize the network – one line; take an average – one more line; and then re-scale to match the mean and standard deviation of your target – one more line. (I’m not commenting here on the statistical merits of CPS – merely what implementing it in a modern programming language.) However, Mannian CPS takes pages of code, and, needless to say, everywhere there are snares for the unwary. I guess Mann learned his programming from Hansen and passed the style along to his students.

We had previously spent some time starting down this analysis before we were so rudely interrupted by Santer et al 2008. So let’s pick up where we left off.

We had just finished smoothing the proxies. This had its own set of complications as Mann used a butterworth filter on uncentered data. This led to peculiar end point problems requiring long end point padding to control the effect of the butterworth filter. Emulating this in R was complicated by the odd problem that R filtfilt operated differently than Matlab filtfilt. Roman M identified and resolved this by writing a pretty function filtfiltx, which emulated Matlab filtfilt, enabling an emulation of Mann’s smoothing function in R.

Aside from the fact that I’m used to R, going through the exercise of translating Matlab into R is helpful in walking through each step, always bearing in mind UC’s “Implicit Assumption Advisory.”

On to gridboxcps.m.

After smoothing the proxies using Mannian smoothing, they smooth the “instrumental” data – now the “instrumental” data appears to be related to CRU data, but has been “infilled” back to 1850 in many gridcells that don’t actually have 19th century data. Mannian correlations are often between infilled proxies and infilled data. The instrumental data is smoothed using the same smoothing function as the proxies. The comments in the program say that the data is smoothed with a “lowpass filter with a cutoff frequency of 0.05 — 20 years”, but the parameter
cutfreq is set at 0.1 (10 years). It’s a bit of a guess as to which was actually used for the archived runs, so I worked this through using cutfreq=0.1, presuming that the comment is incorrect, as opposed to the wrong version being archived (but either is possible.)

Averaging into Gridcells
The next step is to “average all proxies … into 5×5 grid boxes”. This is something that should be pretty routine, but hey… If I were doing the exercise from scratch, I would do a one-time assignment of a gridbox number (I use the “Jones” sequence N to S, Dateline to Dateline) in my proxy “info” database and use this information whenever required. This is easy enough to do.

Mann does something a little different. Instead of assigning each proxy to a gridcell, he makes an empty dummy array of 1732 gridcells and looks for proxies contained within each gridcell, a procedure repeated over and over, a piece of pointless inefficiency in rococo GISTEMP style. However, in addition to the inefficiency, each search looks for proxies that in the “less than or equal to” gridcell. In the AD800 network that I’ve been looking at, Mann has (incorrectly) used coordinates of 12.5N 50E for the Socotra Island speleothem. On its face, the logic of the program includes this proxy in two gridcells: 12.5N 47.5E and 12.5N 52.5E. This problem was noted a couple of weeks ago by Jean S and UC, who observed that that the regts series in two gridcells were identical in an early network (Socotra Island.)

Given that a relatively small number of proxies are exactly on gridcell boundaries, my procedure would be to re-examine the source documents for a more precise location to take the lat-long off dead center. In the case of Socotra Island, the matter is easily resolved as the original reference places the site at 12.5N 54E, removing the problem in this instance.

Because the Mann algorithm permits proxies to be located in more than one gridcell, the emulation of his method (using a concise style) is complicated somewhat. Instead of having one gridbox associated with each proxy, you need a list of 1-4 gridcells associated with each proxy. I did this in what I thought was a fairly pretty way. I have a function jones that locates the Jones gridcell for a given lat-long. I created a new function outerjones that yielded 1-4 gridcells in the Mann style and made a list for all 1209 proxies for repetitive use. The outerjones function perturbed each lat-long location plus-minus 0.0001 deg and found the unique gridcells (4 on the exact corner). This eliminates one repetitive step.

outerjones=function(x) unique( c(outer (x[1]+c(-.0001,.0001),x[2]+c(-.0001,.0001),jones) ))
outerlist=apply(details[,c(“lat”,”long”)],1,outerjones)

To simplify the programming, for a given step, I then made a subset of the proxies that (1) started in time; (2) passed the Mannian correlation step (3) starting the collection at the start of the Mannian step. For the AD1800 step, this network goes from 800-1996 and has 18 series.

tempi=!is.na(proxys[period_MBH[k]-tsp(proxys)[1]+1,])&passing;sum(tempi)
temp=(time(proxy)>=period_MBH[k])
working=proxyscaled[temp,tempi];dim(working) #1197 18
dimnames(working)[[2]]

We then determine the number of gridcells attached to these proxies using the list that we created once (rather than cycling through 1732 gridcells over and over.) In this case, there are 12 such gridcells (including two cells for Socotra Island.) The list of gridcells can be obtained in one line as follows:

idsort= sort(unique(unlist(outerlist[tempi]))) ;K=length(idsort);K #12
# 94 245 328 402 661 780 922 955 1126 1127 1462 1626 1938

Next we set up array, one column for each of the above gridcells.We name each column by its Jones gridcell number (which is not the same as the Mann gridcell number – Mann goes from S to N). I use Jones numbers because I’m used to them.

grids=array(NA,dim=c(nrow(working),K) )
dimnames(grids)[[2]]=idsort

We then average the standardized proxies within each gridcell, again emulating the Mannian calculation in a more intelligible style. One of the few annoying features of R is that some important functions (apply) apply only to matrices; the average function shown here will average matrices but won’t fail with a vector.

average= function(X) if( is.null(ncol(X) )) X else apply(X,1,mean,na.rm=T)

For each column in the gridcell, the attached proxies are picked out from our standard list and averaged (they have already been standardized in Mann style). Using techniques like lists and lapply enables one to accomplish pages of programming in one line and, more importantly, to avoid the endless subscripting that characterizes Mann-Hansen programming.

for (i in 1:K) {
tempr= !is.na(unlist(lapply(outerlist[tempi],function (A) match(idsort[i],A)) ) ) # this picks out the relevant series in gridcell
grids[,i]=average( working[,tempr])
}
grids=ts(grids,start=period_MBH[k])

But aside from style, one feels that clearer programming would have identified the apparent defect by which one proxy is assigned to multiple gridcells.

Calibrate Proxies Against Instrumental Data
In a typical CPS calculation, the proxies are assumed, rightly or wrongly, to be temperature proxies and calibration is done after they are averaged. Mannian CPS calibrates on more than one occasion and again it appears that programming issues lead to some unexpected results.

This step is done in the gridboxcps.m program after the comment % calibrate proxies against instrument data. The calibration is a usual Team method: re-scale and re-center the proxy to the mean and standard deviation of the “target” instrumental data. As the program is written, this requires that target instrumental data exist for the gridbox (or gridboxes) to which the proxy is attached. What happens if there is no target instrumental data in the required box? As written, the program omits the affected proxy. In the AD800 network, two series are affected: Agassiz melt and Central Greenland GISP. As far as I can tell, they do not enter into subsequent CPS calculations. So at the end of these two operations, Mannian CPS has double-counted Socotra Island and removed Agassiz and Central Greenland, leaving us with 11 gridcells. [see below as this doesn’t seem to be right for the AD1000 network examined by UC. UC’s compilation below doesn’t contain Tornetrask and Tiljander, but does contain GISP and Agassiz – I’ll check with him on this, as this seems odd.]

Update- I’ve now compared UC’s graphic here against my emulation. I can identify all 10 series in his graphic, which confirms the double-counting of the Socotra series, but UC’s chart here includes the Agassiz and GISP series (but excludes a number of series that my emulation would have included – so I’m missing something here. The caption below this figure identifies the series.

Left column, top to bottom: Socotra Island speleothem (C13-O18 combination); 2- Punta Laguna sediment combination; 3- Graybill foxtail (Campito Mt ca533); 4- Scotland speleothem(Baker); 5- Graybill bristlecone (nv512); Right top to bottom; 1 – Duplicate Socotra speleothem; 2- inverted Dongge cave speleothem; 3- Shihua cave speleothem; 4 – GISP O18; Agassiz melt.

Here’s how I emulated this calculation (the first line coordinates between Jones-sequence gridcell numbering and Mann-sequence gridcell numbering):

test=match(idsort,info$jones);temp=!is.na(test);sum(temp);info[temp,]
mean.obs= info$mean[test];sd.obs=info$sd[test]
mean.proxy=apply(grids[(1850:1995)-tsp(grids)[[1]]+1,],2,mean)
sd.proxy=apply(grids[(1850:1995)-tsp(grids)[[1]]+1,],2,sd)

regts=array(NA,dim=c(nrow(grids),ncol(grids)))
dimnames(regts)[[2]]=dimnames(grids)[[2]]
for( i in (1:ncol(grids))[!is.na(mean.obs)]) {
regts[,i]=(grids[,i]-mean.proxy[i])*sd.obs[i]/sd.proxy[i]+mean.obs[i] }
regts=ts(regts[,!is.na(mean.obs)],start=tsp(grids)[1]);dim(regts)

Re-Gridding
The Mann SI states:

In regions within North America and Eurasia with especially dense spatial coverage, grid boxes were first averaged over 15° resolution boxes before inclusion in the spatial average to diminish potential bias due to variations in spatial sampling density (see Fig. 1B).

In the computer code, this is done as follows. All of the gridded series located south of 30N are retained as is. The gridded series north of 30N are re-averaged into 15×15 degree boxes. In the AD800 network, this results in the Tornetrask and (upside-down) Tiljander sediments being averaged. We now have 10 gridboxes, plots shown below (with names of contributing series attached):

   

As in other studies, for analyis of the period back to AD800, the number of contributing proxies is very small. Claiming the use of 1209 proxies is mostly rhetorical. Many of these proxies are already familiar to us.

Tornetrask, Graybill bristlecones, Quelccaya and Cook’s Tasmania reconstruction are all staples of all the reconstructions. The Graybill bristlecone chronology is one of only a couple of stong HS-contributors. The other main HS contributor is the Tornestrask-Tiljander combination. We’ve already discussed the Tiljander series – Tiljander reported that the modern portion of this series has been disturbed; Mann inverted the interpretation of the authors and shows the series upside down from the interpretation of the authors. (Obviously the authors don’t think that the data went hugely down in the 20th century, that’s why they said the modern portion shouldn’t be used.) But you can see that Mann’s error in the Tiljander series is not simply incidental.

Most of the other series are speleothems, a class of proxy relatively undiscussed at CA. Shihua Cave was used in Moberg; the South African speleothem was used in Loehle, with this use being criticized by Gavin Schmidt. (Loehle used the grey-scale series, Mann used C13 and O18 – some of Schmidt’s disputes with Loehle would seem to apply to Mann as well.) Dongge Cave and Socotra Island are both speleothems as well. I plan to do specific posts on these proxies.

NH and SH Averages
Finally, Mann calculates NH and SH averages, re-scaling the averages to the target CRU (or HadCRU) hemispheric temperature. As far as I can tell from the code, each of these reconstructions to a different “target” is a linear transformation of one another, with only the mean and standard deviation of the recon varying somewhat. In each case, the reconstruction is a weighted average of the contributing proxies, with the Mannian weights varying a little due to things like re-gridding, intermediate rescaling etc.

Though Mann has provided a commendable amount of information on the reconstruction, I was unable so far to locate anything other than spliced versions of output (as with MBH98). Below is a comparison of my emulation (spliced) to the spliced NH reconstruction (CRU target). It isn’t quite the same – even when you’re spotted the data and code, it’s never easy repeating a Mann calculation, but it’s pretty close – certainly sufficiently close to feel justified in doing sensitivity analyses, like the impact of using Tiljander data the right way up or Ababneh bristlecones, which I’ll do another day.

Atlantic Hurricane Season 2008 Withers on the Vine

The North Atlantic hurricane season has nearly come to an end. As November progresses, the chance of another storm developing becomes smaller. Climatology (last 60 years) tells us that roughly 4 in 10 years see a November storm formation including 4 in 2005 (Beta, Gamma, Delta, Epsilon), Hurricane Michelle (2001), Hurricane Lenny (1999), and Hurricane Kate (1985). Jeff Masters from the Weather Underground has an image of previous early-November storm tracks especially clustered in the Western Caribbean.

So, what has the 2008 season wrought in the North Atlantic and how well did the seasonal prognosticators fare?

Even with the expected post-season tinkering of the real-time storm tracks by the folks at the National Hurricane Center, we can provide fairly accurate preliminary numbers. The community at Wikipedia constantly updates many interesting facts about the ongoing 2008 hurricane season.

Total Named storms (34 knots + one-minute maximum sustained winds): 15
Total Hurricanes (64 knots +): 7
Total Major Hurricanes (96 knots +): 4
Accumulated Cyclone Energy: 132

The respective forecasts made by CSU (Klotzbach and Gray), NOAA, as well as the UK Met Office came in quite close to the actual experienced storm activity. Before handing out trophies, please keep in mind that forecast “skill” is a function of many forecasts over longer time periods. Each of the forecasting outfits prefers to use different techniques and variables to calculate their storm numbers, so we will have to wait until each completes their post-season analysis to determine if they were “right for the right reason” or got lucky.

Now, to answer the question: how active was the 2008 hurricane season, we need to define climatology. This is where the tricksters can play pranks on the public. Where is the beginning point of the analysis? How well do we trust the frequency and the estimated intensities of each storm? What metric do we use – number of tropical storms, number of hurricanes, ACE (accumulated cyclone energy), Power Dissipation, or perhaps some complicated statistical measure? All of these questions are entangled in the debate surrounding whether anthropogenic climate change is indeed a modulating influence upon current and future Atlantic hurricane activity.

A well-accepted metric which convolves storm frequency, intensity, and duration is called accumulate cyclone energy (ACE) and is calculated very simply: take the maximum sustained winds reported by the NHC every 6-hours for all storms (> 34 knots), square this value, and sum over the entire lifetime, then divide by 10,000. In 2007, even though there were also 15 storms, the ACE was only 72 compared to 132 for 2008 with the same number of named storms. This is partially because the storms in 2008 were much longer lived especially Bertha.

Here are three different views of the Atlantic hurricane climatology depending upon what period you look at. The data is from the NHC Best Tracks without any corrections to the intensity data.

Links to two other time periods:

1978-2008


1944-2008

Thus, since 1995, Atlantic hurricane activity measured by ACE is hugely variable with feast (i.e. 2005) and famine (1997). 2008 ACE is nearly equivalent to 2006 and 2007 combined, but about half as what was experienced in the record 2005 season. The choice of 30-years is a particular favorite for many researchers in the tropical cyclone community (1978-2007). The second image clearly shows the nearly stepwise increase in ACE between 1994 and 1995. In this reference frame, 2008 ranks as one of the more active years of the past 30. Now, back up to 1944, when admittedly the intensity (and detection) data is somewhat less reliable. However, since the ACE metric is the convolution of an entire year’s worth of storm lifecycle information, and is most sensitive to higher wind speeds, the track data points prior to satellite observation (~1970s) are probably sufficient for this exercise.

Final verdict: When encapsulated in the recent active period in North Atlantic activity (1995-2007), 2008 experienced normal or expected activity as measured by ACE. In terms of a long-term climatology, either the last 30 or 65 years, 2008 is clearly an above average year.

Note: for the Climate Audit seasonal forecasters, especially those that showed exemplary skill (however you wish to measure it), please fill us in on your methodology and perhaps provide guidance for 2009. Or, for those feeling shame about being “blown off track”, time to think of good excuses.

Also, a new Science perspective has been published by Vecchi et al. (2008) entitled Whither Hurricane Activity? More on that later…

More on the Gridded MXD Data

We’ve noted that Briffa’s gridded MXD has high correlations to temperature, much higher than run-of-mill proxies. We’ve also noted that Mann (like Briffa) truncated this data at 1960 because of divergence. At the time that Mann et al 2008, the gridded MXD data was not available anywhere – Mann cited a webpage as follows:

The gridded maximum density dataset was developed differently by Osborn et al. (http://www.cru.uea.ac.uk/~timo/datapages/mxdtrw.htm) [as used by Rutherford et al. (2)] with the specific goal of capturing century to multicentury variability.

While Briffa has published many articles on his MXD data, the only article providing even a cursory description of the gridded version was Rutherford (Mann) et al 2005, which stated:

The version of the MXD dataset used here was compiled using a combination of grid-box estimates based on traditionally standardized MXD records (with limited low frequency information) and regional estimates developed to retain low-frequency information (OSB; the data in the MXD network are available online at http://fox.rwu.edu/~rutherfo/supplements/jclim2003a). The latter were developed using the age-band decomposition (ABD) method of standardization, wherein density data from trees of similar ages are averaged to create long chronologies with minimal effect of tree age and size (Briffa et al. 2001).

Unfortunately, the information that the data was available at http://fox.rwu.edu/~rutherfo/supplements/jclim2003a was untrue. Previous attempts to get this data had failed. In September, as noted previously, I sent an FOI request to CRU in England for any gridded data that they had sent to Mann. (Note that requesting in terms of documents is more effective). This led to a new tranche of data and file information at Osborn’s webpage, and, of course, the usual puzzles.

The creation of the gridded series (by coauthors Osborn, Briffa and Jones) is described in Rutherford et al 2005 as follows:

OSB therefore worked first with the traditionally standardized data at the individual chronology scale and gridded them to provide values in 115 5° by 5° grid boxes (26 available back to A.D. 1400) in the extratropical NH (Fig. 1b). They then developed temperature reconstructions by the local calibration of the MXD grid-box data against the corresponding instrumental grid-box temperatures. The “missing” low-frequency temperature variability was then identified as the difference between the 30-yr smoothed regional reconstructions of Briffa et al. (2001) and the corresponding 30-yr smoothed regional averages of the gridded reconstructions. OSB add this missing low-frequency variability to each grid box in a region. After roughly 1960, the trends in the MXD data deviate from those of the collocated instrumental grid-box SAT data for reasons that are not yet understood (Briffa et al. 1998b, 2003; Vaganov et al. 1999). To circumvent this complication, we use only the pre-1960 instrumental record for calibration/cross validation of this dataset in the CFR experiments.

You have to watch a bit carefully here – you expect “OSB” to be an acronym for a peer reviewed publication describing the gridding process, but OSB (which confusingly is Briffa, Osborn and Schweingruber GPC 2004, as noted by PAt Frank below) contains no description of the gridding method. I was wrongfooted a bit by this and tried to find additional descriptions elsewhere in the Briffa opus, before deciding that this was the description of the gridding operation.

To understand what happened in the gridding process, I picked a cell with only one site (72N, 132E), but there are many others and plotted the underlying chronology of the one site, together with three versions of the gridded data: the Osborn version, the “Mann original” version and the Mann infilled version. The 4 plots are shown below and give a variety of puzzles.

The earliest value of the underlying site chronology (omoloya) is in 1496 (end-1991). The gridded version is almost identical in the period 1496-1991 (correlation of about 0.99), but it begins in 1400. Where did the first 96 years come from? Is there another tranche of unarchived data? Or was there some “infilling” of the early data? Rutherford et al 2005 is silent on this.

In any event, we know that Mann truncated post-1960 values and used “infilled” data for 1960-1995 instead of real data. However the effect in this particular series is a little unexpected: the infilled data doesn’t go up as much as one would have expected given the trouble that they’ve gone to replacing real data with imaginary data. It “diverges” from temperature as well. In fact, for this gridcell, the “original” original data has a higher correlation to temperature than the replacement data.

Update – Osborn’s webpage (and I think that this info was provided in Sept 2008 in response to my FOI request) mentions that the 341 sites used in Rutherford et al 2005 were selected as follows:

the MXD data from the 341 sites determined by Briffa et al. (2002a) to exhibit correlations with nearby gridded April-September temperature of at least +0.22 were used in this study. See Briffa et al. (2002a) and the data description above for further details

This is helpful, as Rutherford et al do not mention 341 sites, instead saying:

it [ABD] has been applied only at a regional scale for the MXD network used here, rather than at the level of the 387 original site chronologies.

If Nychka Standards Applied to Mann…

Santer et al 2008 (including realclimate’s Gavin Schmidt) sharply criticized Douglass et al for failing to properly consider the effect of autocorrelation of regression residuals on trend confidence intervals, which they described as a “methodological error”. The need to properly account for autocorrelation in confidence interval estimation is a fairly long-standing theme at CA and it’s nice to see Gavin Schmidt come down so firmly for this view.

Now that the Santer 17 have spoken on this matter, I thought that it would be interesting to re-visit the Mann et al 2008 proxies, applying the Nychka adjustment for regressions in Mann et al 2008.

Mann et al argued that their proxies rose above mere scavenging through a garbage bin of red noise series as follows:

Although 484 (~40%) pass the temperature screening process over the full (1850–1995) calibration interval, one would expect that no more than ~150 (13%) of the proxy series would pass the screening procedure described above by chance alone. This observation indicates that selection bias, although potentially problematic when employing screened predictors (see e.g. Schneider (5); note, though, that in their reply, Hegerl et al. (10) contest that this is actually an issue in the context of their own study), does not appear a significant problem in our case.

On earlier occasion, we observed that Mann used a unique pick-two option for these correlations – but did not modify the benchmark. Today, I’ll take a first quick look at how he handled autocorrelation, merely applying Santer-Nychka methods.

Here’s how Mann described their test:

We assumed n ~ 144 nominal degrees of freedom over the 1850– 1995 (146-year) interval for correlations between annually resolved records (although see discussion below about influence of temporal autocorrelation at the annual time scales), and n~ 13 degrees of freedom for decadal resolution records. The corresponding one-sided P ~ 0.10 significance thresholds are |r| ~ 0.11 and |r| ~ 0.34, respectively.

Owing to reduced degrees of freedom arising from modest temporal autocorrelation, the effective P value for annual screening is slightly higher (P ~ 0.128) than the nominal (P~ 0.10) value. For the decadally resolved proxies, the effect is negligible because the decadal time scale of the smoothing is long compared with the intrinsic autocorrelation time scales of the data.

In order to sustain this argument, Mann needs to demonstrate that the temporal autocorrelation is “modest”. But is this actually true? “Modest”, in the context of the reduction of degrees of freedom in the above calculation, is AR1 of about 0.2 (I need to double check this). The following graphic shows a histogram of AR1 coefficients resulting from the regression relationship between the proxy and gridded temperature, showing the Luterbacher-gridded and Briffa MXD-gridded series separately. Far from the residuals having “modest” correlation, the majority had very high autocorrelations – certainly high enough to have a dramatic effect on the effective degrees of freedom. In some cases, an AR1 coefficient could not even be calculated – there was so much autocorrelation!

The effect of allowing for autocorrelation in the t-statistic (through accounting for the effective degrees of freedom) results in the following histograms of t-statistics, adjusted according to the procedures of Gavin Schmidt and his Santer coauthors. Out of 1033 non-Luterbacher non-Briffa gridded series, only 32 pass a positive t-test. (This is actually a little generous as I’ve not parsed the decadal series in these graphics and there’s further hair on how these are handled, whih I’ll get to some time.) On the other hand, the majority of Luterbacher and Briffa gridded MXD series are “significant”. Given that the Luterbacher data uses instrumental data in its “calibration” period, the existence of high correlations to a different gridded instrumental version proves nothing about the quality of the proxy population and, in effect, spikes the drink. The gridded Briffa MXD data only became available a few weeks ago following an FOI request and, of course, this is the data that was truncated in 1960 because of the divergence problem. Whatever its ultimate disposition, it is also a subset with properties that cannot be generalized to run-of-mill proxies.

Of the non-gridded proxies, 32 (3.1%) pass a positive t-test of 2, the usual value of a two-sided 95% confidence interval (or one-sided 97.5%). And only a handful of these go before AD1400 (including old chestnuts like Briffa’s Tornetrask reconstruction, which has its own problems.)

These results are not very impressive if one avoids the “methodological error” of failing to allow for autocorrelation, as emphasized by Schmidt and Santer. Once autocorrelation is allowed for, Mann’s “proof” that he had not merely scavenged red noise series is no longer valid.

It’s interesting that Schmidt praised one study containing such errors and condemned another.

This Gets Even More Amusing

Can anyone on the Team actually hit a target?

A couple of days ago, I reported that Santer’s own method yielded failed t-tests on UAH when data up to 2008 (or even 2007) was used. I also reported that their SI (carried out in 2008) included a sensitivity test on their H1 hypothesis up to 2006, but they neglected to either carry out or report the corresponding test on the H2 hypothesis (the results reported in Table III).

The reason why the H2 test failed using up to date data is not that the trend changed materially from the 1979-99 trend, but that the AR1-style uncertainty decreased a lot. While I had calculated the changes in SE (for the observed trend) to make these calculations, I didn’t discuss these changes, but, for reference, the SE (observed trend) for 1979-99 (Santer Table 1, which I’ve replicated given their methods) was 0.138 deg C/decade. Apples and apples, using information up to 2007 (available to Santer and his Team as of the submission of their article), the SE(observed trend) had decreased to 0.067 deg C/decade.

The “correct” CI, according to the actual methods of Santer et al, is to take the Pythagorean sum of the SE (observed trend) (0.138 in their analysis) and the much decried SE of the inter-model mean (0.092/sqrt(19-1) = 0.0217), thus 0.14 deg C/decade.

Santer et al. state that Douglass et al had ignored uncertainty in the observations:

DCPS07 ignore the pronounced influence of interannual variability on the observed trend (see Figure 2A). They make the implicit (and incorrect) assumption that the externally forced component in the observations is perfectly known (i.e. the observed record consists only of φ_o(t), and η_o(t) = 0

Most readers have taken this for granted. But let’s see what Douglass et al themselves say that they did:

Agreement means that an observed value’s stated uncertainty overlaps the 2σSE uncertainty of the models.

In examples in the running text of Douglass et al, one can see applications of this procedure. For example:

Given the trend uncertainty of 0.04 °C/decade quoted by Free et al. (2005) and the estimate of ±0.07 for HadAT2, the uncertainties do not overlap according to the 2σSE test and are thus in disagreement.

So it’s a bit premature to take at face value Santer’s allegation that Douglass et al “ignored” trend uncertainty in the observations. Douglass et al say that they considered trend uncertainty in the observations (I haven’t parsed this in their article yet.) For now, let’s compare the uncertainties that Douglass said that they used, as compared to uncertainties calculated according to the method Santer said should be used. Here’s what Douglass et al said about trend uncertainty for the UAH T2LT that we’ve been looking at:

For T2LT, Christy et al. (2007) give a tropical precision of ±0.07 °C/decade, based on internal data-processing choices and external comparison with six additional datasets, which all agreed with UAH to within ±0.04. Mears and Wentz (2005) estimate the tropical RSS T2LT error range as ±0.09.

Do you recognize any numbers here? Look at the trend uncertainty for UAH T2LT up to the end of 2007 – calculated according to Santer’s own methods – 0.067 deg C/decade. Compare to the Douglass number – ±0.07 °C/decade.

So if you use up-to-date information in calculating the trend uncertainty according to the Santer method, you get the same value (actually a titch less) than that Douglass et al say that they used (0.07). I suppose that Santer might then try to argue the Douglass et al estimate of 0.07 deg C is covering different uncertainties than their 0.07 deg C, but what’s the evidence? And since this is a critical point, shouldn’t Santer have proved the point with some sort of exegesis?

[Update – Oct 26] John Christy writes in with the following exegesis of the topic:

In our paper, we addressed “measurement” uncertainty – i.e. how precise the trend really is given the measurement problems with the systems (I introduced the separation of these two types of uncertainty in the IPCC TAR text in the upper air temperature section.) We wanted to know how accurate the observational trend estimates were.

The subtle point here is that “temporal” or “statistical” uncertainty is not needed because of the precondition that the surface trends of the observations and the model simulations be the same. Sure, you can create a surface trend of +0.13 C/decade from scrambling the El Ninos and volcanoes any way you like, but when the surface trend comes out to +0.13 it will be accompanied by (in the models) a specific tropospheric trend – no need for the “temporal” uncertainty. When you scramble the interannual variations and get a trend different from +0.13, then we don’t want to use it in our specific experimental design.

This notion of the precondition seems lost in the discussion, but when understood, the arguments of Santer17 become irrelevant.

On this information, I’m not quite sure what the ±0.07 °C/decade in Douglass et al would represent in Santer terms. At this point, it appears that the Santer accusation that Douglass et al had “ignored” this form of uncertainty is incorrect, but the way that they handled it is, as Christy says, sufficiently “subtle” that, for me to express an opinion on whether the Douglass method is adequate in the context of Santer’s data, I’d need to emulate exactly what was done on Santer’s data and see how it works.

At present, I’ve requested the 49 Santer runs as used, but have received no response from Santer other than that he is at a workshop. The Santer coauthor with whom I’ve cordially corresponded did not personally ever receive a copy of the data and indicates that Santer will probably take the position that I be obliged to run the irrelevant gauntlet of trying to reconstruct his dataset from first principles, even to do simple statistical tests – the sort of petty silliness that is incomprehensible to the public. [End update]

PS. There’s one other issue raised in Santer (citing Lanzante 2005). Lanzante 2005 observed that the denominator when combining different uncertainties is a Pythagorean sum (sqrt(ssq)) and that a t-test using this gives different results than a visual comparison of overlap of uncertainties. If one uncertainty is a lot larger than the other, the Pythagorean sum ends up being dominated by the larger uncertainty.

For example, in the Santer Table III example sqrt(.138^2+.021^2)= .1397, only 1% larger than the greater uncertainty by itself. A visual comparison in which the overlap of separate 2-sigma intervals is inspected implies a little more onerous t-test than using t=2 against the Pythagorean sum. Lanzante 2005 criticized these sorts of visual comparisons in IPCC TAR and many climate articles, but the practice continues.

While the point is valid enough against Douglass, it is equally so against dozens of others studies; however, in the matter at hand, the issue does not appear to be material in any event.

The Santer "S.D."

Lucia has written an interesting post – see here, continuing the effort to figure out the Santer brainteaser.

I can shed a little more light (I think) on what Santer’s “S.D” is in operational terms. I was able to replicate Santer’s Table III values using the line item from Table 1 entitled “Inter-model_S.D._T2LT” which is shown in column 1 (“Trend”) as being 0.092. So this number – whatever it is – is the one that is used as the other component of the denominator in his t-test.

Santer did not archive the 49 time series as used and thus far has not responded to my request for this data.

The caption to Santer Table 1 says:

The multimodel mean and inter-model standard deviation were calculated using the ensemble-mean values of the time series statistics for the 19 models [see Equations (7)–(9)].

Now there is a bit of helpful information in Douglass Table 1, which provides ensemble-mean values for 22 models – I haven’t checked to see which models are different. Douglass Table 1 provides model ensemble trends by altitude, but not weighted according to T2LT weights. John Christy sent me weights by email and I calculated T2LT trends according to these weights as follows:

douglass=read.table(“http://data.climateaudit.org/data/models/douglass.dat”,header=TRUE,sep=”\t”)
name1=names(douglass)[4:15]
alt=as.numeric(substr(name1,2,nchar(name1)))
douglass[,3:15]=douglass[,3:15]/1000 #converts to deg C/decade

christy.weights= c(0.1426, 0.0706, 0.1498, 0.1894, 0.1548, 0.1382, 0.1035,
0.0477, 0.0167, 0.0067, -0.0016, -0.0179) ##CHRISTY WEIGHTS email Apr 30 2008
names(christy.weights)=alt
douglass$trend_LT= apply(douglass[,c(3,5:15)],1,function(x) weighted.mean(x,christy.weights) );
douglass$trend_LT
# 1] 0.176 0.131 0.427 0.144 0.268 0.212 0.141 0.093 0.175 0.245 0.337
sd(douglass$trend_LT) #[1] 0.0913

The standard deviation of 0.0913 from this calculation compares nicely with the value of 0.092 in Santer Table 1 for UAH_T2LT, so it looks like this is apples and apples.

In the subsequent t-calculation for Table III, Santer (like Douglass) divides this number by sqrt(M-1), where M is the number of models, yielding a value of 0.02152. So Santer, like Douglass, in effect proceeds on the basis that the 95% CI envelope for the ensemble mean trend is 0.171- 0.257 deg C/decade. The majority of individual models are above 0.171 deg C/decade.

Now here’s a quick simulation that implements what I believe to be a fairly reasonable interpretation of what should be done (considering the sorts of points that beaker and others have made).

First I assumed that the ensemble mean was the true trend; I centered everything at their midpoints and then calculated residuals from the observed values. I then did an arima AR1 fit to the residuals getting an AR1 coefficient of 0.89 – a little higher than with a better fit and a sd of a little over 0.14 (deg C/decade). I then did 1000 simulations in which I generated AR1 “noise” with parameters of AR1= 0.89 and sd=0.141, added the AR1 noise to the ensemble mean trend ( a straight line), calculated the OLS trend for each run and made a histogram as shown below, also showing the observed OLS trend as a red triangle and the 2.5% and 5% percentiles. The observed OLS trend is outside both percentiles.

Histogram of OLS Trends adding AR1 noise (AR1=0.89, sd=0.141) to ensemble mean trend

Next, I did the same thing assuming a trend of 0.17 (the lower limit of CI interval on ensemble means) and a level that picks up a few important individual models. In this case, the observed OLS trend is outside the 5% percentile, but inside the 2.5 % percentile.

Suppose that Santer now says in Mannian tones: you are “wrong”!!!! You haven’t allowed for uncertainty in the observations. Haven’t you read about the uncertainty attached to an OLS trend in the presence of autocorrelation?

But haven’t we already dealt with that by generating this type of “noise” in our simulations. Aside from the big problem between RSS and UAH, we know what the observed OLS trend and the AR1-type uncertainty does not enter into this calculation. We compare one OLS trend to the distribution of OLS trends generated by our simulation.

And BTW, I sure don’t get why this has to get to a 5% or even 2.5% level of significance before practitioners concede that something needs to be re-tuned. This isn’t really a Popperian falsification problem and that way of thinking probably makes this over-dramatic. This is more like an engineering problem. Ask yourself – would a reasonable engineer in a chemical factory wait until his model got into this sort of red zone or would he take steps when signs of bias started showing up.

Which takes us back to UAH vs RSS. That’s the $64 question. If RSS is right, the modelers don;t have a problem; if UAH are right, they do. The UAH-RSS differences still need to be resolved and the failure of the climate science community to resolve this leaves the modelers in an unreasonable situation. If I were running the show, I’d say that the parties had had long enough to resolve the situation by lobbing grenades at one another in the academic literature, where nothing is really getting resolved. (They might as well be writing blog articles.) I would hire third party engineers, give them a proper budget and ask them to resolve the matter.

Not a pro bono winging effort, but a proper engineering study so that modelers would have a benchmark. I agree that Santer had cause to criticize the statistics in Douglass, but that doesn’t mean that Santer’s own analysis was necessarily right. It’s possible for both sides to have done things wrong. In terms of conclusions, I think that both Santer et al and the predecessor CCSP report, in effect, distract people from the need to finally resolve UAH v RSS by purporting to paper over real differences with what turns out to be less than adequate statistical analysis.