Jones et al [1998]: Confidence Intervals

I’m working away at Jones et al [1998]. Here’s an interesting diagram from Jones et al [Science, 2001] , which purports to provide confidence intervals for the J98 reconstruction (blue). There’s (at least) one really strange feature in this diagram. See if you can pick it out.

Original Caption: Fig. 2. (A) Northern Hemisphere surface temperature anomalies (deg C) relative to the 1961-1990 mean (dotted line). Annual mean land and marine temperature from instrumental observations (black, 1856-1999) (5) and estimated by Mann et al. (red, 1000 to 1980) (6, 10) and Crowley and Lowery (orange, 1000-1987) (7). April to September mean temperature from land north of 20N estimated by Briffa et al. (green, 1402-1960) (8) and estimated by re-calibrating (blue, 1000 to 1991) the Jones et al. Northern Hemisphere summer temperature estimate (9, 16). All series have been smoothed with a 30-year Gaussian-weighted filter. (B) Standard errors (SE, deg C) of the temperature reconstructions as in (A), calculated for 30-year smoothed data. The proxy average series (6-10) do not extend to the present because many of the constituent series were sampled as long ago as the early 1980s.

In most multiproxy studies, the number of series gets fewer in the earlier portions. In the Jones 1998 reconstruction, there are only 3 series in the first part of the 11th century (Polar Urals, Tornetrask and Greenland dO18); a fourth enters in the mid-11th century (Jasper, Alberta tree rings) and the roster is only 4 until 1400. Eventually, there are 10 "proxies", including the CEngland instrumental record and the CEurope documentary record. The J98 standard errors (second panel – blue) using 10 proxies are only a little lower than the standard errors using 3-4 MWP period proxies. That’s curious enough. But look at what happens at AD1400. One proxy series is added – Svalbard ice melt percentage. For some reason, the standard error nearly doubles. I wonder what’s going on.

There are some other frustrating aspects to this diagram. It turns out that the Jones et al [1998] version used in the compilation in Jones et al [2001] is not the same as the archived version for Jones et al [1998], but has been "re-calibrated". I think that I’ve figured out the "re-calibration", but it leads into more murky by-ways of the multiproxy underworld.


  1. TCO
    Posted Oct 26, 2005 at 1:31 PM | Permalink

    The more I look at the spaghetti graphs over the period of the instrumentation, the more I wonder if there is something to the possibility that the proxies are right and the instrumentation off (lately) because of UHI.

  2. Brooks Hurd
    Posted Oct 26, 2005 at 5:22 PM | Permalink

    The Mann SE appears to be a constant value of slightly more than 0.2 through 1600, at which point it becomes a new constant with a value of half the previous constant. I suppose that I was always operating under the misconception that SE had something to do with the data.

  3. Willis Eschenbach
    Posted Oct 26, 2005 at 7:19 PM | Permalink

    Re: #2, Brooks, an interesting comment. The missing link is that the SE is connected to the variation of the data and to the number of data points. This is why the SE on these types of reconstructions is usually wider the further back you go, because there are less data series. In 1600 there were a bunch of other proxies added, so the standard error dropped.

    Steve’s comment, though, points to the real question — why does the SE go up when another proxy is added in 1400?

    I also find the Jones and Crowley & Lowery results for the period ~ 1860-1920 to be curious, as they both are going up while the instrumental record is dropping.

    Finally, why is the Mann error (red line, lower panel) mostly straight while the Briffa error (green line, lower panel) jumps up and down like a kid who’s eaten too much sugar?

    The world wonders …


  4. JerryB
    Posted Oct 26, 2005 at 8:16 PM | Permalink


    > The world wonders …

    OT, but are you picking on Halsey there? 🙂

  5. Brooks Hurd
    Posted Oct 27, 2005 at 12:15 AM | Permalink


    The Jones SE appears to be derived from the data, albiet with a step change in the 1400 to 1600 period. The explanation that makes sense to me (lacking any other information) is that the proxy(s) which Jones added after 1400 would seem to have much more variation. Either he added additional proxies after 1600 with considerably less variation or replaced the proxy(s) added in 1400 with others.

    I can not figure out how Mann got an SE that is essentially constant before and after a step change.

  6. Paul Gosling
    Posted Oct 27, 2005 at 4:00 AM | Permalink

    Heres a stab on the dark. Arctic temperatures are notoriously variable (I think I am correct) so adding in the Svalbard melt would add in a highly variable proxy.

  7. Willis Eschenbach
    Posted Oct 27, 2005 at 9:42 PM | Permalink

    Steve, a question for you, or for any dendrochrologists who read this.

    Jones et. al. quote a standard error (SE) of around 0.15°C in the data shown above. Is this SE just the internal uncertainty of the mathematical procedure used to do the reconstruction, or does it also include some value for the uncertainty due to the error between the reconstruction and the actual data?

    If it is the former, how well do the reconstructions actually match up with known temperatures? What kind of errors do the reconstructions have WRT reality?


  8. Steve McIntyre
    Posted Oct 27, 2005 at 10:30 PM | Permalink

    I’ve spent a couple of days trying to figure out how Jones calculated his “standard errors”. So far I haven’t been able to replicate anything close. As usual with the Hockey Team, it’s a total waste of time trying to decode their murky procedures.

    My impression is that the MBH standard errors are just 2 x the standard error of the residuals (which assumes that they are i.i.d., which of course they aren’t). The MBH steps sort of correspond to the calculation steps – I posted on this in May – theere are some odd differences between MBH98 standard errors and MBH99 standard errors.

    I presume that Jones has a term reflecting (x- mean(x))^2, which is actually part of the proper confidence interval formula (the MBH formula probably being incorrect). THis term accounts for the volatility, but as I said, I can’t yet get it to work and it’s increasingly irritating me.

  9. TCO
    Posted Oct 27, 2005 at 11:41 PM | Permalink

    the answer?

  10. Willis Eschenbach
    Posted Oct 28, 2005 at 6:07 PM | Permalink

    Thanks, Steve. Look, I’m just a kid off of a cattle ranch, so maybe I missed something obvious. But can you or someone answer me this?

    The Mann standard error for 1900 is about 0.1°C. It is unclear whether this is one or two standard errors. You seems to be saying it is two. That means they’re saying that trees can tell us the temperature a hundred years ago to an accuracy of plus or minus a tenth of a degree.

    Does anyone really believe that?

    I say that’s nonsense. Tree ring records might be able to tell within a degree or two what the temperature was a hundred years ago, but a tenth of a degree? Get real. That would mean if two successive years average temperatures were a quarter of a degree apart, you could easily tell the difference from the tree rings … absolute nonsense. Way too many confounding factors, water, clouds, early springs but cool summers, summers that are too hot, all of those affect tree growth. They say with all those confounding factors that the trees tell temperature to a tenth of a degree? Risible.

    We can’t measure to much better than a half of a degree with the standard weather thermometer. Most weather records are kept to the nearest degree. And now they’re saying that trees tell temperature to a tenth of a degree? Absolutely impossible. I’d say impossible even if they said the standard error was +/- 0.5. I truly do not believe that trees are accurate to half a degree.

    Unless I’m missing something really obvious … which is certainly possible …


  11. TCO
    Posted Oct 28, 2005 at 6:19 PM | Permalink

    Learn some statistics, dude. (Think about the difference between a single measurement and repeats for a given accuracy measurement.) I think you are the same chuckie who thought you could divide accuracy for a decadal measurement to get a 10 times more accurate measurement for a year. (Imagine how impressively accurate it must be for a second?)

  12. Paul
    Posted Oct 28, 2005 at 7:16 PM | Permalink

    TCO, Willis is right. Errors average down as the square root of the number of measurements — of the same quantity. Unless they have hundreds of proxy measurements of the temperature in one year they can’t improve the error very much. Most outdoor thermometers are only good to a degree, a proxy measurement has to be worse, if it’s valid at all. If they took all their proxies and averaged over 100 years, they could get the error down on the average, but no one cares about the average temperature of a century.

  13. TCO
    Posted Oct 28, 2005 at 7:31 PM | Permalink

    Actually you and I are in agreement. That more measurements improves the error of the overall estimate. It’s called having more than one tree. Willis did not seem aware of this. He seems to think that the limit of the one tree measurement applies to the overall surveying.

  14. Willis Eschenbach
    Posted Oct 28, 2005 at 8:07 PM | Permalink

    TCO, the abuse is un-necessary. You obviously missed my first point, where I asked Steve whether the published error was the statistical error only, or whether it included the error between the results and the reality.

    And of course I know there is more than one tree. I also understand standard deviation, standard error, R2, arima models, the law of large numbers, skew, kurtosis, the Jarque-Bera test for normality, autocorrelation, and the power content of complex signals. Can we move on now?

    I don’t care if you have a thousand trees. It’s called instrument error. It is not normally distributed and it is not linear with temperature — it is upside down quadratic.

    As the dendrochronologist mentioned on an earlier thread, the R2 of tree ring reconstructions is typically 0.1 to 0.3. This is a very weak correlation. This is because the instrument error is huge. In the hot European summer of 2003, tree growth was greatly reduced. This would be read on the tree “thermometer” as a very cool year. This is an instrument error of tens of degrees .

    You can have a hundred tree “thermometers” out there (a typical dendrochronological sample size) that have a non-random, non-gaussian, non-linear error of up to 10 degrees or a thousand, but you still won’t know if the temperature changes by 0.1°. It’s called instrument error. Save your statistical abuse until you are sure that you understand what I’ve said.


  15. Willis Eschenbach
    Posted Oct 28, 2005 at 8:19 PM | Permalink

    Oh, I forgot to mention in my last post that the quoted errors are in-sample errors. These are undoubtedly smaller, and perhaps much smaller, than out-of-sample errors. And historical temperature reconstructions are by definition out of sample. See Steve’s earlier thread on this subject.


  16. Willis Eschenbach
    Posted Oct 28, 2005 at 8:45 PM | Permalink

    I realized that people may not be familiar with the concept of how instrument accuracy limits measurements, and how this is not improved by increasing the sample size.

    I have a mechanical pencil. The thickness of the lead is 0.7mm. Think about trying to measure the thickness of this lead using an ordinary ruler. I look at it and I say “about 0.8 mm”. OK, one measurement. Can we improve on that?

    Sure we can. I ask you, and you say “0.7”, and someone else says “three quarters of a mm”, and so on. And for the first few additional measurements, our accuracy does improve. But soon a limit is reached.

    We will never, for example, be able to detect the difference between two pencil leads that are different by a thousandth of a mm by using an ordinary ruler. No matter how many measurements we take. We can ask a million people, and have a statistical error of … just a sec, call the standard deviation of the readings say 0.1, error of the mean = SD/sqrt(n) … SEM ~ 0.0001 mm. Statistical enough for ya?

    Doesn’t matter if the SE is a ten thousandth of a mm. The limit is not in the sample size. The limit is in the accuracy of the instrument. We still can’t tell the difference between the two pencil leads using a ruler.

    It gets worse with trees, though. At least a ruler is a nice linear measure, with a very high R2 to the data, and the errors are likely to be randomly distributed. With trees, none of those are true …


  17. Paul
    Posted Oct 28, 2005 at 9:54 PM | Permalink

    RE #11…

    TCO. No…I’m not the same guy. Merely pointing out the UHI issues aren’t clearly defined or understood. And, with Jones not sharing, we’re all just guessing (which is what I think we’d be doing even if Jones was sharing).

  18. TCO
    Posted Oct 28, 2005 at 11:01 PM | Permalink

    post 11 was addressed to post 10. And if willis is the same dude who assumed that the error in a year’s temp would be one tenth that of a decades temp.

  19. Steve McIntyre
    Posted Oct 28, 2005 at 11:24 PM | Permalink

    I don’t even think that we get to instrument accuracy here. If you have spurious regression, you have spurious confidence intervals. In the case of MBH, as far as I can tell, they calculate the standard error of the residuals, multiply by 2 and call it a confidence interval. You could do the same with Honduran births as a “proxy” for Australian wine sales. They ignore elementary tests for spurious regression. If you use the true t-statistic allowing for autocorrelation, my guess is that it would be much higher. Also, some of the series appear to be estimated outside the calibration zone.

    With Jones, I still haven’t figured out how they calculate their confidence intervals, but, whatever it is, it will be along the same lines.

    When cross-validation is used, often the verification period standard errors are used to estimate conifdenc intervals. This is NOT done here. Since MBH98 using 15th century proxies has a cross-validation R2 of ~0, the standard errors will be equal to natural variability. In fact, in some ways, it seems to be the estimates of long-run variance of temperature that may be a problem. Remember my posts on estimation of standard deviations in autocorrelated series and how sample standard deviations can be wild under-estimates.

  20. Steve McIntyre
    Posted Oct 28, 2005 at 11:27 PM | Permalink

    Re #12 – in autocorrelated situations, errors do not average down with the square root. Beran points out that under the autocorrelation of long-run dependence (which occurs in geophysical systems, without necessarily taking a position on hte physical mechanism) you may need up to 10,000 measurements to achieve the confidence of 100 i.i.d. measurements.

  21. TCO
    Posted Oct 29, 2005 at 2:21 PM | Permalink

    Is autocorrelation of the record agreed on? Proven? Is it fundamental to your criticisms? Do others in the field use autocorr style stats?

  22. Hans Erren
    Posted Oct 29, 2005 at 2:50 PM | Permalink

    Just some thought about homogeneity corrections, station density and confidence intervals of the surface records.

    A proper homogeneity correction can ONLY be carried out if there are sufficient stations within a radius of 1000 km. If this is not the case then the confidence interval of average annual temperature IMHO is as big as 1 Kelvin, being the approximate average station inhomogeneity.

    I would therefore imagine that confidence interval of the surface record is inversely proportional to station density, and increasing dramatically in the 19th century.

    Now the the proxies were calibrated against the surface record, was the confidence interval of the surface record ever taken into account.
    In all the graphs I saw of the surface record I never saw an error bar….

  23. Willis Eschenbach
    Posted Oct 29, 2005 at 5:36 PM | Permalink

    re: 22, interesting point, Hans. And with tree ring temperature reconstructions, in addition to the error you point out in the surface temperature, there is an added difficulty.

    This is that the most temperature sensitive series, those near the polar or elevational end of their range, are also likely to be a long distance (both in kilometres and temperature) from the nearest long-term ground station temperature record.

    Consider, for example, two stands of trees, say they’re only 1 km. apart, but one is on the north slope of a mountain and the other on the south slope. We all know that both the average temperature and the changes of the temperature can be significantly different between the two locations.

    To understand their changes in growth, we compare them to the nearest temperature record, which may be 30 km. away and down in a valley.

    Now me, I’m not a brave enough man to attempt to numerically estimate the underlying error inherent in this calculation, but I can tell you one thing for sure …

    It’s more than a tenth of a degree, the statistical error in the reconstructions above.


  24. Posted Oct 31, 2005 at 11:37 AM | Permalink

    Dosn’t anyone here understand statistics? The SE did not “almost double” in 1400. The blue line pops up at most 50% of the pre-1400 value. Adding a series with a larger inherent variability will increase your SE. The red line shows a smaller step becuase it’s a completely different series. Red is “Annual mean land and marine temperature” while blue is “April to September mean temperature from land north of 20N”. The differences between these series is not readily apparent when expressing them as differences from the ’61-’90 mean. The appropriate meanis obviously different for the two series. Clearly, Arctic data (that added begining in 1400) affects estimates of temperature for part of the year above 20N more than it effects annual global mean temperatures. For one thing, it obviously has a much lower weighting in a global resonstruction than in a summertime reconstruction.

    One might reasonably ask why include a proxy with a larger inherent variation? The answer is that it’s not proper statistics to deliberately exclude data just because it’s going to make your fit worse for a period.

  25. TCO
    Posted Oct 31, 2005 at 4:10 PM | Permalink

    Steve, my question??

  26. Steve McIntyre
    Posted Oct 31, 2005 at 4:28 PM | Permalink

    The autocorrelation can be verified simply by the time series. It’s not obvious that it’s been handled properly. I can’t come close to replicating their results, so I really don’t know what they did. I’m still working on it.

  27. TCO
    Posted Oct 31, 2005 at 5:24 PM | Permalink

    I’m trying to disaggregate issues:

    a. is there autocorrelation
    b. how should autocorrelation be handled

    I could imagine a case for instance where there is agreement on the method of dealing with autocorr, but disagreement on it being there. Or visa versa. Let’s drill down.

  28. Posted Dec 27, 2007 at 7:00 PM | Permalink

    Interesting thread (which I will have to study to understand better). I can only hope I am OT.

    Reading it though, I was reminded of the teacher of my first statistical physical measurement class. Before telling us about errors, he talked about precision and accuracy. As I recall his example was “If you ask the billion Chinese the height of their chairman (whom they have never seen) you will get a very precise, but necessarily accurate answer”.

    His point was be very careful when you use the term error since they can can result from many different source types. 😉

  29. Posted Feb 21, 2008 at 12:20 PM | Permalink

    Good post. You make some great points that most people do not fully understand.

    “There are some other frustrating aspects to this diagram. It turns out that the Jones et al [1998] version used in the compilation in Jones et al [2001] is not the same as the archived version for Jones et al [1998], but has been “re-calibrated”. I think that I’ve figured out the “re-calibration”, but it leads into more murky by-ways of the multiproxy underworld.”

    I like how you explained that. Very helpful. Thanks.

%d bloggers like this: