Its a Mystery!

Ok, coffee break’s over, back on your heads! Let’s get back to business here…

I like puzzles. They provide no end of intellectual entertainment – and there is a thrill with figuring out the solution, particularly if the solution is done in a particularly elegant fashion. A good source of puzzles seems to be climate science papers which involve (sometimes extremely simple) statistical analyses.

This puzzle stems from a throwaway line toward the end of the Steig et al. Antarctic paper. The line is one of those typical “our analysis is correct because it sort of looks to us like something that somebody else did earlier and nobody complained about it” which are often thrown out without any further specific corroboration:

Independent data provide additional evidence that warming has been significant in West Antarctica. At Siple Station (76S, 84W) and Byrd Station (80S, 120W), short intervals of data from AWSs were spliced with 37-GHz (microwave) satellite observations, which are not affected by clouds, to obtain continuous records from 1979 to 1997 (ref. 13). The results show mean trends of 1.1 +/- 0.8C per decade and 0.45 +/- 1.3C per decade at Siple and Byrd, respectively (Ref. 13). Our reconstruction yields 0.29 +/- 0.26 C per decade and 0.36 +/- 0.37C per decade over the same interval. In our full 50-year reconstruction, the trends are significant, although smaller, at both Byrd (0.23 +/- 0.09 C per decade) and Siple (0.18 +/- 0.06 C per decade). Furthermore, the seasonal characteristics of these data (ref. 13) agree well with those from our reconstructions, with the greatest amount of warming in austral spring and winter (Fig. 3). Independent analyses of tropospheric temperature trends have also found spring and winter warming to be greatest in West Antarctica (ref.14,15).

Reference 13 refers to the paper, Shuman, C. A. & Stearns, C. R. Decadal-length composite inland West Antarctic temperature records. J. Clim. 14, 1977-1988 (2001).

Steig et al Figure 3b
Figure 3b which is the relevant part of Fig. 3 has the caption:

b, Mean annual trends for 1969-2000, to facilitate comparison with ref. 2…. Black circles in b show the locations of Siple and Byrd Stations, and the adjacent numbers show their respective trends (Ref. 13) for 1979-1997.

So what is the mystery? Is it that the authors substituted the Byrd and Siple values from the earlier study into Figure 3b instead of their own? No, I think anybody who hangs out at CA can guess that pretty easily from just looking at the numbers. My guess is that Steig et al. may have decided that the Shuman results were more reliable than their own and wished to present as correct a picture of the situation as possible.

Normally, I might not have gone any farther in looking at this except that I accidently ran across information about the Shuman paper at when I was searching for Antarctic satellite data. I figured that, because their values were co-opted for Figure 3 of the Steig paper, they were fair game for auditing. The actual data was found in Excel files here .

You may also notice that there are two other AWS (Lettau and Lynn) whose data were reconstructed by Shuman for a slightly shorter period of time. However, these both show negative trends so presumably they would not be of further interest.

Boot up R and let’s go.

# As an example of quick-and-dirty data entry,
# I infilled the missing values in the Annual Excel file with NAs,
#then copied just the data in the four AWS columns of the annual data set to the clipboard.
#Note that the first line should be TYPED directly into R, not using ctrl-R or
# copying-and-pasting the command (which destroys the data in the clipboard).

shuman = read.table(“clipboard”)

colnames(shuman) = c(“Byrd”,”Lettau”,”Lynn”,”Siple”)
shuman = ts(shuman,start=1979)
year = time(shuman)

shumanb.reg = lm(shuman[,”Byrd”] ~ year)
coef(shumanb.reg)[2] # 0.04557895
diff(confint(shumanb.reg)[2,])/2 # 0.1353811
#Shuman Byrd decadal trend 0.456 +/- 1.354
#Byrd matches the stated results

shumans.reg = lm(shuman[,”Siple”] ~ year)
coef(shumans.reg)[2] # 0.1112807
diff(confint(shumans.reg)[2,])/2 #0.07902228
#Shuman Siple decadal trend 1.11 +/- 0.790
#Shuman matches the stated result

Well, no mystery here. The results are as claimed. The trends and the error bars match the printed values. Might as well keep going and check the results for the current reconstruction as well.

#load recon data and supporting information
download.file(“”, “”,mode=”wb”)
download.file(“”, “”,mode=”wb”)

#make annual series
recon.ann = ts(sapply(1:ncol(recon_aws), function(x) colMeans(matrix(recon_aws[,x],12,50))),start=1957)
colnames(recon.ann) = Info$aws_grid$name

#restrict to shuman time period
reconx.ann = window(recon.ann,start = 1979, end = 1997)

steigb.ann = lm(reconx.ann[,”Byrd”] ~year)
coef(steigb.ann)[2] # 0.01117608
diff(confint(steigb.ann)[2,])/2 #0.09751764
#Steig Byrd decadal trend 0.112 +/- 0.975
#Claimed value: 0.36 +/- 0.37

steigs.ann = lm(reconx.ann[,”Siple”] ~year)
coef(steigs.ann)[2] # 0.0224113
diff(confint(steigs.ann)[2,])/2 #0.04585337
#Steig Siple decadal trend 0.224 +/- 0.456
#Claimed value: 0.29 +/- 0.26

Well, we finally located our mystery- not earthshaking, but a mystery nonetheless. Where did the values claimed by Steig et al. come from? The differences here are not minor – a factor of three in the case of Byrd with a somewhat smaller difference for Siple. Note that the Shuman value is now 10 times larger for Byrd, despite the fact that there is agreement on the “seasonal characteristics” (whatever that may mean) according to the authors.

This is a job for the CSI (Climate Science Investigation) unit! I went after this puzzle with some more R tools. First we look at Byrd:

#Can we match a Byrd trend of .36?
#Try all consecutive time periods of three years or longer
bcheck = NULL
for (i in 1:48) { for (j in (i+2):50 ) { xyear = i:j
regx = lm(recon.ann[i:j,”Byrd”] ~xyear)
trend = 10*coef(regx)[2]
if (round(trend,2)==.36) { bcheck = rbind(bcheck,c(trend, 10*diff(confint(regx)[2,])/2, 1956 + i, 1956 + j))
k = k + 1 } } }
colnames(bcheck) = c(“trend”,”95%error”, “start”,”end”)
# trend 95%error start end
#[1,] 0.3646779 0.9444317 1957 1977
#[2,] 0.3577181 0.8571222 1978 1998
#[3,] 0.3593054 0.6060411 1978 2002

The best candidate appears to be the time interval from 1977 to 1998. When a year is added at each end of the sequence, the decadal trend increases from 0.11 to 0.36. However, the error bar for the trend is nowhere near the target value of 0.37.

On to Stiple:

scheck = NULL
for (i in 1:48) { for (j in (i+2):50 ) { xyear = i:j
regx = lm(recon.ann[i:j,”Siple”] ~xyear)
trend = 10*coef(regx)[2]
if (round(trend,2)==.29) { scheck = rbind(scheck,c(trend, 10*diff(confint(regx)[2,])/2, 1956 + i, 1956 + j))
k = k + 1 } } }
colnames(scheck) = c(“trend”,”95%error”, “start”,”end”)
# trend 95%error start end
#[1,] 0.2932384 0.2567706 1957 1992
#[2,] 0.2882775 0.2426950 1957 1993
#[3,] 0.2939766 0.2662463 1973 2003
#[4,] 0.2942575 0.2860570 1977 2004
#[5,] 0.2942141 0.2964496 1978 2003
#[6,] 0.2886615 0.5032191 1983 2000
#[7,] 0.2887827 0.4908402 1985 2000

We have an excellent match in both the trend (0.29) and the error (0.26), however, the years are a bit off. Maybe it is just coincidence. Could use some further searching here.

I also checked the full 50 year trends for the two stations as well:

#Full 50 year trends
allyear = time(recon.ann)
steigbyrd.all = lm(recon.ann[,”Byrd”] ~ allyear)
coef(steigbyrd.all)[2] # 0.008854096
diff(confint(steigbyrd.all)[2,])/2 #0.02195606
#Steig Byrd decadal trend for all 50 years 0.089 +/- 0.220
#Claimed value: 0.23 +/- 0.09

steigsiple.all = lm(recon.ann[,”Siple”] ~ allyear)
coef(steigsiple.all)[2] # 0.01838724
diff(confint(steigsiple.all)[2,])/2 #0.01393572
#Steig Siple decadal trend for all 50 years 0.184 +/- 0.139
#Claimed value: 0.18 +/- 0.06

You don’t suppose that they accidently swapped the two values in the Byrd case -with the end result of both making the trend larger as well as gaining statistical “significance”? The coincidence is pretty close. The Stiple trend is correct (one (sort of) out of four isn’t too bad I guess) but the error limit is out by a factor a little larger than two. The standard error of the trend calculated in R is is equal to 0.069 so rounding it down could give the second value, but that is speculation.

Where did these all those other values come from? Dunno for sure yet. To quote a line from what I remember as being a routine by George Carlin, I guess that “It’s a mystery, fadda!”


  1. tarpon
    Posted Mar 10, 2009 at 12:00 PM | Permalink

    Kudods for the work you do.

    It’s great that you decided to put up source code and go through your methods. It makes a much better site when users can follow your work more closely, step by step. It also makes it far easier to spot errors and fix them. Using programs like “R” is great because either everyone has them or can obtain them at reasonable cost — free.

    It adds significantly to the public open science record, to your site and the sites valuable archives.

    Open Science is the future.

    • RomanM
      Posted Mar 10, 2009 at 1:02 PM | Permalink

      Re: tarpon (#1),

      One thing that I particularly appreciate about this site is the opportunity to learn things from the others who post here. This can be anything from information on proxies, methods used in tracking world temperatures, where data on virtually anything can be found – you name it. Understanding this information is a lot more useful and easier to comprehend when the presentation is transparent and detailed. I see including the background R scripts as part of this process although sometimes they may appear to add some clutter to the posts.

  2. Bob McDonald
    Posted Mar 10, 2009 at 12:36 PM | Permalink

    There’s something about the Mean annual trends for 1969-2000……….

  3. anon
    Posted Mar 10, 2009 at 12:38 PM | Permalink

    A classic example of why I contributed to the server appeal. Great stuff.

  4. Ryan O
    Posted Mar 10, 2009 at 3:07 PM | Permalink

    No wonder couldn’t get their numbers. Haha! 🙂 I just kind of passed it off as interesting and moved on to other things. Thanks for doing this, Roman.

  5. Posted Mar 10, 2009 at 3:57 PM | Permalink

    Roman —
    Very interesting!

    Although the Schuman trends are probably computed from their annual data (as plotted in their graph on the webpage), I would have expected Steig et al to have computed their trends from their monthly recon. This might have changed the results a little, though not by as much as you indicate.

    I would, however, take exception to your aside,

    You may also notice that there are two other AWS (Lettau and Lynn) whose data were reconstructed by Shuman for a slightly shorter period of time. However, these both show negative trends so presumably they would not be of further interest.

    In fact, Shuman gives Byrd and Siple for 19 years (79-97), but Lettau for only 12 (86-97) and Lynn for only 10 years (88-97). I wouldn’t call barely half “slightly shorter”. Although the negative trends may have had an influence on Steig & Co’s decision to omit these stations, they could easily argue that it was an objective decision dictated by the substantially shorter time periods, and therefore undoubtedly larger standard errors.

    • RomanM
      Posted Mar 10, 2009 at 5:03 PM | Permalink

      Re: Hu McCulloch (#6),

      It would not make much sense for Steig to do a monthly regression for several reasons. Although the trend would not change much (it didn’t since, as you might expect, I did do the monthly as well) but it would become apples and oranges because the standard errors (and degrees of freedom) would no longer be comparable.

      With regard to the shorter time periods for Lettau and Lynn, I do agree with you that there is the argument that this would make their use less meaningful. I also didn’y check to see whether they satisfied the criterion of being in Western Antarctica which applied to the other AWS’s.

      • Geoff Sherrington
        Posted Mar 11, 2009 at 6:18 PM | Permalink

        Re: RomanM (#8),

        Tis post deals with estimation of errors, which we know can be broadly divided into precision and accuracy.

        Do you know of an accepted way to express errors of both types for land surface temperatures and SST? There seems to be a tendency to progressively reduce raw data to fewer numbers, then to calculate errors on the reduced data. This can result in an improvement in the stated error terms. Sometimes this can be valid, sometimes not. I do not think that a standard deviation type calculation on annual averaged data should be used to derive error terms, when monthly would be worse, daily worse again.

        The original instrument capability must be considered in the error terms and probably should dominate it. If a thermomenter reads to 1 degree, then it’s hard to improve on that error without secondary experiments, like taking replicate readings in a controlled chamber and experiments where certified thermometers are compared over the range of interest.

        If you take the GISS style of about 5 “adjustments”, should the error bounds move in unison with each adjustment, or should they remain pegged to the raw record lines plus any excursions outside them, that adjustment adds?

        There has been a lot of discussion about the validity of temporal trends because of the calculation of errors done in certain ways. Ditto global models. Ditto radioisotopes, where the laboratory replication error is often quoted as an indicator of life in the wild.

        I have no mind full of rules of thumb that let me grasp the magnitude of real errors from estimates on condensed data. Do you consider it fruitful to open a thread where the formalism of the carrying forward of error estimates in successive calculations can be laid out as a preferred practice? With a mention of the need to determine actual distributions before assuming Poisson, and the corect approach to DOF and a preferred way to treat outliers, including end padding? (In-filling has had a fair coverage, so its conclusions can wait).

        • Kazinski
          Posted Mar 11, 2009 at 11:00 PM | Permalink

          Re: Geoff Sherrington (#17),

          If a thermomenter reads to 1 degree…

          Why would you use a thermometer when a Mannian proxies can be accurate to a fraction of a degree going back a thousand years?

  6. Ian
    Posted Mar 10, 2009 at 4:03 PM | Permalink

    I feel quite sick reading this, I am used to Team talk of course being a frequent visitor to unrealclimate, closedmind, rabid etc and I know they’ll be no answer from the pROFESSIONAL climate people but [snip…] it’s just not possible to make these sort of mistakes. Mind you with those error bars I’m not sure I’d waste any effort on the original Shuman etal 01, our research suggests its either cooling or warming doesn’t seem to lend a lot to the literature.

  7. asdf
    Posted Mar 10, 2009 at 8:52 PM | Permalink

    There is a read.xls function in R so that you don’t need to use the clipboard.

    • RomanM
      Posted Mar 11, 2009 at 4:58 AM | Permalink

      Re: asdf (#9),

      I went looking for the xlsReadWrite package and the search on cran-R indicates that the package has been removed although earlier versions are apparently available. There is a read.xls function in the package gdata but it appears to require having perl installed on your computer.

      One could also save the file in csv format, but my point in reading the file through clipboard was to show users less familiar with R that there was a quick-and-dirty way to get data into R. I have also used this method to extract partial data from data text files in my browser without saving the file and writing a more complicated script to extract the part I needed. You just have to remember not to accidently copy something else into the clipboard while you are doing this.

      • Peter
        Posted Mar 14, 2009 at 10:34 AM | Permalink

        Re: RomanM (#11),

        “remember not to accidently copy something else into the clipboard”

        Suggest you install a clipboard extender:

  8. Posted Mar 10, 2009 at 10:56 PM | Permalink

    RE: RomanM, #8,

    I would expect the trend CI’s to be about the same, after adjustment for serial correlation, at the monthly and annual frequencies. There are more monthly observations, but they are noisier, so it should be a wash. Annual might have the advantage that serial correlation might no longer be an issue, however.

    Lettau is on the Ross Ice Shelf, just on the WH side of 180, so some might put it in WH. Lynn is on the EH side of Ross Bay, so it definitely is E.Ant. See
    (Click on image for blowup).

  9. Posted Mar 11, 2009 at 6:01 AM | Permalink

    You need a Glossary of Useful Research Phrases.


    — Sinan

    • RomanM
      Posted Mar 11, 2009 at 12:02 PM | Permalink

      Re: Sinan Unur (#12),

      You can add to that one of my favourites from mathematics:

      It is obvious that … : I think it’s true, but I can’t prove it.

      In an early paper, I used that phrase and the referee replied that this needed to be shown. After a week of trying to prove an obvious fact 😦 , it took less than an hour to find a counter-example. I stopped using the phrase after that. Fortunately, by adding a relatively simple condition, I was able to get the result in a useful form for use in the rest of the paper.

      Re: Kenneth Fritsch (#14),

      Ken, these “nimble” exercises often start out like a heavy ballet dancer before I decide to go out and learn a simpler way of re-writing the R script.

  10. Manniac
    Posted Mar 11, 2009 at 6:38 AM | Permalink

    This site is the true inheritor of the motto of the late Royal Society : ‘Nullius in verba’ – On no-one’s word.

  11. Kenneth Fritsch
    Posted Mar 11, 2009 at 9:06 AM | Permalink

    For AR1 correlations for the AWS and TIR reconstructions from 1957-2006, I found:

    AWS Monthly = 0.20; AWS Annual = 0.27 and for TIR Monthly = 0.32; TIR Annual = 0.17.

    Also I want to add that I much appreciate Roman’s nimble exercises using R. A novice like me can learn from them and appreciate them more, the more I learn about R.

  12. Ryan O
    Posted Mar 11, 2009 at 12:12 PM | Permalink

    An interesting quote from Shuman:

    However, caution must be exercised in extending this
    result inland to AWS Byrd. This site’s positive slope is
    dependent on the first annual value (cooler, see Fig. 6),
    which is derived solely from extrapolated TC values.
    Without that point, the Byrd trend would be slightly
    negative making it more compatible with the trends at
    Lettau and Lynn and in opposition to Siple. This is in
    close agreement with the results reported by Comiso
    (2000), who reports an overall tendency toward cooling
    across Antarctica over 1979–98 period but with warming
    along the Antarctic Peninsula (see Comiso 2000,
    Fig. 2). It must be noted here that the Siple record is
    the most heavily dependent on extrapolated TC data. In
    any event, replacement of the AWS units at both Siple
    and Lynn would allow continuation of the composite
    record into the future giving longer records that could
    better detect small but significant trends. This would
    also improve definition of climate baselines for these
    specific sites around West Antarctica.

    Exclude that first point from the Byrd record and the trend becomes:
    Slope: -0.0093 Deg C/yr +/-0.0624.
    Now go to Fig. 9 in the Shuman article and look at the scatter plot for Byrd. Note how the satellite data from Byrd tails strongly for the cooler temps (actual – calibrated), which is why Shuman cautioned about drawing conclusions from Byrd’s trend. Shuman again says as much in the caption. The quoted number for Byrd that Steig used is likely not reflective of actual surface air temperatures and should not have been used without a caveat (my opinion).
    I think it could be successfully argued against the inclusion of Lynn and Lettau because of the shorter record from Shuman, but Siple at 38% complete actually contains fewer ground measurements than either Lynn or Lettau (1st paragraph in Section 2 of Shuman). Also, in Fig. 9, Siple not only demonstrates a downturn at the lower end, but also an upturn at the higher end, which would indicate that the calculated trend is likely to be exaggerated.
    Inclusion of Byrd and Siple – and exclusion of Lynn and Lettau – in the Steig graphic without accompanying explanation has a cherry-picking feel to it . . . especially given how explicit Shuman was about the calibration error between satellite and ground temp measurements. I cite as evidence Roman’s quote where they tout only the upside of the 37 GHz measurements and mention none of the difficulties outlined by Shuman.

    • RomanM
      Posted Mar 12, 2009 at 6:58 AM | Permalink

      Re: Ryan O (#16),

      However, caution must be exercised in extending this result inland to AWS Byrd. This site’s positive slope is dependent on the first annual value (cooler, see Fig. 6), which is derived solely from extrapolated TC values. Without that point, the Byrd trend would be slightly negative making it more compatible with the trends at Lettau and Lynn and in opposition to Siple. This is in close agreement with the results reported by Comiso (2000), who reports an overall tendency toward cooling across Antarctica over 1979–98 period but with warming along the Antarctic Peninsula (see Comiso 2000, Fig. 2).

      Good eye! I hadn’t read through the Shuman paper so I didn’t see this particular quote. There is certainly no mention of this caveat in the Steig paper making the reason for the use of the numbers in Figure 3 more questionable.
      I decided to go a little further into comparing the reconstructions for Byrd by looking at the monthly data. This was complicated somewhat by the fact that Shuman is given in actual temperatures and the Steig reconstruction was in anomalies. I managed to straighten this out using the actual measured data.

      [Added note] Grrrr! The “resizing” feature in WordPress has again altered the graph. it was fine when it left the R-shop. The black line is actually continuous. Sorry about that.

      In R:

      #Assuming that the following exist as in the initial post:
      #Data, Info, recon_aws

      #copied Shuman monthly data to a text file
      #with header and with NAs inserted in appropriate places
      #file name: Shuman_monthly.txt
      #change Kelvin to Celsius
      shuman.mon=ts(shuman.mon, start=c(1978,11),frequency=12)-273.15

      byrd = ts.union(recon_aws[,3],shuman.mon[,1],Data$aws[,3])
      colnames(byrd) = c(“steig”,”shuman”,”aws”)

      #easier to recreate steig temperatures from anomalies
      #calculate monthly adjustments used by Steig
      dif.staws = byrd[,3]-byrd[,1]
      # [1] NA -27.39 -30.39 -30.47 -32.03 -35.77 -37.63
      # [8] -35.39 -28.68 -21.05 -13.79 -20.67 -14.09
      #twelve distinct non-missing values (to two decimals)

      byrd.steig = byrd[,1] + rep(dif.staws[37:48],20)[1:230]
      byrd.shuman = byrd[,2]
      dat.avail = -5*(byrd[,3]<20)
      matplot(time(byrd),cbind(byrd.shuman-byrd.steig,dat.avail) , main=”Byrd (Shuman-Steig)”, ylab=”Difference (C)”, xlab = “Red line indicates where actual Byrd data is available”, type = “l”, lty=1,lwd=c(1,3) )

      I have added a red line where actual AWS data is available.

      The graph explains the large difference in the two reconstruction trends. It is almost completely due to the (often large) difference in the two satellite recons. Another curious feature is that the temperatures where AWS data exists are not identical. Steig’s values are basically the measured temperatures. It isn’t clear what Shuman used.

      • Ryan O
        Posted Mar 12, 2009 at 7:38 AM | Permalink

        Re: RomanM (#20),

        Another curious feature is that the temperatures where AWS data exists are not identical. Steig’s values are basically the measured temperatures. It isn’t clear what Shuman used.

        I believe this would explain the difference (from Shuman):

        The AWS data were obtained from the 3-hourly files
        at the anonymous ftp site operated by the University of
        Wisconsin—Madison (see Table 1). These data are quality-
        controlled averages of 10-min AWS observations
        that are also available from this site. Daily average temperature
        values are then derived from the 3-h data for
        comparison to the daily average passive microwave TB
        (see Figs. 2a–d).

        Shuman started with the 3-hourly data and turned it into daily data. He filled in missing data from the 37 GHz observations on a daily, not a monthly, basis. He also may have calculated daily/monthly averages different than UWisc (for example, as a guess, say UWisc used max-min and Shuman used a time average) . . . and who knows what kind of data changes have been made since Shuman did their paper.

        • romanm
          Posted Mar 12, 2009 at 9:00 AM | Permalink

          Re: Ryan O (#22),

          The measurement frquency could certainly explain some of the the smaller wiggles. Infilling daily observations with satellite would be needed to account for the larger spikes (e.g. around 1983) of the order of magnitude of 2 C for a monthly mean! I haven’t looked at the daily data at all yet, but I might have a chance later.

        • Ryan O
          Posted Mar 12, 2009 at 9:09 AM | Permalink

          Re: romanm (#24), Ah, if only Shuman had archived the AWS data he had used . . .

  13. John Andrews
    Posted Mar 11, 2009 at 11:57 PM | Permalink

    Be careful people, you are getting close to picking the data you like.

    John Andrews in Knoxville

    • romanm
      Posted Mar 12, 2009 at 7:21 AM | Permalink

      Re: John Andrews (#19),

      I’m not sure exactly what your point is.

      There are several reasons why I think that what we are doing in this thread is important.

      There is the obvious fact that, if Steig et al. give incorrect results, then the quality of the remainder of the statistical treatment in the paper (particularly the “black box” portion where errors are less obvious) becomes somewhat more suspect.

      However, IMHO, the more egregious issue is that if the results are not corrected in a public way, they enter the mainstream of “knowledge” and will be quoted as indisputable fact by later authors.

      There is also the issue of “laundering” previous work as well. Ryan O (#16) pointed out that Shuman placed a caveat on the appropriate interpretation of their Byrd results. The quotation by Steig removed this caveat and future authors can now quote the more recent Steig version of Shuman instead. Unless one goes back to Shuman this change won’t be noticed. Previously unreliable results thus gain a patina of undeserved reliability.

      • Eric Anderson
        Posted Mar 12, 2009 at 8:03 AM | Permalink

        Re: romanm (#21),
        Well stated, Romanm. I’m not sure what the outcome of the investigation will be in this particular instance, but the effort is important.

      • steven mosher
        Posted Mar 12, 2009 at 10:31 AM | Permalink

        Re: romanm (#21),

        exactly. and science becomes the game of chinese whispers.

        • Kenneth Fritsch
          Posted Mar 12, 2009 at 2:09 PM | Permalink

          Re: steven mosher (#27),

          To me a paper reference without the original caveats, and particularly when the original author made those reservations clear, goes beyond Chinese whispers and could go to a deliberate attempt to mislead. Of course, I have seen that in the game were a participant deliberately changes what he heard so as to make the end result more humorous.

      • Posted Mar 13, 2009 at 2:59 AM | Permalink

        Re: romanm (#21) and Re: Kenneth Fritsch (#32),

        This “cleansing” of the caveats sounds to me remarkably like some of the stuff done by highly paid merchant bankers with mortgages and CDOs. You start off with a bunch of home loans, some good some bad. Mix them together in a large pot and stir on a medium flame for 30 minutes taking care not to let the mixture boil over. Then split off into tranches of debt class with Tranche 1 being AAA and lowest rate etc. Finally take the lowest debt class mix with some other lower debt class from another pot and repeat.

        The proprietor of this site is always talking about how the methods used in Climate Science would not be acceptable when used in Finance/Economics. Perhaps here were have an example of why it isn’t acceptable?

        • RomanM
          Posted Mar 13, 2009 at 8:54 AM | Permalink

          Re: FrancisT (#36),

          I would not attribute any sort of nefarious motives or foul play to the “laundering” in this case. It seems to me to be merely a case of where the values quoted offered “evidence” for the main point of the paper and it was convenient not to raise issues that might reduce the value of that evidence. I don’t think that some of the researchers even understand that such laundering may have on the future use of the quoted material.

          Regardless of the motives, I believe that this reduces the credibility of the results of the paper as much as the mistatement or the miscalculation of the numerical results.

          Re: Jorge Mata (#37),

          Many of these issues have been discussed on CA quite often. You might try doing a search of CA for “Global Temperature” using the Google box at the top right hand corner of the page to see some of the threads in which this was done.

        • Posted Mar 13, 2009 at 12:33 PM | Permalink

          Re: RomanM (#38), thanks, Roman, I posted that because Ryan O said “In cases like this, there is no NIST standard for “surface air temperature” – and even defining what is meant by “surface air temperature” is a whole topic in and of itself.”

  14. Ryan O
    Posted Mar 12, 2009 at 9:20 AM | Permalink

    Roman, one other thought that occurred to me. For a while I’ve been playing with GHCN records for Siberia and calculating monthlies in different ways. I have actually seen differences in excess of 2C for a month that are due solely to the method of calculating the monthly average. You wouldn’t think that the method could cause such a large disparity, but it can. Then Steig’s paper came along and I kind of dropped the Siberian thing.

  15. romanm
    Posted Mar 12, 2009 at 10:51 AM | Permalink

    Geoff Sherrington (#17), I don’t think that the issue of error that you are speaking of is exactly the same one that we are dealing with here.


    However, I am probably over-lecturing this, but here goes (even though it is slightly OT).


    Statisticians tend to think of estimators (of a population parameter) calculated from data with random properties as random variables. The estimator has a (sampling) distribution and a mean and standard deviation. The concepts of error, precision and accuracy stem from considerations of the properties of that distribution. In particular, error is defined as the difference between the estimate generated by the estimator and the parameter value.


    Bias is defined as the difference between the mean of the sampling distribution and the value of the parameter. When the bias is non-zero, the estimator consistently overestimates (or underestimates) the parameter inducing an average “error” of a fixed size in a specific direction. This error can be due to (sampling) bias in the original measurements or by the procedure used to calculate the estimator. Sampling bias is usually not correctable without information external to the specific sample at hand. Calculation bias can also be difficult to determine (particularly in a complicated procedure such as RegEM), but there have been some techniques developed for reducing bias (e.g. jackknife procedures).


    The remaining part of the error equation stems from the sampling variability of the estimator. Even when the estimator is unbiased, the estimate can have a high probability of ending up a long way from the parameter value. If the estimator can be shown to have certain properties (e.g. consistency – defined in a probabilistic sense), these types of error can be reduced by increasing the sample size, i.e. the information on which the estimate is based.


    The concepts of precision and accuracy stem from a combination of these two qualities mentioned above. The particulars of each situation need to be looked at individually to assess the error structure for that situation.


    Some general principles do apply. For example, errors due to sampling variability usually propagate is a reasonably straightforward fashion (after taking possible dependencies within the data into account) when summarizing a structure. The overall variability can often be estimated from the end result. The same is usually not true for bias. Bias errors could be passed along without much change through the entire procedure and without the ability to determine their final size.


    In the case of GISS adjustments, there could be an effect due to induced (usually positive) correlation between nearby stations. This could lead to underestimation of the size of the “error bar” due to sampling. Any bias induced in the procedure would shift the results and be difficult to account for without looking at the details. As well, the original error bars would no longer be appropriate.


    IMO, any thread in this direction would be better grounded in specific cases where the “formalism” could have a real-life interpretation … and would have to be initiated by our host.

    • Geoff Sherrington
      Posted Mar 14, 2009 at 3:31 AM | Permalink

      Re: romanm (#28),

      Thank you. I shall raise the topic of error propagation with Steve when he’s ready, the correct procedure as you note. However, I would infer that you have a lot to contribute and I hope that you do if it happens.

      The lecture you gave me differs in no material aspects from ones I gave to staff 30 years ago. But, there has been an observable change in climate “science”. That change can briefly be summarised as obtaining confidence intervals that are an order of magnitude or more better than reality, while failing to authenticate their derivation. The start of this error obfuscation is the omission of calibration data.

      Given the satellite involvement in the Steig paper, one can study satellite errors from literature. If you wish to give yourself a fright, try … and …

      Although this is specific to Landsat, the scope for errors as decribed (but not individually quantified) is huge. To quote,

      A major objective of the Landsat-7 program is to upgrade the radiometric quality of the data to be commensurate with the other sensors in the Earth Observing System (EOS). Unlike its predecessors, a specific goal of the Landsat-7 program is to achieve radiometric calibrations of the data to ± 5% uncertainty over the 5 year life of the mission. Pre-launch, the mission design supports this requirement through hardware design changes, and instrument characterizations. Post-launch or on-orbit, this 5% requirement is supported by a monitoring and calibrations program, and the implementation of any necessary changes to the ground processing of the data.

      From whence does one contrive a 0.1 deg C warming increment? Maybe it just rolls easily off the tongue, as does the so often seen +/- 5%. This is more the type of error analysis to which I was referring. And this is only the tip of the iceberg, so to speak.

  16. Fred
    Posted Mar 12, 2009 at 12:00 PM | Permalink

    “Recent research indicates”

    I did a five minute pop quiz with my Thursday evening Grad Student Discussion Seminar Course and interpreted the results.

  17. Johan i Kanada
    Posted Mar 12, 2009 at 1:19 PM | Permalink

    romanm (#28)

    Haven’t you overlooked the most basic sources of errors?
    – Precision (does the thermometer measure to the nearest degree, tenth of a degree, or something else?)
    – Accuracy (does the thermometer measure the temperature correctly, or is there e.g. a constant offset, plus or minus?)

    Maybe I have totally underestimated the power of statistics, but doesn’t the GIGO rule apply also to climate science?

    And, speaking of precision, how can RSS, UAH, and others, report the global average temperature with a precision of one thousand of a degree, when the thermometers, best case, have a precision of 0.1 degree?

    Please clarify. (Here or in some other thread.)


    • romanm
      Posted Mar 12, 2009 at 2:24 PM | Permalink

      Re: Johan i Kanada (#30),

      This is really OT here, but …

      Haven’t you overlooked the most basic sources of errors?
      – Precision (does the thermometer measure to the nearest degree, tenth of a degree, or something else?)
      – Accuracy (does the thermometer measure the temperature correctly, or is there e.g. a constant offset, plus or minus?)

      What you call Accuracy, I have defined as bias, the values generated tend to center around a value other than the one you are trying to estimate, in effect, an “offset”. The presence of such an offset may be hard to identify and measure when one is trying to make appropriate compensation for the bias.

      Precision (as you describe it for the thermometer) can be thought of as measuring to a high precision and then rounding the results to the nearest degree, tenth of a degree of whatever. Assuming that there is no bias in the instument, this will usually have the result of increasing the variability of the values thus requiring a larger number of observations to achieve a given level of precision. If the instrument is poor, the number of observations needed to estimate a result with a small error bar could be very large.

      GIGO is another story.

      Results given by RSS, UAH, and others may be to three decimal places, but the error bars are larger than .001. For example, HadCru claims on this page :

      How accurate are the hemispheric and global averages?

      Annual values are approximately accurate to +/- 0.05°C (two standard errors) for the period since 1951. They are about four times as uncertain during the 1850s, with the accuracy improving gradually between 1860 and 1950 except for temporary deteriorations during data-sparse, wartime intervals.

      (I said that they “claim” this. I didn’t say that I necessarily agree with the correctness of their claim 😉 )

    • Ryan O
      Posted Mar 12, 2009 at 3:16 PM | Permalink

      Re: Johan i Kanada (#30) and Re: curious (#31),
      I don’t know about relevant texts or introductory material for climate-specific usages, so I’m sorry I can’t help there, but I can make a comment on the usage of “calibration” in this sense.
      As an engineer, I expect “calibration” to mean that the instrument readings fall within +/- X of the True Value, where the True Value is traceable to a standard at the National Institute of Standards. I receive for my payment a certification stating as much. In reality, of course, the instrument reading falls within +/- X of some surrogate of the True Value, which in turn falls within +/- X of another surrogate, and so on . . . until you eventually arrive at the NIST.
      Theoretically, it would be possible to obtain all of the calibration records for each of the surrogates, add up their respective errors, and find how far your instrument deviates from the True Value. In practice, these errors are usually distributed fairly evenly around the True Value, so +/- X from the True Value can be substituted for +/- X from the surrogate. You can then do a gage R&R (repeatability and reproducibility) study in order to determine the internal variability of the instrument (the range of readings when taking duplicate measurements) and the external variability associated with the method of use (the error introduced by the user or measurement method). In this way, you can establish confidence intervals for measurements with a high degree of precision.
      In cases like this, there is no NIST standard for “surface air temperature” – and even defining what is meant by “surface air temperature” is a whole topic in and of itself. In this case, the ground station readings (AWS) are defined to be a surrogate for the True Value and the “calibration” is finding a mathematical expression that allows you to predict the AWS reading from a satellite reading within a desired range.
      As you might expect, the results are not comparable to an engineering-quality calibration. Unlike an engineering calibration, there is no calibration record that takes you from your surrogate (AWS) to the True Value (actual surface air temperature – whatever that means). Any bias inherent in the AWS instrumentation passes through the calibration undetected. Because you cannot perform a gage R&R study, any repeatability errors with the surrogate instrumentation cannot be distinguished from real changes in the quantity you are measuring.
      You can estimate errors, but to do so, you have to assume a distribution for the errors. Most often a normal assumption is made, but this is not always accurate and can lead to overly tight confidence intervals. For example, in the Byrd and Siple scatter plots out of Shuman, the distribution of errors is clearly non-normal. Since this scatter plot was done without regard to time, there is the potential of a time-varying error that could also change the results in a meaningful way.
      As far as instrument accuracy goes, I believe Shuman quotes +/-0.35C for an individual AWS measurement. This is obviously less than the trend we are looking for, but assuming the instrument error to be a combination of bias + noise (where the noise is evenly distributed around the mean), then by taking large numbers of measurements, you can assume that the noise approximately cancels. The bias doesn’t matter for the trend.
      Whether modeling instrument error as bias + noise is valid is a separate question that depends both on the nature of the measurement and the quantity being measured.
      Without making this post too long, none of the above addresses splicing the surrogate (AWS) and the transformed measurement (satellite) together. If you want to do that, then bias does matter, as splicing instrument readings together with different biases results in a trend. Because there is so much “noise” compared to signal, making sure you’ve eliminated enough of the bias for your results to be meaningful is not a trivial task.

  18. curious
    Posted Mar 12, 2009 at 1:21 PM | Permalink

    re: Geoff 17, Ryan O 22 and 26, romanm 28.

    FWIW I’d appreciate a thread on this and any introductory references/texts that are relevant. Also any thoughts on analogous/worthwhile monitoring situations to investigate please – the ones which come to my mind are (say) process plant or civil structures but I think the calibration and operational parameters will be much more tightly controlled and identified than (say) Antarctic AWS. Thanks

  19. curious
    Posted Mar 12, 2009 at 4:47 PM | Permalink

    RyanO and romanm – thanks for the info. I just lost a long response but I think it was heading OT… Might retype on unthreaded at the weekend. I think a thread on these issues would be good.

  20. Posted Mar 13, 2009 at 5:16 AM | Permalink

    NASA’s GISS article on The Elusive Absolute Surface Air Temperature (SAT) here: .
    At the bottom it says: GISS Website Curator: Robert B. Schmunk. Responsible NASA Official: James E. Hansen. Page updated: [Dec 07, 2005]
    Q. What exactly do we mean by SAT ?
    A. I doubt that there is a general agreement how to answer this question. Even at the same location, the temperature near the ground may be very different from the temperature 5 ft above the ground and different again from 10 ft or 50 ft above the ground. Particularly in the presence of vegetation (say in a rain forest), the temperature above the vegetation may be very different from the temperature below the top of the vegetation. A reasonable suggestion might be to use the average temperature of the first 50 ft of air either above ground or above the top of the vegetation. To measure SAT we have to agree on what it is and, as far as I know, no such standard has been suggested or generally adopted. Even if the 50 ft standard were adopted, I cannot imagine that a weather station would build a 50 ft stack of thermometers to be able to find the true SAT at its location.
    Q. What do we mean by daily mean SAT ?
    A. Again, there is no universally accepted correct answer. Should we note the temperature every 6 hours and report the mean, should we do it every 2 hours, hourly, have a machine record it every second, or simply take the average of the highest and lowest temperature of the day ? On some days the various methods may lead to drastically different results.

    Q. If SATs cannot be measured, how are SAT maps created ?
    A. This can only be done with the help of computer models, the same models that are used to create the daily weather forecasts. We may start out the model with the few observed data that are available and fill in the rest with guesses (also called extrapolations) and then let the model run long enough so that the initial guesses no longer matter, but not too long in order to avoid that the inaccuracies of the model become relevant. This may be done starting from conditions from many years, so that the average (called a ‘climatology’) hopefully represents a typical map for the particular month or day of the year.

  21. Johan i Kanada
    Posted Mar 13, 2009 at 12:12 PM | Permalink

    romanm (#33)


    So HadCru believes their estimate of the global mean temperature is within +/- 0.05 degrees C. Great.
    Meanwhile, the [snip] thermometers measure AC outlet air, amplified by concrete structures, the precision of each thermometer is 1 degree C, some of them have been buried under snow for years, and large parts of the globe have no thermometers at all.

  22. Ryan O
    Posted Mar 13, 2009 at 12:16 PM | Permalink

    Hey Roman . . . when you calculated the Byrd/Siple trends from Steig’s data, which recon did you use?

  23. Ryan O
    Posted Mar 13, 2009 at 12:59 PM | Permalink

    I think they may mean that their trends are from the corresponding grid location from the full reconstruction. The claimed confidence intervals look way too small for the AWS recon . . . unless they accidentally reported standard error instead of CI.
    I suspect the former, though. I did the trends at the corresponding grid points here: (#328). I found:
    Byrd 50-yr trend: 0.25 +/- 0.09 (after converting the SE to a 95% CI)
    Siple 50-yr trend: 0.19 +/- 0.06 (after converting the SE to a 95% CI)
    The minor difference is probably due to the fact that I matched up the stations to the grid assuming the grid was referenced from the center of the cell, but I believe it is actually referenced from the southeast corner.
    I don’t have immediate access to the data right now to check the shorter trend.

  24. Ryan O
    Posted Mar 13, 2009 at 1:21 PM | Permalink

    Also – just realized something else – with all the rounding Steig does with his lat/lons for station locations, we could easily be different by a couple of grid cells.

  25. romanm
    Posted Mar 13, 2009 at 2:42 PM | Permalink

    great stuff! I have just been looking at it in R and you might be on to the answer. if that turns out to be the case, the comparison to Shuman is pretty much apples and oranges since the Shuman reconstruction uses actual AWS data and the satellite recon does not.

    Your link to #328 took a little searching since it links to this page. My guesses for Byrd and Siple would be series 817 and 324 of the satellite recon (which were simply calculated by minimizing the sum of the absolute difference between the lats and lons). I located more accurate co-ordinates for Byrd (80.007S, 119.404W), but you may be right that extra rounding could move things around.

    I don’t have the time right now to do any more with it right now, but I will try to look further at things later.

  26. Ryan O
    Posted Mar 13, 2009 at 5:56 PM | Permalink

    Roman, I had used 817 & 324 as well, but had arrived at that by explicitly calculating distance. Not sure what Steig used.
    Going from 1979 up to (but not including) 1997 I get 0.35 +/- 0.39 for Byrd and 0.28 +/- 0.28 for Siple . . . pretty close to what they report. A grid cell or two might easily make up the difference.
    I agree that the comparison is apples and oranges. In order to give their reconstruction legitimacy, they did need to compare it to something. I would have preferred that they compare it to just actual data (or, better yet, attempted a calibration). Instead, they do no calibration, use RE/CE and correlation statistics that – to be quite honest – aren’t sensitive enough for the application of splicing data sets from different instruments, and compare to the two most problematic series from Shuman. Kind of hoaky.

    • RomanM
      Posted Mar 14, 2009 at 2:21 PM | Permalink

      Re: Ryan O (#46),

      I’ve been looking at the Steig stuff today and I intend to put up a new post on the topic (or update the old one) in the next day or so. I think it’s more or less figured out.

      Re: Peter (#48), Thanks for the information, but it wouldn’t help the readers much. I’m also not sure that it would interact properly with the front end of R as well.

  27. Johan i Kanada
    Posted Mar 14, 2009 at 4:44 AM | Permalink

    romanm (#33)

    You write

    Precision .. can be thought of as measuring to a high precision and then rounding the results to the nearest degree… Assuming that there is no bias in the instrument, this will usually have the result of increasing the variability of the values thus requiring a larger number of observations to achieve a given level of precision. If the instrument is poor, the number of observations needed to estimate a result with a small error bar could be very large.

    1) But in climate studies, we’re not observing the same thing more than once, so how can the above statement apply?
    I.e. if we measure the temp in N different locations, we do not have N samples of 1 thing, we have 1 sample each of N different things (i.e. with different estimated values and distributions).
    Likewise, if you measure the temp M times (in the same location), separated in time (let’s say a day between each sample), again you do not have M samples of the same thing, you have 1 sample each of M different things (i.e. with different estimated values and distribitions).

    2) Furthermore we do indeed have biases, limited precision, and random measurement errors, for every instrument, every where, every time.
    How is that taken into account?

    (As can be seen, I do not understand how these .322 average global temperatures (deviations) can make any sense whatsoever.)

    3) Simple example
    Assume that the measured temps (at some point in time) in 7 different locations are 0, 1, 3, 2, 4, 1, and 0.
    Assume further that each thermometer has an in-precision of 0.5, and that there is no systemic error (bias), nor any other measurement error, in any of the thermometers.
    i) What is the average temperature at this particular time? (2? 1.6? 1.57? 1.571?)
    ii) What is the (estimated) error, taking also instrument in-precision into account?

    If and when you get some time…


One Trackback

  1. […] to some insight from Ryan O (#43), we have been able to make progress on the puzzle of where the values in the Steig paper came from. […]

%d bloggers like this: