Some BEST Tools

Here’s a major complaint about BEST now that I’ve looked at it more closely.

If BEST wanted to make their work as widely available as possible, then they should have done their statistical programming in R so that it was available in a public language. And made their data available as organized R-objects.

I’ve taken a quick look at their code which is in Matlab. I’ve browsed some of the code and it all looks like transliterated Fortran. I haven’t noticed much vector processing and list processing in R style. IMO, one of the great benefits of the vector and list processing features in R is that you can write scripts that clearly self-document what you’re doing as “scripts”. I haven’t seen anything in their code files that looks like the sort of R-script that I would like to see in order to follow the calculations.

I collated the station data and station information into two R-objects for interested readers so that data can be quickly accessed without having to compile rather large data sets.

The station information is uploaded to http://www.climateaudit.info/data/station/berkeley/details.tab. It is a dataframe of 39028 rows containing id, name, lat, long, etc, directly collated from the information in the BEST data. Some tidying of trailing spaces and use of NA has been done. It’s 1.8 MB.

The station data is uploaded to http://www.climateaudit.info/data/station/berkeley/station.tab. It’s organized in a style that I’ve used before: it is a list of 39028 objects, each object being a time series of the station data beginning in the first year of data. I didn’t collate the accompanying information about counts and uncertainty. Interested readers can consult the original data for these. Each of the 39028 objects is given a name corresponding to the id in the details file – which I’ve kept as a character object rather than a number (though it is a number). As an organized R-object, this is 39 MB, as opposed to 618 MB if data.txt in the PreliminaryTextDataset were expanded.

If you want to look at the BEST results for a given station, here’s how to do it quickly (and you want to keep the data in say directory d:/climate/data/berkeley). The example here is Detroit Lakes 1NNE, the subject of a number of posts in connection with Hansen’s Y2K:

destfile="d:/climate/data/berkeley/details.tab" 
download.file("http://www.climateaudit.info/data/station/berkeley/details.tab",destfile, mode="wb")
load(destfile);
nrow(details) #39028

destfile="d:/climate/data/berkeley/station.tab" 
download.file("http://www.climateaudit.info/data/station/berkeley/station.tab",destfile, mode="wb")
load(destfile);
length(station) #39028

details[grep("DETROIT L",details$name),1:5]
#           id                name     lat     long     alt
#144289 144289 DETROIT LAKES(AWOS) 46.8290 -95.8830 425.500
#144298 144298 DETROIT LAKES 1 NNE 46.8335 -95.8535 417.315

u=station[[paste(144298)]]
ts.plot(u,ylab="deg C (Seas Adj)",main="Detroit Lakes 1NNE")


Figure 1. BEST Station Data for Detroit Lakes 1NNE

There are some puzzles about the station data that I’ll discuss in another post.

Update: Nick Stokes reports:
I made a GHCN v2 version of data.txt. It’s here. I had to split into two, bestdata1.zip and bestdata2.zip (to fit site limit). Each is about 11 Mb. Units are 1/10 deg C. [SM note - this is the same data that I collated into an R-list. For R users, you're far better off using my collation than re-collating from this.]

There is also a zipfile called ALL4inv.zip which has inventories in CSV format of BEST, GHCN, GSOD and CRUTEM3. The fields don’t match GHCN, but may be the best that can be derived from what BEST has published.

There are also lots of KMZ/KML files. The one called ALL4.kmz combines GHCN, GSOD, BEST and CRUTEM3 with a folder structure to show by source and start date.


24 Comments

  1. BillC
    Posted Oct 29, 2011 at 1:06 PM | Permalink

    Maybe if they had done this I would have tried to learn R. Still should. Expanding data.txt was a pain, cutting it into shorter files so text editors and excel could read it. Couldn’t even get the flags and source description files to come through. At least I have the same data you describe above Steve, in the same sort of arrangement but in Excel. (Yeah, I know). Oh and I only grabbed it for the US. I thought Matlab programming could be done vector style like you said, so they just didn’t?

    Probably worth getting and learning R and just getting the files from here.

    Steve: Yup. When you’re trying to handle large data sets, CA readers should NOT use Excel for data analysis. (Or for small data sets.)

    • Willis Eschenbach
      Posted Oct 29, 2011 at 2:41 PM | Permalink

      Bill, I learned R at Steve’s urging. I still use Excel for initial analysis because I can see all the data. Somehow that helps me when I have something new to understand.

      But Steve is right, R is the only language worth knowing for the kinds of analysis required in climate science. If I want to add 3 to every record in a datablock DB, I have to do something like this:

      for i = firstyear to lastyear
        for j = firstmonth to lastmonth
         DB[ i , j ] = DB[ i , j ] + 3
        next j
      next i

      In R, the corresponding statement is

      DB = DB + 3

      Which one is much easier to write, debug, and understand?

      In addition, as Steve points out, unlike Excel, R can be used as supplementary material for a scientific paper. As in this blog post.

      Well, you could use Excel … but I generally can’t because I have literally dozens of my own user functions. For example, Excel has “Sumif” and “Countif” functions, to sum or count data based on some conditions. So why doesn’t Excel have the following functions?

      Averageif
      Maxif
      Minif

      In addition, I have functions to return Gaussian averages

      Gaussian
      GaussianTrunc

      There’s a bunch more, but you get my point. I can’t just toss my typical Excel spreadsheet up here, it only runs with the associated user functions.

      Finally, consider what Steve has done with R above. In a few lines of plain old ASCII text, nothing proprietary, is the code to download, unpack, analyze, and display the analysis results.

      To pass a spreadsheet that does the same thing in Excel is a very large file. So R gets the vote for economy of representation.

      Learn the language, you’ll never regret it. WARNING. Use of R may lead to chronic frustration with your current language. Consult your doctor if you experience programming sessions lasting longer than four hours.

      Finally, Steve, your few lines of text above are hugely valuable, can’t thank you enough.

      w.

      • Nicholas
        Posted Oct 29, 2011 at 5:13 PM | Permalink

        “Well, you could use Excel … but I generally can’t because I have literally dozens of my own user functions. For example, Excel has “Sumif” and “Countif” functions, to sum or count data based on some conditions. So why doesn’t Excel have the following functions?

        Averageif
        Maxif
        Minif”

        I think the problem here is that they have approached it from the wrong angle. The reason being that Excel started out as a pretty basic program and the language grew around it, it wasn’t designed with a proper high-level language from the start, like R.

        The way this type of problem is normally solved is with a “map” or “apply” type function and closures. The closure is a program snippet and it is “applied” against every element of a vector. It can then contain the if/average or whatever other procedure you want to apply to each element.

        I’m not an R expert but I believe that is the way you would do this in R. It’s much more general than specific functions to handle situations you might come across.

  2. Posted Oct 29, 2011 at 2:09 PM | Permalink

    You can get GNU Octave here. It does run most Matlab style code..

    http://www.gnu.org/software/octave/download.html

    You certainly can use lists and vectors — easily and there is a built in stats package. And, since it is engineering oriented, you can do frequency (Fourier and Wavelet) analysis…

  3. Posted Oct 29, 2011 at 2:19 PM | Permalink

    Maybe the R Link will motivate some to install the package…

    See here:

    http://www.r-project.org/

    Maybe you can mention what add-on libraries (if any) might be a good idea to follow your discussions..

    • BillC
      Posted Oct 29, 2011 at 2:24 PM | Permalink

      ha ha just did that before I even saw your comment. nothing like getting called out by the blog owner.

  4. Wolfgang Flamme
    Posted Oct 29, 2011 at 2:23 PM | Permalink

    Thanks a lot – didn’t find the time yet to dig into this.
    I’d be interested in obtaining their gridded reconstruction as well.

  5. observer
    Posted Oct 29, 2011 at 3:27 PM | Permalink

    R is good—But.

    Based on my experience many engineers and scientists use MatLab. It’s great, albeit expensive. There is a free clone named Scilab (www.scilab.org) that seems to be pretty good.

    In matlab to add three to every element of a vector one writes
    x = x + 1
    that’s not much harder than the corresponding R code
    x = x + 1

    Matlab has some capabilities for interfacing with lab equipment and it has extensive libraries for tasks like signal processing.

    In contrast, a lot of stat people use R—and it’s libraries focus on statistics.

    If you are interested in the differences, search the web for matlab vs. r.

    Both are great tools. In the case of BEST, I suspect that people were just using the tools that they often use and are most familiar with.

    Steve: I’m sure that they were using the tools that they were most familiar with. However, the criticism remains. If their objective is to reach people, then R, as both free software and better statistical software, trumps Matlab. Their exercise was not interfacing with lab equipment but statistics. They should have got someone on their team who could program in R.

    Mosher lives in the Bay area and has gotten very skilful with R, for example.

    • Posted Oct 30, 2011 at 12:29 AM | Permalink

      I offered to turn their matlab into R months ago, however after the dust up with the congressional testimoney and Romm claiming that I was somehow an evil force in the Muller camp we just decided to wait. I put together the skeleton of an R package for reading in their data and making custom S4 class objects for their project. It’s collected dust. We will re engage shortly. I have no desire to own the whole project, but will gladly farm out pieces parts to folks who want to help and who can use <- instead of =
      for the assignment operator.. hehe.

      more next week.

      one thing people forget about R. If you ever wonder how a built in function works.. you have the source

  6. Daublin
    Posted Oct 29, 2011 at 5:21 PM | Permalink

    To be more precise with the blame, I would say it’s a matter of poor programming rather than a poor programming language. You can write good code in Matlab if you bother to try.

    And you should. If you are doing a data analysis that matters, then you are want a way to convince yourself that you’re calculating what you think you’re calculating. You want to “audit” yourself before you publish something and potentially make a fool of yourself.

    If you want to find screwups in someone’s analysis, look for the sloppiest code. That’s going to be the part where they haven’t thought much about what the calculation is doing. That will be the part where they repeatedly hacked the code until the output looked like they expected.

    • Wayne
      Posted Oct 29, 2011 at 10:53 PM | Permalink

      Matlab is a terrible programming language. To pass named (optional) arguments to a function, you pass a list of alternating strings (the variable “name”) and values, and in the function parse it with another function. Huh? Perhaps it should also require you to use roman numerals?

      Matlab’s widely used, it’s fast, and it’s very succinct for small jobs, but it’s a badly-designed language. R has its issues, but it was also ahead of its time and it still fits in with modern languages. And it’s free. (And it has 2,000 some free packages to handle any statistical problem you might imagine.)

  7. Steven Mosher
    Posted Oct 29, 2011 at 7:47 PM | Permalink

    I’ve got a BEST package around her somewhere based on an early data release. Soon as I finsih some other work I’ll post it up.

  8. observer
    Posted Oct 29, 2011 at 9:01 PM | Permalink

    Steve, well, you may be right. My point is that MatLab is widely used in engineering and, I think, in the physical sciences. In contrast, I bet many in those fields don’t even know R exists.

    I will admit, my view is based on my experience. But, I recall more than a decade ago seeing books with MatLab examples in the appendix. I’ve never seen that with R—although there probably are such books.

    See

    or http://www.amazon.com/Digital-Signal-Processing-Using-MATLAB/dp/1111427372/ref=sr_1_1?s=books&ie=UTF8&qid=1319939624&sr=1-1

    A search on amazon reveals 118 hits for MatLab; a search for R gets a million hits but, as best I can tell by looking quickly, only about 20 of them involve R—the others are books like “Industrial R&D Management”.

    My perception is that far more people use MatLab than use R. I expect that will change within the next decade because the price of R is much lower and the quality high. But, I don’t think it is reasonable to criticize physicists for using R; that’s almost like criticizing them for using calculus.

    Steve: statistics books now will use R code in the running text, not just appendices. Yes, I know that physicists are comfortable with Matlab. If they want to use it for physics papers, fine. However, if their objective was outreach, R would have been better. Sometimes people have to think about the audience and the market.

    • Steven Mosher
      Posted Oct 30, 2011 at 11:11 AM | Permalink

      examples in the book?

      child. in R the book can be executed. google sweave.

  9. Posted Oct 30, 2011 at 1:22 PM | Permalink

    Thanks Steve

  10. Tom
    Posted Oct 31, 2011 at 6:03 AM | Permalink

    I am mostly a software engineer (C/C++, Java) and a moderate MATLAB user for time-series analysis. Every six months or so I get sick of pushing the need for a MATLAB license onto my clients and try learning R. I have to say I find R syntax almost incomprehensible, or at least completely unmemorable. It seems to regularly break assumptions common to almost every other language. They are actually capable of roughly the same things, and in about the same amount of code, if it’s written well. MATLAB has just as good vector/matrix-oriented operations as R. It’s just that MATLAB syntax is very reminiscent of what you’d expect from undergraduate math, where R syntax is completely different. Some examples:

    Writing matrix literals. In MATLAB, you write it exactly as you would on paper – a square bracket on either side, newlines between rows. In R, I have to look up the documentation on this ‘matrix’ function, because you must specify at least three arguments (four? not sure) to get a usable matrix. And it depends on the ‘c’ function, which constructs vectors, which for some reason are a separate data type and are constructed in a completely different way. ???
    Multiplying arrays. In MATLAB, ‘A*B’ does what I expect. In R, it says, “non-conformable arrays”. In every other language, ‘*’ means multiply. Multiplication of two matrices is a well-understood operation. Why does ‘A*B’ then not multiply two matrices? It has some bizarre new set of operators for matrices, apparently.
    Combining arrays. In MATLAB, C = [A B], just like in ordinary math. In R, I need to find out how some cbind function (? or is this just for vectors?) works.
    Do R, functions are always pass-by-value. Except when they aren’t – see the ‘names’ function for an example that appears to modify its arguments.
    Try this in R:


    A = c(1,2,3,4)
    B = c(1,2,3,4)
    C = c(-1,-2,-3,-4)
    plot(A,B,type='l',col='blue')
    plot(A,C,type='l',col='red')

    Does the axis autoscale, or not?

    There are just too many places where R breaks your assumptions. MATLAB succeeds because people who use it all the time know all the tricks to unlock all its power, while people who use it occasionally find it intuitive enough to get by. R, IMO, will remain more niche because people who use it all the time know all the tricks to unlock all its power, while people who use it occasionally find it very frustrating and come away thinking WTF?

    OTOH, of course, I have someone to pay the £3k for a MATLAB license, or whatever it is this month.

    And, on one incidental note, a brief outline of what comment syntax is available here would be really nice. I’m sure the above turns out all over the place, but I can’t seem to find any guide to how to format comments. The blog setup page claims that there is live preview, but I can’t see it. There’s not even a preview button to see if your formatting worked.

  11. Posted Oct 31, 2011 at 6:40 PM | Permalink

    Tom,

    I’m also a software developer but not a Matlab or R programmer (yet).

    When Climaategate kicked off I decided that it was about time the GHCN V2 dataset was put into an RDBMS (in third normalised form). After writing some code to do this I decided it would also be a good idea to also write some code to fit trends for the raw and adjusted data for each monthly series for each GHCN station. Later after doing some reseach on how to create Google KML files I also wrote some code to generate a KML file show colour coded palcemarkers (based on their warming/cooling trends) for all the stations in the GHCN V2 and V3 datasets and for all the stations in each different country/WMO region for a range of different time periods. I also genearted charts showing the warming/cooling trends for the raw/adjusted yearly data series for each individual sttaion for each time period. I called the software tool I developed TEKTemp and if you are interested you can read some further details about it by clicking on the following link

    http://diggingintheclay.wordpress.com/2010/01/04/climate-database-development/

    and you can download some KML via the following link

    http://www.climateapplications.com/kmlfiles.asp

    If you are interested in downloading the ‘full set’ then drop me an email and I’ll be happy to send you the zip file containing all the country/region KML files.

    Paging Mosh……

    Mosh, I’ve yet to learn and become proficient in R and until I do and in the intervening time I’d like to use TEKtemp to process the recently released BEST dataset. I’ve written the TEKtemp application so that it can import datasets provided they are in ‘GHCN text format’ i.e. ascii files in the format of the GHCN station inventory and raw/adjusted monthly station temperature files. By chance Mosh do you have the BEST dataset (station inventory and temperature files) in this format? It doesn’t matter if they aren’t exactly GHCN format as I can chnage the import code to cope. If you do have any file sin the approx GHCN format and can supply a download link to them, it would be much appreciated. If you could drop me an email with a download link or post a reply with link in this thread that would be great. Anthony may well have mentioned my name to you recently in regards setting up a GHCN version of the surface stations project? Hopefully we’ll be in contact to discuss collaborating on this project shortly.

    Cheers

    KevinUK

    • Posted Oct 31, 2011 at 9:50 PM | Permalink

      Kevin,
      I made a GHCN v2 version of data.txt. It’s here. I had to split into two, bestdata1.zip and bestdata2.zip (to fit site limit). Each is about 11 Mb. Units are 1/10 deg C.

      There is also a zipfile called ALL4inv.zip which has inventories in CSV format of BEST, GHCN, GSOD and CRUTEM3. The fields don’t match GHCN, but may be the best that can be derived from what BEST has published.

      There are also lots of KMZ/KML files. The one called ALL4.kmz combines GHCN, GSOD, BEST and CRUTEM3 with a folder structure to show by source and start date.

      • Steve McIntyre
        Posted Oct 31, 2011 at 9:56 PM | Permalink

        thanks for this, Nick. I’ll refer to this in the head post.

      • Posted Nov 1, 2011 at 4:08 AM | Permalink

        Nick,

        You are a STAR!!

        It looks like you’ve broken the back of the conversion to GHCN format process which will save many of us lots of effort and that is much appreciated. When I’ve got more time later this week I’ll analyse the data and put up a version of the TEKTemp ‘admin system’ so others can view the BEST data using a nice user friendly interface. The TEKTemp RDBMS database(s) are in MS Access 2003 .mdb format and I’ll make them available for download as well.

        Does Roger P Snr know about the KMZ/KML files you’ve produced as he is keen to find out how much the BEST dataset overlaps with the GHCN V3 dataset and from my previous analyses of both datasets I’ve concluded that there are very few differences in the station inventory metadata (e.g. station GPS location data) between GHCN V3 and GHCN V2 datasets.

        I presume you made a GHCN V2 version of the BEST dataset because of the different way NCDC handles ‘duplicates’ and combines stations in V3 versus V2? I have versions of TEKTemp that can handle both.

        Just to confirm Nick, are your temperature files Tave monthly and NOT Tmax?

        Once again thanks for all your effort and keep up the good work?

        KevinUK

        • Posted Nov 1, 2011 at 4:32 AM | Permalink

          Alas, perhaps a dimmed star. I didn’t do anything clever about duplicates – I just took the first one. I should have noted that. I had meant to review that; for the moment I was just counting stations and coverage.

          But yes, I believe thay are TAve – I downloaded the second file they put up after the warning. But I’ll double check.

          Yes, Roger P knows about it – there is some chit-chat round about here.

        • Posted Nov 1, 2011 at 4:58 AM | Permalink

          Kevin,
          I’ve checked – I downloaded the data.txt on 23 Oct and I understand it is TAve. On duplicates, I now remember why I didn’t do anything special. I don’t think there are any. I’ve just checked again.

        • Posted Nov 1, 2011 at 7:04 AM | Permalink

          Nick,

          Thanks for the update. No matter what you’ve done you’ve made a good start.

          It really is time that the climate community got together and agreed upon a common format (XML schema maybe) for how station inventory and daily/monthly temeprature data should be stored and interchanged (the X in XML stands for extensible).

          I thought that this was one of things that BEST was going to do and since they haven’t, like many others, I’m sadly disappointed. In fact I’m left wondering what exactly is it that BEST had brought to the party?

          Now why stop at station inventory data and daily/monthly temperature data? Why not widen the scope so that it includes other metadata and data e.g. images/video of the station as viewed from the 8 different points of the compass etc. Just imagine how much easier it would be to look at the relative differences between GISS/CRUTEM/BEST etc if these datasets where all stored in a fully relational structured common format. Yes, I know R is very powerful and can do a lot of this but what I’m aiming at would be much more user friendly than that and far more along the lines of what the KNMI Explorer, Appinsy and WoodfortheTrees web sites have achieved i.e. access to and the ability to analyse the data without any specialised programming (R,Python,Ruby etc) skills. This is one of the tasks I’m hoping to try to accomplish shortly once Anthony W and Roger P Snr’s GHCN surface stations project gets underway.

          At the moment the USHCN surface stations project data is stored in a database schema used by a piece of open source image management software called Gallery. It’s really not up to the job as all too often the ‘tail’ (the database schema and software) will be ‘wagging’ the ‘dog’ (the data/metadata and how it should be structured relationally). Rather than use customised fields in the Gallery database schema its far better IMO to start from scratch and design a new schema around how the station data needs to be structured (now and in the future). We can then build a ‘fit for purpose’ user friendly interface on top of the data stored in this new equally ‘fit for purpose’ schema complete with a backend ‘admin system’ for ‘bulk’ importing, cleansing and updating data within the database. MySQL and PHP anyone?

          Regards

          KevinUK

  12. MikeN
    Posted Nov 3, 2011 at 8:25 PM | Permalink

    Complaining that a scientific paper used Matlab instead of R is quite petty.

One Trackback

  1. [...] data from BEST, plotted by Steve McIntyre.  Glad to see Steve is on the case, big time it seems (Some BEST Tools and subsequent posts). Overlay of two data sets for Los Angeles combined to fit scale and time. [...]

Follow

Get every new post delivered to your Inbox.

Join 3,306 other followers

%d bloggers like this: