Annoying Fortran Formats

Earth to climate scientists: no one uses Fortran punchcards anymore. When you put data onto the web, it doesn’t have to fit onto 80 columns as though it were a punchcard. Also when home computers have 100 GB of memory, you don’t have to squish multiple records onto one line. Hurricane track data as archived – Atlantic here is archived on the web so that it fits into 80-column punch cards just as in the 1960s. (Dendro data is the same.) The first 3 storms are shown below. There are header “cards”, trailer “cards” and fixed format fields. A Fortran program is provided at the site to read the data. To read this into R requires a fiddly little program – it’s easy enough to do, but it’s a pointless nuisance. I’m in the process of making R-tables out of the data for the 6 ocean basins and will save the results as flat files. Aside from the nuisance, the continued use of Fortran makes everything much more work than it really is with modern languages like R.

00005 06/25/1851 M= 4 1 SNBR= 00010 06/25*280 948 80 00015 06/26*282 970 70 00020 06/27*290 994 50 00025 06/28*3101002 40 00030 HRBTX1
00035 07/05/1851 M= 1 2 SNBR= 00040 07/05* 0 0 0 00045 HR
00050 07/10/1851 M= 1 3 SNBR= 00055 07/10* 0 0 0 00060 TS
00065 08/16/1851 M=12 4 SNBR= 00070 08/16*134 480 40 00075 08/17*149 546 60 00080 08/18*166 625 80 00085 08/19*180 693 90 00090 08/20*199 759 70 00095 08/21*226 814 60 00100 08/22*250 849 80 00105 08/23*274 865 100 00110 08/24*307 851 90 00115 08/25*340 800 40 00120 08/26*378 736 40 00125 08/27*428 633 40 00130 HRAFL3 GA1 1 NOT NAMED XING=1 SSS=1
0*280 954 80 0*280 960 80 0*281 965 80 0*
0*283 976 60 0*284 983 60 0*286 989 50 0*
0*295 998 40 0*3001000 40 0*3051001 40 0*
0* 0 0 0 0* 0 0 0 0* 0 0 0 0*
2 NOT NAMED XING=0
0* 0 0 0 0*222 976 80 0* 0 0 0 0*
3 NOT NAMED XING=0
0* 0 0 0 0*120 600 50 0* 0 0 0 0*
4 NOT NAMED XING=1 SSS=3
0*137 495 40 0*140 510 50 0*144 528 50 0*
0*154 565 60 0*159 585 70 0*161 604 70 0*
0*169 641 80 0*172 660 90 0*176 676 90 0*
0*184 711 70 0*189 726 60 0*194 743 60 0*
0*205 776 70 0*212 790 70 0*219 804 70 0*
0*232 825 60 0*239 836 70 0*244 843 70 0*
0*256 855 80 0*262 860 90 0*268 863 90 0*
0*280 866 100 0*285 866 100 0*296 861 100 0*
0*316 841 70 0*325 830 60 0*334 814 50 0*
0*348 786 40 0*358 770 40 0*368 751 40 0*
0*389 718 40 0*400 700 40 0*413 668 40 0*
0*445 602 40 0*464 572 40 0*485 542 40 0*

This entry was written by Stephen McIntyre, posted on Oct 10, 2006 at 7:32 AM, filed under General. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

25 Comments

Michael Jankowski

Posted Oct 10, 2006 at 9:03 AM | Permalink

Aside from the nuisance, the continued use of Fortran makes everything much more work than it really is with modern languages like R.

I hated FORTRAN. I never understood why it uses the reverse notation of rows and columns in matrices. But they were still teaching it recently (at least until a dozen years ago when I was an undergrad) because of all the pre-programmed mathematical functions. I imagine a lot of folks who grew up using FORTRAN are reluctant to change because it takes a move to a “new” language (although it’s quite easy – if you can program in FORTRAN, you can program in pretty much anything – just change some terminology).
Steve McIntyre

Posted Oct 10, 2006 at 9:27 AM | Permalink

I think that these Fortran data sets inhibits proper statistical analysis. If you put the track data into an organized data frame in R, analyses suggest themselves. I hope to get to this.
Troy Baer

Posted Oct 10, 2006 at 9:36 AM | Permalink

Dang Steve, your PC has 100 GB of memory?!?! How did you find one with enough DIMM slots?

(I suspect you meant 100 GB of disk space, not 100 GB of memory…)

That file format really is a horror-show. However, speaking as somebody who programs in Fortran on a fairly regular basis, I think you’re unfairly tarring the entire language based on one old (and admittedly crusty) example. I find Fortran, particularly Fortran 90, a much more usable language than R.
Steve McIntyre

Posted Oct 10, 2006 at 10:04 AM | Permalink

Yes, disk-space. The file formats in dendro are worse. They are in fixed format. They will have trailer values of 999 in files which have values in that range. The WDCP does not impose standard formats and so if you try to use fixed widths, they blow up every so often, sometimes without overt failure.

42288C 1 HERRING ALPINE LONG CORES TSHE 1129
42288C 2 ALASKA HEMLOCK 275M 06026-14745 40 1422 1972
42288C 3 H.C. FRITTS 08/23/73
42288C14229990 09990 0 675 11033 1 975 1 768 11146 11082 11378 11231 1
42288C1430 745 11340 12567 11180 1 710 11009 11268 11780 1 850 11430 1
42288C1440 609 1 661 1 974 11127 12331 12358 12332 11959 1 815 1 449 1
42288C1450 951 1 684 1 600 1 833 1 617 1 882 1 448 1 678 11338 1 692 1
42288C14601231 11176 11432 1 614 1 692 11057 11067 11694 11212 11250 1
42288C14701258 11234 1 964 11048 1 497 1 314 1 252 1 221 1 365 1 275 1
42288C1480 403 1 600 1 439 1 971 1 766 1 871 11112 11539 11567 11246 1
42288C14901929 1 830 1 995 1 866 2 905 2 730 2 984 21103 2 871 21190 2
42288C15001209 21060 2 634 2 624 2 488 2 556 2 727 2 593 2 735 2 606 2
42288C1510 608 21064 21278 2 894 21009 2 802 2 916 2 805 2 850 21509 2
42288C15201326 2 997 2 754 2 669 21065 21131 21298 21137 2 748 21125 2
42288C15301149 21054 2 823 2 754 2 787 2 946 21171 21020 21036 21036 2
42288C15401106 21019 2 905 2 732 2 903 21368 21512 21210 21663 21431 2
42288C15501222 21720 21212 2 884 2 802 2 821 2 793 2 875 21024 21118 2
42288C15601283 21391 21298 21272 21087 21349 21078 21020 21190 21290 2
42288C15701181 2 814 2 503 2 727 2 694 2 986 2 923 21002 21011 21174 2
42288C15801000 21141 21333 21442 2 988 2 535 2 961 3 663 3 516 3 799 3
42288C1590 996 31628 31392 31285 31071 31052 3 739 31268 3 896 3 939 3
42288C16001010 5 640 5 477 51014 51219 51230 51239 51651 51392 51264 5
42288C16101438 71614 71613 71097 71036 71172 71136 71338 7 776 71095 7
42288C1620 352 7 564 7 584 7 561 7 832 7 823 7 743 7 787 7 910 7 948 8
42288C1630 990 81072 81368 81188 81178 81308 8 952 81017 81041 8 516 8
42288C16401178 8 627 8 751 8 794 8 641 8 899 8 917 81025 81062 8 728 8
42288C16501160 91432 91076 9 955 9 859 9 832 9 754 9 982 9 895 91129 11
42288C1660 965 11 694 11 791 11 488 11 317 11 435 11 643 11 736 11 712 11 607 11
42288C1670 949 111115 11 923 11 761 11 536 11 750 11 873 11 403 11 287 11 152 11
42288C1680 534 13 789 131072 131104 131086 131322 131359 131123 131247 14 951 14
42288C16901270 14 944 14 913 14 983 141173 14 968 14 865 14 800 15 707 15 760 15
Jaye Bass

Posted Oct 10, 2006 at 11:03 AM | Permalink

As a frequent lurker, first let me say I enjoy the sight and have learned quite a bit. I visit this site, Pielke’s and RC to stay abreast of the goings on related to climate science/policy.

The opening remarks regarding “issues” when dealing with a Fortran code base have brought to mind a different but related problem mentioned here on occasion. Namely, problems associated with aquiring and building source as part of the auditing process. While I work in industry, large scale aerospace simulation/modeling, I seem to run into this very problem with the academic world from time to time. A nontrivial percentage of the academics I’ve worked with seem to be a bit sloppy when it comes to developing their software. A little unit testing, decent configuration management, code documenation and modern build practices would go along way toward making software from the academic world more usable/accessible. There is an array of open source tools that can help with each of these areas. Of course, not all academics produce sloppy, hard to maintain software. The group that develops/maintains Ptolemy at Berkeley creates quality code that is definitely up to snuff.

I think there are at least 3 reasons for this. First, frequently they aren’t being paid to develop tools but to study something so the code suffers. Next, it has been my experience that grad students with limited experience writing code for reuse are often the ones who generate the code that the grant isn’t paying for in the first place. Lastly, if FORTRAN is the language of choice then the researcher typically sees the programming as a necessary evil in the way of solving his/her problem. The irony is that in the end the code is the secondary if not the primary product. In the engineering disciplines MatLab is quickly replacing FORTRAN as the “language” of choice. Don’t know if the physical sciences are moving in that direction. If so you’ll start to see lots of MatLab scripts.
Troy Baer

Posted Oct 10, 2006 at 11:40 AM | Permalink

Re: #4 This looks like it might be easier to process in C (using scanf()) or Perl.

I find it somewhat ironic that these file formats are so gross, when one of the nicest scientific data file formats I’ve ever used (NetCDF) comes from our friends at NCAR. I also wonder why nobody’s come up with conversion tools to load all this data into an honest-to-goodness database…
Stan Palmer

Posted Oct 10, 2006 at 1:30 PM | Permalink

Re 6 “I also wonder why nobody’s come up with conversion tools to load all this data into an honest-to-goodness database…”

For the same reason that the IETF uses text files for its RFCs and Internet Drafts. These documents are going to be useful for times much longer than the lifetime of any particular word processing or database program
Nicholas

Posted Oct 10, 2006 at 1:40 PM | Permalink

I agree with Stan, I think it makes more sense to have a standard text format, then a tool for each database to import and export data in this text format from each type of database.

Doing so would be really quite trivial, the only hard part is defining the standard and actually getting people to use it.
Ric Techow

Posted Oct 11, 2006 at 5:28 AM | Permalink

I think it is called XML
Steve Sadlov

Posted Oct 11, 2006 at 9:01 AM | Permalink

I have to wonder how much of this is due to all the archival data which were crunched back in the 1970s and 80s using Fortran programs, punch cards and mainframes? It may be one of those data continuity things. For example, if one switched to newer methods on more recent data, someone would need to reformat old data? That is a bit of a stretch though, for I am not aware of issues trying to read those old anally formatted files using C++ or something like that. I must admit, I don’t do hands on programming any more and was never any good at it, so I am pretty much shooting from the hip here – LOL 😉
John A

Posted Oct 11, 2006 at 12:07 PM | Permalink

Dave,

When you’ve got a lot of egg on your face and zero “climate prediction” then egg throwing is all that’s left.

I’m waiting for the announcement of the merger with Andrew “Hydrometeor particles” Weaver for an emsemble forecast of “there will be a lot of weather everywhere”
K

Posted Oct 11, 2006 at 12:09 PM | Permalink

Converting and standardizing old data gets done when it pays. If you want it done then get it funded.

Stan, and some others, have noted that a final repository in text is not a bad idea. Peripheral storage is becoming so cheap that compression and optimization schemes make less sense every year.

Supposedly, one of the merits of Chinese writing is that the symbols for a word do not change – two thousand year old text can be read as easily as today’s news. I suspect such a blanket statement is not totally, completely true, but has some truth to it.

When I began programming fifty years ago (yes fifty) one’s toolkit had to include juggling memory, storage, speed, and precision. And with some work taking weeks it was necessary to divide calculations and periodically save the status.

One by one these horrors were turned over to the machine – virtual memory, terabyte storage, gigahertz speeds in multiple processors, and precision can normally be whatever you want.

Peoples of the past weren’t considerate of your difficulties this morning? How dare they!
Steve McIntyre

Posted Oct 11, 2006 at 1:15 PM | Permalink

#12. I’m not objecting to old data; it’s storing today’s data in 80 column formats that is tiresome. Here’s another irritating entry: http://www.weather.unisys.com/hurricane/w_pacific/2004/01W/track.dat
Missing pressure and wind data is denoted by a dash. C’mon.
Lars Kamél

Posted Oct 11, 2006 at 1:47 PM | Permalink

I have a theory: They used Fortran and punch cards when they first started to digitize data in the 1960’s. Lisp and Fortran were almost the only programming languages back then, and Fortran was the choice for science applications. No one have bothered to change the format and translate the old programs to some other programming language. Updates have been made, but Fortran is still used.
But it must be rather easy to make a program that convert data files in an old format into a new, more easily read format, and then change the input statements in the programs used to analyse the data. They can still use Fortran, but they don’t have to have 80-character lines anymore.
Joel McDade

Posted Oct 11, 2006 at 2:18 PM | Permalink

If I remember right, Fortran has had “NUMLIST” data entry for 25 years now. It’s sort of like reading a CSV file — easy as cake. I’ve long switched mostly to Delphi Pascal and wouldn’t know about more recent implementations.

Some of the posted examples look pretty horrible even for Fortran!
Joel McDade

Posted Oct 11, 2006 at 2:24 PM | Permalink

Wait… I figured it out. They can’t program without those precious GOTO statements 🙂
Jaye Bass

Posted Oct 11, 2006 at 3:06 PM | Permalink

text => perl => anything you want.
K

Posted Oct 11, 2006 at 3:11 PM | Permalink

Dave: Sorry, I got a bit ticked seeing comments that more or less implied design decisions of the past were foolish, or even mischievious, because we wouldn’t make them today.

Your tough questions on the matter of climate warming have and are doing a lot of good.

Some comments on data conversion: Programs are (or were) available to easily convert almost anything written in Fortran or its associated data to better formats. They worked extremely well twenty years ago and were public domain (is there anything NASA hasn’t funded at one time?)

Ironically, conversion programs themselves soon are in an obsolete language and format, or need some long vanished machine to run upon. Emulators were devised to limit the damage. And they are still used today – Windows can be run on MacIntosh (kinda, sorta, hopea).

Which points exactly to the trouble with stepwise changeovers to newer languages and data formats. Each step must be known and preserved or the chain breaks. The Rosetta Stone story is the well known illustration of how a broken chain was repaired after a chance discovery.

Yes, Fortran, did reverse the standard row and column notation in matrices. But at least the (1) element was the first. Unlike C, where the (1) was the second data point; a decision which could allow spectacular beauty in coding but was not otherwise agreeable.

As for missing pressure and wind data being indicated with —. Well, you have to do something. Years ago programmers would use an impossible value such as 999.99 for barometic pressure. That is one reason old data sets often ended with strings of 9s (as somone commented).

But some chores have to be done. And the sooner we have an agreed target format for all this data the better. I believe it would be a string rather than fixed format. And it would define what it represents within the file itself. Such a format, properly devised, should remain usable indefinitely; I see no reason why not.
Nicholas

Posted Oct 11, 2006 at 4:59 PM | Permalink

XML is not a format. It’s a scheme you can use to design a format so that other programs can read, but not necessarily understand it.

It’s unnecessarily complex and verbose anyway. Personally I would use something like a CSV except with name/value pairs, I think. Something like

date=1/2/3,time=14:25:00,wind_speed(km/h)=7.265,temperature(celcius)=24.5,lat=35.43276,long=24.87215
date=1/2/3,time=14:26:00,wind_speed(km/h)=7.624,temperature(celcius)=24.6,lat=35.43382,long=24.87198

Not only would it be relatively easy to parse, it’s also easy to understand what it represents, and deals well with data where certain elements are not available for each record. If there’s no temperature reading for a given time, you can just leave it out.

You could do something very similar in XML but it would be larger and harder to read, and given how easy this is to parse, I don’t see the point.

1/2/314:25:007.26524.535.4327624.87215
1/2/314:26:007.62424.635.4338224.87198

etc.

I don’t really see the advantage. But you could do that if you wanted. More programs could read it, but you’d still have to program them to understand it, which is the hard bit anyway.
Nicholas

Posted Oct 11, 2006 at 5:00 PM | Permalink

gah, the XML tags I wrote in got stripped out.
Paul

Posted Oct 11, 2006 at 10:06 PM | Permalink

Clearly, you’re just not advanced enough. You can see from this link. We’re just a short time away from controlling the weather. I’m sure Fortran would be a part of that computer…
Jaye Bass

Posted Oct 12, 2006 at 8:56 AM | Permalink

It’s unnecessarily complex and verbose anyway.

I suppose for very large data sets that might be true. The ability to validate an xml file/fragment without changing one line of parsing code, even if the xml format changes, is a very big deal. Not too mention all the tools that allow one to transform the xml formats to just about else. If your data sets are extremely large or if you never expect to have a file format change or if the data is never going to be corruputed by an extra space, spurious character or an unexpected cr/lf then CSV is a great way to handle important data files.
Steve McIntyre

Posted Oct 12, 2006 at 12:06 PM | Permalink

There’s an easier-to-read version of updated HURDAT here http://www.aoml.noaa.gov/hrd/hurdat/easyhurdat_5105.html.
Nicholas

Posted Oct 12, 2006 at 9:26 PM | Permalink

Jaye:

The ability to validate an xml file/fragment without changing one line of parsing code, even if the xml format changes, is a very big deal.

Except you can only validate the format, and perhaps check if the data is in range. You can’t validate the data itself, and that’s the area I would worry about corruption or mistakes the most. Besides, I can write PERL code to validate a CSV file faster than I can install an XML validation tool. Any XML validation you can perform is pretty trivial with a CSV-type file with any decent text processing language.

Besides I don’t see why it’s such a big deal. If the data format is invalid, the program reading it in should throw an error, right?

ot too mention all the tools that allow one to transform the xml formats to just about else.

Again, I can write a PERL program to transform to/from a CSV format in a couple of minutes, as should anybody competent with a language that has decent text-handling abilities. I think you’re exaggerating the availability of XML conversion software too. I bet you won’t find a program to convert the weird FORTRAN formats Steve is complaining about to XML.

If your data sets are extremely large or if you never expect to have a file format change or if the data is never going to be corruputed by an extra space, spurious character or an unexpected cr/lf…

Why would the data be corrupt? If you’re worried about that, keep a checksum (say MD5sum) or use an archiver (zip, rar, etc.) that keeps checksums on the files and will report an error if the data is corrupt.

Besides which, if the corruption say truncates a number, XML likely won’t be able to tell. It could still be valid XML. XML doesn’t solve the issue of corruption; it just pads the data enough that corruption is likely to affect all the extra padding and therefore be detected. CSV is more compact so you’re more likely to corrupt the data, I suppose, but again checksums are a far better defence.

I just don’t see what XML brings to the table over simple text formats. I’ve dealt with XML parsers in the past, and they’re a pain to use. It takes me longer to get one hooked up to my program typically that it takes me to write a small routine to import/export CSV or a similar format.
Jaye

Posted Oct 14, 2006 at 11:14 PM | Permalink

Besides I don’t see why it’s such a big deal. If the data format is invalid, the program reading it in should throw an error, right?

I’ve seen non-xml based parsers break because of unanticipated format errors. They will try to use garbage because they don’t know that the format has been violated. Hence, no exception but a perplexing runtime bug. XML validators tend to take the guess work out of format validation. The big deal is that xml validation really only depends on the DTD or schema. So if one is writing production code, that somebody else has to use, versus writing little scripts for onesy twosy apps, then XML is definitely preferable. Using the xerces validator, for instance, is trivial.

MD5’s and checksums can indicate change in state but not an incorrect state to begin with.

I personally wouldn’t want to use XML for large data sets, since in most cases the XML tags could account for a large percentage of the file size.

I bet you won’t find a program to convert the weird FORTRAN formats Steve is complaining about to XML.

You misunderstand. The ease of converting is from XML to something else via XSLT or whatever. Again, I don’t have to change the code, just change the xslt…instead of “just writing more PERL” which defeats the purpose.

Just because you can’t do it doesn’t mean it can’t be done.