Comments on: Porting RegEM to R #1

By: Jeff Id

Jeff Id — Wed, 18 Feb 2009 20:10:35 +0000

In reply to Ryan O. Re: Ryan O (#40), For sure. Not only is there no way to know, there is no effort to check.

By: Ryan O

Ryan O — Wed, 18 Feb 2009 15:43:33 +0000

In reply to Mike B. Re: Mike B (#39), There's no way to know. In the case of the AWS recon, it is very possible to have undetected, spurious correlations because some of the station data is so short. For longer series - like the manned stations - this is less likely because a spurious correlation is unlikely to persist for long periods. But with the shorter duration of many of the AWS stations, it could definitely be an issue.

By: Mike B

Mike B — Wed, 18 Feb 2009 15:33:37 +0000

In reply to Jeff C.. Re: Jeff C. (#23), Thanks Jeff. And I get that part (i.e. the algorithm implicitly "recognizes" the higher correlation between two stations 50 miles apart and two stations 950 miles apart). But my poorly posed question(s) are slightly different: a) What if station B, via error or spurious correlation, is more highly correlated with station C 950 miles away than it is with station A 50 miles away? and more generally: b) Even if things are working well, you're still stuck (in your hypothetical) with 950 miles of guesswork between two stations. And RegEM seems to crank along indifferent to this.

By: bender

bender — Wed, 18 Feb 2009 13:54:40 +0000

In reply to Alan S. Blue. Re: Alan S. Blue (#32), I just asked the same question in the other thread. Why infill if your sole purpose is to estimate a trend? Unless you are infilling for one data type based on patterns in another data type ...

By: bender

bender — Wed, 18 Feb 2009 13:50:18 +0000

In reply to Jeff Id. Re: Jeff Id (#33),

It seems like statistics should have a frequency based r val.

For periodic time-series you use cross-spectral coherence. It is interpreted like Pearson's r wrt a specific frequency range. But for aperiodic time-series the spectral methods (frequency domain) are not so good. There is nothing illegal about filtering and correlating based on the filtered component series. This is not a statistical issue. It's a signal processing issue. The idea is that weather noise, climate noise, and forcing signals occur on different characteristic timescales and you filter on that basis. This is not a statistical proposition; it is a climatoogical propostiion. So it's out of the hands of the statisticians. You can ask a statistician's opinion, but they are going to tell you that it makes sense, if that's what the physics dictates. Two statisticians I would trust on this are Bloomfield and Nychka. (And of course, Wegman.)

By: Craig Loehle

Craig Loehle — Wed, 18 Feb 2009 13:47:58 +0000

In reply to Molon Labe. Re: Molon Labe (#31), Oh I just love all the tech jokes. You had me completely sucked in for a minute. I notice when the discussion gets really mathematical the trolls stay away: is math like wolfbane or garlic?

By: MrPete

MrPete — Wed, 18 Feb 2009 09:47:27 +0000

In reply to Alan S. Blue. Re: Alan S. Blue (#32), that sounds like a great methodology to me. But what do I know, I'm certainly no statistician. (Note: Alan's comment must be read carefully; many times when referring to "average" or "weight" he's referring to average or weight of the errors" not of temperature.)

By: Geoff Sherrington

Geoff Sherrington — Wed, 18 Feb 2009 09:02:37 +0000

Re your graphs in header:

I’m not too sure of the validity of correlation coefficients on annual data averaged from daily data. Using years 2004 (less a day for leap year) and 2005 for Macquarie Island, the CORREL Excel function gives 0.555 for Tmax and 0.453 for Tmin. Years chosen at random.

The stats package for year 2005 daily gives for Tmax

Mean 7.039452055
Standard Error 0.116673752
Median 6.9
Mode 6.6
Standard Deviation 2.229048909
Sample Variance 4.96865904
Kurtosis 0.086984527
Skewness -0.312414358
Range 13.8
Minimum -0.2
Maximum 13.6
Sum 2569.4
Count 365

and for Tmin

Mean 3.536191781
Standard Error 0.142589332
Median 3.9
Mode 3.0
Standard Deviation 2.724165366
Sample Variance 7.421076941
Kurtosis -0.452491633
Skewness -0.457868522
Range 13.1
Minimum -4.3
Maximum 8.8
Sum 1290.71
Count 365

How do you get correlations of 0.75 in your graphs when taking a favourable case I can only manage 0.45-0.55? Am I missing a square root terminologically somewhere? If not, surely correlation coefficients should decrease as you go further out on a data limb?

By: Jeff Id

Jeff Id — Wed, 18 Feb 2009 07:36:43 +0000

In reply to bender.

Re: bender (#30),

I was hoping for a rule or something. It seems like statistics should have a frequency based r val. I’m just an engineer though and have not yet attempted to prove anything with a correlation. Things either work or they don’t and in my experience the god of physics pretty well let’s me know when I messed up.

By the way, I did some slightly more complete PC analysis of the Sat data. I’m completely exhausted but if I didn’t mess it up, Dr. Steig is going to have a headache tomorrow.

The Three PC’s of the Antarctic

By: Alan S. Blue

Alan S. Blue — Wed, 18 Feb 2009 06:08:32 +0000

I’m still missing the fundamental reason for infilling or inferring data in the first place.

First break the spacial area into gridcells (which is already done). Then determine “The Gridcell Temperature” in each gridcell at each time. Gridcells with no data are exactly that – gridcells with no data. Gridcells with 30 datapoints at a given time mean that you’ve probably substantially reduced the error on that particular datapoint.

When it becomes time to estimate the average temperature of the entire area, you’re doing a average weighted by the errors on the individual gridcells. The ’30 point’ cell is going to get a substantially heavier weight than the zero data cell (which would actually get a zero weight). Along with the “Temperature Maps” that we keep getting, you’d be able to show an “Error Map.”

At the next level, when you’re moving from individual times to trend across the years, you’ve got individual error bars on each time. If there’s very little data in 1973, you can see it flat out – because the error for the 1973 data would be enormous. Whereas when you’re doing the same thing for a gridcell in 2003 with full coverage, you should have nice tight error bars.

So in fitting a trend, you also get to weight the yearly data by it’s own individual merit. Rock solid data after 2000? Excellent, strongly weighted. All data from any random year missing? That’s fine – you’ve made no assumptions about missing data whatsoever.