For sure. Not only is there no way to know, there is no effort to check.

]]>Thanks Jeff. And I get that part (i.e. the algorithm implicitly “recognizes” the higher correlation between two stations 50 miles apart and two stations 950 miles apart).

But my poorly posed question(s) are slightly different:

a) What if station B, via error or spurious correlation, is more highly correlated with station C 950 miles away than it is with station A 50 miles away?

and more generally:

b) Even if things are working well, you’re still stuck (in your hypothetical) with 950 miles of guesswork between two stations. And RegEM seems to crank along indifferent to this.

]]>I just asked the same question in the other thread. Why infill if your sole purpose is to estimate a trend? Unless you are infilling for one data type based on patterns in another data type … ]]>

It seems like statistics should have a frequency based r val.

For periodic time-series you use cross-spectral coherence. It is interpreted like Pearson’s r wrt a specific frequency range. But for aperiodic time-series the spectral methods (frequency domain) are not so good. There is nothing illegal about filtering and correlating based on the filtered component series. This is not a statistical issue. It’s a signal processing issue. The idea is that weather noise, climate noise, and forcing signals occur on different characteristic timescales and you filter on that basis. This is not a statistical proposition; it is a climatoogical propostiion. So it’s out of the hands of the statisticians. You can ask a statistician’s opinion, but they are going to tell you that it makes sense, if that’s what the physics dictates. Two statisticians I would trust on this are Bloomfield and Nychka. (And of course, Wegman.)

]]>I’m not too sure of the validity of correlation coefficients on annual data averaged from daily data. Using years 2004 (less a day for leap year) and 2005 for Macquarie Island, the CORREL Excel function gives 0.555 for Tmax and 0.453 for Tmin. Years chosen at random.

The stats package for year 2005 daily gives for Tmax

Mean 7.039452055

Standard Error 0.116673752

Median 6.9

Mode 6.6

Standard Deviation 2.229048909

Sample Variance 4.96865904

Kurtosis 0.086984527

Skewness -0.312414358

Range 13.8

Minimum -0.2

Maximum 13.6

Sum 2569.4

Count 365

and for Tmin

Mean 3.536191781

Standard Error 0.142589332

Median 3.9

Mode 3.0

Standard Deviation 2.724165366

Sample Variance 7.421076941

Kurtosis -0.452491633

Skewness -0.457868522

Range 13.1

Minimum -4.3

Maximum 8.8

Sum 1290.71

Count 365

How do you get correlations of 0.75 in your graphs when taking a favourable case I can only manage 0.45-0.55? Am I missing a square root terminologically somewhere? If not, surely correlation coefficients should decrease as you go further out on a data limb?

]]>I was hoping for a rule or something. It seems like statistics should have a frequency based r val. I’m just an engineer though and have not yet attempted to prove anything with a correlation. Things either work or they don’t and in my experience the god of physics pretty well let’s me know when I messed up.

By the way, I did some slightly more complete PC analysis of the Sat data. I’m completely exhausted but if I didn’t mess it up, Dr. Steig is going to have a headache tomorrow.

http://noconsensus.wordpress.com/2009/02/18/the-three-pcs-of-the-antarctic/

]]>First break the spacial area into gridcells (which is already done). Then determine “The Gridcell Temperature” in each gridcell at each time. Gridcells with no data are exactly that – gridcells with no data. Gridcells with 30 datapoints at a given time mean that you’ve probably substantially reduced the error on that particular datapoint.

When it becomes time to estimate the average temperature of the entire area, you’re doing a average weighted by the errors on the individual gridcells. The ’30 point’ cell is going to get a substantially heavier weight than the zero data cell (which would actually get a zero weight). Along with the “Temperature Maps” that we keep getting, you’d be able to show an “Error Map.”

At the next level, when you’re moving from individual times to trend across the years, you’ve got individual error bars on each time. If there’s very little data in 1973, you can see it flat out – because the error for the 1973 data would be enormous. Whereas when you’re doing the same thing for a gridcell in 2003 with full coverage, you should have nice tight error bars.

So in fitting a trend, you also get to weight the yearly data by it’s own individual merit. Rock solid data after 2000? Excellent, strongly weighted. All data from any random year missing? That’s fine – you’ve made no assumptions about missing data whatsoever.

]]>