From the linked article summary:

However, the behavior of PCA for genetic data showing continuous spatial variation, such as might exist within human continental groups, has been less well characterized. Here, we find that gradients and waves observed in Cavalli-Sforza et al.’s maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events.

Well there it is – they said it. Do we conclude the same for the subject of this thread?

]]>It points out some pitfalls and recreates some of the same patterns Steve has derived.

http://www.nature.com/ng/journal/vaop/ncurrent/abs/ng.139.html

PCA patterns

]]>There are lots of ways to describe data using functions or sets of functions. That allows the scientist to pick the math that best suits the science. If I were to be handed sets of temperature data over time, and told to use PCA analysis, I probably would have just thrown all the data together, and come up with a basis that is a mixture of temperature and time. Maybe the first element of the basis would have most of the structure that I want to pull out of the data. Not a given, but worth trying.

If the data were a lot more complex, PCA could also be used to simplify the representation of the data, and to show me what part of the data is redundant. Thus I could use the results to build a simpler set of data that represents the original data, maintaining its greatest variances.

I have a bit of trouble in my mind trying to expand this concept to sets of data. Can I call each set of data a separate ordinate and develop a basis set as a function of these data sets? And what on earth is orthogonality between these sets, especially since they are definitely non-orthogonal series of time data?

Maybe they are misusing the math, or maybe it is a brilliant and powerful generalization. But, the result won’t pick out which data sets best represent the planets temperature. Instead, it will pick out and emphasize which data sets best represent the variance inherent in the aggregate.

There will be strong correlations between latitude and temperature. There will also be correlations between altitude and temperature, and altitude is positional. It also makes a difference which side of an ocean one is on, if one is near an ocean or land locked, etc. Steve is right to consider these, they are hidden features of the sets. One could use these techniques to clean up larger data sets, say smoothing the data over area like in weather maps to get better areal averages of temperature to use in getting the global average temperature. But, how on earth can one use so few data sets, with such a poor representation of the earth to even dare claim that they can tell us the average temperature of the earth?

It seems to me that this is a sophisticated slight of hand. A technique was found that was best suited to emphasize particular data sets over other data sets. The result was then presented as proof that man has set himself on a runaway train to temperatures that were higher than the world had ever seen.

Is there a mathematical generalization of this technique to use sets as ordinates?

If valid, has anyone demonstrated that this is an appropriate technique to use for this situation?

Has someone included latitude, longitude and elevation as well as time, and used PCA in the normal way in their analysis and averaging of the data?

Thanks for the stimulation.

]]>“if you have spatial autocorrelation, can you always find a wave equation to describe the field?”

I doubt it in an analytical sense, wave equations are particular second order PDEs. It seems unlikely that you could find such a PDE for every possible spatial autocorrelation. But I imagine in many cases you can find a wave equation whose eigenfunctions look a lot like those for your particular autocorrelation if you allow for noise terms.

]]>Information like this keeps me sharp. ]]>