Ex post selection based on correlation has been a long-standing issue at this blog (and has been independently discussed at other blogs from time to time – Luboš, Jeff Id and David Stockwell have all written on it independently. The issue came back into focus with Mann 2008, in which there is industrial strength correlation picking. While the problem is readily understood (other than by IPCC scientists), it’s hard to find specific references. Even here, surprisingly, my mentions of this have mostly been passim – in part, because I’d worked on this in pre-blog days. We mention the issue in our PNAS comment, using Stockwell (AIG News, 2006) as a reference as I mentioned before.
“Spurious” correlations are very familiar to someone familiar with the stock market, whereas people coming from applied math and physics seem to be much quicker to reify correlations and much less wary about the possibility of self-deception.
The problem seems to be highly similar to ex post selection of proxies by correlation. Vul et al write (and touch on other issues familiar to CA readers):
The implausibly high correlations are all the more puzzling because social-neuroscience method sections rarely contain sufficient detail to ascertain how these correlations were obtained. We surveyed authors of 54 articles that reported findings of this kind to determine the details of their analyses. More than half acknowledged using a strategy that computes separate correlations for individual voxels, and reports means of just the subset of voxels exceeding chosen thresholds. We show how this non-independent analysis grossly inflates correlations, while yielding reassuring-looking scattergrams. This analysis technique was used to obtain the vast majority of the implausibly high correlations in our survey sample. In addition, we argue that other analysis problems likely created entirely spurious correlations in some cases. We outline how the data from these studies could be reanalyzed with unbiased methods to provide the field with accurate estimates of the correlations in question. We urge authors to perform such reanalyses and to correct the scientific record.
In their running text, they observe:
in half of the studies we surveyed, the reported correlation coefficients mean almost nothing, because they are systematically inflated by the biased analysis.
They illustrate “voodoo correlations” with one more example of spurious correlation (echoing our reconstruction of temperature with principal components of tech stock prices:
It may be easier to appreciate the gravity of the non-independence error by transposing it outside of neuroimaging We (the authors of this paper) have identified a weather station whose temperature readings predict daily changes in the value of a specific set of stocks with a correlation of r=-0.87. For $50.00, we will provide the list of stocks to any interested reader. That way, you can buy the stocks every morning when the weather station posts a drop in temperature, and sell when the temperature goes up. Obviously, your potential profits here are enormous. But you may wonder: how did we find this correlation? The figure of -.87 was arrived at by separately computing the correlation between the readings of the weather station in Adak Island, Alaska, with each of the 3315 financial instruments available for the New York Stock Exchange (through the Mathematica function FinancialData) over the 10 days that the market was open between November 18th and December 3rd, 2008. We then averaged the correlation values of the stocks whose correlation exceeded a high threshold of our choosing, thus yielding the figure of -.87. Should you pay us for this investment strategy? Probably not: Of the 3,315 stocks assessed, some were sure to be correlated with the Adak Island temperature measurements simply by chance – and if we select just those (as our selection process would do), there was no doubt we would find a high average correlation. Thus, the final measure (the average correlation of a subset of stocks) was not independent of the selection criteria (how stocks were chosen): this, in essence, is the non-independence error. The fact that random noise in previous stock fluctuations aligned with the temperature readings is no reason to suspect that future fluctuations can be predicted by the same measure, and one would be wise to keep one’s money far away from us, or any other such investment advisor9. 9 See Taleb (2004) for a sustained and engaging argument that this error, in subtler and more disguised form, is actually a common one within the world of market trading and investment advising.
Nature’s summary states:
They particularly criticize a ‘non-independence error’, in which bias is introduced by selecting data using a first statistical test and then applying a second non-independent statistical test to those data. This error, they say, arises from selecting small volumes of the brain, called voxels, on the basis of their high correlation with a psychological response, and then going on to report the magnitude of that correlation. “At present, all studies performed using these methods have large question marks over them,” they write.
The scientists under criticism say that the criticisms do not matter because they do appropriate adjustments:
Appropriate corrections ensure that the correlations between the selected voxels and psychological responses are likely to be real, and not noise,
Interestingly, these criticisms are said to have an “iconoclastic tone” and to have been widely covered in blogs, much to the annoyance of the scientists defending their correlations. Nature:
The iconoclastic tone have attracted coverage on many blogs, including that of Newsweek. Those attacked say they have not had the chance to argue their case in the normal academic channels. “I first heard about this when I got a call from a journalist,” comments neuroscientist Tania Singer of the University of Zurich, Switzerland, whose papers on empathy are listed as examples of bad analytical practice. “I was shocked — this is not the way that scientific discourse should take place.”