Voodoo Correlations and Correlation Picking

Ex post selection based on correlation has been a long-standing issue at this blog (and has been independently discussed at other blogs from time to time – Luboš, Jeff Id and David Stockwell have all written on it independently. The issue came back into focus with Mann 2008, in which there is industrial strength correlation picking. While the problem is readily understood (other than by IPCC scientists), it’s hard to find specific references. Even here, surprisingly, my mentions of this have mostly been passim – in part, because I’d worked on this in pre-blog days. We mention the issue in our PNAS comment, using Stockwell (AIG News, 2006) as a reference as I mentioned before.

“Spurious” correlations are very familiar to someone familiar with the stock market, whereas people coming from applied math and physics seem to be much quicker to reify correlations and much less wary about the possibility of self-deception.

Reader Jonathan brings to our attention an interesting new study entitled “Voodoo Correlations in Social Neuroscience” which was discussed in Nature here

The problem seems to be highly similar to ex post selection of proxies by correlation. Vul et al write (and touch on other issues familiar to CA readers):

The implausibly high correlations are all the more puzzling because social-neuroscience method sections rarely contain sufficient detail to ascertain how these correlations were obtained. We surveyed authors of 54 articles that reported findings of this kind to determine the details of their analyses. More than half acknowledged using a strategy that computes separate correlations for individual voxels, and reports means of just the subset of voxels exceeding chosen thresholds. We show how this non-independent analysis grossly inflates correlations, while yielding reassuring-looking scattergrams. This analysis technique was used to obtain the vast majority of the implausibly high correlations in our survey sample. In addition, we argue that other analysis problems likely created entirely spurious correlations in some cases. We outline how the data from these studies could be reanalyzed with unbiased methods to provide the field with accurate estimates of the correlations in question. We urge authors to perform such reanalyses and to correct the scientific record.

In their running text, they observe:

in half of the studies we surveyed, the reported correlation coefficients mean almost nothing, because they are systematically inflated by the biased analysis.

They illustrate “voodoo correlations” with one more example of spurious correlation (echoing our reconstruction of temperature with principal components of tech stock prices:

It may be easier to appreciate the gravity of the non-independence error by transposing it outside of neuroimaging We (the authors of this paper) have identified a weather station whose temperature readings predict daily changes in the value of a specific set of stocks with a correlation of r=-0.87. For $50.00, we will provide the list of stocks to any interested reader. That way, you can buy the stocks every morning when the weather station posts a drop in temperature, and sell when the temperature goes up. Obviously, your potential profits here are enormous. But you may wonder: how did we find this correlation? The figure of -.87 was arrived at by separately computing the correlation between the readings of the weather station in Adak Island, Alaska, with each of the 3315 financial instruments available for the New York Stock Exchange (through the Mathematica function FinancialData) over the 10 days that the market was open between November 18th and December 3rd, 2008. We then averaged the correlation values of the stocks whose correlation exceeded a high threshold of our choosing, thus yielding the figure of -.87. Should you pay us for this investment strategy? Probably not: Of the 3,315 stocks assessed, some were sure to be correlated with the Adak Island temperature measurements simply by chance – and if we select just those (as our selection process would do), there was no doubt we would find a high average correlation. Thus, the final measure (the average correlation of a subset of stocks) was not independent of the selection criteria (how stocks were chosen): this, in essence, is the non-independence error. The fact that random noise in previous stock fluctuations aligned with the temperature readings is no reason to suspect that future fluctuations can be predicted by the same measure, and one would be wise to keep one’s money far away from us, or any other such investment advisor9. 9 See Taleb (2004) for a sustained and engaging argument that this error, in subtler and more disguised form, is actually a common one within the world of market trading and investment advising.

Nature’s summary states:

They particularly criticize a ‘non-independence error’, in which bias is introduced by selecting data using a first statistical test and then applying a second non-independent statistical test to those data. This error, they say, arises from selecting small volumes of the brain, called voxels, on the basis of their high correlation with a psychological response, and then going on to report the magnitude of that correlation. “At present, all studies performed using these methods have large question marks over them,” they write.

The scientists under criticism say that the criticisms do not matter because they do appropriate adjustments:

Appropriate corrections ensure that the correlations between the selected voxels and psychological responses are likely to be real, and not noise,

Interestingly, these criticisms are said to have an “iconoclastic tone” and to have been widely covered in blogs, much to the annoyance of the scientists defending their correlations. Nature:

The iconoclastic tone have attracted coverage on many blogs, including that of Newsweek. Those attacked say they have not had the chance to argue their case in the normal academic channels. “I first heard about this when I got a call from a journalist,” comments neuroscientist Tania Singer of the University of Zurich, Switzerland, whose papers on empathy are listed as examples of bad analytical practice. “I was shocked — this is not the way that scientific discourse should take place.”

This entry was written by Stephen McIntyre, posted on Jan 16, 2009 at 9:42 AM, filed under General and tagged ex post. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

39 Comments

Eve N.

Posted Jan 16, 2009 at 10:29 AM | Permalink

Cargo cult science strikes again!

I’m sorry if I sound facetious, but I think every scientist has to read Mr. Feynman’s speech on the subject. *especially* climate scientists who seem to lack that certain thing which Mr. Feynman spoke of.

Read It Here
- Peter D. Tillman
  
  Posted Jan 16, 2009 at 11:33 AM | Permalink
  
  Re: Eve N. (#1),
  
  [Feynman discusses a meta-experiment on how to be smarter than a rat, in rat-in-maze experiments]
  
  I looked into the subsequent history of this research. The next
  experiment, and the one after that, never referred to Mr. Young.
  They never used any of his criteria of putting the corridor on
  sand, or being very careful. They just went right on running rats
  in the same old way, and paid no attention to the great discoveries
  of Mr. Young, and his papers are not referred to, because he didn’t
  discover anything about the rats. In fact, he discovered all the
  things you have to do to discover something about rats. But not
  paying attention to experiments like that is a characteristic of
  cargo cult science.
  
  Quite familiar story for CA regulars!
  
  Steve, keep after the rat-runners….
  
  Best for 2009,
  Pete Tillman
UC

Posted Jan 16, 2009 at 10:32 AM | Permalink

Not sure if related, but I was reading this http://news.bbc.co.uk/2/hi/health/7825890.stm just before the post appeared 🙂
Tom Gray

Posted Jan 16, 2009 at 10:33 AM | Permalink

From a Nature article as qouted in the posting:

Those attacked say they have not had the chance to argue their case in the normal academic channels. “I first heard about this when I got a call from a journalist,” comments neuroscientist Tania Singer of the University of Zurich, Switzerland, whose papers on empathy are listed as examples of bad analytical practice. “I was shocked — this is not the way that scientific discourse should take place.”

Note that phrase “Those attacked”. No one is being “attacked” here. A scientific practice is being subject to informed criticism. The phrase reveals a great deal about the scientific community and the revelation is not flattering.
Ryan O

Posted Jan 16, 2009 at 10:34 AM | Permalink

The veracity of a criticism is independent of the source. Who cares if it comes from a blog? If the criticism is correct, then it is correct.
.
And Richard Feynman rulez.
Sam Urbinto

Posted Jan 16, 2009 at 10:40 AM | Permalink

Instead of calling them voxels, perhaps they should be called loa or vodun.
Hu McCulloch

Posted Jan 16, 2009 at 10:45 AM | Permalink

I always caution my econometrics students against “Data Mining”, which I interpret as searching through a long list of potential variables and combinations thereof for sets that appear to have significant correlations using critical values that are only valid for single tests. I am therefore disturbed to find that the SAS catalog has an entire section of manuals and packages that automate “Data Mining” for the researcher. Perhaps these procedures correct the critical values for the effects of the multiple tests on test size, but I doubt it.

Wikipedia’s approving article on Data Mining makes no mention of size correction.
JD Long

Posted Jan 16, 2009 at 10:47 AM | Permalink

Eve, great Feynman reference! My colleagues have suggested that at meetings we have coconut halves and if anyone breaks out in Cargo Cult Science we make them wear the coconut halves like a radio headset. The only problem is dealing with the religious battles between those who see something as CCS and those who don’t!
GTFrank

Posted Jan 16, 2009 at 11:23 AM | Permalink

from Wired July 2008, I think
“The End of Science”
“The quest for knowledge used to begin with grand theories. Now it begins with massive amounts of data. Welcome to the Petabye Age.”

“The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”
“But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the “beautiful story” phase of a discipline starved of data) is that we don’t know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on”

“There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.”
- Mark T.
  
  Posted Jan 16, 2009 at 11:51 AM | Permalink
  
  Re: GTFrank (#8), I read that this summer and cried.
  
  Mark
Hoi Polloi

Posted Jan 16, 2009 at 11:23 AM | Permalink

More Voodoo Correlations
Kenneth Fritsch

Posted Jan 16, 2009 at 11:36 AM | Permalink

“Spurious” correlations are very familiar to someone familiar with the stock market, whereas people coming from applied math and physics seem to be much quicker to reify correlations and much less wary about the possibility of self-deception.

Your comment is much in line with what I have observed over the recent years in participating in blogs dealing with investment strategies where posters with hard science backgrounds seem to have a mental block when it comes to the dangers of data snooping. I sometimes think it might be the difficulty in separating the deterministic from the stochastic parts of the model.

Posters were, of course, able to show that some of these data snooped strategies worked out-of- sample (for the relatively short time periods that they existed out-of-sample) as they came up with hundreds of them that were published (out of many thousands or more generated, snooped and discarded) and instead settling for the tendency of some to work by mere chance they would go into detailed analysis of why their snooped strategies worked.

There was actually a poster on one web site called datasnooper, who explained, in great detail and with many examples and in an articulate manner, the dangers of data snooping. He was a whiz at statistics and a financial analyst. He had to fight off some very nasty efforts to discredit him on the site and finally gained a huge following of people who were willing and able to understand what he was attempting, in good faith, to teach them.
The leaders on the site were very hesitant to recognize his efforts and appeared to go out of their way to ignore his comments.

Once the data snooping has been used for finding a correlation (or investment strategy) it could, in this layperson’s view, be theoretically and approximately corrected for the data snooping by way of the Bonferroni correction, but that correction requires keeping track of the n (from the excerpt below) which in most cases is very difficult to impossible to do in practice as one must include all the explicit and implicit hypothesis considered.

At least, the Bonferroni correction excercise allows one to see what that correction can do to the probabilty of a statistical test — providing, of course, that one is able to honestly record all the implicit and explicit hypothesis considered in finding a pet theory, conjecture or strategy.

The Bonferroni correction is a safeguard against multiple tests of statistical significance on the same data falsely giving the appearance of significance, as 1 out of every 20 hypothesis-tests is expected to be significant at the α = 0.05 level purely due to chance. Furthermore, the probability of getting a significant result with n tests at this level of significance is 1-0.95n (1-probability of not getting a significant result with n tests).

http://en.wikipedia.org/wiki/Bonferroni_correction
Demesure

Posted Jan 16, 2009 at 12:00 PM | Permalink

The stocks prediction method based on thermometer/stock correlation should be marketed with Mannian speak. It would be rendered obscure enough to sell like crazy to subprimes brokers.
Carl G

Posted Jan 16, 2009 at 12:53 PM | Permalink

#6: HU,

I have both SAS and Enterprise Miner, and I don’t think that data mining is necessarily a problem. Suppose I have a dataset of 10,000,000 addresses and I am trying to market. An overfit model on a portion of that dataset will predict well at all on another portion of that dataset. The easy solution is this: make hundreds of models (different types of models, different variables, and/or different parameters) that fit well to a subset of the data. Then, on another large subset of the data, select the model that fits the 2nd dataset the best. The principle is that a model that is overfit on the calibration data will certainly predict horribly on the validation set, but a more modest calibration model will fit the validation set better. Finally, the model is used on a third subset of the data. This data, not used at all in selecting variables or setting parameters, is the best possible gauge of how the model will do on other datasets. It is from predictions on this dataset that confidence statistics should come.

If there’s something wrong with the above method, I don’t see it. I think data mining has gotten a bad rap, especially as an exploratory tool, when it’s really poor validation methodology (or none at all)that is the problem.
- Kenneth Fritsch
  
  Posted Jan 16, 2009 at 1:13 PM | Permalink
  
  Re: Carl G (#14),
  
  Perhaps data mining and data snooping should be contrasted, but I think there are dangers in both. I think of data mining as the process that you describe above as opposed to data snooping as I described in my previous post.
  
  Your methods prescribe an out-of-sample test that works within limits but depends on the experimeter not peeking at the “virgin” data or making multitudes of runs in searching for a fit in the in-sample and out-of-sample periods.
  
  I would also think that some well-reasoned a priori assumptions would complement the data mining approach and eliminate some rather obviously spurious correlations.
  
  Now I will let Hu give you an expert and professional answer.
Carl G

Posted Jan 16, 2009 at 1:02 PM | Permalink

#14: 2nd line, I meant to say: “An overfit model on a portion of that dataset will *NOT* predict well at all on another portion of that dataset”
Craig Loehle

Posted Jan 16, 2009 at 1:08 PM | Permalink

There is a difference between data mining for marketing on mailing addresses and in science. If your data mining finds that a certain zip code/income level/whatever buys more of certain products, that relationship will probably remain stable since the same people live there and the zip has the same income level from one year to the next. In stocks, the selected stocks based on spurious relations may well continue to go up or down for a while, and thus for short-term trading you might make money, but then things change…In other areas of science, not so much.
Jonathan

Posted Jan 16, 2009 at 1:30 PM | Permalink

My own thread on Climate Audit – now I can die happy 🙂

I should give full credit to my wife (an avid reader but rare poster) who spotted the Nature article over breakfast this morning.
Hu McCulloch

Posted Jan 16, 2009 at 1:32 PM | Permalink

Re Kenneth Fritsch #11,
Thanks, Ken (Kenneth?) for the reference to the Wikipedia Bonferonni article! Unfortunately, your quote didn’t cut and paste the exponent correctly. Your comment had

The Bonferroni correction is a safeguard against multiple tests of statistical significance on the same data falsely giving the appearance of significance, as 1 out of every 20 hypothesis-tests is expected to be significant at the α = 0.05 level purely due to chance. Furthermore, the probability of getting a significant result with n tests at this level of significance is 1-0.95n (1-probability of not getting a significant result with n tests).

The last sentence should have read, using ^ for exponentiation since I can’t get superscripts to work without going into LaTeX,

Furthermore, the probability of getting a significant result with n tests at this level of significance is 1-0.95^n (1-probability of not getting a significant result with n tests).

In fact, this equation is the basis of what is known as the Šidàk correction: If n independent tests are performed at size α*, the probability that none will falsely reject the null is (1-α*)^n. In order for this to equal 1-α, where α is the probability that at least one of the tests will falsely reject its null, we must have α* = 1-(1-α)^(1/n). This adjustment is exact, but only for independent tests. For example, if we wanted the final test size α to be .05 and ran 5 independent tests, the Šidàk adjusted test size would be 1-.95^(1/5) = .0102.

The Bonferroni adjustment itself (which is referenced in the Wikipedia paragraph preceding the one Ken quotes) is based on Bonferroni’s Inequality, which here states that whether or not the tests are independent, α* ≥ α/n. For example, if we want α = .05, we may have to set the individual test size α* as low as .05/5 = .01 in order to achieve it. This is the worst case scenario, but for small tests sizes, isn’t much different than the Šidàk adjustment which is based on independence, so it makes sense to just use the simple Bonferroni adjustment and rest assured that it is conservative.

The best-case scenario is based on the additional inequality α ≥ α*. When this holds with equality, no adjustment is required at all, but it only applies when the individual tests are perfectly correlated, i.e. they are replications of exactly the same test with the same data. Hence it’s not very relevant, even if it’s the basis for Mann’s 2008 Pick-Two procedure.

The Bonferrroni and Šidàk adjustments are discussed in an article by Hervé Abdi from the Encyclopedia of Measurement and Statistics cited by the Wiki article and online at http://www.utdallas.edu/~herve/Abdi-Bonferroni2007-pretty.pdf. Unfortunately, Abdi does not make it clear that Šidàk is exact for independent tests, while Bonferroni is a worst-case inequality.

Of course, using either of these adjustments potentially loses a lot of efficiency, since a rectangular acceptance region is ordinarily not optimal. With Gaussian errors, if possible we would want to explicitly measure the correlations and use elliptical regions governed by a χ^2 or F critical value.
- John A
  
  Posted Jan 16, 2009 at 5:00 PM | Permalink
  
  Re: Hu McCulloch (#19),
  
  Hu, it’s pretty easy to use LaTeX here – just surround your text with the tex tags (the button is on the quicktags line and works the same way as bold or blockquote). Just don’t add the $$ signs because they are not required.
  
  $a^* = 1-(1-\alpha)^\frac{1}{n}$
  
  $\chi ^2$
- Kenneth Fritsch
  
  Posted Jan 16, 2009 at 6:56 PM | Permalink
  
  Re: Hu McCulloch (#19),
  
  Thanks, Ken (Kenneth?)
  
  Hu, you can call me Kenneth or you can call me Ken or you can call me Kenny, just don’t call me Kenny boy.
  
  Thanks for the link to the Abdi article and your more complete explanation of the Bonferroni correction.
Steve Geiger

Posted Jan 16, 2009 at 1:38 PM | Permalink

Craig L., you write

“. In stocks, the selected stocks based on spurious relations may well continue to go up or down for a while, and thus for short-term trading you might make money”

I thought the issue was that given enough chances (i.e., different stocks) one could find a reasonable correlation between a few of them and, say, the price of tea in China (Chinese tea stocks not withstanding, of course ;-). Thus, if that is the issue, I would see no benifit (short term or otherwise) to buy any stocks based on the seemingly great correlation to (fill in the parameter).
Maurice Garoutte

Posted Jan 16, 2009 at 2:01 PM | Permalink

Parametric studies are a good way to handle a problem that you don’t understand. In the world of academia correlations can point to where more study could lead to understanding. However, recommending changes in the real world based on correlations that are not understood on the basis of first principles is reaching too far.

My favorite example is the neural net that correlated tanks with clouds and concluded that tanks must be present if it’s cloudy. http://neil.fraser.name/writing/tank/

As founding CTO for Cernium (http://www.cernium.com/) my work involved the analysis of large noisy data sets of visual primitives and extracting high level recognition data such as “car” or “person”. There were many parameters most of which were same for all classes of target. My motto was “Correlation should make me curious not convinced”.

If I was unable to understand the first principles of why a visual primitive should indicate an object class I did not use it. First principles are always correct. Correlations that hold up for one data set (the past for example) may not be relevant to another data set (the future in the case of climate).
- Kenneth Fritsch
  
  Posted Jan 16, 2009 at 6:48 PM | Permalink
  
  Re: Maurice Garoutte (#21),
  
  If I was unable to understand the first principles of why a visual primitive should indicate an object class I did not use it. First principles are always correct. Correlations that hold up for one data set (the past for example) may not be relevant to another data set (the future in the case of climate).
  
  I recall some of the data snooping that was used in formulating some investment models using various past performance criteria and then looking at the square and cubed roots of that criteria. After obtaining an unbelievable in-sample performance with the cubed root of the criteria some of the rather convoluted explanations that were derived after the fact would almost sound convincing – if one did not recognize the dangers of data snooping.
henry

Posted Jan 16, 2009 at 2:22 PM | Permalink

I suppose this is better in this thread:

The American Statistical Association (ASA) invites applications and nominations for the position of editor of the Statistical Analysis and Data Mining.

Further info at amstat.org

I wonder if any Climate Scientists are members of the ASA. Sounds like their kind of journal.
Hu McCulloch

Posted Jan 16, 2009 at 2:41 PM | Permalink

Carl G writes in #14,

#6: HU,

I have both SAS and Enterprise Miner, and I don’t think that data mining is necessarily a problem. Suppose I have a dataset of 10,000,000 addresses and I am trying to market. An overfit model on a portion of that dataset will predict well at all on another portion of that dataset. The easy solution is this: make hundreds of models (different types of models, different variables, and/or different parameters) that fit well to a subset of the data. Then, on another large subset of the data, select the model that fits the 2nd dataset the best. The principle is that a model that is overfit on the calibration data will certainly predict horribly on the validation set, but a more modest calibration model will fit the validation set better.

Validation out of sample can help, but is not a cure-all. Suppose you start with 400 models that in truth have no explanatory power. We would expect to find that 20 of them are significant at the 5% level using the initial data set. Of these we would expect 1 to be significant at the 5% level using the validation data set, so it is not true that a false model will “certainly predict horribly on the validation set” as you state.

It would make more sense to me to apply all 400 models to the combined data set, but to use a Bonferroni-adjusted test size of .05/400 = .000125 on the individual models to obtain a final test size of .05. (This corresponds, with large samples, to t values greater than about 3.84. This is not unusually high for a clearly valid model.) The final model(s) can then be double-checked for stability across subsamples with a Chow (switching regression) test, if desired.

Finally, the model is used on a third subset of the data. This data, not used at all in selecting variables or setting parameters, is the best possible gauge of how the model will do on other datasets. It is from predictions on this dataset that confidence statistics should come.

Again, with sufficient initial models, even this two-stage validation with a third data set could give misleading results. Suppose one had three decades of annual data on 8,000 tree ring series, and tried to use them to model the following year’s return on the S&P 500 stock index. Of these, 400 are expected to be significant at the 5% level in the first decade. Then, 20 of these survivors are expected to be significant at the 5% level when verified with the second decade of data. And 1 of the 20 finalists is expected to be significant when re-verified with the third decade of data. (Of course, the realization will not be 1 necessarily, but just a random number with expectation 1.)

I sure wish I knew which tree that was before this past year’s bear market! Is it a Gaspé cedar, a Graybill BCP, or some linear combination thereof?
Steve McIntyre

Posted Jan 16, 2009 at 2:59 PM | Permalink

Almagre tree 2007-31 predicted the market decline, well ahead of the actual event giving ample opportunity to trade profitably.
Jerry M

Posted Jan 16, 2009 at 4:01 PM | Permalink

I also use SAS and Enterprise Miner for forecasting and predictive modeling. Yesterday I presented a model I developed for another group in my company. They were very excited and want more, but my time is limited. I was asked, “Well if you can’t do it, can you just tell one of our analysts what you did and they can just plug in new numbers?”. No, no I can’t. One needs to know statistics and what they are capable of.
Hu McCulloch

Posted Jan 16, 2009 at 5:00 PM | Permalink

Henry #22 reports,

The American Statistical Association (ASA) invites applications and nominations for the position of editor of the Statistical Analysis and Data Mining.

The new journal’s webpage is at http://www3.interscience.wiley.com/journal/112701062/home.

The very title sets my teeth on edge, sort of like “The ASA Journal of Cherry Picking” would. But given that this is coming from the ASA, perhaps they are cognizant of the potential perils of data mining, and are primarily concerned with how to compensate correctly for it, eg with the Bonferroni or Šidàk adjustment?

Or maybe it’s really just a spoof, like the famous Journal of Irreproducible Results, or the Sokol hoax?
per

Posted Jan 16, 2009 at 6:59 PM | Permalink

interesting post.

It is probably germane to note that this paper, and the story, are not unique. This one has got a lot of press, perhaps due to the title, which is snappy.

However, this is a common problem. See:
Nature 454, 682-685 (2008)
Amyotroph Lateral Scler. 2008;9(1):4-15 Scott et al.

Scott et al followed up a mouse model used in over 50 publications, all of which purportedly showed statistically significant effects on prolonging lifespan in the mouse. Little bit of poor control of variables, bit of chucking out data you don’t like, and the use of small group size; and gee whiz, you have 50 papers. Small group sizes are key, because you can get a big effect size by random chance; you don’t publish the uninformative experiments but you do get a decent number of false positives. When you do the experiment properly, you cannot repeat the “positive” studies.

One of the interesting aspects is the Nature overview, which quotes one researcher “There just aren’t the resources now to do really large, well-powered mouse studies”. That’s interesting, ‘cos there was all the money spent on enabling 50 different publications (plus unpublished failed replications), plus the resulting clinical trials in humans, and that has all gone down the drain because the mouse studies are (according to scott’s study) not replicable. So 50 different publications is going to be ~~ $5 million, plus the clinical studies, which will easily double that.

there is also the rather invidious issue that many of these researchers who got these fantastic papers got follow-on grant money, prestige, etc.

per
- DeWitt Payne
  
  Posted Jan 16, 2009 at 7:26 PM | Permalink
  
  Re: per (#30),
  
  Sounds like the old quality management saw: “There’s not enough resources to do it right the first time, but there’s always enough to do it over again.” Poorly designed trials can keep a project alive for years.
- MarkB
  
  Posted Jan 16, 2009 at 9:10 PM | Permalink
  
  Re: per (#30),
  
  I saw something similar when I was in grad school. People use poor data sets, or limited experimental design. You point out their error – carefully, because you can’t make enemies out of faculty – and the respose is “It’s not optimal, but it’s all we’ve got in this field, so we need to use it to solve the problem.” That’s the good careerist answer, but of course the truth is that if you are unable to do it right, then you have to right to do it. Wrong is wrong, and if that means you have to find some other way to get published, so be it.
Steve McIntyre

Posted Jan 16, 2009 at 9:28 PM | Permalink

Little bit of poor control of variables, bit of chucking out data you don’t like, and the use of small group size; and gee whiz, you have 50 papers. Small group sizes are key, because you can get a big effect size by random chance

Sounds like the Team multiproxy literature with a bunch of studies with 6-15 “proxies”, all with bristlecones and/or Yamal, plus Tornetrask, ….
jorgekafkazar

Posted Jan 17, 2009 at 2:06 PM | Permalink

For some reason, this part of the Sokal link thread, above, stood out:

http://en.wikipedia.org/wiki/Rosenhan_experiment

Researchers claimed to hear voices and were admitted to institutions. When the phony patients wrote their daily research notes, genuine patients quickly caught on that the researchers were sane. Staff, on the other hand, interpreted the note making as compulsive behaviour, a further symptom of insanity. The only way some of them could get released was to admit that they were insane. Irony²!

There’s a parallel here, somewhere…
EW

Posted Jan 17, 2009 at 2:22 PM | Permalink

“There just aren’t the resources now to do really large, well-powered mouse studies”.

And there’s also a trend to limit the use of laboratory animals or skip some levels of animal testing in the experiments, because of all this “green” legislative, especially in the EU. This, of course, can lead to half-baked experiments with totally insufficient controls.
al

Posted Mar 6, 2009 at 2:21 PM | Permalink

I thought this might amuse http://www.xkcd.com for more like it..
Dr Justin Marley

Posted Jan 1, 2010 at 7:23 AM | Permalink

Hi,

Any chance of some constructive feedback on a video I put together about the above study?

Regards

Justin
Matthew Lieberman

Posted Mar 24, 2010 at 3:22 PM | Permalink

For anyone interested, there was a public debate on Voodoo Correlations last fall at the Society of Experimental Social Psychologists between Piotr Winkielman (one of the authors on the Voodoo paper) and myself (Matt Lieberman). The debate has been posted online.

http://www.scn.ucla.edu/Voodoo&TypeII.html
tz2026

Posted May 30, 2013 at 4:43 PM | Permalink

Speaking of Voodoo, anyone else notice as soon as they started burning witches, the Medieval Warm Period ended? And that when we stopped burning witches temperatures started climbing?