The Texas Sharpshooter fallacy is a logical fallacy where a man shoots a barn thirty times then circles the bullet holes nearest each other after the fact calling that his target. It’s of particular concern in epidemiology.
Folks, you are never going to see a better example of the Texas Sharpshooter work itself out in real life than Caspar Ammann’s handling of Mann’s RE benchmark.
I introduce you to Caspar Ammann, the Texas Sharpshooter. Go get ’em, cowboy.
In Ammann’s replication of MBH, he reports a calibration RE (the Team re-brand of the despised calibration r2 statistic) of 0.39 and a verification RE of 0.48. So that’s his bulls’ eye.
In our original papers, we observed that combinations of high calibration RE and high verification RE statistics were not necessarily “99.99% significant” (whatever that means), but were thrown up quite frequently even by red noise handled in Mannian ways. So something that might look at first blush like sharpshooting, could happen by chance.
In my first post, the other day on this, I observed that Ammann’s simulations, like ours, threw up a LOT of high RE values – EXACTLY as we had found. There are nuances of differences in our simulations, but he got a 99th RE percentile of 0.52, while we got 0.54 in MM2005c. Rather than disproving our results, at first blush, Ammann’s results confirmed them. Mann didn’t appear to be quite the sharpshooter that he proclaimed himself to be or that everyone thought. (This is something that should have been reported in their article, but, needless to say, they aren’t going to admit that we know the street that we live on.)
It’s not that the MBH RE value for this step isn’t in a high percentile – it is, something that we reported in our articles, though in a slightly lower percentile according to our calculations. For us, the problem was the failure of other statistics, which suggested to us that the seemingly high RE statistic (99.999% significant) was an illusion from inappropriate benchmarking – a form of analysis familiar in econometrics (especially the seminal Phillips 1986). The pattern of MBH statistics (high RE, negligible verification r2) was a characteristic pattern of our red noise simulations – something we reported and observed in our 2005 articles.
Obviously, it wasn’t enough for Ammann to show that the MBH RE value was in a high percentile – he wanted to show that it was “99% significant” as the maestro had claimed.
So he re-drew the bulls’ eye. A couple of days ago, I described the two steps whereby Ammann gets the MBH RE score (0.4817) into the 99% “bullseye” but this was my first cut analysis and did not tie it directly to the re-drawing of the bullls’ eye.
Ammann’s first step was to assigned an RE value of -9999 to any result with a calibration RE under 0. That only affected 7 out of 1000 and didn’t change the 99% percentile anyway. So this seemingly plausible argument had nothing to do with re-drawing the bulls’ eye, as noted previously.
The bulls’ eye was re-drawn in the next step – where Ammann proposed a “conservative” ratio of 0.75 between the calibration RE and verification RE statistics. Using this “conservative” ratio, he threw out 419 out of 1000 votes. The salient question is whether this “conservative” procedure has any validity or whether it’s more like throwing out black votes because they couldn’t answer a skill-testing question like naming the capital of a rural county in Tibet or identify the 11th son of Ramesses II. I’ll provide some details below and you decide.
First no one has ever heard of this “conservative” benchmark – and I mean, no one. You can’t look up this “conservative” ratio in Draper and Smith or other statistical text. The “conservative” benchmark is completely fabricated. So everyone’s statistical instincts should be on red alert (as Spence_UK and Ross’ have been and as mine were.)
So I thought – let’s look at the votes that didn’t count. What did the rejected votes actually look like? First of all, if a simulation had a negative RE score, those would fail the test and be re-assigned to -9999. OK, but those didn’t matter, because they were already to the left of the target; the 99% bulls’ eye wasn’t affected by this.
The only ones that mattered were the votes with RE scores higher than MBH which were thrown out on this new technicality. There were 13 votes thrown out on this pretext which I list below in order of decreasing RE score (note once again how high both the calibration and RE scores are in these rejected votes.) Most of the rejected votes had calibration RE values above 0.3, but slightly lower the value of the calibration RE in the WA emulation of MBH (0.39), but the third one in the list had both a calibration RE and a verification RE that were higher than MBH. Nonetheless, the vote still got thrown out. The RE score was too “good”.
For a calibration RE of 0.3957, the maximum allowable verification RE to be eligible would be 0.528! (0.3957/.75). Turn that over in your minds, folks. If the calibration RE was 0.3957, unless the verification RE was exactly between 0.4817 (MBH) and 0.528, the score would be be placed to the left of MBH and the bulls’ eye re-drawn. Redneck scrutineers would be proud.
# Cal_RE Ver_RE
647 0.3390 0.644
944 0.3485 0.620
113 0.3957 0.609
548 0.3016 0.599
374 0.3542 0.550
683 0.2479 0.542
153 0.3826 0.519
146 0.3112 0.514
299 0.3383 0.508
40 0.3176 0.508
194 0.1840 0.502
492 0.3552 0.491
656 0.3284 0.483
Once the above 13 votes were thrown, MBH was declared the winner of the election by 99% of the votes – sort of like a paleoclimate Kim Il Jong.
Let’s think a little further about the “conservative” ratio of 0.75 between calibration RE and verification RE. The one that no one’s ever heard of. Where did it come from? As soon as he saw it, Spence_UK thought that it probably stunk and, needless to say, it does. Here’s how it works.
The MBH ratio in the AD1400 step is 0.813. So any ratio that is higher than 0.813 would cause the MBH result to be thrown out. 0.75 is tucked in just under the value that would cause the MBH result to be thrown out. That’s the first part. (Ammann’s code shows that he tested a variety of cases with values higher than 0.813, but that these ratios would cause MBH rejection is never mentioned.)
On the other hand, if you go to a ratio of 0.5 (also a case shown in the code but not discussed), you don’t throw out enough votes. Only 2 votes would get thrown out with such a criterion and MBH would not win the election.
So 0.75 is pretty much the optimum value for throwing the maximum number of votes out without throwing MBH out. Perhaps this is what Ammann meant by a “conservative” ratio – he’s allying himself with redneck vote manipulation. Hardly what one expects in Boulder, Colorado, but life is strange.
Now there are other issues involved in all of this, such as whether bristlecones operate as a type of radar meterological antennae measuring temperatures in Asia, Africa and Australia. And nothing in this particular dispute affects the “big picture”.
In the past, I’ve sometimes sarcastically referred to the Team as the gang who couldn’t shoot straight. You’d think that if they sent out a Texas sharpshooter gunning for Ross and me, that they’d send out a guy that wouldn’t shoot himself in his own foot. Or draw the bulls’ eye with himself in the middle? But that’s the Team.
Who else could lose a Texas sharpshooting contest?