You can use the VF framework to test model-data equivalence while allowing for model heterogeneity. To do so you construct the joint test {[model1=obs], [model2=obs], [model3=obs], …, [model23=obs]} (see math below).

The catch is that, to be testing modeling as an “approach”, you need to do this as a joint test across models, which is what the curly braces denote. In other words what you cannot do is run separate tests [model1=obs], [model2=obs], [model3=obs], …, [model23=obs]; and then pick the lowest F score, and if it is insignificant, claim the “approach” is consistent with observations. In this case each test is run as if it is the only model in the world, or all the other models are unrelated to the one being tested, which contradicts the notion that you are testing models jointly. The independence assumption means that a low F score on model #23 would tell you no more about models #1 through #22 than it would Environment Canada’s weather prediction model or the US Fed’s GDP forecasting model. Claiming otherwise means trying to have it both ways–the independence assumption lets you get pretty much any F score you want, but then you have to pretend you never made the independence assumption in order to say something nontrivial about modeling.

If you want a result that tells you something about the group (or “approach) of models, without imposing the assumption that all the models have the same trend, you need to use the multivariate testing approach to construct an F score of the joint test given by the expression in the {}’s above. In equation (16) in the MMH paper, the R matrix will consist of 23-row identity matrix for its top left block, zeros in the bottom 4 rows, and the rest (the first 23 rows of the last 4 columns) will consist of the constant -0.25. Also q now equals 23, which is the number of restrictions being tested. This tests each model against the mean of the 4 observed trends. (Write out Rb to see why).

As q increases the critical value of F does as well. For 23 restrictions the 99% critical value for the F2 test is 118.3. The test I have described, if I’ve done things correctly, yields F scores of 661.5 in the LT layer and 903.02 in the MT layer on data spanning 1979-2009. That means we can reject at the 1% significance level the hypothesis that models jointly have the same trend as the mean observations. This is not the same as testing the model mean trend against the observational mean trend (which also rejects significantly).

I think this comment also serves to address Nick Stokes’ point below.

]]>Yes. And the way you do the test is as follows. Stack the vectors in the order you have written them, then define the dummy variable d=(one,zero,zero,zero)’ where one is a 100-length vector of ones, and zero is a 100-length vector of zeros. Then estimate

y=b0+b1.t+b2.d.t+b3.d+e

Then test b2=0.

]]>Ross.

Thanks. So, back to your example…

tt = B * t + c

t0 = B0 * t + c

t1 = B1 * t + c

t2 = B2 * t + c

… and test for a difference between B and Bn coefficients right?

]]>Lazar, for the basic model/obs diff-significance test, the panel regression equation is equation 10:

y=b0+b1.t+b2.d.t+b3.d+e. The code file is VF09.do and even if you don’t use Stata it should be readable. The SE on the model trend is the SE on the trend slope (b2). It comes from the computation of panel-corrected standard errors where each panel is assigned its own variance, its own AR1 coefficient and the off-diagonal covariances are also calculated.

All the models enter the estimation individually. We do not average the models together when estimating the regression parameters. For models with multiple runs we average the runs together to create an ensemble mean for that model, but if we used the runs individually it would not change our conclusions (IIRC — this was something a reviewer demanded in an early round).

Hope this helps.

]]>Ross, where you state

“In this case the null is: the models and the observations have the same trend over 1979-2009.”

… are you essentially testing the mean of models or each model individually?

]]>Ross,

could you write the regression equations? (i’ve read mmh, still don’t get where an s.e. of 0.08 comes from, or where the model spread of trends come into the testing…)

tx

]]>“The model may perform its calculations well and simply the assumptions are wrong. It could make its calculations wrong and get a “right” answer by shear chance. It could get its assumptions wrong but multiple wrongs may give the correct answer. ”

Your right, which is why having multiple tests with various observation data can help discover what is correct and what is not.

That’s the problem I see. It is a have your cake and eat it too. Of all the possible tests to run, the few tests that are passed nitpicked out and shown as as support, and disregard the rest as unimportant. If you can run 3 tests and only one out of 3 passes, that should be stated plainly in report of the model.

What we get… IT PASSED THE TEST!

What we should get… It passed one test but failed 2 others. They can then go on to explain why the other 2 tests matter less than the one that passed, however ignoring the other 2 tests is wrong.

Also, if even you can do it then please explain why adding nonsense models as a test of a proceedure is a bad idea. Several people have given explanations as to why it is necessary, if you think it is wrong, please explain why.

]]>If they don’t go through independent V&V then I don’t know what they are for. Seems to me, I could propose two models that would bound temperature as a function of time f(t) = a big number and f(t) = 0. There ya go, two models that are guaranteed to bound any temperature trend you might have.

We model/test/model to do our design work. Thing is our stuff has to work.

However, it is simply laughable that an ensemble of models each with different behaviors wrt to trend prediction should be cobbled together to show much of anything. Annan’s silliness in attempting to show that two distributions are similar if the mean of one, falls somewhere in the meaty portion of another distribution is ridiculous. Back when I was doing pattern recognition, we used Mahalanobis distance to do this sort of thing. If you aren’t taking the variance of both distributions into account then you have a meaningless metric.

]]>Let’s take the MMH10 approach to an extreme. Let’s suppose that the internal variability for runs of each model is zero. Thus the “within group” SD must be zero. Let’s further suppose that the different models agree pretty well with each other, and that observations fall within the tight band of model projections. Then, by the MMH10 method, you will determine the average for the “ensemble of models,” and give this average an SD. But the MMH10-calculated SD will be zero. So, in this Thought Experiment, observations would be within the range of the closely-spaced models that make up the “ensemble of models.” And yet, since the average of the “ensemble of models” has an SD of zero and an SEM of zero, the MMH10 test will declare that the modeling has “failed” — the observed trend falls very far from the ensemble average (an infinite number of SDs away, in this extreme example).

This is not true. Here’s a demo you can do on a spreadsheet, though I did it on Stata. Generate a trend t=1,…,100. Now generate 3 deterministic “model” runs using t0=0*t, t1=1*t and t2=2*t. So each one has zero SD. And generate some “observed temperature” data using tt=0.8*t+N(0,1). So the observed trend is 0.8, within the spread of models. Now do the panel estimation as in MMH, by constructing a dummy variable d=0 for models and d=1 for obs, stack the 4 series, construct dt = d*t and do the panel regression. OLS will do fine here since there is no autocorrelation. The estimated trend on t will be 1.00 and the SE will be about 0.08. The test of a model-obs difference will not reject. You could even leave out the N(0,1), in other words make the “observed temperature” data deterministic, and the t-test on the model/obs difference will be 1.21 (p=0.227). In this case there is no within-model variance, only between-model variance, but the panel regression still takes it into account. I don’t know where they got the idea that the variance of the trend would be zero if all the (detrended) model runs had zero variance. That would only happen if all the models were identical and exactly linear, but in that case there would, in fact, be zero variance on the trend.

]]>