The Two Jeffs on Emulating Steig

The two Jeffs ( C and Id) have interesting progress reports on emulating Steig using unadorned Tapio Schneider code here. Check it out. One of the first questions that occurred to third party readers was whether RegEM somehow increased the proportional weight of Peninsula stations to continental stations as compared to prior studies. Jeff C observes:

As I became more familiar, it dawned on me that RegEM had no way of knowing the physical location of the temperature measurements. RegEM does not know or use the latitude and longitude of the stations when infilling, as that information is never provided to it. There is no “distance weighting” as is typically understood as RegEM has no idea how close or how far the occupied stations (the predictor) are from each other, or from the AWS sites (the predictand).

Jeff notes that the Peninsula is less than 5% of the land mass, but has over 35% of the stations (15 of 42). Jeff shows that the reported Steig trend is cut in half merely through geographic grouping, saying:

Again, I’m not trying to say this is the correct reconstruction or that this is any more valid than that done by Steig. In fact, beyond the peninsula and coast data is so sparse that I doubt any reconstruction is accurate. This is simply to demonstrate that RegEM doesn’t realize that 40% of the occupied station data came from less than 5% of the land mass when it does its infilling. Because of this, the results can be affected by changing the spatial distribution of the predictor data (i.e. occupied stations).

The irrelevance of geography is something that we’ve observed in other Mannian methods, starting right from the rain in Maine (which falls mainly in the Seine.) In MBH98, geographic errors didn’t “matter” either. The rain in Spain/Kenya error in Mann 2008 only “mattered” because the hemisphere changed. Had the error stayed in the same hemisphere, it wouldn’t have “mattered”. Gavin Schmidt and Eric Steig took umbrage at someone bothering to notice a geographic error in the Supplementary Information. At the time, I noted that I wasn’t sure whether the error was a typo or, as in the MBH and Mann 2008 cases, was embedded in the information files themselves. In either case, I didn’t expect the error to “matter” simply because I didn’t expect that Steig’s methods care whether a site was correctly located – a point that is a corollary to the results of the two Jeffs. Take a look.

This entry was written by Stephen McIntyre, posted on Feb 15, 2009 at 8:49 AM, filed under Steig at al 2009 and tagged RegEM, Steig. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

54 Comments

Bernie

Posted Feb 15, 2009 at 9:16 AM | Permalink

I am not sure as to the preferred location of comments. My apologies for the repetition.

Jeff & Jeff:
Nicely done.
The caveats and limitations are also a nice touch.
Ian

Posted Feb 15, 2009 at 9:40 AM | Permalink

Well done JJ09, not sure, and really don’t care whether it halves the slope, or doubles it, this is proper science, look at a paper, find a few station errors, consider the whole approach, notice the difference between coastal and inland sites, east/west is significant and then do the proper science. Why the hell can’t the professionals do the same….
Allen63

Posted Feb 15, 2009 at 9:40 AM | Permalink

Value added analysis.

Some interpolation is OK. However, interpolation along changing latitude lines is more like extrapolation into the unknown in this case, I think. My “feeling” is that we may never accurately know (accurately enough to confirm or deny AGW) what was happening in Antarctica prior to satellites.

Nonetheless, interesting that warming is confirmed during decades when warming may have been inertia (from natural warming since the 1800s). However, lately, no warming — when warming is supposed to be rampant from CO2. A result pretty much in accord with Steig’s finding, I presume.
Ron Cram

Posted Feb 15, 2009 at 10:03 AM | Permalink

Nice work, guys. Amazing what comes out when a paper is examined closely.
Jeff C.

Posted Feb 15, 2009 at 10:21 AM | Permalink

Steve,

Thanks for the mention. FYI – the reason I was waffling between the peninsula station count being 35% (15 of 42) and 40% (17 of 42) was that I was originally including the two stations on the South Orkney Islands (Orcadas and Signy) under the category of the “Antarctic Peninisula”. Although they are close (about 400 miles away), they aren’t actually on the peninsula so the 35% value is technically correct. However, an island at 60 degrees north latitude is probably no more representative of the continent-wide Antarctic trends than is that of the peninsula itself.
Jeff C.

Posted Feb 15, 2009 at 10:24 AM | Permalink

Ooops, should be 60 degrees South latitude.
Ryan O

Posted Feb 15, 2009 at 10:49 AM | Permalink

Very nice. It would be interesting to see how your result compares to the AVHRR recon, especially as the overall trend you got was half of Steig’s AWS recon . . . which already didn’t match the AVHRR. Great job, guys!
.
P.S. – and many thanks for the explanations and help in the Deconstruction thread. 🙂
AnonyMoose

Posted Feb 15, 2009 at 11:05 AM | Permalink

RegEM is used for the geographic infilling, but is unaware of geography? I thought RegEM was being used to infill missing data periods within each station’s records. Somehow in previous phrasing I missed that it was used for the land blanket. Well, using for geographic tasks something which is unaware of two-dimensional geometry is asinine. And such a use also obviously requires adjustments for topology, as altitude has obvious effects upon temperatures.
Bob North

Posted Feb 15, 2009 at 11:36 AM | Permalink

I must say that I did not realize that RegEM does not consider location in infilling data. The nearest-neighbor concept is nothing new in evaluating geospatial data and, seems to me, that some sort of distance/altitude weighting is absolutely critical for having any type of confidence in a reconstruction of any type, be it benzene in a groundwater plume, gold in an ore body, or temperature in Antartica. If this is true, I am left speechless.
- Jeff Id
  
  Posted Feb 15, 2009 at 12:01 PM | Permalink
  
  Re: Bob North (#9),
  
  It’s true. It’s the main reason I was interested in working on RegEM. I’m just an engineer and don’t work in climatology but I can’t even begin to understand how data could be imputed without spatial weighting.
  
  I left this on RomanM’s thread when I was starting to figure it out.
  
  I’m really not sure what’s going on exactly but I don’t believe location of individual stations was incorporated into the analysis except by correlation to the 3 pc’s in the RegEm analysis.
  
  In the reconstructions done above, it is only the temperatures fed into a matrix. No information at all is included about spatial position. You could feed a trend in from the north pole and as I understand it the weighting would be determined by the statistical match to the 3 PC’s. Of course you’d expect a better match to stations in closer proximity but there are no other controls or even checks for the final weighting (things that make engineers happy).
  
  Jeff C showed clearly that adding more stations with a particular trend (upslope) changes the weighting of those trends on all the other stations.
RomanM

Posted Feb 15, 2009 at 12:16 PM | Permalink

Jeff Id (#10), you are correct in their conclusion that RegEM does not, in fact, explicitly take distance into account. However, a satellite image from a given moment does not usually look like random noise – there are going to be spatially induced relationships between values from adjacent grid points. Regem uses the covariance matrix of the values from the various gridpoints and will thus contain information about the location relationships implicitly. The PCs are calculated from that matrix.

In some circumstances, this may have some advantages. The results from a grid point may conceivably be more closely related to those of a point at the same altitude than with a closer point at a diferent altitude. From looking at their results, it is highly doubtful that this method is the best way to go in this analysis,.
PhilH

Posted Feb 15, 2009 at 12:30 PM | Permalink

I would like to propose that you add “The Air Vent” to your “Blogroll.”

Done
stan

Posted Feb 15, 2009 at 12:35 PM | Permalink

snip – editorializing
Hank Henry

Posted Feb 15, 2009 at 12:51 PM | Permalink

I wonder how this method would work if you took a couple dozen spaced out stations from Australia or North America or some other continent and ran the numbers?
MikeU

Posted Feb 15, 2009 at 1:13 PM | Permalink

This is a little difficult to comprehend – how can any scientist not think spatial weighting matters when interpolating geographically sparse data? It seems intuitively obvious that it would, and it’s good to see that intuition is borne out by running gridded cells over the same dataset. A factor of 2 difference isn’t exactly minor. Nice work, gentlemen.
Allen63

Posted Feb 15, 2009 at 1:13 PM | Permalink

Though open to considering most any idea, I would be hard to convince that a global or continental temperature extrapolation/interpolation method that does not explicitly account for distance, altitude, longitude, latitude, and surroundings (mountains, plains, ocean, snow fields, city streets, etcetera) has any credibility for “fine” trend discrimination of the type needed to prove AGW. I also question the use of data from hundreds of miles away to “correct” local data. How many of these methods are “proven” performers and how many are merely personal preference “guesstimates”?

If the AGW community and politicians were not planning to “forcibly-take” my money to prevent “catastrophic” AGW, I could be more charitable regarding the questionable AGW “science” used to support their position.
Jeff C.

Posted Feb 15, 2009 at 1:49 PM | Permalink

My take on this is that RegEM will do a reasonable job on infilling missing data from dates if the temperature series used are well correlated. Take California as an example. If I had ten locations from the central valley (Bakersfield – Fresno area) that had data missing from scattered dates, RegEM would probably do a decent job infilling as all the sites have similar climactic trends.

Now add in two more California series of Eureka (NW Coast) and Palm Springs (SE Dessert). Although both are in California, their climates aren’t anything like the central valley. RegEM doesn’t know that these two sites are distant as it doesn’t have the lat/long coordinates. As Roman mentions in #11, the algorithm can recognize that the two new sites aren’t well correlated to the original ten and make implicit assumptions regarding distance. However, if lots of points in the time series are missing (as we know is the case) any trend difference may not be readily apparent. In addition, since there are only two outliers, their impact on the overall reconstruction is minimized. RegEM would still do a reasonable job on the central valley sites, but probably is off considerably for the other two.

We might have an analogous geographic inbalance in the Steig reconstruction due to the prevalance of peninsula and coastal stations.

I’m going to run some varariations deleting gridcells to see the impact. I’ll put up the results later in the day.
- Billy Ruff'n
  
  Posted Feb 15, 2009 at 2:35 PM | Permalink
  
  Re: Jeff C. (#17),
  The California example raises a question in the mind of this layman: Wouldn’t it be possible to “test” the accuracy of RegEM reconstructions under varing circumstances using known data, e.g. take data from a region where data sets are reasonably complete (California?), randomly delete elements of the known data set to approximate the data voids found in places like Antarctica, then let RegEM do a reconstruction to “recreate” the missing (deleted) data, and then measure the accuracy of the reconstructed data vs the deleted (known) data? You could then repeat the process under varying circumstances, e.g. more or less deleted data, greater geographic separation between stations, delete data from stations with different climactic trends.
Jeff C.

Posted Feb 15, 2009 at 2:54 PM | Permalink

Another observation in all of this is that Steig refers to the occupied station data set as the predictor and the AWS reconstructed data set as the predictand. However, in the methods section and in the Jeff Id implementation of RegEM, there is no real distinction between these data sets.

From the methods section of the paper:

RegEM uses an iterative calculation that converges on reconstructed fields that are most consistent with the covariance information present both in the predictor data (in this case the weather stations) and the predictand data (the satellite observations or AWS data).

In the Jeff Id RegEM, both the AWS data (63 sites) and the occupied station data (42 sites) are dumped in. 105 series are input, and 105 series are output. RegEM has no idea which is the predictor and which is the predictand. This is important as any true distance weighting would need to be applied to both data sets, not just the occupied station data.

We have been focused on the AWS recon series (i.e. the 63 infilled AWS series) because that is what Steig provided. RegEM also provides a occupied station recon (the other 42 series of the 105 output from RegEM). There might be something to be learned by evaluating these.

I highly recommend visiting Jeff Id’s site and getting his code to run through this yourself. You need both R and Matlab to run it, if you work in a technical industry you probably have a site-wide Matlab license.
Ryan O

Posted Feb 15, 2009 at 6:01 PM | Permalink

I’m not sure if this belongs here or in the RegEM thread. It’s more general than just Antarctica.
.
In my opinion, the problem with using RegEM like this is more fundamental than spatial weighting. RegEM can only find correlations. It cannot identify causality. Without establishing causality, no weighting scheme is any less arbitrary than another.
.
While it is reasonable to suppose that stations close to each other will share many aspects of climate and weather, the degree to which they share is dependent on more than just proximity. For example, land topology can greatly affect the degree of coupling, and, over time, the nature of that coupling can change. This is not restricted to times of years or more – it can occur in days if it is strongly dependent on highly variable aspects of the weather, such as the location of the jet stream. The correlation in temperature between points in an area as small as Montana (where I’m from) can change from year to year and month to month, and depending on how much needs to be infilled, there may not be enough actual information present for any algorithm (RegEM or otherwise) to accurately capture those changes.
.
Because RegEM cannot identify causality, I hesitate to believe any uncertainties calculated from imputed series. Those uncertainties are valid IF and ONLY IF the correlations between the series did not change with time . . . and by virtue of needing RegEM to begin with, that information is unavailable.
.
Furthermore, the uncertainties with any algorithm like RegEM must be dependent on the data being missing in a random fashion. With climate information in real life, this is rarely the case. Data is usually missing in large chunks, not randomly.
.
I think imputation algorithms can be valuable tools, but the current way they are being used is, in my opinion, poor science. This doesn’t just go for Antarctica; it goes for the paleoclimate reconstructions as well. Unless you can identify a physical cause for temperature between sites to be correlated in a meaningful way, the output is suspect at best.
.
Correlation does not demonstrate causality.
Hal

Posted Feb 15, 2009 at 6:12 PM | Permalink

# 20 Hans Erren

RegEm shows that the correlation of i before e (rather that e before i, following c) is much higher in full continental usage.

The most prevalent peg was the fact that the study appeared to reverse the “Steig” meme that has been a staple of disinformation efforts for a while now.

Therefore Stieg is now the correct usage.
husten

Posted Feb 15, 2009 at 6:24 PM | Permalink

Jeff, Geostatistics tools have an algorithm that computes and visualizes the the variance between stations . (Semi)Variogram.
Geostats like RegEM has it’s own set of believers and deniers. In the end it all depends how you interpret the results. Some of the tools might be of use to you here. The various software packages involved all honour the distance between stations, address spatial clustering etc. I am not sure there is a FREE software, one would need to google. Most code around is based on the 1980’s BLUEpack.
Robert Wood

Posted Feb 15, 2009 at 6:48 PM | Permalink

JeffId @ #10

With spatial data, one should not “impute” data, as S&M do, but, rather interpolate… in two dimensions. This is not difficult. Image processing technology applies here.
- Jeff Id
  
  Posted Feb 15, 2009 at 7:00 PM | Permalink
  
  Re: Robert Wood (#24),
  
  One of my several hidden backgrounds is image processing which I’ve done quite a bit of. It seems from the quote Jeff C put in the article that these methods were rejected. — “Unlike simple distance weighting……….”
Jon

Posted Feb 15, 2009 at 8:58 PM | Permalink

Isn’t proper spatial weighting one of the purposes of GISS? This puts new contrast on the discrepancy between GISS and Steig et al results. Also vis-a-vis the notion of prior methods doing a ‘back of the envelope estimation’, Steig et al seem to be taking the casual approach!

To the Jeffs: I know you are struggling to select a spatial weighting of your own. Perhaps you ought to follow the GISS procedure and use RegEM as a replacement for the GISS in-fill procedures.
- Ryan O
  
  Posted Feb 15, 2009 at 10:12 PM | Permalink
  
  Re: Jon (#26), There’s a conceptual difference between weighting in order to determine an aggregate measurement (like “surface temperature”) and weighting in order to drive an interpolation. (GISS may do a bit of both . . . the homogenization procedure smacks a bit of the latter.)
  .
  The former does not use the weighting to determine the degree to which the grid cells communicate or can be used as predictors of each other. It is an averaging technique to spread points of data over large areas.
  .
  The latter, on the other hand, uses the weighting to determine the degree to which grid cells are causally connected – i.e., can be used as predictors of each other. While distance is a reasonable parameter to choose, how would you know if you’ve chosen the correct distance function? How would you assign a confidence interval? Nor would the function be universal; it would depend on topology and local climatic variations.
  .
  While distance certainly wouldn’t be useless, it is not necessarily physically correct. Physically one quantity can only be used as a predictor of another if there is a causal connection. Distance probably could be used in many cases to approximate the strength of that connection . . . but it’s still just an approximation. Depending on the situation, it might not even be a good approximation.
  .
  Still, it’s probably better than just teleconnecting everything everywhere . . . but I’m not sure how much I would trust it.
alpha

Posted Feb 15, 2009 at 9:38 PM | Permalink

(longtime reader, first time commenter, etc.)

Steve, it might be useful to put together a regularly updated summary post with a table.
Idea is to give a heads up view of exactly which papers have been contested and by who. With
you, Watts, Pielke Junior, and others, it appears that there are a fair number of skeptics out there.

Moreover, there appear to be at least three or four nontrivial data errors (Mann, Hansen, and now Steig)
and it would be good to summarize them (acknowledged and unacknowledged) in one place.

Possible columns of this table could be:

1) year
2) authors
3) title
4) abstract
5) your summary of their message
6) link(s) to any papers, code or forensics you or others have done which points out an error
7) link(s) to any admission of error
8) link(s) to any post or email denying access to data or code
9) number of citations
10) binary indicator: was this cited in IPCC or other influential report?

Basic idea is the briefest possible heads up view to show — definitively — that many of the
core papers by Mann and crew have a significant degree of associated controversy.

The goal would be for me (or others) to point an intelligent non-specialist at this page and — by sheer weight of acknowledged error — demonstrate to them that the “consensus” isn’t really so.
- Peter D. Tillman
  
  Posted Feb 16, 2009 at 12:12 PM | Permalink
  
  Re: alpha (#26),
  
  Rather than asking Steve to do this (he’s pretty well committed), why don’t you write up a brief summary of Stieg’s article (and others, if possible) in the format you outlined, and post it to http://climateaudit101.wikispot.org/ — the wiki for this site. Make sure to post a link here so people can find your writeup, and add to it.
  
  Best, Pete Tillman
alpha

Posted Feb 15, 2009 at 9:42 PM | Permalink

also, all this stuff regarding imputation is very dodgy.

If you have a missing data problem, if you have NAs, it really depends on how those NAs arose.

Are they missing completely at random (MCAR)? Are the NAs due to completely random events and uncorrelated with any other measured variable?

Or are they statistically dependent on some other columns in your predictor matrix?

The best thing to do when you have an NA is to get the distribution of the possible values for that NA. Sometimes in the MCAR case this will be the univariate distribution for that column alone, because no others columns correlate with it. At other times it will be a conditional distribution, where other columns can be used. But using a simple scalar replacement is generally not a good idea.

NAs are real things, they shouldn’t be papered over…
bugs

Posted Feb 15, 2009 at 10:35 PM | Permalink

“starting right from the rain in Maine (which falls mainly in the Seine.)”

No abuse here and childish taunts, no siree, just honest, dispassionate commentary and analysis.

Steve: And even after the error had been identified and was well known, the rain in Maine continued to fall in the Seine in the Mann et al 2007 SI. I agree that Mann’s refusal to correct the geographical mislocations was childish. The important thing for readers to reflect on is that, under Mannian methods, geographic errors don’t “matter”.
mhc

Posted Feb 16, 2009 at 12:39 AM | Permalink

It seems to me that a very simple way to get reasonable weights for weather stations would be to draw a Veroni diagram, with one polygon around each weather station. Use the area of the polygon as the weight. This essentially attributes to each weather station all points closer to that station than any other station, and avoids all kinds of special cases like two weather stations in a cell, no stations in a cell, etc.

(I spent most of the winter of 1978 using this method to estimate the amount of ore in potential open-pit uranium mines. Veroni diagrams can be drawn pretty easily with a compass and straightedge, and areas can be found with a mechanical planimeter. Somewhat boring, though. 🙂
John A

Posted Feb 16, 2009 at 2:14 AM | Permalink

In Googling “Veroni Diagram” I found that Dr John Snow solved the mysterious cholera epidemic that struck central London in the mid 19th Century by constructing such a diagram, where the weights were the number of cholera victims in each house.

Here is the diagram:

The outer line denotes the points of equidistance between water pumps. The only pump in the area was the one on Broad Street, so Snow went there and took the handle off and the epidemic was stopped in its tracks.

The pump (without the handle) is still there:

and the nearest pub was renamed in honour of the Doctor:
- Bill Drissel
  
  Posted Feb 16, 2009 at 8:59 AM | Permalink
  
  Re: John A (#31), For “veroni” read “Voronoi”. Discussion, bio at: http://en.wikipedia.org/wiki/Voronoi_diagram . Acc to: http://en.wikipedia.org/wiki/John_Snow_(physician) , Dr Snow had to convince politicains to remove the handle.
  
  One of Edward Tufte’s books has a discussion of Dr Snow’s maps. Start at: http://www.amazon.com/Visual-Display-Quantitative-Information-2nd/dp/0961392142/ref=sr_1_1?ie=UTF8&s=books&qid=1234796131&sr=1-1
  
  Regards,
  Bill Drissel
  - John A
    
    Posted Feb 16, 2009 at 3:05 PM | Permalink
    
    Re: Bill Drissel (#37),
    
    I’ve just ordered “The Ghost Map” so I should be a lot less ignorant in about a week.
    
    In any case, I’ve learned to avoid Wikipedia for historical references, knowing that large numbers of articles are written by crazies. Maybe its correct, we’ll see.
BKR

Posted Feb 16, 2009 at 2:37 AM | Permalink

The reference to John Snow is wonderfully accurate for this blog. For a fascinating account of his work, see
Statistical Models and Shoe Leather
Author(s): David A. Freedman
Source: Sociological Methodology, Vol. 21 (1991), pp. 291-313
Published by: American Sociological Association
Stable URL: http://www.jstor.org/stable/270939

As Freedman (a noted statistician) observes “…this paper suggests that statistical technique can seldom be an adequate substitute for good design, relevant data, and testing predictions against reality in a variety of settings.”
- Peter D. Tillman
  
  Posted Feb 16, 2009 at 12:16 PM | Permalink
  
  Re: BKR (#32),
  
  Thanks for the ref, and the delightful quote. It does seem like the Mannian crowd prefers shuffling electrons to fieldwork, doesn’t it?
  
  I’d appreciate a copy of the paper if you have it: pdtillmanATgmailDOTcom
  
  TIA & Cheers — Pete Tillman
bender

Posted Feb 16, 2009 at 4:03 AM | Permalink

I had *assumed* RegEM used spatiotemporal covariance for infilling. My very bad.
- John A
  
  Posted Feb 16, 2009 at 4:23 AM | Permalink
  
  Re: bender (#33),
  
  Its very easy to overestimate the abilities of the Team. You’d think we would have wised up by now…
Louis Hissink

Posted Feb 16, 2009 at 4:16 AM | Permalink

Spatial weighting has been used in geostats for decades – its otherwise known as area of influence issues, and usually the method is to weight a reading with the area it is representative of.

It’s actually a fancy way of making sure the intensive values (station readings) are applied to areas (the extensive variables) to produce numbers which actually mean something physically.

IOt’s the main reason why the method of calculating the global mean temp is just a load of horsefeathers. No different to aggregating telephone numbers in the grid cells.
husten

Posted Feb 16, 2009 at 5:22 AM | Permalink

Geostats: Re: husten (#22), Google has indeed yielded tools that are still free:
1.) at www-sst.unil.ch/research/variowin/index.html.
or 2.) at sgems.sourceforge.net/ which is also recommended by NASA – see 3.) at gcmd.nasa.gov/records/SGEMS.html.

You don’t want to use commercial tools. Their customers are the oil, coal and mining industries and therefore always yield results biased towards global cooling. 😉 /sarcasm-off/ Most are based on the same code – gslib- but offer more or less comfortable data analysis tools.
- Evan Englund
  
  Posted Feb 16, 2009 at 1:35 PM | Permalink
  
  Re: husten (#36),
  
  In R, you can use a contributed package, gstat:
  
  index.html”>http://cran.r-project.org/web/packages/gstat/index.html
Harry Eagar

Posted Feb 16, 2009 at 12:14 PM | Permalink

As I recall, Tufte also showed that the epidemic was pretty well over before Snow took the pump handle off.

I predict something similar with respect to global warming and mitigation.
Ryan Maue

Posted Feb 16, 2009 at 1:04 PM | Permalink

Pat Michaels has an interesting point in his editorial piece at the Guardian:

The problem with Antarctic temperature measurement is that all but three longstanding weather stations are on or very near the coast. Antarctica is a big place, about one-and-a-half times the size of the US. Imagine trying to infer our national temperature only with stations along the Atlantic and Pacific coasts, plus three others in the interior.

As a test of the veracity of the RegEM reconstruction, couldn’t a similar experiment be conducted with the US continent as the testbed?
Dean P

Posted Feb 16, 2009 at 6:42 PM | Permalink

I posted the following over at RC and here was Gavin’s reply:

One of the primary points of the WUWT discussion is that RegEM assumes that the missing data locations are random in nature. Since the missing data in the Steig paper isn’t random (almost all the interior is “missing”), then is it proper to use a method that assumes otherwise?

[Response: No. The issue isn’t that the data have to be randomly missing in time or space, but that the value of the missing data is unrelated to the fact that it is missing. – gavin]

So the question is, does RegEM really expect the missing data to be from random locations and if so, does Gavin realize this?
- Jason
  
  Posted Feb 17, 2009 at 8:04 AM | Permalink
  
  Re: Dean P (#44),
  
  The data in the interior is different because it is in the interior.
  
  The data in the interior is missing because it was in the interior.
  
  It is plainly NOT the case that the value of the data is unrelated to the fact that it is missing. In fact, both are primarily the result of a single factor: the geographic location.
  
  It is remarkable that Gavin can understand the basic principle and then so spectacularly fail to apply it.
BKR

Posted Feb 17, 2009 at 2:23 AM | Permalink

Pete: (#40)

Click to access Freedman91.pdf

Absolutely delightful reading if you are interested in statistical analysis of data (it focuses social science but is still useful for thinking about the stuff here).
Carrick

Posted Feb 17, 2009 at 10:03 AM | Permalink

A bit off-the-wall question, but one could apply RegEM to the total Earth ground-station data.

What happens when you do that? How does the reconstruction compare to other methods (e.g.,GISS & HadCRUT).
Harry Eagar

Posted Feb 17, 2009 at 11:49 AM | Permalink

‘the value of the missing data is unrelated to the fact that it is missing.’

While I take Jason’s point, on the other hand, the value of missing data will always be unrelated to the fact that it is missing.

The value is/was the value, it does not change by the fact of observation, at least not till you get down to subatomic observations.

The speed at which I drove to work this morning is unrelated to whether a cop with a laser gun was hiding behind a billboard (purely hypothetical, we don’t have billboards in Hawaii).
- bender
  
  Posted Feb 17, 2009 at 2:51 PM | Permalink
  
  Re: Harry Eagar (#48),
  
  While I take Jason’s point, on the other hand, the value of missing data will always be unrelated to the fact that it is missing.
  
  The data are assumed to be mssing at random relative to the data field, not the x, y geo-coordinates. The reality is that data sensors are most likely to fail under extreme weather, which in Antarctica means extreme cold. I note further that the data field in fact covaries with x, y, z, with south pole and higher elevations being far colder than more northerly locations at lower elevation. Therefore it is a bit disingenuous to dismiss someone’s concern about data missing larger from the colder interior continental region. ie. The data are probably not missing at random with regard to the temperature data field. (Of course you have no way of knowing this because the missing data are not known to you. That is why it is an assumption that is so easy to pretend is true.)
  
  Re: Dean P (#44),
  You must be careful in talking with the dismissive ones at RC. They would rather so show you to be an idiot – especially if you smell like a skeptic – than take the time necessary to fully answer (and clarify, if necessary) your question. As my comment above would indicate.
Dean P

Posted Feb 17, 2009 at 3:06 PM | Permalink

Bender,

I’ve had that happen there before, but thanks for the warning. The reason I even went over there was that several people had mentioned Jeff & Jeff’s work, but hadn’t described what the result was (and I’m not sure I did it justice). The link to the article had been dismissed as not worthy of reading even on the slowest of days. I really wanted to see if Gavin knew that RegEM doesn’t factor in distances when doing its magic.

I think he knows that, but then I’m not sure he understands the issue that this can lead to. And that is that the easiest way to warm the antarctic is to take more measurements on the peninsula. RegEM will handle the rest…

(note, this is my take on what J&J have shown… if that’s not accurate, then please let me know!)
- bender
  
  Posted Feb 17, 2009 at 4:25 PM | Permalink
  
  Re: Dean P (#50),
  It is impossible to probe them. They know when you’re probing and will always dodge if they’re in the wrong. This makes it impossible to tell when they’re really wrong vs. when they’re just annoyed and are ignoring the substance of your question. Occasionally, and only when they are in the right, you will get a sensible reply. Don’t forget to genuflect. It increases your odds of getting an answer.
Harry Eagar

Posted Feb 17, 2009 at 5:44 PM | Permalink

Hmmm. OK. I get you better now. It’s like the joke about the drunk looking for his lost keys in the dark. He looks under the lamp post, although that’s not where the keys are, it’s where the light is.

There is an expressive word in Hawaiian pidgen for what I think about this: shibai.

To watch you guys tease out the story is amusing and stimulating. Even if The Team were all bullet-proof statisticians, I would still think that recreating a temperature history by making up temperatures where there are no observations is shibai, at least when the making up is on the scale we see in Steig.

But it’s all about process in the audit, ain’t it?

Gee, hope I didn’t cross the snip line.
Douglas Hoyt

Posted Feb 18, 2009 at 7:36 AM | Permalink

Here is a suggested test of RegEM:

1. Double up the number of pennisular stations from 15 to 30.
2. The new 15 stations have identical temperature records to the existing 15. The “new” stations could be visualized as being 10 feet away from the existing stations.
3. Re-run the RegEm analysis and calculate the continent wide trend. If RegEm is correct, then there should be no change in trend. If the method is poor, the trend will change.
re-nature

Posted Feb 26, 2009 at 7:49 AM | Permalink

Ich denke, dass das Umweltbewusstsein langsam besser wird. Außerdem wird die Marktlücke Umweltschutz immer grösser, da ja auch der Bedarf steigt. So nimmt die Entwicklung auch langsam einen positiven Verlauf. Desweiteren sollte man auch die Wirtschaftskriese als Chance sehen, denn wenn alte Strukturen vernichtet werden, werden neue Strukturen wachsen. Wie die Natur so will wenn etwas Neues entsteht kann um weiten besser und moderner sein. Lass die Politik nur machen, die wollen alle nur Ihr Geldwelt retten und nicht unsere Umwelt. In der Politik geht’s nur um Macht und nicht um Idealismus.
Hier ist auch ein Tipp für euch zum Posten.
Soll kein Spam sein- Ich finde diese Seiten interessant
Umweltschutz im Bog
NEUER Rekord bei Kohlendioxidausstoß
Mit nachhaltigem Gruß
Heinz