Milestone

Newport, TN parking lot, at a sewage treatment plant
Another parking lot being measured for climate change: Newport, TN

The surfacestations.org project has reached an important milestone.

With the submission of #222, Lexington, VA, submitted by John Goetz, we are now below the 1000 mark (out of 1221) stations left to survey. It was a 3 -way race to #222 between power surveyors John Goetz, Kristen Byrnes, and Don Kostuch.

Thanks to ALL of the wonderful volunteers for helping to reach this important benchmark! We current stand at 231 surveyed stations and 990 left to go.

I still need help in the midwest and the south, particularly Kansas, Nebraska, Montana, the Dakotas, Oklahoma, Mississippi, and Alabama. If you live in the areas want to make a lasting contribution to climate science, please visit www.surfacestations.org and sign up as volunteer. Its easy to do, and it makes for a fun science learning experience.

This entry was written by Anthony Watts, posted on Jul 28, 2007 at 11:22 AM, filed under Surface Record. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

91 Comments

Reid

Posted Jul 28, 2007 at 12:08 PM | Permalink

Anthony,

Do you have any early data on the percentage of sites that are high quality versus those that have serious problems?
Anthony Watts

Posted Jul 28, 2007 at 12:29 PM | Permalink

RE1 I plan to start that now. I feel that we have a large enough sample at this point to be statistically significant. Right now, at 18.9 %, soon to be 20% it seems reasonable to start. Once the first 231 stations are analysed for siting quality and given a rating, then a running total will be kept as we add more. Based on what I know is in the que for surveys, by the time I get the 231 rated, there will be another 20 completed, bringing the total to 20.5% of all USHCN stations. We also seem to have better balance now, with east cost stations and west coast stations being closer in numbers.

We still need more in the midwest.

National public opinion and product surveys are done on far less percentages, so I beleive we can start with what we have now.

I expect we’ll hear the usual criticisms of the first analysis, but we have to start somewhere. If anybody has better ideas, I’m open to suggestions.
Pat Frank

Posted Jul 28, 2007 at 12:34 PM | Permalink

On a general note, congratulations to you, Anthony, for opening climate science to the significant multiple contributions from dedicated and talented amateurs that, until now, has been typical only of astronomy.

You have produced a real example of the democracy of science.

Anyone can freely participate, and only the quality of the argument carries the ultimately deciding weight. The entire validity of the surface temperature record now hangs on your project.

High priest wannabes can only screech their frustration.
A. Sinan Unur

Posted Jul 28, 2007 at 12:55 PM | Permalink

Congratulations and a heartfelt thank you Anthony.

I do have one correction, though. You say in #2 that:

National public opinion and product surveys are done on far less percentages, so I beleive we can start with what we have now.

The primary criterion that determines whether a sample is suitable is not the sample size but whether it is representative random sample. Random sampling enables one to put proper confidence intervals on estimates. You are conducting a census of all weather stations, not randomly sampling them. It is important to keep distinction in mind.

This census is very important: The main problem with temperature data sets is that they are not random samples of temperature at various locations on the planet: Humans have tried to measure temperature and other weather variables mostly where they lived and traveled and that fact alone should indicate that such measurements may not be representative of the whole planet.

You are identifying many interesting ways such discrepancies can occur. This is an historic undertaking. I do not think humankind can thank you and the volunteers enough. Good luck.

Sinan
steven mosher

Posted Jul 28, 2007 at 1:48 PM | Permalink

RE #4. Anthony Sinan is correct,kinda, sorta. Even a census makes estimates on unsampled
households.

There are 1200 ( ok 1221) USHCN stations.

Using the CRN ranking system we would rank them using the 5 classes specified by CRN

class 1 and 2 meet standards

Class 3 starts to have issues ( encroachment of warming surfaces)

Class 4 and 5 are clearly Impaired.

My expectation going in was that the CRN score would be something like this:

Class 1: 5%
Class 2: 15%
Class 3: 60%
Class 4: 15%
Class 5: 5%

So, for 1200 sites or so, I would expect that 240 or so would be unsuitable for use. And 240 would
be pristine. Hey if Parker can use 290 sites……

We have surveyed 200 sites or so. I’d guess that if you go through those sites you’d find
40 near pavement or buildings. We should pull the Google map view of each site and refine
this a bit. Anyway

To facilitate analysis by others, I’m thinking the project needs a file format that embeds
certain metadata information. Silly to have two files. One for Temp, and one for station change.

So, just lke hansen has nightlights, I’m thinking of flags for CRN class, flag for Ac unit, asphalt..
distance from buildings, population density of the Zipcode, water treatment facility flag.

Other thing I noted. The Survey forms with GPS sometimes indicate Elevations that differ from the
metadata.. If Hansen Jones rely on Meta Data and not an upto date GPS reading, there is error in
the elevation adjustments.
BarryW

Posted Jul 28, 2007 at 2:14 PM | Permalink

One criticsm that has already been made of the survey is that there is a bias towards urban (hence “bad”?) sites. Statistics on how many urban vs rural sites have been surveyed should be kept to determine if the mix of sites is representative of the network as a whole. It would also be good to rate the sites for urban/rural to see how the survey’s ranking compares with the NOAA rankings for the sites. Let’s see if they even know what sites are still rural.
DeWitt Payne

Posted Jul 28, 2007 at 2:53 PM | Permalink

The sample size is important because it determines the minimum statistical error. Once again we’re in the realm of counting statistics where the total number of ‘counts’ is equal to the variance and the standard deviation is the square root of the variance. So for 225 stations sampled the standard deviation is 15 or a relative standard deviation of 6.7%. Getting a representative sample is much less of a problem when you’ve sampled nearly 20% of the entire population compared to an opinion poll of 1,000 people out of a population of 300,000,000.
A. Sinan Unur

Posted Jul 28, 2007 at 2:58 PM | Permalink

Re: #5

Anthony Sinan is correct,kinda, sorta. Even a census makes estimates on unsampled households.

In Statistics, census means observing every single observational unit in a population. There is no sampling in a census. When a census fails (for example, population censuses see
this or
this.

In fact, it is a pity that governments around the world, including the U.S., have spurned proper sampling (rather than census) and resort to imputation to make up information they do not have (but that is a whole other topic).

Anthony is attempting to organize a census of stations. That is, he is trying to find out the characteristics of all the stations.

I hope he succeeds. However, what he has so far is not a representative random sample of all stations in the U.S. just as the proportion of actors in California’s population would not be representative of the proportion of actors in the U.S. population.

I do expect that many more awful stations will be found not because of the statistical properties of the data that have been collected so far but because of my intuition.

Even if the census of all stations fails, that is, some stations are not visited, photographed etc, does not diminish the importance of this work in terms of making people aware of the stations that are indeed problematic. In fact, if there were an instantaneous response to these concerns and these stations were fixed, I am afraid we might face another panic over sudden global cooling (which, of course, is caused by CO2).

🙂 🙂
A. Sinan Unur

Posted Jul 28, 2007 at 3:09 PM | Permalink

Re: #7

You cannot carry out those calculations if your sample is not random. Period.

Getting a representative sample is much less of a problem when you’ve sampled nearly 20% of the entire population compared to an opinion poll of 1,000 people out of a population of 300,000,000.

That is statistical balooney. The population of New York City is about 8 million and the population of New York State is about 20 million. About 45% of New York City is white whereas about 68% of New York State is white.

In statistical sampling, the margin of error depends only on sample size, not at all on population size. When a proper random sample is not used, the margin of error calculations become meaningless.

So, a random sample of 1,000 out of 300 hundred million is more meaningful then just counting all the people in the top ten largest cities in the United States.

Sinan
steven mosher

Posted Jul 28, 2007 at 3:20 PM | Permalink

Re 6.

Good point Barry.

the list of sites USED by hansen is
http://data.giss.nasa.gov/gistemp/station_data/station_list.txt

Select country code cc=425 and see what the distribution of rural/urban/small town is.

I count over 1700 sites used in the US.
Barry Dauphin

Posted Jul 28, 2007 at 3:35 PM | Permalink

As I understand it, what is being done is a census of current samples (aka surface stations). There are a finite number of surface stations. However, in theory there are an infinite number of locations at which one could place a surface station. Some sort of broad (average? global?) temperature is currently being sampled via these surface stations. The census could shed light on how biased (or not) the current sampling placement (technique) is.

Having said that, tally ho, so to speak. Only 4/5 to go!
BarryW

Posted Jul 28, 2007 at 3:54 PM | Permalink

Re #10

I notice Hansen gives a brightness index for each site. Wonder how this compares to urban/rural designations and if that compares to what the audit shows. I also wonder what that bodes for the “corrections” that are made to the raw data.
steven mosher

Posted Jul 28, 2007 at 3:58 PM | Permalink

Sinan,

Good point. With 1221 stations if we wanted to select a random sample,

1. How would you do it.
2. How big would be.
3. If you had a sample could you show that it was not random.
DeWitt Payne

Posted Jul 28, 2007 at 4:22 PM | Permalink

You cannot carry out those calculations if your sample is not random. Period.

No. Period. You can and should always do the calculation because counting statistical error is always the best you can do. That’s not saying that the results of the sample measurement will be within the counting statistical error of the true value. That is only likely to be true if the sample is representative. The thing is, until you do a survey of some sort, how do you determime what’s representative? As far as the quality of the measurement stations is concerned, what would be your criteria for a representative sample? I doubt rural vs. urban is a significant factor. A completely random selection of 225 stations from the list of 1200 would likely show as much geographical clustering as in the current sample.

The sample for opinion polls is not a completely random list of people. The sample is, or at least should be, designed to be representative. You have to have the correct ratio of male to female, Republican to Democrat to Independent, etc., or at least control for these variables. You’ll probably get the male to female ratio close to correct by pulling names out of the proverbial hat. Sampling of minorities will be considerably more problematic in a completely random sample. A group representing 10 % of the population will have a 2 sigma range of 40 to 160 in a completely random sample of 1,000 people. That should be be an unacceptable large variation for a polling organization.
A. Sinan Unur

Posted Jul 28, 2007 at 4:49 PM | Permalink

Re: 13

Good point. With 1221 stations if we wanted to select a random sample,

Please note that I am not advocating a sampling approach. All I am saying is that the data so far does not constitute a sample that can be used for confidence intervals, statistical tests etc. The data set represents a substantial achievement towards a complete census of all 1221 stations.

1. How would you do it.

Honestly, I’d rather not. I do not know the distribution of these stations over the United States (although, that will change by later this evening). If the stations themselves are distributed uniformly enough, I would sample about 100 of them randomly and with replacement.

2. How big would it be?

N = 100 seems fine to me. If you go for a much larger size, the benefits of sampling (in terms of reduced cost) start to go down and a census becomes more attractive. Now that Anthony is getting volunteers and has passed a significant milestone, I am sure we will have a census.

3. If you had a sample could you show that it was not random.

Random sampling is not a characteristic of the sample but of the method used to collect the sample.

I know that the data Anthony has collected so far is not a random sample because the data are being collected as volunteers get a chance to go to a particular station. The current data set is a portion of a census.

Hope this helps.

Sinan
A. Sinan Unur

Posted Jul 28, 2007 at 5:06 PM | Permalink

Re: 14

You can and should always do the calculation because counting statistical error is always the best you can do.

I am sorry, but there is no way to calculate a margin of error on the estimate of the count of bad stations in the U.S. using the data Anthony has collected so far. Statistics is not magic.

The word sample does not just mean subset.

The sample for opinion polls is not a completely random list of people. The sample is, or at least should be, designed to be representative.

Obviously. And, the data that Anthony and the volunteers have collected so far have not been collected in a way to ensure that the data set would be representative of the whole U.S.

At this point, the data set is a portion of a census. Not a sample from which you can draw statistical conclusions.

If you want to call it a type of sample, it is a “convenience sample”.

It is similar to asking everyone in Ithaca, NY whom they are going to vote for and predicting the results of the U.S. presidential election based on that information.

In a similar vein, even before Anthony’s effort, I fundamentally opposed making any predictions about the future based on the existing so-called temperature record because the data we humans have had only been collected where we lived and traveled and coverage only expanded as industrialization took hold.

Sinan
Anthony Watts

Posted Jul 28, 2007 at 5:33 PM | Permalink

RE5 Steven, I have no expectations. let the data and photos tell the story, whatever it may be.

RE13,14 If this survey had a huge budget where a random number generator could be employed, picking the USHCN station to survey next, and then somebody could be immediately flown there by private jet or helo (maybe Gore could help) then yes we could get a random sample.

But given a zero budget relying upon out of pocket travel expenses paid for by each volunteer, we’ll just have to do the best we can and rely upon the randomness of the Internet to deliver volunteers as it has already.
steven mosher

Posted Jul 28, 2007 at 6:12 PM | Permalink

RE 17.

Anthony what I mean by expectation is mathematical. I could expect to find a uniform distribution
of station scores or something more normalish or something skewed to the ‘good’ end of the scale
where 4s and 5s were rare events.

Clearly I think the AGW folks think that CRN score will be skewed to the “good” end.

As for Random sampling that is not what I am getting at. I need to cogitate a bit.
BarryW

Posted Jul 28, 2007 at 6:15 PM | Permalink

Re #17

I think the concept of randomess is being overstated. The random sampling of a population is to ensure that the measurement is indicative of the overall population, but this is only true if the sample represents the population. For example if I am trying to determine an election result with a phone sample, even if I do totally random calling I can get a biased sample if the number of Democrats, Republicans or Independents is skewed from the actual population, and especially if it’s skewed from the actual percentage that that group will vote. Polling companies routinely adjust for the differences from the actual population in their sampling. I think ( and this is a non statistician speaking) that ensuring that the sample you do use is as close to the attributes of the entire population would give meaningfull results. By this I mean adjust for urban/rural, population, maybe even east/central/west. (“is there a statistician in the house?”)

The real statistic is how well NOAA is doing, not how well Anthony and the volunteers are doing. If they’re way off from what they say the quality of the stations represent, then that is significant, reguardless of how much carping is done about the size or bias of the present sample. And this is going to be a preliminary result, not the final answer.
A. Sinan Unur

Posted Jul 28, 2007 at 7:30 PM | Permalink

I would like to make it clear that I am absolutely NOT second-guessing what Anthony did or suggesting that he should have gone with a sophisticated sampling scheme.

It is, however, important to keep in mind the distinction between proper sampling (in which case you can use standard statistical methods to make inferences beyond the observations you have) versus an incomplete census.

I am firmly of the opinion that even a few really bad stations are grounds for serious concern about the rest of the temperature data.

I am also firmly of the opinion that all the surface temperature data sets only show how measured temperatures at the locations where people have chosen to measure the temperature have changed and they cannot be used to make inferences about places where people have not made the choice to measure the temperature.

Now, as promised before, you can find a map of the stations in the list at http://www.unur.com/climate/giss-station-locations.html

Sinan
Anthony Watts

Posted Jul 28, 2007 at 8:40 PM | Permalink

Since you fellows are pondering statistical techniques here is one that has been bugging me.

There are those that say the “power of large numbers” for example in Tamino’s post here:
http://tamino.wordpress.com/2007/07/05/the-power-of-large-numbers/

Make the data more accurate in the final analysis. While I can see that it makes the average or the mean more accurate by using larger sets of numbers, I don’t see how it can null out the starting uncertainty in the measurement.

One thing I just can’t get away from though and keep coming back to, even with the power of large numbers, is the propagation of uncertainty. I posit that if you start with a +/- 0.5°F uncertainty (two sources, rounding to whole integers by observer, and base accuracy of MMTS system is +/- 0.5°F also per published NWS specs) no amount of large numbers is ever going to change the base uncertainty. I think that even a slight increase in uncertainty occurs when USHCN temps are converted to °C because there is a slight conversion error and a slight rounding uncertainty added.

Over on Rabett/Halperns site I asked this question, prefaced with some of the above paragraphs: “Can you tell me why temperature data from USHCN and GISS are not reported with +/- x°C uncertainties?”.

Of course nobody answered, and when I pressed the issue a second time one angry fellow named Delgado went Ç–ber ballistic, so I don’t try to engage them anymore with the question.

It seems straightforward and simple to me. It seems to me that any GISS plot should read +/- 0.9°C or +/- 0.5°F to reflect the base uncertainty of measurement. Likewise, products derived from the combination of such data should state the same.

Or, am I missing something?

I’d appreciate hearing A. Sinan Unur’s and Steve and Barry’s thoughts on this if you are so inclined.
David Smith

Posted Jul 28, 2007 at 9:04 PM | Permalink

RE #20 Sinan, on the GISS map you prepared, any idea why there is such a concentration of stations in Turkey?
A. Sinan Unur

Posted Jul 28, 2007 at 10:14 PM | Permalink

Re: 21

I am afraid the whole thing is a little bit of a diversion. If errors are independently and identically distributed with constant mean (they don’t have to be normally distributed), the larger the number of observations, the closer will be the average of those observations to the “true” average.

That is the law of large numbers. The Wikipedia entry actually does a good job of explaining it: http://en.wikipedia.org/wiki/Law_of_large_numbers

For statistical inference, we resort to variations on the Central Limit Theorem: http://en.wikipedia.org/wiki/Central_limit_theorem

In the latter case, we are not talking about converging to a true parameter value but rather about the ability to calculate the conditional probability that the sample one observed came from a hypothesized distribution (assuming, independently and identically distributed observations).

Neither means squat if the iid assumption is violated.

The post you linked to is certainly amusing, but the argument assumes that whatever errors are iid across time and place. I doubt that is the case with surface temperature data.

How many places are there in your database where they put the weather station in a location that would have been cooler than the surrounding area?

There is a difference between the distribution of IQ scores in the population versus the IQ scores in an Economics Ph.D. program.

Hope this helps.

Sinan
A. Sinan Unur

Posted Jul 28, 2007 at 10:29 PM | Permalink

Re: 22

I have no firm facts. However, as far as I know, weather stations started cropping up around Anatolia around the turn of the century, first during the end of the Ottoman period, then more systematically when the State Meteorological Institute was established around 1930 during the first decade of the Republic.

Second, the fact that Turkey has been a member of NATO, OECD, EEC, EC and affiliated with various EU organizations probably meant funding and access.

This is all speculation though.

Anyway, it looks like 3.7% of all stations in that list are in Turkey. On the other hand, 28% of the stations are in the U.S. even though the U.S. about 12.5 times the size of Turkey.

Note that there is also a very high density of stations in Japan and South Korea.

Sinan
A. Sinan Unur

Posted Jul 28, 2007 at 10:33 PM | Permalink

Re: 22 and 24

I think this re-enforces my point that the temperature data that exists can only show how temperature measurements have changed in the places we have chosen to measure temperature. And, certainly, we did not use proper sampling techniques to choose where to measure the temperature so that the trends in the resulting aggregate data set would be representative of global trends.

Sinan
A. Sinan Unur

Posted Jul 28, 2007 at 10:50 PM | Permalink

Can’t let it go. According to this, there are a total of 451 stations in Turkey. Out of these, 64 are unmanned, automated and 56 are at airports. 69 are so-called “synoptic” stations which aggregate information from a number of stations. I do not know which ones are in the GISS list.

Sinan
matt

Posted Jul 29, 2007 at 2:52 AM | Permalink

One thing I just can’t get away from though and keep coming back to, even with the power of large numbers, is the propagation of uncertainty. I posit that if you start with a +/- 0.5°F uncertainty (two sources, rounding to whole integers by observer, and base accuracy of MMTS system is +/- 0.5°F also per published NWS specs) no amount of large numbers is ever going to change the base uncertainty. I think that even a slight increase in uncertainty occurs when USHCN temps are converted to °C because there is a slight conversion error and a slight rounding uncertainty added.

Assuming the MMTS and human reader both exhibted a mean error of zero degrees, why do you think this is a problem?
MrPete

Posted Jul 29, 2007 at 7:03 AM | Permalink

Once there are sufficient stations surveyed, couldn’t a proper random sample be generated this way:

– Computer-generate a truly random subset from the full USHCN list
– Pull the already-surveyed data for those sites
– If any sites in the list have not been surveyed, do your darndest to get them surveyed
– Process the results

Obviously, with only 200 complete, selecting 100 out of the 1200 is going to mean we’ll have a good number of “misses”… but that’s ok. We’re making good progress in any case!

On another note: the desire to see a clear and accurate temp picture (from non-contaminated sites) ought to be motivation enough for people to find the good sites along with the bad! I’m curious whether any good sites will be found in Maine 😉
Steven B

Posted Jul 29, 2007 at 7:05 AM | Permalink

I’m sure this averaging business was discussed here before. The fundamental problem is that many measurement errors such as quantisation are correlated, and the laws of large numbers only apply up to a certain point. You can gain some increased accuracy from averaging, but this is strictly limited.

The variance of an average is 1/n^2 times the sum of all elements of the covariance matrix. If the observations are independent then this vanishes except along the diagonal, you have n diagonal elements divided by n^2 which causes your variance to reduce as 1/n. That’s the usual 1/Sqrt(n) dependence of SD.

But if the observations are not independent, you have the sum of a whole nxn matrix of values divided by n^2, which no longer necessarily disappears with increasing n.

If you suppose the correlations are all the same, you get var(y-bar) = (r + (1-r)/n) s^2, where r is the correlation and s the standard deviation of each individual measurement. For r small, the (1-r) term dominates and you get the expected accuracy improvement until n approaches 1/r, when the improvement tails off. There is an ultimate limit to the accuracy. In practice it won’t be anything like as neat, with cross-correlation values all over the place, but the same sort of behaviour is to be expected. The matrix expands in size as fast as the 1/n^2 shrinks the total.

Have a look at the “Guide to the Evaluation of Measurement Uncertainty for Quantitative Test Results” equation A5.5 on p45.

Rounding error is clearly correlated to the actual local temperature value, and local temperature values are correlated with one another. Humidity, sunlight, wind, common calibration procedures, and similar instrument construction all cause tiny correlations. All the adjustments they apply also introduce correlation, as does the anomaly calculation, smoothing, and any other manipulation. Unless the correlation between observations is perfectly and absolutely zero, the benefit to be obtained from averaging is finite.
steven mosher

Posted Jul 29, 2007 at 7:11 AM | Permalink

re 28.

Thats what I was thinking, but Let other who know more weigh in.
steven mosher

Posted Jul 29, 2007 at 7:33 AM | Permalink

re 21.

Anthony, It’s been bugging me too. over on another thread I’ve been discussing with JerryB
a related problem.

When you download USHCN daily data and request Tmean, they round it up. so if Tmax were 12F
and Tmin were 9, they report Tmean as 11

No more significant digits that are in the observation. Tmean is not OBSERVED per se.
steven mosher

Posted Jul 29, 2007 at 7:51 AM | Permalink

Sinan,

http://data.giss.nasa.gov/gistemp/station_data/station_list.txt

Turkey is cc=649

I’ll do a count
steven mosher

Posted Jul 29, 2007 at 8:07 AM | Permalink

Sinan,

I counted 232 for Turkey. over 1700 for the US
Reference

Posted Jul 29, 2007 at 8:16 AM | Permalink

The small error margins of historic temperatures are only possible becuase of the large number of high quality observations. As the number of high quality stations will be significantly reduced as a result of this survey, it wil be interesting to see how much difference this will make to the accurracy of these historic temperatures.
John Lang

Posted Jul 29, 2007 at 8:39 AM | Permalink

So not only are weather stations being increasingly sited on parking lots, the global temperature monitors of GISS, the Hadley Centre and the USHCN continue to adjust the raw temperatures taken at these sites UPWARDS.

GISS – total adjustment to the raw data – +0.6C;

USHCN – total adjustment to the raw data – +0.6C;

Hadley Centre – total adjustment – +0.7C;

So the total increase in temperatures since 1900 – +0.8C –

… could easily be explained by parking lots and overzealous climate researchers.
Dan Hughes

Posted Jul 29, 2007 at 8:46 AM | Permalink

The apparent increase in ‘accuracy’ afforded by the power of large numbers is a diversion from the several issues of a fundamental nature associated with measurement of experimental data. Repeated measurements of exactly the same property many times can only increase the apparent accuracy of the mean value of the measurements. The mean value of the estimates and absolutely nothing else. And the necessary conditions under which this can be achieved have been mentioned several times on several blogs, especially here on CA. Not all procedures/methods meet the requirements of the necessary conditions. The same person reading the same instrument many times does not necessarily meet the requirements because of the possibility of systematic bias both in the instrument and the way the person reads the instrument.

The power of large numbers cannot increase the precision of the measurement. If that was the case the market for fine German high-precision metal-working machines would not exist. And development of ever-more precise methods and procedures would have ceased long ago.

The power of large numbers cannot overcome limitations introduced by systematic bias. And it is equally useless if the same quantity is not being measured.

The questions that Anthony has asked are important, in my opinion. Additionally, the issue of significant digits has also not be completely addressed. Early reports of the temperature read off a gauge are very likely limited to digits in the units place, or at most to one digit beyond the decimal place. The number of significant digits cannot be increased by mathematical operations. Note that this does not address the issues of the limitations of the numerical values that are reported. These limit the precision of the numbers.

If Tmin and Tmax are reported to two significant digits, the numerical average can only be reported to two significant digits. Note also that the increase in the number of digits accumulated in the numerical values of the mean of the measurements with large numbers of measurements should also be limited by the number of significant digits in the measurements.

Maybe all this data analysis stuff would be easier, and more nearly correctly, done if Interval Arithmetic routines were used. I look forward to more complete discussions here.

All corrections appreciated.
JS

Posted Jul 29, 2007 at 10:36 AM | Permalink

Anthony, of the less than 1,000 sites that are left, how many are closed vs open?
Scott-inWA

Posted Jul 29, 2007 at 12:43 PM | Permalink

Just as a point of interest for this discussion concerning mounting evidence of the generally poor siting of many USHCN weather stations, in the world of nuclear facilities QA auditing which I am directly familiar with, the progression of events in identifying and documenting quality assurance issues generally occurs as follows:

Stage 1: Early indications of potential problems emerge through:
* Concerns and issues raised by interested parties both formally and informally
* Analysis of past audit reports and surveillance activities for the presence of recurring themes and topics.
* Formal trending and analysis of the types of QC compliance data which is systematically generated as part of purchasing/procurement activities, construction activities, or as part of the engineering configuration management process.
* Analysis of system failures or process compliance failures of such significance so as to require a root cause analysis report and a follow-on corrective action plan.

Stage 2: Recurring indications of potential problems might be of such concern’€”depending upon the nature of the problem and the importance of the systems or activities it affects’€”that a more-focused surveillance effort is initiated to take a systematic look at the issue. If a large number of systems or activities are potentially affected, a surveillance plan might be developed which targets some representative sample of the systems or activities. Information is gathered which might include:
* The presence or absence of process or system documentation as required by written policies and procedures.
* An itemized listing of specific noncompliances and nonconformances with system specifications or with procedural/process requirements.
* An itemized evaluation of the potential safety and cost impacts associated with each identified noncompliance and nonconformance.
* An aggregate evaluation of the overall potential safety and cost impacts associated with the collective set of noncompliances and nonconformances, where determined to be required.
* As in Stage 1, a root cause analysis report and a follow-on corrective action plan might be developed for each of the identified issues, where individually determined to be required.

Stage 3: If the more focused surveillance activity performed in Stage 2 reveals evidence of widespread problems and issues of a specific and related nature, and if the problems and issues are judged to have sufficiently important safety and/or cost impacts associated with them, then a full-blown, 100% compliance audit of all potentially affected systems and activities might be initiated. As in Stage 2, a root cause analysis report and a follow-on corrective action plan might be developed for each of the identified issues, where individually determined to be required. Moreover, a comprehensive root cause analysis and corrective action plan might also be developed to deal with cross-cutting problems or issues, as they exist within the overall context of system/program requirements versus actual system/program implementation.

OK …

It is generally recognized in the nuclear industry that if one has entered Stage 3 Auditing Space for some particular set of systems and/or work processes, then there are probably larger issues which must be addressed across the entire scope and breadth of the project; i.e., are the individual contributors taking a quality-conscious approach in performing their work; are the individual project managers taking responsibility for implementing proper quality control practices within their own assigned project areas; is senior management on board with an appropriate QC/QA philosophy, and is the Quality Assurance program effective in keeping senior management informed as to how well the project is doing in meeting its overall expectations for quality control and quality assurance?
Adrian Starks

Posted Jul 29, 2007 at 1:32 PM | Permalink

I cant add anything from a scientific perspective, only good work keep it up not everyone has swallowed the AGW agenda hook line and sinker, and life has taught me that you need to be so careful and “experts” can get it so wrong. I’ve been fishing all my life since the early 60s and have seen all sorts of weather when you would not expect it. Our climate seems to vary over periods of time in little cycles of warm and cold.
It seems to me from a common sense point of view that citing weather stations close to parking lots will increase the temperature detected. I know from my own house which has an asphalt front, the stated temperature which is displayed on my vehicle will always be higher on the asphalt then in my back garden ( which has an external thermometer).
You guys and girls have certainly stirred up a horents nest on some of the other AGW sites by daring to question the figures etc in detail. Some of them seem so juvenille with name calling of anyone who is skeptical. How professional is that?
Anthony Watts

Posted Jul 29, 2007 at 2:47 PM | Permalink

RE37 “open -vs- closed” ????

please elaborate
JS

Posted Jul 29, 2007 at 3:13 PM | Permalink

Anthony, MMS shows some ot the USHCN stations as closed. I immagine they went there and removed the equipment.
matt

Posted Jul 29, 2007 at 6:38 PM | Permalink

The power of large numbers cannot increase the precision of the measurement.

I think this statement is false. How do you explain a one bit digital to analog converter delivering the equiavalent of 24 bits of resolution? It does it by extreme oversampling.

In fact, in radio engineering, sending the same signal twice increases the measured sensitivity of your receiver as a whole by nearly 3 dB

There are countless instances in which additional information allows you to improve the accuracy of a measurement.

If in fact the measurements are rounded to the nearest degree, and if the error mean is zero, then there isn’t a bias being introduced. And small trends can still be easily detected.

Play with excel and normal distributions for a bit to convince yourself of this. It’s very easy to spit out a month of temps with a given distribution. Round them all to the nearest integer and look at monthly means and standard deviations versus unrounded measurements.
Bryn

Posted Jul 29, 2007 at 7:10 PM | Permalink

Please excuse my naivety, but what is the likely outcome of this survey? Demonstrating that a percentage of the USHCN sites are poorly located is all very well, but what effect would this have on any reading of the data gathered? Some stations are poorly sited, yes, but as long as the conditions remain constant, won’t the variations recorded through time at those sites still be significant? And that is what counts. Much has been said about the effect of UHI. This contribution to bias is only so because the HIs are enlarging, i.e. the conditions of the environment are changing. Poorly sited recorders may provide readings that affect the state-wide average. Only if the number of recorders on line changes, will the average change “significantly” [said with full deference to the preceding discussion of the statistics of sampling]. If the result is, for the sake of argument, 35% of the stations are poorly located, 65% remain that are properly sited. Wouldn’t they be enough from which to still derive a statistically valid result?
Jan Pompe

Posted Jul 29, 2007 at 7:28 PM | Permalink

#43 Bryn

Wouldn’t they be enough from which to still derive a statistically valid result?

They are probably statistically valid but some will be measuring climate effects and others air conditioner usage it will be nice to know which is which, and get some idea how much one will influence the other in the aggregates. Any engineer worth his salt will do his/her best to minimise any noise affecting any transducer output before the measuring and analyses start. Failing that we have to sort out likely sources of noise and its influence and deal with that.
John Goetz

Posted Jul 29, 2007 at 8:31 PM | Permalink

I am happy to see Anthony’s site get below the 1000-to-go mark, although I do feel bad (sort of) for beating out Kristen and Don for the honors. I echo Anthony’s call for more sites to be surveyed in the midwest, although I have a plea of my own. I tried to survey the site in Staunton, VA, and I made the mistake of doing my correspondence with them via email up until the last moment. Unfortunately, I was never able to get official permission from them to photograph the site, and when I showed up the operator on duty had to decline my request because he lacked the authority to give me the permission. If anyone in the Staunton area would like to take a crack at getting permission to photograph the site, please contact Anthony to get my email, and I will be happy to give you all of the contact information I have, including names and phone numbers. This is one logistical mistake I made and I would simply like to see the site added to the survey.
Willis Eschenbach

Posted Jul 29, 2007 at 8:32 PM | Permalink

matt, in response to the statement “The power of large numbers cannot increase the precision of the measurement.” you say:

I think this statement is false. How do you explain a one bit digital to analog converter delivering the equiavalent of 24 bits of resolution? It does it by extreme oversampling.

In fact, in radio engineering, sending the same signal twice increases the measured sensitivity of your receiver as a whole by nearly 3 dB

There are countless instances in which additional information allows you to improve the accuracy of a measurement.

If in fact the measurements are rounded to the nearest degree, and if the error mean is zero, then there isn’t a bias being introduced. And small trends can still be easily detected.

Play with excel and normal distributions for a bit to convince yourself of this. It’s very easy to spit out a month of temps with a given distribution. Round them all to the nearest integer and look at monthly means and standard deviations versus unrounded measurements.

While this is true in certain specialized situations, in others it is not true.

For example, consider a sin wave that varies from -10 to 10. We’ll call it a daily temperature record, and take measurements every 5°. Assume further that it contains some random noise on all the measurements, lets say with a standard error of ⯠0.2. Now sample the wave say every 15° degrees, and take the average of the maximum daily measurements.

Next, as in your example, round the measurements to the nearest degree … you will find that the answer is quite different from the average of the actual measurements.

And this doesn’t even include the human factors, such as the tendency for very high temperatures to be rounded up and very low (possibly record low) temperatures to be rounded down …

Finally, you say that repeated oversampling can increase the accuracy from 1 bit to 24 bit … this seems somewhat extreme, since 2^24 is over sixteen million. I’d have to see the data on that one. But in the meantime, let’s look at another example.

Suppose I’m measuring the length of a line, using a ruler marked in mm. I ask 10,000 people, or a million people, to measure the line. Are you really saying that with enough people I could use that ruler to accurately measure the line to the nearest sixteen millionth of a millimetre?

I don’t think so … in fact, I’d be very surprised if the average of a million people measuring the line could do better than 0.01 mm, if that.

w.
Paul S

Posted Jul 29, 2007 at 8:34 PM | Permalink

Re: Post # 43 by Bryn:

If the result is, for the sake of argument, 35% of the stations are poorly located, 65% remain that are properly sited. Wouldn’t they be enough from which to still derive a statistically valid result?

I would think that first one would need to know which 35% of the sites are poorly located before any conclusion regarding statistical validity could be reached. That this apparently have never been done before by climate experts responsible for the US surface sites is to me the biggest surprise.
Anthony Watts

Posted Jul 29, 2007 at 8:36 PM | Permalink

RE42 Matt, Not to be brusque, but oversampling is not at all aplicable to this issue and the a/d converter analogy doen’t apply.

Reasons:

1 Manual samples are made once a day by weather observers, hi/lo datapoints are recorded. You can’t oversample a single reading.

2 Samples are taken with resolution of 0.1 degree F but are rounded to integers, oversampling won’t recover that resolution because the methodology precludes it. It’s not time domain sampling, its once a day.

3 You can’t go back in time and change the way temperatures were sampled in the past. The data is what it is.
Al

Posted Jul 29, 2007 at 8:36 PM | Permalink

If you have 65% good sites measuring something completely flat, and 35% sites that are going upwards, then the overall trend will go upwards in a statistically significant fashion. The ‘bad sites’ aren’t just poorly positioned, if you pay attention to the station notes you can actually tie the rises directly to site changes in several cases. Like: new building -> 1 degree rise. New AC, 2 degree rise. At least, I haven’t seen any with a step change _down_ at all.

If you throw garbage in, you have to know a LOT more about the data to get the garbage back out again. You can’t assume the _garbage_ is randomly distributed, even more than you can assume the sample was random.
Neil Fisher

Posted Jul 29, 2007 at 10:34 PM | Permalink

Re #46:

“Finally, you say that repeated oversampling can increase the accuracy from 1 bit to 24 bit … this seems somewhat extreme, since 2^24 is over sixteen million. I’d have to see the data on that one.”

It’s a furphy, IMO. In this application, you *start* with a 24 bit sample, then convert it to a highly oversampled 1 bit stream and get back… 24 bit resolution. The only reason it’s done is to simplify (read: reduce the cost of) the filters post d->a (because the sampling frequency is so much higher, you can get away with a filter that rolls off at 6db/octave rather than 10+db/octave, and the gain flatness/phase response is more easily controlled).

IOW, in this application, we are *not* increasing the resolution by oversampling.
Nathan Kurz

Posted Jul 30, 2007 at 12:47 AM | Permalink

Hi Anthony —

Some of the site problems you have found (incinerators, parked cars, possibly air conditioners) seem like they might vary on a workday/weekend schedule. Have you been able to see any day of the week effects in the data? While it seems certain that these problems must have some effect, it would be nice to have a better idea of the magnitude, or at least a lower bound.

–nate
Geoff Sherrington

Posted Jul 30, 2007 at 1:27 AM | Permalink

Re # 22 David Smith
Why is there such a high concentration of these things in Turkey?

Because turkeys like playing with these things.

Serious note. First, the pictures tell the story. The public has to see as many picturs as possible. Then turn to mathematics.

I once set up and run a polling organisation after hours as an addition to my normal work. We used 1,100 responses in a population of 18 million. I did not trust the data, but that was the number that all the other pollsters used. The impoprtant factor was writing the question.

Correct. A census is a total count.

Correct. You cannot use the data we have in the raw and have critics believe it. There has to be a random choice if a subsample is to be used.

Problem. What are we measuring against? The long term record of each station? Then we would have to accept that it is correct, which is the very point that we are disputing. No logic there.

Another reader said “Why not use a thermometer” in a naturally-better place nearby and compare the official temp with the measured one (away from cars, airconds, asphalt, but at the correct height above ground and in an acceptable shelter). Problem: One needs to know the time the official record was taken and take the thermometer record at the same time.

It is vitally important to minimise any invalid aberrations between the official record method/equipment and the ones we use. Otherwise credibility will be shot to pieces. I think this is where the thought should be concentrated. It is the methodolgy priciple that we are attacking, not the magnitude of the differences (yet). A random 10 sites would show the principle if the differences were big enough and our methodology well controlled. It’s a bit like writing captions for the cartoons that our photographers are posting.
crosspatch

Posted Jul 30, 2007 at 1:38 AM | Permalink

“Problem: One needs to know the time the official record was taken and take the thermometer record at the same time.”

Well, sort of. Since the measurement recorded is only the daily high/low and not the current temperature at the time of the reading, one could be relatively confident of obtaining the high/low temperature by taking one reading per day. The corner case being when one of the stations was in the process of recording the daily high or low at the moment it was being read. So in addition to the height standard and other standards, there really should be a standard time of day when the high/low temperature for the preceding 24 hours is recorded. With modern electronic recording this should be possible with minimum fuss and then it should also be possible to keep measurements from various stations in reasonable sync with each other.
matt

Posted Jul 30, 2007 at 2:41 AM | Permalink

RE42 Matt, Not to be brusque, but oversampling is not at all aplicable to this issue and the a/d converter analogy doen’t apply.

Reasons:

1 Manual samples are made once a day by weather observers, hi/lo datapoints are recorded. You can’t oversample a single reading.

2 Samples are taken with resolution of 0.1 degree F but are rounded to integers, oversampling won’t recover that resolution because the methodology precludes it. It’s not time domain sampling, its once a day.

3 You can’t go back in time and change the way temperatures were sampled in the past. The data is what it is.

1. Yes, agree. I’m not talking about a single reading. I’m stating it’s simple math to detect a 0.1 degree shift in a year of temperatuer measurements even if they are rounded. Do you disagree? It’s easy to model in excel.

2. Oversampling doesn’t recover the resolution in a given time period, but if you double the time period, with enough samples you can easily detect a shift of 0.1 degree in spite of rounding to nearest degree. Again, try it in excel. Let me know if you need help. Excel can generation normal distributions and round.

3. Agree.

The A/D example demonstrates how engineers trade redundency for accuracy all the time. The theoretical noise floor in a given bandwidth can be exceeded via signal processing, which of course relies on redundency.
matt

Posted Jul 30, 2007 at 2:59 AM | Permalink

“Finally, you say that repeated oversampling can increase the accuracy from 1 bit to 24 bit … this seems somewhat extreme, since 2^24 is over sixteen million. I’d have to see the data on that one.”

It’s a furphy, IMO. In this application, you *start* with a 24 bit sample, then convert it to a highly oversampled 1 bit stream and get back… 24 bit resolution. The only reason it’s done is to simplify (read: reduce the cost of) the filters post d->a (because the sampling frequency is so much higher, you can get away with a filter that rolls off at 6db/octave rather than 10+db/octave, and the gain flatness/phase response is more easily controlled).

Well, the real reason it’s done is because one bit converters (both ADC and DAC) can be built directly in CMOS. And when they are, they are less costly than analog converters. It works the other way too: you can take a one bit ADC and with enough sampling you can get an arbitrarily accurate representation of the input voltage. So, I’d dispute that this is a “furphy”

Here’s a good starting point for reading more: http://www.maxim-ic.com/appnotes.cfm?appnote_number=1870&CMP=WP-10

If you have a cellphone, it’s loaded with one-bit ADCs and DACs.

Think about this: using a simple analog comparator that measures if a voltage is above or below 1.5V, (and some wicked fast digital logic), you can measure any voltage between 0 and 3V to 24 bits of resolution. It’s very common. Your bathroom scale likely uses this type of converter.

The summary, though, is that rounding to the nearest degree doesn’t matter IF you have enough data to look at. If I want to know if this may was warmer than last may by 0.1 degree, then there’s enough data. If I want to know if this year was warmer than last year by 0.1 degree, then there’s more than enough data. If I want to know if this July 29 was warmer than last July 29 by 0.1 degree, then no, you don’t have enough data. But we don’t care about single days do we? It’s trends.

As noted elsewhere, spend 5 mins in excel and convince yourself of this. You don’t even need to go through all the noise shaping stuff outlined in the article. Generate 30 rands with a normal disti and a std dev of 3 (typical for monthly weather data). In another column, round the rands to the nearest int. Compare with unrounded version. In the two trials I just did, the difference is
Brooks Hurd

Posted Jul 30, 2007 at 4:50 AM | Permalink

Matt,

With temperature data being fed into an algorithm to calculate a mean global temperature, the standard deviation does not go to zero even though it appears to get smaller. The standard deviation is always a finite number. The standard deviation is based only on the data itself. The errors involved in taking the data, recording the data, adjusting the data, and averaging the data are still there. These errors are distinct from the standard deviation of the data.
RomanM

Posted Jul 30, 2007 at 5:43 AM | Permalink

#55: Matt says

The summary, though, is that rounding to the nearest degree doesn’t matter IF you have enough data to look at. If I want to know if this may was warmer than last may by 0.1 degree, then there’s enough data.

You are partly correct. The extra variance introduced into a single reading by rounding is about 1/12 (assuming a uniform distribution from -.5 to +.5 centered on the rounded reading)and is relatively minor. However, if you introduce other effects: take larger values and stretch them upwards (the AC and asphalt “feedbacks”), look at temperatures at other locations and add or subtract some unknown quantities (the quality control “feedback”) and randomly change some values (calibration changes, trash burning effects, etc.) then even if none of these were to introduce a bias into the situation, then you won’t have “enough data” to detect the changes you suggest are trivial to detect.
MrPete

Posted Jul 30, 2007 at 5:59 AM | Permalink

Perhaps this is just a longwinded way of repeating what Brooks Hurd said nicely in one paragraph ;)… I’m no stats expert, but I do have a bit of background in other areas related to what Matt is saying. I agree with part of his argument…

Andrew said,

It’s not time domain sampling, its once a day.

As Matt noted in #55, if you decrease the time resolution (ask an annual question from daily data) you just changed the time domain sampling ratio. Instead of one sample (daily temp from daily reading) you have 365 samples. Hundreds of samples of a value that varies smoothly over time *will*

However, I believe Matt’s examples and logic do not fully fit the situation.

First, a 1-bit A/D or D/A does its magic not only through extreme oversampling. There’s another key element required: what’s sampled is the difference between the signal and a precision reference value. (In the case of an A/D meter, there’s a feedback loop that adjusts a generated analog value up or down for a short period of time, based on whether the generated value is higher or lower than the input.)

So yes, by extreme oversampling, one can (in a sense) trade off bits of data precision vs bits of time precision… in a differential signal. But you need a (valid) reference value for the comparison (“high/low”) function to know what value to emit. This is crucial…

Second, Matt says:

I’m stating it’s simple math to detect a 0.1 degree shift in a year of temperature measurements even if they are rounded. Do you disagree? It’s easy to model in excel.

There are important hidden assumptions here. Others have alluded to them (particularly Willis) but perhaps not clearly.

Yes, if all measurements are done identically, and if the unmeasured portion of the signal has particular (constant waveshape/frequency/noise/etc) characteristics, and if the measurement crosses over the reference value in proportion to the signal value, and as long as we’re oversampling in time (not space), then it’s true that oversampling will buy you a lot. But are those assumptions true?

* Measurement methods and timing have changed over time. We don’t have calibration for all the meter changes, timing changes, station moves, etc. Somewhere in CA is reference to a (John Christy?) paper showing the necessary work to correct for such issues in a small region — it’s painful.

* SurfaceStations is showing that sample biases are likely being introduced into the record, at random, over time. Some of these biases may affect the entire “shape” of the signal. Others may affect only the high or low or other aspects. We don’t know. IOW, the signal itself is likely corrupted, and does not have predictable characteristics.

* As Willis noted, there’s a likely tendency for observation bias (among human observers). Rounding high readings up and low readings down. This would corrupt oversampled analysis.

* I believe the “reference value” in this case is the temperature rounding hi/low point. Thus, the oversampling “works” as long as measured high/low temps vary considerably more than a degree over time. Nightly lows in Singapore might not work out so well. Consider February 2007’s data. (I go to S’pore pretty often… one of those places where seemingly “the weather never changes”…)

* A single station is sampled in time. But the networks and models and corrections are sampling in space (i.e. the calculated value for a single station is based on samples from all other “nearby” stations out to 1000mi.) 1000 rounded readings in different places are not made more accurate by such “oversampling”.

It’s an interesting topic! And to me, it shows even more the importance of understanding what we have in our historical datasets.
MarkW

Posted Jul 30, 2007 at 7:23 AM | Permalink

Bryn,

How many sites do you suppose were surrounded by asphalt and air conditioner units 100 years ago?
Sam Urbinto

Posted Jul 30, 2007 at 12:42 PM | Permalink

I would think the readings, based upon all the factors, aren’t accurate beyond entire degrees, but it would seem that over time, the data is pretty consistently going up when combined. So I would say “The trend appears to be up.” But remember, the actual low high (1909 -.46 C and 2005 +.51 C ) have never exceeded 1 degree F… (which we’ll call .6 C at our resolution of .1 C)

So if you’re correct in #5 we should in this dataset of 240 have

Class 1: 12
Class 2: 36
Class 3: 144
Class 4: 36
Class 5: 12

These stations are “random” in the sense that nobody’s picking them out. They’ll have to be categorized to see if they are biased twoards something…..

But I think everyone’s missing something here. From what I understand, 214 USHCN stations and 256 of the GHCN are rural, and supposedly are the only ones used to make the calculations.

If so, then there are actually two sets of data in the entire network: Those stations classified “Rural” and comprising the data, and those not classifed such and not comprising the data. Then the trick becomes even more interesting and seemingly far easier: We have a control network and can change what makes up our anomalies.

We can vary the stations used to calculate, by class. Recalculating the anomaly datapoints in that set into a new trendline graph. Regardless of the number in that class. Then doing the same thing by class but separately for each of rural/non-rural. Then doing the same thing but restricting it not to class, but grabbing a random 250 (or a percentage of the total until we have them all) from the entire set and seeing how things look for a few random sets. Then doing it for just 1 station for each 250th in the land area, trying to get the most centered station in that area. (Which would be interesting to see if there even is 1 station in each 250th that’s roughly centered in the area….)

Once these are graded, this 20% can have such things done it the set, to see what premliminarily they look like until they’re all in and do that to each 20% and see how they compare to each other as they come in.

With the reduced set we have now, what trend would we get would be smaller and maybe not accurate, but at least we’d have some idea what the ones “we have now” are.

I think one of the most important things somebody said is also being a little ignored; these stations are not located where the location was picked to be indicative of the area. Another of the data things we could do is only put stations into our network that have been determined to fairly accurately indicate what the area is like and see how it changes the numbers.

Because as it is, we are discusssing data that probably isn’t better than 1 degree F correct anyway. But even if it is perfect, if the thermometer’s location is anomalous to the area, it is meaningless. So….

Then we just have to “correct” the data for each station in the group we happen to be using, to make the trend look like it does now. 🙂
Paul G M

Posted Jul 30, 2007 at 1:23 PM | Permalink

SURFACE STATIONS UK

Thought CA readers might like to see and comment this.

http://www.anenglishmanscastle.com/archives/004389.html

Are they acquiescing?

Paul G M
A. Sinan Unur

Posted Jul 30, 2007 at 1:38 PM | Permalink

Re: 43

The work Anthony is doing is invaluable for many reasons:

1) We can divide stations into groups of “good” and “bad” and compare their temperature trends.

2) We can apply some good old econometric panel data methods and look at the temperature trends the effects of things like parking lots, air conditioners etc removed.

This is orders of magnitude more important than just saying “the stations suck!, Nah na nah na!”

Sinan
Mark T.

Posted Jul 30, 2007 at 2:04 PM | Permalink

Here’s a good starting point for reading more: http://www.maxim-ic.com/appnotes.cfm?appnote_number=1870&CMP=WP-10

There’s an error in that document, just below Figure 3. It should state “a factor of 64x” not 24x.

Mark
Curt

Posted Jul 30, 2007 at 2:58 PM | Permalink

Matt:

Your example of the 1-bit comparator of sigma-delta ADCs is not appropriate here, as the comparator is not sampling the input signal, but rather the time integral of the difference between the input signal and the clocked output of the comparator. See figure 4 of the Maxim application note you cite. This is completely different. If the comparator were simply oversampling the input signal and averaging the output, you would get virtually no resolution enhancement, excepting perhaps when the difference between the input and the comparator threshold was within the noise band. I say this, by the way, as someone who is in the middle of a design now utilizing sigma-delta ADCs.

IMO, Steven B nails it in #29.
Bill F

Posted Jul 30, 2007 at 3:28 PM | Permalink

#42,

Your flaw is in assuming that the +/- error is randomly distributed in both directions. That assumption has not been proven to be valid either for manual observation or for the automated equipment. Take the following example:

If all of the manual thermometers were set at 1.5 meters high (random #…not sure if it is real or not), and most observers were males that average 1.75m in height, they would be looking down at the thermometer to read it unless they squatted. Lets say half squatted to read it and got a precise measurement, while half read it standing up straight and as a result of the angle from which they viewed the thermometer, all of their readings were 0.5 degrees lower than the actual temperature. There would be an error of -0.25 degrees in that dataset that would be present no matter how many millions of observations were included. The law of large numbers would not eliminate that kind of error. If all of those stations then converted over to automated measurement over a period of 10-15 years, with the automated equipment having an net error of zero over the large number of observations, then over a period of 10-15 years, the dataset would show a positive temperature trend of 0.25 degrees that was not real and was due to observational error.
Sam Urbinto

Posted Jul 30, 2007 at 3:58 PM | Permalink

That’s known as parallax if I remember correctly, and is why most meters that use needle deflection have a mirror behind the gauge, to ensure you’re looking at it straight on….
Bill F

Posted Jul 30, 2007 at 4:02 PM | Permalink

Just as an FYI, I wasn’t suggesting that the scenario in #65 is realistic (it might be…who knows). What I was trying to do is point out that the hypothesis that measurement error gets averaged out as the number of measurements increases may incorrectly assume that the errors are symmetrical around the actual value.
Sam Urbinto

Posted Jul 30, 2007 at 4:07 PM | Permalink

What you brought up comes under the blanket of “various measurement errors” on the part of an observer. I’m sure it’s happened, just no way to know how much, probably. But yeah, I got your point….
Neil Fisher

Posted Jul 30, 2007 at 9:41 PM | Permalink

Matt said “Well, the real reason it’s done is because one bit converters (both ADC and DAC) can be built directly in CMOS. And when they are, they are less costly than analog converters.”

As I said, it’s a *cost* exercise, NOT a resolution improvement exercise. You’re not seriously suggesting that we can get 24 bit resolution from 16 bit samples, are you?

“It works the other way too: you can take a one bit ADC and with enough sampling you can get an arbitrarily accurate representation of the input voltage. So, I’d dispute that this is a “furphy””

To continue the analogy, if there’s 2 bits worth of noise on your 8 bit samples, it’s possible to “average” it out with oversampling, but it’s *not* possible to do so if you use multiple convertors to acheive the oversampling, is it? Well, *maybe*, but it’s hardly as straightforward as you seem to imply.
MrPete

Posted Jul 31, 2007 at 5:41 AM | Permalink

#69 — as was explained in #58, yes you can get 24 bit resolution from even 1-bit samples, if done fast enough and properly accumulated.

and yes, as explained in #58 and #64, you’re correct: it’s hardly as straightforward as matt implied.

In essence, through oversampling it’s possible to pull a signal out of an amazingly high noise floor. That’s how GPS phones get a reading in parking garages when “normal” GPS won’t even work under a tree. But that’s another topic for another blog.

Returning to the subject at hand, I think there’s general agreement here that Andrew’s growing dataset is already of great value in what has been demonstrated; that a truly random subset of the whole would be required to get a valid “sample” analysis; that we understand and appreciate Andrew’s goal to recover a complete census, not just a valid sample.
matt

Posted Jul 31, 2007 at 8:49 AM | Permalink

To continue the analogy, if there’s 2 bits worth of noise on your 8 bit samples, it’s possible to “average” it out with oversampling, but it’s *not* possible to do so if you use multiple convertors to acheive the oversampling, is it? Well, *maybe*, but it’s hardly as straightforward as you seem to imply.

I think there are a host of concepts being mixed together. Oversampling doesn’t create more signal–oversampling reduces quantization noise. So your improvement comes not from boosting S in signal-to-noise ratio, it comes from reducing N in SNR.

Not sure how much more straightforward we can get than an experiment in Excel:

Assume Aug has a daily mean hi temp of 90 degrees, normal distribution, with std dev of 3. Generate 31 high temps for each day, and do 3 trials:

Average for the month:
Trial 1: 90.189 degrees
Trial 2: 90.256 degrees
Trial 3: 89.039 degrees

Now take each daily temp, and round to nearest degree, and re-average for the month:
Trial 1: 90.194 degrees
Trial 2: 90.258 degrees
Trial 3: 89.065 degrees

Trial 1 Error: 0.005%
Trial 2 Error: 0.002%
Trial 3 Error: 0.029%

Over a single month, this shows that rounding your temp to the nearest degree introduces very small errors, and still permits you to see very small shifts in mean temps. In other words, each measurement was quantized, but with enough measurements we can still detect very small shifts to the mean for the month.
matt

Posted Jul 31, 2007 at 8:57 AM | Permalink

Your example of the 1-bit comparator of sigma-delta ADCs is not appropriate here, as the comparator is not sampling the input signal, but rather the time integral of the difference between the input signal and the clocked output of the comparator. See figure 4 of the Maxim application note you cite. This is completely different. If the comparator were simply oversampling the input signal and averaging the output, you would get virtually no resolution enhancement, excepting perhaps when the difference between the input and the comparator threshold was within the noise band. I say this, by the way, as someone who is in the middle of a design now utilizing sigma-delta ADCs.

It’s a climate blog, not eet.com 🙂 I left a citation to a paper outlining how one-bit converters work so those interested can find out more AND to highlight that single bit converter can indeed deliver accuracy far beyond intuition. Note I handwaved the rest of the stuff as “some wicked fast digital logic.”

But at the end of the day, the decision element in a one-bit ADC is a simple comparator. See output X4 in the referenced doc.
Neil Fisher

Posted Jul 31, 2007 at 5:28 PM | Permalink

Mr Pete says:
“In essence, through oversampling it’s possible to pull a signal out of an amazingly high noise floor. That’s how GPS phones get a reading in parking garages when “normal” GPS won’t even work under a tree.”

You can only do that if you know the characteristics of the signal in the first place – you can’t pull an “unknown” signal out of noise in the same way. If you only have a limited number of possible states, then oversampling and some clever hueristics helps you to reconstruct the *likely* signal, and the coding used in the signal verifies that you “guessed” right – such techniques allow you to pull a signal from negative SNRs.
Unless I missed something, there is no a priori knowledge of what the climate “signal” should look like, and there is no error detection/correction coding in the climate “signal”. How, then, can you pull it out of the noise? The only thing I can think of is to characterise the noise and try to subtract it. To get mildly back OT, I think it would be fair to say that Anthony is attempting to characterise the noise.
TCO

Posted Jul 31, 2007 at 5:51 PM | Permalink

It’s also a milestone of way too much publication of intermediate results, of people taking this to the bank to early, and of a shoddy practice which does not push for mathematical insight, but instead looks for “shocking pictures”. Zorita (a solid state chemist) is 10 times the man that you lot are in terms of ethics and intellectual curiousity. I’m still waiting for Steve’s mathematical definition of a bad apple and for his proof that matrices and regressions can promote BAD apples as opposed to “apples”.
BarryW

Posted Jul 31, 2007 at 6:15 PM | Permalink

Re #74

Oh, you mean like Al Gore’s melting glaciers, drowning polar bears, and the like? Shouldn’t the fact that it’s been so easy to find those “shocking pictures” and the lack of rigorous, documented methodologies used for processing the temperature data give you pause to wonder at the qualtity of the temperature record instead of attacking those who’ve identified the lack of quality control in the climate communities activities?
D. Patterson

Posted Jul 31, 2007 at 7:28 PM | Permalink

Well, it certainly doesn’t justify your dishonesty resulting from your use of strawman arguments and refusals to supply honest answers to honest questions as you indulge in increasingly strident and insulting comments. Such behavior on your part creates a strong impression that you are deliberately attempting to force someone to bar you from this forum. Perhaps you would care to dispel or mitigate such an impression of dishonesty on your part by giving an honest and straightforward answer to the question of why you do not ask Cicerone to require compliance with the PNAS data archive policy. If you were to succeed, it would make the complaint against Thompson and Cicerone, et al a moot issue. On the other hand, perhaps you will not do so because you want to assist Thompson in any efforts to deny disclosure of his data to public scrutiny. In financial auditing, any refusal to disclose information is a presumptive cause for suspicion of potential error .
Anthony Watts

Posted Jul 31, 2007 at 11:55 PM | Permalink

RE75, I’d add that other expectations of science is being demonstrated by the surfacestations.org project: verification, data sharing, and replication/repeatability.

The stations that have good, mediocre, or bad siting can be easily verified by any third party by simply making a visit and taking similar frames for comparison using the metadata published with each station survey. As for data sharing, everything we have is put online for all to see from the start.

Replication/repeatability is being demonstrated daily. The premise of the project is that there are a number of USHCN sites that have data contamination from a variety of sources. Daily, new photos arrive from all corners of the USA showing that its not just a handful, nor confined to a region, but the problems are showing up widespread throughout the network. An analysis of the photos will tell us if the problem is limited to a small percentage or widely systemic.
A. Sinan Unur

Posted Aug 1, 2007 at 6:18 AM | Permalink

Re: 71

matt, you have just applied the Central Limit Theorem. In doing so, you assumed that each day’s high temperature independent of every other day’s high temperature.

When the observations may be correlated across time and/space, this naive application of the CLT is not valid.

Sinan
MarkW

Posted Aug 1, 2007 at 6:22 AM | Permalink

The fact that a single “shocking picture” exists proves that the climate “scientists” were lying when they claimed that the sensor network had verified as producing good data. So much for your claims of ethics.
steven mosher

Posted Aug 1, 2007 at 7:44 AM | Permalink

TCO,

There are several points to Anthony’s study. I’ll point to an obvious one

There is a claim that the network is high quality. Every researcher who currently uses the data
relies on this. Yet, the claim has not been verified. I will give you one specific example.
In Parkers paper on UHI, he claimed

Furthermore, Peterson (2003) found no statistically
significant impact of urbanization in an analysis
of 289 stations in 40 clusters in the contiguous United
States, after the influences of elevation, latitude, time of
observation, and instrumentation had been accounted
for. One possible reason for this finding was that many
“urban” observations are likely to be made in cool
parks, to conform to standards for siting of stations.

Now, this is a testable hypothesis. Parker supposes that Urban observations
are made in cool parks that meet siting standards.

Did he check? no. Can we check? yes. How? visit the sites and take pictures.

What is the point? to provide a resource to investigators so they don’t say things
like Parker did. Pretty basic.

Does tuscon conform to siting standards? No.

Now, if we cannot agree that Anthony’s project disconfirms Parker’s supposition here,
then we really can’t have a rational discussion.

As for the analysis approach, I will rely on one suggested by Gavin S.
We rank the sites using CRN ranking guidelines. Select a grid. Select the sites in the grid
that meet the standard and calculate a century trend for the grid.

basically he believes that the grids are oversampled in the US and we will see the same
warming trend if we remove the bad sites. A bad site is simply determined by using CRN
guidelines and photo evidence.

Pretty straightforward.

there is more, but if we can’t agree on the Parker example, then we cannot have a discussion.
fFreddy

Posted Aug 1, 2007 at 8:16 AM | Permalink

Re #80, steven mosher

Furthermore, Peterson (2003) found no statistically
significant impact of urbanization in an analysis
of 289 stations in 40 clusters in the contiguous United
States, after the influences of …

Does Peterson name his 289 stations ?
Has Anthony’s project covered any of them yet ?
BarryW

Posted Aug 1, 2007 at 8:45 AM | Permalink

I take the title of this site at face value, it is an audit of climate research and data collection. Audits don’t necessarily look at every piece of data, they check a representative sample to determine if proper procedures and quality controls are being employed. It’s not up to the auditors to fix the problems, no more than it is for a reviewer of a paper to fix the problems, only identify them. I’ve been on both ends of an audit (in the software world) and being on the receiving end is not always pleasant. If I were confronted by audit results such as Anthony’s site surveys thus far, my auditors would have gone away telling me they’d come back when I had actually done what I was supposed to do or show cause why my contract shouldn’t be canceled for non performance.
Steve McIntyre

Posted Aug 1, 2007 at 9:00 AM | Permalink

There are a couple of quotes from Vose (Karl) et al 2005, the official reply to criticisms of Davey and Pielke of sites in eastern Colorado:

First they stated:

the USHCN database could definitely benefit from improved site exposure documentation. Under ideal conditions, this new documentation would meet the high standards set forth by Davey and Pielke (2005).

So when you comments by people criticizing the efforts to document stations, keep in mind that the new documentation at surfacestations.org is trying to meet or exceed the “high standards” of Davey and Peilke 2005 – standards endorsed by USHCN itself in an “ideal” world.

They also threw a type of gauntlet at Davey and Pielke saying that they had not demonstrated poor sites anywhere other than eastern Colorado:

Furthermore, their analysis was a static assessment of site exposures over a relatively small part of the country, an area within which station exposures varied considerably. In other words, their results do not show that a large number of USHCN stations have a comparable exposure problem,

Ordinary businesses, confronted with the defects in eastern Colorado identified in 2005, would have done their own survey to ensure that other sites didn’t have such exposure problems. NOAA negligently failed to do so. Anthony and others are showing what Vose et al 2005 challenged: a large number of stations do have comparable exposture problems – actually many are worse.
Douglas Hoyt

Posted Aug 1, 2007 at 9:37 AM | Permalink

Karl said

“their results do not show that a large number of USHCN stations have a comparable exposure problem”

and at the same time Karl had photographs of all the stations in his possession. Pielke is now trying to obtain copies of these photos.
MarkW

Posted Aug 1, 2007 at 9:53 AM | Permalink

In every audit that I have been involved with, the audit only continues until one problem site is found.
At that point the audit concludes with a result of fails audit.

It is then up to the person being audited to prove that the failed site had been fixed and that all other sites had been examined and that they do not suffer from the same problems.
steven mosher

Posted Aug 1, 2007 at 9:58 AM | Permalink

RE 83.

That was my next argument.

I think Anthony’s project addresses THREE issues.

1. Prior claims of proper siting. Previous studies have relied on or claimed that the sites meet standards.
Anthony’s study puts these claims INTO QUESTION. they may survive the question, but the Onus is on
them to prove it. I view this as an extension of the Peilke Davey paper and a challenge to the
criticisms of it. Like you said, P &D were criticized on the basis of limited sampling. Watts el at
is a rational response to that criticsm of P&D.

2. Potential Bias in the land record. This is obviously the issue that causes all the heat and publicity.
One issue that must annoy the AGW folks is the drip drip drip of site corruption. The longer
the project takes the more drip drip drip. A smart opponent would get the photo job done as quickly
as possible, and not engage in Nixonian stonewalling. A smart opponent would take charge of the
re analysis and get on the offensive and not send rabbetts to engage in a proxy war.

3. Provide quality sites that the CRN can be coordinated with. Currently the CRN plan is to link the
CRN sites to the nearest neighbor in the historical network. They should consider the nearest
quality site.
MarkW

Posted Aug 1, 2007 at 10:01 AM | Permalink

Furthermore, when a high fraction of the audited sites are discovered to have problems, the claim that “we will just assume that none of the unaudited sites will have any problems, would be laughed out of court.
steven mosher

Posted Aug 1, 2007 at 10:26 AM | Permalink

RE 81.

I have been unable to find the list.

What we do know is that with 231 stations surveyed, I have yet to see an Urban site located
in a cool park. maybe the next 990 will be situated as Parker described.
Sam Urbinto

Posted Aug 1, 2007 at 1:29 PM | Permalink

#81, #88
“Does Peterson name his 289 stations ?”

I haven’t read it, but I doubt they were named; it doesn’t seem anyone leaves much of an audit trail. It seems to me to be akin to other non-replicatable and vague (in some or all ways) studies and surveys:

…by analyzing 928 abstracts, published in refereed scientific journals between 1993 and 2003, and listed in the ISI database with the keywords “climate change” (9).

9. The first year for which the database consistently published abstracts was 1993. Some abstracts were deleted from our analysis because, although the authors had put “climate change” in their key words, the paper was not about climate change.

We know what happened to the person that tried to replicate it, given the errors and omissions in that paragraph and vagueness of the note.
Steve McIntyre

Posted Aug 1, 2007 at 1:47 PM | Permalink

I’ve emailed Peterson and asked him for a list of the 289 sites. Given that it’s NOAA, he’ll probably send the information. Russell Vose was quite pleasant.
matt

Posted Aug 2, 2007 at 12:07 AM | Permalink

matt, you have just applied the Central Limit Theorem. In doing so, you assumed that each day’s high temperature independent of every other day’s high temperature.

When the observations may be correlated across time and/space, this naive application of the CLT is not valid.

Disagree. While I agree that there is day-to-day correlation between temps, we need just strong guarantee that there will be 0.25 degree (half the rounding error) of change between the days. Browsing temps at weather underground shows that we have that in spades. Once you can be very certain that you’ll have at least a 0.25 degree of variability between the highs of two days, then I’d assert you can treat each day as random.

Of course, you could crank something out in excel in 5 minutes to prove me wrong.