Standardization in Mannian algorithms is always a bit of an adventure. The bias towards bristlecones and HS-shaped series from the impact of Mann’s short segment standardization on his tree ring PCs has been widely publicized. Smerdon’s demonstration of defects in Rutherford et al 2005, Mann et al 2005 and Mann et al 2007 all relate to errors introduced by inconsistencies between the standardization methods and the assumptions of the algorithm. Jean S has pointed out further defects that have not yet been incorporated in the PRL.
Jeff Id and I have been exchanging digital files. Jeff has provided some intermediates from his Matlab run and I’ve made some progress in emulating this in R. I think that I’ve got a stable emulation of the Truncated TLS operation (Tapio Schneider’s pttls function) and I’ve got a pretty close emulation of the Xmis object in the first step of the RegEM operation and am getting close to the first step X object.
Along the way, I noticed something which may prove of interest – tho I caution that this is just a diarized note for follow up at present.
One of the fundamental assumptions of Tapio Schneider RegEM is data is missing at random. This is clearly untrue for Steig’s data. However, this sort of failure doesn’t necessarily mean the algorithm is worthless – but careful authors would surely have discussed this issue and its potential impact.
One of the key non-randomness is that some series are missing a LOT more data than other series. It looks like this could have a material effect on weights through an interesting step in the algorithm.
Schneider’s algorithm first centers the data (Matlab code below):
[X, M] = center(X); % center data to mean zero
Then missing data is set at 0.
X(indmis) = zeros(nmis, 1); % fill missing entries with zeros
A little further on, the series are divided by their standard deviations:
% scale variables to unit variance
Do you see what happens? If a series has a LOT of missing data, the missing data is all set at 0 for the purposes of the standard deviation. Could a potential bias arise? Here’s a plot of the standard deviation after first step infilling. It’s interesting that the infilled standard deviations of inland stations are so much lower than coastal stations, though there’s no reason to believe that this is the case physically. It’s only because the inland stations are missing more data – which is thus set to zero and reduces the standard deviation. This might reduce the weighting of the inland stations relative to the Peninsula stations if it persists through later steps. I don’t see any code right now that would appear to compensate for this (but I haven’t finished yet.) There’s no reason why there would be such code. Schneider assumes that data is missing at random, while Steig and Mann ignore this caveat. I’ll keep you posted as I continue to work through my emulation of Schneider’s RegEM.
UPDATE Feb 19:
Tapio Schneider says above:
The initial variance estimate is biased because missing values are set to the mean (zero after centering), but the estimate is iteratively improved. (One can show that the estimate in the ordinary EM algorithm converges monotonically to the maximum likelihood estimate for normal data; in the regularized EM algorithm, this monotonic convergence is assured for a fixed regularization parameter but not necessarily in general, when regularization parameters are chosen adaptively.)
The following diagram of PCs is from the AVHRR data. Note the structure of the PC3. PCs recover patterns. The pattern being recovered in the PC3 is surely the infilling of data with zeros.
So notwithstanding Tapio’s comment, it sure looks like the infilling with zero pattern is preserved in the final result. As a caveat, while Steig said loudly that he used the unadorned Schneider algorithm, this statement contradicts a statement in the article that they used the algorithm as “adapted” by Mann and Rutherford. Previous adaptations by these authors have been criticized by Smerdon in three publications and it is possible that the phenomenon observed here is related to the adaptations rather than the Schneider algorithm itself. Without knowing what Steig really did – and he’s incommunicado – there’s still a lot of speculation here.