RegEM has reared its ugly head again in Mann’s review of Burger and Cubasch.
An EM algorithm was one of the very first things that I tried when I started doing this a couple of years ago. When I tried to replicate MBH98, I got stuck on the temperature principal components even before the tree ring principal components. Both networks have problems with missing data. Mann had said that he had selected gridcells with continuous data records to enable PCA, but this simply wasn’t the case. Four gridcells in the HadCRU2 data set had no data whatever. (In passing, in light of recent Greenland temperature discussions, these 4 gridcells are all in Greenland and the difference in data versions bears examination. They were “nearly continuous” in the earlier HadCRU version used in MBH98 and had no data in HadCRU2 – what happened?)
I sent an email to Rutherford and then Mann asking for clarification without any success. I saw an iterative procedure for doing PCA in an image processing context on a data set with missing data and applied it to the MBH data set. The EM operation seemed to converge. This was the first mathematical thing that I’d done in almost 35 years, I was still learning my way on how to use R and I was quite proud of myself that it actually seemed to work. The answers were somewhat different from MBH PCs. I didn’t pursue it because of all the convoluted issues on the proxy side.
Later in the Corrigendum SI, Mann archived the original temperature data set, which is an interesting record of an earlier generation (HadCRU1 ?) data (and the only such record that I’m aware of). He also said that he interpolated missing data. From the earlier data set and using interpolation, I was able to pretty accurately replicate his temperature PCs. So my EM exercise seemed to be wasted as Mann had not used anything so complicated to fill in the missing values and derive temperature PCs.
However, it seems that the exercise was not totally wasted as RegEM appears to be a variant of this process using both proxy and gridcell data.
Let’s step back and consider the multivariate problem of relating a large data set X of (say) m=1082 temperature gridcells available over n=79 years in a calibration period to a proxy population Y of p=22 proxies available over n+N=581 years with the final objective of estimating an average NH temperature. Cook et al 1994 (the et al including Jones and Briffa) consider, as a base case, the OLS regression of each temperature cell on the proxies (Note that I’ve got the X and Y variables in reverse of the usual equation because, after all, tree rings do not cause temperature:
(1)
In our calibration period above, when you had 22*79=1738 proxy measurements, this would result in the calculation of 22*1082=23,804 coefficients. While Cook et al did not doubt the miracle under which each measurement could generate nearly 14 parameters (loaves and fishes, so to speak), even though they were climate scientists, they were able to see some risk of overfitting in such circumstances. After calculating 23,804 coefficients, the NH average would be calculated containing only 79 values – so, intuitively, it seems like a more parsimonious model should be available.
Cook et al 1994 then discuss using principal components to reduce the populations of both X and Y. This strategy is followed in MBH in only a piecemeal fashion. Temperature PCs are calculated with 11 of the first 16 retained. On the proxy side, Mann used PC methods to reduce some tree ring populations, but the majority of series in his Y matrix are used in their raw form. In the AD1400 network, 19 of 22 proxy series are original and only 3 are PC series. For now, let’s treat the Y matrix as garden variety proxies with the temperature gridcell matrix X being represented as principal components as follows:
(2)
Then under a Cook et al methodology, they would calculate a multivariate inverse regression (OLS) of temperature PCs on proxies as in equation (3).
(3) (OLS)
As regards the number of generated coefficients, in the 22-proxy and 1-PC AD1400 network, Mann would thus calculate only 22 coefficients (instead of 23,804) from the 22*79=1738 measurements (mutatis mutandi for 11 PCs calculated with 112 proxies in the AD1820 network.)
Now MBH98 did not simply do a multiple linear regression of PCs on proxies, but, as I’ve argued elsewhere, their “novel” method was, in effect, PLS (partial least squares) regression, a known technique in chemometrics (although they did not know that that was what they did and this is one of many things that I need to write up formally.) The differences between OLS and PLS are not as much as you’d think in the early networks anyway as the proxy networks are close to being orthogonal (some signal??). Thus
(4) (PLS)
Burger and Cubasch plausibly characterize distinctions between things like OLS and PLS (the latter, unfortunately, inaccurately characterized by them – this inaccuracy detracts from, but does not invalidate their results.)
Now my take on what’s going on with RegEM as described here is that it contains many of the problems faced in the Cook et al situation where there were an implausible number of different coefficients calculated on the back of a rather limited population of actual measurements, except that, instead of using OLS regression, they use ridge regression (RR). Thus, instead of (1), we have:
(5) (RR)
I don’t guarantee that I’ve diagnosed this correctly, but it seems that in the AD1400 network, a grand total of only 1738 (79*22) measurements are used to yield 1082*22=23,804 regression coefficients plus 1082 ridge parameters as well. I’m not 100% certain that this is what they’ve done. Their use of idiosyncratic methods always make it hard to tell exactly what they’ve done. In this case, code has been provided so that it should be possible to decode what’s been done more expeditiously than MBH98, although the method itself is much more complicated. [Note: Code was subsequently provided]
In his Review, if one manages to get past all the ad hominems and irrelevancies, Mann’s main objection to Burger and Cubasch is that RegEM is “correct” and comparing results from other seemingly plausible methods is “spurious” or “erroneous” or “demonstrably incorrect” because RegEM is “correct”. Me, I like seeing what happens under other methods, all of which seem equally plausible a priori, relative to a method used by no one except Mannians.
But let’s see why Mann argues that RegEM is “correct”. His main reason(see S142) is that
“RegEM employs both a rigorous, objective regularization scheme and explicit statistical modeling of the error term”.
Now Jean S and I have spent a fair bit of time trying to decode Mannian confidence intervals and I will no doubt be forgiven if I think it prudent to investigate how the “explicit statistical modeling of the error term” is done in RegEM. As far as I can tell, in Mann’s own calculations, the “explicit” modeling is simply two sigmas (but I have to re-check this.)
My main point here is to draw attention to the consideration of error terms in Schneider 2001 (Analysis of Incomplete Climate Data, available online at Tapio Schneider’s website), who proposed RegEM. In his section 6, Schneider tested estimated errors from his methodology against a simulated data from a climate model (page 866). In this case, Schneider considered data set consisting entirely of temperature data in which:
In each of the nine data sets, 3.3% of the values were missing.
Under these circumstances, he reported that his error estimates were biased too low:
The estimated rms relative imputation error was, on the average, 11% smaller than the actual rms relative imputation error.
The underestimation of the imputation error points to a general difficulty in estimating errors in ill-posed problems. Error estimates in ill-posed problems depend on the regularization method employed and on the regularization parameter, but one rarely has a priori reasons, independent of the particular dataset under consideration, for the choice of a regularization method and a regularization parameter. In addition to the uncertainty about the adequacy of the regression model (1), the uncertainties about the adequacy of the regularization method and of the regularization parameter contribute to the imputation error. Since in the estimated imputation error, these uncertainties are neglected, the estimated imputation error underestimates the actual imputation error.
…
This underestimation of the variances is a consequence of using the residual covariance matrix of the regularized regression model in place of the unknown conditional covariance matrix of the imputation error (cf. section 3a). The residual covariance matrix of the regularized regression model underestimates the conditional covariance matrix of the imputation error for the same reason that the estimate of the imputation error in the appendix underestimates the actual imputation error: the error estimates neglect the uncertainties about the regularization method and the regularization parameter. To be sure, the traces of the estimated covariance matrices, on the average, have a relative error of only about 1.8%, but for datasets in which a greater fraction of the values is missing, the underestimation of the variances will be greater.
Now let’s compare this to MBH98. In the 15th century network we are estimating 1082 gridcells from 22 noisy proxies. In other words, Mann is going from a method that under-estimates error with 3.3% missing data to a situation where over 98% of the data is missing. Worse, it’s not “missing” data in the sense that you have actual measurements, but “proxies” whose connection to temperature is itself unproven in many cases.
If one actually reads Schneider’s exposition of RegEM, his caveats seem entirely consistent with the “flavor” problem of Burger and Cubasch. Having said that, I’m inclined to agree with the critics that the flavors are not as felicitously laid out as desirable, so it’s far from being the last word on the topic. I’ll re-visit the question of RE statistics on another occasion.