Estimation of Confidence Intervals for the Mean in the Presence of Outliers
Raymond L Correll, Peter Hall, and Ravendra Naidu

Data from contaminated sites often contain some large positive outliers. These data are usually verified by repeated chemical analysis.  They may have a large effect on the arithmetic mean and hence estimates of the total contaminant load.  Land use planners are often concerned with this total load, especially if the ground water is threatened by the contaminant.  Such an estimate should be unbiased.  This paper considers methods of estimating confidence intervals in the presence of a large positive outlier, and is illustrated using copper levels from a contaminated site near Adelaide, South Australia.
The actual data consist of 76 samples, and had a mean of 556 but a median of only 115.  The highest value was 18000 mg kg-1.  This outlier affects not only the mean and the variance, but also skewness and kurtosis.  In our sample, G2 is Fisher's measure of kurtosis, had a value of 61. Use of the t distribution in this case is therefore invalid. Typically robust estimates of central tendency do not give an unbiased estimate of the sample mean.
We note that a simple bootstrap technique is dominated by whether the bootstrap contains 1, 2, 3 or more occurrences of the outlier.  The bootstrap distribution is thus multimodal and can be disjoint.  This disjoint distribution means that a small change in the quantiles (eg a change from 0.97 to 0.98) could lead to a large change in the resultant confidence limit.  We therefore explore a method of data sharpening followed by a bootstrap technique to obtain confidence intervals.
The sharpening is achieved by minimising a penalty function that increases slowly with the distance of the observations from the mean, but constrained so that the arithmetic mean is unaffected and that the variance is reduced to some pre-assigned fraction (to say a half of the original variance).  Confidence intervals can then be obtained on the sharpened values using a bootstrap technique, and those confidence intervals can be back-transformed to the original scale.

RAYMOND CORRELL
Mathematical & Information Sciences, CSIRO
PMB#2 Glen Osmond
Adelaide 5064, Australia
ray.correll@cmis.csiro.au


A Statistical Test for the Evaluation of Signal Species
as Indicators of Key Biotopes for Endangered Species
Anders Grimvall, Stig Danielsson, Markus Malm, and Stefan Stark

Forests and other widespread ecosystems are believed to contain specific biotopes that play a crucial role in the protection of endangered species. Hence, it is of great interest to develop procedures that enable identification of such key biotopes. One of the strategies that has been adopted is based on inventories of so-called signal species, i.e. species which are easy to detect and indicate favourable environments for species that are red-listed because they are rare or endangered. This article describes how the results of an inventory of both signal species and red-listed species in a number of biotopes can be used to assess whether the presence of red-listed species is linked to specific patterns in the presence of signal species. The data set observed in such an inventory may be regarded as outcomes of binary variables, one for each combination of species and biotope. Furthermore, the prediction problem addressed is a matter of selecting a suitable subset of signal species and a suitable binary function of the variables representing presence/absence of selected signal species. We propose a procedure in which the class of permissible predictors is comprised of all binary variables that can be expressed as increasing functions of one, two or three other binary variables. Furthermore, we propose that a permutation test is employed to test whether the fit of predicted values to observed values is statistically significant. A case study of mosses and lichens in forest ecosystems in Sweden is used to illustrate the methods proposed.

ANDERS GRIMVALL
Department of Mathematics
Linkoping University
Linkoping 58183, Sweden
angri@mai.liu.se
 
 

Estimation of Soil ingestion Via Semiparametric Bayes Methods
John V. Tsimikas and Edward J. Stanek

Exposure to soil is assessed based on mass-balance soil ingestion studies. In these studies a sample of individuals are followed over a period of time and their daily soil ingestion is estimated based on measurements on trace element intake from food and trace element output observed in fecal samples.  Many trace elements may be used in one study. Given knowledge of the transit time between intake  and output of the trace element one can reliably estimate the daily amounts of soil ingested by individuals in the study.
Linear random coefficient models provide a natural framework for the analysis of such data, the crucial parameter to be estimated being the upper 5th or 10th quantile of the distribution of subject-specific soil ingestion over a fixed period of time. An estimate easily arises if one assumes normality of the random subject specific effects in the model. A Semiparametric alternative to the standard linear random effects model is the Bayesian nonparametric hierarchical model involving Dirichlet priors or Dirichlet Process Mixtures (Escobar and West, 1995, 1998; Ibrahim and Keinman 1998).
 We  apply and extend these  methods to the estimation of subject and population exposure parameters based on short multiple time series of trace element excretion. We discuss how these methods yield more reliable estimates of soil ingestion exposure distributions which serve as the foundation for many environmental risk assessments.

JOHN V. TSIMIKAS
Department of Mathematics and Statistics
University of Massachusetts at Amherst
1442 LGRT
Amherst, MA 01003, USA
tsimikas@math.umass.edu
 
 

Back to Scientific Program