This practicum will provide an opportunity to explore two topics of fairly fundamental importance in spatial (and other) statistics: uncertainty and the role of monitoring networks in addressing uncertainty. In particular, you will learn how to actually go about redesigning a monitoring network to maximize the amount of information they provide about an environmental process.

Problem 1: Uncertainty

Monitoring networks for monitoring hazardous environmental processes are developed with the goal of collecting data and hence information about those processes. That information in turn is needed to address uncertainties arising from concerns about the levels of those processes. This problem concerns uncertainty.

  1. Probability has been called the language of uncertainty. In other words uncertainty about an event’s outcome can be characterized as the probability of its occurrence. This idea is enshrined in statistics through the Bayesian paradigm where all uncertain quantities turn into random variables with a probability distributions. However, in other contexts, uncertainty is not regarded as quite so simple. Commonly it is dichotomized as being either “aleatory” (pure chance) or “epistemic” (due to lack of knowledge). In the latter case, additional knowledge will change that state of uncertainty.

    [Exercise 1.1.] A perfectly balanced six sided conventional die is to be tossed. What is the probability of seeing an “ace” (one) on its uppermost side? Suppose n independent tosses of the die yield \(Z_1,\dots,Z_n\) where \(Z\) is \(1\) or \(0\) according as an ace is seen or not. Could that information change your answer?

    [Exercise 1.2.] A new die is now used in this experiment with no assurance of its “fairness” but with knowledge that it has 6 faces numbered from 1 to 6 as before. What would you now say about the probability of a ace on a single toss of that die? Take a Bayesian approach and say how the sequence of \(Z\)s in this case might alter your answer to that question and your certainty about your answer.

    [Exercise 1.3.] So what has all this to do with modelling spatio-temporal processes within a Bayesian framework where the two types of uncertainty are not distinguished. Consider the following example. An environmental process \(\{Y_{it},~i=1, \dots, p, ~t =1, \dots, T\}\) is measured without error over time \(t\) at a number of spatial sites \(p\) and modeled by:

\[\begin{eqnarray*} \nonumber Y_{it}|\beta & = & \beta_{it} + \epsilon_{it}, ~i= 0, 1, \dots, p,~ t=1, \dots, T+1,~{\rm or}\\ \textbf{Y}_{i}|\beta &= & \beta_{i} + %\emph{} \epsilon_{i},~i=0,1,\dots,p,~{\rm with}\\ \beta_{i} &\sim & {\rm N}(\mu_0 {\rm 1}_{T+1},\sigma^2_\beta ~ {\rm I}_{T+1}),~i=0,1,\dots,p\\ \mu_0 &\sim & N(\mu^*,\sigma^2_\mu), \end{eqnarray*}\]

where the variances are known while the \(\epsilon\)’s are independent of one another. For parsimony, the model thus puts all the spatial correlation into that of the \(\beta\)’s. Some of the randomness in the \(\beta\)’s might be due to chance, such things as fluctuations in temperature or wind direction. This randomness and hence uncertainty about the coefficients would thus be aleatory. But some of their uncertainty would be due to their unknown baseline level \(\mu_0\) and would be epistemic.

  1. Compute the posterior spatial covariance between sites at time \(T+1\)
  2. What is wrong with this result and hence the model as \(T\) increases?
  3. How might this model be fixed to overcome the difficulty in (ii)?

Problem 2: Redesigning monitoring networks

In this problem you will work through an example that illustrates the use of methods developed by Le and Zidek (2006) and published in a CRAN package called EnviroStat. The methods were from the inception of their development: aimed at models for multivariate data vectors; capable of providing computationally realistic models for spatio–temporal fields of fine temporal resolutions e.g. thousands of hours over hundreds of spatial sites; capable or representing random fields with heavy tails e.g. chlorine concentrations in acid precipation. So as a compromise, empirical Bayes shortcuts were used in the methods in EnviroStat.

The data from the problem comes from a network of sites that began measuring at various starting dates, hourly ozone concentrations, thus yielding data with a ``staircase pattern’’ of missing data. The model is described in detail in Le and Zidek (2006, Chapter 10). The example will lead you to generate a set of several new sites that ideally should be added to the network.

More specifically, the dataset (data) used in this example consists of hourly O\(_3\) concentration levels (ppb) from nine stations (S1–S9) in New York State. A square root transformation of those concentration levels has been taken to make render their distribution more Gaussian in shape. Furthermore, for simplicity, only the data for four hours in the so–called ``ozone–day’’ are included - 8AM to 12noon and for each day, these are stacked in a 4 dimensional vector.

Preliminary analysis (not included here) suggests ’month’ and ’weekday’ are important covariates and the square root transformed O3 concentration levels are approximately normal. The results show consistent patterns of hourly and weekday effects for all stations. Except for the first few weeks in April the weekly effects show little temporal trend. Autoregression of order 2 was also seen in the preliminary analysis. One could attempt to model that autocorrelation. But to demonstrate the full power of our multivariate approach we use a different approach that avoids altogether the need to model that dependence at the fine temporal scale of one hour. This would be a challenging problem since the strength of that autocorrelation varies over a day, being fairly weak at night for example but strong in the middle of the day. Instead we allow it to be arbitrary by exploiting the power of our multivariate approach by stacking the hourly responses in a single vector of responses for each day as noted above. Excluding the intervening hours makes the successive daily random vectors of four hours approximately independent with a conditionally multivariate Gaussian distribution. Yet because four hours are used, inferences about each random coordinate hour can ``borrow strength’’ from the other three hours – quite an asset as ozone is very strongly correlated over successive hours during the summmer months covered by the data. The datafile is described in detail below.

Exercises

[Exercise 2.1.] Plot site locations using both Google maps (color and black-and-white) as well ggplot and comment on the relative advantages of these different visualizations. Likewise plot the prospective new sites to be considered for extending the network.

[Exercise 2.2.] Comment on: the between-hour within-day correlation between ozone levels; the fitted degrees of freedom calculated for the steps in the data staircase; the Sampson-Guttorp analysis and the nature of any nonstationarity it suggests. Would you expect the ozone field over New York State to be nonstationary?

[Exercise 2.3.] Find an optimal extension to the monitoring network by adding four additional sites. (It would be nice to consider more, but the optimization is slow, so we’ll settle for four.)

[Exercise 2.4.] Comment on your new design. Why did entropy pick the five sites it did?

Intiate R run

Start R to begin the example using the script below.

Start by (1) loading the EnviroStat package, (2) reading the raw data, (3) extracting the month, weekday, and sqrt-transformed ozone concentrations, and (4) reading in (lat, long) coordinates for the sites.

  • RawData data.NY.txt: 9 sites S1-S9;
  • month (4 - 9);
  • weekday (2-8);
  • Sqrt transformed data: Month; Weekday; sqrt03 (36 columns: 4 hours - (8am-12noon) x 9 stations )

The R script

# Load libraries
library(EnviroStat)
library(sp)
library(rgdal)
library(ggplot2)
library(maps)
library(RgoogleMaps)
library(RColorBrewer)
library(plotGoogleMaps)
library(ggmap)
library(spatial)
library(geoR)
library(gstat)



# Read data data = matrix(scan('../data/data.NY.txt'),byrow=T,ncol=38)
data = matrix(scan("http://www.stat.washington.edu/peter/591/labs/Lab7/data.NY.txt"), 
    byrow = T, ncol = 38)
month = data[, 1]
weekday = data[, 2]
sqO3 = data[, 3:38]

# Read information on locations (lat, long) site.loc =
# matrix(scan('../data/location.NY.txt'),byrow=T, ncol=2)
site.loc = matrix(scan("http://www.stat.washington.edu/peter/591/labs/Lab7/location.NY.txt"), 
    byrow = T, ncol = 2)

Fit the hierarchical model in the LZ method with covariates Z

Rearrange the data to have a monotone pattern and calculate number of missing observations

apply(is.na(sqO3), 2, sum)
# Output [1] 0 0 0 0 109 109 109 109 0 0 0 0 0 0 0 0 3 3 3 [20] 3 84 84 84
# 84 0 0 0 0 0 0 0 0 0 0 0 0

Thus, the order of stations should be (2,6,5,1,3,4,7,8,9) with S2 having the largest number of NA’s - i.e the latest startup date. Note that within each stations, the elements must have the same missing number of observations.

norder = c(2, 6, 5, 1, 3, 4, 7, 8, 9)
tt = NULL
for (i in 1:9) tt = c(tt, c(1:4) + 4 * (norder[i] - 1))
ndata = sqO3[, tt]
nloc = site.loc[norder, ]

Plot the site locations.

# A crude plot
plot(nloc[, 2], nloc[, 1], xlab = "Long", ylab = "Lat", type = "n")
text(nloc[, 2], nloc[, 1], c(1:9))

# Use ggplot

# crs.org <- read.table('../data/location.NY.txt', header=T)
crs.org <- site.loc  # Same site coordinates being used again under a different name.
dimnames(crs.org)[[2]] <- c("lat", "long")

# load us map data
all_states <- map_data("state")

# plot all states with ggplot

# states <- subset(all_states, region %in% c( 'illinois', 'indiana', 'iowa',
# 'kentucky', 'michigan', 'minnesota','missouri', 'north dakota', 'ohio',
# 'south dakota', 'wisconsin' ) )

states <- subset(all_states, region %in% c("new york"))
p <- ggplot()
p <- p + geom_polygon(data = states, aes(x = long, y = lat, group = group), 
    colour = "black", fill = "white")
p
########### 
long <- crs.org$long
lat <- crs.org$lat


p <- p + geom_point(data = crs.org, aes(x = long, y = lat), color = "coral1")
p

# A Google map

# ny_data <- read.table('../data/location.NY.txt', header = T)
ny_data <- site.loc  # Same site coordinates being used again under a different name.
dimnames(ny_data)[[2]] <- c("Lat", "Long")


# copy ca_data into ca_data_sp so that ca_data can be used later by other
# packages
ny_data_sp <- ny_data

# convert data to 'sp' format
coordinates(ny_data_sp) <- ~Long + Lat

# Specifying the projection code assign a reference system to ca_data_sp
proj4string(ny_data_sp) <- CRS("+proj=longlat +ellps=WGS84")


# Specifying the bounding box - 2 x 2 matrix of corners of the geographic
# area. Then specifying the range of locations within the box.  The location
# must be in left-bottom-right-top bounding box format
latLongBox = bbox(ny_data_sp)
location = c(latLongBox[1, 1] - 0.2, latLongBox[2, 1] - 0, latLongBox[1, 2] + 
    0.2, latLongBox[2, 2] + 0.2)

# Now create the map with location dots marked on it in
NYmap <- get_map(location = location, source = "google", color = "color", maptype = "roadmap")
NYmap <- ggmap(NYmap)
NYmap <- NYmap + geom_point(data = ny_data, aes(x = Long, y = Lat), size = 4)
NYmap

Preliminary analysis (not included here) suggests ‘month’ and ‘weekday’ as important covariates and the sqrt transformation of O3 levels to be approximately normal. Set the design matrix for the covariates using the ‘model.matrix’ function

month[1:5]
weekday[1:5]

ZZ = model.matrix(~as.factor(month) + as.factor(weekday))

ZZ[1:2, ]

The EM staircase fit is computed using the staircase.EM function in EnviroStat. Missing data are allowed only at the beginning of the series, but all elements within a station must have the same number of missing observations. Start by examining the help file for the staircase.EM function

help(staircase.EM)

Here the staircase steps are used for the block structure (default). The fitting takes a couple of minutes!

em.fit = staircase.EM(ndata, p = 4, covariate = ZZ, maxit = 200, tol = 1e-06, 
    verbose = T)
em.fit$block  # 4 blocks with the first block having 1 station and the last one 6 stations
em.fit$Omega  # between hours covariance
em.fit$Lambda  #  residual covariances
em.fit$Beta0[, 1:4]

The current version only allows exchangeable structure for each elements; ie. the same b0 is assumed for each hour across all stations, but different b0’s for different hours

Get the marginal correlation matrices (corr.est: spatial and omega: between hours) and include these in your report.

cov.est = em.fit$Psi[[1]]
dim1 = dim(cov.est)[1]
dim2 = dim(em.fit$Omega)[1]
corr.est = cov.est/sqrt(matrix(diag(cov.est), dim1, dim1) * t(matrix(diag(cov.est), 
    dim1, dim1)))
round(corr.est, 2)  #Spatial correlation

omega = em.fit$Omega/sqrt(matrix(diag(em.fit$Omega), dim2, dim2) * t(matrix(diag(em.fit$Omega), 
    dim2, dim2)))
round(omega, 2)  # Correlation between hours

Plot correlation vs inter-station distances. First project the coordinates into a rectangular coordinate system with the Lambert projection using the Flamb2 function (SG) in EnviroStat.

help(Flamb2)

coords = Flamb2(abs(nloc))
coords

Calculate distance between the locations using the Fdist (SG) function in EnviroStat and plot spatial correlations vs inter-station distances.

help(Fdist)
dist = Fdist(coords$xy)

par(mfrow = c(1, 1))
plot(-0.2, 0, xlim = c(0, 250), ylim = c(-0.2, 1), xlab = "Dist", ylab = "Spatial correlation", 
    type = "n")
for (i in 1:8) for (j in (i + 1):9) points(dist[i, j], corr.est[i, j])

Next similarly plot the “dispersions” vs inter-distances.

disp = 2 - 2 * corr.est
plot(-0.2, 0, xlim = c(0, 250), ylim = c(0, 2), xlab = "Dist", ylab = "Dispersion", 
    type = "n")
for (i in 1:8) for (j in (i + 1):9) points(dist[i, j], disp[i, j])

Add a fitted exponential variogram using the Fvariogfit3 (SG) function with model=1 for Exponential (default) and = 2 for Gaussian.

Exponential variogram: \(a_1+ (2-a_1)(1-\exp(-t_0 h))\)

Gaussian variogram: \(a_1+ (2-a_1)(1-\exp(-t_0 h^2))\)

h.lt = dist[row(dist) < col(dist)]
disp.lt = disp[row(disp) < col(disp)]
variogfit <- Fvariogfit3(disp.lt, h.lt, a0 = 1.5, t0 = 0.1)
x = seq(min(h.lt), max(h.lt), 1)
a0 = variogfit$a[1]
t0 = variogfit$t0
lines(x, a0 + (2 - a0) * (1 - exp(-(t0 * x))))
# Save it for later comparison!

Use SG method to extend the spatial covariance to new locations

Step 1: Identifying a thin-plate sline mapping transformation with no smoothing (lambda =0) using the Falternate3 function ‘Falternate3’ (SG), starting with the spatial correlation.

This function does simultaneous estimation of coords and exponential or gaussian variogram by alternating weighted least squares.

This version passes a smoothing parameter to the optimization. This parameter is not scaled exactly the same as it is in the thin-plate spline sinterp function of SG used by the Ftransdraw in Step 2 below, and this has not been investigated yet. This should not be an issue for you as you are asked to use the function only with the default lambda = 0.
Note: we rescale the coords so that they are reasonably small in magnitude before attempting to compute; otherwise matrix inversion may not work in calculation of bending energy matrix.

help(Falternate3)

Recall that the exponential variogram is the default; set model=2 for a Gaussian variogram.

Scaling the distances to be somewhat small facilitates the multidimensional scaling method computations.

coords.lamb = coords$xy/10
sg.est = Falternate3(disp, coords.lamb, max.iter = 100, alter.lim = 100, model = 1)
sg.est

Step 2: Selecting a smoothing parameter for the identified spline mapping through the ‘Ftransdraw’ function (SG).

How this function works: for varying (user-supplied) values of the spline smoothing parameter lambda,

  • compute image of coordinates, interpoint distances, and dist-disp plot.
  • compute and draw image of regular grid (2D or 3D perspective plot)

Computed prior to execution of this code are:

  • disp: spatial dispersion matrix
  • Gcrds: geographic coordinates (nx2)
  • MDScrds: kyst mds solution (nx2 or nx3) - using Falternate3
  • gridstr: regular grid structure on the G-plane (from Fmgrid)

First create a coordinate grid for examining the mapping transformation.

apply(coords.lamb, 2, range)
coords.grid = Fmgrid(range(coords.lamb[, 1]), range(coords.lamb[, 2]))
par(mfrow = c(1, 2))
temp = setplot(coords.lamb, ax = T)
deform = Ftransdraw(disp = disp, Gcrds = coords.lamb, MDScrds = sg.est$ncoords, 
    gridstr = coords.grid)

Note: You must click on the plot window to register the cursor in order to proceed; this interactive function will then provide instructions in the R console window - specifically for entering new lambda values.

Lambda = 50 may be ok. The fitted variogram is quite similar to the one without any transformation (above). The mapping is almost linear!

Step 3 - combining to get an optimal thin-plate spline using the the sinterp function, saving the results.

After computing the thin-plate spline, plot the biorthogonal grid characterizing the deformation of the G-space

Tspline = sinterp(coords.lamb, sg.est$ncoords, lam = 50)

par(mfrow = c(1, 1))
Tgrid = bgrid(start = c(0, 0), xmat = coords.lamb, coef = Tspline$sol)

tempplot = setplot(coords.lamb, ax = T)
text(coords.lamb, labels = c(1:(dim(coords.lamb)[1])))
draw(Tgrid, fs = T)

Solid lines indicate contraction and dashed lines indicate expansion. See Sampson and Guttorp (1992) for details on interpretation.

Step 4- estimating the dispersion between new locations and the stations using the SG fit from steps 1-3 above.

Here new locations are created using a grid of 100 points between the stations. The locations are ordered as (lat1,long1),(lat2,long1), ... ,(latn,long1), (lat1,long2), ... (latn,long2), ...

lat10 = seq(min(nloc[, 1]), max(nloc[, 1]), length = 10)
lat10

long10 = seq(max(abs(nloc[, 2])), min(abs(nloc[, 2])), length = 10)
long10

llgrid = cbind(rep(lat10, 10), c(outer(rep(1, 10), long10)))
llgrid[1:10, ]

Project the new locations using the same Lambert projection computed for the stations above (ie. using the same reference point). Note the same scale factor of 10 is used as before. Then combine the new locations and the original stations together, beginning with the new locations.

z = coords
newcrds.lamb = Flamb2(llgrid, latrf1 = z$latrf1, latrf2 = z$latrf2, latref = z$latref, 
    lngref = z$lngref)$xy/10
allcrds = rbind(newcrds.lamb, coords.lamb)

Use the corrfit function to obtain correlations between the stations

help(corrfit)

corr.est = corrfit(allcrds, Tspline = Tspline, sg.fit = sg.est, model = 1)
round(corr.est$cor[1:5, 1:5], 2)

Step 5: Interpolating the variance field

The variance field is considered non-homogeneous and interpolated using the same thin-plate spline function sinterp. Variances are taken from the diagonal of the covariance matrix.

diag(cov.est)

Tspline.var = sinterp(allcrds[101:109, ], matrix(diag(cov.est), ncol = 1), lam = 50)

The “thin-plate spline evaluation” function seval is used to obtain variance estimates at the locations. The results are arranged in a matrix. These variances are combined to scale the correlation matrix to get the covariance matrix for all the locations.

varfit = seval(allcrds, Tspline.var)$y
temp = matrix(varfit, length(varfit), length(varfit))
covfit = corr.est$cor * sqrt(temp * t(temp))

This completes the SG-method for extending the covariance matrix to ungauged sites stored in covfit.

Extend the results to estimate hyperparameters associated with the new locations through the staircase.hyper.est function

Begin by examining the arguments and outputs of this function in its helpfile.

help(staircase.hyper.est)

u = 100  # number of new locations
p = 4  # dimension of the multivariate response
hyper.est = staircase.hyper.est(emfit = em.fit, covfit = covfit, u = u, p = p)

This completes the estimation of all hyperparameters.

The predictive distribution for spatial interpolation.

For example, to get the predictive mean and covariance for Day 183:

# 
x = hyper.est
tpt = 183
Z = x$covariate[tpt, ]
y = x$data[tpt, ]
# The beta0 for u= 100 ungauged locations each with p=4 hours
b0 = matrix(rep(c(x$Beta0[, 1:p]), u), nrow = length(Z))

# Predictive mean for hours 8-12 on Day 183
mu.u = Z %*% b0 + (y - Z %*% x$Beta0) %*% kronecker(x$Xi0.0, diag(p))

Next plot the observed values for Day 183 by hours.

# X11()

# Select a set of colors - color = colors() if default is preferred
color = colors()[c(12, 26, 32, 37, 53, 60, 70, 80, 84, 88, 94, 101, 116, 142, 
    366, 371, 376, 386, 392, 398, 400:657)]

par(mfrow = c(1, 1))
plot(c(7, 13), range(y), type = "n", xlab = "Hours", ylab = "Levels (log)")
for (i in 1:p) points(rep(i + 7, 9), y[i + p * c(0:8)], col = color)
# dev.off()

# X11()
par(mfrow = c(2, 2))
# Plot the contour for hours
for (i in 1:p) {
    tt = i + p * c(0:(u - 1))
    mu = mu.u[tt]
    hr = matrix(mu, byrow = T, ncol = length(lat10))
    print(range(hr))
    contour(-long10, lat10, hr, xlab = "Long", ylab = "Lat", main = paste("Mean: Day 183 - Hour ", 
        7 + i))
}

Note that the predictive covariance may have to be obtained recursively (ie. when steps are involved). Generally the derivation is not simple in this case. A simpler approach is to simulate realizations from the predictive distribution and estimate the mean and covariance from the simulated data as demonstrated next.

We do interpolation by simulation using the predictive distribution - via the pred.dist.simul function with N = 1000. As usual, first examine the help file for this function to understand its arguments and outputs.

help(pred.dist.simul)

simu = pred.dist.simul(hyper.est, tpt = 183, N = 1000)

Extract the simulated data at the gauged stations and plot the contours of the mean in a new graphic window. Then plot the corresponding variance field.

x = apply(simu, 2, mean)[1:(p * u)]
# X11()
par(mfrow = c(2, 2))
# Plot the contour for hours
for (i in 1:p) {
    tt = i + p * c(0:(u - 1))
    x1 = x[tt]
    hr = matrix(x1, byrow = T, ncol = length(lat10))
    print(range(x1))
    contour(-long10, lat10, hr, xlab = "Long", ylab = "Lat", main = paste("Mean: Day 183 - Hour ", 
        7 + i))
}

# Plot the corresponding variance field
x = simu[, 1:(u * p)]
# X11()
par(mfrow = c(2, 2))
# Plot the contour for hours
for (i in 1:p) {
    tt = i + p * c(0:(u - 1))
    x1 = x[, tt]
    x2 = diag(var(x1))
    vv = matrix(x2, byrow = T, ncol = length(lat10))
    contour(-long10, lat10, vv, xlab = "Long", ylab = "Lat", main = paste("Var: Day 183 - Hour ", 
        7 + i))
    points(nloc[, 2], nloc[, 1])
}

Redesigning a network.

The design solution can be done with the ldet.eval function. Look at its help file.

help(ldet.eval)

To select an additional 3 stations among the 100 new locations from the above grid. Their conditional covariance is in hyper.est$Lambda.0. The Entropy criterion selects the combination with the largest log|det|.

The ldet.eval above evaluates the log|det| for sub-covariance matrices of size nsel and returns the combination with largest value using option all =F.

Option all=T returns all combinations with corresponding values. This option needs a very large memory allocation if the number of combinations is big and so should be used only for small number of potential sites, say < 15.

Note that when the number of combinations is large, this function could be very slow even with option all=F. For example, it could take about 30’ to select 5 out of 100.

nsel = 4
yy = ldet.eval((hyper.est$Lambda.0 + t(hyper.est$Lambda.0))/2, nsel, all = F)

# Option all = T for a smaller matrix

yy1 = ldet.eval(((hyper.est$Lambda.0 + t(hyper.est$Lambda.0))/2)[1:10, 1:10], 
    nsel, all = T)


# Plot the selected locations

# NYmap <- get_map(location = location, source = 'google', color = 'color',
# maptype = 'roadmap') NYmap <- ggmap(NYmap) NYmap <- NYmap +
# geom_point(data = ny_data, aes(x = Long, y = Lat), size = 4) NYmap

dimnames(llgrid)[[2]] <- c("Lat", "Long")
llgrid[, 2] <- -llgrid[, 2]

NYmapGrid <- NYmap + geom_point(data = data.frame(llgrid), aes(x = Long, y = Lat), 
    size = 1, color = "red")
NYmapGrid
ny_dataAug <- data.frame(llgrid[yy$coord.sel, ])
NYmapAug <- NYmapGrid + geom_point(data = ny_dataAug, aes(x = Long, y = Lat), 
    size = 4, color = "red")
NYmapAug