PASI 2014
Data and software
Nonstationary workshop
Small scale spatial data set
Log housing prices from Stockton, CA in column 1.
There are 8 covariates:
Column 2: sqft = area in 100 square feet (roughly 9.3 square meters)
Column 3: age = age of house,
Column 4: bedrooms = number of bedrooms
Column 5: vacant_lot = lot vacant when sold (1 if true)
Column 6: arge_lot = large lot (1 if true)
Column 7: dist_freeway = distance to Interstate freeway in miles
Column 8: lat = latitude
Column 9: long = longitude
Suggested mean values (obtained from ols regression of log price on the covariates) are in column 10.
The locations for which covariance predictions are to be made have NA in columns 1 and 10.
Global bivariaate dta set
These data are Nov-Mar averages over 1970-1999 of precipitation and temperature from a historical run of the global CCSM3 climate model from NCAR (Trenberth and Shea, GRL 2005), after removing some spherical harmonics. See Jun, Scand J Stat Vol. 38: 726–747, 2011 for details.
Small scale spatio-temporal data set
This is a spatio-temporal dataset for nonstationary spatial covariance modeling. The processed data are derived from concentrations of NO2 recorded daily in the U.S. EPA Air Quality System (AQS) at 52 monitoring sites over a region of southern California for the period from 1999 to 2011. In the interest of focusing on spatial covariance, the data have been processed and detrended
as follows.
1. First, to eliminate most of the issues of small-scale temporal correlation, the data have been averaged to a 2-week time scale, which happens to be the time scale for most of the analyses being conducted by the MESA Air study at the University of Washington. (The main reason for this time scale in the MESA Air study is that supplemental monitoring data, not being provided here, are from instruments that are sampling 2-week time periods.) The two-week averages are then log-transformed (you could argue whether this is the best choice for these data, but it works well enough and it is what MESA Air investigators have been using).
2. The log 2-week average time series at each monitoring site were detrended using an empirical smooth SVD approach described in a couple of papers, including the two listed here. See attached figures for the nature of the detrending.
a. Guttorp, P., Fuentes, M., Sampson, P.D. (2006). Using transforms to analyze space-time processes. in: Statistical Methods for Spatio-Temporal Systems, B. Finkenstadt, L. Held, V. Isham, Eds., CRC/Chapman and Hall, pp 77-150.
b. Sampson PD, Szpiro AA, Sheppard L, Lindstrom J, and Kaufman JD. (2011) Pragmatic estimation of a spatio-temporal air quality model with irregular monitoring data. Atmospheric Environment, 45, 6593-6606.
3. The detrended, mean zero time series (residuals from the fitted smooth trends) are the basis for the computation of a 52 x 52 covariance matrix.
4. An empirical covariance matrix was using the EM algorithm to deal with missing data.
Notes:
10 of the 52 monitoring sites have been reserved for validation purposes. The figure shows the location of the sites for analysis/modeling (red) and the sites for validation (blue). The aim of the modeling is to predict covariances among the validation sites and between the validation and analysis sites. Validation sites were selected by repeatedly sampling 10 sites at random until I observed a configuration with sites covering most of the geographic span of the 52 sites, but not including extreme spatial sites.
These data have not been analyzed for nonstationary covariance structure in the MESA Air study. The sites provided here cover a larger region of southern California than is the focus (around L.A.) for MESA Air.
There are missing data, as in almost all environmental monitoring datasets, but we selected sites that were mostly complete, having no more that 110 missing observations out of 339 2-week averages for these 13 years. A few sites were eliminated for having batches of highly suspicious observations that inflated variances and deflated spatial correlations.
An empirical covariance matrix was computed by the EM algorithm to deal with missing data.
A separate dataset will be provided for consideration of possible nonstationary spatial covariance for the spatial (not spatio-temporal) dataset of long-term mean concentrations. Geographic covariates will be provided for specification of a mean model.
The current data for analysis are provided in the R workspace PASI.NO2.anal.RData. It contains the following objects. Dates and monitoring site ids are provided in the dimnames of these objects.
`
CA.NO2.anal
`
339x42 matrix of log 2-week averages for analysis sites
CA.NO2.anal.res
339x42 matrix of detrended log 2-week averages
CA.NO2.anal.cov
42x42 matrix of covariances among the analysis sites
CA.NO2.anal.longlat
42x2 matrix of longitude and latitude for analysis sites
CA.NO2.anal.lamb2xy
42x2 matrix of Lambert projection coordinates
CA.NO2.xval.longlat
10x2 matrix of longitude and latitude for validation sites
CA.NO2.xval.lamb2xy
10x2 matrix of Lambert project coordinates for validation sites
CA.NO2.sites3.pdf.
CA.nc2.crtTrendFits.pdf
Simulated data set 1 (Higdon model)Mh
There is data on a grid as well as one irregularly spaced. Each of the data sets containts an indicator variable for leaving out a variable from the analysis. Sea README.pdf for more details.
simulated_grid.txt
simulated_irreg.txt
Here are 200 independent reps of this simulation:
simulated_grid_repsWM.txt
simulated_irreg_repsWM.txt
Simulated Data 2 (a deformation model)a
Please see SimulationDataDescription.pdf for more details.
DeformationDataProvided.csv
GeographicLocProvided.csv
GeographicLocWithheldSites.csv
SimulationDataDescription.pdf