Quiz 6

This quiz deals with polynomial regression, the ANOVA decomposition, and some issues arising in multiple regression.

Part I: Polynomial regression and the ANOVA decomposition 

a) Write code to compute SST, SS_explained, and SS_unexplained using their respective defining formulas, for a quadratic polynomial regression model of the brainhead data used in the previous lab. You may use lm() and predict(). Report the numerical answers.

  dat = read.table("http://sites.stat.washington.edu/people/marzban/390/spring21/brainhead_dat.txt",header=T)
  x = dat[,3]
  y = dat[,4]
  lm.1 = lm(y ~ x + I(x^2)) 
  SST = sum((y - mean(y))^2)                  # 3417710
  SS_exp = sum((predict(lm.1) - mean(y))^2)   # 2192907
  SS_unexp = sum( (y - predict(lm.1))^2 )     # 1224803

# FYI. These R formulas were given in the last lecture note.
# Also note that SST = SS_exp + SS_unexp, i.e., the anova decompositon works even for polynomial regression.

In a hw you just turned in, you showed that the cross-term in the ANOVA decomposition is identically zero. To confirm that, 

b) Write code to compute and report the cross-term for the anova decomposition in the previous part. Your answer may not be zero identically due to machine precision.

  sum( (predict(lm.1) - mean(y)) * (y - predict(lm.1)) )   # -6.072682e-09

Part II: multiple regression

c) Write code to generate data (n=100) on x1, x2, and y, with some collinearity (correlation coeff about 0.5), following the equation y = 1 + 2 x1 + 3 x2, with NO error in y, and make the relevant scatterplots. YOU may choose the values of any other parameters for this simulation.

  library(MASS) # This library contains mvrnorm(); see
  set.seed(1)   # Not necessary; only for ensuring reproducible output.
  n = 100
  r = 0.5
  dat = mvrnorm(n, rep(0, 2), matrix(c(1, r, r, 1), 2, 2))
  x1 = dat[, 1]
  x2 = dat[, 2]
  y = 1 + 2*x1 + 3*x2 + rnorm(n, 0, 0) # Generate y, but with no error/noise. May skip rnorm() completely.
  dat = data.frame(x1, x2, y)              # Here is the whole data.
  plot(dat)

d) Write code to fit a model y = alpha + beta1 x1 + beta2 x2 to the resulting data, and report its R2 and s_e.

  lm.1 = lm( y ~ x1 + x2) # Fit a plane through the data.
  summary(lm.1)            # R2 = 1, s_e = 0

e) Given the presence of unambiguous scatter in all of the scatterplots in the previous parts, why are the fits perfect (i.e., R^2 = 1, s_e = 0)? Explain in words. Hint: mentally visualize the data in 3d.

Because the data was made to lie on a plain exactly;
That's what "No error" means in part c.
FYI: Note that the reason is NOT overfitting.

f) Write code to generate data (n=1000) on x1, x2, and y, with little collinearity (correlation coeff about 0.1), following the equation y = 1 + 2 x1 + 3 x2 + 4 x1 x2, again with NO error in y, and make the relevant scatterplots. YOU may choose the values of any other parameters for this simulation.

  library(MASS) # This library contains mvrnorm(); see
  set.seed(1)   # Again, not necessary.
  n = 1000
  r = 0.1
  dat = mvrnorm(n, rep(0, 2), matrix(c(1, r, r, 1), 2, 2))
  x1 = dat[, 1]
  x2 = dat[, 2]
  y = 1 + 2*x1 + 3*x2 + 4*x1*x2 + rnorm(n, 0, 0) # Generate y, and add NO error.
  dat = data.frame(x1, x2, y)              # Here is the whole data.
  plot(dat)

g) Write code to fit a model y = alpha + beta1 x1 + beta2 x2 + beta3 x1 x2 (i.e., with interaction) to the resulting data, and report its R2 and s_e.

  summary(lm( y ~ x1 + x2 + I(x1*x2) )) # R2 = 1, s_e = 0

h) Hopefully, you'll see a bowtie/butterfly pattern in the y-x1 and/or y-x2 plains in the previous part. Even if you don't, say something that would explain a bowtie/butterfly pattern in the y-x1 or y-x2 plains. Hint: mentally visualize the data in 3d.

That pattern is a result of the interaction term. 
Projecting a saddle surface in 3d onto the two plains gives rise to that pattern.

Morals: 
Part I: The ANOVA decomposition works for any model (linear or nonlinear).  
Part II: In any modeling work, it's important to be able to visualze the data in higher dimensions. In this case, we made the data ourselves, and so we knew there is no error; but even if we didn't, being able to visualize the data in 3d would reveal to us why the R^2 and s_e suggest a perfect fit. Also, notice that even the complex scatterplots in part h still lead to a perfect when we include an interaction term. In that case, the reason is the same, i.e., the data lie on a saddle surface perfectly. Finally, learn to recognize the bowtie or butterfly pattern in y-x1 or y-x2 plains as a signature of an interaction.