Quiz 10 (questions are in italic)

Part I:

You've tossed a 6-sided die 60 times, and have recorded the following
frequency table for obtaining each of the 6 numbers on the die.

Face        1  2  3  4  5  6
Frequency   14 6  5  17 12 6

You are interested in whether this data suggest that the die is unfair.

a) What are the appropriate hypotheses stated in terms of well-defined population parameters

# H0: pi_1 = 1/6, pi_2 = 1/6, ...pi_6 = 1/6
# H1: At least one of the above specifications is incorrect.
# where pi_1 = proportion of times a 1 is got, etc.

# In English, these would mean
# H0: The die is fair.
# H1: The die is not fair

b) Write code for using the chisq.test() function to test the hypotheses,
and report the p-value.

 obscounts =  c(14, 6,  5,  17, 12, 6)
 pi0 = c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
 chisq.test(obscounts,p=pi0)
# p-value = 0.027

c) Based on the above p-value, state your conclusion "In English" (at alpha=0.05).

# Given that this p-value is less than alpha, we can reject
# H0 (die is fair) in favor of H1 (die is unfair).  
# In English: the data do provide sufficient evidence that the die is unfair.

Part II

There are concerns that the bridges in the USA may be damaged due to aging. 
An engineering firm is wondering if a quantity denoted MS (the maximum stress a 
bridge can tolerate) is a good measure of the condition of a bridge. So, the
company measures MS for bridges in Poor, Good, and Excellent conditions, and
obtains the following data:

Poor:        1.0 4.0 3.0 1.0 1.0 2.0 4.0 3.0
Good:        3.0 4.0 4.0 3.0 2.0 4.0
Excellent:   5.0 4.0 2.0 4.0 2.0 3.0

d) Write code to answer the question of whether or not MS is a good measure? Report the p-value, and the conclusion "in English" at alpha = 0.05 .

# The appropriate test is a 1-way ANOVA F-test, because whether or not the 
# measure MS is appropriate depends on whether the mean of MS is different
# for the three bridge conditions.

  x = c(1,1,1,1,1,1,1,1,  2,2,2,2,2,2,  3,3,3,3,3,3) 
  y = c(1,4,3,1,1,2,4,3,  3,4,4,3,2,4,  5,4,2,4,2,3)
  aov.1 = aov(y ~ as.factor(x))               
  summary(aov.1)

#              Df Sum Sq Mean Sq F value Pr(>F)
# as.factor(x)  2  4.408   2.204   1.662  0.219
# Residuals    17 22.542   1.326

# The p-value 0.219 is above 0.05, and so, the conclusion is
# "One cannot tell if the three means of MS are different."
# I.e., "One cannot tell if MS is appropriate."

Part III: 

Here is a different kind of question than we're used to. Study and run the following lines of code, which are accompanied by explanations. Copy/paste into R:

  set.seed(3)   # pick a random seed (just so that we all get the same answer).  
# Here are 1000 random numbers. Think of them as 1000 observations of a response y.
  y = rnorm(1000, 0, 1)   

# Here are 100,000 random numbers:
  x = rnorm(1000*100, 0,1)

# Here they are formatted into a 1000 x 100 matrix.  Think of them as 1000 observations on 100 variables:

  x = data.frame(matrix(x, ncol=100, byrow=T))

# This is how we give names x1 - x100 to the 100 columns/variables: 

  colnames(x) = c("x1","x2","x3","x4","x5","x6","x7","x8","x9","x10","x11","x12","x13","x14","x15","x16","x17","x18","x19","x20","x21","x22","x23","x24","x25","x26","x27","x28","x29","x30","x31","x32","x33","x34","x35","x36","x37","x38","x39","x40","x41","x42","x43","x44","x45","x46","x47","x48","x49","x50","x51","x52","x53","x54","x55","x56","x57","x58","x59","x60","x61","x62","x63","x64","x65","x66","x67","x68","x69","x70","x71","x72","x73","x74","x75","x76","x77","x78","x79","x80","x81","x82","x83","x84","x85","x86","x87","x88","x89","x90","x91","x92","x93","x94","x95","x96","x97","x98","x99","x100")

# And this command makes sure that you can call the variables by their names:

   attach(x) # Makes sure that you can call the variables by their names, 
   x1        # e.g., x1 now contains the observations on x1.

In summary, we have constructed fake data consisting of 1000 observations of a response (y), and another completely unrelated 1000 observations on each of
100 predictors (x1, ..., x100).

Now, you know what this line does:

  lm.1 = lm ( y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + x23 + x24 + x25 + x26 + x27 + x28 + x29 + x30 + x31 + x32 + x33 + x34 + x35 + x36 + x37 + x38 + x39 + x40 + x41 + x42 + x43 + x44 + x45 + x46 + x47 + x48 + x49 + x50 + x51 + x52 + x53 + x54 + x55 + x56 + x57 + x58 + x59 + x60 + x61 + x62 + x63 + x64 + x65 + x66 + x67 + x68 + x69 + x70 + x71 + x72 + x73 + x74 + x75 + x76 + x77 + x78 + x79 + x80 + x81 + x82 + x83 + x84 + x85 + x86 + x87 + x88 + x89 + x90 + x91 + x92 + x93 + x94 + x95 + x96 + x97 + x98 + x99 + x100)

e) Write code to perform the F-test of model utility, report the p-value,
and state the conclusion "in English."

   summary(lm.1)    # 0.1865
# Given that p-value > alpha, there is no evidence that any of the 100 predictors are useful. Or, we cannot tell if any of the predictors are useful.

As shown in the prelab, we can perform 2-sided t-tests on each of the predictors. As we studied on Friday, the corresponding p-values appear in the last column of the standard output from summary(). This is how we can collect the last column of the standard output:

  tt = summary(lm.1) 
  pv = tt$coefficients[,4]

And, finally, let's count how many of the p-values are less than 0.05:

  TF = pv < 0.05    # Are the p-values less than 0.05 (TRUE) or not (FALSE)
  sum(TF)           # Count the p-values that are less than 0.05.

f) In words, what is the conclusion that follows from comparing parts e and 
the code you just went through above?

Even though none of the predictors are useful, t-testing them one-at-a-time
can suggest otherwise.

FYI: The cases where something is not useful, but some test says it is, are called "false positives." In fact, the number of false positives is about 5, i.e. alpha percent of the 100 predictors.

# Morals: 
Part I: The chi-squared test of *proportions* actually depends on count data
in each of the k categories!
Part II: The 1-way ANOVA F-test is a test of k means.
Part III: When you have a lot of predictors, try to avoid doing t-tests on
each of the predictors, because otherwise you'll get false positives.
BTW, if/when you come across a problem with many predictors, here is a
more efficient way of supplying the formula to lm() (as suggested by our TA Qiliang):
 colnames(x) = paste0("x", 1:100)
 fmla = as.formula(paste("y ~ ", paste(colnames(x), collapse= "+")))
 lm.1 = lm(fmla, data = x)