QUIZ 5 (The questions are in italics)

Part I: qq-plot

In the prelab, you learned how to make a qqplot "by hand," i.e., without using q qnorm(). The advantage of the "by hand" version is that you can put quantiles of different distributions on the x-axis, e.g., quantiles of uniform distribution. That way, a straight line in the qqplot is evidence that your data has come from a uniform distribution. To make that point clear, 

a) Write code to take a sample of size 1000 from Unif(2,4) and make a qqplot for this sample, but with the quantiles of Unif(0,1) on the x-axis.

  x = runif(1000,2,4)
  n = length(x)
  X = seq(.5/n, 1-.5/n, length=n)  # seq(0,1,length=n) is sufficient, as well.
  Q = qunif(X, 0,1)
  plot(Q, sort(x))

b) Write code to superimpose on the above qqplot a line whose intercept and slope capture the overall pattern of the qqplot. Explore only integer values of slope and intercept.

  abline(2,2)

c) What is the relationship between the intercept and slope found above with the parameters of the uniform distribution, Unif(a,b), from which the sample is taken? Give your answer in the form of "intercept = ..." and "slope = ..." where the "..." are expressions/functions of the parameters a and b. Hint: you have had
a hw related to this, but I cannot tell you which hw.

# intercept = a 
# slope = (b-a)
# The main purpose of this question is to make a connection between the 
# theoretical calculation done in hw_lect9_4 and the simulation done here.

d) Write code to take a sample of size 1000 from an exponential distribution with parameter 2, i.e., from Exp(2), and make a qqplot for this sample, but with the quantiles of Exp(1) on the x-axis.
Additionally, on this qqplot superimpose a line with the interecept and slope you reported in part b. Remember to check the posted soln, later.

  x = rexp(1000,2)
  n = length(x)
  X = seq(.5/n, 1-.5/n, length = n)
  Q = qexp(X, 1)
  plot(Q, sort(x))
  abline(0,1/2)

# A theoretical calculation similar to that in hw_lect9_4 shows that the 
# intercept and slope of such a qqplot are 0 and 1/lambda, respectively. 
# You can see the theoretical calculation in test2 of Spring 2019.

Part II: Correlation

As you know, R comes with many data sets, one of which is called anscombe.
To "see" the data, simply type/run anscombe ; and if you want to know even
more type ?anscombe , but that's not necessary here. All you need to know
is that it looks like a matrix, and so you can select different columns
by treating like a matrix. Alternatively, you can use the names that are
already given to the columns; but to be able to use those names, you must run 

  attach(anscombe)      # The attach() function allows you to use the names given to the columns of anscombe.

e) Write code for computing the correlation between cor(xi,yi) for i = 1, 2, 3, 4. Hint: You don't need a for-loop for this. 

  cor(x1,y1)  # 0.8164205
  cor(x2,y2)  # 0.8162365
  cor(x3,y3)  # 0.8162867
  cor(x4,y4)  # 0.8165214 

# FYI: Alternatively,
  for(i in 1:4)
  print(cor( anscombe[,i], anscombe[,(i+4)]))

f) Note that the correlation values are very similar. Write code for exploring the relationsip between xi and yi, i = 1, 2, 3, 4. For i = 2, 3, 4, accompany your code with an explanation of why the correlation is the same as that between x1 and y1. Use an appropriate tool for exploration of relationships.

# The appropriate tool for exploring relationships is the scatterplot.

   plot(x1,y1)

   plot(x2,y2)  # The relationship is a sufficiently nonlinear to reduce r from 1 to 0.816

   plot(x3,y3)  # The outlier is just in the "right" place to reduce r from 1 to 0.816

   plot(x4,y4)  # The outlier is just in the "right" place to increase r from 0 to 0.816

Part III: Regression

We often hear about how the human brain weights about 2 pounds, etc. Have you wondered how much your brain weights?! Well, we can find out, if we know how big your head is.  This is a perfect example for regression because brain weight is hard to measure, while head size is easy to measure. The 4 columns in the dataset brainhead_dat.txt on the course correspond to gender (1=Male, 2=Female), age (1=between 20-46, 2=over 46), head size (cm^3), brain weight (grams).

g) Write code to 
- read in the data, 
- make a scatterplot appropriate for predicting brain weight from head size, 
- superimpose the OLS line on the scatterplot.

   dat = read.table("http://sites.stat.washington.edu/marzban/390/spring21/brainhead_dat.txt", header=T)
   x = dat[,3]
   y = dat[,4] 
   plot(x,y)          # "hard" variable on y-axis, "easy" variable on x-axis.
   lm.1 = lm(y ~ x)   
   abline(lm.1)

h) Suppose your head size is 3760 cm^3. Based on the model developed in the previous part, what is your brain weight?

   lm.1                       # y = 325.5734 + 0.2634 * x
   325.5734 + 0.2634 * 3760   # 1315.957 grams
  
Morals:
- qqplots can be used to test whether your data come from *any* given distrbution. All you need to do is to put quantiles of that distribution on the x-axis. Then a linear qqplot is evidence that your data come from that distribution. 
- Additionally, the intercept and slope of the qqplot are related to parameter(s) of the distribution; those relationships can be obtained from the formula for the q^th quantile of the distribution.
- clusters and outliers can adversely affect r.
- With regression, you can predict hard things from easy things.