QUIZ 3 (questions are in italics) This quiz deals with distributions and samples taken from them. Part I: Language As explained in class, in statistics, we use distributions to represent populations. In fact, in 390 we can think of distributions and populations as one and the same thing. Specifically, when we talk about "a sample from a population," what we really mean is "a sample from a distribution." The prelab has explained how to do that for some distributions, but for some distributions the language and the symbols can get confusing. As such, every time you hear "a sample from a distribution," translate that into what the actual data in the sample would look like. Consider the following examples, and their translation: "A sample of size n from a Bernoulli distribution with parameter pi." Translation: n numbers, each either 0 or 1, and the proportion of 1's is around pi. "A sample of size m from a Binomial distribution with parameters n and pi." Translation: m numbers, each an integer between 0 and n. (Later, you'll see how pi affects the sample.) "A sample of size n from a Normal distribution with parameters mu and sigma." Translation: n numbers, each between -infinity and +infinity, with most around mu, and a typical deviation around sigma. a) Write code to take a sample of size 100 from a - Binomial distribution with parameters n = 20, pi = 0.5 . (This is the one where tha language and symbols get confusing; so, read the help pages carefuly, and feel free to tinker/experiment to figure out what you want.) - Normal distribution with parameters mu = 10, sigma = 2 . rbinom(100, 20, 0.5) rnorm(100, 10, 2) Part II: N(mu,sigma) Let's learn how to do for-loops in R, so that we can be more efficient. To that end, study the following example: x = numeric(10) # Allocate space to an array of size 10. x[] = NA # Fill it with NA (Not A Number) for(i in c(1,2,3)){ # Increments the values of i. x[i] = 2*i # Store 2i in the ith element of x. } x # Look at x Now, I have claimed that mu and sigma of the normal distribution control its center and its width, respectively. Let's see how those parameters affect the histogram of a sample taken from that distribution. Specifically, b) Write code to make a 3x3 panel of figures, with each panel showing the histogram of a sample of size 500 taken from N(mu,sigma). Let the columns (or rows) have values mu = -3, 0, 3, and the rows (or columns) have sigma = 1, 2, 3. Limit the x-axis to the interval (-15, 15). Hint: The example in the prelab that shows how to overlay two histograms also shows how to limit the x-axis. n = 500 par(mfrow=c(3,3)) for(mu in c(-3,0,3)){ for(sigma in c(1,2,3)){ x = rnorm(n, mu, sigma) hist(x, xlim=c(-15,15)) }} In class and in hw, we've learned how the mathemnatical change of variable z=(x-mu)/sigma transforms the normal *distribution* with parameters mu and sigma to the standard normal *distribution*. Let's confirm that empirically, i.e., on a sample. To that end, c) Take a sample of size 1000 from N(2,3), transform the data to z=(x-mu)/sigma, and make a histogram of the resulting transformed data. n = 1000 x = rnorm(n, 2, 3) z = (x - 2)/3 hist(z) # Note that the center and width are about 0 and 1, respectively.i # Later, we'll learn about a similar transformation where the mu # and sigma are replaced by quantities we compute from the sample itself. Part III: Percentile One example of a random variable that follows the exponential distribution is the inter-arrival time. For example, suppose you record all the times at which an advertisement gets clicked. If you look at the histogram of the times between clicks, you'll get something that very closely approximates the exponential distribution. The comparison between the histogram and the distribution can be done in a variety ways, but one way is to compare their quantiles. In fact, later, we'll learn how this idea is the basis of something called a qq-plot. Until then, d) Write code to - take a sample of size 1000 from an exponential distribution with parameter 1, - find and report the 0.1, 0.5, and 0.9 quantiles of that sample, - find and report the 0.1, 0.5, and 0.9 quantiles of the distribution. IMPORTANT: Start your code with set.seed(123) set.seed(123) x = rexp(1000,1) quantile(x, c(0.1, 0.5, 0.9)) # 0.09862473 0.73116671 2.27992175 qexp(c(0.1, 0.5, 0.9), 1) # 0.1053605 0.6931472 2.3025851 (i.e., pretty close). Part IV: sample size One of the most common misunderstandings among students is to think that the width of a histogram (i.e., the spread of data) increases with sample size. Let's see how that is completely wrong. Because it's hard to compare many histograms, we're going to make 10 comparative boxplots, each one for a sample of some size taken from a normal distribution, and with the sample sizes increasing across the boxplots. This simulation is especially easy to do if we know how to make a matrix in R. To that end, study the following code: x = matrix(nrow=6, ncol=7) # Allocate space to a 6x7 matrix. x[,] = NA # Fill it with NA. for(i in 1:4){ # For the first column, x[1:3, i] = i*c(1,2,3) # insert three numbers in the first three rows. } x # Look at x to see what we've done. boxplot(x) # Makes 4 boxplots, each for 3 cases. e) Write code to make 10 (comparative) boxplots, each for samples of size 100, 200, 300, ..., 1000, taken from N(2,3). x = matrix(nrow=1000, ncol=10) # Allocate space for for(i in 1:10){ n = i*100 x[1:n, i] = rnorm(n, 2, 3) # the 1:n is important. } boxplot(x) boxplot(x, range=0) # FYI: Since one student asked about outliers. # range=0 treats them as part of the data. Morals: - Make sure you're clear what it means to "take a sample from a pop/dist." - The mathematical notion of standardization, applies mto the data/sample as well. - Quantiles/Percentiles are the basis of comparing samples and distributions. - The width/spread of data does NOT increase with sample size.