UW Stat 311

Lab 1 : Basics and Introduction to R

Winter 2015

With thanks to Eli Gurarie and Martina Morris

Introduction

The following examples and exercises should give you a first look at what R does and how it works.

R is a command-line program, which just means commands are entered line-by-line at the prompt. Being a programming language it is very finicky. Everything has to be entered just right - including case-sensitivity.

There are two ways of entering commands: Either typing them out carefully into the “Console Window” (the lower-left window in Rstudio) and hitting Enter. The alternative approach is to write and edit lines in the script window (upper-left window in Rstudio), and “passing” the code into the console by hitting Ctrl-Enter or clicking the Run icon.

For the most part, we will be only doing single commands and the command window is sufficient. But in general, it is smarter to do all of your coding in a script window, and then save the raw code file as a text document, which you can revisit and re-run at any point later.

R is a calculator

1 + 2
## [1] 3
3^6
## [1] 729
sqrt((20 - 19)^2 + (19 - 19)^2 + (19 - 18)^2)/2
## [1] 0.7071
12345 * 54312
## [1] 670481640

and so on.

Assigning variable names

The assignment operator is <-. It is supposed to look like an arrow pointing left.
An alternative is to use the = sign.

X <- 5
Or
X = 5

Notice that using the assignment operator sets the value of X but doesn't print any output. To see what X is, you need to type

X
## [1] 5

Notice also that X now appears in the upper-right panel of Rstudio, letting you know that there is now an object in memory called X.

Now you can use X as if it were a number

X * 2
## [1] 10
X^X
## [1] 3125

Note that you can name a variable ANYTHING, as long as it starts with a character.

Fred <- 5
Nancy <- Fred * 2
Fred + Nancy
## [1] 15

Vectors

Obviously, X can be many things more than just a single number. Another simple object in R is a “vector”, which is a series of numbers (and therefore resembles “data”). We create it as follows:

X <- c(3, 4, 5)  # sets X equal to the vector (3,4,5)
X
## [1] 3 4 5
Y <- c(1:10)  # sequence of numbers from 1-10
Y
##  [1]  1  2  3  4  5  6  7  8  9 10

c() is a function - a very very useful function that creates “vectors”. In all functions, arguments are passed within parentheses.

Now, let's do some arithmetic with this vector:

X + 1
## [1] 4 5 6
X * 2
## [1]  6  8 10
X^2
## [1]  9 16 25
((X + X^2/2)/X)^2
## [1]  6.25  9.00 12.25

Note that in all of these cases, the arithmetic operations are performed on a term-by-term basis.

You can also make a vector of character strings:

Names <- c("Alice", "Boris", "Chaozhi", "Diego", "Eliza")
Names
## [1] "Alice"   "Boris"   "Chaozhi" "Diego"   "Eliza"

Exercise:

Make a vector called Data that contains, in order, the number of students sitting in your row, the number of people living in your house/apartment/dorm suite, the number of pets you have ever had, your age, and your shoesize.

Make another vector called Names, which contains short names for each of these numbers

Vectors and Functions

Now that we have some data, we can study them with simple functions:

Data
## [1]  1  2  8 36  9
Names
## [1] "Students" "Roomies"  "Pets"     "Age"      "Shoesize"
sum(Data)
## [1] 56
length(Data)
## [1] 5

Note that if you try to do something mathematical with characters, you get an error.

sum(Names)
## Error: invalid 'type' (character) of argument

You can learn more about how functions work by using “?”, e.g. ?sum, ?length.

A help window will appear (in the lower right panel in Rstudio) with all sorts of information (only occasionally useful) about the functions. Often, there are some useful examples at the bottom of the help window.

Exercise:

Create an object called N which is the length of your Data, and use that object with the sum() function to calculate the arithmetic mean of your data.

Follow-up: Try applying the mean() function to Data. Does it agree with your calculation above?

Plotting

It is easy (but not very interesting) to plot the Data:

plot(Data)

Notice that the figure appears in the lower right panel of the RStudio screen. Try using the “Zoom” command in the Plot tab command bar.

A “barplot” might be a little nicer:

barplot(Data)

And we can give labels to the barplot by adding a names “argument” (note, that “argument” refers to any information that you give to a function).

barplot(Data, names = Names)

Note that you can export the image using the Export button in the Plot command bar.

Here's a pie chart:

pie(Data, labels = Names)

Exercise:

The International Rhino Federation estimates that there 17,800 rhinoceroses (Rhinocerotidae spp.) living in the wild in Africa and Asia. Here is a breakdown of their numbers:

Species Population
1 Black 3610
2 White 11330
3 Sumatran 300
4 Javan 60
5 Indian 2500

Make a barplot and pie chart of these data.

Loading Data

Loading data into R can be done via several possible functions, or using the Import Dataset tool in RStudio (upper right pane, Environment tab command bar).

We will load some data on the clutch sizes (number of pups born) of great white sharks (Carcharodon carcharias). It is in a file called sharkclutch.csv, which is a text file containing:

pups
3
5
4
5
5
9
8
7
5
...

Except with more data.

You can download the file from the Canvas course website Lab data sets page.

To load a simple string of numbers like this, you can use the scan() function, and to do this you need to provide the location of the data. For example, if I put my shark data in the c:\Users\myname\data\ folder, then I load it via:

pups <- scan("c:/Users/myname/data/sharkclutch.csv", skip = 1)

The skip=1 removes the first line of the file (with the word “pups”) and just reads in the numbers, and note that the slashes here are forward, not backslashes.

To avoid having to type in the full pathname for the file, you can set your “working directory” to be the one where that file is stored. Select the Session dropdown menu from the main Rstudio ribbon up top, and navigate to the directory with the data. Then you can just type:

pups <- scan("sharkclutch.csv", skip = 1)
Another way is to use the file.choose() function:
pups <- scan(file.choose(), skip = 1)
and then navigate to and click on the file in your usual file-finder window which will open. Personally, I find this last method easiest.

Download the data from the course website (on the Labs page), and give this a try.

The scan function puts the data in a vector object. Note that after you read it in, it is listed under the Values section of the Environment tab in the upper right pane. To see what's inside, you need to type the name of the object (pups) in the Console window.

Alternatively, you can read in the data using the Import Dataset command on the Environment tab in the upper right pane. Note that this brings the data in as a data.frame object, gives the data.frame the filename by default and the variable name “pups” is taken from the first row of the file. As a data.frame, it is now listed under the Data section of the Environment pane. That allows you to query the contents by clicking the icons associated with it. To make it easy to access the pups variable, we attach the data.frame:

attach(sharkclutch)

Exploring the shark data

Now that we have a nice long data set, we can start exploring it, with a whole bunch of useful, simple R functions.

length(pups)
## [1] 80
sort(pups)
##  [1]  3  3  4  4  4  4  4  5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6
## [24]  6  6  6  6  6  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  8
## [47]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9  9  9
## [70]  9  9  9  9 10 10 10 10 11 11 12
mean(pups)
## [1] 7.125
median(pups)
## [1] 7
range(pups)
## [1]  3 12

We note that these data can only take discrete, integer values (1,2,3,4 …). So one way to summarize it conveniently is to make a table. The function for this is, predictably, table().

table(pups)
## pups
##  3  4  5  6  7  8  9 10 11 12 
##  2  5 10 11 17 17 11  4  2  1

Plotting some quantitative data

We can make a barplot of the clutch sizes of pups in this table:

barplot(table(pups))

Note, we can add labels and colors with a few extra arguments

barplot(table(pups), main = "White shark clutch size distribution", xlab = "clutch size", 
    ylab = "number of sharks", col = rainbow(12))

(a loud color scheme! try this and see if you like it better:)

barplot(table(pups), main = "White shark clutch size distribution", xlab = "clutch size", 
    ylab = "number of sharks", col = terrain.colors(12))

Another way to illustrate these data is with a histogram. The following commands make a frequency and density histogram, respectively:

hist(pups, col = "grey", main = "Frequency")
hist(pups, col = "grey", freq = FALSE, main = "Density")

They are obviously very similar to each other … except for one, very important difference.

Note also, that this histogram looks different than the bar plot in a few key ways. What are they? And why?

Exercises:

Now would be a good time to work on your Lab1 assignment .

The goal of the lab is to see how an outlier impacts the mean, median, and standard deviation of samples of varying size.
The rnorm(.) function in R generates samples from a Normal distribution: by default these have mean 0 and st.dev =1.

The End!

… with a little post-script:

for some silliness, run the following bit of code:

par(mar = c(0, 0, 0, 0))
for (i in 1:300) {
    cols <- rainbow(i, alpha = 1:i/i)
    Z <- complex(mod = sqrt(1:i), arg = 1:i + i/20)
    plot(Z, col = cols, pch = 19, cex = sqrt(i:1), asp = 1)
}