The goal of this assignment is for you to get familiar with the R statistical programming environment and R language syntax.
R is a powerful environment for statistical computing. You will see R again in other classes and you will no doubt use R again many times during your graduate studies and professional work lives.
R is an open source, free version of the S-plus statistical programming language sold by Insightful. One nice feature of R is that it is free and community-supported. You can install it on as many machines as you like, without worrying about copyright issues. And if you find a bug, you can fix it and contribute your fix to the community, thus giving yourself bragging rights for many years to come. And, of course, since S-plus and R are similar languages, almost anything you learn in R should apply to S-plus, and vice versa.
There are many outstanding R textbooks and references books. Books I've used and liked include:
After the introductory lecture, please work through the exercises that follow and answer each question. Cut and paste your answers into a report, along with any plots, and hand them in as usual at the beginning of class on the due date.
Visit the R project Web site - http://www.r-project.org and look around.
There are two major sites for R - the r-project.org site (just one) and numerous identical CRAN mirror sites. CRAN stands for Comprehensive R Archive Network and offers links to instructions for downloading and installing R for a variety of platforms.
Then take a look at the Bioconductor Web site: http://www.bioconductor.org. Bioconductor is a large collection of packages designed mostly for microarray data analysis. Be aware that one of the challenges of the Bioconductor site is that there are many, many contributions. Navigating and understanding them is not a trivial task!
Start the R interpreter, a program that evaluates R commands interactively. How you do this will vary depending on the platform you are using. In Windows, you just double-click the "R" icon or launch R using the "Start" menu. On Unix systems with R installed, open a terminal window and type "R" at the command prompt.
On Apple computers, you can install R Aqua, a GUI-based version of R similar to the version available on Windows. You can also install a version of R that runs in an X server or in the Terminal. Either one should work fine for this assignment.
Once you've started R, you'll see a ">" symbol at the left of the interpreter window (Windows) or terminal (Unix). This is the R command prompt. To use R, you type commands into the R intepreter, hit return, and then R executes the commands. Note that this is very similar to how python operates. Once you have started an R session, you can type commands and see the results right away.
The R language contains many "built-in" functions (commands) for manipulating and analyzing data as well as for navigating your computer's file system. Fortunately, all of these functions are documented as part of the on-line help functionality packaged with each R distribution. (More on this later.)
If you are doing a lot of complicated things, it is usually a good idea to have another screen open with a text editor (like emacs) for text editing so that you can save your commands. To get R to execute the commands you've saved in a file, you can use the "source" command like so:
> source("afile.R")
where "afile.R" is a file containing the R commands you would like
to execute.
One neat feature of R's "source" command is that if your computer is connected to the Internet, you can even run commands contained in files at remote locations. To see how this works, type this into the R interpreter window - don't type the ">" symbol:
> source("http://teaching.transvar.org/r/hello.R")
Take a look at the text in this file http://teaching.transvar.org/r/hello.R using your Web browser.
> quit()
R will then ask you if you want to save your data environment. If you answer "yes," the next time you launch the R interpreter, all the variables you defined previously will be available to you, and you can start up again exactly where you left off.
The R interpeter contains built-in commands for manipulating and analyzing data that you import from files or create "randomly" using R's pseudo-random number generators (very useful for simulation). Fortunately, these commands are documented as part of the on-line help functionality packaged with each R distribution. Less fortunately, the documentation is not always easy for a beginner to understand. However, the more you use the help pages, the easier it will become for you to master the language. So even if the "help" pages seem confusing at first, you should make a point of taking time to read these pages carefully for each new command you use.
Start the on-line help function. To do this, type:
> help.start()
Note that invoking a command in R requires you to type the name of the command plus a pair of parentheses. In this case, you have invoked "help.start()" with no arguments, which would normally be placed within the parentheses.
The "help.start" command launches a Web browser which shows a page containing the table of contents for the on-line help system. To view the help page for the rnorm command, which lets you generate random numbers sampled from the normal distribution, you would type something like:
> help('rnorm')
You can also see these commands in action by typing something like:
> example(rnorm)
QUESTION 1: How do you use the help command to search for functions that you don't already know the name of? Hint: help(help) and go to the end of the page to the "See also" section.
Like python, you can use R as a calculator. Just as with python, when you type a command and hit return, the R interpreter parses the command and then executes (evaluates) it, printing the result to the screen. If the command syntax is incorrect, R prints an error message instead of computing any output.
Try it out.
The number of codons in the genetic code:
> 4^3The log, base 2, of 256:
> log(256,base=2)Addition:
> 4 + 5Square roots:
> sqrt(2)To learn about other mathematical functions:
> help('Special')
As with python, a variable in R is symbol that represents a value. To save output to a variable name, use the assignment operator: "<-" greater than symbol followed by a hyphen or an equals (=) sign.
Try out these simple examples:
> x <- 3 > y <- 5 > z = 4 > x + y + z
Note that when you use the assignment operator, no value is returned or printed to the screen. The reason is that the assignment operator makes a change to the current R environment but doesn't return a value.
If you want to see the value of what you have created, you simply type the name of the variable at the prompt like so:
> x
To see a list of the variables (objects) you have created thus far, use the ls command.
>ls()
QUESTION 2: What is the equivalent to R's ls command in python?
You should choose variable names that are easy for you to remember and which are not already assigned to built-in variables or functions.
To get rid of a variable, use the rm command. Try this:
> x > rm(x) > x
To create a vector (list) of elements, use the "c" operator.
Create vectors: z <- c(4,8,10) and v <- c(-1,2,5). Note what happens with:
> z + v > z - v > z * v > v - z > v + 1 > sqrt(z)
You can change the values contained in an array using subscripting notation ([]) and the assignment operator. Try this:
> v[4] <- 6 > v > mean(v) > v[8] <- 15 > v > mean(v) > mean(v,na.rm=T)
Note that R contains an object called NA which stands for "data not available". This is like the None value in python. An data container like a vector or a list can contain multiple NA values. However, performing some statistical functions on vectors or lists containing NA values can yield results you may not expect, as in the example above.
Many commands in R (as with Unix) can take optional arguments. These optional arguments (also called flags) modify the behavior of the command. For example, mean computes the arithmatic average of its first argument, and the the na.rm=T flag tells it to ignore missing values (NA) in the calculation.
QUESTION 3: What option tells the 'mean' command to computed a trimmed average? When are trimmed averages useful? (To answer this question, do a Web search with keywords "trimmed mean".)
To define a vector containing character values, you would enter each character value using double or single quotes. Try this:
> v <- c("hello","world","welcome","to","the class.")
Character vectors are very useful when creating x and y-axis labels, or when you are working with sequence data.
Type in the following commands and note the result:
> paste(v) > paste(v,collapse=" ") > sep=" " > paste(v,collapse=sep)
QUESTION 4: How does R deal with an attempt to create a vector with mixed types? To find out, try this:
> v <- c(1,2,3,"hello") > v
You can use R's "seq" command to create new vectors of numeric values:
> s1 <- seq(1,10) > s2 <- seq(1,100,10)R also provides a shortcut syntax for creating a sequence:
> s3 <- 1:10which is equivalent to "seq(1,10)"
Subscripting in R works as you might expect - square brackets operators allow you to extract values:
>v = seq(1,100) >v[1] >v[1:10]
However, you can also insert logical expressions in the square brackets to retrieve subsets of data from a vector or list. For example:
>v = seq(1,100) >logi = v>95 >logi >v[logi] >v[v<6] >v[105]=105 >v[is.na(v)]
And to negate:
>v[!logi]QUESTION 5: What command would you use to select the components of v that are not missing (NA) values? Use the "v" below to test your answer:
> v <- 1:10 > length(v) > v[15] <- 15 > length(v)
Objects and variables can have different modes, which are similar to types in python. To determine the mode of given object, use the "mode" command.
QUESTION 5: What do the following commands return?
> mode(c("a","b","c"))
> mode(mean)
> mode(1:2)
Vectors in R are like arrays in other languages, where only objects of the same mode can co-exist within the same vector. Thus, if a vector contains strings, it cannot also contain numbers. However, R also has a list data type that lets you store objects of any mode, including other lists. For example:
> l = list()
> l[[1]] = c("hello","world")
> l[[2]] = c(1,2,3)
> l[[3]] = 42
> l
> l[1]
> mode(l[1])
> mode(l[[1]])
In this example, we first create a new list. Next, we add some data to it, using the double-square-brackets sub-scripting operator. When we access data stored in the list l, using the double-brackets retrieves the just the object stored at the requested location. When we access data using the single brackets, we get an object that is itself a list -- a sub-list of the original list l.
QUESTION 6: How does this behavior compare with python? Does python bother with having two different list-like data structures? Why do you suppose that R has both vectors and lists?
QUESTION 7: Use the help function to determine the command you would use to count the number of items in a vector. Does the same command work on lists? Does the python len operator work similarly? What is it called when you can use the same command on objects of many different types?
A matrix in R is just a two-dimensional array of values that all have the same type. A convenient way to create a matrix is to use the matrix function:
> mat <- matrix(1:12,nrow=3)
Create a vector using the following command:
> v <- seq(1,20)
Read the R help section describing the matrix command.
QUESTION 8: What commands would you use to convert the vector v that you defined above to the following matrices?> mat1 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 3 5 7 9 11 13 15 17 19 [2,] 2 4 6 8 10 12 14 16 18 20 > mat2 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 2 3 4 5 6 7 8 9 10 [2,] 11 12 13 14 15 16 17 18 19 20
Data frames are similar to matrices in that vector components of data frames must all be the same length. However, unlike matrices, data frames are actually lists (of class "data.frame") and can therefore contain objects of different modes.
The base R package comes with a number of built-in data sets which you can use to try out some of the features of data frames. To view a list of these built-in data sets, use the "data" command:
> data(package="base")
Load the "women" data frame in your R session and view the data:
> data(women) > women > names(women)
Note that the columns represent variables (height and weight) while the rows represent individual women -- samples. You can access a vector containing the women's heights using the following notations:
> women$height > women[,1] > women[seq(1,2),] > women[1:5,1]
Note how the [n,m] syntax retrieves values from row(s) n and column(s) m. Note also that the values separated by the comma in [n,m] can be sequences, individual values, or no values at all, in which case the all possible values are assumed.
You can fit a linear model (predicting weight based on height) to these data using the "lm" command like so
> model <- lm(women$weight ~ women$height) > model > print(model) > names(model) > mode(model)
Use the 'cor' command to comput Pearson's correlation coefficient:
> r <- cor(women$weight,women$height) > summary(model) > r^2
QUESTION 9: Which value reported by the summary command is equivalent to the square of the correlation coefficient?
One of the most useful features of R is that it provides a number of powerful plotting and graphing functions. To view the relationship between women$weight and women$height graphically, use the "plot" command:
> x <- women$height > y <- women$weight > plot(x,y,xlab="height",ylab="weight",main="women")
Note that the response variable (weight) is plotted on the y-axis and the independent variable (height) is plotted on the x-axis. Plot takes numerous option parameters that allow you to add a title to the plot (main) and label the axes (height, weight).
In R, you build up plots incrementally, adding new graphical elements to existing plots.
Add the line predicted by the model from Exercise 9 above to the plot:
> abline(model)
To get a graphical summary of how well the model fits the data, plot the residuals from the model (predicted - actual) against the independent variable. Obvious trends, such as the residuals increasing as the independent variable increases are a sign that the linear model may not be the best fit. Note that the quality statistics associated with a linear model assume that deviations from the model are merely random error and that the errors are normally distributed.
> plot(women$height,model$residuals)
To see whether the residuals are normally distributed, use a quantile-quantile plot for the normal distribution. This type of plot plots the theoretical quantiles of the normal distribution against the corresponding quantiles of the data in a vector.
To view the plot in a new window without erasing the previous plot, you may need to create a new "graphics device" or plot display window. (The behavior may vary depending on whether you are running Windows, MacOS, or Unix.) On Unix, type: x11() to create a new graphics display.
> qqnorm(model$residuals)
QUESTION 10: Examine the three plots you created. Do taller women tend to be heavier or lighter than the linear model predicts?
R contains two special logical values: TRUE and FALSE. These are given as the result of logical expressions. Try this:
> 0 > 1 > 0 < 1 > 66/3 == 5 > 66/3 == 2
The negation "!" operator means "not," so if you create a logical expression and precede it with "!", you are actually asking for the negation of the expression. Try this:
> TRUE > FALSE > a <- c(0,1,2,3) > a == 0 > a != 0Observe the output from the following logical expressions:
> !TRUE > !!TRUE > !!!TRUE > !c(TRUE,FALSE,TRUE) > a != 1 > !(a == 1)
QUESTION 11: What is the result if you attempt to perform a logical operation on a vector that contains an NA (not available) value?
> a <- seq(1:3) > a[6] <- 6 > a > a < 10
You can use logical vectors to select, and then count, all the values in vector that satisfy a logical expression, such as:
> b <- seq(1:100) > b[b >= 50] > length(b[b >= 50])
The reason this is possible is subtle - in R, you can use vectors as indices to extract values from other vectors. What's happening above is that the expression "b >= 50" creates a new logical vector that is then used as index for "b" to select the values in "b" that are greater than or equal to (>=) 50.
Now enter the following commands to simulate a coin flipping experiment, where "flip.100" is function that computes a realization of a random variable representing the number of heads you get if you flip a coin 100 times.
The first line defines a function and assigns it to the variable name "flip".
> flip.100 <- function() { sum(sample(c(1,0),100,replace=TRUE)) }
> v <- numeric()
> v
> for (i in 1:100) { v[i] <- flip.100() }
> v
> v <= 50
> logi <- v <= 50
> logi
You can use "logi" as an index to select values from v:
> v[logi]
QUESTION 12: How would you select the values from v that are greater than 50 using the negation operator and the vector ("logi") you just created?
Descriptive statistics - such as the average, median, etc. - are useful ways to summarize a data set. However, it's almost always a good idea to view data in a graphical format so that you can use your natural humam ability to spot potentially meaningful patterns and get a sense for the shape of the distribution.
R has several high-level plotting functions that allow you to visualize data. To see some of these functions in action, type the following command and hit return to view, one by one, some of the various kinds of plots R can create:
> demo(graphics)
NOTE: To see the first plot, you may need to hit return twice.
Read the on-line documentation for the "rnorm" and "rexp" commands, which selects random numbers from the normal and exponential distributions. Use it to create two vectors with random numbers sampled from these distributions.
> vexp <- rexp(1000) > vnorm <- rnorm(1000)
At this point, you should start taking advantage of the "source" command, a way for you to execute R scripts - files containing R commands - from inside the interpreter. For convenience, use an editor to type your next commands into a file called "rlab.R" and execute these commands by typing:
> source("rlab.R")
In this exercise you'll make two histograms showing the distribution of values in vexp and vnorm. Use the "par" command to allow you to create two plots in the same window.
The "par" command is a bit difficult to get used to at first. It stands for "parameter" or "graphics parameter" and controls how plots will be displayed on the currently open (or soon-to-be opened) graphics window - the graphics "device." If you then close the graphics window that you've "adjusted" using the "par" command, the next R graphics window will use the default "par" settings. To view all the current graphics parameter settings, invoke "par" without any arguments. Try this inside the R interpreter:
> settings <- par() > settings > mode(settings)
Graphing in R is a bit like drawing with a pen or pencil. When you call a plotting function (like "hist") R displays (draws) the plot in the currently open graphics window, or in a newly created one if none is already open. You can then add elements to the plot, one by one, much as you would if you were drawing a plot by hand.
Enter the following command in your file. This tells R to plot the next two plots in a single column with two rows:
> par(mfrow=c(2,1))
Now, tell R to plot a histogram for each vector, giving each plot an informative title. After you've created each histogram, draw vertical lines showing where the mean and one standard deviation from the mean are located. Note that the following 'hist' command is given the option 'freq=F', which tells the histogram to display densities rather than actual counts on the y axis.
> hist(n,freq=F,main=paste("SD=",signif(sd(n),3),"mean=",signif(mean(n),3)))
> abline(v=mean(vnorm),lty=2,col="blue")
> abline(v=mean(vnorm) + sd(vnorm),lty=3,col="red")
> abline(v=mean(vnorm) - sd(vorn),lty=3,col="red")
Next, add a curve showing a the probability density (or mass) function for a normal distribution. To superimpose it on the plot, you need to include the option 'add=T' to the following function call:
>curve(dnorm(x,mean=mean(vnorm),sd=sd(vnorm)),add=T)
Now create a plot for the other vector: vexp. Note that adding the exponential curve to the plot requires a different function and different options, specifically, a rate parameter. To refresh your memory on the exponential distribution, take a look at the Wikipedia entry for the exponential distribution.
QUESTION 13: Demonstrate your understanding and ability to master on-line documentation for R: What commands did you use to create the plot for vexp?
To save the output from a graphics function in R, use one of several optional "device drivers" that don't display to the screen but instead write the plots to files.
For example, if you were preparing a plot for publication, you might use the 'pdf' device driver, which would create a 'pdf' format file. Take a moment to read the on-line document for the pdf function.
To create a 'pdf' file of your plot, just type:
> pdf("plots.pdf")
Then, run the graphics commands, or use the 'source' command to execute the commands you've saved in a file. When the commands have completed, then type:
> dev.off()
The "pdf" command above tells R to create the plot using the pdf device driver and that you will ultimately want R to save the plot to a pdf-format file called "plots.pdf." To tell R when you are done creating the plot, you must use the "dev.off" command. This causes R to create the file. If you forget to invoke "dev.off," then R will not write the file.
Take a look at the file. You can open it using your Web browser's "open file" command (under the File menu) or using Acrobat.
QUESTION 14: Hand in a copy of your plots with the homework. If you would like to vary the color scheme to suit your tastes, feel free. Tip: type
> colors()To get a list of built-in colors in R for creating plots.
One of the basic problems in statistics has to do with using a sample to make an educated guess about the population from which it was selected.
In a typical situation, an experimenter would sample a population and the use the sample average and standard devation to estimate the mean and standard deviation for the entire population. For example, you might measure the heights of 20 randomly selected women and then use their average height to estimate the heights for all women.
Now, change your thinking from a single sample and consider the situation where you take many samples and then determine an average for each one. According to the Central Limit Theorem, the distribution of a collection of sample averages tends to approach normality, even when the distribution of items from which they were selected is not normal.
Test this idea. Read the on-line documentation for the 'replicate' command, which is a wrapper around R's family of apply functions, which apply a given function to each item in a list (lapply) or vector (sapply).
> exp.means <- replicate(1000,mean(rexp(1000)) > normal.means <-replicate(1000,mean(rnorm(1000))
QUESTION 15: Now make two histograms, one for each set of averages. What do you observe about these two distributions?
Your goal in this exercise is to learn how to read data into R "data frames" - a type of data object in R that can hold multi-mode data - e.g., character as well as numeric data, and then perform simple exploratory data analyses in R. In this assignment, you'll be working with the output from Homework Assignment 11 - findNmdTargets.py.
Use your findNmdTargets.py program to create a file containing distances separating each gene model's stop codon and the distal-most (most distant) 3-prime EEJ, or "NA" if there is no EEJ three-prime of the annotated stop codon. Name your file "distances.txt" and check it into subversion into your project1 directory. QUESTION 16: Unix refresher: What Unix command lets you count the number of lines in your distances.txt file?Load your data file into R using the 'read.delim' command:
> dat <- read.delim("distances.txt",header=F,sep='\t')
This command reads the file you just created into a data frame object named 'dat'. A data.frame object is a lot like a matrix in that it it contains rows and columns. Unlike a matrix, however, a data frame can contain data of mixed types - both character and numeric, in this case.
Once you have loaded the data into R, you can use the 'names' command to assign headings to each column.
> names(dat)
> names(dat) <- c('gene.model','dist')
To find out how large (how many rows and columns) your data set is, use the "dim" command, which returns this information. Read the on-line documention for the "dim" command. Execute:
> dim(dat)
Because you've associated names with each column of data, you can access individual columns from this data set by referring to them by name and using the $ operator. For example, the following command:
> distances <- dat$distretrieves the second column of the data set and saves it to the variable "distances."
You can also access individual elements in the data set using subscripts. For example, these command:
> dat[2,1]tells R to return the item in the second row and the first column of the data frame.
You can access entire rows or columns using empty subscripts. For example, to get the first row of data, you would use this command:
> dat[1,]Note that this command includes a comma (,) which tells R that you want everything - all columns - for the first row of data in the data frame.
If you want to select a range of rows or columns, you can also use sequences. For example, this command:
> dat[1:10,]retrieves the first 10 rows of data.
You can also use logical vectors to access rows and columns. For example, you can use the function is.na to retrieve all the gene models that did not contain "NA" in the "dist" column:
> notnas <- dat[!is.na(dat$dist),]
QUESTION 17: What data type is "bigs" and how many gene models does it contain?
QUESTION 18: Use the 'hist' command to observe the shape of EEJ distances in Arabidopsis genes. Is the distribution skewed or symmetrical?
QUESTION 19: Compute some summary statistics for the EEJ distances:
average: median: standard deviation: min: max:
Now add a new column to "dat" - it should contain the log (base 10) of the "dist" column. Create a histogram of the log-transformed "dist" data. Add a vertical, dashed line indicating where 50 bases occurs in the distribution.
QUESTION 20: Where does the 50 base line appear relative to the rest of the data?
Read the on-line documentation for the 'write.table' function in R. Now, write out the "dat" data.frame you created.
QUESTION 21: What options to you give to write.table to write a tab-delimited file? What about a comma-separated (csv) file?