INTRODUCTION TO R WITH EXERCISES

From Training Material
Jump to navigation Jump to search


title
Introduction to R with exercises
author
Yolande Tra for NobleProg Ltd

USE R AS A CALCULATOR ⌘

Use the ‘gets’ (also called the ‘assigns’) operator ( <- ) rather than the equals sign. After assigning a particular value to an object, confirm that the object now has that value by typing in the name of the object.

4+7**3   sqrt(45)     (1+0.005)^4   a <- log(89)    b <- exp( 3.4) 
sin(0)     f <- sin(pi/2)      
h <- 2: 9      x <- seq(0, 30, by = 2)   seq(1: 12)      1: 12

Try the following three commands together.

x<-seq(-4, 4, 0.2) ;  y<-2*x^2+4*x-7
plot(x,y)

Note: for assistance or further information about any function, type: ?functionname (e.g. ?log) OR ?functionname (e.g. ?mean) Also try: example(mean) etc and apropos("mean") for functions that involve the word mean

DEAL WITH VECTORS (LISTS OF NUMBERS)⌘

An array is an arrangement of objects, pictures, or numbers in columns and rows. Use the ‘combine’ or ‘concatenate’ operator, c, to set up vectors and arrays (as objects), and use square brackets and subscripts to identify and extract elements. Note the comparison operators:

== (equal to)   != (not equal to)   >= (greater than or equal to)   <= (less than or equal to)   

Also note the logical operators:

& (and)         | (or)              ! (not) 

Now set up the following three vectors by cutting and pasting from the electronic version of this document:

a <- c(2,-4,6,-4,8,-2,4,-8)    
b <- c(7,2,4,-1,-2,6,8,12)
d <- c(1,2,3,4,5,6,7,8)
length(a) ;  a[6] ;  a[c(2,5,6)] ; a[a>4] ; a[-c(2,3,6)]       
a^2     ;      a+100  ;      mean(a)  ;     median(a)                   
sd(a)   ;  sum(a[a<6]) ;   mean(a[a!=4]) ;      f<-sort(a);     
a==4   ;    which(a==4);    abs(a);   row_ab <- rbind(a,b)    col_ab <- cbind(a,b)

READ (IMPORT) A DATAFILE ⌘

Let’s read in the data file, Marks.txt, which contains a list of students’ marks over several subjects.

M<-read.table("http://training-course-material.com/images/e/ef/Marks.txt",header=T)
attach(M)
dim(M)
is.data.frame(M)
names(M)
M

The read.table function is very powerful and can read a range of file types as long as you instruct it properly (see http://www.r-project.org/ and click on Manuals). Note: h=T is short for header=TRUE, meaning that the datafile has a header (in this case consisting of students’ names - John, Mary etc, each of which is interpreted as a variable). The attach function places your array as an object in the R workspace. You will see the following table on your screen:

           John     Mary      Dave      Anne      Bob
English     56       73        45        78        51
History     67       82        42        85        49
Maths       68       80        51        92        58
Physics     73       78        46        83        54
Chem        72       79        60        77        62
Biology     64       80        54        89        68

R now treats this table as a dataframe, ready for any analysis you care to perform. If you were reading from a csv file you would use a command such as: A <- read.csv("filename.csv",sep=",",h=T).

GENERATE SUMMARY STATISTICS ⌘

In this section we continue to use the object M, created from the file Marks.txt from section 2 above.

summary(M)               colMeans(M)        rowMeans(M)   

The functions sappply and lapply traverse over a set of data and return a vector or a list. They call a specified function. The function apply traverse row wise (1) or column wise (2) by the second argument.

sapply(M, mean)   lapply (M, mean)  unlist(lapply (M, mean))  sapply(M, sd)   lapply (M, sd)  
apply(M, 1, mean)        apply(M, 2, mean)

Note: the operator, t, switches rows and columns (i.e. produces the transpose of the array). Now, instead of statistics on each student, you can generate summary statistics by subjects as follows:

MT<-t(M)
MT<-data.frame(MT)
attach(MT)
summary(MT)

Note the data.frame function, which allows R to interpret the academic subjects as variables.

EXTRACT ELEMENTS AND SUBSET YOUR ARRAYS ⌘

As before, we use the c operator to set up arrays, and use square brackets and subscripts to identify and extract elements. Now figure out what each of the following commands do:

M$Dave    M["Dave"]    M[2,4]     Dave[2]        M[M>70]                    
M[ ,4]    M[3, ]    M[ ]    M[ , ]         M[c(2,3,4), ]       M[ , c(2:4)]
match(John,  Dave)      M>80      which(M>80)     max(Dave)  
min(Anne)      min(M)     max(M[,4 ])
which.min(Anne)     which.max(Bob)     which(Anne<=80) 
which(M==46)     which(M[,2]!=82)
M[2,4]<-M[2,4]*0.95        M[,3]<-M[,3]+10         M[3,]<-M[3,]+10


All the rows for which the entry at column 2 is 80

M[ M[ , 2 ] == 80, ]

More generally: all rows for which the entry at column numcol equals the value VAL

Z <- x[ x [ , numcol ] == VAL, ]         

All the rows for which the first column is greater than 72

M[ M[ ,1 ]  >  72, ]

All rows for which Dave (column 3) had grades < 60

hs1<-subset(M,Dave <60,select=c(Dave))
hs1

You can subset your arrays by retaining wanted rows and columns using the c operator or by eliminating the unwanted (-c). Try the following:

MGirls<-M[,c(2,4)]                      MScience<-M[c(4:6),]   
MBoyArts<-M[c(1,2),c(1,3,5)]     OR     MBoyArts<-M[-c(3:6),-c(2,4)]        

Now you can work with your subsetted arrays.

MBoyArts2<-MBoyArts2+12                 MBoyArts2["Dave"] <-Dave+7   

To use boxplot to find outliers, try:

which(M==boxplot(M,range=1)$out)

To find unique lines in your array:

unique(as.data.frame(M))

You can save your objects as follows:

write.table(objectname, "Filename.txt")

You can edit your object from the keyboard. Try the following:

ME <- edit(M)                       fix(M)

You can delete any variable from your array; e.g.

M["Dave"] <- NULL      OR           M[,3]<-NULL

Note that you can use the cbind(column-bind) and rbind (row-bind) function to combine lists, such as a, b and d of section 2 above, to form a table. Then you can give it row and column names, as follows:

Y<-cbind(a,b,d)
colnames(Y) <- c("Col1", "Col2","Col3")
rownames(Y) <- c("Row1", "Row2", "Row3","Row4","Row5","Row6","Row7","Row8")
Y

Of course, you can give your rows and columns any names you like! You can also bind your set of vectors the other way:

Z<-rbind(a,b,d)
rownames(Z) <- c("Row1", "Row2","Row3")
colnames(Z) <- c("Col1","Col2", "Col3","Col4","Col5","Col6","Col7","Col8")
Z

However, to perform statistical analysis on any table constructed this way, you must first turn it into a dataframe, as follows:

Y<-data.frame(Y)
attach(Y)
summary(Y)

Note: You can enter data straight from the keyboard. Thus, to enter a string of numbers as an object called d, simply type and hit enter when done.

d <- scan()        OR  for character variables:        d <- scan( , "") 

and enter as many data as you like, hitting return after each and finishing by hitting return twice.

DEAL WITH MISSING DATA (NA) ⌘

Let’s read in a new file, MarksNA.txt, which also contains lists of students’ marks, but which includes some missing data (NAs).

MM<-read.table("http://training-course-material.com/images/6/64/MarksNA.txt",h=T)
attach(MM)
MM

On your screen you’ll see:

               Joe    Bill   Roger   Karen   Steve
Geog           46      71      NA      88     62
French         60      47      NA      69     NA
German         59      50      51      78     65
SocStud        NA      NA      57      73     55
HomeEc         72      59      70      86     NA
Art            49      60      NA      89     NA

R provides several commands or operators to deal with missing values. Here are four of the most useful: is.na(MM) - identifies missing data in your array complete.cases(MM) - a logical vector telling you which rows are complete MM[complete.cases(MM), ] - lists only complete rows (i.e. that have no missing data) na.omit(MM) - omits NAs and so is similar to MM[complete.cases(MM), ] na.rm=T - enables you to use any function (e.g. mean or standard deviation) by ignoring NAs.

Try the following:

is.na(MM)   complete.cases(MM)  MM[!complete.cases(MM),]    any(is.na(MM))  
which(is.na(MM))

You will find the function na.rm = T very useful. It allows you to perform calculations using many built-in functions, simply by ignoring missing values.

mean(MM, na.rm=T)       median(MM, na.rm=T)       sd(na.rm=T)

You can replace missing values element-by-element or by variable (column).

MM[2,3] <- mean(MM[,3],na.rm=T)          MM[2,3] <- median(MM[,3],na.rm=T)
MM[is.na(MM$Steve),"Steve"] <- median(MM$Steve,na.rm=T)  

The above code replaces all of Steve’s NAs with his median mark, but more generic is:

MM[is.na(MM[,5]),5] <- median(MM[,5],na.rm=T)

In a later section you will find an R programme (script) that performs this replacement on an entire array at once!

WORKING DIRECTORY, WORKSPACE AND HISTORY ⌘

Returns an absolute file path representing the current working directory of the R process

getwd()

Set the working directory to

setwd("C:/Training")

Your workspace consists of all the data objects you've created or loaded during your R session (and perhaps a few other things as well). When R asks if you want to save your workspace image before quitting, and you say "no", all of the NEW stuff you've created goes away--forever. If you say "yes", then a file called ".RData" is written to your working directory. The next time you start R in this directory, that workspace and all its data objects will be restored.

Here's how to retrieve them. First, you may want to get rid of any workspace items R has dragged along from the default working directory. That can be done with the rm( ) function. Then you can load the previously saved workspace (and presumably the history file if so desired).

ls()                                 # This is the default workspace                    
rm(list=ls())                        # Delete the default workspace.
load(".RData")                       # Load a previously saved workspace.
loadhistory()                        # Load a previously saved history file.
ls()

EXERCISE ⌘

The library MASS has a built-in data set named immer that records the barley yield in years 1931 and 1932 of the same field. It is a matched paired sample.

Load the MASS package and the data. Find out the variable names. Display the internal structure of the data. Determine the dimension of the data frame.

library(MASS)
data(immer)
names(immer)
str(immer)
dim(immer)

Return the first part of the data.

head(immer)

Identify 4th row, rows 6 through 10 of columns 3 and 4, 2nd column of immer data.

Assuming that the data in immer follows the normal distribution, construct a 95% confidence interval estimate of the difference between the mean barley yields between years 1931 and 1932.

t.test(immer$Y1, immer$Y2, paired=TRUE)

We will use a built-in data frame in R, called mtcars to run a logistic regression. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Instead of printing out the entire data frame, it is often desirable to preview it with the head function beforehand.

 help(mtcars)
 head(mtcars)

We apply the function glm to a formula that describes the transmission type (am) by the horsepower (hp) and weight (wt). This creates a generalized linear model (GLM) in the binomial family.

am.glm = glm(formula=am ~ hp + wt,data=mtcars,family=binomial)

We can print out a summary of the generalized linear model and check for the p-values of the hp and wt variables.

summary(am.glm)

READ AND WRITE DATAFILES ⌘

R likes text files but can read most file types. OK, what do the following commands do?

Grades <- read.table("http://www.oxford.ac.uk//Users//PhysicsGrades.dat", h=T)
Grades <- read.csv(file="grades2009.csv",sep=",",h=T)
Grades<- read.spss("C:/mydata/survey.sav", to.data.frame=TRUE)
Grades<- read.table(file("clipboard"),h=T)    OR     Grades <- read.clipboard()

Writing a new file is easy!

write.table(M, "FinalGrades.txt")

However, your table won’t look so nice, so try:

write.table(M, "FinalGrades.txt", sep="\t", quote=F)

The last two arguments write your table in neat columns and get rid of unwanted quotation marks.

PLOTTING IN R ⌘

Let’s create a simple plot using the plot() command. First set up a vector B

B <- c(2, 4, 5, 7, 12, 14, 16)

Now let’s graph B

plot(B)

That was simple! However, our plot is very basic.


Let’s start again and enhance the plot Plot B using dark green points

plot(B, type="o", col="darkgreen")  # type="o" produces symbols joined by straight lines 

Now let’s create a title

title(main="My Plot", font.main=2, col.main="blue")

Notice how to create a title.


Let’s take a more complex example and enhance further

B1 <- c(2, 4, 5, 7, 12, 14, 16)
B2 <- c(3, 6, 7, 8, 9, 11, 12)

Graph B1 using a y axis from 0 to 20

plot(B1, type="o", col=" darkgreen ", ylim=c(0,20))

Graph B2

lines(B2, type="o", pch=22, lty=2, col="blue")

Notice how we plotted the first curve and then added the second using the lines function. Let’s create a title

title(main=" My Plot ", col.main="red", font.main=2)

You can look up the various arguments for colour and font type on line. Note the default labels for the horizontal and vertical axes


Another example, in which we automate the calculation of the y-axis limits Define 3 vectors

B1 <- c(2, 4, 5, 7, 12, 14, 16)
B2 <- c(3, 6, 7, 8, 9, 11, 12)
B3 <- c(1, 7, 3, 2, 2, 7, 9)

Calculate the maximum value of B1, B2 and B3

yaxismax <- max(B1, B2, B3)

Plot using a vertical axis from 0 to yaxismax. Then disable the default axes so that we can create our own axes.

plot(B1, pch = 15, type="o", col="blue", ylim=c(0, yaxismax), axes=FALSE, ann=FALSE)   
axis(1, at=1:7, lab=c("A","B","C","D","E", "F", "G")) 

The first argument in the axis command (1) specifies the horizontal axis. The at command specifies where to place the axis labels. The vector called lab stores the actual labels. Create y axis with horizontal labels that display ticks every 2 marks.

axis(2, las=1, at=2*0: yaxismax) # las controls the orientation of the axis labels. Try entering other values. 

Create a box around the plot

box()
lines(B2, pch = 16, type="o", lty=2, col="red")
lines(B3, pch = 17, type="o", lty=3, col="darkgreen")

Create a title

title(main="My Plot", col.main="darkgreen", font.main=2)

Label the x and y axes

title(xlab="Letters", col.lab="purple")
title(ylab="Values", col.lab="brown")

Create a legend at the location (1, yaxismax)

legend(1, yaxismax, c("B1","B2", "B3"), cex=0.7, col=c("blue", "red", "darkgreen"), pch=c(15, 16,17), lty=1:2)

The cex argument gives you control over the size of fonts, where the default is 1. You can check out the parameters pch, type and lty for yourselves.


In this example, we save our plot as a png file and create our own axes.

A <- read.table("http://training-course-material.com/images/6/68/MyPlot.txt", header=T)  # reading our data from a csv file. 

Find the maximum y value

ymax <- max(A)
print(A)

Prepare your colours

colours <- c("darkgreen","grey","orange","brown")

Start the device driver to create a png file

png(filename="MyPlot.png")

Disable the default axes and set up a y axis

plot(A$Bill, pch = 25, type="o", col= colours [1], ylim=c(0, ymax), axes=FALSE, ann=FALSE)

Create an x axis

axis(1, at=1:5, lab=c("H", "I", "J", "K", "L"))

Create a y axis with horizontal labels with ticks every 2 units.

axis(2, las=1, at=2*0:ymax)

Create a box

box()

Graph Robert

lines(A$Robert, type="o", pch=15, lty=2, lwd=2, col=colours[2])

Graph David

lines(A$David, type="o", pch=16, lty=3, lwd=2, col=colours[3])

Graph Anne

lines(A$Anne, type="o", pch=17, lty=4, lwd=2,col=colours[4])

Create a title

title(main="MY PLOT TITLE", col.main="forestgreen", font.main=2)

Label the x and y axes

title(xlab= "Letters", col.lab="red")
title(ylab= "Y Values", col.lab="darkgreen")

Create a legend at (1, ymax)

legend(1, ymax, names(A), cex=0.7, col=colours, pch=c(25, 15,16,17), lty=1:3);

Close the device driver

dev.off()

Now you should have your plot as a png file in your working directory.

CREATE A BAR CHART ⌘

B <- c(2, 4, 5, 7, 12, 14, 16)
barplot(B) # It’s that simple! However, this one is not very sophisticated! 

ANOTHER BAR CHART

Read values from csv file

A <- read.table("http://training-course-material.com/images/6/68/MyPlot.txt", header=T)
print(A)

Create Your bar chart with labels

barplot(A$David, main="MY BARPLOT", xlab="Letters",
ylab="VALUES", names.arg=c("A","B","C","D ", "E"),
border="red", density=c(90,70,50,40,30)) 

Try other values of the density parameter and see what you get. What does the command names.arg do?


Another bar chart, this time using R’s rainbow palette. We have five scores for each of four people.

A <- read.table("http://training-course-material.com/images/6/68/MyPlot.txt", header=T)
print(A)

Plot a bar chart of A with adjacent bars

barplot(as.matrix(A), main="My Barchart", ylab= "Numbers",
beside=TRUE, col=rainbow(5))

Create a legend at the top-left corner without any frame

legend("topleft", c("First score","Second score","Third score","Fourth score", "Fifth score"), cex=0.6,bty="n", fill=rainbow(5))  # The bty parameter controls borders

We will talk later about the as.matrix function.


HISTOGRAMS

B <- c(2, 4, 5, 7, 12, 14, 16)

Create a histogram for B

hist(B)

Again this was easy, but you need more from your histogram.


Let’s create a histogram from all the data in an array.

A <- read.table("http://training-course-material.com/images/6/68/MyPlot.txt", header=T)
print(A)

Transform the four vectors into a single vector and make a histogram of all elements.

B <- c(A$Bill, A$Robert, A$David, A$Anne)

Create a histogram in light blue

hist(B, col="lightblue", ylim=c(0,10))


Now a more complex example.

 A <- read.table("http://training-course-material.com/images/6/68/MyPlot.txt", header=T)
print(A)

Transform the four vectors, as before.

B <- c(A$Bill, A$Robert, A$David, A$Anne)

Find the maximum value so that we can create a bar for each value.

max <- max(B)

Disable right-closing of cell intervals, and make the horizontal axis labels horizontal

hist(B, col=  "darkgreen", breaks=max, xlim=c(0,max), right=F, main="My Histogram", las=2, xlab = "Values", cex.lab = 1.3)


PIE CHARTS

B <- c(2, 4, 5, 7, 12, 14, 16)

Create a simple pie chart

pie(B)


Now let's create a pie chart with a heading, use nice colours, and define our own labels using R’s rainbow palette

B <- c(2, 4, 5, 7, 12, 14, 16)
pie(B, main="My Piechart", col=rainbow(length(B)),
labels=c("Mon","Tue","Wed","Thu","Fri"))


Here is a more complex example, using percentages and a legend

B <- c(5, 3, 1, 8, 9, 4, 6)  # A vector of data, one for each day of the week

Set up colours for printing

cols <- c("grey90","grey50","black","grey30", "white","grey70","grey50")

Calculate the percentage for each day, at one decimal place

percentlabels <- round(100*B/sum(B), 1)  # Rounding the percentage labels to one decimal place

Add a '%' sign to each percentage value using the paste command.

pielabels <- paste(percentlabels, "%", sep="")  # What does the paste command do? 
pie(B, main="My Best Piechart", col=cols, labels= pielabels, cex=0.8)

Create a legend to the right

legend(1.5, 0.5, c("Mon","Tue","Wed","Thu","Fri"), cex=0.5, fill=cols)

A SIMPLE PLOT INVOLVING A CALCULATION ⌘

Produce a simple plot and develop it iteratively! Note: abline(a,b) draws a line with intercept a and slope b. The function lm (Linear Model) performs a least squares regression.

x<-seq(-5,5,0.2)
y<-3*x^2+2*x-5
plot(x,y)
plot(x,y, xlab="X Values",ylab="Y Values",pch=16, col="red",main="A NICE QUADRATIC")
lines(x,y)
text(-0.8,40.0,"Place text anywhere on your plot!")

Now your text is located at the point (-0.8,40) on your plot.

abline(lm(y~x))

Admittedly, a least squares fit doesn’t help much with a quadratic, but you get the point!

abline(h=3)
abline(v=3)
legend(-4,66,"This is my Legend")

Your legend is now located at the point(-4,66)

rug(x)

What does the rug command do?

A More Complex Plot ⌘

Produce a more complicated plot (note the use of the expression and paste functions). Copy and paste the following syntax (which I have modified from a very fine book called "Statistics: an introduction using R", by Michael J. Crawley) into R and try to understand what each line is doing.

x <- seq(-4, 4, len = 101)
plot(x, cos(x),type="l",xaxt="n", col="red", main="A DEMONSTRATION PLOT",
xlab=expression(paste("Angle",  theta)),
ylab=expression("cos "*theta))
axis(1, at = c(-pi, -pi/2, 0, pi/2, pi),
lab = expression(-pi, -pi/2, 0, pi/2, pi))
text(pi/2+0.3,0.5,substitute(chi^2=="3.2"))
text(pi/4, -0.5, expression(paste(frac(1, sigma*sqrt(2*pi)), " ",
e^{frac(-(x-mu)^2, 2*sigma^2)})))
text(pi/2+1.5,0,expression(hat(alpha) %+-% theta))

Using Loops for Multiple Plots ⌘

You can use loops to develop complex plots. Try the following (also modified from other sources):

x <- c(1:5); y <- 2*x+3 # create some data  
par(pch=20, col="blue") # plotting symbol and colour 
par(mfrow=c(2,4)) # all plots on one page  
choices = c("p","l","o","b","c","s","S","h")  
for(i in 1:length(choices)){  
heading = paste("Type=", choices[i]) 
plot(x, y, type="n", main=heading) 
lines(x, y, type=choices[i]) 
}


WRITE R PROGRAMMES (SCRIPTS) ⌘

R provides control statements similar to those of other languages. Try copying the code given in sections 9.1 to 9.4 into the R workspace to see what they do:

The if Command

mark<- 87
if(mark>=50) {print("Pass")} else {print("Fail")}

The ifelse Loop

y<-1:20
ifelse(y<10, y, 99)

Example of function with if else

x <- 1:10
test <- function(x) { 
   if(x < 5) { 
       x-1 
   } else { 
       x / x 
   } 
}
apply(as.matrix(x), 1, test)

The for Loop

for(i in 1:100) print(2*i^2 - 7)

The while Loop

n<-0
while(n<5) {
n<-n+1
print(n)
}

Function syntax

myfct <- function(arg1, arg2, ...) { 
   function_body 
}

Write a function that calculates the square of a number

square <- function(number)
 {
   answer <- number * number
   return(answer)
 }

Calculate square of 13

square(13)

Write a function that takes as its argument two vectors, x and y, produces a scatterplot, and calculates the correlation coefficient (using cor(x,y)).

two.outcomes<-function(x,y){
   plot(x,y)
   cor(x, y)
}

Apply the function to produce a scatterplot of eruption duration versus waiting interval in the data set faithful.

data(faithful)
head(faithful)
two.outcomes(faithful[,1],faithful[,2])


A Simple R Programme and how to Run it

You are not yet ready to write your own programmes, but examine the following code and analyse what each step is doing. This code comes from the accompanying text file - ProgNA.txt. To run the code, type: source("ProgNA.txt"). The programme should read the file MarksNA.txt (which contains missing data), print the number of rows and the number of columns on your screen, replace all missing data by the median of the existing data for each column (variable), and write a new text file called FinalMarks.txt. The syntax: sep="\t", quote=F formats the newly written file and makes it pleasing to the eye.

M<-read.table("MarksNA.txt", h=T)
nrows <- dim(M)[1]  # Can also use: nrows <- nrow(M)
ncols <- dim(M)[2]    # Can also use: ncols  <- ncol(M)
print(nrows)  
print(ncols)
print(M)
for(j in 1: ncols) 
{
M[is.na(M[, j]), j] <- median(M[, j], na.rm=T) 
} 
write.table(M,"FinalMarks.txt", sep="\t", quote=F)

EXERCISE ⌘

Read with read.csv, the data (courtesy of Statistical Consulting Group, University of California). Name the dataframe hsb. Determine the data structure

hsb = read.table('http://www.ats.ucla.edu/stat/data/hsb2.csv', header=T, sep=",")
attach(hsb)
str(hsb) 

Inspect the resulting table. Obtain summary statistics for the variables summary(hsb). Cross tabulate the female and race variables. Is the design balanced?

table(female,race)
table(female)
table(race)

Read and Write correlation We will use qplot, which is part of the ggplot2 package. To install the package use

install.packages("ggplot2")

This only needs to be done once. Load the ggplot2 package

library(ggplot2)

This, however, needs to be done every time you start a new session or at the top of your script file. Make a scatterplot of read and write. Color the points by race

qplot(x=read, y=write, data=hsb, color=race)

Calculate Pearson correlation and the associated test with cor.test.

cor(hsb$write, hsb$read, use="pairwise")
cor.test(hsb$write, hsb$read)

Write a function which takes a numeric vector as argument and calculates the mean of the vector. Fill in the two lines that are blank

meanofvec <- function(vec) {
  
 
 sum/n  
}

Use the above function to find the mean of the vector 1:5

meanofvec(1:5)

Write a function that normalizes the values of an input vector; normalize by subtracting the mean of the data and rescaling it to have a std dev of 1. Fill in the blank. What happens if you supply a single value to this function?

stdnorm <- function(nums) {
  m <- mean(nums)
 nums <- nums-m
 stdev<-
 nums <- 
nums
}

Apply the function to normalize the values of data and a summary of the normalized data

data <- rnorm(1000, mean=5, sd=3)
summary(data)
norm <- stdnorm(data)
summary(norm)

PERFORM STATISTICAL MODELLING ⌘

If you are using R for the first time to perform a regression or ANOVA, you may need some assistance. The two commands you’ll use here are lm (short for Linear Model), which performs a regression, and aov (short for Analysis of Variance). Try out the following code.

A simple linear regression ⌘

The following regression uses a made-up datafile (MarksRegression.txt), consisting of marks out of 10 in a trial examination and those scored by the same students in the final examination. You wish to know how well the trial exam marks predicted the final exam marks. So – perform a regression!

par(mfrow=c(1,1))
mod<-read.table("http://training-course-material.com/images/e/e2/MarksRegression.txt",h=T)
attach(mod)
names(mod)
plot(Trial,Final,pch=16,xlab="TRIAL EXAM",ylab="FINAL EXAM", main="TRIAL AND FINAL MARKS", cex=1.4) 
plot(Trial,Final,pch=16,xlab="PRIOR EXAM",ylab="FINAL EXAM", main="PRIOR AND FINAL MARKS", cex=1.5, cex.lab = 1.5, cex.main = 1.6, xlim=c(0,100), ylim=c(0,100))
plot(Trial,Final,pch=16,xlab="PRIOR EXAM",ylab="FINAL EXAM", main="PRIOR AND FINAL MARKS", cex=1.5, cex.lab = 1.5, cex.main = 1.6)

Run a simple linear regression

lm(Final~Trial)
summary(lm(Final~Trial)) 

The model is: Final Mark = 0.83*Trial Mark + 16.3. It explains over 90% of the variance in the Final Mark – not bad! Its p-value is 1.148e-05, which is much less than 0.05, so that’s pretty good too! OK, but now you can go a little further and see if you can improve the model. First, let’s plot the regression and its residuals.

abline(lm(Final~Trial))

Let’s set up the regression model as an object called regmodel.

regmodel<-predict(lm(Final~Trial))

The predict function gives you actual points from your regression. Now draw lines to highlight the differences between the final and fitted points.

for (k in 1:10) lines (c(Trial [k], Trial [k]),c(Final [k], regmodel [k]))

Why not remind ourselves of the results of the regression using slightly different syntax?

model<-lm(Final~Trial)
summary(model)

The next command gives another breakdown:

summary.aov(model)

Now you can use several R diagnostics to improve your model by eliminating outliers.

plot(model)

The first plot (residuals vs. fitted values) should look more or less random, and that’s what your model provides. The second plot (normal Q-Q errors) will give a straight line if the errors are distributed normally, so again we’re OK. The third plot is similar to the first and should look random (and ours does!). The last plot (Cook’s distance) tells you which points have the greatest influence on the regression. Any that are excessively influential are candidates for removal. This plot suggests that the tenth point is an outlier. You can eliminate it to investigate what its removal does to your model. But where is the tenth point on the Trial axis?

Trial[10]

OK, so the tenth point is located at 67 on the Trial axis. Now you can use the update and subset commands to re-run the regression on a subset of the original data (i.e. without the point Trial =67). The update command allows you to re-run your model, but including any changes you wish to specify. The subset command provides an easy way of altering your input data.

model2<-update(model,subset=(Trial!= 67))
summary(model2)

(Remember that != means not equal to). OK. This second regression has greater explanatory power (it explains 96% of the variance) and an even smaller p value (p = 1.037e-06). Now you have a very good model that you may prefer over your first attempt.

One-Way ANOVA ⌘

The following ANOVA uses a datafile of test marks (MarksAnova.txt), scored by a class of 24 students, where gender provides a two-level categorical variable (i.e. male or female).

[Categorical variables allocate subjects into categories that cannot be quantified (e.g. gender, ethnicity, country of birth, school attended, eye colour)].

In this example, the idea is to test for a significant difference between the distribution of marks achieved by males and that achieved by females.

ms<-read.table("http://training-course-material.com/images/c/c7/MarksAnova.txt",h=T)
attach(ms)
names(ms)
str(ms)
ms

Now let’s do an initial plot of the marks.

plot(1:24,ms$Marks,pch = 23, bg = c("red", "blue")[unclass(ms$Gender)], main="MARKS FOR BOYS AND GIRLS", xlab = "STUDENT", ylab = "MARK" )

Now let’s look at the means for each gender.

tapply(Marks, Gender, mean)

Note that the marks for males and females are alternated. Strictly, you don’t need to do this, but it is convenient for plotting (as you’ll see below). We don’t have identifier for each student, so we plot their marks in the order in which their marks are given to us.

summary(aov(Marks ~ Gender))

OK. Your p value is about 0.004. OK. Now plot your 24 data.

plot(1:24,Marks,ylim=c(0,10),ylab="Marks",xlab="Student Order", pch=16, col="blue", main="DISTRIBUTION OF MARKS")

Now let’s draw a horizontal line at the mean value.

abline(mean(Marks),0)

Now let’s draw lines to connect each datum with the mean value in order to highlight the difference visually.

for(j in 1:24) lines(c(j,j),c(mean(Marks), Marks [j]))

So far, your plot hasn’t told you very much about the difference between boys’ marks and girls’ marks. So, let’s do a better job by separating the boys from the girls! First suppress R’s desire to plot your data using the syntax type="n". Then plot the boys’ marks and the girls’ marks separately. Note the syntax: Marks[Gender=="M"] and Marks[Gender=="F"] which works because gender is a two-level categorical variable. This syntax subsets the original distribution of marks into one set for boys and one for girls.

plot(1:24, Marks,ylim=c(0,10),type="n",ylab="Marks",xlab="Student Order", main="GIRLS’ AND BOYS’ MARKS")
points(seq(1,23,2), Marks [Gender=="M"],pch=1)
points(seq(2,24,2), Marks [Gender=="F"],pch=16)

The girls’ marks (black dots) are clearly better than the boys’ (open circles). Question: why did we plot the boys’ marks on a grid from 1 to 23 and the girls’ marks on a grid from 2 to 24? Now plot the mean boys’ mark and the mean girls’ mark.

abline(mean(Marks[Gender=="M"]),0)
abline(mean(Marks[Gender=="F"]),0)

OK, so now draw lines to highlight the residuals for boys and girls.

j<- -1
for (i in 1:12){
  j<-j+2
  lines(c(j,j),c(mean(Marks[Gender =="M"]),Marks[Gender =="M"] [i]))
  lines(c(j+1,j+1),c(mean(Marks [Gender =="F"]), Marks [Gender =="F"] [i]))
}

Let’s re-run the ANOVA in the light of this plot.

summary(aov(Marks ~ Gender))

As we saw with the regression of the last section, you can use diagnostics to improve your model.

plot(aov(Marks ~ Gender))

The first and third plots indicate that the variances of the marks for males and those for females are roughly the same, which is what you’re hoping for. The second is a Normal Quantile-Quantile plot and is a straight line (again, this is good because the errors are not significantly non-normal). The fourth plot gives the Cook’s residuals, and indicates points that are particularly influential. Points 7, 9 and 20 are very influential, so try again without them! First define your new subset ms2 (effectively ms, but without the seventh, ninth and twentieth points).

ms2=(1:24 != 7 & 1:24 != 9 & 1:24 != 20)
ms2

Now re-run your ANOVA on the new datset.

A <-  aov(Marks ~ Gender, subset=ms2)
summary(A)

The p value (0.0004) suggests that the revised model is better than the first. However, whether or not you prefer this revised model over the original is an entirely separate question! You may also wish to plot box and whisker plots, one for each factor. First, change the level names so as to get a better description on your boxplot.

levels(Gender) <- c("FEMALE", "MALE")

Now use the plot command, which automatically produces boxplots when one variable has been defined as a factor. Note the order in which you write the arguments (factor first, then the continuous variable).

print(model.tables(A,"means"), digits=3)  
#report the means and the number of subjects/cell
plot(Gender, Marks, col = "red", xlab = "GENDER", ylab = "MARKS", col.lab = "darkgreen", cex.lab = 1.3, main = "FINAL MARKS BY GENDER", cex.main = 1.6, col.main = "blue", ylim = c(0,10))
abline(6.57,0)

Two-Way ANOVA (Two Factors and looking for an interaction between the factors) ⌘

We are modelling consumers' ratings of products against cost and time for manufacture.

productappearance <-   c(6.5, 6.2, 5.8, 6.5, 6.5, 6.9, 7.2, 6.9, 6.1, 6.3,  6.7, 6.6, 7.2, 7.1, 6.8, 7.1, 7.0, 7.2, 7.5, 7.6)
cost <-  factor(gl(2,10), labels = c("LowC", "HighC"))
time <-  factor(gl(2, 5, len = 20), labels = c("LowT", "HighT"))
Z <- data.frame(productappearance, cost, time) 
Z
fit <-  aov(productappearance ~ cost * time)
summary(fit) 
print(model.tables(fit,"means"), digits=3)       
#report the means and the number of subjects/cell

boxplot(productappearance ~ cost*time, data = Z)    
#graphical summary of means of the 4 cells

interaction.plot(cost, time, productappearance)  

Also try the following approach to setting up a three level factor and plotting the variable according to its factor levels:

app= cut(productappearance,3)
app
table(app)
levels(app) <-  c("SMALL", "MED", "LARGE")
nlevels <-  length(levels(app))
levels(app) 
table(app)

Z <- cbind(productappearance,app)
Z
index <- 1:20
npoints <- length(index) 
plot(index, productappearance,  type="n",xlab="ORDER",ylab="APPEARANCE",xlim=c(0,max(index)), ylim = c(5,max(productappearance)))
ind<-split(index, app)
appfac<-split(productappearance, app)

for (k in 1: nlevels) { points(ind[[k]], appfac[[k]], pch = ( k + 12 )) }  # Now plot the data for each level 

PLOT REGRESSION LINES FOR EACH FACTOR

for (k in 1:nlevels) abline(lm(appfack ~ indk)) 

reg1<-predict(lm((appfac[[1]] ~ ind[[1]])))
reg2<-predict(lm((appfac[[2]] ~ ind[[2]])))
reg3<-predict(lm((appfac[[3]] ~ ind[[3]])))

reg <- list(c(0,nlevels))
for (k in 1: nlevels) {
regk <- predict(lm((appfack ~ indk)))  }  # We have set up three regression models; one for each level

Two-way Repeated measures ANOVA (balanced data) ⌘

Consider a design of experiment with 8 participants. The time to finish a task is recorded. Each participant uses three techniques: A, B, and C and two devices: Pen and Touch. This is a two-way repeated measures ANOVA. We will use the ezANOVA() function of the ez package

Create the data, install and call the library

Participant <- factor(rep(c("1", "2", "3", "4", "5", "6", "7", "8"), 6))
Device <- factor(c(rep("Pen",24), rep("Touch",24)))
Technique <- factor(rep(c(rep("A",8), rep("B",8), rep("C",8)), 2))
Time <- c(1.2,1.4,1.8,2.0,1.1,1.5,1.5,1.7,2.1,2.5,2.2,2.2,2.9,2.3,2.3,2.6,3.5,3.4,3.3,3.2,2.9,2.8,3.8,3.4,2.4,1.8,
    2.5,2.1,2.2,1.9,1.7,2.3,2.8,3.1,3.2,4.0,2.9,3.6,3.2,3.7,4.5,4.8,4.7,4.1,4.1,4.2,4.6,4.9)
data <- data.frame(Device, Technique, Participant, Time)
options(install.packages.check.source = "FALSE")
install.packages("ez")
library(ez)

To reset contrasts to defaults, you type

options(contrasts=c("contr.sum", "contr.poly"))

Let's run the analysis with ezANOVA() function.

ezANOVA(data=data, dv=.(Time), wid=.(Participant), within=.(Technique, Device), type=3)

The followings are specified: data is the dataframe you are using. dv means the dependent variable, which is Time in this case. wid means the indices that represent different cases or participants. In this example, Participant represent eight different cases. within means the within-subject factors, which are Technique and Device. If you have a between-subject factor, you can also use the between option. Look and interpret the results.

Affymetrix data analysis ⌘

This section was given gracefully for free from Bioinformatics Training and Service Facility (BITSF), the Bioinformatics Facility of VIB (http://www.vib.be ).

http://wiki.bits.vib.be/index.php/Introduction_to_R/Bioconductor_for_analysis_of_microarray_data#Module_2_Reading_and_storing_expression_data

EXERCISE ⌘

We will use the same data as in previous exercise. For a refresher, read and attach the data (courtesy of Statistical Consulting Group, University of California). Determine the data structure

hsb = read.table('http://www.ats.ucla.edu/stat/data/hsb2.csv', header=T, sep=",")
attach(hsb)
str(hsb)

We need to convert race and female from integer to factor. Check the conversion after.

hsb$race<-as.factor(hsb$race)
hsb$female<-as.factor(hsb$female)
str(hsb)

Perform a two-way Anova (function aov) with female and race as factors and write as dependent variable. Start by testing the full model

fit<-aov(write~ female* race)
summary(fit)

Test for an interaction. If not significant drop the interaction term and go with an additive model.

fit1<-aov(write~ female+ race)
summary(fit1)

When you find any significant effects, make post-hoc confidence intervals for differences using Tukey’s method. The function for this is TukeyHSD. Look at its help file: ?TukeyHSD

TukeyHSD(fit1)

Diagnostic plots.

Plot residuals versus fitted values and look for obvious departures from constant errors.

library(ggplot2)
qplot(x=fitted(fit1), y=residuals(fit1))

Make a normal quantile-quantile plot of residuals to check for gross departures from normality.

qplot(sample=residuals(fit1))

Boxplots

qplot(race, write, fill=female, data=hsb, geom="boxplot", position="dodge")+theme_bw()

Barplot.

First, you need to get the averages and standard deviations that you desire. For this, we will use the plyr package and ddply function. Install and run it.

install.packages("plyr") 
library(plyr)
hsb.avg<-ddply(hsb, c("race", "female"), function(hsb)
return(c(hsb.avg=mean(hsb$write), hsb.sd=sd(hsb$write))))

Create the barplot component

avg.plot<-qplot(race, hsb.avg, fill=factor(female),stat="identity", data=hsb.avg, geom="bar", position="dodge")

define the width of the dodge

dodge <- position_dodge(width=0.9) 

Then add the error bars (plus or minus one standard deviation) to the plot

avg.plot+geom_linerange(aes(ymax=hsb.avg+hsb.sd, ymin=hsb.avg-hsb.sd), position=dodge)+theme_bw()