Introduction to R

From Training Material
Revision as of 16:34, 25 November 2014 by Cesar Chew (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

USE R AS A CALCULATOR

Use the ‘gets’ (also called the ‘assigns’) operator ( <- ) rather than the equals sign. After assigning a particular value to an object, confirm that the object now has that value by typing in the name of the object.

4+7**3   sqrt(45)     (1+0.005)^4   a <- log(89)    b <- exp( 3.4) 
sin(0)     f <- sin(pi/2)      
h <- 2: 9      x <- seq(0, 30, by = 2)   seq(1: 12)      1: 12

Try the following three commands together.

x<-seq(-4, 4, 0.2) ;  y<-2*x^2+4*x-7
plot(x,y)

Note: for assistance or further information about any function, type: ?functionname (e.g. ?log) OR ?functionname (e.g. ?mean) Also try: example(mean) etc and apropos("mean") for functions that involve the word mean

DEAL WITH VECTORS (LISTS OF NUMBERS)

Use the ‘combine’ or ‘concatenate’ operator, c, to set up vectors and arrays (as objects), and use square brackets and subscripts to identify and extract elements. Note the comparison operators:

== (equal to)   != (not equal to)   >= (greater than or equal to)   <= (less than or equal to)   

Also note the logical operators:

& (and)         | (or)              ! (not) 

Now set up the following three vectors by cutting and pasting from the electronic version of this document:

a <- c(2,-4,6,-4,8,-2,4,-8)    
b <- c(7,2,4,-1,-2,6,8,12)
d <- c(1,2,3,4,5,6,7,8)
length(a)   a[6]   a[c(2,5,6)]  a[a>4]  a[-c(2,3,6)]       
a^2           a+100        mean(a)       median(a)                   
sd(a)     sum(a[a<6])     mean(a[a!=4])       f<-sort(a)     
a==4       which(a==4)    abs(a)   ab <- rbind(a,b)    ab <- cbind(a,b)

READ (IMPORT) A DATAFILE

Let’s read in the data file, Marks.txt, which contains a list of students’ marks over several subjects.

M<-read.table("http://training-course-material.com/images/e/ef/Marks.txt",header=T)
attach(M)
dim(M)
is.data.frame(M)
names(M)
M

The read.table function is very powerful and can read a range of file types as long as you instruct it properly (see http://www.r-project.org/ and click on Manuals). Note: h=T is short for header=TRUE, meaning that the datafile has a header (in this case consisting of students’ names - John, Mary etc, each of which is interpreted as a variable). The attach function places your array as an object in the R workspace. You will see the following table on your screen:

           John     Mary      Dave      Anne      Bob
English     56       73        45        78        51
History     67       82        42        85        49
Maths       68       80        51        92        58
Physics     73       78        46        83        54
Chem        72       79        60        77        62
Biology     64       80        54        89        68

R now treats this table as a dataframe, ready for any analysis you care to perform. If you were reading from a csv file you would use a command such as: A <- read.csv("filename.csv",sep=",",h=T).

GENERATE SUMMARY STATISTICS

In this section we continue to use the object M, created from the file Marks.txt from section 2 above.

summary(M)               colMeans(M)      sapply(M, mean)       sd(M)     
apply(M, 1, mean)        apply(M, 2, mean)

Note: the operator, t, switches rows and columns (i.e. produces the transpose of the array). Now, instead of statistics on each student, you can generate summary statistics by subjects as follows:

MT<-t(M)
MT<-data.frame(MT)
attach(MT)
summary(MT)

Note the data.frame function, which allows R to interpret the academic subjects as variables.


EXTRACT ELEMENTS AND SUBSET YOUR ARRAYS

As before, we use the c operator to set up arrays, and use square brackets and subscripts to identify and extract elements. Now figure out what each of the following commands do:

M$Dave    M["Dave"]    M[2,4]     Dave[2]        M[M>70]                    
M[ ,4]    M[3, ]    M[ ]    M[ , ]         M[c(2,3,4), ]       M[ , c(2:4)]
match(John,  Dave)      M>80      which(M>80)     max(Dave)  
min(Anne)      min(M)     max(M[,4 ])
which.min(Anne)     which.max(Bob)     which(Anne<=80) 
which(M==46)     which(M[,2]!=82)
M[2,4]<-M[2,4]*0.95        M[,3]<-M[,3]+10         M[3,]<-M[3,]+10


All the rows for which the entry at column 2 is 80

M[ M[ , 2 ] == 80, ]

More generally: all rows for which the entry at column numcol equals the value VAL

Z <- x[ x [ , numcol ] == VAL, ]         

All the rows for which the first column is greater than 72

M[ M[ ,1 ]  >  72, ]

All rows for which Dave (column 3) had grades < 60

hs1<-subset(M,Dave <60,select=c(Dave))
hs1

You can subset your arrays by retaining wanted rows and columns using the c operator or by eliminating the unwanted (-c). Try the following:

MGirls<-M[,c(2,4)]                      MScience<-M[c(4:6),]   
MBoyArts<-M[c(1,2),c(1,3,5)]     OR     MBoyArts<-M[-c(3:6),-c(2,4)]        

Now you can work with your subsetted arrays.

MBoyArts2<-MBoyArts2+12                 MBoyArts2["Dave"] <-Dave+7   

To use boxplot to find outliers, try:

which(M==boxplot(M,range=1)$out)

To find unique lines in your array:

unique(as.data.frame(M))

You can save your objects as follows:

write.table(objectname, "Filename.txt")

You can edit your object from the keyboard. Try the following:

ME <- edit(M)                       fix(M)

You can delete any variable from your array; e.g.

M["Dave"] <- NULL      OR           M[,3]<-NULL

Note that you can use the cbind(column-bind) and rbind (row-bind) function to combine lists, such as a, b and d of section 2 above, to form a table. Then you can give it row and column names, as follows:

Y<-cbind(a,b,d)
colnames(Y) <- c("Col1", "Col2","Col3")
rownames(Y) <- c("Row1", "Row2", "Row3","Row4","Row5","Row6","Row7","Row8")
Y

Of course, you can give your rows and columns any names you like! You can also bind your set of vectors the other way:

Z<-rbind(a,b,d)
rownames(Z) <- c("Row1", "Row2","Row3")
colnames(Z) <- c("Col1","Col2", "Col3","Col4","Col5","Col6","Col7","Col8")
Z

However, to perform statistical analysis on any table constructed this way, you must first turn it into a dataframe, as follows:

Y<-data.frame(Y)
attach(Y)
summary(Y)

Note: You can enter data straight from the keyboard. Thus, to enter a string of numbers as an object called d, simply type:

d <- scan()        OR  for character variables:        d <- scan( , "") 

and enter as many data as you like, hitting return after each and finishing by hitting return twice.

DEAL WITH MISSING DATA (NA)

Let’s read in a new file, MarksNA.txt, which also contains lists of students’ marks, but which includes some missing data (NAs).

MM<-read.table("http://training-course-material.com/images/6/64/MarksNA.txt",h=T)
attach(MM)
MM

On your screen you’ll see:

               Joe    Bill   Roger   Karen   Steve
Geog           46      71      NA      88     62
French         60      47      NA      69     NA
German         59      50      51      78     65
SocStud        NA      NA      57      73     55
HomeEc         72      59      70      86     NA
Art            49      60      NA      89     NA

R provides several commands or operators to deal with missing values. Here are four of the most useful: is.na(MM) - identifies missing data in your array complete.cases(MM) - a logical vector telling you which rows are complete MM[complete.cases(MM), ] - lists only complete rows (i.e. that have no missing data) na.omit(MM) - omits NAs and so is similar to MM[complete.cases(MM), ] na.rm=T - enables you to use any function (e.g. mean or standard deviation) by ignoring NAs.

Try the following:

is.na(MM)   complete.cases(MM)  MM[!complete.cases(MM),]    any(is.na(MM))  
which(is.na(MM))

You will find the function na.rm = T very useful. It allows you to perform calculations using many built-in functions, simply by ignoring missing values.

mean(MM, na.rm=T)       median(MM, na.rm=T)       sd(na.rm=T)

You can replace missing values element-by-element or by variable (column).

MM[2,3] <- mean(MM[,3],na.rm=T)          MM[2,3] <- median(MM[,3],na.rm=T)
MM[is.na(MM$Steve),"Steve"] <- median(MM$Steve,na.rm=T)  

The above code replaces all of Steve’s NAs with his median mark, but more generic is:

MM[is.na(MM[,5]),5] <- median(MM[,5],na.rm=T)

In a later section you will find an R programme (script) that performs this replacement on an entire array at once!

READ AND WRITE DATAFILES

R likes text files but can read most file types. OK, what do the following commands do?

Grades <- read.table("http://www.oxford.ac.uk//Users//PhysicsGrades.dat", h=T)
Grades <- read.csv(file="grades2009.csv",sep=",",h=T)
Grades<- read.spss(grades2009,  . . . . )
Grades<- read.table(file("clipboard"),h=T)    OR     Grades <- read.clipboard()

Writing a new file is easy!

write.table(M, "FinalGrades.txt")

However, your table won’t look so nice, so try:

write.table(M, "FinalGrades.txt", sep="\t", quote=F)

The last two arguments write your table in neat columns and get rid of unwanted quotation marks.



PLOTTING IN R

Let’s create a simple plot using the plot() command. First set up a vector B

B <- c(2, 4, 5, 7, 12, 14, 16)

Now let’s graph B

plot(B)

That was simple! However, our plot is very basic.


Let’s start again and enhance the plot Plot B using dark green points

plot(B, type="o", col="darkgreen")  # type="o" produces symbols joined by straight lines 

Now let’s create a title

title(main="My Plot", font.main=2, col.main="blue")

Notice how to create a title.


Let’s take a more complex example and enhance further

B1 <- c(2, 4, 5, 7, 12, 14, 16)
B2 <- c(3, 6, 7, 8, 9, 11, 12)

Graph B1 using a y axis from 0 to 20

plot(B1, type="o", col=" darkgreen ", ylim=c(0,20))

Graph B2

lines(B2, type="o", pch=22, lty=2, col="blue")

Notice how we plotted the first curve and then added the second using the lines function. Let’s create a title

title(main=" My Plot ", col.main="red", font.main=2)

You can look up the various arguments for colour and font type on line. Note the default labels for the horizontal and vertical axes


Another example, in which we automate the calculation of the y-axis limits Define 3 vectors

B1 <- c(2, 4, 5, 7, 12, 14, 16)
B2 <- c(3, 6, 7, 8, 9, 11, 12)
B3 <- c(1, 7, 3, 2, 2, 7, 9)

Calculate the maximum value of B1, B2 and B3

yaxismax <- max(B1, B2, B3)

Plot using a vertical axis from 0 to yaxismax. Then disable the default axes so that we can create our own axes.

plot(B1, pch = 15, type="o", col="blue", ylim=c(0, yaxismax), axes=FALSE, ann=FALSE)   
axis(1, at=1:7, lab=c("A","B","C","D","E", "F", "G")) 

The first argument in the axis command (1) specifies the horizontal axis. The at command specifies where to place the axis labels. The vector called lab stores the actual labels. Create y axis with horizontal labels that display ticks every 2 marks.

axis(2, las=1, at=2*0: yaxismax) # las controls the orientation of the axis labels. Try entering other values. 

Create a box around the plot

box()
lines(B2, pch = 16, type="o", lty=2, col="red")
lines(B3, pch = 17, type="o", lty=3, col="darkgreen")

Create a title

title(main="My Plot", col.main="darkgreen", font.main=2)

Label the x and y axes

title(xlab="Letters", col.lab="purple")
title(ylab="Values", col.lab="brown")

Create a legend at the location (1, yaxismax)

legend(1, yaxismax, c("B1","B2", "B3"), cex=0.7, col=c("blue", "red", "darkgreen"), pch=c(15, 16,17), lty=1:2)

The cex argument gives you control over the size of fonts, where the default is 1. You can check out the parameters pch, type and lty for yourselves.


In this example, we save our plot as a png file and create our own axes.

A <- read.table("http://training-course-material.com/images/6/68/MyPlot.txt", header=T)  # reading our data from a csv file. 

Find the maximum y value

ymax <- max(A)
print(A)

Prepare your colours

colours <- c("darkgreen","grey","orange","brown")

Start the device driver to create a png file

png(filename="MyPlot.png")

Disable the default axes and set up a y axis

plot(A$Bill, pch = 25, type="o", col= colours [1], ylim=c(0, ymax), axes=FALSE, ann=FALSE)

Create an x axis

axis(1, at=1:5, lab=c("H", "I", "J", "K", "L"))

Create a y axis with horizontal labels with ticks every 2 units.

axis(2, las=1, at=2*0:ymax)

Create a box

box()

Graph Robert

lines(A$Robert, type="o", pch=15, lty=2, lwd=2, col=colours[2])

Graph David

lines(A$David, type="o", pch=16, lty=3, lwd=2, col=colours[3])

Graph Anne

lines(A$Anne, type="o", pch=17, lty=4, lwd=2,col=colours[4])

Create a title

title(main="MY PLOT TITLE", col.main="forestgreen", font.main=2)

Label the x and y axes

title(xlab= "Letters", col.lab="red")
title(ylab= "Y Values", col.lab="darkgreen")

Create a legend at (1, ymax)

legend(1, ymax, names(A), cex=0.7, col=colours, pch=c(25, 15,16,17), lty=1:3);

Close the device driver

dev.off()

Now you should have your plot as a png file in your working directory.

CREATE A BAR CHART

B <- c(2, 4, 5, 7, 12, 14, 16)
barplot(B) # It’s that simple! However, this one is not very sophisticated! 

ANOTHER BAR CHART

Read values from csv file

A <- read.table("http://training-course-material.com/images/6/68/MyPlot.txt", header=T)
print(A)

Create Your bar chart with labels

barplot(A$David, main="MY BARPLOT", xlab="Letters",
ylab="VALUES", names.arg=c("A","B","C","D ", "E"),
border="red", density=c(90,70,50,40,30)) 

Try other values of the density parameter and see what you get. What does the command names.arg do?


Another bar chart, this time using R’s rainbow palette. We have five scores for each of four people.

A <- read.table("http://training-course-material.com/images/6/68/MyPlot.txt", header=T)
print(A)

Plot a bar chart of A with adjacent bars

barplot(as.matrix(A), main="My Barchart", ylab= "Numbers",
beside=TRUE, col=rainbow(5))

Create a legend at the top-left corner without any frame

legend("topleft", c("First score","Second score","Third score","Fourth score", "Fifth score"), cex=0.6,bty="n", fill=rainbow(5))  # The bty parameter controls borders

We will talk later about the as.matrix function.


HISTOGRAMS

B <- c(2, 4, 5, 7, 12, 14, 16)

Create a histogram for B

hist(B)

Again this was easy, but you need more from your histogram.


Let’s create a histogram from all the data in an array.

A <- read.table("http://training-course-material.com/images/6/68/MyPlot.txt", header=T)
print(A)

Transform the four vectors into a single vector and make a histogram of all elements.

B <- c(A$Bill, A$Robert, A$David, A$Anne)

Create a histogram in light blue

hist(B, col="lightblue", ylim=c(0,10))


Now a more complex example.

 A <- read.table("http://training-course-material.com/images/6/68/MyPlot.txt", header=T)
print(A)

Transform the four vectors, as before.

B <- c(A$Bill, A$Robert, A$David, A$Anne)

Find the maximum value so that we can create a bar for each value.

max <- max(B)

Disable right-closing of cell intervals, and make the horizontal axis labels horizontal

hist(B, col=  "darkgreen", breaks=max, xlim=c(0,max), right=F, main="My Histogram", las=2, xlab = "Values", cex.lab = 1.3)


PIE CHARTS

B <- c(2, 4, 5, 7, 12, 14, 16)

Create a simple pie chart

pie(B)


Now let's create a pie chart with a heading, use nice colours, and define our own labels using R’s rainbow palette

B <- c(2, 4, 5, 7, 12, 14, 16)
pie(B, main="My Piechart", col=rainbow(length(B)),
labels=c("Mon","Tue","Wed","Thu","Fri"))


Here is a more complex example, using percentages and a legend

B <- c(5, 3, 1, 8, 9, 4, 6)  # A vector of data, one for each day of the week

Set up colours for printing

cols <- c("grey90","grey50","black","grey30", "white","grey70","grey50")

Calculate the percentage for each day, at one decimal place

percentlabels <- round(100*B/sum(B), 1)  # Rounding the percentage labels to one decimal place

Add a '%' sign to each percentage value using the paste command.

pielabels <- paste(percentlabels, "%", sep="")  # What does the paste command do? 
pie(B, main="My Best Piechart", col=cols, labels= pielabels, cex=0.8)

Create a legend to the right

legend(1.5, 0.5, c("Mon","Tue","Wed","Thu","Fri"), cex=0.5, fill=cols)

A SIMPLE PLOT INVOLVING A CALCULATION

Produce a simple plot and develop it iteratively! Note: abline(a,b) draws a line with intercept a and slope b. The function lm (Linear Model) performs a least squares regression.

x<-seq(-5,5,0.2)
y<-3*x^2+2*x-5
plot(x,y)
plot(x,y, xlab="X Values",ylab="Y Values",pch=16, col="red",main="A NICE QUADRATIC")
lines(x,y)
text(-0.8,40.0,"Place text anywhere on your plot!")

Now your text is located at the point (-0.8,40) on your plot.

abline(lm(y~x))

Admittedly, a least squares fit doesn’t help much with a quadratic, but you get the point!

abline(h=3)
abline(v=3)
legend(-4,66,"This is my Legend")

Your legend is now located at the point(-4,66)

rug(x)

What does the rug command do?

A More Complex Plot

Produce a more complicated plot (note the use of the expression and paste functions). Copy and paste the following syntax (which I have modified from a very fine book called "Statistics: an introduction using R", by Michael J. Crawley) into R and try to understand what each line is doing.

x <- seq(-4, 4, len = 101)
plot(x, cos(x),type="l",xaxt="n", col="red", main="A DEMONSTRATION PLOT",
xlab=expression(paste("Angle",  theta)),
ylab=expression("cos "*theta))
axis(1, at = c(-pi, -pi/2, 0, pi/2, pi),
lab = expression(-pi, -pi/2, 0, pi/2, pi))
text(pi/2+0.3,0.5,substitute(chi^2=="3.2"))
text(pi/4, -0.5, expression(paste(frac(1, sigma*sqrt(2*pi)), " ",
e^{frac(-(x-mu)^2, 2*sigma^2)})))
text(pi/2+1.5,0,expression(hat(alpha) %+-% theta))

Using Loops for Multiple Plots

You can use loops to develop complex plots. Try the following (also modified from other sources):

x <- c(1:5); y <- 2*x+3 # create some data  
par(pch=20, col="blue") # plotting symbol and colour 
par(mfrow=c(2,4)) # all plots on one page  
choices = c("p","l","o","b","c","s","S","h")  
for(i in 1:length(choices)){  
heading = paste("Type=", choices[i]) 
plot(x, y, type="n", main=heading) 
lines(x, y, type=choices[i]) 
}

WRITE R PROGRAMMES (SCRIPTS)

R provides control statements similar to those of other languages. Try copying the code given in sections 9.1 to 9.4 into the R workspace to see what they do:

The if Command

mark<- 87
if(mark>=50) {print("Pass")} else {print("Fail")}

The ifelse Loop

y<-1:20
ifelse(y<10, y, 99)

The for Loop

for(i in 1:100) print(2*i^2 - 7)

The while Loop

n<-0
while(n<5) {
n<-n+1
print(n)
}

A Simple R Programme and how to Run it

You are not yet ready to write your own programmes, but examine the following code and analyse what each step is doing. This code comes from the accompanying text file - ProgNA.txt. To run the code, type: source("ProgNA.txt"). The programme should read the file MarksNA.txt (which contains missing data), print the number of rows and the number of columns on your screen, replace all missing data by the median of the existing data for each column (variable), and write a new text file called FinalMarks.txt. The syntax: sep="\t", quote=F formats the newly written file and makes it pleasing to the eye.

M<-read.table("MarksNA.txt", h=T)
nrows <- dim(M)[1]  # Can also use: nrows <- nrow(M)
ncols <- dim(M)[2]    # Can also use: ncols  <- ncol(M)
print(nrows)  
print(ncols)
print(M)
for(j in 1: ncols) 
{
M[is.na(M[, j]), j] <- median(M[, j], na.rm=T) 
} 
write.table(M,"FinalMarks.txt", sep="\t", quote=F)

PERFORM STATISTICAL MODELLING

If you are using R for the first time to perform a regression or ANOVA, you may need some assistance. The two commands you’ll use here are lm (short for Linear Model), which performs a regression, and aov (short for Analysis of Variance). Try out the following code.

Try a simple regression

The following regression uses a made-up datafile (MarksRegression.txt), consisting of marks out of 10 in a trial examination and those scored by the same students in the final examination. You wish to know how well the trial exam marks predicted the final exam marks. So – perform a regression!

par(mfrow=c(1,1))
mod<-read.table("http://training-course-material.com/images/e/e2/MarksRegression.txt",h=T)
attach(mod)
names(mod)
plot(Trial,Final,pch=16,xlab="TRIAL EXAM",ylab="FINAL EXAM", main="TRIAL AND FINAL MARKS", cex=1.4) 
lm(Final~Trial)
summary(lm(Final~Trial)) 

mod<-read.table("http://training-course-material.com/images/e/e2/MarksRegression.txt",h=T)
attach(mod)
names(mod)
plot(Trial,Final,pch=16,xlab="PRIOR EXAM",ylab="FINAL EXAM", main="PRIOR AND FINAL MARKS", cex=1.5, cex.lab = 1.5, cex.main = 1.6, xlim=c(0,100), ylim=c(0,100))
plot(Trial,Final,pch=16,xlab="PRIOR EXAM",ylab="FINAL EXAM", main="PRIOR AND FINAL MARKS", cex=1.5, cex.lab = 1.5, cex.main = 1.6)
lm(Final~Trial)
summary(lm(Final~Trial)) 

Your model is: Final Mark = 0.83*Trial Mark + 16.3. This model explains over 90% of the variance in the Final Mark – not bad! Its p-value is 1.148e-05, which is much less than 0.05, so that’s pretty good too! OK, but now you can go a little further and see if you can improve the model. First, let’s plot the regression and its residuals.

abline(lm(Final~Trial))

Let’s set up the regression model as an object called regmodel.

regmodel<-predict(lm(Final~Trial))

Note: the predict function gives you actual points from your regression. Now draw lines to highlight the differences between the final and fitted points.

for (k in 1:10) lines (c(Trial [k], Trial [k]),c(Final [k], regmodel [k]))

Why not remind ourselves of the results of the regression using slightly different syntax?

model<-lm(Final~Trial)
summary(model)

The next command gives another breakdown:

summary.aov(model)

Now you can use several R diagnostics to improve your model by eliminating outliers.

plot(model)

The first plot (residuals vs. fitted values) should look more or less random, and that’s what your model provides. The second plot (normal Q-Q errors) will give a straight line if the errors are distributed normally, so again we’re OK. The third plot is similar to the first and should look random (and ours does!). The last plot (Cook’s distance) tells you which points have the greatest influence on the regression. Any that are excessively influential are candidates for removal. This plot suggests that the tenth point is an outlier. You can eliminate it to investigate what its removal does to your model. But where is the tenth point on the Trial axis?

Trial[10]

OK, so the tenth point is located at 67 on the Trial axis. Now you can use the update and subset commands to re-run the regression on a subset of the original data (i.e. without the point Trial =67). The update command allows you to re-run your model, but including any changes you wish to specify. The subset command provides an easy way of altering your input data.

model2<-update(model,subset=(Trial!= 67))
summary(model2)

(Remember that != means not equal to). OK. This second regression has greater explanatory power (it explains 96% of the variance) and an even smaller p value (p = 1.037e-06). Now you have a very good model that you may prefer over your first attempt.

Try a simple ANOVA

The following ANOVA uses a datafile of test marks (MarksAnova.txt), scored by a class of 24 students, where gender provides a two-level categorical variable (i.e. male or female).

[Categorical variables allocate subjects into categories that cannot be quantified (e.g. gender, ethnicity, country of birth, school attended, eye colour)].

In this example, the idea is to test for a significant difference between the distribution of marks achieved by males and that achieved by females.

ms<-read.table("http://training-course-material.com/images/c/c7/MarksAnova.txt",h=T)
attach(ms)
names(ms)
ms

Now let’s do an initial plot of the marks.

plot(1:24,ms$Marks,pch = 23, bg = c("red", "blue")[unclass(ms$Gender)], main="MARKS FOR BOYS AND GIRLS", xlab = "STUDENT", ylab = "MARK" )

Now let’s look at the means for each gender.

tapply(Marks, Gender, mean)

Note that I have alternated the marks for males and females. Strictly, you don’t need to do this, but it is convenient for plotting (as you’ll see below). We don’t have identifier for each student, so we plot their marks in the order in which their marks are given to us.

summary(aov(Marks ~ Gender))

OK. Your p value is about 0.004. OK. Now plot your 24 data.

plot(1:24,Marks,ylim=c(0,10),ylab="Marks",xlab="Student Order", pch=16, col="blue", main="DISTRIBUTION OF MARKS")

Now let’s draw a horizontal line at the mean value.

abline(mean(Marks),0)

Now let’s draw lines to connect each datum with the mean value in order to highlight the difference visually.

for(j in 1:24) lines(c(j,j),c(mean(Marks), Marks [j]))

So far, your plot hasn’t told you very much about the difference between boys’ marks and girls’ marks. So, let’s do a better job by separating the boys from the girls! First suppress R’s desire to plot your data using the syntax type="n". Then plot the boys’ marks and the girls’ marks separately. Note the syntax: Marks[Gender=="M"] and Marks[Gender=="F"] which works because gender is a two-level categorical variable. This syntax subsets the original distribution of marks into one set for boys and one for girls.

plot(1:24, Marks,ylim=c(0,10),type="n",ylab="Marks",xlab="Student Order", main="GIRLS’ AND BOYS’ MARKS")
points(seq(1,23,2), Marks [Gender=="M"],pch=1)
points(seq(2,24,2), Marks [Gender=="F"],pch=16)

The girls’ marks (black dots) are clearly better than the boys’ (open circles). Question: why did we plot the boys’ marks on a grid from 1 to 23 and the girls’ marks on a grid from 2 to 24? Now plot the mean boys’ mark and the mean girls’ mark.

abline(mean(Marks[Gender=="M"]),0)
abline(mean(Marks[Gender=="F"]),0)

OK, so now draw lines to highlight the residuals for boys and girls.

j<- -1
for (i in 1:12){
  j<-j+2
  lines(c(j,j),c(mean(Marks[Gender =="M"]),Marks[Gender =="M"] [i]))
  lines(c(j+1,j+1),c(mean(Marks [Gender =="F"]), Marks [Gender =="F"] [i]))
}

Let’s re-run the ANOVA in the light of this plot.

summary(aov(Marks ~ Gender))

As we saw with the regression of the last section, you can use diagnostics to improve your model.

plot(aov(Marks ~ Gender))

The first and third plots indicate that the variances of the marks for males and those for females are roughly the same, which is what you’re hoping for. The second is a Normal Quantile-Quantile plot and is a straight line (again, this is good because the errors are not significantly non-normal). The fourth plot gives the Cook’s residuals, and indicates points that are particularly influential. Points 7, 9 and 20 are very influential, so try again without them! First define your new subset ms2 (effectively ms, but without the seventh, ninth and twentieth points).

ms2=(1:24 != 7 & 1:24 != 9 & 1:24 != 20)
ms2

Now re-run your ANOVA on the new datset.

A <-  aov(Marks ~ Gender, subset=ms2)
summary(A)

The p value (0.0004) suggests that the revised model is better than the first. However, whether or not you prefer this revised model over the original is an entirely separate question! You may also wish to plot box and whisker plots, one for each factor. First, change the level names so as to get a better description on your boxplot.

levels(Gender) <- c("FEMALE", "MALE")

Now use the plot command, which automatically produces boxplots when one variable has been defined as a factor. Note the order in which you write the arguments (factor first, then the continuous variable).

print(model.tables(A,"means"), digits=3)  
#report the means and the number of subjects/cell
plot(Gender, Marks, col = "red", xlab = "GENDER", ylab = "MARKS", col.lab = "darkgreen", cex.lab = 1.3, main = "FINAL MARKS BY GENDER", cex.main = 1.6, col.main = "blue", ylim = c(0,10))
abline(5,0)

Try a Two-Way Analysis of Variance (Two Factors and looking for an interaction between the factors). We are modelling consumers' ratings of products against cost and time for manufacture.

productappearance <-   c(6.5, 6.2, 5.8, 6.5, 6.5, 6.9, 7.2, 6.9, 6.1, 6.3,  6.7, 6.6, 7.2, 7.1, 6.8, 7.1, 7.0, 7.2, 7.5, 7.6)
cost <-  factor(gl(2,10), labels = c("LowC", "HighC"))
time <-  factor(gl(2, 5, len = 20), labels = c("LowT", "HighT"))
Z <- data.frame(productappearance, cost, time) 
Z
fit <-  aov(productappearance ~ cost * time)
summary(fit) 
print(model.tables(fit,"means"), digits=3)       
#report the means and the number of subjects/cell

boxplot(productappearance ~ cost*time, data = Z)    
#graphical summary of means of the 4 cells

interaction.plot(cost, time, productappearance)  

Also try the following approach to setting up a three level factor and plotting the variable according to its factor levels:

app= cut(productappearance,3)
app
table(app)
levels(app) <-  c("SMALL", "MED", "LARGE")
nlevels <-  length(levels(app))
levels(app) 
table(app)

Z <- cbind(productappearance,app)
Z
index <- 1:20
npoints <- length(index) 
plot(index, productappearance,  type="n",xlab="ORDER",ylab="APPEARANCE",xlim=c(0,max(index)), ylim = c(5,max(productappearance)))
ind<-split(index, app)
appfac<-split(productappearance, app)

for (k in 1: nlevels) { points(ind[[k]], appfac[[k]], pch = ( k + 12 )) }  # Now plot the data for each level 

PLOT REGRESSION LINES FOR EACH FACTOR

for (k in 1:nlevels) abline(lm(appfack ~ indk)) 

reg1<-predict(lm((appfac[[1]] ~ ind[[1]])))
reg2<-predict(lm((appfac[[2]] ~ ind[[2]])))
reg3<-predict(lm((appfac[[3]] ~ ind[[3]])))

reg <- list(c(0,nlevels))
for (k in 1: nlevels) {
regk <- predict(lm((appfack ~ indk)))  }  # We have set up three regression models; one for each level

NOW PLOT THE RESIDUALS

for (k in 1: nlevels) {{
for (i in 1: length(indk)) lines (c(indk[i], indk[i]), c(appfack[i], regk[i]))  }}

ANOTHER WAY OF PLOTTING THE RESIDUALS

for (i in 1: length(ind1)) lines (c(ind1[i], ind1[i]), c(appfac1[i], reg1[i]))
for (i in 1: length(ind2)) lines (c(ind2[i], ind2[i]), c(appfac2[i], reg2[i]))
for (i in 1: length(ind3)) lines (c(ind3[i], ind3[i]), c(appfac3[i], reg3[i]))


Try a MANOVA

Set up a dataset of three dependent variables (ratings of the strength, colour and weight of a firm’s product) and two factors (cost of manufacture and total time to manufacture).

strength <-  c(6.5, 6.2, 5.8, 6.5, 6.5, 6.9, 7.2, 6.9, 6.1, 6.3,  6.7, 6.6, 7.2 , 7.1, 6.8, 7.1, 7.0, 7.2, 7.5, 7.6)
colour   <-  c(9.5, 9.9, 9.6, 9.6, 9.2, 9.1, 10.0, 9.9, 9.5, 9.4,  9.1, 9.3, 8.3 , 8.4, 8.5, 9.2, 8.8, 9.7, 10.1, 9.2)
weight   <-  c(4.4, 6.4, 3.0, 4.1, 0.8, 5.7, 2.0, 3.9, 1.9, 5.7, 2.8, 4.1, 3.8 ,   1.6, 3.4, 8.4, 5.2, 6.9, 2.7, 1.9)

Now bind these dependent variables together.

Y <-  cbind(strength, colour, weight)
Y

Now set up two factors.

cost <-  factor(gl(2,10), labels = c("LowC", "HighC"))
totaltime <-  factor(gl(2, 5, len = 20), labels = c("LowT", "HighT"))


Example #1

First let’s make up a table or two.

tapply(strength, list(cost, totaltime), mean)
tapply(colour, list(cost, totaltime), mean)
tapply(weight, list(cost, totaltime), mean)

Now perform the fit.

fit <-  manova(Y ~ cost * totaltime)

Now get your summary of the MANOVA analysis. First – draw up a Univariate ANOVA table.

summary.aov(fit)   

We see that both cost and totaltime are significant predictors of the variable 'strength', but there is no interaction between these two factors. Further, cost is a predictor of colour but time is not, and nor is there any interaction. Neither factor is a predictor of weight.

Now draw up an ANOVA table of Wilks' Lambda.

summary(fit, test="Wilks") 

Can also bind the categorical variables into the dataset, but must exclude them from that part of the dataset that holds the dependent variables.

Z <-cbind(Y,cost,totaltime)
fitz <- manova(Z[ , c(1:3) ] ~ cost * totaltime)
summary.aov(fitz)            
summary(fitz, test="Wilks")