R - Testing Means
Binomial distribution
Introduction_to_Hypothesis_Testing#James_Bond_Example
> pbinom(12,prob=0.5,lower.tail=F,size=16) [1] 0.01063538
or
> binom.test(13,n=16,p=0.5,alternative="greater",) Exact binomial test data: 13 and 16 number of successes = 13, number of trials = 16, p-value = 0.01064 alternative hypothesis: true probability of success is greater than 0.5 95 percent confidence interval: 0.5834277 1.0000000 sample estimates: probability of success 0.8125
Difference between means - independent samples
"Do the population means for urban and rural residents differ on a test of energy use?"
# Create a null-hypothesis for one-tailed and two-tailed test # Interpret the result
Load the data:
> e <- read.table("http://training-course-material.com/images/e/e4/Energy_use.txt",header=T);
Check variances:
> sapply(e,var) Urban Rural 2915935 1859019
Or nicely formated:
> format(sapply(e,var),big.mark = ",") Urban Rural "2,915,935" "1,859,019"
Quite big difference, let us test weather we can assume they are equal:
> var.test(e$Urban,e$Rural) F test to compare two variances data: e$Urban and e$Rural F = 1.5685, num df = 19, denom df = 19, p-value = 0.3349 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.620845 3.962825 sample estimates: ratio of variances 1.568534
Convert the data:
> energy <- stack(e) #Convert colums into factors > names(energy) <- c("EnergyUse","Type")
And test the mean
> t.test(EnergyUse ~Type, data=energy,var.equal=T)
Two Sample t-test
data: EnergyUse by Type t = -4.9907, df = 38, p-value = 1.367e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3427.706 -1449.394 sample estimates: mean in group Rural mean in group Urban 2978.65 5417.20
- H0: Means are equal
- H1: Means are not equal
- The probability that the difference between two mean is just a pure chance is tiny 0.000001367 < 0.05.
- Therefore, we reject the null hypothesis.
- The result is statistically significant
Exercise
"Is there a difference in contribution levels to nonprofits between married and never married females?"
- Create a null hypothesis and an alternative hypothesis
- Interpret the result and draw a conclusion
https://training-course-material.com/images/c/c9/Non-profit-contribution.txt
nonprofit <- read.table("https://training-course-material.com/images/c/c9/Non-profit-contribution.txt",header=T,fill = T);
Answer >>
Difference between means - paired
Does an intervention program reduce the number of cigarettes smoked each day?"
Assumptions
- The number of points in each data set must be the same
- They must be organized in pairs, in which there is a definite relationship between each pair of data points.
- In our case the people asked were the same people after and before the program.
Does an intervention program reduce the number of cigarettes smoked each day?"
Assumed significance level alpha = 0.05 (the maximum tolerable probability of H0 to be a pure chance)
Two Tails
- H0 - means are the same (mb - ma = 0, or mb = ma)
- H1 - they are different
smoke <- read.table("http://training-course-material.com/images/1/14/Smoking.txt",h=T) t.test(smoke$Before, smoke$After, alternative='two.sided', conf.level=.95, paired=TRUE) Paired t-test data: smoke$Before and smoke$After t = 1.5782, df = 19, p-value = 0.131 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.7665942 5.4665942 sample estimates: mean of the differences 2.35
- P-value = 0.131024
- The probability that the difference between the means is just by pure chance, given that they are equal in reality)
- It is quite probable (more probably than our alpha)
- Therefore there is not enough evidence to reject hypotesis one.
- There is not enough evidence to say that the program reduced the numbers of smoked cigarates.
- It doesn't mean that the programe didn't work!!!
One Tail
smoke <- read.table("http://training-course-material.com/images/1/14/Smoking.txt",h=T) t.test(smoke$Before, smoke$After, alternative='greater', conf.level=.95, paired=TRUE)
- H0 - mbefore <= mafter (i.e. mb - ma <= 0) - number of cigarettes smoked increased or hasn't changed
- H1 - mbefore > mafter (i.e. mb-ma > 0) - people decreased the number of cigarettes smoked
- P-value = 0.065512
- It is still quite probable that number of smoked cigarates before the programme whas lower by pure chance.
- How would the result change if significance level would be 10%?
Exercises
Exercise 1
Is there a difference in weekly sales levels in units sold between Region 1 and Region 2?
http://training-course-material.com/images/c/c7/Sales-in-regions.txt
sales <- read.table("",h=T)
sales.f <- stack(sales[c("Sales.R1","Sales.R2")])
tapply(sales.f$values,sales.f$ind,mean)
t.test(values~ind, alternative='less', conf.level=.95, var.equal=FALSE,data=sales.f)
Exercise 2 (proportion test)
A company has been accused of racism. Only 4 green people had been promoted compared with 196 pinks. It turned out that there where 2310 pink applicants and 32 green applicants.
- Would this suggest that pink people where discriminated (12.5% success rate for green versus 8.5% for pinks)?
- What is probability that would happen by pure chance?
- How situation would look like if 3 green people had been promoted instead of 4?
prop.test(c(4,196),c(32,2310))