Statistics for Decision Makers - 11.04 - Hypothesis Testing

From Training Material
Jump to navigation Jump to search
title
11.04 - Hypothesis Testing
author
Bernard Szlachta (NobleProg Ltd)

Prerequisites

Hypothesis Testing。

  • I once asked out a statistician.
  • She failed to reject me.

A Light Bulb。

How many statisticians does it take to change a light bulb?
A: 5–7, with p-value 0.01

Tool Description。

Names
Hypothesis Testing
Usages
Checking the probability of things being different
Examples
Is the new version of software better than the previous one?
Do women watch YouTube more often than men?
Does a blue background make people less tired than a red one?

Questions。

  1. How can we distinguish between two things?
  2. What is the probability that the conclusion is not due to pure chance?
  3. What is a difference between:
    • The probability of an event
    • The probability of a state of the world
  4. How to define the "null hypothesis" and the "alternative hypothesis"

Lady Tasting Tea。

R. A. Fischer.jpg Cassatt Mary The Cup of Tea 1880.jpg

Ronald Fisher explained the concept of hypothesis testing with a story of a lady tasting tea.

  • The lady in question claimed to be able to tell whether the tea or the milk was added first to a cup.
  • Fisher gave her eight cups, four of each variety, in random order.
  • The woman got all eight cups correct.
  • What is the probability that she got it right, but just by pure chance?

Answer。

Answer >>

  • There is 1 in 70 (the combinations of 8 taken 4 at a time) chance that if she couldn't tell the difference, should would guess all 8 cups
  • This is 1.4% significance level, below normally assumed 5%
  • More on Wikipedia.

James Bond Example。

ClipCapIt-140605-162646.PNG

Problem

  • James Bond insists that Martinis should be shaken rather than stirred
  • We want to determine whether Mr. Bond can tell the difference between a shaken and a stirred Martini

Experiment

  • Suppose we gave Mr. Bond a series of 16 taste tests
  • In each test, we flipped a fair coin to determine whether to stir or shake the Martini
  • Then we presented the martini to Mr. Bond and asked him to decide whether it was shaken or stirred

Results

  • Let's say Mr. Bond was correct on 13 of the 16 taste tests
  • Can he tell the difference?

Interpretation

  • This result does not prove that he can!
  • It could be he was just lucky and guessed right 13 out of 16 times
  • How plausible is the explanation that he was just lucky?


Answer。

Answer >>

  • To assess its plausibility, we determine the probability that someone who was just guessing would be correct 13/16 times or more
  • This probability can be computed from the binomial distribution
  • http://www.stat.tamu.edu/~west/applets/binomialdemo.html
  • Google Cal:
    • 1-binomdist(12,16,0.5,true)
    • binomdist(13,16,0.5,false)+binomdist(14,16,0.5,false)+binomdist(15,16,0.5,false) +binomdist(16,16,0.5,false)
  • Binomial distribution calculator shows it to be 0.0106
  • He could have guessed it once in every hundred trials


  • So either Mr. Bond was very lucky, or he can tell whether the drink was shaken or stirred
  • The hypothesis that he was guessing is not proven false, but considerable doubt is cast on it
  • Therefore, there is strong evidence that Mr. Bond can tell whether a drink was shaken or stirred

Physicians' Reactions。

Obese time.jpg

Problem

  • Do physicians spend less time with obese patients?

Experiment

  • Physicians were sampled randomly and each was shown a chart of a patient complaining of a migraine headache
  • They were then asked to estimate how long they would spend with the patient
  • The charts were identical except that for half the charts, the patient was obese and for the other half, the patient was of normal weight
  • The chart a particular physician viewed was determined randomly
  • 31 physicians viewed charts of average-weight patients and 38 physicians viewed charts of obese patients

Results

  • The reported mean time spend with patients:
    • obese 24.7min
    • average-weight: 31.4min
  • How might this difference between means have occurred?

Interpretation。

Answer >>

  • Two possibilities:
    • physicians were influenced by the weight of the patients
    • by pure chance
  • Random assignment of charts does not ensure that the groups will be equal in all respects other than the chart they viewed
  • In fact, it is certain the groups differed in many ways by chance (e.g. mean age, gender, race, etc...)
  • How possible it is that these chance differences are responsible for the difference in times?
  • What is the probability of getting a difference as large or larger than the observed difference (6.7min) due to chance?
  • This probability can be computed to be 0.0057 (one in 175 experiments) - see Differences between Two Means (Independent Groups)
  • Since this is a low probability, we have confidence that the difference in times is due to the patient's weight and is not due to chance

The Probability Value。

  • Probability value is also know as "P", "P-value" or "p"
  • In the James Bond example, the computed probability of 0.0106 is the probability he would be correct on 13 or more taste tests (out of 16) if he were just guessing (i.e. by pure chance)
  • The 0.0106 is NOT the probability he cannot tell the difference
  • The probability of 0.016 is the probability of a certain outcome (13 or more out of 16) assuming a certain state of the world (James Bond was only guessing)
  • It is not the probability that a state of world is true

An example - a bird which knows how to divide。

P bird.png

  • An animal trainer claims that a trained bird can determine whether or not numbers are evenly divisible by 7
  • In an experiment assessing this claim, the bird is given a series of 16 test trials
  • On each trial, a number is displayed on a screen and the bird pecks at one of two keys to indicate its choice
  • The numbers are chosen in such a way that the probability of any number being evenly divisible by 7 is 0.50
  • The bird is correct on 9/16 choices

Answer。

Answer >>

  • From binomial distribution, the probability of being correct nine or more times out of 16 if one is only guessing is 0.40
  • Since a bird who is only guessing would do this well 40% of the time, these data do not provide convincing evidence that the bird can tell the difference between the two types of numbers
  • The 40% does NOT mean that there is a 0.40 probability that the bird can tell the difference!!!
  • The probability value is the probability of an outcome (9/16 or better) and not the probability of a particular state of the world (the bird can tell whether a number is divisible by 7)

State of the world vs an outcome。

  • Hypotheses are the possible states of the world
  • The probability value is the probability of an outcome given the hypothesis
  • It is not the probability of the hypothesis given the outcome
  • If the probability of the outcome given the hypothesis is sufficiently low, we have evidence that the hypothesis is false
  • However, we do not compute the probability that the hypothesis is false
  • In the James Bond example, the hypothesis is that he cannot tell the difference between shaken and stirred martinis
  • The probability value is low (0.0106), thus providing evidence that he can tell the difference
  • However, we have not computed the probability that he can tell the difference
  • A branch of statistics called Bayesian statistics provides methods for computing the probabilities of hypotheses

Why Null Hypothesis is called Null Hypothesis。

  • A statement is called falsifiable if it is possible to conceive an observation or an argument which proves the statement in question to be false
  • We agreed that good hypotheses must be falsifiable
  • In this sense, falsify is synonymous with nullify, meaning not "to commit fraud" but "show to be false"
  • Therefore the hypothesis which needs to be disproved is called "The Null Hypothesis"

The Null Hypothesis。

The null hypothesis is that an apparent effect is due to chance


In the Physicians' Reactions example, the null hypothesis is that in the population of physicians, the mean time expected to be spent with obese patients is equal to the mean time expected to be spent with average-weight patients:

H0: μobese = μaverage
or
H0: μobese - μaverage = 0.

In a correlational study of the relationship between high-school grades and college grades the null hypothesis? would be that the population correlation is 0:

H0: ρ = 0

The test for a biased coin:

H0: π = 0.5


The null hypothesis is typically the opposite of the researcher's hypothesis
  • The physicians were expected to spend less time with obese patients, but the null hypothesis is they do not
  • If the null hypothesis were true, a difference as large or larger than the sample difference of 6.7 minutes would be very unlikely to occur
  • Therefore, the researchers rejected the null hypothesis of no difference and concluded that in the population, physicians intend to spend less time with obese patients

The alternative hypothesis。

  • If the null hypothesis is rejected, then the alternative hypothesis is accepted
  • It is the reverse of the null hypothesis
H0: μobese = μaverage
If H0is rejected, then there are two alternatives:
H1: μobese< μaverage
or
H1: μobese> μaverage

The direction of the sample means determines which alternative is adopted.

Quiz。

Please find the quiz here

Quiz

1 Tommy claims that he blindly guessed on a 20-question true/false test, but then he got 80% of the questions correct. Using the binomial calculator, you find out that the probability of getting 16 or more correct out of 20 when p = 0.5 is 0.0059. This probability of 0.0059 is the probability that...

he would get 80% correct if he took the test again.
he would get this score or better if he were just guessing.
he was guessing blindly on the test.

Answer >>

If Tommy were guessing blindly, the probability that he would have gotten 16 out of the 20 questions right is 0.0059. This is NOT the probability that he was guessing blindly. Remember, the probability value is the probability of an outcome given the hypothesis. It is not the probability of the hypothesis given the outcome.


2 A researcher believes that 2nd graders will score higher than 1st graders on a particular test. Which of the following is the two-tailed null hypothesis?

Mean of the 1st graders < Mean of the 2nd graders.
Mean of the 1st graders > Mean of the 2nd graders.
Mean of the 1st graders = Mean of the 2nd graders.

Answer >>

The null hypothesis says that any apparent effect is due to chance, so in this case, the null hypothesis would be that the two population means for the 1st and 2nd graders are equal. The null hypothesis is usually not the researcher's hypothesis.


3 The researchers hypothesized that there would be a correlation between how much people studied and their GPAs. The null hypothesis is that the population correlation is equal to

Answer >>

The null hypothesis says that any apparent effect is due to chance, so in this case, the null hypothesis would be that the population correlation was 0.


4 Is the new version of a piece of software better than the previous one?

  1. Two randomly chosen groups of 30 people each were selected to assess a new version of a piece of software on the scale 1 to 10
  2. The averages of the two groups were calculated
  3. The null hypothesis is that there is no difference between the versions
  4. The P-value was calculated to be 0.06

What does it mean for a decision maker?

There is no difference! Both must be equally popular with absolute certainty!
There is very small probability (6 in 100) that there is a difference
If there is no difference in reality, it is quite likely (6 in 100) for this sample size to get this difference by pure chance. The test is inconclusive.

Answer >>

The P-value is the probability that given the H0 is true, the difference happens by pure chance. Usually if the P-value is bigger than 5%, we assume that the test is inconclusive.


5 Do women watch YouTube more often than men?

  1. Two randomly selected groups of 30 men and 30 women where asked how often they watch Youtube
  2. Average times spent on watching Youtube were calculated. The women's average time was greater than the men's.
  3. The null hypothesis is that there is no difference
  4. An alternative hypothesis is that women watch Youtube more
  5. The P-value was calculated to be 0.001

What does it mean for a decision maker?

Women watch Youtube more
More men than women watch Youtube
We cannot tell

Answer >>

The P-value is the probability that given the H0 is true, the difference happens by pure chance. Here it is unlikely that the size of the difference observed was due to pure chance.




Hypothesis Testing | Significance Testing >