Bernard Szlachta at 12:32, 6 March 2016

2016-03-06T12:32:43Z

New page

{{Cat|Statistics}}
{{Cat|Intro to R}}
{{Cat|Forecasting}}

== Covered Materials ==
* assumptions of multiple regression
* regression equation
* regression coefficient (weights)
* beta weight
* '''R''' and '''r'''
* partial slope
* sum of squares explained in a multiple regression compared to sum of squares in simple regression
* R2 and proportion explained
* R2 significance test
* Complete and reduced model for significance

== Assumptions ==
# No assumptions are necessary for:
#* computing the regression coefficients
#* partitioning the sum of squares

# Interpreting inferential statistics makes 3 major assumptions
# Moderate violations of the assumptions do not pose a serious problem for '''testing the significance''' of '''predictor variables'''
# Even small violations of these assumptions pose problems for '''confidence intervals''' on predictions for specific observations

=== Residuals are normally distributed ===
* residuals are the errors of prediction (differences between the actual and the predicted scores)
* Q-Q plot is usually to test residual normality
<source lang="rsplus">
qqnorm(m$residuals)
</source>
[[File:Rplot-qqnorm-qqplot.png|400px]]

=== Homoscedasticity ===
* variances of the errors of prediction are the same for all predicted values

<source lang="rsplus">
plot(m$fitted.values,m$residuals)
# see also plot(m)
</source>
[[File:Rplot-homoscedasticity-plot.png|400px]]
* In the picture below the errors of prediction are much larger for observations with low-to-medium predicted scores than for observations with high predicted scores
* Confidence interval on a low predicted UGPA would underestimate the uncertainty

=== Linearity ===
* relationship between each predictor and the criterion variable is linear
* If this assumption is not met, then the predictions may systematically overestimate the actual values for one range of values on a predictor variable and underestimate them for another.
* if it is know that the variables are not linear, a transformation can be use to the linear form

<source lang="rsplus">
plot(UGPA ~ HSGPA)
plot(UGPA ~ SAT)
</source>

[[File:Rplot-liniearity-assuption.png|400px]]

== Data Used ==
[[File:MultipleRegression-GPA.txt]]
<source lang="rsplus">
rawdata <- read.table("http://training-course-material.com/images/4/4b/MultipleRegression-GPA.txt",h=T)
head(rawdata)

# Just nameing them more convinently
HSGPA <- rawdata$high_GPA
SAT <- rawdata$math_SAT + rawdata$verb_SAT
UGPA <- rawdata$univ_GPA

</source>

== Building Model ==

<source lang="rsplus">
m <- lm(UGPA ~ HSGPA + SAT)
</source>

lm(formula = UGPA ~ HSGPA + SAT)

Coefficients:
(Intercept) HSGPA SAT
0.5397848 0.5414534 0.0007918

== Validating Model ==
<source lang="rsplus">
summary(m)
</source>

Call:
lm(formula = UGPA ~ HSGPA + SAT)

Residuals:
Min 1Q Median 3Q Max
-0.68072 -0.13667 0.01137 0.17012 0.92983

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5397848 0.3177784 1.699 0.0924 .
HSGPA 0.5414534 0.0837479 6.465 3.53e-09 ***
SAT 0.0007918 0.0003868 2.047 0.0432 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2772 on 102 degrees of freedom
Multiple R-squared: 0.6232, Adjusted R-squared: 0.6158
F-statistic: 84.35 on 2 and 102 DF, p-value: < 2.2e-16

== Interpreting the results ==
Regression is express my the formula:
UGPA = b1*HSGPA + b2*SAT + A

* b1 and b2 are '''regression coefficients'''
* a regression coefficient is the slope of the linear relationship between:
** the criterion variable and
** the part of a predictor variable that is '''independent of''' all other predictor variables

=== Example ===
* the regression coefficient for HSGPA can be computed by:
# first predicting HSGPA from SAT
<source lang="rsplus">
m.HSGPA.SAT = lm(HSGPA ~ SAT)
</source>
Call:
lm(formula = HSGPA ~ SAT)

Coefficients:
(Intercept) SAT
-1.313853 0.003594
# calculating the errors of prediction (m.HSGPA.SAT$residuals)

# find the slope of the UGPA ~ m.HSGPA.SAT$residuals
<source lang="rsplus">
lm(UGPA ~ m.HSGPA.SAT$residuals)
</source>
Call:
lm(formula = UGPA ~ m.HSGPA.SAT$residuals)

Coefficients:
(Intercept) m.HSGPA.SAT$residuals
3.1729 0.5415

The 0.5415 is the same value as b1

=== Further coefficient interpretations ===
The regression coefficient for HSGPA (b1) is the slope of the relationship between the criterion variable (UGPA) and '''the part''' of HSGPA that is '''independent of (uncorrelated with)''' the other '''predictor variables ''' (SAT in this case).

It represents the change in the criterion variable associated with a change of one in the predictor variable when all other predictor variables '''are held constant'''.

Since the regression coefficient for HSGPA is 0.54, this means that, holding SAT constant, a change of one in HSGPA is associated with a change of 0.54 in UGPA'.

=== Partial Slope ===
The slope of the relationship between the part of a predictor variable independent of other predictor variables and the criterion is its '''partial slope'''

Thus the regression coefficient of 0.541 for HSGPA and the regression coefficient of 0.008 for SAT are partial slopes.

Each partial slope represents the relationship between the predictor variable and the criterion holding constant all of the other predictor variables.

== Beta Weight ==
* It is difficult to compare the coefficients for different variables directly because they are measured on '''different scales'''
* A difference of 1 in HSGPA is a fairly large difference (0.54), whereas a difference of 1 on the SAT is negligible (0.008).
* standardization of the variables to standard deviation of 1 solves the problem

* A regression weight for standardized variables is called a '''beta weight''' (designated as β)

<source lang="rsplus">
install.packages("yhat")
library("yhat")

r <- regr(m)
r$Beta
</source>

* For these data, the beta weights are 0.625 and 0.198
* These values represent the change in the criterion (in standard deviations) associated with a change of one standard deviation on a predictor [holding constant the value(s) on the other predictor(s)]
* Clearly, a change of one standard deviation on HSGPA is associated with a larger difference than a change of one standard deviation of SAT

* In practical terms, this means that if you know a student's HSGPA, knowing the student's SAT does not aid the prediction of UGPA much
* However, if you do not know the student's HSGPA, his or her SAT can aid in the prediction since the β weight in the simple regression predicting UGPA from SAT is 0.68
<source lang="rsplus">
m1 <- lm(UGPA ~ SAT)
regr(m1)$Beta_Weights
</source>
* the β weight in the simple regression predicting UGPA from HSGPA is 0.78
<source lang="rsplus">
m1 <- lm(UGPA ~ HSGPA)
regr(m1)$Beta_Weights
</source>

* As is typically the case, the partial slopes are smaller than the slopes in simple regression.

== Sum of Squares ==
<math>\mathrm{TSS} = \mathrm{ESS} + \mathrm{RSS},</math>
[[File:Partion of Sum of Squares.jpg]]
=== Total sum of squares (TSS or SSY) ===
: Sum of the squared difference between the actual Y and the mean of Y

* <math>y_i</math> ith data point
* <math>\overline{y}</math> the estimate of the mean
* <math>y_i - \overline{y}</math>

<MATH> TSS = \sum_{i=1}^n\left(y_i-\overline{y}\,\right)^2</MATH>

* tells us how much variation there is in the dependent variable

<source lang="rsplus">
sum((UGPA - mean(UGPA))^2)
</source>

=== Explained sum of squares (ESS or SSY') ===
: Sum of the squared differences between the predicted Y and the mean of Y,

<math>\hat{y}_i=\hat{a}+\hat{b_1}x_{1i} + \hat{b_2}x_{2i} + \cdots \, </math>

<math>\text{ESS} = \sum_{i=1}^n \left(\hat{y}_i - \bar{y}\right)^2.</math>

* <math>\hat{y}</math> = Yhat
* ESS tells us how much of the variation in the dependent variable our model explained

<source lang="rsplus">
sum((m$fitted.values - mean(UGPA))^2)
</source>

=== Residual sum of squares (RSS or SSE) ===
: Sum of the squared differences between the actual Y and the predicted Y
<math>
RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2
</math>

<math>
= \sum_{i=1}^n (y_i - f(x_i))^2
</math>

<math>
= \sum_{i=1}^n (\epsilon_i)^2
</math>

<math>
= \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2
</math>

* it is how much of the variation in the dependent variable our model did not explain

<source lang="rsplus">
sum((UGPA - m$fitted.values)^2)
</source>

=== Calculating sum of squares in R ===
<source lang="rsplus">
TSS = sum((UGPA - mean(UGPA))^2)
ESS = sum((m$fitted.values - mean(UGPA))^2)
RSS = sum((UGPA - m$fitted.values)^2)
# TODO: add aov function method
</source>

> TSS
[1] 20.79814
> ESS
[1] 12.96135
> RSS
[1] 7.836789

== Multiple Correlation and Proportion Explained ==
Proportion Explained = SSY'/SSY = R2

<source lang="rsplus">
ProportionExplained = ESS/TSS
</source>
ProportionExplained
[1] 0.6231977
<source lang="rsplus">
m <- lm(UGPA ~ HSGPA + SAT)
summary(m)
</source>
Call:
lm(formula = UGPA ~ HSGPA + SAT)

Residuals:
Min 1Q Median 3Q Max
-0.68072 -0.13667 0.01137 0.17012 0.92983

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5397848 0.3177784 1.699 0.0924 .
HSGPA 0.5414534 0.0837479 6.465 3.53e-09 ***
SAT 0.0007918 0.0003868 2.047 0.0432 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2772 on 102 degrees of freedom
Multiple R-squared: 0.6232, Adjusted R-squared: 0.6158
F-statistic: 84.35 on 2 and 102 DF, p-value: < 2.2e-16

== Confounding ==
* the sum of squares explained (ESS) for these data is 12.96
* How is this value divided between HSGPA and SAT?
* it is not the same as prediction of UGPA in separate simple regressions for HSGPA and SAT

Predictors Sum of Squares
HSGPA 12.64 (simple regression)
SAT 9.75 (simple regression)
HSGPA and SAT 12.96 (multiple regression)

<source lang="rsplus">
m <- lm(UGPA ~ HSGPA + SAT)
ESS = sum((m$fitted.values - mean(UGPA))^2)
# SAT has bee left out
m.hsgpa <- lm(UGPA ~ HSGPA)
ess.hsgpa <- sum((m.hsgpa$fitted.values - mean(UGPA))^2)
# HSGPA has been left out
m.sat <- lm(UGPA ~ SAT)
ess.sat <- sum((m.sat$fitted.values - mean(UGPA))^2)
ess.hsgpa # 12.63942
ess.sat # 9.749823
</source>

* If the sum of ESS in simple regressions is higher than the ESS in multiple regression, it means the '''predictors are correlated (r = .78)'''
* Much of the variance in UGPA is '''confounded''' between HSGPA and SAT
* The variance in UGPA could be explained by either HSGPA or SAT
* is counted twice if the sums of squares for HSGPA and SAT are simply added.

=== Proportion of variable explained ===
Source SS Proportion
HSGPA (unique) 3.21 0.15
SAT (unique) 0.32 0.02
HSGPA and SAT (Confounded) 9.43 0.45
Error 7.84 0.38
Total 20.80 1.00

<source lang="rsplus">
HSGPA.UNIQUE <- ESS - ess.sat
HSGPA.UNIQUE # 3.211531
SAT.UNIQUE <- ESS - ess.hsgpa
SAT.UNIQUE # 0.3219346
HSGPA.SAT.Confounded = ESS - (HSGPA.UNIQUE + SAT.UNIQUE)
HSGPA.SAT.Confounded # 9.427889

</source>

R - Multiple regression - Revision history

Bernard Szlachta at 12:32, 6 March 2016