# R - Multiple regression

## Covered Materials

• assumptions of multiple regression
• regression equation
• regression coefficient (weights)
• beta weight
• R and r
• partial slope
• sum of squares explained in a multiple regression compared to sum of squares in simple regression
• R2 and proportion explained
• R2 significance test
• Complete and reduced model for significance

## Assumptions

1. No assumptions are necessary for:
• computing the regression coefficients
• partitioning the sum of squares
1. Interpreting inferential statistics makes 3 major assumptions
2. Moderate violations of the assumptions do not pose a serious problem for testing the significance of predictor variables
3. Even small violations of these assumptions pose problems for confidence intervals on predictions for specific observations

### Residuals are normally distributed

• residuals are the errors of prediction (differences between the actual and the predicted scores)
• Q-Q plot is usually to test residual normality
 qqnorm(m$residuals)  ### Homoscedasticity • variances of the errors of prediction are the same for all predicted values  plot(m$fitted.values,m$residuals) # see also plot(m)  • In the picture below the errors of prediction are much larger for observations with low-to-medium predicted scores than for observations with high predicted scores • Confidence interval on a low predicted UGPA would underestimate the uncertainty ### Linearity • relationship between each predictor and the criterion variable is linear • If this assumption is not met, then the predictions may systematically overestimate the actual values for one range of values on a predictor variable and underestimate them for another. • if it is know that the variables are not linear, a transformation can be use to the linear form plot(UGPA ~ HSGPA) plot(UGPA ~ SAT)  ## Data Used rawdata <- read.table("http://training-course-material.com/images/4/4b/MultipleRegression-GPA.txt",h=T) head(rawdata) # Just nameing them more convinently HSGPA <- rawdata$high_GPA
SAT <- rawdata$math_SAT + rawdata$verb_SAT
UGPA <- rawdata$univ_GPA  ## Building Model  m <- lm(UGPA ~ HSGPA + SAT)  lm(formula = UGPA ~ HSGPA + SAT) Coefficients: (Intercept) HSGPA SAT 0.5397848 0.5414534 0.0007918  ## Validating Model  summary(m)  Call: lm(formula = UGPA ~ HSGPA + SAT) Residuals: Min 1Q Median 3Q Max -0.68072 -0.13667 0.01137 0.17012 0.92983 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.5397848 0.3177784 1.699 0.0924 . HSGPA 0.5414534 0.0837479 6.465 3.53e-09 *** SAT 0.0007918 0.0003868 2.047 0.0432 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2772 on 102 degrees of freedom Multiple R-squared: 0.6232, Adjusted R-squared: 0.6158 F-statistic: 84.35 on 2 and 102 DF, p-value: < 2.2e-16  ## Interpreting the results Regression is express my the formula: UGPA = b1*HSGPA + b2*SAT + A  • b1 and b2 are regression coefficients • a regression coefficient is the slope of the linear relationship between: • the criterion variable and • the part of a predictor variable that is independent of all other predictor variables ### Example • the regression coefficient for HSGPA can be computed by: 1. first predicting HSGPA from SAT  m.HSGPA.SAT = lm(HSGPA ~ SAT)  Call: lm(formula = HSGPA ~ SAT) Coefficients: (Intercept) SAT -1.313853 0.003594  1. calculating the errors of prediction (m.HSGPA.SAT$residuals)
1. find the slope of the UGPA ~ m.HSGPA.SAT$residuals  lm(UGPA ~ m.HSGPA.SAT$residuals)

Call:
lm(formula = UGPA ~ m.HSGPA.SAT$residuals) Coefficients: (Intercept) m.HSGPA.SAT$residuals
3.1729                 0.5415


The 0.5415 is the same value as b1

### Further coefficient interpretations

The regression coefficient for HSGPA (b1) is the slope of the relationship between the criterion variable (UGPA) and the part of HSGPA that is independent of (uncorrelated with) the other predictor variables (SAT in this case).

It represents the change in the criterion variable associated with a change of one in the predictor variable when all other predictor variables are held constant.

Since the regression coefficient for HSGPA is 0.54, this means that, holding SAT constant, a change of one in HSGPA is associated with a change of 0.54 in UGPA'.

### Partial Slope

The slope of the relationship between the part of a predictor variable independent of other predictor variables and the criterion is its partial slope

Thus the regression coefficient of 0.541 for HSGPA and the regression coefficient of 0.008 for SAT are partial slopes.

Each partial slope represents the relationship between the predictor variable and the criterion holding constant all of the other predictor variables.

## Beta Weight

• It is difficult to compare the coefficients for different variables directly because they are measured on different scales
• A difference of 1 in HSGPA is a fairly large difference (0.54), whereas a difference of 1 on the SAT is negligible (0.008).
• standardization of the variables to standard deviation of 1 solves the problem
• A regression weight for standardized variables is called a beta weight (designated as β)
 install.packages("yhat")
library("yhat")

r <- regr(m)
r$Beta  • For these data, the beta weights are 0.625 and 0.198 • These values represent the change in the criterion (in standard deviations) associated with a change of one standard deviation on a predictor [holding constant the value(s) on the other predictor(s)] • Clearly, a change of one standard deviation on HSGPA is associated with a larger difference than a change of one standard deviation of SAT • In practical terms, this means that if you know a student's HSGPA, knowing the student's SAT does not aid the prediction of UGPA much • However, if you do not know the student's HSGPA, his or her SAT can aid in the prediction since the β weight in the simple regression predicting UGPA from SAT is 0.68  m1 <- lm(UGPA ~ SAT) regr(m1)$Beta_Weights

• the β weight in the simple regression predicting UGPA from HSGPA is 0.78
 m1 <- lm(UGPA ~ HSGPA)


### Residual sum of squares (RSS or SSE)

Sum of the squared differences between the actual Y and the predicted Y

${\displaystyle RSS=\sum _{i=1}^{n}(y_{i}-{\hat {y}}_{i})^{2}}$

${\displaystyle =\sum _{i=1}^{n}(y_{i}-f(x_{i}))^{2}}$

${\displaystyle =\sum _{i=1}^{n}(\epsilon _{i})^{2}}$

${\displaystyle =\sum _{i=1}^{n}(y_{i}-(\alpha +\beta x_{i}))^{2}}$

• it is how much of the variation in the dependent variable our model did not explain
RSS = sum((UGPA - m$fitted.values)^2) # TODO: add aov function method  > TSS [1] 20.79814 > ESS [1] 12.96135 > RSS [1] 7.836789  ## Multiple Correlation and Proportion Explained Proportion Explained = SSY'/SSY = R2   ProportionExplained = ESS/TSS  ProportionExplained [1] 0.6231977   m <- lm(UGPA ~ HSGPA + SAT) summary(m)  Call: lm(formula = UGPA ~ HSGPA + SAT) Residuals: Min 1Q Median 3Q Max -0.68072 -0.13667 0.01137 0.17012 0.92983 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.5397848 0.3177784 1.699 0.0924 . HSGPA 0.5414534 0.0837479 6.465 3.53e-09 *** SAT 0.0007918 0.0003868 2.047 0.0432 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2772 on 102 degrees of freedom Multiple R-squared: 0.6232, Adjusted R-squared: 0.6158 F-statistic: 84.35 on 2 and 102 DF, p-value: < 2.2e-16  ## Confounding • the sum of squares explained (ESS) for these data is 12.96 • How is this value divided between HSGPA and SAT? • it is not the same as prediction of UGPA in separate simple regressions for HSGPA and SAT Predictors Sum of Squares HSGPA 12.64 (simple regression) SAT 9.75 (simple regression) HSGPA and SAT 12.96 (multiple regression)   m <- lm(UGPA ~ HSGPA + SAT) ESS = sum((m$fitted.values - mean(UGPA))^2)
# SAT has bee left out
m.hsgpa <- lm(UGPA ~  HSGPA)
ess.hsgpa <- sum((m.hsgpa$fitted.values - mean(UGPA))^2) # HSGPA has been left out m.sat <- lm(UGPA ~ SAT) ess.sat <- sum((m.sat$fitted.values - mean(UGPA))^2)
ess.hsgpa # 12.63942
ess.sat   # 9.749823

• If the sum of ESS in simple regressions is higher than the ESS in multiple regression, it means the predictors are correlated (r = .78)
• Much of the variance in UGPA is confounded between HSGPA and SAT
• The variance in UGPA could be explained by either HSGPA or SAT
• is counted twice if the sums of squares for HSGPA and SAT are simply added.

### Proportion of variable explained

Source	                       SS   Proportion
HSGPA (unique)	             3.21   0.15
SAT (unique)	             0.32   0.02
HSGPA and SAT (Confounded)   9.43   0.45
Error	                     7.84   0.38
Total	                    20.80  1.00

 HSGPA.UNIQUE <- ESS - ess.sat
HSGPA.UNIQUE # 3.211531
SAT.UNIQUE   <- ESS - ess.hsgpa
SAT.UNIQUE    # 0.3219346
HSGPA.SAT.Confounded = ESS - (HSGPA.UNIQUE + SAT.UNIQUE)
HSGPA.SAT.Confounded # 9.427889