Multivariable Regression
When performing a regression and finding a relationship between the regressor and the response, you have to be careful that there’s not some other variable that actually explains the relationship.
As an example, you may find in your tests that there’s a relationship between breath mints and loss in pulmonary function. You’d probably think that smokers use more breath mints and that smoking is the likely cause.
How do you establish this? You’d consider smokers and non-smokers by themselves and see if their lung function differs by breath mint usage. Multi-variable regression is an ‘automated’ way of doing this, holding all other variables constant while looking and the regression of a single variable.
General Linear Model
$$ Y_i = 1X{1i} + 2X{2i} + + pX{pi} + _i \
= {k=1}^p X{ik}_j + _i $$
The least squares is pretty much the same as a single variable model: ∑ni=1(Yi−∑pk=1Xkiβj)2
The important linearity is linearity in the coefficients. If you squared or cubed the X values, that would still be a linear equation.
Two Variable Example
Consider a regression with two variables, so that the least squares ends up asL
∑(yi−X1iβ1−X2iβ2)2 What it works out to be is that the regression slope for β1 is exactly what you would obtain if you took the residual of X2 out of X1 and X2 out of Y and did regression through the origin.
So what multi-variable regression is doing is the coefficient or β1 is the coefficient, having removed X2, the other covariant, from both the response and that first predictor X1.
Similarly, the X2 is going to be what you get if you were to remove X1 out of both the response Y and the second predictor X2.
Summary: A multi-variable coefficient is one where the linear effect of all the other variables on that predictor and the response has been removed.
Consider Yi=β1X1i+β2X2i where X2i=1, it’s just an intercept term.
In an intercept only ression, the fitted value is simply ˉY, so the residuals ei,Y|X2=Yi−ˉY, which is the centered version of the Y’s.
If we fit X2i on X1i, X2i is just a constant, so the coefficient would just be ¯X1, the mean of that covariant. Thus ei,X1|X2=X1i−¯X1.
So the fitted regression is:
^β1=∑ni=1ei,Y|X−2ei,X1,X2∑ni=1e2i,X1|X2=∑ni=1(Xi−ˉX)(Yi−ˉY)∑ni=1(Xi−ˉX)2=Cor(X,Y)Sd(Y)Sd(X)
Let’s do a computed example:
set.seed(1)
mv <- tibble(
n = 1:100,
x1 = rnorm(n),
x2 = rnorm(n),
x3 = rnorm(n),
y = x1 + 2*x2 + 3*x3 + rnorm(n, sd = .3)
)
e_y <- mv %>% lm(y ~ x2 + x3, data = .) %>% resid()
e_x <- mv %>% lm(x1 ~ x2 + x3, data = .) %>% resid()
sum(e_x * e_y) / sum(e_x^2)
## [1] 0.9826123
## (Intercept) x1 x2 x3
## 0.01581932 0.98261234 1.98351744 3.03138694
Interpretation of Coeficients
The regression predictor is the sum of the x’s times their coefficients:
E[Y|X1=x1,…,Xp=xp]=p∑k=1xkβk
If one of the regression coefficients is incremented by one:
E[Y|X1=x1+1,…,Xp=xp]=(x1+1)β1+p∑k=2xkβk then we subtract these two terms:
E[Y|X1=x1+1,…,Xp=xp]−E[Y|X1=x1,…,Xp=xp]=(x1+1)β1+p∑k=2xkβk−p∑k=1xkβk=β1
Notice that all of the other xp were held fixed. So the interpretation of a multi-variable regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.
Fitted Values, Residuals and Residual Variation
All of our simple linear regression quantities can be extended:
- Model Yi=∑pk=1Xikβk+ϵi where ϵi∼N(0,σ2)
- Fitted responses ^Yi=∑pk=1Xik^βk
- Residuals ϵi=Yi−^Yi
- Variance estimate ^σ2=1n−p∑ni=1e2i
- It’s n−p because you know the last p of the residuals due to some linear constraints.
- We don’t really have n residuals, we have n−p.
- The coefficients have standard errors ^σ^βk
- We can test whether the coefficients are 0 with a t-test: ^βk−βk^σ^βk
- Predicted responses have standard errors and we can calculate predicted and expected response intervals.
Linear Models
Linear models are the single most important applied statistical and machine learning technique.
Some things you can accomplish with a linear model:
- Decompose a signal into its harmonics.
- The discrete Fourier transform can be thought of as the fit from a a linear model.
- Flexibly fit complicated functions.
- Fit factor variables as predictors.
- Uncover complex multivariate relationships with the response.
- Build accurate prediction models.
Confounding
Let’s look at the the swiss
dataset and perform a linear regression on all of the variables with the outcome of fertility
:
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 66.9 10.7 6.25 0.000000191
## 2 Agriculture -0.172 0.0703 -2.45 0.0187
## 3 Examination -0.258 0.254 -1.02 0.315
## 4 Education -0.871 0.183 -4.76 0.0000243
## 5 Catholic 0.104 0.0353 2.95 0.00519
## 6 Infant.Mortality 1.08 0.382 2.82 0.00734
We see a 0.17 decrease. in fertility decrease in standardised fertility for every 1% increase in percentage of males involved in agriculture.
We now look at how our model selection changes the outcomes. We regress only agriculture on to fertility:
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 60.3 4.25 14.2 3.22e-18
## 2 Agriculture 0.194 0.0767 2.53 1.49e- 2
It now appears that the percentage of males involves in agriculture increases rather than decreases the fertility. What has happened? This is Simpson’s Paradox, where ‘a trend appears in different groups but disappears or reverses when the groups are combined’.
We can see this be assigning colour to one of the variables. There’s a clear positive linear trend with Agriculture and Fertility, but we can see also that there’s a trend (viewed via the colour) of Catholic to Fertility as well.
swiss %>%
ggplot(aes(Agriculture, Fertility)) +
geom_point(aes(colour = Catholic)) +
geom_smooth(method = 'lm', se = F)
With the single regressor, the linear regression is picking up effects of other variables and associating them with Agriculture. With all of the variables included, their linear effects are removed when the regression is occurring.
Factor Variables
Consider a model Yi=β0+Xi1β1+ϵi where each Xi1 is binary. It could be that 1 is the control group and 0 otherwise.
For the people in the group, E[Yi]=β0+β1 and for those not in the control group, E[Yi]=β0.
β1 is interpreted as the increase or decrease in the mean for those in the group compared to those not.
It can be extened to three variables with Yi=β0+Xi1β1+Xi2β2+ϵi. Xi1 could be Liberal/not Liberal, and Xi2 could be Labour/not Labour. When both X are 0, then it’s the third outcome, perhaps Independents.
- β1 compares Liberal to Independents
- β2 compares Labour to Independents
- β1−β2 compares Liberal to Labour
The reference level in the factor is very important.
We take a look via the InsectSprays
data set:
InsectSprays %>%
ggplot(aes(spray, count)) +
geom_violin(aes(fill = spray)) +
labs(x = 'Spray', y = 'Insect Count')
We fit a model of count to the factor variable spray:
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 14.5 1.13 12.8 1.47e-19
## 2 sprayB 0.833 1.60 0.520 6.04e- 1
## 3 sprayC -12.4 1.60 -7.76 7.27e-11
## 4 sprayD -9.58 1.60 -5.99 9.82e- 8
## 5 sprayE -11. 1.60 -6.87 2.75e- 9
## 6 sprayF 2.17 1.60 1.35 1.81e- 1
Notice that there’s no sprayA
- this is the reference and everythig is in comparison with it. So the sprayB
0.8333 is the change in the mean between sprayB
and sprayA
.
The intercept is the average count for sprayA
, so the average count for sprayB
is 14.5 + 0.833.
Use fct_relevel()
to relevel the factors:
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.08 1.13 1.84 7.02e- 2
## 2 sprayA 12.4 1.60 7.76 7.27e-11
## 3 sprayB 13.2 1.60 8.28 8.51e-12
## 4 sprayD 2.83 1.60 1.77 8.14e- 2
## 5 sprayE 1.42 1.60 0.885 3.79e- 1
## 6 sprayF 14.6 1.60 9.11 2.79e-13
If you just want the means for each of the groups, you remove the intercept:
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 sprayA 14.5 1.13 12.8 1.47e-19
## 2 sprayB 15.3 1.13 13.5 1.00e-20
## 3 sprayC 2.08 1.13 1.84 7.02e- 2
## 4 sprayD 4.92 1.13 4.34 4.95e- 5
## 5 sprayE 3.5 1.13 3.09 2.92e- 3
## 6 sprayF 16.7 1.13 14.7 1.57e-22
## # A tibble: 6 x 2
## spray `mean(count)`
## <fct> <dbl>
## 1 A 14.5
## 2 B 15.3
## 3 C 2.08
## 4 D 4.92
## 5 E 3.5
## 6 F 16.7
There’s no difference in the model, it’s that the coefficients have a different interpretation. The p-values in the first one are testing whether the means are different from sprayA
, whereas in the one without the intercept it’s testing whether the means are different from 0.
Interaction Terms
Taking our swiss
dataset, we create a new variable which is 1 if the province is > 50% Catholic, and 0 if it is not.
There are a few different models we can apply:
E[Y|X1X2]=β0+β1X1, which is a line that disregards the religon of the province.
We could do E[Y|X1X2]=β0+β1X1+β2X2, however in this instance the slope is the same for both binary outcomes of X_2.
Finally, we can do E[Y|X1X2]=β0+β1X1+β2X2+β3X1X2. In this case we end up with:
Let X2=0β0+β1X1Let X2=1β0+β1X1+β2+β3X1=(β0+β2)+(β1+β3)X1 When we include an interaction term, we have two different lines with two different slopes and intercepts.
# Add a binary variable
swiss %>%
mutate(MajCatholic = Catholic > 50) -> swiss_maj
# Set up the graph
swiss_maj %>%
ggplot() +
geom_point(aes(Agriculture, Fertility, colour = MajCatholic)) -> g
# Perform the linear regression with no interaction
swiss_maj %>%
lm(Fertility ~ Agriculture + MajCatholic, data = .) -> swiss_lm_noint
# We see in this one, the 'MajCatholicTRUE' is added to the intercept when
# MajCatholic = TRUE. The slope does not change.
summary(swiss_lm_noint)
##
## Call:
## lm(formula = Fertility ~ Agriculture + MajCatholic, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.803 -6.701 1.382 6.855 20.435
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8322 4.1059 14.816 <2e-16 ***
## Agriculture 0.1242 0.0811 1.531 0.1329
## MajCatholicTRUE 7.8843 3.7484 2.103 0.0412 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.39 on 44 degrees of freedom
## Multiple R-squared: 0.2046, Adjusted R-squared: 0.1685
## F-statistic: 5.66 on 2 and 44 DF, p-value: 0.006492
# Graph the two regressions
f <- swiss_lm_noint
g +
geom_abline(intercept = coef(f)[1], slope = coef(f)[2]) +
geom_abline(intercept = coef(f)[1] + coef(f)[3], slope = coef(f)[2])
swiss_maj %>%
lm(Fertility ~ Agriculture * MajCatholic, data = .) -> swiss_lm_int
# With the interaction term, we have the intecept for MajCatholic < 50, and
# the slope for Agriculture with MajCatholic < 50.
# The second two are the intercept and slope for MajCatholic > 50, but these need
# to be added to the < 50 main effects (the first two)
summary(swiss_lm_int)
##
## Call:
## lm(formula = Fertility ~ Agriculture * MajCatholic, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.840 -6.668 1.016 7.092 20.242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.04993 4.78916 12.956 <2e-16 ***
## Agriculture 0.09612 0.09881 0.973 0.336
## MajCatholicTRUE 2.85770 10.62644 0.269 0.789
## Agriculture:MajCatholicTRUE 0.08914 0.17611 0.506 0.615
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.49 on 43 degrees of freedom
## Multiple R-squared: 0.2094, Adjusted R-squared: 0.1542
## F-statistic: 3.795 on 3 and 43 DF, p-value: 0.01683
f <- swiss_lm_int
g +
geom_abline(intercept = coef(f)[1], slope = coef(f)[2]) +
geom_abline(intercept = coef(f)[1] + coef(f)[3], slope = coef(f)[2] + coef(f)[4])
Residuals, Diagnostics, Variation
We recall that we had our linear models wth IID, normally distributed error terms, and our residuals are the vertical distance away from the line. Our residual variation is the average of the squared residuals, except that we divide by n−p.
Influence Measures
Calling a point an ‘outlier’ is vague. They can be caused by both real and spurious processes.
You can use ?influence.measures
to get a list of different influence measures.
rstandard()
- standardised residuals, divided by their standard deviation. Allows comparison across between different response units.rstudent()
- standardised residuals where which have have externally studentised- This is where the improbably large observation has been removed from the variance calculation.
hatvalues()
- A measure of influence.- A point could have high leverage, but doesn’t exert that leverage, e.g. an influential point that is on the regression line.
- Very useful in finding data entry errors.
dffits()
- for every data point, how much has the value changed when the fit does vs doesn’t include the observation.- One dffit per data point.
dfbetas()
- looks at how the slope coefficients change when the fit does and doesn’t contain the observation.- One dfbeta per coefficient.
cooks.distance()
- a summary ofdfbeta()
resid()
- returns the ordinary residuals.resid(fit) / (1 - hatvalues(fit))
- press residuals. Less useful for influence and outliers and more used for general model fit.
Be wary of simple rules and diagnostic plots - it is often different depending on the context.
Examples
In this example, we take uncorrelated X and Y values, then add a point out at (10,10). Without the influential point there would be no correlation, but with the point there is now a linear trend.
influence_d <- tibble(
x = rnorm(100),
y = rnorm(100)
) %>%
add_row(x = 10, y = 10)
influence_fit <- lm(x ~ y, influence_d)
influence_d %>%
ggplot(aes(x,y)) +
geom_point() +
geom_smooth(method = 'lm', se = F)
We’ll take a look at some diagnostic plots:
# The 101st point has a very large dfbeta
dfbetas(influence_fit) %>%
as_tibble() %>%
rownames_to_column('ob_id') %>%
top_n(5, y)
## # A tibble: 5 x 3
## ob_id `(Intercept)` y
## <chr> <dbl> <dbl>
## 1 9 0.0605 0.0500
## 2 20 0.0782 0.0666
## 3 31 -0.114 0.0992
## 4 86 0.142 0.0867
## 5 101 0.675 6.83
# We also see the hat value is very large compared to the other values
influence_fit %>%
augment() %>%
rownames_to_column('ob_id') %>%
top_n(5, .hat) %>%
select(ob_id, .hat)
## # A tibble: 5 x 2
## ob_id .hat
## <chr> <dbl>
## 1 11 0.0446
## 2 49 0.0364
## 3 57 0.0395
## 4 95 0.0344
## 5 101 0.524
orly <- read_table2(
'https://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_ramp.txt',
col_names = F,
col_types = cols(
X1 = col_double(),
X2 = col_double(),
X3 = col_double(),
X4 = col_double(),
X5 = col_double(),
X6 = col_logical()
)
) %>% select(-X6)
# The p values show that every regressor is significant
orly_lm <- orly %>% lm(X1 ~ ., data = .)
orly_lm %>% summary()
##
## Call:
## lm(formula = X1 ~ ., data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0652 -0.9878 0.1419 0.7958 1.9560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001194 0.021170 0.056 0.955
## X2 0.986325 0.128630 7.668 2.56e-14 ***
## X3 1.899763 0.195782 9.703 < 2e-16 ***
## X4 2.716371 0.262909 10.332 < 2e-16 ***
## X5 3.709886 0.337727 10.985 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.001 on 2293 degrees of freedom
## Multiple R-squared: 0.05, Adjusted R-squared: 0.04834
## F-statistic: 30.17 on 4 and 2293 DF, p-value: < 2.2e-16
# The residuals versus the fitted values
orly_lm %>%
augment() %>%
ggplot(aes(.fitted, .resid)) +
geom_point(size = .1)
The residuals versus the fit has show a clear systematic pattern. Usually we want to model the systematic elements, so that all that is left over is random noise.
Model Selection
How do we choose what variables to include in a regression model? No single easy answer exists - ‘it depends’.
A model is a lense through which you look at the data. Under this philosophy, what is the right model? Whataver model connects the data to a true, parsimonious statement about what you’re studying.
General Rules
- Including variables that we shouldn’t have increases standard errors of the regression variables.
- Actually, any new varaibles increase the standard errors of the other regressors.
- R2 increases monotonically as more regressors are included.
Not including important regressors increases bias, and including unimportant regressors causes variance inflation.
If a regressor is highly correlated to the response, the variance is going to be inflated. You need to ensure you’re not putting these variables into the model unnecessarily.
Variance Inflation Factors
Variance inflation is much worse when you include a variable that is highly related to the response. We don’t know σ, so we can only estimate the increase in the actual standard error of the coefficients for including a regressor.
However σ drops out of the relative standard errors. If we sequentially add variables, we can check the variance (or SD) inflation for including each one.
When the other regressors are actually orthogonal (uncorrelated) to the regressor of interest, then there is no variance inflation.
The Variance Inflation Factor (VIF) is the increase in variance for the ith regressor compared to the ideal setting where it is orthogonal to the other regressors. It;s only part of the picture, we need to include certain variables that are pertinent even if they dramatically inflate our variance.
We can look at the swiss
dataset:
## Agriculture Examination Education Catholic
## 2.284129 3.675420 2.774943 1.937160
## Infant.Mortality
## 1.107542
So Agriculture
having 2.28 means that the standard error for the Agriculture effect is approx. double what it would be if it was orthogonal to the other regressors.
Variance
If we correctly fit or over fit the model, including all necessary covariates and/or unnecessary covariates, the variance estimate is unbiased. However the variance of the variance estimate gets inflated if we include unnecessary variables.
Principal Components
Principal components or factor analytic models on covariates are foten useful for reducing covariate spaces - i.e. when you have a large variable space. THis comes with consequences - your interpretability is reduced.
Nested Models
If the models of interest are nested and without lots of paramete
So for example you may have a treatment. You first look at the model with the response against treatment, then the model with response against treatment and age, then treatment, age and gender, etc.
swiss %>%
do(
fit_a = lm(Fertility ~ Agriculture, data = .),
fit_b = lm(Fertility ~ Agriculture + Examination, data = .),
fit_c = lm(Fertility ~ Agriculture + Examination + Education, data = .)
) %>%
gather(fit_name, fit_model) %>%
pull(fit_model) %>%
do.call(anova, .)
## Analysis of Variance Table
##
## Model 1: Fertility ~ Agriculture
## Model 2: Fertility ~ Agriculture + Examination
## Model 3: Fertility ~ Agriculture + Examination + Education
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 45 6283.1
## 2 44 4072.7 1 2210.38 29.880 2.172e-06 ***
## 3 43 3180.9 1 891.81 12.056 0.001189 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An expanation of the columns:
- Residual degrees of freedom is the number of data points minus the number of parameters.
- RSS is the residual sum of squares.
- Df is the excess degrees of freedom going between each model - we added two parameters between each successive mode.l
- F-statistic
- Pr(>F) is the p-value for the F-statistic
It’s important to note that the models must be nested for this to work - i.e. the later models must include all of the variables that the earlier models had.
All models are wrong, some models are useful.
Quiz
Question 1
Consider the mtcars
data set. Fit a model with mpg
as the outcome that includes number of cylinders as a factor variable and weight as confounder. Give the adjusted estimate for the expected change in mpg comparing 8 cylinders to 4.
- 33.991
- -3.206
- -6.071
- -4.256
Answer: thh answer is -6.071
##
## Call:
## lm(formula = mpg ~ cyl + wt, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5890 -1.2357 -0.5159 1.3845 5.7915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.9908 1.8878 18.006 < 2e-16 ***
## cyl6 -4.2556 1.3861 -3.070 0.004718 **
## cyl8 -6.0709 1.6523 -3.674 0.000999 ***
## wt -3.2056 0.7539 -4.252 0.000213 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared: 0.8374, Adjusted R-squared: 0.82
## F-statistic: 48.08 on 3 and 28 DF, p-value: 3.594e-11
We see that 4 cylinders is used as the base, so there is a -6.071 estimate for the expected change in mpg going from 4 cylinders to 8.
Question 2
Consider the mtcars
data set. Fit a model with mpg
as the outcome that includes number of cylinders as a factor variable and weight as a possible confounding variable. Compare the effect of 8 versus 4 cylinders on mpg for the adjusted and unadjusted by weight models. Here, adjusted means including the weight variable as a term in the regression model and unadjusted means the model without weight included. What can be said about the effect comparing 8 and 4 cylinders after looking at models with and without weight included?.
- Including or excluding weight does not appear to change anything regarding the estimated impact of number of cylinders on mpg.
- Holding weight constant, cylinder appears to have more of an impact on mpg than if weight is disregarded.
- Within a given weight, 8 cylinder vehicles have an expected 12 mpg drop in fuel efficiency.
- Holding weight constant, cylinder appears to have less of an impact on mpg than if weight is disregarded.
Answer: It is both true and sensible that including weight would attenuate the effect of number of cylinders on mpg.
# Our unadjusted model
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
lm(mpg ~ cyl, data = .) %>%
summary()
##
## Call:
## lm(formula = mpg ~ cyl, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2636 -1.8357 0.0286 1.3893 7.2364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
## cyl6 -6.9208 1.5583 -4.441 0.000119 ***
## cyl8 -11.5636 1.2986 -8.905 8.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
## F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
In this regression without weight, the cylinders appear to have more of an impact on the mpg than when they are included.
Question 3
Consider the mtcars
data set. Fit a model with mpg as the outcome that considers number of cylinders as a factor variable and weight as confounder. Now fit a second model with mpg as the outcome model that considers the interaction between number of cylinders (as a factor variable) and weight. Give the P-value for the likelihood ratio test comparing the two models and suggest a model using 0.05 as a type I error rate significance benchmark.
- The P-value is small (less than 0.05). Thus it is surely true that there is no interaction term in the true model.
- The P-value is small (less than 0.05). Thus it is surely true that there is an interaction term in the true model.
- The P-value is larger than 0.05. So, according to our criterion, we would fail to reject, which suggests that the interaction terms is necessary.
- The P-value is small (less than 0.05). So, according to our criterion, we reject, which suggests that the interaction term is not necessary.
- The P-value is small (less than 0.05). So, according to our criterion, we reject, which suggests that the interaction term is necessary
- The P-value is larger than 0.05. So, according to our criterion, we would fail to reject, which suggests that the interaction terms may not be necessary.
Answer: large p-value, so the term may not be necessary.
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
do(
fit1 = lm(mpg ~ cyl + wt, data = .),
fit2 = lm(mpg ~ cyl + wt + cyl * wt, data = .)
) %>%
gather(name, fit) %>%
pull(fit) %>%
do.call(anova, .)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + wt
## Model 2: mpg ~ cyl + wt + cyl * wt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 183.06
## 2 26 155.89 2 27.17 2.2658 0.1239
The term may not be necessary.
Question 4
Consider the mtcars
data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight inlcuded in the model as:
How is the wt coefficient interpretted?
- The estimated expected change in MPG per one ton increase in weight. -> No
- The estimated expected change in MPG per one ton increase in weight for a specific number of cylinders (4, 6, 8).
- The estimated expected change in MPG per half ton increase in weight for for a specific number of cylinders (4, 6, 8). -> No
- The estimated expected change in MPG per half ton increase in weight for the average number of cylinders. -> No
- The estimated expected change in MPG per half ton increase in weight. -> No
Answer: The estimated expected change in MPG per one ton increase in weight for a specific number of cylinders (4, 6, 8)
##
## Call:
## lm(formula = mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5890 -1.2357 -0.5159 1.3845 5.7915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.991 1.888 18.006 < 2e-16 ***
## I(wt * 0.5) -6.411 1.508 -4.252 0.000213 ***
## factor(cyl)6 -4.256 1.386 -3.070 0.004718 **
## factor(cyl)8 -6.071 1.652 -3.674 0.000999 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared: 0.8374, Adjusted R-squared: 0.82
## F-statistic: 48.08 on 3 and 28 DF, p-value: 3.594e-11
Question 5
Consider the following data set:
Give the hat diagonal for the most influential point
- 0.2804
- 0.2287
- 0.9946
- 0.2025
Answer: 0.9946
tibble(x = x, y = y) %>%
lm(y ~ x, data = .) %>%
augment() %>%
mutate(n = n()) %>%
top_n(1, .hat) %>%
select(.hat, n)
## # A tibble: 1 x 2
## .hat n
## <dbl> <int>
## 1 0.995 5
Question 6
Consider the following data set:
Give the slope dfbeta for the point with the highest hat value.
- -.00134
- 0.673
- -134
- -0.378
Answer:: -134
## (Intercept) x
## 1 1.06212391 -0.37811633
## 2 0.06748037 -0.02861769
## 3 -0.01735756 0.00791512
## 4 -1.24958248 0.67253246
## 5 0.20432010 -133.82261293
Question 7
Consider a regression relationship between Y and X with and without adjustment for a third variable Z. Which of the following is true about comparing the regression coefficient between Y and X with and without adjustment for Z.
- The coefficient can’t change sign after adjustment, except for slight numerical pathological cases.
- It is possible for the coefficient to reverse sign after adjustment. For example, it can be strongly significant and positive before adjustment and strongly significant and negative after adjustment.
- For the the coefficient to change sign, there must be a significant interaction term.
- Adjusting for another variable can only attenuate the coefficient toward zero. It can’t materially change sign.
Answer: It is possible for the coefficient to reverse sign.