Multivariable Regression

When performing a regression and finding a relationship between the regressor and the response, you have to be careful that there’s not some other variable that actually explains the relationship.

As an example, you may find in your tests that there’s a relationship between breath mints and loss in pulmonary function. You’d probably think that smokers use more breath mints and that smoking is the likely cause.

How do you establish this? You’d consider smokers and non-smokers by themselves and see if their lung function differs by breath mint usage. Multi-variable regression is an ‘automated’ way of doing this, holding all other variables constant while looking and the regression of a single variable.

General Linear Model

$$ Y_i = 1X{1i} + 2X{2i} + + pX{pi} + _i \

= {k=1}^p X{ik}_j + _i $$

The least squares is pretty much the same as a single variable model: $\sum_{i=1}^n\bigg(Y_i - \sum_{k=1}^p X_{ki} \beta_j \bigg)^2$

The important linearity is linearity in the coefficients. If you squared or cubed the $X$ values, that would still be a linear equation.

Two Variable Example

Consider a regression with two variables, so that the least squares ends up asL

$\sum\big(y_i - X_{1i}\beta_1 - X_{2i}\beta_2\big)^2$ What it works out to be is that the regression slope for $\beta_1$ is exactly what you would obtain if you took the residual of $X_2$ out of $X_1$ and $X_2$ out of $Y$ and did regression through the origin.

So what multi-variable regression is doing is the coefficient or $\beta_1$ is the coefficient, having removed $X_2$ , the other covariant, from both the response and that first predictor $X_1$ .

Similarly, the $X_2$ is going to be what you get if you were to remove $X_1$ out of both the response $Y$ and the second predictor $X_2$ .

Summary: A multi-variable coefficient is one where the linear effect of all the other variables on that predictor and the response has been removed.

Consider $Y_i = \beta_1X_{1i} + \beta_2X_{2i}$ where $X_{2i} = 1$ , it’s just an intercept term.

In an intercept only ression, the fitted value is simply $\bar{Y}$ , so the residuals $e_{i,Y|X_2} = Y_i - \bar{Y}$ , which is the centered version of the $Y$ ’s.

If we fit $X_{2i}$ on $X_{1i}$ , $X_{2i}$ is just a constant, so the coefficient would just be $\bar{X_1}$ , the mean of that covariant. Thus $e_{i,X_1|X_2} = X_{1i} - \bar{X_1}$ .

So the fitted regression is:

$\hat{\beta_1} = \frac{ \sum_{i=1}^n e_{i,Y|X-2} e_{i,X_1,X_2} }{ \sum_{i=1}^n e^2_{i,X_1|X_2} } \\ = \frac{ \sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y}) }{ \sum_{i=1}^n (X_i - \bar{X})^2 } \\ = Cor(X,Y) \frac{Sd(Y)}{Sd(X)}$

Let’s do a computed example:

set.seed(1)
mv <- tibble(
    n = 1:100,
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = rnorm(n),
    y = x1 + 2*x2 + 3*x3 + rnorm(n, sd = .3)
)

e_y <- mv %>% lm(y ~ x2 + x3, data = .) %>% resid()
e_x <- mv %>% lm(x1 ~ x2 + x3, data = .) %>% resid()

sum(e_x * e_y) / sum(e_x^2)

## [1] 0.9826123

mv %>% lm(y ~ x1 + x2 + x3, data = .) %>% coef()

## (Intercept)          x1          x2          x3 
##  0.01581932  0.98261234  1.98351744  3.03138694

Interpretation of Coeficients

The regression predictor is the sum of the $x$ ’s times their coefficients:

$E[Y|X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_k \beta_k$

If one of the regression coefficients is incremented by one:

$E[Y|X_1 = x_1 + 1, \ldots, X_p = x_p] = (x_1 + 1) \beta_1 + \sum_{k=2}^p x_k \beta_k$ then we subtract these two terms:

$E[Y|X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y|X_1 = x_1, \ldots, X_p = x_p] \\ = (x_1 + 1) \beta_1 + \sum_{k=2}^p x_k \beta_k - \sum_{k=1}^p x_k \beta_k \\ = \beta_1$

Notice that all of the other $x_p$ were held fixed. So the interpretation of a multi-variable regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.

Fitted Values, Residuals and Residual Variation

All of our simple linear regression quantities can be extended:

Model $Y_i = \sum_{k=1}^p X_{ik} \beta_k + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$
Fitted responses $\hat{Y_i} = \sum_{k=1}^p X_{ik} \hat{\beta_k}$
Residuals $\epsilon_i = Y_i - \hat{Y_i}$
Variance estimate $\hat{\sigma^2} = \frac{1}{n - p}\sum_{i=1}^n e_i^2$
- It’s $n - p$ because you know the last $p$ of the residuals due to some linear constraints.
- We don’t really have $n$ residuals, we have $n-p$ .
The coefficients have standard errors $\hat{\sigma_{\hat{\beta_k}}}$
We can test whether the coefficients are 0 with a t-test: $\frac{ \hat{\beta_k} - \beta_k}{ \hat{\sigma_{\hat{\beta_k}}} }$
Predicted responses have standard errors and we can calculate predicted and expected response intervals.

Linear Models

Linear models are the single most important applied statistical and machine learning technique.

Some things you can accomplish with a linear model:

Decompose a signal into its harmonics.
- The discrete Fourier transform can be thought of as the fit from a a linear model.
Flexibly fit complicated functions.
Fit factor variables as predictors.
Uncover complex multivariate relationships with the response.
Build accurate prediction models.

Confounding

Let’s look at the the swiss dataset and perform a linear regression on all of the variables with the outcome of fertility:

swiss %>%
    lm(Fertility ~ ., data = .) %>%
    tidy()

## # A tibble: 6 x 5
##   term             estimate std.error statistic     p.value
##   <chr>               <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)        66.9     10.7         6.25 0.000000191
## 2 Agriculture        -0.172    0.0703     -2.45 0.0187     
## 3 Examination        -0.258    0.254      -1.02 0.315      
## 4 Education          -0.871    0.183      -4.76 0.0000243  
## 5 Catholic            0.104    0.0353      2.95 0.00519    
## 6 Infant.Mortality    1.08     0.382       2.82 0.00734

We see a 0.17 decrease. in fertility decrease in standardised fertility for every 1% increase in percentage of males involved in agriculture.

We now look at how our model selection changes the outcomes. We regress only agriculture on to fertility:

swiss %>%
    lm(Fertility ~ Agriculture, data = .) %>%
    tidy()

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   60.3      4.25       14.2  3.22e-18
## 2 Agriculture    0.194    0.0767      2.53 1.49e- 2

It now appears that the percentage of males involves in agriculture increases rather than decreases the fertility. What has happened? This is Simpson’s Paradox, where ‘a trend appears in different groups but disappears or reverses when the groups are combined’.

We can see this be assigning colour to one of the variables. There’s a clear positive linear trend with Agriculture and Fertility, but we can see also that there’s a trend (viewed via the colour) of Catholic to Fertility as well.

swiss %>%
    ggplot(aes(Agriculture, Fertility)) +
    geom_point(aes(colour = Catholic)) +
    geom_smooth(method = 'lm', se = F)

With the single regressor, the linear regression is picking up effects of other variables and associating them with Agriculture. With all of the variables included, their linear effects are removed when the regression is occurring.

Factor Variables

Consider a model $Y_i = \beta_0 + X_{i1}\beta_1 + \epsilon_i$ where each $X_{i1}$ is binary. It could be that 1 is the control group and 0 otherwise.

For the people in the group, $E[Y_i] = \beta_0 + \beta_1$ and for those not in the control group, $E[Y_i] = \beta_0$ .

$\beta_1$ is interpreted as the increase or decrease in the mean for those in the group compared to those not.

It can be extened to three variables with $Y_i = \beta_0 + X_{i1}\beta_1+ X_{i2}\beta_2 + \epsilon_i$ . $X_{i1}$ could be Liberal/not Liberal, and $X_{i2}$ could be Labour/not Labour. When both $X$ are 0, then it’s the third outcome, perhaps Independents.

$\beta_1$ compares Liberal to Independents
$\beta_2$ compares Labour to Independents
$\beta_1 - \beta_2$ compares Liberal to Labour

The reference level in the factor is very important.

We take a look via the InsectSprays data set:

InsectSprays %>%
    ggplot(aes(spray, count)) +
    geom_violin(aes(fill = spray)) +
    labs(x = 'Spray', y = 'Insect Count')

We fit a model of count to the factor variable spray:

InsectSprays %>%
    lm(count ~ spray, data = .) %>%
    tidy()

## # A tibble: 6 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   14.5        1.13    12.8   1.47e-19
## 2 sprayB         0.833      1.60     0.520 6.04e- 1
## 3 sprayC       -12.4        1.60    -7.76  7.27e-11
## 4 sprayD        -9.58       1.60    -5.99  9.82e- 8
## 5 sprayE       -11.         1.60    -6.87  2.75e- 9
## 6 sprayF         2.17       1.60     1.35  1.81e- 1

Notice that there’s no sprayA - this is the reference and everythig is in comparison with it. So the sprayB 0.8333 is the change in the mean between sprayB and sprayA.

The intercept is the average count for sprayA, so the average count for sprayB is 14.5 + 0.833.

Use fct_relevel() to relevel the factors:

InsectSprays %>%
    mutate(spray = fct_relevel(spray, 'C')) %>%
    lm(count ~ spray, data = .) %>%
    tidy()

## # A tibble: 6 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     2.08      1.13     1.84  7.02e- 2
## 2 sprayA         12.4       1.60     7.76  7.27e-11
## 3 sprayB         13.2       1.60     8.28  8.51e-12
## 4 sprayD          2.83      1.60     1.77  8.14e- 2
## 5 sprayE          1.42      1.60     0.885 3.79e- 1
## 6 sprayF         14.6       1.60     9.11  2.79e-13

If you just want the means for each of the groups, you remove the intercept:

InsectSprays %>%
    lm(count ~ spray - 1, data = .) %>%
    tidy()

## # A tibble: 6 x 5
##   term   estimate std.error statistic  p.value
##   <chr>     <dbl>     <dbl>     <dbl>    <dbl>
## 1 sprayA    14.5       1.13     12.8  1.47e-19
## 2 sprayB    15.3       1.13     13.5  1.00e-20
## 3 sprayC     2.08      1.13      1.84 7.02e- 2
## 4 sprayD     4.92      1.13      4.34 4.95e- 5
## 5 sprayE     3.5       1.13      3.09 2.92e- 3
## 6 sprayF    16.7       1.13     14.7  1.57e-22

InsectSprays %>%
    group_by(spray) %>%
    summarise(mean(count))

## # A tibble: 6 x 2
##   spray `mean(count)`
##   <fct>         <dbl>
## 1 A             14.5 
## 2 B             15.3 
## 3 C              2.08
## 4 D              4.92
## 5 E              3.5 
## 6 F             16.7

There’s no difference in the model, it’s that the coefficients have a different interpretation. The p-values in the first one are testing whether the means are different from sprayA, whereas in the one without the intercept it’s testing whether the means are different from 0.

Interaction Terms

Taking our swiss dataset, we create a new variable which is 1 if the province is > 50% Catholic, and 0 if it is not.

There are a few different models we can apply:

$E[Y | X_1 X_2] = \beta_0 + \beta_1X1$ , which is a line that disregards the religon of the province.

We could do $E[Y | X_1 X_2] = \beta_0 + \beta_1 X_1 + \beta_2 X_2$ , however in this instance the slope is the same for both binary outcomes of X_2.

Finally, we can do $E[ Y | X_1 X_2] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2$ . In this case we end up with:

$\text{Let } X_2 = 0 \\ \beta_0 + \beta_1 X_1 \\ \\ \text{Let } X_2 = 1 \\ \beta_0 + \beta_1 X_1 + \beta_2 + \beta_3 X_1 \\ = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) X_1$ When we include an interaction term, we have two different lines with two different slopes and intercepts.

# Add a binary variable
swiss %>%
    mutate(MajCatholic = Catholic > 50) -> swiss_maj

# Set up the graph
swiss_maj %>%
    ggplot() +
    geom_point(aes(Agriculture, Fertility, colour = MajCatholic)) -> g

# Perform the linear regression with no interaction
swiss_maj %>%
    lm(Fertility ~ Agriculture + MajCatholic, data = .) -> swiss_lm_noint

# We see in this one, the 'MajCatholicTRUE' is added to the intercept when
# MajCatholic = TRUE. The slope does not change.
summary(swiss_lm_noint)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + MajCatholic, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.803  -6.701   1.382   6.855  20.435 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      60.8322     4.1059  14.816   <2e-16 ***
## Agriculture       0.1242     0.0811   1.531   0.1329    
## MajCatholicTRUE   7.8843     3.7484   2.103   0.0412 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.39 on 44 degrees of freedom
## Multiple R-squared:  0.2046, Adjusted R-squared:  0.1685 
## F-statistic:  5.66 on 2 and 44 DF,  p-value: 0.006492

# Graph the two regressions
f <- swiss_lm_noint
g +
    geom_abline(intercept = coef(f)[1], slope = coef(f)[2]) +
    geom_abline(intercept = coef(f)[1] + coef(f)[3], slope = coef(f)[2])

swiss_maj %>%
    lm(Fertility ~ Agriculture * MajCatholic, data = .) -> swiss_lm_int

# With the interaction term, we have the intecept for MajCatholic < 50, and
# the slope for Agriculture with MajCatholic < 50.
# The second two are the intercept and slope for MajCatholic > 50, but these need
# to be added to the < 50 main effects (the first two)
summary(swiss_lm_int)

## 
## Call:
## lm(formula = Fertility ~ Agriculture * MajCatholic, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.840  -6.668   1.016   7.092  20.242 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 62.04993    4.78916  12.956   <2e-16 ***
## Agriculture                  0.09612    0.09881   0.973    0.336    
## MajCatholicTRUE              2.85770   10.62644   0.269    0.789    
## Agriculture:MajCatholicTRUE  0.08914    0.17611   0.506    0.615    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.49 on 43 degrees of freedom
## Multiple R-squared:  0.2094, Adjusted R-squared:  0.1542 
## F-statistic: 3.795 on 3 and 43 DF,  p-value: 0.01683

f <- swiss_lm_int
g +
    geom_abline(intercept = coef(f)[1], slope = coef(f)[2]) +
    geom_abline(intercept = coef(f)[1] + coef(f)[3], slope = coef(f)[2] + coef(f)[4])

Residuals, Diagnostics, Variation

We recall that we had our linear models wth IID, normally distributed error terms, and our residuals are the vertical distance away from the line. Our residual variation is the average of the squared residuals, except that we divide by $n - p$ .

Influence Measures

Calling a point an ‘outlier’ is vague. They can be caused by both real and spurious processes.

You can use ?influence.measures to get a list of different influence measures.

rstandard() - standardised residuals, divided by their standard deviation. Allows comparison across between different response units.
rstudent() - standardised residuals where which have have externally studentised
- This is where the improbably large observation has been removed from the variance calculation.
hatvalues() - A measure of influence.
- A point could have high leverage, but doesn’t exert that leverage, e.g. an influential point that is on the regression line.
- Very useful in finding data entry errors.
dffits() - for every data point, how much has the value changed when the fit does vs doesn’t include the observation.
- One dffit per data point.
dfbetas() - looks at how the slope coefficients change when the fit does and doesn’t contain the observation.
- One dfbeta per coefficient.
cooks.distance() - a summary of dfbeta()
resid() - returns the ordinary residuals.
resid(fit) / (1 - hatvalues(fit)) - press residuals. Less useful for influence and outliers and more used for general model fit.

Be wary of simple rules and diagnostic plots - it is often different depending on the context.

Examples

In this example, we take uncorrelated X and Y values, then add a point out at (10,10). Without the influential point there would be no correlation, but with the point there is now a linear trend.

influence_d <- tibble(
    x = rnorm(100),
    y = rnorm(100)
) %>%
    add_row(x = 10, y = 10)

influence_fit <- lm(x ~ y, influence_d)

influence_d %>%
    ggplot(aes(x,y)) +
    geom_point() +
    geom_smooth(method = 'lm', se = F)

We’ll take a look at some diagnostic plots:

# The 101st point has a very large dfbeta
dfbetas(influence_fit) %>% 
    as_tibble() %>%
    rownames_to_column('ob_id') %>%
    top_n(5, y)

## # A tibble: 5 x 3
##   ob_id `(Intercept)`      y
##   <chr>         <dbl>  <dbl>
## 1 9            0.0605 0.0500
## 2 20           0.0782 0.0666
## 3 31          -0.114  0.0992
## 4 86           0.142  0.0867
## 5 101          0.675  6.83

# We also see the hat value is very large compared to the other values
influence_fit %>%
    augment() %>%
    rownames_to_column('ob_id') %>%
    top_n(5, .hat) %>%
    select(ob_id, .hat)

## # A tibble: 5 x 2
##   ob_id   .hat
##   <chr>  <dbl>
## 1 11    0.0446
## 2 49    0.0364
## 3 57    0.0395
## 4 95    0.0344
## 5 101   0.524

plot(influence_fit)

orly <- read_table2(
    'https://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_ramp.txt',
    col_names = F,
    col_types = cols(
        X1 = col_double(),
        X2 = col_double(),
        X3 = col_double(),
        X4 = col_double(),
        X5 = col_double(),
        X6 = col_logical()
    )
) %>% select(-X6)

# The p values show that every regressor is significant
orly_lm <- orly %>% lm(X1 ~ ., data = .)
orly_lm %>% summary()

## 
## Call:
## lm(formula = X1 ~ ., data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0652 -0.9878  0.1419  0.7958  1.9560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.001194   0.021170   0.056    0.955    
## X2          0.986325   0.128630   7.668 2.56e-14 ***
## X3          1.899763   0.195782   9.703  < 2e-16 ***
## X4          2.716371   0.262909  10.332  < 2e-16 ***
## X5          3.709886   0.337727  10.985  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.001 on 2293 degrees of freedom
## Multiple R-squared:   0.05,  Adjusted R-squared:  0.04834 
## F-statistic: 30.17 on 4 and 2293 DF,  p-value: < 2.2e-16

# The residuals versus the fitted values
orly_lm %>%
    augment() %>%
    ggplot(aes(.fitted, .resid)) +
    geom_point(size = .1)

The residuals versus the fit has show a clear systematic pattern. Usually we want to model the systematic elements, so that all that is left over is random noise.

Model Selection

How do we choose what variables to include in a regression model? No single easy answer exists - ‘it depends’.

A model is a lense through which you look at the data. Under this philosophy, what is the right model? Whataver model connects the data to a true, parsimonious statement about what you’re studying.

General Rules

Including variables that we shouldn’t have increases standard errors of the regression variables.
- Actually, any new varaibles increase the standard errors of the other regressors.
$R^2$ increases monotonically as more regressors are included.

Not including important regressors increases bias, and including unimportant regressors causes variance inflation.

If a regressor is highly correlated to the response, the variance is going to be inflated. You need to ensure you’re not putting these variables into the model unnecessarily.

Variance Inflation Factors

Variance inflation is much worse when you include a variable that is highly related to the response. We don’t know $\sigma$ , so we can only estimate the increase in the actual standard error of the coefficients for including a regressor.

However $\sigma$ drops out of the relative standard errors. If we sequentially add variables, we can check the variance (or SD) inflation for including each one.

When the other regressors are actually orthogonal (uncorrelated) to the regressor of interest, then there is no variance inflation.

The Variance Inflation Factor (VIF) is the increase in variance for the $i$ th regressor compared to the ideal setting where it is orthogonal to the other regressors. It;s only part of the picture, we need to include certain variables that are pertinent even if they dramatically inflate our variance.

We can look at the swiss dataset:

# Need this library for vif
library(car)

swiss %>%
    lm(Fertility ~ ., data = .) %>%
    vif()

##      Agriculture      Examination        Education         Catholic 
##         2.284129         3.675420         2.774943         1.937160 
## Infant.Mortality 
##         1.107542

So Agriculture having 2.28 means that the standard error for the Agriculture effect is approx. double what it would be if it was orthogonal to the other regressors.

Variance

If we correctly fit or over fit the model, including all necessary covariates and/or unnecessary covariates, the variance estimate is unbiased. However the variance of the variance estimate gets inflated if we include unnecessary variables.

Principal Components

Principal components or factor analytic models on covariates are foten useful for reducing covariate spaces - i.e. when you have a large variable space. THis comes with consequences - your interpretability is reduced.

Nested Models

If the models of interest are nested and without lots of paramete

So for example you may have a treatment. You first look at the model with the response against treatment, then the model with response against treatment and age, then treatment, age and gender, etc.

swiss %>%
    do(
        fit_a = lm(Fertility ~ Agriculture, data = .),
        fit_b = lm(Fertility ~ Agriculture + Examination, data = .),
        fit_c = lm(Fertility ~ Agriculture + Examination + Education, data = .)
    ) %>%
    gather(fit_name, fit_model) %>%
    pull(fit_model) %>%
    do.call(anova, .)

## Analysis of Variance Table
## 
## Model 1: Fertility ~ Agriculture
## Model 2: Fertility ~ Agriculture + Examination
## Model 3: Fertility ~ Agriculture + Examination + Education
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     45 6283.1                                  
## 2     44 4072.7  1   2210.38 29.880 2.172e-06 ***
## 3     43 3180.9  1    891.81 12.056  0.001189 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An expanation of the columns:

Residual degrees of freedom is the number of data points minus the number of parameters.
RSS is the residual sum of squares.
Df is the excess degrees of freedom going between each model - we added two parameters between each successive mode.l
- F-statistic
Pr(>F) is the p-value for the F-statistic

It’s important to note that the models must be nested for this to work - i.e. the later models must include all of the variables that the earlier models had.

All models are wrong, some models are useful.

Quiz

Question 1

Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight as confounder. Give the adjusted estimate for the expected change in mpg comparing 8 cylinders to 4.

33.991
-3.206
-6.071
-4.256

Answer: thh answer is -6.071

mtcars %>%
    mutate(cyl = as.factor(cyl)) %>%
    lm(mpg ~ cyl + wt, data = .) %>%
    summary()

## 
## Call:
## lm(formula = mpg ~ cyl + wt, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
## cyl6         -4.2556     1.3861  -3.070 0.004718 ** 
## cyl8         -6.0709     1.6523  -3.674 0.000999 ***
## wt           -3.2056     0.7539  -4.252 0.000213 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

We see that 4 cylinders is used as the base, so there is a -6.071 estimate for the expected change in mpg going from 4 cylinders to 8.

Question 2

Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight as a possible confounding variable. Compare the effect of 8 versus 4 cylinders on mpg for the adjusted and unadjusted by weight models. Here, adjusted means including the weight variable as a term in the regression model and unadjusted means the model without weight included. What can be said about the effect comparing 8 and 4 cylinders after looking at models with and without weight included?.

Including or excluding weight does not appear to change anything regarding the estimated impact of number of cylinders on mpg.
Holding weight constant, cylinder appears to have more of an impact on mpg than if weight is disregarded.
Within a given weight, 8 cylinder vehicles have an expected 12 mpg drop in fuel efficiency.
Holding weight constant, cylinder appears to have less of an impact on mpg than if weight is disregarded.

Answer: It is both true and sensible that including weight would attenuate the effect of number of cylinders on mpg.

# Our unadjusted model
mtcars %>%
    mutate(cyl = as.factor(cyl)) %>%
    lm(mpg ~ cyl, data = .) %>%
    summary()

## 
## Call:
## lm(formula = mpg ~ cyl, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2636 -1.8357  0.0286  1.3893  7.2364 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.6636     0.9718  27.437  < 2e-16 ***
## cyl6         -6.9208     1.5583  -4.441 0.000119 ***
## cyl8        -11.5636     1.2986  -8.905 8.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared:  0.7325, Adjusted R-squared:  0.714 
## F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09

In this regression without weight, the cylinders appear to have more of an impact on the mpg than when they are included.

Question 3

Consider the mtcars data set. Fit a model with mpg as the outcome that considers number of cylinders as a factor variable and weight as confounder. Now fit a second model with mpg as the outcome model that considers the interaction between number of cylinders (as a factor variable) and weight. Give the P-value for the likelihood ratio test comparing the two models and suggest a model using 0.05 as a type I error rate significance benchmark.

The P-value is small (less than 0.05). Thus it is surely true that there is no interaction term in the true model.
The P-value is small (less than 0.05). Thus it is surely true that there is an interaction term in the true model.
The P-value is larger than 0.05. So, according to our criterion, we would fail to reject, which suggests that the interaction terms is necessary.
The P-value is small (less than 0.05). So, according to our criterion, we reject, which suggests that the interaction term is not necessary.
The P-value is small (less than 0.05). So, according to our criterion, we reject, which suggests that the interaction term is necessary
The P-value is larger than 0.05. So, according to our criterion, we would fail to reject, which suggests that the interaction terms may not be necessary.

Answer: large p-value, so the term may not be necessary.

mtcars %>%
    mutate(cyl = as.factor(cyl)) %>%
    do(
        fit1 = lm(mpg ~ cyl + wt, data = .),
        fit2 = lm(mpg ~ cyl + wt + cyl * wt, data = .)
    ) %>%
    gather(name, fit) %>%
    pull(fit) %>%
    do.call(anova, .)

## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + wt
## Model 2: mpg ~ cyl + wt + cyl * wt
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 183.06                           
## 2     26 155.89  2     27.17 2.2658 0.1239

The term may not be necessary.

Question 4

Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight inlcuded in the model as:

q4_fit <- lm(mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars)

How is the wt coefficient interpretted?

The estimated expected change in MPG per one ton increase in weight. -> No
The estimated expected change in MPG per one ton increase in weight for a specific number of cylinders (4, 6, 8).
The estimated expected change in MPG per half ton increase in weight for for a specific number of cylinders (4, 6, 8). -> No
The estimated expected change in MPG per half ton increase in weight for the average number of cylinders. -> No
The estimated expected change in MPG per half ton increase in weight. -> No

Answer: The estimated expected change in MPG per one ton increase in weight for a specific number of cylinders (4, 6, 8)

q4_fit %>% summary()

## 
## Call:
## lm(formula = mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    33.991      1.888  18.006  < 2e-16 ***
## I(wt * 0.5)    -6.411      1.508  -4.252 0.000213 ***
## factor(cyl)6   -4.256      1.386  -3.070 0.004718 ** 
## factor(cyl)8   -6.071      1.652  -3.674 0.000999 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

Question 5

Consider the following data set:

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)

Give the hat diagonal for the most influential point

0.2804
0.2287
0.9946
0.2025

Answer: 0.9946

tibble(x = x, y = y) %>%
    lm(y ~ x, data = .) %>%
    augment() %>%
    mutate(n = n()) %>%
    top_n(1, .hat) %>%
    select(.hat, n)

## # A tibble: 1 x 2
##    .hat     n
##   <dbl> <int>
## 1 0.995     5

Question 6