Introduction

Regression models are the workhorse of data science. The key insight for regression models is that they produce highly interpretable fits. This is unlike machine learning algorithms which often sacrifice interpretability for improved prediction or automation.

Basic Least Squares

Let’s look at Francis Galton’s data on parents and children’s heights:

library(UsingR, include.only = 'galton')

galton %>%
    gather(person, height, c(child, parent)) %>%
    ggplot() +
    geom_histogram(aes(height, fill = person), colour = 'black', binwidth = 1) +
    facet_wrap(vars(person)) +
    labs(x = 'Height', y = 'Count')

Let’s consider only the children’s heights. How do we describe “the middle”?

One way is to let $Y_i$ be the height of child $i$ for $i = 1, \ldots, n = 928$, then define the middle as the value of minimises:

\[ \sum_{i = 1}^n (Y_i - \mu)^2 \]

This is the physical centre of mass of the histogram. The anwer is $\mu = \bar{Y}$ - the mean of the data.

Parent vs Child Scatter Plots

If we produce a pure scatter plot, we get overplotting as the precision is only to a single decimal point:

galton %>%
    ggplot(aes(parent, child)) +
    geom_point() +
    labs(x = 'Parent Height', y = 'Child Height')

We can modify the size of each point to show the count of the data points:

galton %>%
    ggplot(aes(parent, child)) +
    geom_count(colour = 'black', fill = 'lightblue', shape = 'circle filled') +
    labs(x = 'Parent Height', y = 'Child Height')

Regression Through the Origin

Suppose that $X_i$ are the parents heights. We want to pick a slope that minimises

\[ \sum_{i = 1}^n (Y_i - X_i\beta)^2 \]

This is using the origin as a pivot point. We pick a gradient $\beta$ that minimises the sum of the squared vertical distances of the points to the line.

If the x and y values are reoriented by subtracting the mean. We end up with the origin sitting in the middle of our observations. Then we can perform the regression without calculating the intercept. This is equivalent to calculating the slope and intercept with the original data.

Let’s look at the solution - the $-1$ tells lm to not calculate the intercept.

lm(I(child - mean(child)) ~ I(parent - mean(parent)) - 1, data = galton)

## 
## Call:
## lm(formula = I(child - mean(child)) ~ I(parent - mean(parent)) - 
##     1, data = galton)
## 
## Coefficients:
## I(parent - mean(parent))  
##                   0.6463

Least Squares

Notation

$X_1, X_2, \ldots, X_n$ represents $n$ data points.
- If we have ${1,2,3}$, then $X_1 = 1,\ X_2 = 2,\ X_3 = 3$.

Mean

The empirical mean is $\frac{1}{n} \sum_{i=1}^n X_i$.

If we subtract the mean from the data points, we get data that has mean 0.

$\tilde{X} = X_i - \bar{X}$. The mean of $\tilde{X}$ is 0

Variance

The empirical variance is:

\[ S^2 = \frac{1}{n - 1} \sum_{i=1}^n (X_i - \bar{X})^2 \\ = \frac{1}{n-1} \bigg( \sum_{i=1}^n X_i^2 - n\bar{X}^2 \bigg) \]

The empirical standard deviation is defined as $S = \sqrt{S^2}$

The data defined by $X_i / s$ have empirical standard deviation 1. The is called ‘scaling’ the data.

Normalised data is calculated using:

\[ Z_i = \frac{X_i - \bar{X}}{s} \]

Covariance

Consider pairs of data $(X_i, Y_i)$. The empirical covariance is:

\[ Cov(X,Y) = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y}) \\ = \frac{1}{n - 1} \bigg( \sum_{i=1}^n X_iY_i - n\bar{X}\bar{Y}\bigg) \]

Correlation

The correlation is the covariance standardised into a unitless quantity:

\[ Cor(X,Y) = \frac{ Cov(X,Y) }{ S_x S_y } \] Some facts about correlation:

$Cor(X,Y) = Cor(Y,X)$
$-1 \le Cor(X,Y) \le 1$
$Cor(X,Y) = 1$ and $Cor(X,Y) = -1$ only when the $X$ or $Y$ observations fall perfectly on a positive or negative sloped line respectively.
$Cor(X,Y)$ measures the strength of the linear relationship between the $X$ and $Y$ data.
$Cor(X,Y) = 0$ implies no linear relationship.

Linear Least Squares

Using the Galton data discussed previously, we let $Y_i$ be the $i^{th}$ childs height and $X_i$ be the $i^{th}$ (average over the pair of) parents’ heights.

We want to find the best line where $\text{Childs Height } = \beta_0 + \text{ Parents' Height } \times \beta_1$

What is the criteria for ‘best’? We’ll use least squares. We want to minimise the sum of the squared vertical distances between the data points and the points on the fitted line.

\[ \sum_{i_1}^n \{ Y_i - (\beta_0 + \beta_1X_i) \}^2 \]

The least squares model fit to the line $Y = \beta_0 + \beta_1 X$ through the data pairs $(X_i,Y_i)$ obtains the line $Y = + X where

\[ \hat{\beta_1} = Cor(X,Y) \frac{ \sigma_Y }{ \sigma_X } \\ \hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X} \]

The $\hat{\beta_1}$ coefficient has the units of $Y/X$ as the $Cor()$ is a unitless quantity, so the units come from the division of the standard deviation.

The $\hat{\beta_0}$ coefficient has the units of $Y$.

The line always passes through the point $(\bar{X}, \bar{Y})$. We see this if we re-arrange the second formula to $\bar{Y} = \hat{\beta_0} + \hat{\beta_1} \bar{X}$

The slope is the same one you would get if you centered the data and did the regression through the origin.

If you normalised the data, $Cor(X_i - \bar{X},Y_i - \bar{X}) \times \frac{1}{1} = Cor(X,Y)$.

Let’s do a comparison:

galton %>%
    summarise(
        beta_1 = cor(parent, child) * sd(child) / sd(parent),
        beta_0 = mean(child) - beta_1 * mean(parent)
    ) %>%
    dplyr::select(beta_0, beta_1)

##     beta_0    beta_1
## 1 23.94153 0.6462906

galton %>%
    lm(child ~ parent, data = .) %>%
    coef()

## (Intercept)      parent 
##  23.9415302   0.6462906

B1-Hat Derivation

We are fitting the line $y = x\beta$ with $y_1, \ldots, y_n$ and $x_1, \ldots, x_n$.

We want to minimuse the quantity $\sum_{i=1}^n (y_i - x_i\beta)^2$. Let $\hat{\beta}$ be the solution.

We can add in a zero:

\[ = \sum_{i=1}^n (y_i - x_i\hat{\beta} + x_i\hat{\beta} - x_i\beta)^2 \]

Expand this out:

\[ = \sum_{i=1}^n (y_i - x_i\hat{\beta})^2 - 2\sum_{i=1}^n (y_i - x_i\hat{\beta})(x_i\hat{\beta} - x_i\beta) + \sum_{i=1}^n (x_i\hat{\beta} - x_i\beta)^2 \]

The last term is removed as it is only adding up a number of positive things, so by removing it we know we’re getting smaller and can ignore it.

\[ \ge \sum_{i=1}^n (y_i - x_i\hat{\beta})^2 - 2\sum_{i=1}^n (y_i - x_i\hat{\beta})(x_i\hat{\beta} - x_i\beta) \]

Looking at the second summation: if beta hat forces this term to be zero, then the least squared criteria for any beta is larger than what we get when we plug in beta hat specifically. So that beta hat would have to be the minimiser, because every other beta creates a least squares criteria at least as large or larger.

\[ \ge \sum_{i=1}^n (y_i - x_i\hat{\beta})^2 \]

So what we need to do is make that second term equal to zero.

\[ \sum_{i=1}^n (y_i - x_i\hat{\beta}) x_i (\hat{\beta} - \beta) = 0 \]

The last term doesn’t depend on i, so it can be divided out of both sides:

\[ \sum_{i=1}^n (y_i - x_i\hat{\beta}) x_i = 0 \]

Solving for $\hat{\beta}$ yields the following formula for regression through the origin.

\[ \hat{\beta} = \frac{ \sum_{i=1}^n y_i x_i }{ \sum_{i=1}^n x_i^2 } \]

Regression to the Mean

Why are the children of tall parents tall, but not as tall as their parents?
Why are the children of short parents short, but not as short as their parents?

Why can think about it in terms of paris of standard normals. The largest first ones would be the largest by chance - put another way $P(Y < x|X = x)$ gets bigger as $x$ heads into the very large values.

Let’s simulate this. We generate the normals and arrange them in descending order by x. We calculate whether $x > y$ and separate them into 4 distinct groups. We then calculate the proportion of true values by each group.

set.seed(1)
tibble(
    x = rnorm(100),
    y = rnorm(100),
) %>%
    arrange(desc(x)) %>%
    mutate(
        x_gt_y = x > y,
        g = gl(4, 25)
    ) %>%
    group_by(g) %>%
    summarise(prob = mean(x_gt_y))

## # A tibble: 4 x 2
##   g      prob
##   <fct> <dbl>
## 1 1      0.84
## 2 2      0.72
## 3 3      0.48
## 4 4      0.2

This is “100% regression to the mean”, but usually there is a blend of some intrinsic component and some noise.

Consider the galton data set with sons and parents’ heights. If the data is normalised then the slope of the line is $Cor(X,Y)$. If the parents height was a perfect predictor - there was no noise, then the regression line would be on the identity $y = x$. If there was only noise the the line would fall along $y = 0$.

How shrunk the correlation is between the identity line and the zero line gives you the extent of the regression to the mean.

Quiz

Question 1

Consider the data set:

x <- c(0.18, -1.54, 0.42, 0.95)

and weights given by:

w <- c(2, 1, 3, 1)

Give the value of $\mu$ that minimizes the least squares equation:

\[ \sum_{i=1}^n w_i (x_i - \mu)^2 \]

1.077
0.300
0.0025
0.1471

# Using the lm() function
tibble(x = x, y = w) %>%
    lm(y ~ x - 1, data = .) %>%
    coef()

##         x 
## 0.2957306

Question 2

Consider the following data set

x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)

Fit the regression through the origin and get the slope treating y as the outcome and x as the regressor. (Hint, do not center the data since we want regression through the origin, not through the means of the data.)

0.8263
0.59915
-1.713
-0.04462

Answer: we can use the derived regression through the origin formula of $\frac{ \sum_{i=1}^n y_i x_i }{ \sum_{i=1}^n x_i^2 }$, or using the lm() function and specifiying regression through the origin.

# Using the derived formula
sum(y * x) / sum(x^2)

## [1] 0.8262517

# Using lm()
tibble(x = x, y = y) %>%
    lm(y ~ x - 1, data = .) %>%
    coef()

##         x 
## 0.8262517

Question 3

Using the mtcars data set, model with mpg as the outcome and weight as the predictor. Give the slope coefficient.

-5.344
30.2851
-9.559
0.5591

Answer: we use the $\hat{ \beta }_1$ formula and lm() to calculate the slope coefficient.

# Using the formula
mtcars %>%
    summarise(beta_1 = cor(mpg, wt) * (sd(mpg) / sd(wt)))

##      beta_1
## 1 -5.344472

# Using lm()
mtcars %>%
    lm(mpg ~ wt, data = .) %>%
    coef()

## (Intercept)          wt 
##   37.285126   -5.344472

Question 4

Consider data with an outcome $Y$ and a predictor $X$. The standard deviation of the predictor is one half that of the outcome. The correlation between the two variables is .5. What value would the slope coefficient for the regression model with $Y$ as the outcome and $X$ as the predictor?

3
0.25
4
1

Question 5

Students were given two hard tests and scores were normalized to have empirical mean 0 and variance 1. The correlation between the scores on the two tests was 0.4. What would be the expected score on Quiz 2 for a student who had a normalized score of 1.5 on Quiz 1?

0.6
0.4
0.16
1.0

Answer: if the data has been normalised ($\sigma = 1, \mu = 0$), then the correlation of the two is the regression line slope: $q_2 = .4q_1$. The expected score $.4 \times 1.5 = 0.6$.

Question 6

Consider the data given by the following:

x <- c(8.58, 10.46, 9.01, 9.64, 8.86)

What is the value of the first measurement if x were normalized (to have mean 0 and variance 1)?

8.86
9.31
8.58
-0.9719

Answer: we normalise my taking the mean away from each value and diving by the standard deviation:

( (x - mean(x)) / sd(x) )[1]

## [1] -0.9718658

Question 7

Consider the following data set:

x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)

What is the intercept for fitting the model with x as the predictor and y as the outcome?

1.252
1.567
-1.713
2.105

Answer:

# Using the formulas
tibble(x = x, y = y) %>%
    summarise(
        beta_1 = cor(x,y) * ( sd(y) / sd(x) ),
        beta_0 = mean(y) - beta_1 * mean(x)
    ) %>%
    select(beta_0)

## # A tibble: 1 x 1
##   beta_0
##    <dbl>
## 1   1.57

# Using lm()
tibble(x = x, y = y) %>%
    lm(y ~ x, data = .)

## 
## Call:
## lm(formula = y ~ x, data = .)
## 
## Coefficients:
## (Intercept)            x  
##       1.567       -1.713

Question 8

You know that both the predictor and response have mean 0. What can be said about the intercept when you fit a linear regression?

It is undefined as you have to divide by zero.
Nothing about the intercept can be said from the information given.
It must be exactly one.
It must be identically 0.

Answer: we know that any regression line must pass through $(\mu_x, \mu_y)$. If the mean of both of these is zero, then the regression line must pass through the origin.

Question 9

Consider the data given by:

x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)

What value minimizes the sum of the squared distances between these points and itself?

0.44
0.573
0.36
0.8

Answer: using a regression through the origin:

tibble(x = x, y = y) %>%
    lm(y ~ x - 1, data = .) %>%
    coef()

##         x 
## 0.8262517

Question 10

Let the slope having fit $Y$ as the outcome and $X$ as the predictor be denoted as $\beta_1$. Let the slope from fitting $X$ as the outcome and $Y$ as the predictor be denoted as $\gamma_1$. Suppose that you divide $\beta_1$ bu $\gamma_1$. In other words consider $\beta_1 / \gamma_1$. What is this ratio always equal to?

$Cor(Y,X)$
$1$
$Var(Y)/Var(X)$
$2SD(Y)/SD(X)$

Answer: $\beta_1 = Cor(X,Y) * \frac{ sd(Y) }{ sd(X) }$ and $\gamma_1 = Cor(X,Y) * \frac{ sd(X) }{ sd(Y) }$, so we have:

\[ \frac{ Cor(X,Y) * \frac{ sd(Y) }{ sd(X) }}{ Cor(X,Y) * \frac{ sd(X) }{ sd(Y) } } \\ = \frac{ \frac{ sd(Y) }{ sd(X) }}{ \frac{ sd(X) }{ sd(Y) } } \\ = 2\frac{ sd(Y) }{ sd(X) } \]

Course 7 - Regression Models - Week 1 - Notes

Greg Foletta

2020-01-11