Predicting with Trees

Iteratively split variables into groups and evalularte the ‘homogeneity’ within each group. You can split again if necessary.

It’s easy to interpret and has good performance in nonlinear settings. However without pruning/cross-validation it can lead to overfitting, it’s hard to estimate uncertainty, and the results may be variable.

The basic algorithm:

Start with all variables in one group.
Find the variable & split that best separates the outcomes.
Divide the data into two groups (leaves) on the splt (node).
Within each split, go back to 1.
Continue until the groups are too small or sufficiently ‘pure’.

Measures of Impurity

\[ \hat{p}_{mk} = \frac{1}{N_m} \sum_{x_i \in Leaf_m} \mathbb{1}(y_i = k) \]

This is probability that within a leaf $m$ there are $N$ total objects that we might consider, and you can count the number of a times a class $k$ occurs in a leaf.

The missclassification error is then

\[ 1 - \hat{p}_{mk(m)}; k(m) = \text{most common k} \]

So 0 is perpect purity, .5 is no purity (because if the classification swung the other way it would be 0 in the other direction).

Another method is deviance/information gain

\[ -\sum_{k=1}^K \hat{p}_{mk} log_2(\hat{p}_{mk}) \] If you use $log_e$ it’s called deviance, if you use $log_2$ it’s called information gain.

It’s minus the sum of the probability of being assigned to class $k$ and leam $m$, times the log of that same probability.

0 is perfect purity, 1 means no purity.

Example

library(modelr)

## 
## Attaching package: 'modelr'

## The following object is masked from 'package:broom':
## 
##     bootstrap

iris_smpl <-
    iris %>%
    resample_partition(c(train = .8, test = .2))

iris_smpl$train %>%
    as_tibble() %>%
    ggplot() +
    geom_point(aes(Petal.Width, Sepal.Width, colour = Species)) +
    labs(
        title = 'Iris Petal vs Sepal Width by Species',
        x = 'Petal Width',
        y = 'Sepal Width'
    )

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(rattle)

## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

iris_smpl$train %>%
    as_tibble() %>%
    train(Species ~ ., method = 'rpart', data = .) ->
    iris_tree

iris_tree$finalModel

## n= 119 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 119 75 setosa (0.3697479 0.3025210 0.3277311)  
##   2) Petal.Length< 2.45 44  0 setosa (1.0000000 0.0000000 0.0000000) *
##   3) Petal.Length>=2.45 75 36 virginica (0.0000000 0.4800000 0.5200000)  
##     6) Petal.Length< 4.95 41  5 versicolor (0.0000000 0.8780488 0.1219512) *
##     7) Petal.Length>=4.95 34  0 virginica (0.0000000 0.0000000 1.0000000) *

fancyRpartPlot(iris_tree$finalModel, caption = 'Species Categorisation')

Let’s look at how good it is at predicting:

iris_smpl$test %>%
    as_tibble() %>%
    mutate(
        pred = predict(iris_tree, newdata = .),
        res = Species == pred
    ) %>%
    summarise(Categorisation_Accuracy = mean(res))

## # A tibble: 1 x 1
##   Categorisation_Accuracy
##                     <dbl>
## 1                   0.903

Notes

Classification trees and non-linear models
- They use interactions between variables
- If you have a large number of classes, the models can overfit.
Data transformations may be less important
- Any monotone transformation means you’ll get the same data.
Trees can also be used for regression problems
- RMSE as the purity measure

Bagging

Short for bootstrap aggregating. If you fit complicated models, sometimes if you average those models together you get a smoother model fit.

Resample with replacement cases and recalculate predictions
Average or majority vote

You get similar bias but a reduced variance. It’s most useful for non-linear functions.

Bagging in Caret

In caret you can use bagEarth, treeBag or bagDFA method options in the train() function.

You can also bag any model using the bag function.

Notes

Most useful for non-linear models
Often used with trees
Several models use bagging in caret’s train() function.

Random Forests

Random forests can be thought of as an extension to basgging for classification trees.

Bootstrap samples
At each split, bootstrap variables
- Thus only a subset of variables is considered at each split.
Grow multiple trees and vote or average the trees

Pros:

Accuracy

Cons:

Speed
Interpretability
Overfitting

Once the trees have been build, you predict an outcome through each tree. At the end you take an average of all of the predictions.

Example

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(10)
iris_smpl <- iris %>% resample_partition(c(train = .8, test = .2))

iris_rtree <- 
    iris_smpl$train %>%
    as_tibble() %>%
    train(Species ~ ., data = ., method = 'rf', prox = TRUE)

iris_rtree

## Random Forest 
## 
## 119 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 119, 119, 119, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9391505  0.9062563
##   3     0.9454635  0.9159994
##   4     0.9446999  0.9148553
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.

Can use the getTree() to look at the tree. Each of the rows is a split in the tree. We see the left and right daughter of the split, which variable, at which point we split on that variable, and what the prediction is out of the split.

getTree(iris_rtree$finalModel, k = 2)

##    left daughter right daughter split var split point status prediction
## 1              2              3         4        0.70      1          0
## 2              0              0         0        0.00     -1          1
## 3              4              5         4        1.65      1          0
## 4              6              7         3        4.95      1          0
## 5              8              9         1        5.95      1          0
## 6              0              0         0        0.00     -1          2
## 7             10             11         3        5.05      1          0
## 8             12             13         2        3.10      1          0
## 9              0              0         0        0.00     -1          3
## 10             0              0         0        0.00     -1          3
## 11            14             15         4        1.55      1          0
## 12             0              0         0        0.00     -1          3
## 13             0              0         0        0.00     -1          2
## 14             0              0         0        0.00     -1          3
## 15             0              0         0        0.00     -1          2

Use classCenter() to find the center of each class.

To predict values:

iris_smpl$test %>%
    as_tibble() %>%
    pull(Species) %>%
    confusionMatrix(predict(iris_rtree, iris_smpl$test))

## Warning in model.matrix.default(Terms, m, contrasts = object$contrasts):
## partial argument match of 'contrasts' to 'contrasts.arg'

## Warning in seq.default(along = classLevels): partial argument match of
## 'along' to 'along.with'

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         17          0         0
##   versicolor      0          9         0
##   virginica       0          0         5
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8878, 1)
##     No Information Rate : 0.5484     
##     P-Value [Acc > NIR] : 8.16e-09   
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.5484            0.2903           0.1613
## Detection Rate              0.5484            0.2903           0.1613
## Detection Prevalence        0.5484            0.2903           0.1613
## Balanced Accuracy           1.0000            1.0000           1.0000

Notes

Random forests are usually one of the two top performing algorithms (along with boosting) in prediction contests.
Random forests are difficult to interpret but often very accurate.
Care should be taken to avoid overfitting
- See the rfcv() function.

Boosting

Basic idea:

Take lots of possibly weak predictors
Weight them and add them up
Get a stronger predictor

Diving in a little deeper:

Start with a set of classifiers $h_1, \ldots, h_k$.
- Examples: all possible trees, all possible regression models, all possible cutoffs.
Create a classifier that combines classification functions
- $f(x) = sgn\bigg( \sum_{t=1}^T \alpha_t h_t(x) \bigg)$ -$\alpha_t$ is the weight -$h_t(x)$ is the classifier -$sgn()$ is the signum function
- Goal is to minimise error on training set.
- Iterative, select one $h$ at each step.
- Calculate weights based on errors
- Upweight missed classifications and select next $h$

Notes

Can be done with any subset of classifiers
One large class is gradient boosting
Boosting libraries
- gbm does boosting with trees
- mboost does model based boosting
- ada does statistical boosting based on additive logistic regression
- gamBoost for boosting generalised additive modles

Example

set.seed(1)
wage_smpl <- Wage %>%
    select(-logwage) %>%
    resample_partition(c(train = .8, test = .2))

wage_mod <-
    wage_smpl$train %>%
    as_tibble() %>%
    train(wage ~ ., method = 'gbm', data = ., verbose = FALSE)

wage_mod

## Stochastic Gradient Boosting 
## 
## 2399 samples
##    9 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2399, 2399, 2399, 2399, 2399, 2399, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   1                   50      35.29895  0.3355328  23.98975
##   1                  100      34.63944  0.3470334  23.55519
##   1                  150      34.56330  0.3484458  23.53197
##   2                   50      34.59568  0.3499714  23.48109
##   2                  100      34.51537  0.3501258  23.49626
##   2                  150      34.64395  0.3459081  23.62387
##   3                   50      34.47805  0.3525904  23.42600
##   3                  100      34.57368  0.3484831  23.56461
##   3                  150      34.74272  0.3433984  23.73508
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 50, interaction.depth
##  = 3, shrinkage = 0.1 and n.minobsinnode = 10.

wage_smpl$test %>%
    as_tibble() %>%
    mutate(pred = predict(wage_mod, newdata = .)) %>%
    ggplot() +
    geom_point(aes(pred, wage)) +
    labs(
        title = 'Boosted Wage Prediction',
        x = 'Predicted Wage',
        y = 'Real Wage'
    )

Model Based Prediction

Assume the data follow a probabilistic model
Use Bayes’ theorem to identify optimal classifiers

Pros:

Can take advantage of structure of the data.
May be computationally convenient.
Are reasonably accurate on read problems.

Cons:

Make additional assumptions about the data.
When the model is incorrect you may get reduced accuracy.

Approach

Our goal is to build a parametric model for conditional distribution $P(Y = k|X = x)$.

A typical approach is to apply Bayes’ theorem:

$$ Pr(Y = k|X = x) \

= \ =

The last line has $f_k(x)$, which is the assumption of a parametric model for the distribution of the features given the class. The $Pr(Y = k)$ is the ‘prior’ that an element $Y$ comes from class $k$.

Thus we can model the $Pr(Y = k|X = x)$ as a model for the $x$ variables and a model for the prior probability.

Priors are generally set in advance, and a common choice for $f_k(x)$ is the Gaussian distribution. Could be multi-variate Gaussian if there are muliple variables. The $(\mu_k,\sigma_k^2)$ are estimated from the data.

Classify to the class with the highest value of $P(Y = k|X = x)$.

Uses

Linear discriminant analysis assumes $f_k(x)$ is a multivariate Gaussian with the same covariances.
Quadratic discriminant analysis assumes $f_k(x)$ is multivariate Gaussian with different covariances.
Model based prediction assumes more complicated versions of the covariance matirx.
Naive Bayes assumes independence between features.

Discriminant Function

\[ \delta_k(x) = x^T\Sigma^{-1} \mu_k - \frac{1}{2}\mu_k \Sigma^{-1} \mu_k + log(\mu_k) \]

Where

$\mu_k$ is the mean of class $k$ for all our features.
$\Sigma^-1$ is the inverse of the co-variance matrix for that class.
$x^T$ is the transform of our predictor matrix.

We pplug in our new data value into our function, and we pick the value of $k$ that produces the largest value in this function: $\hat{Y}(x) = \text{argmax}_k \delta_k(x)$

Estimate parameters with maximum likelyhood.

Naive Bayes

Suppose we have many predictors, we want to model $P(Y = k|X_1, \ldots, X_m)$

We can use Bayes theorem to get:

\[ P(Y = k|X_1, \ldots, X_m) = \frac{ \pi_k P(X_1, \ldots, X_m|Y = k) }{ \sum_\ell^K P(X_1, \ldots, X_m | Y = k) \pi_\ell } \\ \text{ } \\ \propto \pi_k P(X_1, \ldots, X_m | Y = k) \]

You can expant out every $X$ probability so you end up with:

\[ \pi_k P(X_1|Y=k) P(X_2, \ldots, X_m|X_1, Y = k) \\ \pi_k P(X_1|Y=k) P(X_2 | X_1, Y = k) \ldots P(X_m | X_1, \ldots, X_{m-1}, Y = k) \]

So you can see that the variables are dependent on each other. To make this simple you could assume that the predictors are independent, in which case they drop out:

\[ \pi_k P(X_1|Y = k)P(X_2 | Y = k)\ldots P(X_m|Y=k) \]

This is a naive assumption, which is why it’s called Naive Bayes.

Examples

set.seed(10)
iris_smpl <-
    iris %>%
    resample_partition(c(train = .8, test = .2))

iris_smpl$train %>%
    as_tibble() %>%
    train(Species ~ ., data = ., method = 'lda') %>%
    predict(newdata = as_tibble(iris_smpl$test)) %>%
    confusionMatrix(as_tibble(iris_smpl$test)$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         17          0         0
##   versicolor      0          9         0
##   virginica       0          0         5
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8878, 1)
##     No Information Rate : 0.5484     
##     P-Value [Acc > NIR] : 8.16e-09   
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.5484            0.2903           0.1613
## Detection Rate              0.5484            0.2903           0.1613
## Detection Prevalence        0.5484            0.2903           0.1613
## Balanced Accuracy           1.0000            1.0000           1.0000

iris_smpl$train %>%
    as_tibble() %>%
    train(Species ~ ., data = ., method = 'nb') %>%
    predict(newdata = as_tibble(iris_smpl$test)) %>%
    confusionMatrix(as_tibble(iris_smpl$test)$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         17          0         0
##   versicolor      0          8         0
##   virginica       0          1         5
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9677         
##                  95% CI : (0.833, 0.9992)
##     No Information Rate : 0.5484         
##     P-Value [Acc > NIR] : 2.165e-07      
##                                          
##                   Kappa : 0.9456         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8889           1.0000
## Specificity                 1.0000            1.0000           0.9615
## Pos Pred Value              1.0000            1.0000           0.8333
## Neg Pred Value              1.0000            0.9565           1.0000
## Prevalence                  0.5484            0.2903           0.1613
## Detection Rate              0.5484            0.2581           0.1613
## Detection Prevalence        0.5484            0.2581           0.1935
## Balanced Accuracy           1.0000            0.9444           0.9808

Quiz

Question 1

For this quiz we will be using several R packages. R package versions change over time, the right answers have been checked using the following versions of the packages.

AppliedPredictiveModeling: v1.1.6

caret: v6.0.47

ElemStatLearn: v2012.04-0

pgmm: v1.1

rpart: v4.1.8

library(AppliedPredictiveModeling)
library(caret)
library(pgmm)
library(rpart)

If you aren’t using these versions of the packages, your answers may not exactly match the right answer, but hopefully should be close.

Load the cell segmentation data from the AppliedPredictiveModeling package using the commands:

library(AppliedPredictiveModeling)
data(segmentationOriginal)
library(caret)

Subset the data to a training set and testing set based on the Case variable in the data set.
Set the seed to 125 and fit a CART model with the rpart method using all predictor variables and default caret settings.
In the final model what would be the final model prediction for cases with the following variable values:
1. TotalIntench2 = 23,000; FiberWidthCh1 = 10; PerimStatusCh1=2
2. TotalIntench2 = 50,000; FiberWidthCh1 = 10;VarIntenCh4 = 100
3. TotalIntench2 = 57,000; FiberWidthCh1 = 8;VarIntenCh4 = 100
4. FiberWidthCh1 = 8;VarIntenCh4 = 100; PerimStatusCh1=2

set.seed(125)

segmentationOriginal %>%
    dplyr::filter(Case == 'Train') %>%
    train(Class ~ ., method = 'rpart', data = .) ->
    seg_rpart


library(rattle)
fancyRpartPlot(seg_rpart$finalModel)

1. PS
1. Not possible to predict
1. PS
1. Not possible to predict
1. WS
1. WS
1. PS
1. Not possible to predict

- a. PS - b. WS - c. PS - d. Not possible to predict

1. PS
1. WS
1. PS
1. WS

Question 2

If K is small in a K-fold cross validation is the bias in the estimate of out-of-sample (test set) accuracy smaller or bigger?

If K is small is the variance in the estimate of out-of-sample (test set) accuracy smaller or bigger?

Is K large or small in leave one out cross validation?

The bias is larger and the variance is smaller. Under leave one out cross validation K is equal to the sample size.
The bias is larger and the variance is smaller. Under leave one out cross validation K is equal to one.
The bias is smaller and the variance is bigger. Under leave one out cross validation K is equal to one.
The bias is smaller and the variance is smaller. Under leave one out cross validation K is equal to the sample size.

Answer: in k-fold CV, the data set is split into K ‘folds’, and LOOCV is a special case where k = n.

As K is smaller, then we are training on more observations, and thus the variance should be reduced. However this will be traded off against higher bias in the error estimates.

Question 3

Load the olive oil data using the commands:

library(pgmm)
data(olive)
olive <- olive[,-1]

(NOTE: If you have trouble installing the pgmm package, you can download the -code-olive-/code- dataset here: olive_data.zip. After unzipping the archive, you can load the file using the -code-load()-/code- function in R.)

These data contain information on 572 different Italian olive oils from multiple regions in Italy. Fit a classification tree where Area is the outcome variable. Then predict the value of area for the following data frame using the tree command with all defaults:

newdata = as.data.frame(t(colMeans(olive)))

What is the resulting prediction? Is the resulting prediction strange? Why or why not?

olive %>%
    train(Area ~ ., method = 'rpart', data = .) %>%
    predict(newdata = newdata)

##        1 
## 2.783282

2.783. It is strange because Area should be a qualitative variable - but tree is reporting the average value of Area as a numeric variable in the leaf predicted for newdata
4.59965. There is no reason why the result is strange.
0.005291005 0 0.994709 0 0 0 0 0 0. The result is strange because Area is a numeric variable and we should get the average within each leaf.
0.005291005 0 0.994709 0 0 0 0 0 0. There is no reason why the result is strange.

Question 4

Load the South Africa Heart Disease Data and create training and test sets with the following code:

library(ElemStatLearn)
data(SAheart)
set.seed(8484)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]/2,replace=F)
trainSA = SAheart[train,]
testSA = SAheart[-train,]

Then set the seed to 13234 and fit a logistic regression model (method=“glm”, be sure to specify family=“binomial”) with Coronary Heart Disease (chd) as the outcome and age at onset, current alcohol consumption, obesity levels, cumulative tabacco, type-A behavior, and low density lipoprotein cholesterol as predictors. Calculate the misclassification rate for your model using this function and a prediction on the “response” scale:

missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}

What is the misclassification rate on the training set? What is the misclassification rate on the test set?

trainSA %>%
    glm(
        chd ~ age + alcohol + obesity + tobacco + typea + ldl,
        data = .,
        family = 'binomial'
    ) -> SA_mdl

missClass(testSA$chd, predict(SA_mdl, newdata = testSA, type = 'response'))

## [1] 0.2813853

missClass(trainSA$chd, predict(SA_mdl, type = 'response'))

## [1] 0.3116883

Test Set Misclassification: 0.32
Training Set: 0.30
Test Set Misclassification: 0.31
Training Set: 0.27
Test Set Misclassification: 0.38
Training Set: 0.25
Test Set Misclassification: 0.35
Training Set: 0.31

Question 5

Load the vowel.train and vowel.test data sets:

library(ElemStatLearn)
data(vowel.train)
data(vowel.test)

Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit a random forest predictor relating the factor variable y to the remaining variables. Read about variable importance in random forests here: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr The caret package uses by default the Gini importance.

Calculate the variable importance using the varImp function in the caret package. What is the order of variable importance?

[NOTE: Use randomForest() specifically, not caret, as there’s been some issues reported with that approach. 11/6/2016]

The order of the variables is: x.2, x.1, x.5, x.6, x.8, x.4, x.9, x.3, x.7,x.10
The order of the variables is: x.10, x.7, x.9, x.5, x.8, x.4, x.6, x.3, x.1,x.2
The order of the variables is: x.1, x.2, x.3, x.8, x.6, x.4, x.5, x.9, x.7,x.10
The order of the variables is: x.10, x.7, x.5, x.6, x.8, x.4, x.9, x.3, x.1,x.2

library(randomForest)

set.seed(33833)


vowel.test %>%
    mutate(y = factor(y)) %>%
    randomForest(y ~ ., data = .) %>%
    varImp() %>%
    rownames_to_column() %>%
    arrange(desc(Overall))

##    rowname  Overall
## 1      x.1 91.02628
## 2      x.2 79.04752
## 3      x.4 52.13079
## 4      x.5 34.87761
## 5      x.6 30.33223
## 6      x.7 30.16330
## 7      x.3 29.09812
## 8      x.9 27.55066
## 9      x.8 23.12069
## 10    x.10 21.72502

Course 8 - Practical Machine Learning - Week 3 - Notes

Greg Foletta

2020-03-10

Predicting with Trees

Measures of Impurity

Example

Notes

Bagging

Bagging in Caret

Notes

Random Forests

Example

Notes

Boosting

Notes

Example

Model Based Prediction

Approach

Uses

Discriminant Function

Naive Bayes

Examples

Quiz

Question 1

Question 2

Question 3

Question 4

Question 5