Introduction
This course focuses on developing the tools and techniques for understanding, building and testing prediction functions.
What is Prediction?
The central dogma of prediction is:
- Take a large group of something you want to predict about.
- Pick a training set
- Measure characteristics about the training set and build a prediction function.
- Evaluate whether the prediction functions works well or not.
Components of a predictor:
- Define the question: what are you trying to predict?
- Collect the best input data that you can.
- Determine the features you think are useful in predicting and outcome.
- Use algorithms and estimate the parameters you’ll use.
- Evaluate the prediction function.
Relative Order of Importance
The question is the most important part of the machine learning process. Then collecting the data, which may not be readily available. Afer this comes features, and finally the algorithm which can be the least important.
Input Data
Quote from John Tukey:
“The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data”
Garbage in = Garbage out. Often more data is more important that better models.
Features
Properties of good features:;
- Lead to data compression
- Retain relevant information
- Are created based on export application knowledge
Common mistakes:
- Trying to automate feature selection
- Black box selections can be useful, but can also change on a dime if we’re not paying attention to how the features actually predict the outcome.
- Not paying attention to data-specific quirks.
- Throwing away information unnecessarily.
Issues to Consider
- Is the model interpretable?
- Simplicity - easy to explain
- Accurate - interpretabilitiy and simplicity may be traded off against accuracy.
- Fast - easy to build and test
- Sclable - scales out to larger data sets.
Prediction is about accuracy tradeoffs.
In Sample vs Out Of Sample
One of the most fundamental concepts.
In sample error is the error you get on the sample you used to build your predictor. It is also known as resubstitution error.
Out of sample error is the error rate you get on a new data set. It is also known as generalisation error.
Out of sample is what you care about. Fitting on the in sample set leads to overfitting.
Data has two parts:
- Signal
- Noise
The goal of the predictor is to find the signal. You can always design a perfect in-sample predictor, but you’re capturing both the signal and the noise when you do that. It won’t perform as well on new samples.
Prediction Study Design
- Define your error rate
- Split data into test, train and optionally validation
- On the training set pick features
- Can use cross-validation
- On the training set pick a prediction function.
- Use cross-validation
- If there’s no validation set, apply 1x to the test set.
- If there is a validation set
- Apply to the test set and refine
- Apply 1x to the validation set.
Sample Sizes
Avoid small sample sizes. Imagine a binary outcome, which is like flipping a coin. The probability of perfect classification is approximately \(\frac{1}{2}^{\text{sample size}}\).
So when \(n = 1\), you’ve got a 50% chance of 100% accuracy. When \(n = 10\), you’ve got a 0.10% chance of 100% accuracy.
Good rules of thumb, for a large data set:
- 60% training
- 20% test
- 20% validation
For a medium training set:
- 60% training
- 40% test
If you have a small sample size:
- Perform cross-validation
- Report the caveat of a small sample size.
Principles
- Test/validation set set aside an not looked at.
- In general, randomly sample training set.
- If te predictions evolve with time, split the train/test sets in time chunks.
- This is called backtesting in finance.
- All subsets should reflect as much diversity as possible.
- Random assignment does this
- Can also balance by features, but this is difficult.
Types of Errors
Thinking about a binary prediction problem, positive = identified and negative = rejected.
- True positive (TP) = correctly identified
- Sick people diagnosed as sick
- False positive (FP) = incorrectly identified
- Healthy people diagnosed as sick
- True negative (TN) = correctly rejected
- Healthy people diagnosed as healthy
- False Negative (FN) = incorrectly rejected
- Sick people identified as healthy
- Sensitivity = Pr( positive test | disease )
- TP / (TP + FN)
- Specificity = Pr( negative test | no disease )
- TN / (FP + TN)
- Positive Predictive Value = Pr( disease | positive test )
- TP / (TP + FP)
- Negative Predictive Value = Pr( no disease | negative test )
- TN / (FN + TN)
- Accuracy = Pr( correct outcome )
- (TP + TN) / (TP + TN + FP + FN)
Continuous Data
The goal it see how close you are to the truth. You use mean squared error (MSE).
\[ \frac{1}{n} \sum_{i=1}^n (\text{Prediction}_i - \text{Truth}_i)^2\]
Root mean squared error (RMSE)
\[ \sqrt{ \frac{1}{n} \sum_{i=1}^n (\text{Prediction}_i - \text{Truth}_i)^2 }\]
Common Error Measures
- MSE or RMSE
- Continuous, sensistive to outliers
- Median absolute deviation
- Continuous, often more robust.
- \(\tilde{X} = median(X)\)
- \(MAD = median(|X_i - \tilde{X}|)\)
- Sensitivity
- If you want few missed positives
- Specificity
- If you want few negatives called positives
- Accuracy
- Weights false positives/negatives equally
- Concordance
- Used in multi-class data
Receiver Operating Characteristic (ROC Curves)
In binary classification you’re predicting one of two categories, but most of the prediction algorithms will assign a probability (0 - 1) or a prediction between 1 - 10. The cutoff you choose gives different results.
The ROC curve shows one minus the specificity on the x-axis (Pr(FP)) and on the y-axis they plot the sensitivity (Pr(TP)).
Then every point along the curve shows the results for a different cutoff.
The “best” curve is usually determined by the area under the ROC curve. If the area is .5 (axis are [0,1]), then this is as good as random guessing. An area of 1 is a perfect classifier.
In general an area of .8 is considered good. A curve sitting on the 45 degree angle means the sensitivity and specificity match each other. The further you are towards the upper left of the plot the better you are, as you’re getting more true positives rather than false positives are your cutoff goes up.
Cross Validation
Take the training set and split it into sub-training and sub-test sets. Build the model on the sub-training set and evaluate on the sub-test set. They key is then to repeat this process and average the estimated errors.
Can be used for picking the variables in the model, picking the type of prediction, picking the parameters, or comparing different predictors.
Random Subsampling
Take a number of different random samples from the training set, splitting it into training and test sets.
K-Fold
Break the data set up into \(K\) equal size data sets. Build a model on the training sets within each fold and test on the test within each fold.
Leave One Out Cross Validation (LOOCV)
Leave out one sample, train the model on the remaining \(n-1\) samples. Test on the one sample that was left out. This is the same as K-Fold but where \(k = n\).
Considerations
For time-series data this doesn’t natively work, you have to get blocks of time that is contiguous. This is because one time point may be dependent on a number of other time points that came before.
For K-fold: - Larger k = less bias, more variance. - Smaller k = more bias, less variance
The random sampling must be done without replacement. Random smapling with replacement is called the bootstrap. - Underestimates the error. This is because some samples will appear more than once, and if you get one right you’ve got the other one right. - Can be corrected, but its complicated (a “.632 bootstrap”).
If you cross-validate to pick predictors you must estimate the errors on independent data. The cross-validated error rates won’t be a good example of what the out-of-sample error rates will be. The only way to do this is to apply once to your test set.
What Data Should You Use?
Polling data is using like data to predict like outcomes. “To predict X, use data related to X”.
The looser connection, the harder the prediction.
Data properties matter. Knowing how the data connects back to what you’re trying to predict is vitally important. Unrelated data is the most common mistake. It’s the old “correlation versus causation”.
Quiz
Question 1
Which of the following are components in building a machine learning algorithm?
- Statistical inference
- Machine learning
- Collecting data to answer the question.
- Artificial intelligence
- Training and test sets
Answer: training and test sets are components in the process of building an ML algorithm.
Question 2
Suppose we build a prediction algorithm on a data set and it is 100% accurate on that data set. Why might the algorithm not work well if we collect a new data set?
- We have used neural networks which has notoriously bad performance.
- We may be using bad variables that don’t explain the outcome.
- We are not asking a relevant question that can be answered with machine learning.
- Our algorithm may be overfitting the training data, predicting both the signal and the noise.
Answer: the algorithm will be over-fitting our data on the training set.
Question 3
What are typical sizes for the training and test sets?
- 10% test set, 90% training set
- 60% in the training set, 40% in the testing set.
- 50% training set, 50% test set
- 90% training set, 10% test set
Answer: 60% in the training, 40% in the test.
Question 4
What are some common error rates for predicting binary variables (i.e. variables with two possible values like yes/no, disease/normal, clicked/didn’t click)? Check the correct answer(s).
- Accuracy
- Median absolute deviation
- R^2
- Root mean squared error
- Correlation
Answer: Accuracy. The others relate to continuous variables.
Question 5
Suppose that we have created a machine learning algorithm that predicts whether a link will be clicked with 99% sensitivity and 99% specificity. The rate the link is clicked is 1/1000 of visits to a website. If we predict the link will be clicked on a specific visit, what is the probability it will actually be clicked?
- 89.9%
- 50%
- 9%
- 0.009%
Answer:
We’re looking for Pr( click | prediction ). This it positive predictive value given by \(TP / (TP + FP)\).
## [1] 0.09016393