Hierarchical Clustering

An agglomerative approach - find the closest two things, put them together, find the next closest.

We need to define distance, and a merging approach. We get a tree showing how close things are to each other.

Defining Closeness

Most important step - “garbage in -> garbage out”.

Distance or similarity
Continuous - euclidean distanc.e
Continuous - correlation similarity.
Binary - manhattan distance.

Need to pick a distance metrix that makes sense to your problem.

Distances

Euclidean is your standard $ $. This is the L2 norm.

The Manhattan distance comes from taxis in New York city traversing the grid. In general it is $ |A_1 - A_2| + |B_1 - B_2| + + |Z_1 - Z_2| $. It is the L1 norm.

Example

library(ggdendro)
set.seed(1)
tibble(
    x = rnorm(12, mean = rep(1:3, each = 4), sd = 0.2),
    y = rnorm(12, mean = rep(c(1,2,3), each = 4), sd = 0.2)
) -> cluster_data

cluster_data %>%
    ggplot(aes(x,y)) +
    geom_point()

# Distance matrix - defaults to Euclidean distance.
dist(cluster_data)

##             1          2          3          4          5          6
## 2  0.35751195                                                       
## 3  0.35173109 0.69834230                                            
## 4  0.45905269 0.51771058 0.53955180                                 
## 5  1.63572696 1.76972836 1.45464569 1.25272041                      
## 6  1.62723974 1.81690764 1.39102380 1.30451088 0.29960614           
## 7  1.77634318 1.92568087 1.57526126 1.40798566 0.17043440 0.26272648
## 8  1.77919526 1.91654625 1.58982463 1.39944664 0.14687995 0.31951778
## 9  3.21662371 3.34956720 3.00761449 2.83447868 1.58429292 1.62066703
## 10 3.07611050 3.22103256 2.85760442 2.70425296 1.45154707 1.46731394
## 11 3.23565762 3.34276974 3.04994740 2.83366357 1.60170276 1.68314947
## 12 2.79905917 2.88945851 2.63381337 2.38526081 1.17930058 1.30904061
##             7          8          9         10         11
## 2                                                        
## 3                                                        
## 4                                                        
## 5                                                        
## 6                                                        
## 7                                                        
## 8  0.06771202                                            
## 9  1.44053389 1.43885262                                 
## 10 1.30093910 1.30491361 0.17834636                      
## 11 1.47490680 1.46163123 0.25212124 0.39001345           
## 12 1.07382039 1.04837571 0.58285336 0.57147153 0.46982962

# The dendrogram
dist(cluster_data) %>%
    hclust() %>%
    ggdendrogram(theme_dendro = F)

You can then cut at different points to get differnt number of clusters.

How do we merge points together? - Average linkage - average of the components - Complete linkage - furthest points away from each other.

The other type of visualisation to consider is a heatmap:

cluster_data %>%
    as.matrix() %>%
    heatmap()

It runs a hierarchical analysis on the rows and the columns of the data. Very useful for quickly visualising high dimensional table data.

Considerations

The picture may be unstable
- Change a few points.
- Have different missing values.
- Pick a different distance.
- Change the merging strategy.
- Change the scale of points for one variable.
However it is deterministic.
Choosing where to cut isn’t always obvious.
Primarily used for exploration.

K-Means Clustering

A partitioning approach:

Fix a number of clusters
Get centroids of each cluster.
Assign things to the closest centroid.
Recalculate centroids.

Required a defined distance metric, a number of clusters, and an initial guess as to the cluster centroids.

It produces the final estimate of cluster centroids and an assignment of each point to clusters.

Let’s use the same cluster data as before:

set.seed(1)
tibble(
    x = rnorm(12, mean = rep(1:3, each = 4), sd = 0.2),
    y = rnorm(12, mean = rep(c(1,2,1), each = 4), sd = 0.2)
) -> kmeans_data

kmeans_data %>%
    ggplot(aes(x,y)) +
    geom_point()

# Calculate the kmeans
km <- kmeans(kmeans_data, centers = 3)

# We can see the group allocations
km$cluster

##  [1] 3 3 3 3 2 2 2 2 1 1 1 1

# We can also see the centroids
km$centers

##          x         y
## 1 3.108601 0.9893163
## 2 2.036740 2.1171384
## 3 1.015842 0.9122028

kmeans_data %>%
    ggplot(aes(x,y)) +
    geom_point(aes(colour = as.factor(km$cluster))) +
    geom_point(data = as_tibble(km$centers), shape = 'cross')

Heatmaps

Another way of viewing the K-means data is with heatmaps.

image(t(cluster_data)[, order(km$cluster)])

In this image, the rows have been reordered in the data frame so that the clusters have been put together.

Considerations

K-means requires a number of clusters.
- Pick by eye/intuition
- No easy rule
- Can use cross-validation
K-means is not deterministic.

Dimension Reduction

Consider having multivariate variables $ X_1 X_n $ so $ X_1 = (X_{11}, , X_{1m}) $

Find a new set of multivariate variables that are uncorrelated and explain as much variable as possible.
- This can be done using principal component analysis.
If you putall the variables together in one matrix, find the best matrix created with fewer variables (lower rank) that explains the original data.
- This can be done using singular value decomposition.

The first goal is statistical, the second goal is **data compression.

SVD

If $X$ is a matrix then the SVD is a matrix decomposition:

\[ X = UDV^T \] where the columns of $U$ are orthogonal (left singular vectors), and the columns of $V$ are orthogonal (right singular vectors) and $D$ is a diagonal matrix.

PCA

The principal components are qual to the right singular values if you first scale the variables.

SVD

We create some random data ($n = 10,\ p = 40$), then we shift some of the data to create a skew in some of the columns.

set.seed(2349234)
# Generate the matrix
rand_data <- matrix(rnorm(400), ncol = 40)
heatmap(t(rand_data))

# Create a shift in some of the rows
rand_shift <- apply(rand_data, 2, function(x) {
    if(rbinom(1,1,.5)) {
        x + rep(c(0,3), each = 5)
    } else {
        x
    }
})

heatmap(t(rand_shift))

We can no apply SVD to the data.

rand_svd <- svd(scale(rand_shift))
tibble(
    n = 1:length(rand_svd$u[,1]),
    u = rand_svd$u[,1]
) %>%
    ggplot(aes(n,u)) +
    geom_point()

We can see that the left singular vector picked up on the shift in an unsupervised manner.

The $D$ matrix can be used to calculate the variance explained:

tibble(
    x = 1:length(rand_svd$d),
    y = rand_svd$d^2 / sum(rand_svd$d^2) * 100
) %>%
    ggplot(aes(x,y)) +
    geom_point() +
    labs(x = 'SVD Raw Singular Values', y = 'Proportion of Variance Explained (%)')

SVD vs PCA

The SVD and the PCA are essentially the same thing. Below we calculate the PCA and then plot the first principal component on the X-axis, and the first right singular value on the Y-axis:

rand_pca <- prcomp(scale(rand_shift))
tibble(
    pca_first_comp = rand_pca$rotation[ , 1],
    svd_first_right_singular = rand_svd$v[ , 1]
) %>%
    ggplot(aes(pca_first_comp, svd_first_right_singular)) +
    geom_point() +
    geom_line()

We can see that they fall exactly on a line.

Missing Values

One of the issues with both svd() and prcomp() is that you can’t run it on data with missing values.

One option is to impute the missing values. For example you can use the impute library to impute the missing data points. The code below uses impute.knn() to take a missin row and imputes by the K-nearest neighbours to that row.

missing_matrix <- rand_shift
missing_matrix[sample(1:100, size = 10, replace = F)] <- NA
try( svd(scale(missing_matrix)) )

## Error in svd(scale(missing_matrix)) : infinite or missing values in 'x'

library(impute)
fixed_matrix <- impute.knn(missing_matrix, k = 5)$data

fixed_svd <- svd(scale(fixed_matrix))
summary(fixed_svd)

##   Length Class  Mode   
## d  10    -none- numeric
## u 100    -none- numeric
## v 400    -none- numeric

Summary

Scale matters
PCA / SVD may mix real patterns.

Working with Colour

The main problem that comes out is that the standard colours are not very good, and the colours don’t align with the data that they’re representing.

The grDevices package has two functions that can assist: - colorRamp() - takes a palette of colours and returns a function that takes values between 0 and 1, indicating the extremes of the colour palette. - Similar to the gray() function. - colorRampPalette() - takes a palette of colours and returns a function that takes integer arguments and returns a vector of colours interpolating the palette.

gray(c(0, 0.1, 0.2, 0.8, 1))

## [1] "#000000" "#1A1A1A" "#333333" "#CCCCCC" "#FFFFFF"

# Create our pallete of two colours
pal <- colorRamp(c('red', 'blue'))

# RGB values, all red.
pal(0)

##      [,1] [,2] [,3]
## [1,]  255    0    0

# RGB values, all green
pal(1)

##      [,1] [,2] [,3]
## [1,]    0    0  255

# Halfway between
pal(.5)

##       [,1] [,2]  [,3]
## [1,] 127.5    0 127.5

pal <- colorRampPalette(c('red', 'yellow'))

# Returns the two colours in the palette
pal(2)

## [1] "#FF0000" "#FFFF00"

# Returns 10 colours, interpolating between the two palette colours
pal(1)

## [1] "#FF0000"

THe RColorBrewer package that contains interesting palettes.

There are three types of palettes:

Sequential - ordered data.
Diverging - deviate from something, think deviations from the mean.
Qualitative - data that are not ordered.

The data from this package can be passed to the colorRamp*() functions.

library(RColorBrewer)

cols <- brewer.pal(3, 'BuGn')
cols

## [1] "#E5F5F9" "#99D8C9" "#2CA25F"

pal <- colorRampPalette(cols)

image(volcano, col = pal(20))

Another function that uses colour brewer is the smoothScatter() function.

set.seed(1)
x <- rnorm(1000)
y <- rnorm(1000)
smoothScatter(x,y)

RGB Function

The RGB function can be used to produce any colour via RGB proportions. There’s also an alpha parameter to add transparency.

# RGB function returns a hexadecimal string.
rgb(.5, .5, 0, .5)

## [1] "#80800080"

# Normal plot
plot(x,y, pch = 16)

# Plot with different colour and transparency
plot(x,y, pch = 16, col = rgb(.5,.5,0,.1))

## Warning in plot.xy(xy, type, ...): semi-transparency is not supported on
## this device: reported only once per page

Summary

Use of colours can make it easier to interpret graphs / visualisations.
The RColorBrewer package provides additional color palettes for sequential, divergent and categorical data.
colorRamp() and colorRampPalette() can be used in conjunction with palettes to connect data to colors.
Tansparency can help clarify plots with many overlapping points.

Course 3 - Getting and Cleaning Data - Week 3 - Notes

Greg Foletta

2019-10-30

Hierarchical Clustering

Defining Closeness

Distances

Example

Considerations

K-Means Clustering

Heatmaps

Considerations

Dimension Reduction

SVD

SVD vs PCA

Missing Values

Summary

Working with Colour

RGB Function

Summary