---
title: "Plotting clustering trees"
author: "Luke Zappia"
date: "Last updated: 12 June 2020"
output:
    rmarkdown::html_vignette:
        fig_width: 7
        fig_height: 7
        toc: TRUE
        number_sections: TRUE
bibliography: "references.bib"
vignette: >
  %\VignetteIndexEntry{Plotting clustering trees}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# What is a clustering tree?
 
Clustering analysis is used in many contexts to group similar samples. One
problem when conducting this kind of analysis is how many clusters to use. This
is usually controlled by a parameter provided to the clustering algorithm, such
as $k$ for $k$-means clustering.

Statistics designed to help you make this choice typically either compare two
clusterings or score a single clustering. A clustering tree is different in that
it visualises the relationships between at a range of resolutions.

To build a clustering tree we need to look at how cells move as the clustering
resolution is increased. Each cluster forms a node in the tree and edges are
constructed by considering the cells in a cluster at a lower resolution
(say $k = 2$) that end up in a cluster at the next highest resolution
(say $k = 3$). By connecting clusters in this way we can see how clusters are
related to each other, which are clearly distinct and which are unstable. Extra
information about the cells in each node can also be overlaid in order to help
make the decision about which resolution to use. For more information about
clustering trees please refer to our associated publication [@Zappia2018-lz].

# A simple example

To demonstrate what a clustering tree looks like we will work through a short
example using the `nba_clusts` dataset.

## The data

The `nba_clusts` dataset consists of some basic statistics from 150 NBA players
in 2017, 50 from each of three positions (Center, Point Guard and Shooting
Guards). This dataset has then been clustered using $k$-means with
different values of $k$. For more information see `?nba_clusts`. Let's load the
data and take a look:

```{r load-nba}
library(clustree)
data("nba_clusts")

head(nba_clusts)
```

Here we have a `data.frame` with player positions and some statistics
(turnover percentage, rebound percentage, assist percentage and field goal 
percentage) as well as some additional columns. These columns contain the
cluster assignments from clustering this data using $k$-means with values of
$k$ from $k = 1$ to $k = 5$.

## Plotting a tree

This clustering information is all we need to build a clustering tree. Each
column must consist of numeric values indicating which cluster each sample
has been assigned to. To plot the tree we just pass this information to the
`clustree` function. We also need to specify a `prefix` string to indicate
which columns contain the clusterings.

```{r nba-plot}
clustree(nba_clusts, prefix = "K")
```

We can see that one cluster is very distinct and does not change with the value
of $k$. This is the Center players which are very different to the other
positions. On the other side of the tree we see a single cluster that splits
into the two clusters we would expect to see. After this the tree becomes
messier and there are node with multiple incoming edges. This is a good 
indication that we have over clustered the data.

## Controlling aesthetics

By default the size of each node is related to the number of samples in each
cluster and the colour indicates the clustering resolution. Edges are coloured
according to the number of samples they represent and the transparency shows
the incoming node proportion, the number of samples in the edge divided by the
number of samples in the node it points to. We can control these aesthetics
by setting them to specific values:

```{r nba-aes-static}
clustree(nba_clusts, prefix = "K", node_colour = "purple", node_size = 10,
         node_alpha = 0.8)
```

We can also link these aesthetics to other information we have about the 
samples. All the additional columns in the dataset are available to be added as 
attributes to the nodes in our tree. Because each node represents multiple
samples we need to supply an aggregation function to use as well specifying a 
column name. Let's try colouring the nodes according to the rebound percentage:

```{r nba-aes}
clustree(nba_clusts, prefix = "K", node_colour = "ReboundPct",
         node_colour_aggr = "mean")
```

We can clearly see that the distinct cluster containing the Center
players has a higher rebound percentage on average compared to the other
clusters.

### SC3 stability index

Apart from information in the dataset itself it can useful to display measures
of clustering quality as aesthetics. The stability index from the **{SC3}** 
package [@Kiselev2017-an] measures the stability of clusters across resolutions 
and is automatically calculated when a clustering tree is built. It can be 
accessed by setting an aesthetic to `"sc3_stability"` and because it is
calculated by cluster we don't need to provide an aggregation function. For 
example:

```{r nba-stability}
clustree(nba_clusts, prefix = "K", node_colour = "sc3_stability")
```

## Layout

By default the tree is drawn using the Reingold-Tilford tree layout algorithm
which tries to place nodes below their parents [@Reingold1981-iy]. Alternatively 
we could use the Sugiyama layout by specifying the `layout` argument. This 
algorithm tries to minimise the number of crossing edges [@Sugiyama1981-wu] and 
can produce more attractive trees in some cases.

```{r nba-layout}
clustree(nba_clusts, prefix = "K", layout = "sugiyama")
```

For both of these layout algorithms `clustree()` uses slightly modified versions 
of the tree by default. Only the core network of edges, those that are the 
highest in-proportion edge for a node, are used when creating the layout. In most 
cases  this leads to more attractive trees that are easier to interpret. To turn 
this off, and use all edges for deciding the layout, we can set `use_core_edges` 
to `FALSE`.

```{r nba-layout-nocore}
clustree(nba_clusts, prefix = "K", layout = "sugiyama", use_core_edges = FALSE)
```

## Adding labels

To make it easy to identify clusters the cluster nodes are labelled with their
cluster number (controlled using the `node_text` arguments) but sometimes it is
useful to add labels with additional information. This is done the same way
as the other aesthetics. Here we label nodes with the maximum assist percentage:

```{r nba-labels}
clustree(nba_clusts, prefix = "K", node_label = "AssistPct",
         node_label_aggr = "max")
```

One way this can be useful is if we have assigned labels to the samples. Here
is a custom function that labels a cluster if all the players are the same
position, otherwise it labels the cluster as "mixed":

```{r nba-labels-custom}
label_position <- function(labels) {
    if (length(unique(labels)) == 1) {
        position <- as.character(unique(labels))
    } else {
        position <- "mixed"
    }
    return(position)
}

clustree(nba_clusts, prefix = "K", node_label = "Position",
         node_label_aggr = "label_position")
```

# Clustering trees for scRNA-seq data

Clustering has become a core tool for analysing single-cell RNA-sequencing
(scRNA-seq) datasets. These datasets contain gene expression measurements from
hundreds to hundreds of thousands of cells. Often samples come from complex
tissues containing many types of cells and clustering is used to group similar
cells together. To make it easier to produce clustering trees for these kinds
of datasets we provide interfaces for some of the objects commonly used to
analyse scRNA-seq data.

The clustree package contains an example simulated scRNA-seq data that has been
clustered using the **{SC3}** and **{Seurat}** [@Satija2015-or] packages.

```{r sc-example}
data("sc_example")
names(sc_example)
```

## `SingleCellExperiment` objects

```{r sce-present, echo = FALSE}
sce_present <- requireNamespace("SingleCellExperiment", quietly = TRUE)
```

```{r sce-not, echo = FALSE, results = "asis", eval = !sce_present}
cat("> **NOTE:** This section requires the SingleCellExperiment package.",
    "This package isn't installed so the results won't be shown.")
```

The `SingleCellExperiment` is one of these common objects, used across a range
of Bioconductor packages. Let's have a look at an example, but first we need
to convert the example dataset to a `SingleCellExperiment` object:

```{r sce, eval = sce_present}
suppressPackageStartupMessages(library("SingleCellExperiment"))

sce <- SingleCellExperiment(assays = list(counts = sc_example$counts,
                                          logcounts = sc_example$logcounts),
                            colData = sc_example$sc3_clusters,
                            reducedDims = SimpleList(TSNE = sc_example$tsne))
```

The clustering information is held in the `coldata` slot.

```{r sce-colData, eval = sce_present}
head(colData(sce))
```

We can plot a clustering tree in the same way we did with a `data.frame`. In
this case the clustering column names contain a suffix that needs to be stripped
away, so we will pass that along as well.

```{r sce-plot, eval = sce_present}
clustree(sce, prefix = "sc3_", suffix = "_clusters")
```

## `Seurat` objects

```{r seurat-present, echo = FALSE}
seurat_present <- requireNamespace("Seurat", quietly = TRUE) &&
    packageVersion("Seurat") >= package_version(x = "5.0.0")
```

```{r seurat-not, echo = FALSE, results = 'asis', eval = !seurat_present}
cat("> **NOTE:** This section requires the Seurat package (>= 5.0.0).",
    "This package isn't installed so the results won't be shown.")
```

Clustering trees can also be produced directly from `Seurat` objects. Let's
convert our `SingleCellExperiment` to `Seurat` format:

```{r seurat, eval = seurat_present}
suppressPackageStartupMessages(library("Seurat"))

# Create the Seurat object, Seurat >= expects a sparse matrix
seurat <- CreateSeuratObject(counts = as(sc_example$counts, "sparseMatrix"),
                             data = as(sc_example$logcounts, "sparseMatrix"),
                             meta.data = sc_example$seurat_clusters)

# Add the t-SNE embedding
seurat[['TSNE']] <- CreateDimReducObject(embeddings = sc_example$tsne,
                                         key = "tSNE_")
```

In this case the clustering information is held in the `meta.data` slot which
can be accessed using the `[[` operator:

```{r seurat-meta, eval = seurat_present}
head(seurat[[]])
```

We can now produce a clustering tree using this object. In this example the
prefix for clustering columns is `res.` but in most cases the default prefix
from `Seurat` will be automatically used.

```{r seurat-plot, eval = seurat_present}
clustree(seurat, prefix = "res.")
```

> Note: This example uses the newer Seurat object available in version 3.0.0 or
> greater of Seurat. There is also an interface for the older seurat object but
> this may be deprecated in the future.

## Using genes as aesthetics

As well as being able to use any additional columns for aesthetics we can also
use the expression of individual genes. Let's colour the nodes in the `Seurat`
tree by `Gene730` (a highly variable gene). Again we need to supply an
aggregation function.

```{r plot-gene, eval = seurat_present}
clustree(seurat, prefix = "res.",
         node_colour = "Gene730", node_colour_aggr = "median")
```

# Overlaying clustering trees

One way to think of clustering trees is that they add an extra dimension to the
data that shows how clusters "evolve" over time (increasing resolution). Some
times it can be useful to overlay this dimension on other dimensions in the 
data, particularly those that come from dimensionality reduction techniques. We
can do this using the `clustree_overlay` function:

```{r nba-overlay}
clustree_overlay(nba_clusts, prefix = "K", x_value = "PC1", y_value = "PC2")
```

The easiest way to understand this plot is to imagine that you are looking down
on the clustering tree from above. The x and y axes are the chosen dimensions in
the data and the z axis is the clustering resolution. Each cluster node is 
placed at the mean x and y values of the samples it contains. We can also see
points corresponding to the individual samples.

## Choosing what to colour

Due to the way `ggplot2` works we can only colour one element in the plot (the
node points actually use the fill aesthetic). By default the tree edges are 
coloured according to the clustering resolution they originate from. 
Alternatively we can choose to colour the sample points by their highest 
resolution cluster. The colour of whichever element isn't using the colour
aesthetic can be set using the `alt_colour` argument.

```{r nba-overlay-colour}
clustree_overlay(nba_clusts, prefix = "K", x_value = "PC1", y_value = "PC2",
                 use_colour = "points", alt_colour = "blue")
```

Aesthetics for the clustering tree nodes can be controlled in the same way as
for the regular `clustree` function and there are additional arguments for
adjusting the appearance of edges and points.

## Labelling nodes

One of the downsides of this visualisation is that it can be hard to identify
the clusters that each node represents as they can often overlap. To make this
a bit easier we can label each node with the resolution and cluster ID.

```{r nba-overlay-labels}
clustree_overlay(nba_clusts, prefix = "K", x_value = "PC1", y_value = "PC2",
                 label_nodes = TRUE)
```

## Showing the side view

While the main overlay plot show us the tree from above it can also be useful
to see it from the side, showing one of the x or y dimensions against the
resolution dimension. We can get these views by setting the `plot_sides` option
to `TRUE`. This will return a list of `ggplot` objects instead of a single plot.

```{r nba-overlay-sides}
overlay_list <- clustree_overlay(nba_clusts, prefix = "K", x_value = "PC1",
                                 y_value = "PC2", plot_sides = TRUE)

names(overlay_list)

overlay_list$x_side
overlay_list$y_side
```

# Modifying appearance

The `clustree()` function returns a `ggplot` object which can be modified using
functions in the **{ggplot2}** or **{ggraph}** packages. For example we could 
change the colour scales used for the nodes and edges:

```{r modify}
clustree(nba_clusts, prefix = "K") +
    scale_color_brewer(palette = "Set1") +
    scale_edge_color_continuous(low = "blue", high = "red")
```

## Legends

The way `ggplot` objects is displayed is relative to the size of the plotting
window or output file. While the main plot will always fit sometimes legends
will be placed outside the visible area. One solution to this is to simply 
increase the size of the image. An alternative solution is to turn off some of
the legends, either by setting some of the aesthetics to static values or by
using the `guides` function. We could also move them to somewhere they might
fit better. For example let's remove the edge legends and move the rest
to the bottom:

```{r legends}
clustree(nba_clusts, prefix = "K") +
    guides(edge_colour = FALSE, edge_alpha = FALSE) +
    theme(legend.position = "bottom")
```

# Citing clustree

If you find **{clustree}** or the clustering trees approach useful for your work
please cite our associated publication:

```{r citation}
citation("clustree")
```

# References