Single-cell RNA sequencing (scRNA-seq) allows researchers to collect large catalogues detailing the transcriptomes of individual cells. Unsupervised clustering is of central importance for the analysis of these data, as it is used to identify putative cell types. However, there are many challenges involved. We discuss why clustering is a challenging problem from a computational point of view and what aspects of the data make it challenging. We also consider the difficulties related to the biological interpretation and annotation of the identified clusters.
Change history
22 January 2019
During typesetting of this article, errors were inadvertently introduced to the hyperlinked URLs of some of the clustering tools in table 1 (Seurat, CIDR, pcaReduce and mpath), as well as to the numbering of the bold-text annotations in the reference list. The article has now been corrected online. The editors apologize for this error.
Related Links
BackSPIN: https://github.com/linnarsson-lab/BackSPIN
CIDR: https://github.com/VCCRI/CIDR
GiniClust: https://github.com/lanjiangboston/GiniClust
pcaReduce: https://github.com/JustinaZ/pcaReduce
mpath: https://github.com/JinmiaoChenLab/Mpath
PhenoGraph: https://github.com/jacoblevine/PhenoGraph
RaceID: https://github.com/dgrun/RaceID
RaceID2: https://github.com/dgrun/StemID
RaceID3: https://github.com/dgrun/RaceID3_StemID2
SC3: http://bioconductor.org/packages/release/bioc/html/SC3.html
scanpy: https://github.com/theislab/scanpy
Seurat (latest): https://satijalab.org/seurat/
SIMLR: https://bioconductor.org/packages/release/bioc/html/SIMLR.html
SINCERA: https://github.com/xu-lab/SINCERA
SNN-Cliq: http://bioinfo.uncc.edu/SNNCliq/
TSCAN: https://bioconductor.org/packages/release/bioc/html/TSCAN.html
- Unsupervised clustering
The process of grouping objects based on similarity but without any ground truth or labelled training data.
- Feature selection
A collection of statistical approaches that identify and retain only variables that are most relevant to the underlying structure of the data set.
- Dimensionality reduction
A collection of statistical approaches that reduces the number of variables in a data set. It often refers specifically to methods that recombine the original variables into a new set of non-redundant variables. Dimensionality reduction can help in identifying important patterns and reducing the amount of computations needed.
- Greedy
An algorithm that, at each step, chooses the option that leads to the greatest reduction of the cost function. Greedy algorithms are often fast, but they may fail to find the optimal solution.
- Graphs
Each graph consists of a set of nodes connected to each other with a set of edges. In single-cell RNA sequencing, nodes are cells, and edges are determined according to cell–cell pairwise distances.
- Heuristic optimization
A method for solving a problem that is designed to sacrifice accuracy in favour of speed. These methods are often based on approximations and cannot be guaranteed to find the best solution.
- Bootstrapping
A statistical approach in which data sets are randomly sampled and reanalysed to assess the robustness of a result.
- Gaussian mixture model
A statistical model of one or more normal distributions. When fitted to data, each normal distribution can be interpreted as a distinct cluster of points.
- Cell ontology
A hierarchical organization of controlled vocabulary to describe properties of (and relationships between) different cell types.
