Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2015 Mar 22:14:1-19.
doi: 10.4137/CIN.S22080. eCollection 2015.

HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree

Affiliations
Review

HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree

Askar Obulkasim et al. Cancer Inform. .

Abstract

Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster. However, the fixed-height branch cut may not be ideal in situations where one expects a complicated tree structure with nested clusters. Furthermore, due to lack of utilization of related background information in selecting the cutoff, induced clusters are often difficult to interpret. This paper describes a novel procedure that aims to automatically extract meaningful clusters from the HC tree in a semi-supervised way. The procedure is implemented in the R package HCsnip available from Bioconductor. Rather than cutting the HC tree at a fixed-height, HCsnip probes the various way of snipping, possibly at variable heights, to tease out hidden clusters ensconced deep down in the tree. The cluster extraction process utilizes, along with the data set from which the HC tree is derived, commonly available background information. Consequently, the extracted clusters are highly reproducible and robust against various sources of variations that "haunted" high-dimensional genomics data. Since the clustering process is guided by the background information, clusters are easy to interpret. Unlike existing packages, no constraint is placed on the data type on which clustering is desired. Particularly, the package accepts patient follow-up data for guiding the cluster extraction process. To our knowledge, HCsnip is the first package that is able to decomposes the HC tree into clusters with piecewise snipping under the guidance of patient time-to-event information. Our implementation of the semi-supervised HC tree snipping framework is generic, and can be combined with other algorithms that operate on detected clusters.

Keywords: R package; data integration; hierarchical clustering; high-dimensional data; semi-supervised clustering.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An example HC tree.
Figure 2
Figure 2
Schemata of the optimal partition selection procedure.
Figure 3
Figure 3
Illustration of the indirect projection principle.
Figure 4
Figure 4
Illustration of the three scenarios considered in the Concordance index calculation.
Figure 5
Figure 5
Schemata of group assignment from TwoHC_assign function for a new sample.
Figure 6
Figure 6
The HC tree that is derived from the first 30 samples’ expression profiles in the Leukemia data set. Partitions induced by the piecewise snipping are displayed at the bottom of the HC tree.
Figure 7
Figure 7
Kaplan–Meier survival analysis by clusters that were induced on the new samples.
Figure 8
Figure 8
The gene expression entropy distributions of the two clusters that were induced on the 50 new samples. The shape of the violin plot shows the distribution of the gene expression entropy values within each cluster and the white dot in each denotes the mean entropy level. Distributions are relatively different between cluster1 and cluster2, with a larger mean gene expression entropy value observed in cluster1.
Figure 9
Figure 9
PCA plot. The 50 samples shown in the 2D plane spanned by their first two principal components. This type of plot is useful for visualizing the cluster homogeneity. The smaller the connected area, the more similar are the samples within the clusters. The scatter plots show that there is less within-cluster variation in cluster2 (right), which is indicative of a higher level of within-cluster homogeneity resulting from low within-cluster gene expression entropy. In cluster1(left), the samples are more spread out, indicative of a higher degree of within-cluster variation resulting from a higher within-cluster gene expression entropy.
Figure 10
Figure 10
(A) The density of the observed relative risk ratios (rrobs). (B) The distribution of the test statistics from the null model obtained by permutation (Tperm). Note: The black broken line denotes the location of the observed test statistics.
Figure 11
Figure 11
The HC tree that has been derived by using the DNA copy number profiles in the GBM data set. The optimal partition extracted by the two approaches is displayed at the bottom.
Figure 12
Figure 12
Kaplan–Meier survival analysis by clusters displayed in Figure 11. For visualization purpose, clusters in each branch are plotted separately.
Figure 13
Figure 13
Gene expression entropy distribution by clusters displayed in Figure 11. The shape of the violin plot shows the distribution of the gene expression entropy values within each cluster, and the white dot in each denotes the mean entropy level.
Figure 14
Figure 14
(A) The empirical distribution of rrobs. (B) The distribution of the test statistics from the null model obtained by permutation Tperm. Note: The black broken line denotes the location of the observed test statistics.

Similar articles

References

    1. de Souto M, Costa IG, de Araujo D, Ludermir TB, Schliep A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics. 2008;9:497. - PMC - PubMed
    1. Sørlie T, Perou C-M, Tibshirani R, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA. 2001;98:10869–74. - PMC - PubMed
    1. Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004;2:511–22. - PMC - PubMed
    1. Obulkasim A, Meijer AG, van de Wiel AM. Semi-supervised adaptive-height snipping of the hierarchical clustering tree. BMC Bioinformatics. 2015;16(1):15. - PMC - PubMed
    1. Sboner A, Demichelis F, Calza S, et al. Molecular sampling of prostate cancer: a dilemma for predicting disease progression. BMC Med Genomics. 2010;3:8. - PMC - PubMed

LinkOut - more resources